Robots.txt

The robots.txt is a text file that is important for indexing website content. With the file, webmasters can specify which of the subpages should be captured and indexed by a crawler such as the Googlebot and which should not. This makes robots.txt extremely interesting for search engine optimization as well.

The basis of robots.txt and the associated control of indexing is the “Robots Exclusion Standard Protocol”, also abbreviated as REP, published in 1994. This protocol defines certain possibilities for webmasters to control search engine crawlers and their work. However, it should be noted that robots.txt is only a guideline for the search engines, which they do not necessarily have to comply with. The file cannot be used to assign access rights or prevent access – but since the major search engines such as Google, Yahoo and Bing have committed themselves to adhering to this guideline, robots.txt can be used to control the indexing of one’s own page very reliably.

In order for the file to actually be read, it must be located in the root directory of the domain, and the entire name of the file must also be written in lower case, as well as most of the instructions in the file itself.

In addition, it is important to note that pages can still be indexed even if they were actually excluded from indexing in robots.txt. This is especially the case for pages with many backlinks, because these are an important criterion for the web crawlers of the search engines.

How is the robots.txt structured?

The structure of the file is very simple. At the beginning, the so-called “user agents” are determined, for which the following rules are to apply. A user agent is basically nothing else than a crawler of a search engine. In order to be able to enter the correct names here, however, you need to know how the individual providers have designated their user agents. The most common user agents are:

  • Googlebot (normal Google search engine)
  • Googlebot-News (a bot that is no longer used, but whose instructions are also followed by the normal Googlebot).
  • Googlebot-Image (Google image search)
  • Googlebot-Video (Google video search)
  • Googlebot-Mobile (Google mobile search)
  • Adbot-Google (Google AdWords)
  • Slurp (Yahoo)
  • bingbot (Bing)

So the first line of robots.txt might look like this: “User-agent: Googlebot”. Once the desired user-agents are specified, the actual instructions follow. Usually, these begin with “Disallow:”, after which the webmaster specifies which directory or directories the crawlers should ignore during indexing. As an alternative to the Disallow command, an Allow entry can also be made first. This makes it easier to separate which directory may be used for indexing and which may not. However, the Allow entry is not mandatory – the Disallow command is.

In addition to specifying individual directories, “Disallow” (or “Allow”) can also be used to set so-called wildcards, i.e. placeholders that can be used to specify more general rules for indexing the directories. On the one hand there is the asterisk (*), which can be set as a wildcard for any character string. With the entry “Disallow: *”, for example, the entire domain could be excluded from indexing, while “User-agents: *” can be used to set rules for all web crawlers for the domain. The second wildcard is the dollar sign ($). It can be used to specify that a filter should only apply to the end of a string. With the entry “Disallow: *.pdf$”, all pages ending with “.pdf” could be excluded from indexing.

In addition, an XML sitemap can be referenced in robots.txt. This requires an entry according to the following pattern: “Sitemap: http://www.beispiel.de/sitemap.xml”. Furthermore, comment lines can be inserted. For this purpose, the respective line must only be preceded by a hash sign (#).

The robots.txt and SEO

Since the robots.txt determines which subpages are used for search engine indexing, it is obvious that the file also plays an important role for search engine optimization. If, for example, a directory of the domain is excluded, all SEO measures on the corresponding pages will come to nothing, because the crawlers simply do not pay attention to them. Conversely, the robots.txt can also be used specifically for SEO, for example, to exclude certain pages and thus not be penalized because of duplicate content.

In general, it can be said that the robots.txt is enormously important for search engine optimization, because it can have a massive impact on the ranking of a page. Accordingly, it must be carefully maintained, because errors can quickly creep in that prevent important pages from being captured by the crawlers. Caution is especially important when using wildcards, because here a typo or a small carelessness can have a particularly strong effect. For this reason, it is recommended for inexperienced users to set no or only very low restrictions in the file. Subsequently, further rules can be determined step by step, so that, for example, SEO measures have a better effect.

Help with the creation of robots.txt

Although the robots.txt is a simple text file that can be easily written with any text editor, errors have a very strong impact, as described in the section above, and in the worst case can have a massive negative impact on the ranking of a page.

Fortunately, for all those who do not want to venture directly into the robots.txt themselves, there are numerous free tools on the Internet that make the creation of the file much easier, including Pixelfolk and Ryte. In addition, there are also free tools for checking the file, for example at TechnicalSEO.com and Ryte. Of course, the search engine giant Google also offers corresponding services, which can be easily started via Webmaster Tools.

Despite its simple structure and generally low level of awareness, robots.txt is a very important criterion when it comes to SEO measures and the ranking of a page. It is true that the rules set in the file are not binding. In most cases, however, they are correctly implemented by the user agents of the search engines, so webmasters can use robots.txt to quickly and easily determine which directories and pages of their domain should be used for indexing by the search engines.

Due to the far-reaching effect of the file, however, it is advisable to first familiarize yourself a little with the required syntax or to use one of the tools available free of charge on the Internet. Otherwise, there is a risk of excluding pages from indexing that should actually be covered by the search engines, and vice versa.

BUY HIGH QUALITY SEO CONTENT

SEO-CONTENT ✔️ Blog Content ✔️ SEO Content Writing ✔️ Article Writing ✔️