What is Robots.txt ?

0 comments

Robots.txt is a text  file you put on your website to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but usually search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from stopping search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) & the fact that you put a robots.txt file is something like putting a note , do not enter on an unlocked door  e.g. you cannot prevent thieves from coming in but the nice guys won't open to door & enter. That is why they say that in case you have sen positive knowledge, it is naive to depend on robots.txt to protect it from being indexed & displayed in search results.

The location of robots.txt is important. It must be by & massive listing because otherwise user agents (search engines) won't be able to finding it is they do not search the whole site for a file named robots.txt. In lieu, they look first by & massive listing (i.e. http://mydomain.com/robots.txt) & in the event that they don't find it there, they basically assume that this site does not have a robots.txt file & therefore they index everything they find along the way. So, in the event you don't put robots.txt in the right place, do not be surprised that search engines index your whole website.

The idea and structure of robots.txt has been developed over a decade ago and in case you are interested to learn more about it, visit http://www.robotstxt.org/ or you can go straight to the Standard for Robot Exclusion because in this news story they will deal only with the most important aspects of a robots.txt file. Next they will continue with the structure a robots.txt file.

Structure of a Robots.txt File

The structure of a robots.txt is simple (and barely flexible)  it is an countless list of user agents and disallowed files and directories. Fundamentally, the syntax is as follows:

User-agent:

Disallow:

User-agent are search engines' crawlers & disallow: lists the files & directories to be excluded from indexing. In addition to user-agent: & disallow: entries, you can include comment lines put the # sign at the beginning of the line:

# All user agents are disallowed to see the /temp listing.

User-agent: *

Disallow: /temp/

The Traps of a Robots.txt File

When you start making complicated files i.e. you choose to permit different user agents access to different directories issues can start, in the event you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos & contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent & Disallow, etc. Typos can be tricky to find but in some cases validation tools help.

The more serious issue is with logical errors. For example:

User-agent: *

Disallow: /temp/

User-agent: Google-bot

Disallow: /temp/

Disallow: /images/

Disallow: /cgi-bin/

The above example is from a robots.txt that allows all agents to access everything on the website except the /temp listing. Up to here it is fine but later on there is another record that specifies more restrictive terms for Google-bot. When Google-bot starts reading robots.txt, it will see that all user agents (including Google-bot itself) are allowed to all folders except /temp/. This is for Google-bot to know, so it won't read the file to the finish and will index everything except /temp/ - including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made basically.

Tools to Generate & Validate a Robots.txt File

Having in mind the simple syntax of a robots.txt file, you can always read it to see if everything is OK but it is much simpler to make use of a validator, like this: http://tool.motoricerca.info/robots-checker.phtml. These tools document about common mistakes like missing slashes or colons, which if not detected compromise your efforts. For example, in case you have typed:

User agent: *

Disallow: /temp/

this is wrong because there is no slash between user and agent and the syntax is incorrect.

In those cases, when you have a complex robots.txt file i.e. you give different instructions to different user agents or you have a long list of directories and sub-directories to exclude, writing the file by hand can be a actual pain. But do not worry there's tools that will generate the file for you. What is more, there's visual tools that permit to point and select which files and folders are to be excluded. But even in the event you do not feel like purchasing a graphical gizmo for robots.txt generation, there's online tools to assist you. For example, the Server-Side Robots Generator offers a drop-down list of user agents as well as a text box for you to list the files you don't require indexed. Honestly, it is not much of a help, unless you require to set specific rules for different search engines because in any case it is up to you to type the list of directories but is over nothing.

0 comments:

Post a Comment