SEO | Robots.txt | Optimize Websites

Thursday, December 14, 2006

Where to create the robots.txt file

The Robot will simply look for a "/robots.txt" URL on your site, where a site is defined as a HTTP server running on a particular host and port number. For example:
Robots.txt

Note that there can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead.

Also, remeber that URL's are case sensitive, and "/robots.txt" must be all lower-case.
Robots.txt


So, you need to provide the "/robots.txt" in the top-level of your URL space. How to do this depends on your particular server software and configuration.

For most servers it means creating a file in your top-level server directory. On a UNIX machine this might be /usr/local/etc/httpd/htdocs/robots.txt


What to put into the robots.txt file ?

The "/robots.txt" file usually contains a record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/


In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:


To exclude all robots from the entire server
User-agent: *
Disallow: /


To allow all robots complete access
User-agent: *
Disallow:


Or create an empty "/robots.txt" file.

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/


To exclude a single robot
User-agent: BadBot
Disallow: /


To allow a single robot
User-agent: WebCrawler
Disallow:



User-agent: *
Disallow: /


To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/docs/


Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html




Some more commands on Robots.txt

Write a Robots.txt File - Writing the File, Using Disallows

The second line known as the directive is written as:
Disallow:

By adding a folder after the Disallow statement, the search spider
should ignore the folder for indexing purposes and move to others where there is
no restriction.

Disallow: /images/

This is a special example, just for Perfect 10. This one minute
bit of instruction could have saved a ton in wasted legal fees on a
frivolous lawsuit. As this is a basic step in building websites, it is incumbent
on website owners to their intellectual property, and not a 3rd party
search engines duty.

You can also disallow specific files this way

Disallow:cheeseyporn.htm

One way I recommend using this all the time is to keep robots out of
you cgi bin directory

Disallow: /cgi-bin/

If you leave the Disallow
directive line blank or not filled in, this indicates that ALL files may be retrieved and or indexed by specifiedl robot(s).This would let all robots index all files.
User-agent: *

Disallow:


And vice versa you can keep all robots out easily.
User-agent: *

Disallow: /


In the example above, the one forward slash (/) equals your root
directory. Since the root directory is blocked, none of the other folders and
files can be indexed or crawled. Your site will be removed from search engines
once they read your robots.txt and update their indexes.



Some suggestions for Robots.txt from Forums

1) I have submitted Google sitemap & yahoo feeds for 3 of my sites. I havenot included robots.txt file in the root directory as I want all my pages indexed in search engines.

My question is, is it ok without a robots.txt file or should I include a robots.txt file as

User-agent: *
Disallow:


2) If you want the robots to visit all pages, there is no need for a robots.txt file.

If there is no robots.txt in your site, the robots will still try to find it and there will be "404 not found" messages in your log file. This is not a problem for the robots, but you might prefer to avoid this with a robots.txt as :

User-agent: *
Disallow:



3) to avoid later security problems around hackers on your site
I would disallow any path leading into

- admin folders
- cgi-bin
- configuration folders
- any logfile/statistic pages ( stats have URLs into all private pages as well such as cgi-bin, admin, etc unless cleanly configured, hence online stats are a potential security risk for site owner
- also exclude an y folders or Sub-folders containing non-content pages

this keeps your number of pages clean to the real content and avoids at least partially that hackers find via Google paths to particular scripts known at a given time to have a security issue
Google is the prime resource for cyber criminals and easiest way for hackers to find sites open for abuse ( I had the lesson of being victim last winter several time, each time a hacker attempt was initiated via Google search-result )

hence a typical robots.txt might look like

User-agent: *
Disallow: /cgi-bin
Disallow: /logs
Disallow: /any-software/admin
Disallow: /your_blog/trackback.php
Disallow: /some-software/include
Disallow: /your-scripts/templates


like almost all pages and formats on the web - robots.txt has a validator

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

and a home page

http://www.robotstxt.org



4) the key source for all known to me cyber crime is Google
Google does strictly respect robots.txt
hence excluding admin and script sections of a site in robots.txt also successfully disables the major information-source of cyber criminals

all known hacker attacks always came from Google search and NO single other source is known to me in the past 9 yrs of full time web publishing

all MAJOR SE such as MSN, Y, G and ask.com do fully respect robots.txt and no other SE is of any significance to obtain correct and current CC-relevant information

from: HERE



More about Robot.txt

This issue I thought I’d take a look at a subject which is of absolute importance to those of us who use search engines, but is something we know virtually nothing about, and that is how do web pages end up in search engine directories?

If you’re short on time, the quick summary is that search engines of the free text variety (rather than the Index/Directory type) employ specialised utilities which visit a site, copy the information they find back to base, and then include this information the next time that they update their index for users. OK, that’s it. You can move onto the next article now.

Oh, you’re still here! Well, in that case, lets look at the entire issue in a little more detail. These utilities are often called robots, spiders or crawlers. As already described, they reach out and grab pages from the Internet and if it’s a new page or a page that has been updated since the last time that they visited they will take a copy of the data. They find these pages either because the web author has gone to a search engine and asked for their site to be indexed, or the robot has found their site by following a link from another page. As a result, if the author doesn’t tell the engines about a particular page, and doesn’t have any links to it, it’s highly unlikely that the page will be found.

Robots are working all the time; the ones employed by AltaVista for example will spider about 10,000,000 pages a day. If your website has been indexed by a search engine, you can be assured that at some point a robot has visited your site and by following all your links, will have copied all the pages that it can find. It might not do this in one go; if your site is particularly large for example it could put something of a strain on the server which wouldn’t please your technical people, so many robots will stagger their visits over the period of several days, just indexing a few pages at a time until they’ve taken copies of everything that they can. Also, if you have ever submitted to a search engine that says that it will instantly register and index your site in actual fact it won’t do – it will make a preliminary visit and make a note to come back and grab the rest of your data at a later date.

How can you tell if your site has been visited by one of these robots? The answer, as with most answers related to the Internet is ‘it depends’. Obviously one way is to go to a search engine and run a search for your site; if it’s retrieved, then your site has been visited. An easy way of doing this at somewhere like AltaVista for example is to do a search for:

host: <URL of your site> such as host:philb

and if you get some results back, you know that your site has been indexed. (It might be worth checking at this point however just to make sure that the search engine has got all of your pages, and also that they are the current versions).

However, this is a pretty laborious way of checking – what is much more sensible and easier to do is to access the log files that are automatically kept on your site. As you may know, if you visit a site, your browser is requesting some data, and details of these transactions are kept by the host server in a log file. This file can be viewed using appropriate software (usually called an Access Analyser) and will provide information such as the IP address of the machine that made the request, which pages were viewed, the browser being used, the Operating System, the domain name, country of origin and so on. In exactly the same way that browsers leave this little trace mark behind them, so do the robot programs.

Consequently, if you know the name of the robot or spider that a particular search engine employs it should be a relatively easy task to identify them. Well (and here’s that answer again), it depends. If your site is popular your log files will be enormous and it could take a very long time to work through them to find names that make some sense to you. Added to this is the fact that there are over 250 different robots in operation and some, many or none of them might have visited in any particular period of time. So it’s not a perfect way of identifying them. Besides, new robots are being introduced all the time, old ones may change their names, all of which can make the whole process much more difficult.

There is however a simpler solution that makes a refreshing change! Before I tell you what it is, a quick diversion. It’s quite possible that, if you’re a web author you might not want the robots to visit certain pages on your site – they might not be finished, or you might have some personal information that you don’t want everyone to see, or a page may only be published for a short period of time. Whatever the reason, you don’t want the pages indexed. The people who produce search engines realise that as well, so a solution was created to allow an author to tell the robots not to index pages, or to follow certain links, or to ignore certain subdirectories on servers – the Robot Exclusion Standard. This is done in two ways, either by using the meta tag facility (a meta tag being an HTML tag that is inserted into a page and which is only viewed by search engines, not the viewer of a web page via their browser) or by adding a small text file in the top level of the space allocated to you by your web hosting service.

It is this second method that is of interest to us. Since search engine robots know that authors might not want them to index everything, when they visit sites they will look for the existence of this file, called robots.txt just to check to see if they are allowed to index a site in its entirety. (For the purposes of this article it’s not necessary to go into detail about what is or is not included in the robots.txt file, but if you’re interested in this, a good page to visit is the description of it at AltaVista (http://doc.altavista.com/adv_search/ast_haw_avoiding.html) Alternatively, if you want to read the nitty gritty you might want to look at “A Standard for Robot Exclusion” (http://info.webcrawler.com/mak/projects/robots/norobots.html) Admittedly, not all robots do look for this file, but the majority do, and will abide by what they find.

Right – that’s the quick diversion out of the way, so back to the article. When you look at your statistics you should pay particular attention to any requests for the robots.txt file, because the only requests for it will be from the robot and spider programs; it’s not the sort of thing that a browser will go looking for. It should then be a much simpler matter of then being able to identify which search engines have visited your site over the particular period in question. If you see that ‘Scooter’ has requested the file you can then track that back to AltaVista – once you know enough to link Scooter to that search engine of course. A very useful site which lists details on over 250 robots and spiders can be found at The Web Robots Page at http://info.webcrawler.com/mak/projects/robots/robots.html and just reading the list of names can be quite fascinating – I noticed one called Ariadne for example (part of a research project at the University of Munich), another called Dragon Bot (collects pages related to South East Asia) and Googlebot (no prizes for guessing where that one comes from!)

Are there any disadvantages to using this approach? Yes, you’ve guessed it, the answer is once again ‘it depends’. If you only have very limited access to statistical data it is possible that you will get an artificially high count of users, when in actual fact the number of ‘real’ users is much less than this. Unless you extract more information from your statistics it’s going to be very difficult to isolate the real users from the spiders. Some people do claim that the spiders cause problems due to the bandwidth they use to collect their data, particularly if they use the ‘rapid fire’ approach of attempting to get a large number of pages in a very short space of time. This in turn leads to poor retrieval times for ‘real’ people who want to view pages on a site. Since anyone with sufficient knowledge can create spiders or robots their number will only increase in the future, and although the robots.txt file is of some use in this case, there is no requirement or standardisation that says that all robots need to adhere to it; some may very well completely ignore it. These problem have been addressed in an interesting article written by Martijn Koster, which although now several years old still makes relevant and interesting reading. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html

However, like them, loathe them or be completely indifferent to them, spiders are one of the most important ways that we have of being able to access the information that’s out there.



Wednesday, December 13, 2006

Advanced Use of Robots.txt

With all of the SEO tips, tricks, and tutorials available to you, probably the easiest to achieve is the use of the robots.txt file. This is a simple file that gives instructions to search engine robots, or spiders, on how to crawl your website, and which files and directories to stay out of, and to not index in their databases.

In an earlier tutorial by Clint Dixon, he showed you how to write a robots.txt file, and what information to include, such as User-Agent and the Disallow directive, instructing the search engine spiders on how to crawl your site. In this article, I want to build upon what he showed you, and give you more information on the importance of the robots.txt in your SEO efforts, and some of the consequences of not having one, or having one written incorrectly.

Behavior of Search Engines When Encountering Robots.txt
Search engines behave differently upon encountering, or not, the robots.txt file during a crawl. You only have to follow your web stats to know that the robots.txt is one of the most requested files by search engine spiders. Many spiders check the robots.txt file first before ever performing a crawl, and some even pre-empt crawls by checking for the presence of, and commands in, the file; only to leave and come back another day. In the absence of this file, many search engine spiders will still crawl your website, but you will find that even more will simply leave without indexing. While this is seen as one of the most extreme consequences of excluding the robots.txt file, I will also show you consequences that I consider to be far worse.

Some of the major search engine spiders and robots have distinct behavior patterns upon reading the robots.txt that you can track in your stats. Sometimes, however, it is nice to have an outsider’s perspective on robot behaviors in order to compare it to what you may have noticed. I view a lot of robots.txt files, and sites with and without them, so I’ve been able to come up with a few behavior patterns I would like to share with you.

MSNbot

MSN’s search engine robot is called MSNbot. The MSNbot has quite a voracious appetite for spidering websites. Some webmasters love it and try to feed it as much as possible. Other webmasters don't see any reason to use up bandwidth for a search engine that doesn't bring them traffic. Either way, MSNbot will not spider your website unless you have the robots.txt. Once it finds your robots.txt, it will wander the site, almost timidly at first. Then MSNbot builds up courage and indexes files rapidly. So much so, that use of the crawl-delay directive is recommended with this robot. I’ll cover this more later.

Recent events could be the cause of this. Several months ago, MSN received many complaints that MSNbot was ignoring directives written into the robots.txt files, such as crawling directories it has been instructed to stay out of. Engineers looked into the problem, and I believe they changed a few things to help control this type of behavior with the robot.
In the process, they may have changed it in such a way as to instruct the MSNbot to follow the robots.txt to the letter, and for websites that didn’t have one, it probably got confused and just left, not having a letter of the law to go by. While this is probably mere speculation, the spidering behavior of the robot seems to fit this assessment.

Yahoo’s Inktomi Slurp

Yahoo incorporated the use of Inktomi’s search engine crawler, and is now known as Slurp. Inktomi/Yahoo's Slurp seems to gobble greedily for a couple of days, disappear, come back, gobble more, and disappear again. Without the robots.txt, however, it will crawl fairly slowly, until it just kind of fades away, unless it finds great, unique content. But still, without the presence of the robots.txt, it may not crawl very deeply into your website.

Googlebot
On Google’s website, they instruct webmasters on the use of the robots.txt, and recommend that you do so. SEOs know that Google’s “guidelines” for webmasters are actually more like step by step directions on how to optimize for the search engine. So if Google makes mention of the robots.txt, then I would definitely follow those recommendations to a T.

Google will crawl a site, robots.txt or no, sporadically either way, but it will heed the instructions in the file if it is there. Googlebot has been known to only crawl one or two levels deep without the presence of the robots.txt file.

IA_Archiver

Alexa’s search engine robot is called ia_archiver. It is an aggressive spider with a big appetite; however it is also very polite. It tends to limit its crawls to a couple hundred pages at a time, crawling without using extraneously large amounts of bandwidth, and slow enough as to not overload the server. It will continue its crawl over a couple of days, and then come back after that fairly consistently as well. So much so, that by analyzing your web stats, you can almost predict when ia_archiver will perform its next crawl. Alexa’s ia_archiver obeys the robots.txt commands and directives.

There are many other spiders and robots that exhibit particular behaviors when crawling your site. The good ones will follow the robots.txt directives, and many of the bad ones will not. Later, I’ll show you a few ways to help prevent some problems you might encounter from search engine robots, and how to utilize your robots.txt to help.Advanced Robots.txt Commands and Features

While the basic commands that make up a robots.txt file are two types of information, there are some commands and features that can be used. I should let you know, however, that not all search engine spiders understand these commands. It’s important to know which ones do and which do not.
Crawl Delay

Some robots have been known to crawl web pages at lightening speeds, forcing web servers to ban ip addresses from the robots, or disallowing them to crawl the websites. Some web servers have automatic flood triggers implemented, with automatic ip-banning software in place. If a search engine spider crawls too quickly, it can trigger these ip-bans, blocking the subsequent crawling activities of the search engine. While some of these robots would do well with a ban, there are others more likely that you do not wish banned.

Instead of the following example, which subsequently bans the robot from crawling any of your pages, another solution was offered to this problem. The crawl delay command.


User-agent: MSNbot
Disallow: /


MSNbot was probably the most notorious offender. In an SEO forum, “msndude” gave some insight into this: “With regards to aggressiveness of the crawl: we are definitely learning and improving. We take politeness very seriously and we work hard to make sure that we are fixing issues as they come up…
I also want to make folks aware of a feature that MSNbot supports…what we call a crawl delay
. Basically it allows you to specify via robots.txt an amount of time (in seconds) that MSNbot should wait before retrieving another page from that host. The syntax in your robots.txt file would look something like:

User-Agent: MSNbot
Crawl-Delay: 20


“This instructs MSNbot to wait 20 seconds before retrieving another page from that host. If you think that MSNbot is being a bit aggressive this is a way to have it slow down on your host while still making sure that your pages are indexed.”
Other search engine spiders that support this command are Slurp, Ocelli, Teoma/AskJeeves, Spiderline and many others.

Googlebot does not officially support this command, however it is usually fairly well-mannered and doesn’t need it. If you are not sure which robots understand this command, a simple question presented to the search engine’s support team could easily help you with this. There is a good list of search engine robots at RobotsTxt.org with contact information if you are unsure how to reach them. It’s not always easy to know which website the robot belongs to. You may not know, for example, that Slurp belongs to Yahoo, or that Scooter belonged to AltaVista.

Meta Tag Instructions
With the availability of search engine robot technology, there are thousands of search engine robots. There just isn’t a way to list them all, along with their capabilities and disadvantages. Many of these lesser known robots don’t even attempt to view your robots.txt. So what do you do then? Many webmasters find it handy to be able to place a few commands directly into their meta tags to instruct robots. These tags are placed in the "head" section like any other meta tags.


Meta tags

This meta tag above tells the robot not to index this page.


meta rags
This tag above tells a robot should neither index this document, nor analyze it for links.

Other tags you might have use of are below:





Unfortunately, there is no way to guarantee that these less than polite robots will follow your instructions in your meta tags any more than they will follow your robots.txt. In these extreme cases, it would be to your benefit to view your server logs, find out the ip address of this erring robot, and just ban it.

Bandwidth Limitations

Another complaint for having a search engine spider crawl un-instructed lies in the area of bandwidth. A search engine spider could easily eat up a gigabyte of bandwidth in a single crawl. For those of you paying for only so much bandwidth, this could be a big, if not just expensive, problem.

Without a robots.txt file, search engine spiders will request it anyway, causing a 404 Error to be presented. If you have a custom 404 Page Not Found error page, then you are going to be wasting bandwidth. A robots.txt file is a small file, and will cause less bandwidth usage than not having one. Usually the crawl-delay directive can help with this.

Some webmasters believe that another good way to keep a search engine spider from using too much bandwidth is with the revisit-after tag. However, many believe this to be a myth.
meta tags

Most search engine robots, like Google, do not honor this command. If you feel that Googlebot is crawling too frequently and using too much bandwidth, you can visit Google’s help pages and fill out a form requesting Googlebot to crawl your site less often.

You can also block all robots except the ones you specify, as well as provide different sets of instructions for different robots. The robots.txt file is very flexible in this way.

Using Robots.txt for Corporate Security

While some of you are familiar with a company called Perfect 10 and its security issues, some are not. Perfect 10 is an adult company with copyrighted pictures of models. They filed a preliminary injunction against Google in August of 2005. According to BusinessWire.com, “The motion for preliminary injunction seeks to enjoin Google from copying, displaying, and distributing Perfect 10 copyrighted images. Perfect 10 filed a complaint against Google, Inc. for copyright infringement and other claims in November of 2004. It is Perfect 10's contention that Google is displaying hundreds of thousands of adult images, from the most tame to the most exceedingly explicit, to draw massive traffic to its web site, which it is converting into hundreds of millions of dollars of advertising revenue. Perfect 10 claims that under the guise of being a "search engine," Google is displaying, free of charge, thousands of copies of the best images from Perfect 10, Playboy, nude scenes from major movies, nude images of supermodels, as well as extremely explicit images of all kinds. Perfect 10 contends that it has sent 35 notices of infringement to Google covering over 6,500 infringing URLs, but that Google continues to display over 3,000 Perfect 10 copyrighted images without authorization.”

What is interesting in this situation is that the blame actually lies with Perfect 10, Inc. The company failed to direct the search engine to stay out of its image directory. Two simple lines in a robots.txt file on their web server would have easily barred Google from indexing these images in the first place, a practice which Google themselves mention in their guidelines for webmasters.

User-agent: Googlebot-Image
Disallow: /images


One good piece of advice given in an SEO forum is this: “If you want to keep something private on the web, .htaccess and passwords are your friends. If you want to keep something out of Google (or any other search engine), robots.txt and meta tags are your friends. If someone can type a URL into a browser and find your page, don't count on a secret URL remaining secret. Use passwords or robots.txt to protect data.”

Using robots.txt to keep search engines out of sensitive areas is a simple task, and a step that every webmaster has use of. Search engines have been known to index members-only areas, development documents, and even employee personnel records. It is the responsibility of the webmaster to ensure the protection of their sensitive data and copyrighted material. A search engine spider cannot be expected to know the difference between copyrighted material and other data, especially when it makes it clear what would be an easy deterrent to this type of behavior. This is one of the many consequences a webmaster will face if they do not utilize their robots.txt file.


Between Clint’s article and this one, I hope you understand the importance of using a robots.txt on your web server. Ultimately, it’s up to you to help control the behaviors of search engine robots when spidering your site’s pages. Using robots.txt is easy, and there is no excuse for lack of security, spider bandwidth issues or not getting indexed because you failed to do this simple thing. If you need help generating a robots.txt, there are many websites that give you step by step instructions, or can even generate the file for you. With this powerful tool at your disposal, you need to make use of it. It’s your own fault if you don’t.