Some suggestions for Robots.txt from Forums
1) I have submitted Google sitemap & yahoo feeds for 3 of my sites. I havenot included robots.txt file in the root directory as I want all my pages indexed in search engines.
My question is, is it ok without a robots.txt file or should I include a robots.txt file as
User-agent: *
Disallow:
2) If you want the robots to visit all pages, there is no need for a robots.txt file.
If there is no robots.txt in your site, the robots will still try to find it and there will be "404 not found" messages in your log file. This is not a problem for the robots, but you might prefer to avoid this with a robots.txt as :
User-agent: *
Disallow:
3) to avoid later security problems around hackers on your site
I would disallow any path leading into
- admin folders
- cgi-bin
- configuration folders
- any logfile/statistic pages ( stats have URLs into all private pages as well such as cgi-bin, admin, etc unless cleanly configured, hence online stats are a potential security risk for site owner
- also exclude an y folders or Sub-folders containing non-content pages
this keeps your number of pages clean to the real content and avoids at least partially that hackers find via Google paths to particular scripts known at a given time to have a security issue
Google is the prime resource for cyber criminals and easiest way for hackers to find sites open for abuse ( I had the lesson of being victim last winter several time, each time a hacker attempt was initiated via Google search-result )
hence a typical robots.txt might look like
User-agent: *
Disallow: /cgi-bin
Disallow: /logs
Disallow: /any-software/admin
Disallow: /your_blog/trackback.php
Disallow: /some-software/include
Disallow: /your-scripts/templates
like almost all pages and formats on the web - robots.txt has a validator
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
and a home page
http://www.robotstxt.org
4) the key source for all known to me cyber crime is Google
Google does strictly respect robots.txt
hence excluding admin and script sections of a site in robots.txt also successfully disables the major information-source of cyber criminals
all known hacker attacks always came from Google search and NO single other source is known to me in the past 9 yrs of full time web publishing
all MAJOR SE such as MSN, Y, G and ask.com do fully respect robots.txt and no other SE is of any significance to obtain correct and current CC-relevant information
from: HERE


0 Comments:
Post a Comment
<< Home