A robots.txt is a file placed on your server to tell the various search engine spiders not to crawl or index certain sections or pages of your site. You can use it to prevent indexing totally, prevent certain areas of your site from being indexes or to issue individual indexing instructions to specific search engines.
The file itself is a simple text file, which can be created in Notepad. It need to be saved to the root directory of your site, that is the directory where your home page or index page is.
Why Do I Need One?
All search engines, or at least all the important ones, now look for a robots.txt file as soon their spiders or bots arrive on your site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a good idea, it can act as a sort of invitation into your site.
There are a number of situations where you may wish to exclude spiders from some or all of your site.
The very fact that search engines are looking for them is reason enough to put one on your site. Have you looked at your site statistics recently? If your stats include a section on ‘files not found’, you are sure to see many entries where search engines spiders looked for, and failed to find, a robots.txt file on your site.
Creating the robots.txt file
There is nothing difficult about creating a basic robots.txt file. It can be created using notepad or whatever is your favorite text editor. Each entry has just two lines:
User-Agent: [Spider or Bot name]
Disallow: [Directory or File Name]
This line can be repeated for each directory or file you want to exclude, or for each spider or bot you want to exclude.
A few examples will make it clearer.
1. Exclude a file from an individual Search Engine
You have a file, privatefile.htm, in a directory called ‘private’ that you do not wish to be indexed by Google. You know that the spider that Google sends out is called ‘Googlebot’. You would add these lines to your robots.txt file:
User-Agent: Googlebot
Disallow: /private/privatefile.htm
2. Exclude a section of your site from all spiders and bots
You are building a new section to your site in a directory called ‘newsection’ and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, ‘*’, to exclude them all.
User-Agent: *
Disallow: /newsection/
Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.
3. Allow all spiders to index everything
Once again you can use the wildcard, ‘*’, to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from nowhere.
User-agent: *
Disallow:
4. Allow no spiders to index any part of your site
This requires just a tiny change from the command above – be careful!
User-agent: *
Disallow: /
If you use this command while building your site, don’t forget to remove it once your site is live!
Getting More Complicated
If you have a more complex set of requirements you are going to need a robots.txt file with a number of different commands. You need to be quite careful creating such a file, you do not want to accidentally disallow access to spiders or to areas you really want indexed.
Let’s take quite a complex scenario. You want most spiders to index most of your site, with the following exceptions:
Let’s take this one in stages!
1. First you would ban all search engines from the directories you do not want indexed at all:
User-agent: *
Disallow: /cgi-bin/
Disallow: /_borders/
Disallow: /_derived/
Disallow: /_fpclass/
Disallow: /_overlay/
Disallow: /_private/
Disallow: /_themes/
Disallow: /_vti_bin/
Disallow: /_vti_cnf/
Disallow: /_vti_log/
Disallow: /_vti_map/
Disallow: /_vti_pvt/
Disallow: /_vti_txt/
It is not necessary to create a new command for each directory, it is quite acceptable to just list them as above.
2. The next thing we want to do is to prevent Alta Vista from getting in there at all. The Altavista bot is called Scooter.
User-Agent: Scooter
Disallow: /
This entry can be thought of as an amendment to the first entry, which allowed all bots in everywhere except the defined files. We are now saying we mean all bot can index the whole site apart from the directories specified in 1 above, except Scooter which can index nothing.
3. Now you want to keep Google away from those images. Google grabs these images with a sperate bot from the one that indexes pages generally, called Googlebot-Image. You have a couple of choices here:
User-Agent: Googlebot-Image
Disallow: /images/
That will work if you are very organized and keep all your images strictly in the images folder.
User-Agent: Googlebot-Image
Disallow: /
This one will prevent the Google image bot from indexing any of your images, no matter where they are in your site.
4. Finally, you have two pages called content1.html and content2.html, which are optimized for Google and Lycos respectively. So, you want to hide content1.html from Lycos (The Lycos spider is called T-Rex):
User-Agent: T-Rex
Disallow: /content1.html
and content2.html from Google.
User-Agent: Googlebot
Disallow: /content2.html
Kaynak: www.outfront.net
belgesi-1135
Albert Einstein, Charles Darwin, Wolfgang Amadeus Mozart ve Pablo Picasso gibi dünyayı etkilemiş dahilerin beyinlerinin…
Çeşitli kişilik testleri belli gruptan insanlar arasındaki benzerlikleri vurgular. Yine de, diğerleriyle olan tüm benzerliklerine…
Boşaltım sistemi vücutta homeostazın sağlanmasında çok önemli bir yere sahiptir.Böbrekler, üreterler ve mesaneden oluşan boşaltım…
Büyük Atatürk'ün ölümünü takip eden günlerde, o zamanlar yalnız Avrupa'nın değil, dünyanın en güçlü günlük…
Mustafa Kemal Atatürk 1881 yılında Selânik'te Kocakasım Mahallesi, Islâhhâne Caddesi'ndeki üç katlı pembe evde doğdu.…