The traffic a search engine can bring is the thing each site owner aims to achieve. And in order for the search engine to know what your website is about, it uses a “spider” which crawls the network, looking for updates and reading your website.
However, you may not want for the spider to read parts of you site. If you are working on a project online, but you don’t want anyone to know about it, if you are using duplicate content or simply if you want to ban the “spider” from your site, there is one tool you can use – the robots.txt file.
The robots.txt file is a simple text file, which can be created with any text editor out there – for example you can create it with Notepad and achieve the same effect as that of someone who created it with DreamWeaver. This file needs to be put in the root folder of your site. When a spider crawls a site, the first thing it looks for is the robots.txt file. So, even if the spider is looking at your-domain.com/shop/order.php, it will first check if there is a your-domain.com/robots.txt present.
In the actual file, you can enter the rules for the search engine spiders that are trying to visit your site. The first thing that you have to specify is the spiders for which you want to set specific rules. You can do this with the “User-agent:“ record. For example:
User-agent: * – this will affect all spiders, who obey the robots.txt file
User-agent: Google – this will apply only to the Google spider
Once you have specified the spiders you want to target, it’s time to set the rules.
Disallow: / – this rule will stop the specified spiders from looking at your site.
Disallow: /tmp/ – this rule will stop robots from looking in the /tmp/ folder of your site, while the rest of the site will be crawled.
Disallow: /tmp/test.html – this will stop spiders from looking into a specific file.
Additionally, you can also specify a robots <META> tag. This meta tag is set to an individual page and is of the following format:
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
The NOFOLLOW attribute will stop the spider from following the links on your page and the NOINDEX attribute will stop the spider from reading the text on the page.
View Comments (0)