Control the Search engine bots using the robots.txt file
The traffic a search engine can bring is the thing each site owner aims to achieve. And in order for the search engine to know what your website is about, it uses a “spider” which crawls the network, looking for updates and reading your website. However, you may not want for the spider to read parts of you site. If you are working on a project online, but you don’t want anyone to know about it, if you are using duplicate content or simply if you want to ban the “spider” from your site, there is one tool you can use – the robots.txt file. The robots.txt file is a simple text file, which can be created with any text editor out there – for example you can create it with Notepad and achieve the same effect as that of someone who created it with DreamWeaver. This file needs to be put in the root folder of your site. When a spider crawls a site, the first thing it looks for is the robots.txt file. So, even if the spider is looking at your-domain.com/shop/order.php, it will first check if there is a your-domain.com/robots.txt present. In the actual file, you can enter the rules for the search engine spiders that are trying to visit your site. The first thing that you have to specify is the spiders for which you want to set specific rules. You can do this with the “User-agent:“ record. For example: User-agent: * – this will affect all spiders, who obey the robots.txt file User-agent: Google – this will apply only to the Google spider Once you have specified the spiders you want to target, it’s time to set the rules. Disallow: / – this rule will stop the specified spiders from looking at your site. Disallow: /tmp/ – this rule will stop robots from looking in the /tmp/ folder of your site, while the rest of the site will be crawled. Disallow: /tmp/test.html – this will stop spiders from looking into a specific file. Additionally, you can also specify a robots <META> tag. This meta tag is set to an individual page and is of the following format: <META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”> The NOFOLLOW attribute will stop the spider from following the links on your page and the NOINDEX attribute will stop the spider from reading the text on the page. Originally published Saturday, October 24th, 2009 at 12:48 pm, updated October 22, 2009 and is filed under The Free Reseller Program.
Tags: website, SEO tips, search engine, robots.txt, spider
January 12th, 2010 at 9:27 am
[…] be crawled by the news bot and which can be crawled by the regular Googlebot with the help of the robots.txt file. Here is how you can manage both the regular Google bot and the Google news […]