Truth about web crawlers

by Maksym Nesen

Wouldn't it be nice to be able to leave some code in your web site totell the search engine spider crawlers to make your site number one?Unfortunately a robots.txt file or robots meta tag won't do that, butthey can help the crawlers to index your site better and block out theunwanted ones.First a little definition explaining:Search Engine Spiders or Crawlers - A web crawler (also known as webspider) is a program which browses the World Wide Web in a methodical,automated manner. Web crawlers are mainly used to create a copy of allthe visited pages for later processing by a search engine, that willindex the downloaded pages to provide fast searches.A web crawler is one type of bot, or software agent. In general, itstarts with a list of URLs to visit. As it visits these URLs, itidentifies all the hyperlinks in the page and adds them to the list ofURLs to visit, recursively browsing the Web according to a set ofpolicies.Robots.txt - The robots exclusion standard or robots.txt protocol is aconvention to prevent well-behaved web spiders and other web robotsfrom accessing all or part of a website. The information specifying theparts that should not be accessed is specified in a file calledrobots.txt in the top-level directory of the website.The robots.txt protocol is purely advisory, and relies on thecooperation of the web robot, so that marking an area of your site outof bounds with robots.txt does not guarantee privacy. Many web siteadministrators have been caught out trying to use the robots file tomake private parts of a website invisible to the rest of the world.However the file is necessarily publicly available and is easilychecked by anyone with a web browser.The robots.txt patterns are matched by simple substring comparisons, socare should be taken to make sure that patterns matching directorieshave the final '/' character appended: otherwise all files with namesstarting with that substring will match, rather than just those in thedirectory intended.Meta Tag - Meta tags are used to provide structured data about data.In the early 2000s, search engines veered away from reliance on Metatags, as many web sites used inappropriate keywords, or were keywordstuffing to obtain any and all traffic possible.Some search engines, however, still take Meta tags into someconsideration when delivering results. In recent years, search engineshave become smarter, penalizing websites that are cheating (byrepeating the same keyword several times to get a boost in the searchranking). Instead of going up rankings, these websites will go down inrankings or, on some search engines, will be kicked off of the searchengine completely.Index a site - The act of crawling your site and gathering information.How can the robots.txt file and meta tag help you?In the robots.txt you can tell the harmful 'web crawlers' to leave yourweb site alone, and give helpful hints to the ones you want to crawlyour site. Below is an example on how to disallow a web crawler tosearch your site:# this identifies the wayback machine User-agent: ia_archiver Disallow: / ia_archiver is the crawler name for the wayback machine that you mayhave heard of, and the / after disallow tells ia_archiver not to indexany of your site. The # allows you to write comments to yourself so youcan keep track of what you typed.Type the above three lines into notepad from your computer and save itto the root directory of your web site as robots.txt. Web crawlers lookfor this document first at a web site before doing anything else. Thishelps the crawler to do its job, and helps the web site owner tell thespider what to do. Say for instance you have some data that you don'twant the crawlers to see. (Like duplicate content for other browserreferrer pages)You can deter crawlers from indexing the 'duplicate' directory by typing this into your robots.txt file. User-agent: * Disallow: /duplicate/ The * after user-agent says that this action applies to all crawlersand /duplicate/ after disallow tells all crawlers to ignore thisdirectory and not search it. For each user-agent and disallow linethere must be a blank space between them in order for it to functioncorrectly. So this is how you would create the above two commands intoa robots.txt file:# this identifies the wayback machineUser-agent: ia_archiver Disallow: / User-agent: * Disallow: /duplicate/ One thing to note that is very important: Anyone can access therobots.txt file of a site. So if you have information that you don'twant anyone to see don't include it into the robots.txt file. If thedirectory that you don't want anyone to see is not linked to from yourweb site the crawlers won't index it anyway.An alternative to blocking indexing of your site is to put a meta tag into the page. It looks like this:You put this into the tag of your web page. This line tells the robotcrawlers not to index (search) the page and not to follow any of thehyperlinks on the page. So as an example tells the robot crawlers tonot index the page, but follow the hyperlinks on this page.Did You Know That Google Has Its Own Meta Tag?It looks like this:.This tells the Google robot crawler not to index the page, not tofollow any of the links, and not to keep from storing cached versionsof your web site. You will want this done if you update the content onyour site frequently. This prevents the web user from seeing outdatedcontent that isn't refreshed because of storage in the cache.You can use the meta tag to specifically talk to Google's robots toavoid complications or if you are optimizing your site for Google'ssearch engine. Recommended software tools to automate submitting and link creation : "http://blog-submitter.cafe150.com" - Blogs AutoFiller

About the Author

Maksym Nesen is leading programmer of the Oksima team. he developed great product - Blogs Auto Filler which saves time and money for people who used to blogs advertisinghttp://blog-submitter.cafe150.com

Tell others about
this page:

facebook twitter reddit google+

Comments? Questions? Email Here

Truth about web crawlers

About the Author

How to Advice .com

Charity