The World Wide Web (WWW) or the ‘Web’ for short, is a collection of billions of documents written in a way that enables them to cite each other using ‘hyperlinks’, which is why they are a form of ‘hypertext’. These documents, or ‘Web Pages’, are typically a few thousand characters long, written in a diversity of languages, and cover essentially all the topics of human endeavor.
The World Wide Web has become highly popular in the last few years, and is now one of the primary means of information publishing on the Internet. When the size of the Web increased beyond a few sites and a small number of documents, it became clear that manual browsing through a significant portion of the hypertext structure is no longer possible, let alone an effective method for resource discovery. Browsing is a useful but restrictive means of finding information. Given a page with many links to follow, it would be unclear and painstaking to explore them in search of a specific information need.
1.1 What are Web Crawlers?
A Web Crawler or Web Robot is a program that traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
These programs are sometimes called “web robots”, "spiders", "web wanderers", or "web worms". These names, while perhaps more appealing, may be misleading, as the term "spider" and "wanderer" give the false impression that the robot itself moves. In reality robots are implemented as a single software system that retrieves information from remote sites using standard Web protocols.
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Web crawlers are almost as old as the web itself. The first crawler, Matthew Gray’s Wanderer, was written in the spring of 1993, roughly coinciding with the first release of NCSA Mosaic. Several...