Web Crawler

Web Crawler

Chapter 1
Introduction

1.1 Background

The World-Wide-Web (WWW) is a big network where you can get a huge amount of information. This information is spread on servers all over the world and it is not possible to handle all these data by human. Search engines are one of the most important services to indicate and find all these pages. Without them it wouldn’t be possible to get information in a normal and fast way. The question is what is going on behind this search engines and why is it possible to get relevant data so fast?

The answer is web crawlers. A web crawler combs through the internet and copies all pages to index them. They can also be used to harvest e-mail addresses (mostly for spam) or validate HTML code and check links on a page. But they are most in use for search engines to provide up-to-date data to distribute relevant and new search results.

The algorithm behind a web robot is quite simple: It starts with a prepared list of links and visits each of these pages. It identifies links on the particular pages and adds them to the list. In a recursive way it visits all links found according to a set of policies.

1.2 History of Web Crawler

The first search engines where developed for small and coherent collections but the World Wide Web (the Web) is not small and it is much less coherent. The Web changes every day so there must be new algorithms and technologies to keep the information up to date.

The very first crawler for the internet was the so called World Wide Web Wanderer developed in June 1993. Basically it was designed to collect web statistics. It was combined
with a simple indexer methods and so it was possible to search URLs.

In October 1993 the Aliweb appeared at the internet. It included some new ideas for crawling the Web to get better results. It was a combination of manual produced index files with certain content like URL, title, description and keywords. A spider was responsible for indexing those files...

Similar Essays