Crawler

Crawler

  • Submitted By: archita
  • Date Submitted: 10/30/2008 4:00 AM
  • Category: Technology
  • Words: 5328
  • Page: 22
  • Views: 1

HOW SEARCH ENGINES WORK
AND A WEB CRAWLER APPLICATION
Monica Peshave
Department of Computer Science
University of Illinois at Springfield
Springfield, IL 62703
mpesh01s@uis.edu
Advisor: Kamyar Dezhgosha
University of Illinois at Springfield
One University Plaza, MS HSB137
Springfield, IL 62703-5407
kdezh1@uis.edu
Abstract
The main purpose of this project is to present the anatomy of a large scale Hypertext
Transfer Protocol (HTTP) based Web search engine by using the system architecture of
large search engines such as Google, Yahoo as a prototype. Additionally, a web crawler
is developed and implemented in Java v1.4.2, which demonstrates the operation of a
typical Web crawler.
The paper describes in detail the basic tasks a search engine performs. An overview of
how the whole system of a search engine works is provided. A WebCrawler application
is implemented using Java programming language. The GUI of the developed application
helps the user to identify various actions that can take place like specifying the start URL,
maximum URLs to be crawled, the way crawling has to be done – breadth first or depth
first. This paper also lists proposed functionalities as well as features not supported by the
web crawler application.
Key Words – Spider, Indexer, Repository, Lexicon, Robot Protocol, Document indexer,
Hit list.
1
1 Introduction
Engineering a search engine is a challenging task. Search engines index tens to hundreds
of millions of web pages involving a comparable number of distinct terms. They answer
tens of millions of queries every day. Despite the importance of large-scale search
engines on the web, very little academic research has been conducted on them.
Furthermore, due to rapid advance in technology and web proliferation, creating a web
search engine today is very different from three years ago. There are differences in the
ways various search engines work, but they all perform three basic tasks:
1. They search the...

Similar Essays