Make your own free website on Tripod.com

Agents, Crawlers, Robots and Spiders


References ·  Questions · 


COIN 51 (Winter Quarter, ONLINE) FUNDAMENTALS OF INTERNET TECHNOLOGY By Priya (Middle name), Puvvada (Last name) Monday Feb 02 2004 Agents, crawlers, robots and spiders

INTRODUCTION

There are over 4 million sites in the Web now and the new ones are added every single day. The Web, being a collection of WebPages that reside in millions of computers all over the world, is not organized in any orderly fashion, as we would hope it to be. There are no catalogs listing titles, authors, and topics in any particular alphabetical, chronological, or numerical order. This is the main reason why search engines were developed. There are mainly two types of search engines according to my knowledge they are individual search engines and Meta search engines.

A search engine does not exactly go forth and search these millions of computers for the information users asked for. Search engines are programs that search through databases of HTML documents that are indexed by key words. Search engines rely on software programs called robots to build these databases. A search engine spider is an automated software program used to locate and collect data from web pages for inclusion in a search engine's database and to follow links to find new pages on the World Wide Web. The term "agent" is more commonly applied to web browsers and mirroring software. (1)

Web robots are often referred to as crawlers, spiders, wanderers, worms, ants, and even bots for short. In general robots don't literally move from one site to another. Rather, the software visits a site then scans it for links to other sites and moves on to these other sites. Robots of major search sites can visit a million or more sites a day. They build databases by indexing the contents of Web sites. Depending on how these were programmed, indexing robots parse web pages the titles, the description, the first few paragraphs, and Meta tags, or even the entire body of the document. So, if I/we use the word "internet" and "robots" in this page 50 times there is a good chance this page will be pulled up by a search engine in response to a request for "Internet robots." But then this page would obviously not make any sense to anybody. This is where Meta tags come in handy. Tags are codes that tell browsers how to display text, images, and other files in a web page. For example, Meta tags are different because these provide information that is not displayed on the web page itself. This includes the author, content, and description of the page. Robots and search engines use keywords and descriptions in Meta tags to index HTML documents.

Robots serve many purposes other than indexing. There are robots that do nothing but check or validate links and web pages, robots that monitors new sites, and robots that verify mirror sites -- a website that is replicated in other networks or servers.

Whenever we browse the web we come across server logs or web site traffic reports, we probably might have come across some weird and wonderful names for search engine spiders, including "Fluffy the Spider" and Slurp. Depending upon the type of web traffic reports we receive, we may find spiders listed in the "Agents" section of our statistics. Not all spiders are good. Some agents are generated by software such as Teleport Pro, an application that allows people to download a full "mirror" of our site onto their hard drives for viewing later on, or sometimes for more insidious purposes such as plagiarism. If we have a large or image heavy site, the practice of web site stripping could also have a serious impact on our bandwidth usage each month.

What is a WWW robot? In Brief?

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Each search engine uses a Web robot to build its database. Robot is ”cush” a program that typically searches the Web to find new Websites and update information about old Web sites that are already there in the database. One of the robot's important tasks is to delete the information from the database when a Website is no longer exists. Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These different names of robots are a bit misleading as they give the impression the software itself moves between sites like a virus, but in reality a robot simply visits sites by requesting documents from them. (2)

What are Web crawlers? In Brief?

WebCrawler-a primitive search engine that appears to give relevance to simply the number of occurrences of a string in a document without regard to placement in the document. This means large pages consistently are at the top of results returned from WebCrawler. WebCrawler ignores META tags and timeliness is not a consideration. Machine-calculated samples of text are returned as summaries. Same as robots, but note WebCrawler is a specific robot. Crawler-based search engines, such as Google, create their listings automatically. They "crawl" or "spider" the web, then people search through what they have found. If for example we change our web pages, crawler-based search engines eventually find these changes, and that can affect how we are listed. Page titles, body copy and other elements all play a role.

All crawler-based search engines have the basic parts described above, but there are differences in how these parts are tuned. That is why the same search on different search engines often produces different results. Some of the significant differences between the major crawler-based search engines are summarized on the Search Engine Features Page. Information on this page has been drawn from the help pages of each search engine, along with knowledge gained from articles, reviews, books, independent research, tips from others and additional information received directly from the various search engines. (3) Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes.

Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine. Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.(4)

What is an agent? In Brief?

The word "agent" is used for lots of meanings in computing these days. Specifically: Autonomous agents are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet Intelligent agents are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking. User-agent is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc. The word "agent" can have a variety of meanings and imply various functions. Types of agents include:

Autonomous agents These are programs that actually travel between sites, deciding themselves when to move and what to do. However, these types of programs can only travel between special servers and are currently not widespread on the Internet.

Intelligent agents These are programs that help users perform various tasks, such as choosing a product or service, or helping to fill out an online form. They have generally little to do with networking.

User-agent This is a technical name for programs that perform networking tasks for a user. Web user-agents include programs like Netscape Navigator and Microsoft Internet Explorer, and e-mail user-agents such as Qualcomm, Eudora, etc. (5)

What are Spiders? In Brief?

Same as robots but sounds cooler in the press. Those from the major search engines can sometimes be identified from their host names. These often incorporate part of the search engine's name or the company's name. For example, one of WebCrawler's host names is spidey.webcrawler.com. A better way of spotting spiders is to look for their agent names, or what some people call browser names. Spiders have their own names, just like browsers. For example, Netscape identifies itself by saying Mozilla. Alta Vista's spider says Scooter, while HotBot's spider is named Slurp. There are many ways of finding spiders as they visit our site/'s. The two most common ways are to collect the information as each user visits or to evaluate our log files at a later date. Both methods are good, but not all log files are alike. Each server can have different information and some do not allow changes to the information collected.

The most important pieces of information that we need are:
a) The IP address of the visitor.
b) The User Agent of the program that they are using.
c) The file that they are requesting.

With these three pieces of information we can collect other information that will allow us to determine if the visitor is a spider or not. The User Agent lets us know the name of the program that is being used by every visitor to our site. Most people that visit will use Netscape or Internet Explorer. Both of these web browsers have the word "mozilla" somewhere in the User Agent of the program. Most spiders use their own names as the User Agent and even give us information about themselves, like the version number or a web site that we can visit to get more information about them. Although the User Agent is a great way to find out who is visiting our site, this can be faked. Both spiders and people fake User Agents for the purposes of seeing what may be behind a simple cloaking program (see User Agent Cloak). With a simple User Agent faking program we can appear to be any browser or spider that we wish. This means that any spiders that we find through the User Agents must be thoroughly verified before adding them into our cloaking program.

Once we have the IP address of the potential spider, we have the most powerful piece of information we could want/get. With the IP address we can find out who owns the rights to use that address, the name of the computer (providing it has one), and the path that it takes to get from our computer to the IP address that we are interested in. With that information we will be able to find out who owns the computer that visited our site, and, more often than not, for what purpose the visit was intended. Some search engines are "nice" and use the name of their spider in the DNS name of the computer. AltaVista uses the word "scooter" in many of the names of the computers that run their spider. Others have consistent naming conventions. For example, most of Excites computers seem to be named after musical instruments. Most have a naming convention that is only understood by the people who name them. (6)