Francis Crimmins. Focussed Crawling Review

Aus Markus' Wiki
Wechseln zu: Navigation, Suche

This list is taken from: http://flame.cs.dal.ca/~eem/Research/Research/focused-crawler-review.html

Francis Crimmins, 10 September 2001. dev.funnelback.com/focused-crawler-review.html

1 Abstract

This document gives a literature review of focused crawling.

2 Introduction

A web crawler is a program which automatically traverses the web by downloading documents and following links from page to page [Koster 1999]. A general-purpose web crawler normally tries to gather as many pages as it can from a particular set of sites. In contrast, a focused crawler is designed to only gather documents on a specific topic, thus reducing the amount of network traffic and downloads.

3 Focused Crawling

One of the first focused web crawlers is looked at in [De Bra et al. 1994]. This works describes a client-based real-time retrieval system for hypertext documents, based on depth-first search. The "school of fish" metaphor is used, with multiple processes or threads following links from pages. The "fish" follow more links from relevant pages, based on keyword and regular expression matching. The authors acknowledge that this type of system can make heavy demands on the network, and propose various caching strategies to deal with this.

Experiences with a focused crawler are related in [Chakrabarti et al. 1999]. Their research goals included looking at linkage sociology (who links to who), looking at sites vs pages, semi-supervised learning etc. The crawler starts by using a canonical topic taxonomy and user specified starting points (e.g. bookmarks). A user marks interesting pages as they browse, which are then placed in a category in the taxonomy. This was bootstrapped by using the Yahoo hierarchy (260,000 documents).

The main components of the focused crawler were a classifier, distiller and crawler. The classifier makes relevance judgements on pages to decide on link expansion and the distiller determines centrality of pages to determine visit priorities. The latter is based on connectivity analysis. They talk about the harvest ratio which is the rate at which relevant pages are acquired, and how effectively irrelevant pages are filtered away.

They state that is desirable to explore out from keyword-based and limited-radius search for resources. Another observation was that the web graph is rapidly mixing: random links read to random topics within an extremely short radius. At the same time, long paths and large subgraphs with topical coherence do exist.

Another type of focused crawling is looked at in [Cho et al. 1998]. The research involved making crawling more efficient by crawling "more important" pages first. They look at various measures of importance for a page e.g. similarity to a driving query, number of pages pointing to this page (backlinks), pagerank and location (in a hierarchy). The pagerank for a particular page is recursively defined as the weighted sum of the pageranks of the pages which point to it. They implemented a crawler which used various ordering metrics based on these importance measures. The work was carried out on a copy of the Stanford University web (approximately 225,000 pages).

They found that the crawler which used the backlink count behaved like a depth-first search, visiting pages in a particular "cluster" before moving on. It was often biased by the set of starting points. The crawler which used pagerank did not seem to be affected by this, and combined breadth and depth in a better way. They also found that using a backlink count does not work as well in a small domain e.g. subset of pages.

A site mapping application which uses focused crawling is described in [Hersovici et al. 1998]. The crawler starts from a particular point and tries to follow links to relevant documents. It uses the vector space model to compute the relevance of a particular page to the driving query. Child pages inherit this score, modified by some decay factor. They also use the text around links as well as the anchor text itself to decide whether to follow a particular link. This work is based on improvements to the algorithms detailed in [De Bra et al. 1994].

A type of focused crawling which concentrates on the "hidden web" is looked at in [Raghavan & Garcia-Molina 2000]. Here the objective is to gain access to content "hidden" behind web search forms. The researchers propose a task-specific, human-assisted approach, and built a system called HiWE (Hidden Web Exposer). They model a form as a set of (element, domain) pairs and try to determine suitable input values based on labels, form layout etc. The label value set (LVS) is populated using explicit initialization, built-in categories (dates), wrapped data sources and crawling experience. The crawler seems to perform better on larger forms (more descriptive labels and finite domains).

Work on evaluating different crawling strategies is described in [Menczer et al. 2001]. Classifiers were built for each of 100 topics, to be used to evaluate the crawled pages. The authors argue that a good focused crawler should remain in the vicinity of the topic in vector space. They plot a trajectory over time and assess the crawlers based on their ability to remain on topic. Three different focused crawling strategies were evaluated:

  • BestFirst crawler: priority queue ordered by similarity between topic and page where link was found.
  • PageRank crawler: crawls in pagerank order, recomputes ranks every 25th page.
  • InfoSpiders: uses neural net, back propagation, considers text around links.

It was found that BestFirst performed best, followed by Infospiders and then PageRank. The researchers suggest that PageRank is too general for use in topic-driven tasks.

The use of context graphs to guide a focused crawler is detailed in [Diligenti et al. 2000]. The authors state that a major problem in this area is appropriate credit assignment to pages during the crawl. For example, some off-topic pages lead to on-topic pages, as a result of page hierarchies. To combat this, a context focused crawler was built. This uses a general search engine to get the pages linking to a specific document and build up a "context graph" for the page. This is then used to train a set of classifiers to assign documents to different categories based on their expected link distance to the target. Graphs and classifiers are constructed for each seed document, with layers being built up to a specified level. Thus the crawler gains knowledge about topics that are directly or indirectly related to the target topic, as well as a very simple model of the paths to the target.

A Naive Bayes classifier is used for each layer, and the shortest paths are tried first. The context focused crawler (CFC) was compared to an ordinary breadth-first crawler and a "traditional" focused crawler. The metric used was the proportion of relevant downloaded pages. The experiments showed that the CFC maintained a higher level of relevance.

A focused crawler which tries to learn the linkage structure while crawling is described in [Aggarwal et al. 2001]. This involves looking for specific features in a page which makes it more likely that it links to a given topic. These features may include page content, URL structure, link structure (siblings etc.). The authors claim that the learning approach is a more general framework than assuming a standard pre-defined structure while crawling.

A web topic management system (WTMS) is looked at in [Mukherjea 2000]. They use a focused crawling technique which only downloads pages which are "near" a relevant page (parent, children and siblings). This is based on a vector space representation of pages. The crawler will also load all URLs containing the topic keyword. Finally, it keeps a relevance threshold for a particular area (directory), and will stop downloading from that location if relevance drops below a certain level. The other aspect of this work is how to visualize topics, using logical groupings of sites based on hubs and authorities.

The use of reinforcement learning for building domain-specific search engines is briefly looked at in [McCallum et al 1999]. Naive Bayes classifiers are used to classify hyperlinks based on both the full page text and anchortext. This is used to compute a "reward value" for each link, which is used to train a crawler using reinforcement learning. Tests on computer science department web sites showed that the focused crawler found more research papers in less time than an ordinary breadth-first crawler.

Finally, a study reported in [Najork & Wiener 2001] shows that breadth-first search discovers high-quality pages early on in the crawl. The Mercator web crawler and a connectivity server were used in the experiments, with over 300 million pages being downloaded. Like the work reported in [Cho et al. 1998], PageRank was used as the measure of quality of a page. The authors speculate that the most important pages have many links from numerous hosts, and these will be found early on, regardless of the start point.

4 Conclusions

A review of focused crawling has been presented.

5 References

[Aggarwal et al. 2001] "Intelligent Crawling on the World Wide Web with Arbitrary Predicates", C. Aggarwal, F. Al-Garawi and P. Yu. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, May 2001. http://www10.org/cdrom/papers/110/index.html

[Chakrabarti et al. 1999] "Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery", S. Chakrabarti, M. van den Berg and B. Dom. In Proceedings of the 8th International WWW Conference, Toronto, Canada, May 1999. http://www8.org/w8-papers/5a-search-query/crawling/index.html

[Cho et al. 1998] "Efficient Crawling Through URL Ordering", J. Cho, H. Garcia-Molina, L. Page. In Proceedings of the 7th International WWW Conference, Brisbane, Australia, April 1998. http://www7.scu.edu.au/programme/fullpapers/1919/com1919.htm

[De Bra et al. 1994] "Information Retrieval in Distributed Hypertexts", P. De Bra, G. Houben, Y. Kornatzky and R. Post. In Proceedings of the 4th RIAO Conference, 481 - 491, New York, 1994. http://citeseer.nj.nec.com/debra94information.html

[Diligenti et al. 2000] "Focused Crawling Using Context Graphs", M. Diligenti, F. Coetzee, S. Lawrence, C. Giles and M. Gori. In Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000), Cairo, Egypt, September 2000. http://citeseer.nj.nec.com/diligenti00focused.html

[Hersovici et al. 1998] "The Shark-Search Algorithm - An Application: Tailored Web Site Mapping", M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim and S. Ur. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, April 1998. http://www7.scu.edu.au/programme/fullpapers/1849/com1849.htm

[Koster 1999] "The Web Robots Pages", M. Koster. 1999. http://info.webcrawler.com/mak/projects/robots/robots.html

[McCallum et al 1999] "Building Domain-Specific Search Engines with Machine Learning Techniques", A. McCallum, K. Nigam, J. Rennie, and K. Seymore. In 1999 AAAI Spring Symposium on Intelligent Agents in Cyberspace, Stanford University, USA, March 1999. http://www.ri.cmu.edu/pubs/pub_2716.html

[Menczer et al. 2001] "Evaluating Topic-Driven Web Crawlers", F. Menczer, G. Pant, P. Srinivasan and M. Ruiz. In Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, USA, 2001. http://dollar.biz.uiowa.edu/~fil/papers.html

[Mukherjea 2000] "WTMS: A System for Collecting and Analysing Topic-Specific Web Information", S. Mukherjea. In Proceedings of the 9th International World Wide Web Conference, Amsterdam, Netherlands, May 15-19, 2000. http://www9.org/w9cdrom/293/293.html

[Najork & Wiener 2001] "Breadth-First Search Crawling Yields High-Quality Pages", M. Najork and J. Wiener. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, May 2001. http://www10.org/cdrom/papers/208/index.html

[Raghavan & Garcia-Molina 2000] "Crawling the Hidden Web", S. Raghavan and H. Garcia-Molina. Stanford Digital Libraries Technical Report, 2000. http://dbpubs.stanford.edu:8090/pub/2000-36