Presentation is loading. Please wait.

Presentation is loading. Please wait.

December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems.

Similar presentations


Presentation on theme: "December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems."— Presentation transcript:

1 December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems

2 December 20, 2002CUL Metadata WG Meeting2 Outline Crawlers Collection Synthesis Focused Crawling Some Results Student Project (Fall 2002)

3 December 20, 2002CUL Metadata WG Meeting3 Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

4 December 20, 2002CUL Metadata WG Meeting4 Crawlers – some background Resource discovery Crawlers and internet history Crawling and crawlers Mercator

5 December 20, 2002CUL Metadata WG Meeting5 Resource Discovery Finding info on the Web –Surfing (random strategy, goal is serendipity) –Searching (inverted indices; specific info) –Crawling (“all” the info) Uses for crawling –Find stuff –Gather stuff –Check stuff

6 December 20, 2002CUL Metadata WG Meeting6 Crawlers and internet history 1991: HTTP 1992: 26 servers 1993: 60+ servers; self-register; archie 1994 (early) – first crawlers 1996 – search engines abound 1998 – focused crawling 1999 – web graph studies 2002 – use for digital libraries

7 December 20, 2002CUL Metadata WG Meeting7 Crawling and Crawlers Web overlays the internet A crawl overlays the web seed

8 December 20, 2002CUL Metadata WG Meeting8 Crawler Issues The web is so big Visit Order The URL itself Politeness Robot Traps The hidden web System Considerations

9 December 20, 2002CUL Metadata WG Meeting9 Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler

10 December 20, 2002CUL Metadata WG Meeting10 Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster

11 December 20, 2002CUL Metadata WG Meeting11 The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web

12 December 20, 2002CUL Metadata WG Meeting12 System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources

13 December 20, 2002CUL Metadata WG Meeting13

14 December 20, 2002CUL Metadata WG Meeting14 Mercator Features Written in Java One file configures a crawl Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Industrial-strength crawler: – uses its own DNS and java.net package

15 December 20, 2002CUL Metadata WG Meeting15 Collection Synthesis The NSDL –National Scientific Digital Library –Educational materials for K-thru-grave –A collection of digital collections Collection (automatically derived) –20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall

16 December 20, 2002CUL Metadata WG Meeting16 Crawler is the Key A general search engine is good for precise results, few in number A search engine must cover all topics, not just scientific For automatic collection assembly, a Web crawler is needed A focused crawler is the key

17 December 20, 2002CUL Metadata WG Meeting17 Focused Crawling

18 December 20, 2002CUL Metadata WG Meeting18 Focused Crawling 432 765 1 1 R Breadth-first crawl 1 432 5 R X X Focused crawl

19 December 20, 2002CUL Metadata WG Meeting19 Collections and Clusters Traditional – document universe is divided into clusters, or collections Each collection represented by its centroid Web – size of document universe is infinite Agglomerative clustering is used instead Two aspects: –Collection descriptor –Rule for when items belong to that Collection

20 December 20, 2002CUL Metadata WG Meeting20 Q = 0.2 Q = 0.6

21 December 20, 2002CUL Metadata WG Meeting21 The Setup A virtual collection of items about Chebyshev Polynomials

22 December 20, 2002CUL Metadata WG Meeting22 Adding a Centroid An empty collection of items about Chebyshev Polynomials

23 December 20, 2002CUL Metadata WG Meeting23 Document Vector Space Classic information retrieval technique Each word is a dimension in N-space Each document is a vector in N-space Example: Normalize the weights Both the “centroid” and the downloaded document are term vectors

24 December 20, 2002CUL Metadata WG Meeting24 Agglomerate A collection with 3 items about Ch. Polys.

25 December 20, 2002CUL Metadata WG Meeting25 Where does the Centroid come from? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s

26 December 20, 2002CUL Metadata WG Meeting26 Building a Centroid 1. Google(“Chebyshev Polynomials”)  {url1 … url-n 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {u1 … un} do D  download(url) V  term vector(d) For each term t in V do If t not in H add it with value H(t) ++ 4. Compute tf-idf weights. C  top 20 terms.

27 December 20, 2002CUL Metadata WG Meeting27 Dictionary Given centroids C1, C2, C3 … Dictionary is C1 + C2 + C3 … –Terms are union of terms in Ci –Term Frequencies are total frequency in Ci –Document Frequency is how many C’s have t –Term IDF is as from Berkeley Dictionary is 300-500 terms

28 December 20, 2002CUL Metadata WG Meeting28 Focused Crawling Recall the cartoon for a focused crawl: A simple way to do it is with 2 “knobs” 1 432 5 R X X

29 December 20, 2002CUL Metadata WG Meeting29 Focusing the Crawl Threshold: page is on-topic if correlation to the closest centroid is above this value Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff

30 December 20, 2002CUL Metadata WG Meeting30 Illustration 23 4 6 7 1 555 5 Cutoff = 1 Corr >= threshold

31 December 20, 2002CUL Metadata WG Meeting31 Closest Furthest

32 December 20, 2002CUL Metadata WG Meeting32 Collection “Evaluation” Assume higher correlations are good With human relevance assessments, one can also compute a “precision” curve Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.

33 December 20, 2002CUL Metadata WG Meeting33 Cutoff = 0 Threshold = 0.3

34 December 20, 2002CUL Metadata WG Meeting34

35 December 20, 2002CUL Metadata WG Meeting35 Tunneling with Cutoff Nugget – dud – dud… - dud – nugget Notation: 0 – X – X … - X – 0 Fixed cutoff: 0 – X1 – X2 - … Xc Adaptive cutoff: 0 – X1 – X2 - … X?

36 December 20, 2002CUL Metadata WG Meeting36 Statistics Collected 500,000 documents Number of seeds: 4 Path data for all but seeds 6620 completed paths (0-x…x-0) 100,000s incomplete paths (0-x…x..)

37 December 20, 2002CUL Metadata WG Meeting37 Nuggets that are x steps from a nugget

38 December 20, 2002CUL Metadata WG Meeting38 Nuggets that are x steps from a seed and/or a nugget

39 December 20, 2002CUL Metadata WG Meeting39 Better parents have better children.

40 December 20, 2002CUL Metadata WG Meeting40 Using the Empirical Observations Use the path history Use the page quality - cosine correlation Current distance should increase exponentially as you get away from quality nodes Distance = 0 if this is a nugget, otherwise: 1 or (1-corr) exp (2 x parent’s distance / cutoff)

41 December 20, 2002CUL Metadata WG Meeting41 Results Details in the ECDL paper Smaller frontier  more docs/second More documents downloaded in same time Higher-scoring documents were downloaded Cutoff of 20 averaged 7 steps at the cutoff

42 December 20, 2002CUL Metadata WG Meeting42 Fall 2002 Student Project Query Mercator CentroidCollectionDescription Term vectors Centroids, Dictionary Collection URLs Chebyshev P.s HTML

43 December 20, 2002CUL Metadata WG Meeting43 Conclusion We’ve covered crawling – history, technology, use Focused crawling with tunneling Adaptive cutoff with tunneling We have a good experimental setup for exploring automatic collection synthesis


Download ppt "December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems."

Similar presentations


Ads by Google