Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003.

Similar presentations


Presentation on theme: "(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003."— Presentation transcript:

1 (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

2 (C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Mondays, 1-4 PM in 409 West Hall

3 (C) 2003, The University of Michigan3 Schedule Readings for 03/31: –Chakrabarti, van den Berg, and Dom “Focused Crawling” WWW 1999 –Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 –Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002

4 (C) 2003, The University of Michigan4 Schedule March 24 –The link-content hypothesis –XML retrieval March 31 –Information extraction –Language reuse April 7 –Language modeling for IR –The Lemur system

5 (C) 2003, The University of Michigan5 Schedule HW3 assigned 03/24 HW3 due 04/07 Final projects due 04/11 Final project presentations 04/14 Final exam 04/21 2-3 essay questions, 2-3 problems

6 (C) 2003, The University of Michigan6 The link-content hypothesis

7 (C) 2003, The University of Michigan7 Kleinberg and Lawrence, The structure of the Web - Science 294 1849-1850 Web structure

8 (C) 2003, The University of Michigan8 Web structure 16-20 links on average The fraction of pages with n in-links is approximately n -  for  ~ 2.1 Kleinberg/Lawrence: 100,000 coherent communities (e.g., people concerned with oil spills off the coast of Japan)

9 (C) 2003, The University of Michigan9 Topical locality [Davison 00] Most web pages are linked to others with related content - this helps users navigate the Web. Presence of topical locality - important for building focused crawlers. Traditionally search engines only indexed titles and/or the first few lines of each document. Now, they index all links. “More evil than Satan himself”

10 (C) 2003, The University of Michigan10 Experimental design Local crawl of 100,000 pages Starts from HotBot and AltaVista Biased towards English-language pages From each page, retrieve one outgoing link per page.

11 (C) 2003, The University of Michigan11 TFIDF cosine similarity

12 (C) 2003, The University of Michigan12 Other metrics Query-document overlap Query term probability

13 (C) 2003, The University of Michigan13 Experimental results 100,000 URLs but only 89,891 retrievable An additional 111,107 URLs: two children per initial page www.geocities.com (561), www.webring.com(419), www.amazon.com(303), etc. 18% top-level pages 50%.com, 27%.edu

14 (C) 2003, The University of Michigan14 Textual similarity TFIDF similarity –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random

15 (C) 2003, The University of Michigan15 Structure and content [Menczer 01] Cluster hypothesis (van Rijsbergen 79) Link-cluster conjecture (Menczer) - preservation of semantics across link

16 (C) 2003, The University of Michigan16 Experimental design Open directory project (dmoz.org) 896,233 URLs from 97,614 topics 150,000 URLs from 47,174 topics 10,000 from each of the 15 top-level branches

17 (C) 2003, The University of Michigan17 Measures of similarity Cosine Link similarity Semantic similarity lca c2c2 c1c1

18 (C) 2003, The University of Michigan18 Correlations between similarities Over 3.84x10 9 pairs Highest for News and Home (  > 0.2) Lowest for Arts and Games (  < 0.05)

19 (C) 2003, The University of Michigan19 Fit  1 =1.8,  2 =0.6,

20 (C) 2003, The University of Michigan20 Document closures for Q&A capital P LP Madrid spain capital

21 (C) 2003, The University of Michigan21 Document closures for IR Physics P LP Physics Department University of Michigan

22 (C) 2003, The University of Michigan22 The perltree experiments 23.6% of the Excite log (2.5 M queries) –60% have both words in WordNet –27% have one word in WordNet –13% have no words in WordNet 200 queries from the log 200 random queries

23 (C) 2003, The University of Michigan23 Two-word queries jimi SAT seats david caesar poker cruise yellow science Tishara trim yankee witnesses naked swaybar cheats rides Precious drugs university Clock engines metal choreography anthony swinging psychoanalysis webdesign pic lens toys online speech therapy Malcolm McDowell cellular accessories migrant farmworkers witch tv davis instruments Adult Games chichen itza freighter Cruises used motorcycles feng shui revolucion mexicana zeebrugee belgium electronic greetings

24 (C) 2003, The University of Michigan24 Query analysis Words: –Familiarity –Ambiguity –IDF Queries; –GoogleSize –SemDist –DistribSim

25 (C) 2003, The University of Michigan25 Query analysis Fam1Fam2Amb1Amb2IDF1IDF2GsizeSemDDistS Excite (E)1.421.891.702.364.004.74670,0000.390.06 Random (R)1.541.612.062.294.404.55329,0000.290.02

26 (C) 2003, The University of Michigan26 Link-based language models Wt2g corpus 247,491 pages 3,118,248 links 948,036 unique words

27 (C) 2003, The University of Michigan27

28 (C) 2003, The University of Michigan28 Procedure Given a query q 1 q 2 –Get top 50 hits from Altavista (A) –Extract links that contain q 1 or q 2 –Get pages that are linked (B) –Extract links from A U B that point to A U B –Index A U B using glimpse –Compute link fertility

29 (C) 2003, The University of Michigan29 Results New links pointing to pages that were not in the AltaVista top 50 –E = +11.7%, R = +8.9% Improvements higher for –rarer words –lower distributional similarity –lower semantic distance

30 (C) 2003, The University of Michigan30 Topic distillation [Chakrabarti et al. 01] Topic drift Returning snippets rather than full documents Clique attacks (www.411fun.com, www.411fashion.com, www.411loans.com)


Download ppt "(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003."

Similar presentations


Ads by Google