Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan September 19, 2005.

Similar presentations

Presentation on theme: "(C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan September 19, 2005."— Presentation transcript:

1 (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan September 19, 2005

2 (C) 2005, The University of Michigan2 About the instructor Dragomir R. Radev Associate Professor, University of Michigan –School of Information –Department of Electrical Engineering and Computer Science –Department of Linguistics Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan Treasurer, North American Chapter of the ACL Ph.D., 1998, Computer Science, Columbia University Home page:

3 (C) 2005, The University of Michigan3 Introduction

4 (C) 2005, The University of Michigan4 IR systems Google Vivísimo AskJeeves NSIR Lemur MG Nutch

5 (C) 2005, The University of Michigan5 Examples of IR systems Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language

6 (C) 2005, The University of Michigan6

7 7

8 8 Need for IR Advent of WWW - more than 8 Billion documents indexed on Google How much information? 200TB according to Lyman and Varian Search, routing, filtering Users information need

9 (C) 2005, The University of Michigan9 Some definitions of Information Retrieval (IR) Salton (1989): Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests. Kowalski (1997): An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).

10 (C) 2005, The University of Michigan10 Sample queries (from Excite) In what year did baseball become an offical sport? play station codes. com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running

11 (C) 2005, The University of Michigan11 Mappings and abstractions RealityData Information needQuery From Korfhages book

12 (C) 2005, The University of Michigan12 Typical IR system (Crawling) Indexing Retrieval User interface

13 (C) 2005, The University of Michigan13 Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase. DOCUMENT: an information entity that the user wants to retrieve COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document or a query

14 (C) 2005, The University of Michigan14 Documents

15 (C) 2005, The University of Michigan15 Documents Not just printed paper collections vs. documents data structures: representations Bag of words method document surrogates: keywords, summaries encoding: ASCII, Unicode, etc.

16 (C) 2005, The University of Michigan16 Document preprocessing Formatting Tokenization (Pauls, Willow Dr., Dr. Willow, , New York, ad hoc) Casing (cat vs. CAT) Stemming (computer, computation) Soundex

17 (C) 2005, The University of Michigan17 Document representations Term-document matrix (m x n) term-term matrix (m x m x n) document-document matrix (n x n) Example: 3,000,000 documents (n) with 50,000 terms (m) sparse matrices Boolean vs. integer matrices

18 (C) 2005, The University of Michigan18 Document representations Term-document matrix –Evaluating queries (e.g., (A B) C) –Storage issues Inverted files –Storage issues –Evaluating queries –Advantages and disadvantages

19 (C) 2005, The University of Michigan19 IR models

20 (C) 2005, The University of Michigan20 Major IR models Boolean Vector Probabilistic Language modeling Fuzzy retrieval Latent semantic indexing

21 (C) 2005, The University of Michigan21 Major IR tasks Ad-hoc Filtering and routing Question answering Spoken document retrieval Multimedia retrieval

22 (C) 2005, The University of Michigan22 Venn diagrams xwy z D1D1 D2D2

23 (C) 2005, The University of Michigan23 Boolean model A B

24 (C) 2005, The University of Michigan24 restaurants AND (Mideastern OR vegetarian) AND inexpensive Boolean queries What types of documents are returned? Stemming thesaurus expansion inclusive vs. exclusive OR confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal)

25 (C) 2005, The University of Michigan25 Boolean queries Weighting (Beethoven AND sonatas) precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses Use of negation: potential problems Conjunctive and Disjunctive normal forms Full CNF and DNF

26 (C) 2005, The University of Michigan26 Transformations De Morgans Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) CNF or DNF? –Reference librarians prefer CNF - why?

27 (C) 2005, The University of Michigan27 Boolean model Partition Partial relevance? Operators: AND, NOT, OR, parentheses

28 (C) 2005, The University of Michigan28 Exercise D 1 = computer information retrieval D 2 = computer retrieval D 3 = information D 4 = computer information Q 1 = information retrieval Q 2 = information ¬computer

29 (C) 2005, The University of Michigan29 Exercise 0 1Swift 2Shakespeare 3 Swift 4Milton 5 Swift 6MiltonShakespeare 7MiltonShakespeareSwift 8Chaucer 9 Swift 10ChaucerShakespeare 11ChaucerShakespeareSwift 12ChaucerMilton 13ChaucerMiltonSwift 14ChaucerMiltonShakespeare 15ChaucerMiltonShakespeareSwift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

30 (C) 2005, The University of Michigan30 Stop lists most common words in English account for 50% or more of a given text. Example: the and of represent 10% of tokens. and, to, a, and in - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63

31 (C) 2005, The University of Michigan31 Vector models Term 1 Term 2 Term 3 Doc 1 Doc 2 Doc 3

32 (C) 2005, The University of Michigan32 Vector queries Each document is represented as a vector non-efficient representations (bit vectors) dimensional compatibility

33 (C) 2005, The University of Michigan33 The matching process Document space Matching is done between a document and a query (or between two documents) distance vs. similarity Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

34 (C) 2005, The University of Michigan34 Miscellaneous similarity measures The Cosine measure (D,Q) = = (d i x q i ) (d i ) 2 * (q i ) 2 |X Y| |X| * |Y| (D,Q) = |X Y| The Jaccard coefficient

35 (C) 2005, The University of Michigan35 Exercise Compute the cosine measures (D 1,D 2 ) and (D 1,D 3 ) for the documents: D 1 =, D 2 = and D 3 = Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.

36 (C) 2005, The University of Michigan36 Evaluation

37 (C) 2005, The University of Michigan37 Relevance Difficult to change: fuzzy, inconsistent Methods: exhaustive, sampling, pooling, search-based

38 (C) 2005, The University of Michigan38 Contingency table wx yz n 2 = w + y n 1 = w + x N relevant not relevant retrievednot retrieved

39 (C) 2005, The University of Michigan39 Precision and Recall Recall: Precision: w w+y w+x w

40 (C) 2005, The University of Michigan40 Exercise Go to Google ( and search for documents on Tolkiens Lord of the Rings. Try different ways of phrasing the query: e.g., Tolkien, JRR Melville, +JRR Tolkien +Lord of the Rings, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries.

41 (C) 2005, The University of Michigan41 [From Saltons book]

42 (C) 2005, The University of Michigan42 Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5?

43 (C) 2005, The University of Michigan43 Issues Why not use accuracy A=(w+z)/N? Average precision Average P at given document cutoff values Report when P=R F measure: F=( 2 +1)PR/( 2 P+R) F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R

44 (C) 2005, The University of Michigan44 Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators

45 (C) 2005, The University of Michigan45 Kappa example (from Manning, Schuetze, Raghavan) J1+J1- J J2-2070

46 (C) 2005, The University of Michigan46 Kappa (contd) P(A) = 370/400 P (-) = ( )/800 = P (+) = ( )/800 = P (E) = * * = K = ( )/( ) = Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good

47 (C) 2005, The University of Michigan47 Relevance collections TREC ad hoc collections, 2-6 GB TREC Web collections, 2-100GB

48 (C) 2005, The University of Michigan48 Sample TREC query Number: 305 Most Dangerous Vehicles Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. LA FT LA LA LA LA FT LA FT LA LA FT LA LA LA LA LA LA LA LA LA LA FT LA LA

49 (C) 2005, The University of Michigan49 LA March 16, 1989, Thursday, Home Edition Business; Part 4; Page 1; Column 5; Financial Desk 586 words AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS By LINDA WILLIAMS, Times Staff Writer The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. Several Fatalities However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the Bronco II models, the agency said. According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai.... Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY

50 (C) 2005, The University of Michigan50 TREC (contd) l l

51 (C) 2005, The University of Michigan51 Word distribution models

52 (C) 2005, The University of Michigan52 Shakespeare Romeo and Juliet: And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; … A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;

53 (C) 2005, The University of Michigan53 The BNC (Adam Kilgarriff) the det be v of prep and conj a det in prep to infinitive-marker have v it pron to prep for prep i pron that conj you pron he pron on prep with prep do v at prep by prep Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) Pp

54 (C) 2005, The University of Michigan54 Stop lists most common words in English account for 50% or more of a given text. Example: the and of represent 10% of tokens. and, to, a, and in - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63

55 (C) 2005, The University of Michigan55 Zipfs law Rank x Frequency Constant

56 (C) 2005, The University of Michigan56 Zipf's law is fairly general! Frequency of accesses to web pages in particular the access counts on the Wikipedia page, with s approximately equal to 0.3 page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5 Words in the English language for instance, in Shakespeares play Hamlet with s approximately 0.5 Sizes of settlements Income distributions amongst individuals Size of earthquakes Notes in musical performances

57 (C) 2005, The University of Michigan57 Zipfs law (contd) Limitations: –Low and high frequencies –Lack of convergence Power law with coefficient c = -1 –Y=kx c Li (1992) – typing words one letter at a time, including spaces

58 (C) 2005, The University of Michigan58 Heaps law Size of vocabulary: V(n) = Kn In English, K is between 10 and 100, β is between 0.4 and 0.6. n V(n)

59 (C) 2005, The University of Michigan59 Heaps law (contd) Related to Zipfs law: generative models Zipfs and Heaps law coefficients change with language Alexander GelbukhAlexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws Coefficients Depend on Language. Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN , ISBN , Springer-Verlag, pp. 332–335.Grigori Sidorov

60 (C) 2005, The University of Michigan60 Indexing

61 (C) 2005, The University of Michigan61 Methods Manual: e.g., Library of Congress subject headings, MeSH Automatic


63 (C) 2005, The University of Michigan63 Medicine CLASS R - MEDICINE Subclass R R5-920Medicine (General) R General works R History of medicine. Medical expeditions R Medicine as a profession. Physicians R Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R Directories R Missionary medicine. Medical missionaries R Medical philosophy. Medical ethics R Medicine and disease in relation to psychology. Terminal care. Dying R Medical personnel and the public. Physician and the public R Practice of medicine. Medical practice economics R Medical education. Medical schools. Research R Medical technology R Biomedical engineering. Electronics. Instrumentation R Computer applications to medicine. Medical informatics R864Medical records R Medical physics. Medical radiology. Nuclear medicine

64 (C) 2005, The University of Michigan64 Finding the most frequent terms in a document Typically stop words: the, and, in Not content-bearing Terms vs. words Luhns method

65 (C) 2005, The University of Michigan65 Luhns method WORDSFREQUENCY E

66 (C) 2005, The University of Michigan66 Computing term salience Term frequency (IDF) Document frequency (DF) Inverse document frequency (IDF)

67 (C) 2005, The University of Michigan67 Applications of TFIDF Cosine similarity Indexing Clustering

68 (C) 2005, The University of Michigan68 Vector-based matching The cosine measure sim (D,C) = (d k. c k. idf(k)) (d k ) 2. (c k ) 2 k k k

69 (C) 2005, The University of Michigan69 IDF: Inverse document frequency N: number of documents d k : number of documents containing term k f ik : absolute frequency of term k in document i w ik : weight of term k in document i idf k = log 2 (N/d k ) + 1 = log 2 N - log 2 d k + 1 TF * IDF is used for automated indexing and for topic discrimination:

70 (C) 2005, The University of Michigan70 Asian and European news deng china beijing chinese xiaoping jiang communist body party died leader state people nato albright belgrade enlargement alliance french opposition russia government told would their which

71 (C) 2005, The University of Michigan71 Other topics shuttle space telescope hubble rocket astronauts discovery canaveral cape mission florida center compuserve massey salizzoni bob online executive interim chief service second world president

72 (C) 2005, The University of Michigan72 Compression

73 (C) 2005, The University of Michigan73 Compression Methods –Fixed length codes –Huffman coding –Ziv-Lempel codes

74 (C) 2005, The University of Michigan74 Fixed length codes Binary representations –ASCII –Representational power (2 k symbols where k is the number of bits)

75 (C) 2005, The University of Michigan75 Variable length codes Alphabet: A.- N B -... O C -.-. P D -.. Q E. R F..-.S G --.T H....U I.. V J.--- W K -.- X -..- L.-.. Y -. M -- Z --.. Demo: – –

76 (C) 2005, The University of Michigan76 Most frequent letters in English Most frequent letters: –E T A O I N S H R D L U – hy/subs/frequencies.html hy/subs/frequencies.html Demo: – nt-char.cfm nt-char.cfm Also: bigrams: –TH HE IN ER AN RE ND AT ON NT – hy/subs/digraphs.html hy/subs/digraphs.html

77 (C) 2005, The University of Michigan77 Useful links about cryptography

78 (C) 2005, The University of Michigan78 Huffman coding Developed by David Huffman (1952) Average of 5 bits per character (37.5% compression) Based on frequency distributions of symbols Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

79 (C) 2005, The University of Michigan79

80 (C) 2005, The University of Michigan c bd f g ij he a

81 (C) 2005, The University of Michigan81

82 (C) 2005, The University of Michigan82 Exercise Consider the bit string: Use the Huffman code from the example to decode it. Try inserting, deleting, and switching some bits at random locations and try decoding.

83 (C) 2005, The University of Michigan83 Ziv-Lempel coding Two types - one is known as LZ77 (used in GZIP) Code: set of triples a: how far back in the decoded text to look for the upcoming text segment b: how many characters to copy c: new character to add to complete segment

84 (C) 2005, The University of Michigan84 p pe pet peter peter_ peter_pi peter_piper peter_piper_pic peter_piper_pick peter_piper_picked peter_piper_picked_a peter_piper_picked_a_pe peter_piper_picked_a_peck_ peter_piper_picked_a_peck_o peter_piper_picked_a_peck_of peter_piper_picked_a_peck_of_pickl peter_piper_picked_a_peck_of_pickled peter_piper_picked_a_peck_of_pickled_pep peter_piper_picked_a_peck_of_pickled_pepper peter_piper_picked_a_peck_of_pickled_peppers

85 (C) 2005, The University of Michigan85 Links on text compression Data compression: – Calgary corpus: – Huffman coding: – – LZ –

86 (C) 2005, The University of Michigan86 Relevance feedback and query expansion

87 (C) 2005, The University of Michigan87 Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need. Idea: modify the original query so that it gets closer to the right documents in the vector space

88 (C) 2005, The University of Michigan88 Relevance feedback Automatic Manual Method: identifying feedback terms Q = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N|

89 (C) 2005, The University of Michigan89 Example Q = safety minivans D1 = car safety minivans tests injury statistics - relevant D2 = liability tests safety - relevant D3 = car passengers injury reviews - non- relevant R = ? S = ? Q = ?

90 (C) 2005, The University of Michigan90 Pseudo relevance feedback Automatic query expansion –Thesaurus-based expansion (e.g., using latent semantic indexing – later…) –Distributional similarity –Query log mining

91 (C) 2005, The University of Michigan91 Examples Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Lexical semantics (Hypernymy): Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

92 (C) 2005, The University of Michigan92 Examples (query logs) Book: booksellers, bookmark, blue Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn

93 (C) 2005, The University of Michigan93 Problems with automatic query expansion Adding frequent words may dilute the results (example?)

94 (C) 2005, The University of Michigan94 Stemming

95 (C) 2005, The University of Michigan95 Goals Motivation: –Computer, computers, computerize, computational, computerization –User, users, using, used Representing related words as one token Simplify matching Reduce storage and computation Also known as: term conflation

96 (C) 2005, The University of Michigan96 Methods Manual (tables) –Achievement achiev –Achiever achiev –Etc. Affix removal (Harman 1991, Frakes 1992) –if a word ends in ies but not eies or aies then ies y –If a word ends in es but not aes, ees, or oes, then es e –If a word ends in s but not us or ss then s NULL –(apply only the first applicable rule)

97 (C) 2005, The University of Michigan97 Porters algorithm (Porter 1980) Home page: – Reading assignment: – Consonant-vowel sequences: –CVCV... C –CVCV... V –VCVC... C –VCVC... V –Shorthand: [C]VCVC... [V]

98 (C) 2005, The University of Michigan98 Porters algorithm (contd) [C](VC){m}[V] {m} indicates repetition Examples: m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN Conditions: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

99 (C) 2005, The University of Michigan99 Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing Step 1b1 If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file

100 (C) 2005, The University of Michigan100 Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

101 (C) 2005, The University of Michigan101 Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

102 (C) 2005, The University of Michigan102 Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

103 (C) 2005, The University of Michigan103 Porters algorithm (contd) Example: the word duplicatable duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing ic, cannot be applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf %./ computers computers comput

104 (C) 2005, The University of Michigan104 Porters algorithm

105 (C) 2005, The University of Michigan105 Stemming Not always appropriate (e.g., proper names, titles) The same applies to casing (e.g., CAT vs. cat)

106 (C) 2005, The University of Michigan106 String matching

107 (C) 2005, The University of Michigan107 String matching methods Index-based Full or approximate –E.g., theater = theatre

108 (C) 2005, The University of Michigan108 Index-based matching Inverted files Position-based inverted files Block-based inverted files This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55

109 (C) 2005, The University of Michigan109 Inverted index (trie) Letters: 60 Text: 11, 19 Words: 33, 40 Made: 50 Many: 28 l m t w a d n

110 (C) 2005, The University of Michigan110 Sequential searching No indexing structure given Given: database d and search pattern p. –Example: find words in the earlier example Brute force method –try all possible starting positions –O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) –Typical runtime is actually O(n) given that mismatches are easy to notice

111 (C) 2005, The University of Michigan111 Knuth-Morris-Pratt Average runtime similar to BF Worst case runtime is linear: O(n) Idea: reuse knowledge Need preprocessing of the pattern

112 (C) 2005, The University of Michigan112 Knuth-Morris-Pratt (contd) Example ( ) database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index char A B C D A B D – pos ABCDABD


114 (C) 2005, The University of Michigan114 Boyer-Moore Used in text editors Demos – –

115 (C) 2005, The University of Michigan115 Word similarity Hamming distance - when words are of the same length Levenshtein distance - number of edits (insertions, deletions, replacements) –color --> colour (1) –survey --> surgery (2) –com puter --> computer ? Longest common subsequence (LCS) –lcs (survey, surgery) = surey

116 (C) 2005, The University of Michigan116 Levenshtein edit distance Examples: –Theatre-> theater –Ghaddafi->Qadafi –Computer->counter Edit distance (inserts, deletes, substitutions) –Edit transcript Done through dynamic programming

117 (C) 2005, The University of Michigan117 Recurrence relation Three dependencies –D(i,0)=i –D(0,j)=j –D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j- 1)+t(i,j)] Simple edit distance: –t(i,j) = 0 iff S1(i)=S2(j)

118 (C) 2005, The University of Michigan118 Example Gusfield 1997 WRITERS V11 I22 N33 T44 N55 E66 R77

119 (C) 2005, The University of Michigan119 Example (contd) Gusfield 1997 WRITERS V I N T44444* N55 E66 R77

120 (C) 2005, The University of Michigan120 Tracebacks Gusfield 1997 WRITERS V I N T44444* N55 E66 R77

121 (C) 2005, The University of Michigan121 Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics –Homology information –BLAST –Blosum –http://eta.embl-

122 (C) 2005, The University of Michigan122 Web sites: – –

123 (C) 2005, The University of Michigan123 Clustering

124 (C) 2005, The University of Michigan124 Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis –Documents in the same cluster are relevant to the same query

125 (C) 2005, The University of Michigan125 Representations for document clustering Typically: vector-based –Words: cat, dog, etc. –Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed)

126 (C) 2005, The University of Michigan126 Hierarchical clustering Dendrograms E.g., language similarity:

127 (C) 2005, The University of Michigan127 Another example Kingdom = animal Phylum = Chordata Subphylum = Vertebrata Class = Osteichthyes Subclass = Actinoptergyii Order = Salmoniformes Family = Salmonidae Genus = Oncorhynchus Species = Oncorhynchus kisutch (Coho salmon)

128 (C) 2005, The University of Michigan128 Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A

129 (C) 2005, The University of Michigan129 Methods Single-linkage –One common pair is sufficient –disadvantages: long chains Complete-linkage –All pairs have to match –Disadvantages: too conservative Average-linkage Centroid-based (online) –Look at distances to centroids Demo: –/clair4/class/ir-w05/clustering

130 (C) 2005, The University of Michigan130 k-means Needed: small number k of desired clusters hard vs. soft decisions Example: Weka

131 (C) 2005, The University of Michigan131 k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

132 (C) 2005, The University of Michigan132 Example Cluster the following vectors into two groups: –A = –B = –C = –D = –E = –F =

133 (C) 2005, The University of Michigan133 Complexity Complexity = O(kn) because at each step, n documents have to be compared to k centroids.

134 (C) 2005, The University of Michigan134 Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank

135 (C) 2005, The University of Michigan135 Demos al_html/AppletKM.html al_html/AppletKM.html k-means k-means mo/kmcluster mo/kmcluster

136 (C) 2005, The University of Michigan136 Human clustering Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).

137 (C) 2005, The University of Michigan137 Lexical networks

138 (C) 2005, The University of Michigan138 Lexical Networks Used to represent relationships between words Example: WordNet - created by George Millers team at Princeton Based on synsets (synonyms, interchangeable words) and lexical matrices

139 (C) 2005, The University of Michigan139 Lexical matrix

140 (C) 2005, The University of Michigan140 Synsets Disambiguation –{board, plank} –{board, committee} Synonyms –substitution –weak substitution –synonyms must be of the same part of speech

141 (C) 2005, The University of Michigan141 $./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 board => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something

142 (C) 2005, The University of Michigan142 Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface => artifact, artefact => object, physical object => entity, something Sense 6 board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something

143 (C) 2005, The University of Michigan143 Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

144 (C) 2005, The University of Michigan144 Antonymy x vs. not-x rich vs. poor? {rise, ascend} vs. {fall, descend}

145 (C) 2005, The University of Michigan145 Other relations Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to X is a part of Y, X is a member of Y. Hyponymy: {tree} is a hyponym of {plant}. Hierarchical structure based on hyponymy (and hypernymy).

146 (C) 2005, The University of Michigan146 Other features of WordNet Index of familiarity Polysemy

147 (C) 2005, The University of Michigan147 board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1) Familiarity and polysemy

148 (C) 2005, The University of Michigan148 Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees

149 (C) 2005, The University of Michigan149 Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

150 (C) 2005, The University of Michigan150 Top-level concepts {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time}

151 (C) 2005, The University of Michigan151 WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /data2/tools/relatedwords/relate reason

152 (C) 2005, The University of Michigan152 System comparison

153 (C) 2005, The University of Michigan153 Comparing two systems Comparing A and B One query? Average performance? Need: A to consistently outperform B [this slide: courtesy James Allan]

154 (C) 2005, The University of Michigan154 The sign test Example 1: –A > B (12 times) –A = B (25 times) –A < B (3 times) –p < (significant at the 5% level) Example 2: –A > B (18 times) –A < B (9 times) –p < (not significant at the 5% level) [this slide: courtesy James Allan]

155 (C) 2005, The University of Michigan155 Other tests The t test: –Takes into account the actual performances, not just which system is better – ml ml The sign test: – gn_Test.html gn_Test.html

156 (C) 2005, The University of Michigan156 Techniques for dimensionality reduction: SVD and LSI

157 (C) 2005, The University of Michigan157 Techniques for dimensionality reduction Based on matrix decomposition (goal: preserve clusters, explain away variance) A quick review of matrices –Vectors –Matrices –Matrix multiplication

158 (C) 2005, The University of Michigan158 SVD: Singular Value Decomposition A=U V T This decomposition exists for all matrices, dense or sparse If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3 In Matlab, use [U,S,V] = svd (A)

159 (C) 2005, The University of Michigan159 Term matrix normalization D1 D2 D3 D4 D5

160 (C) 2005, The University of Michigan160 Example (Berry and Browne) T1: baby T2: child T3: guide T4: health T5: home T6: infant T7: proofing T8: safety T9: toddler D1: infant & toddler first aid D2: babies & childrens room (for your home) D3: child safety at home D4: your babys health and safety: from infant to toddler D5: baby proofing basics D6: your guide to easy rust proofing D7: beanie babies collectors guide

161 (C) 2005, The University of Michigan161 Document term matrix

162 (C) 2005, The University of Michigan162 Decomposition u = v =

163 (C) 2005, The University of Michigan163 Decomposition s = Spread on the v1 axis

164 (C) 2005, The University of Michigan164 Rank-4 approximation s4 =

165 (C) 2005, The University of Michigan165 Rank-4 approximation u*s4*v'

166 (C) 2005, The University of Michigan166 Rank-4 approximation u*s

167 (C) 2005, The University of Michigan167 Rank-4 approximation s4*v'

168 (C) 2005, The University of Michigan168 Rank-2 approximation s2 =

169 (C) 2005, The University of Michigan169 Rank-2 approximation u*s2*v'

170 (C) 2005, The University of Michigan170 Rank-2 approximation u*s

171 (C) 2005, The University of Michigan171 Rank-2 approximation s2*v'

172 (C) 2005, The University of Michigan172 Documents to concepts and terms to concepts A(:,1)'*u*s >> A(:,1)'*u*s >> A(:,1)'*u*s >> A(:,2)'*u*s >> A(:,3)'*u*s

173 (C) 2005, The University of Michigan173 Documents to concepts and terms to concepts >> A(:,4)'*u*s >> A(:,5)'*u*s >> A(:,6)'*u*s >> A(:,7)'*u*s

174 (C) 2005, The University of Michigan174 Contd >> (s2*v'*A(1,:)')' >> (s2*v'*A(2,:)')' >> (s2*v'*A(3,:)')' >> (s2*v'*A(4,:)')' >> (s2*v'*A(5,:)')'

175 (C) 2005, The University of Michigan175 Contd >> (s2*v'*A(6,:)')' >> (s2*v'*A(7,:)')' >> (s2*v'*A(8,:)')' >> (s2*v'*A(9,:)')

176 (C) 2005, The University of Michigan176 Properties A*A' A'*A A is a document to term matrix. What is A*A, what is A*A?

177 (C) 2005, The University of Michigan177 Latent semantic indexing (LSI) Dimensionality reduction = identification of hidden (latent) concepts Query matching in latent space

178 (C) 2005, The University of Michigan178 Useful pointers a_definition.htm xing.html

179 (C) 2005, The University of Michigan179 Models of the Web

180 (C) 2005, The University of Michigan180 Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –8 Billion Web pages indexed [Google 2005] Amount of data –roughly 200 TB [Lyman et al. 2003]

181 (C) 2005, The University of Michigan181 Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page

182 (C) 2005, The University of Michigan182 Power laws Web site size (Huberman and Adamic 1999) Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

183 (C) 2005, The University of Michigan183 Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example… Milgram experiment (1967) –Kansas/Omaha --> Boston (42/160 letters) –diameter = 6 Albert et al – average distance between two verstices is d = log 10 n. For n = 10 9, d= Six degrees of separation

184 (C) 2005, The University of Michigan184 Clustering coefficient Cliquishness (c): between the k v (k v – 1)/2 pairs of neighbors. Examples: nkdd rand Cc rand Actors Power grid C. Elegans

185 (C) 2005, The University of Michigan185 Models of the Web A B a b Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 Kleinberg 98 Menczer 02 Radev 03 Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

186 (C) 2005, The University of Michigan186 Self-triggerability across hyperlinks Document closures for information retrieval Self-triggerability [Mosteller&Wallace 84] Poisson distribution Two-Poisson [Bookstein&Swanson 74] Negative Binomial, K-mixture [Church&Gale 95] Triggerability across hyperlinks? pjpj pipi p p by with from p p photo dream path

187 (C) 2005, The University of Michigan187 Evolving Word-based Web Observations: –Links are made based on topics –Topics are expressed with words –Words are distributed very unevenly (Zipf, Benford, self- triggerability laws) Model –Pick n –Generate n lengths according to a power-law distribution –Generate n documents using a trigram model Model (contd) –Pick words in decreasing order of r. –Generate hyperlinks with random directionality Outcome –Generates power-law degree distributions –Generates topical communities –Natural variation of PageRank: LexRank

188 (C) 2005, The University of Michigan188 Social network analysis for IR

189 (C) 2005, The University of Michigan189 Social networks Induced by a relation Symmetric or not Examples: –Friendship networks –Board membership –Citations –Power grid of the US –WWW

190 (C) 2005, The University of Michigan190 Krebs 2004

191 (C) 2005, The University of Michigan191 Prestige and centrality Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs.

192 (C) 2005, The University of Michigan192 Graph-based representations Square connectivity (incidence) matrix Graph G (V,E)

193 (C) 2005, The University of Michigan193 Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x 0, x 1, …, x n ). X i = x i-1 *E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find X j given x 0, E, and j.

194 (C) 2005, The University of Michigan194 Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: –E is stochastic –E is irreducible –E is aperiodic To make these conditions true: –All rows of E add up to 1 (and no value is negative) –Make sure that E is strongly connected –Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use teleportation

195 (C) 2005, The University of Michigan Example This graph E has a second graph E (not drawn) superimposed on it: E is the uniform transition graph. t=0 t=1

196 (C) 2005, The University of Michigan196 Eigenvectors An eigenvector is an implicit direction for a matrix. Mv = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ 1 = 1. For λ 1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, E T p = p.

197 (C) 2005, The University of Michigan197 Computing the stationary distribution function PowerStatDist (E): begin p (0) = u; (or p (0) = [1,0,…0]) i=1; repeat p (i) = E T p (i-1) L = ||p (i) -p (i-1 )|| 1 ; i = i + 1; until L < return p (i) end Solution for the stationary distribution

198 (C) 2005, The University of Michigan Example t=0 t=1 t=10

199 (C) 2005, The University of Michigan199 How Google works Crawling Anchor text Fast query processing Pagerank

200 (C) 2005, The University of Michigan200 More about PageRank Named after Larry Page, founder of Google (and UM alum) Reading The anatomy of a large-scale hypertextual web search engine by Brin and Page. Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

201 (C) 2005, The University of Michigan201 HITS Query-dependent model (Kleinberg 97) Hubs and authorities (e.g., cars, Honda) Algorithm –obtain root set using input query –expanded the root set by radius one –run iterations on the hub and authority scores together –report top-ranking authorities and hubs

202 (C) 2005, The University of Michigan202 The link-content hypothesis Topical locality: page is similar ( ) to the page that points to it ( ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer =1.8, 2 =0.6,

203 (C) 2005, The University of Michigan203 Measuring the Web

204 (C) 2005, The University of Michigan204 Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek 10,000 queries in mid and late 1997 Estimate is 200M pages Only 1.4% are indexed by all of them

205 (C) 2005, The University of Michigan205 Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).

206 (C) 2005, The University of Michigan206 Crawling the web

207 (C) 2005, The University of Michigan207 Basic principles The HTTP/HTML protocols Following hyperlinks Some problems: –Link extraction –Link normalization –Robot exclusion –Loops –Spider traps –Server overload

208 (C) 2005, The University of Michigan208 Example U-Ms root robots.txt file: –User-agent: * –Disallow: /~websvcs/projects/ –Disallow: /%7Ewebsvcs/projects/ –Disallow: /~homepage/ –Disallow: /%7Ehomepage/ –Disallow: /~smartgl/ –Disallow: /%7Esmartgl/ –Disallow: /~gateway/ –Disallow: /%7Egateway/

209 (C) 2005, The University of Michigan209 Example crawler E.g., poacher – /examples/poacher –/data0/projects/perltree-index

210 (C) 2005, The University of Michigan210 &ParseCommandLine(); &Initialise(); $robot->run($siteRoot) #======================================================================= # Initialise() - initialise global variables, contents, tables, etc # This function sets up various global variables such as the version number # for WebAssay, the program name identifier, usage statement, etc. #======================================================================= sub Initialise { $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, ' ' => $ , 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error); } #======================================================================= # follow_url_test() - tell the robot module whether is should follow link #======================================================================= sub follow_url_test {} #======================================================================= # process_get_error() - hook function invoked whenever a GET fails #======================================================================= sub process_get_error {} #======================================================================= # process_contents() - process the contents of a URL we've retrieved #======================================================================= sub process_contents { run_command($COMMAND, $filename) if defined $COMMAND; }

211 (C) 2005, The University of Michigan211 Focused crawling Topical locality –Pages that are linked are similar in content (and vice- versa: Davison 00, Menczer 02, 04, Radev et al. 04) The radius-1 hypothesis –given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page) Focused crawling –Keeping a priority queue of the most relevant pages

212 (C) 2005, The University of Michigan212 Question answering

213 (C) 2005, The University of Michigan213 People ask questions Excite corpus of 2,477,283 queries (one days worth) 8.4% of them are questions –43.9% factual (what is the country code for Belgium) –56.1% procedural (how do I set up TCP/IP) or other In other words, 100 K questions per day

214 (C) 2005, The University of Michigan214 People ask questions In what year did baseball become an offical sport? Who is the largest man in the world? Where can i get information on Raphael? where can i find information on puritan religion? Where can I find how much my house is worth? how do i get out of debt? Where can I found out how to pass a drug test? When is the Super Bowl? who is California's District State Senator? where can I buy extra nibs for a foutain pen? how do i set up tcp/ip ? what time is it in west samoa? Where can I buy a little kitty cat? what are the symptoms of attention deficit disorder? Where can I get some information on Michael Jordan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live? Which Frenchman declined the Nobel Prize for Literature for ideological reasons? What is the largest city in Northern Afghanistan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live?

215 (C) 2005, The University of Michigan215

216 (C) 2005, The University of Michigan216 Question answering What is the largest city in Northern Afghanistan?

217 (C) 2005, The University of Michigan217 Possible approaches Map? Knowledge base Find x: city (x) located (x,Northern Afghanistan) ¬exists (y): city (y) located (y,Northern Afghanistan) greaterthan (population (y), population (x)) Database? World factbook? Search engine?

218 (C) 2005, The University of Michigan218 The TREC Q&A evaluation Run by NIST [Voorhees and Tice 2000] 2GB of input 200 questions Essentially fact extraction –Who was Lincolns secretary of state? –What does the Peugeot company manufacture? Questions are based on text Answers are assumed to be present No inference needed

219 (C) 2005, The University of Michigan219 Q: When did Nelson Mandela become president of South Africa? A: 10 May 1994 Q: How tall is the Matterhorn? A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches Q: How tall is the replica of the Matterhorn at Disneyland? A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years Q: If Iraq attacks a neighboring country, what should the US do? A: ?? Question answering

220 (C) 2005, The University of Michigan220

221 (C) 2005, The University of Michigan221 NSIR Current project at U-M – Reading: –[Radev et al., 2005a] Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005

222 (C) 2005, The University of Michigan222

223 (C) 2005, The University of Michigan Afghanistan, Kabul, 2, Administrative capital and largest city (1997 est... Undetermined. Panama, Panama City, 450, of the Gauteng, Northern Province, Mpumalanga... died in Kano, northern Nigeria's largest city, during two days of anti-American riots led by Muslims protesting the US-led bombing of Afghanistan, according to... air strikes on the city.... the Taliban militia in northern Afghanistan in a significant blow... defection would be the largest since the United States k... Kabul is the capital and largest city of Afghanistan..... met. area pop. 2,029,889), is the largest city in Uttar Pradesh, a state in northern India..... k/k1menu.html... Gudermes, Chechnya's second largest town. The attack... location in Afghanistan's outlying regions... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the... Region: Education Offers Women in Northern Afghanistan a Ray of Hope.... within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan, held since 1998 by the Taliban. There was no immediate comment... Google

224 (C) 2005, The University of Michigan224 Document retrieval Query modulation Sentence retrieval Answer extraction Answer ranking What is the largest city in Northern Afghanistan? (largest OR biggest) city Northern Afghanistan Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regions within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan Gudermes Mazer-e-Sharif Gudermes

225 (C) 2005, The University of Michigan225

226 (C) 2005, The University of Michigan226

227 (C) 2005, The University of Michigan227 Research problems Source identification: –semi-structured vs. text sources Query modulation: –best paraphrase of a NL question given the syntax of a search engine? –Compare two approaches: noisy channel model and rule-based Sentence ranking –n-gram matching, Okapi, co-reference? Answer extraction –question type identification –phrase chunking –no general-purpose named entity tagger available Answer ranking –what are the best predictors of a phrase being the answer to a given question: question type, proximity to query words, frequency Evaluation (MRDR) –accuracy, reliability, timeliness

228 (C) 2005, The University of Michigan228 Document retrieval Use existing search engines: Google, AlltheWeb, NorthernLight No modifications to question CF: work on QASM (ACM CIKM 2001)

229 (C) 2005, The University of Michigan229 Sentence ranking Weighted N-gram matching: Weights are determined empirically, e.g., 0.6, 0.3, and 0.1

230 (C) 2005, The University of Michigan230 Probabilistic phrase reranking Answer extraction: probabilistic phrase reranking. What is: p(ph is answer to q | q, ph) Evaluation: TRDR –Example: (2,8,10) gives.725 –Document, sentence, or phrase level Criterion: presence of answer(s) High correlation with manual assessment


232 (C) 2005, The University of Michigan232 Question Type Identification Wh-type not sufficient: Who: PERSON 77, DESCRIPTION 19, ORG 6 What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16, NUMBER 14, etc. How: NUMBER 33, LENGTH 6, RATE 2, etc. Ripper: –13 features: Question-Words, Wh-Word, Word-Beside- Wh-Word, Is-Noun-Length, Is-Noun-Person, etc. –Top 2 question types Heuristic algorithm: –About 100 regular expressions based on words and parts of speech

233 (C) 2005, The University of Michigan233 Ripper performance %-TREC8,9,10 30%17.03%TREC10TREC8,9 24%22.4%TREC8TREC9 Test Error Rate Train Error Rate TestTraining

234 (C) 2005, The University of Michigan234 Regex performance 7.6%5.5%4.6%TREC8,9, %6%7.4%TREC8,9 18%15%7.8%TREC9 Test on TREC10 Test on TREC8 Test on TREC9 Training

235 (C) 2005, The University of Michigan235 Phrase ranking Phrases are identified by a shallow parser (ltchunk from Edinburgh) Four features: –Proximity –POS (part-of-speech) signature (qtype) –Query overlap –Frequency

236 (C) 2005, The University of Michigan236 Proximity Phrasal answers tend to appear near words from the query Average distance = 7 words, range = 1 to 50 words Use linear rescaling of scores

237 (C) 2005, The University of Michigan237 Part of speech signature NO (100%) NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%) PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%) NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%) PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%) ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%) VBD DT NN NNP DT JJ NNP NNP NNP DT NNP Phrase TypesSignature Example: Hugo/NNP Young/NNP P (PERSON | NNP NNP) =.458 Example: the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN P (PERSON | DT NNP NNP NNP NN) = 0 Penn Treebank tagset (DT = determiner, JJ = adjective)

238 (C) 2005, The University of Michigan238 Query overlap and frequency Query overlap: –What is the capital of Zimbabwe? –Possible choices: Mugabe, Zimbabwe, Luanda, Harare Frequency: –Not necessarily accurate but rather useful

239 (C) 2005, The University of Michigan239 Reranking RankProbability and phrase the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN._ International_NNP Space_NNP Station_NNP Alpha_NNP International_NNP Space_NNP Station_NNP to_TO become_VB a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP NASA_NNP Johnson_NNP Space_NNP Center_NNP will_MD form_VB The_DT purpose_NN prime_JJ contracts_NNS First_NNP American_NNP this_DT bulletin_NN board_NN Space_NNP :_: 'Spirit_NN '_'' of_IN Alan_NNP Shepard_NNP Proximity =.5164

240 (C) 2005, The University of Michigan240 Reranking RankProbability and phrase Space_NNP Administration_NNP._ SPACE_NNP CALENDAR_NNP _ First_NNP American_NNP International_NNP Space_NNP Station_NNP Alpha_NNP her_PRP$ third_JJ space_NN mission_NN NASA_NNP Johnson_NNP Space_NNP Center_NNP the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP the_DT Red_NNP Planet_NNP._ First_NNP American_NNP Alan_NNP Shepard_NNP February_NNP Space_NNP International_NNP Space_NNP Station_NNP Qtype =.7288 Proximity * qtype =.3763

241 (C) 2005, The University of Michigan241 Reranking RankProbability and phrase Neptune_NNP Beach_NNP._ February_NNP Go_NNP Space_NNP Go_NNP Alan_NNP Shepard_NNP First_NNP American_NNP Space_NNP May_NNP First_NNP American_NNP woman_NN Life_NNP Sciences_NNP Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN the_DT Moon_NNP International_NNP Space_NNP Station_NNP Space_NNP Research_NNP A_NNP Session_NNP All four features

242 (C) 2005, The University of Michigan242

243 (C) 2005, The University of Michigan243

244 (C) 2005, The University of Michigan244

245 (C) 2005, The University of Michigan245 Document level performance #> Avg GoogleNLightAlltheWebEngine TREC 8 corpus (200 questions)

246 (C) 2005, The University of Michigan246 Sentence level performance #> Avg GO O GO L GO U NL O NL L NL U AW O AW L AW U Engine

247 (C) 2005, The University of Michigan247 Phrase level performance Combined Global proximity Appearance order Upperbound Google S+PGoogle D+PNorthernLightAlltheWeb Experiments performed Oct-Nov. 2001

248 (C) 2005, The University of Michigan248

249 (C) 2005, The University of Michigan249

250 (C) 2005, The University of Michigan250 Text classification

251 (C) 2005, The University of Michigan251 Introduction Text classification: assigning documents to predefined categories Hierarchical vs. flat Many techniques: generative (maxent, knn, Naïve Bayes) vs. discriminative (SVM, regression) Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) Discriminative: model p(y|x) directly.

252 (C) 2005, The University of Michigan252 Generative models: knn K-nearest neighbors Very easy to program Issues: choosing k, b?

253 (C) 2005, The University of Michigan253 Feature selection: The 2 test For a term t: Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0) –P(C=0) = (k 00 +k 01 )/n –P(C=1) = 1-P(C=0) = (k 10 +k 11 )/n –P(I t =0) = (k 00 +K 10 )/n –P(I t =1) = 1-P(I t =0) = (k 01 +k 11 )/n ItIt 01 C0k 00 k 01 1k 10 k 11

254 (C) 2005, The University of Michigan254 Feature selection: The 2 test High values of 2 indicate lower belief in independence. In practice, compute 2 for all words and pick the top k among them.

255 (C) 2005, The University of Michigan255 Feature selection: mutual information No document length scaling is needed Documents are assumed to be generated according to the multinomial model

256 (C) 2005, The University of Michigan256 Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence


258 (C) 2005, The University of Michigan258 Well-known datasets 20 newsgroups – wsgroups/ wsgroups/ Reuters –Cats: grain, acquisitions, corn, crude, wheat, trade… WebKB – –course, student, faculty, staff, project, dept, other –NB performance (2000) –P=26,43,18,6,13,2,94 –R=83,75,77,9,73,100,35

259 (C) 2005, The University of Michigan259 Support vector machines Introduced by Vapnik in the early 90s.

260 (C) 2005, The University of Michigan260 Semi-supervised learning EM Co-training Graph-based

261 (C) 2005, The University of Michigan261 Additional topics Soft margins VC dimension Kernel methods

262 (C) 2005, The University of Michigan262 SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. NB also good in many circumstances

263 (C) 2005, The University of Michigan263 Readings Books: 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison- Wesley/ACM Press, Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms; Wiley, 2003, ISBN: Papers: Barabasi and Albert "Emergence of scaling in random networks" Science (286) , 1999 Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998 Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 Bush "As we may thing" The Atlantic Monthly 1945 Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998 Davison "Topical locality on the Web" SIGIR 2000 Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999 Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS 41(6) 1990

264 (C) 2005, The University of Michigan264 Readings Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" JAIR 22, 2004 Jeong and Barabasi "Diameter of the world wide web" Nature (401) , 1999 Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 Haveliwala "Topic-sensitive pagerank" WWW 2002 Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph" PODS 2000 Lawrence and Giles "Accessibility of information on the Web" Nature (400) , 1999 Lawrence and Giles "Searching the World-Wide Web" Science (280) , 1998 Menczer "Links tell us about lexical and semantic Web content" arXiv 2001 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998 Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" JASIST 2005 Singhal "Modern Information Retrieval: an Overview" IEEE 2001

265 (C) 2005, The University of Michigan265 More readings Gerard Salton, Automatic Text Processing, Addison- Wesley (1989) Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997) Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983) C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979) Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994) ACM SIGIR Proceedings, SIGIR Forum ACM conferences in Digital Libraries

266 (C) 2005, The University of Michigan266 Thank you! Благодаря!

Download ppt "(C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan September 19, 2005."

Similar presentations

Ads by Google