Information Retrieval

Information Retrieval
September 19, 2005 Information Retrieval Dragomir R. Radev University of Michigan (C) 2005, The University of Michigan

About the instructor Dragomir R. Radev
Associate Professor, University of Michigan School of Information Department of Electrical Engineering and Computer Science Department of Linguistics Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan Treasurer, North American Chapter of the ACL Ph.D., 1998, Computer Science, Columbia University Home page: (C) 2005, The University of Michigan

IR systems Google Vivísimo AskJeeves NSIR Lemur MG Nutch
(C) 2005, The University of Michigan

Examples of IR systems Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language (C) 2005, The University of Michigan

Need for IR Advent of WWW - more than 8 Billion documents indexed on Google How much information? 200TB according to Lyman and Varian 2003. Search, routing, filtering User’s information need (C) 2005, The University of Michigan

Some definitions of Information Retrieval (IR)
Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).” (C) 2005, The University of Michigan

Sample queries (from Excite)
In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running (C) 2005, The University of Michigan

Mappings and abstractions
Reality Data Information need Query From Korfhage’s book (C) 2005, The University of Michigan

Typical IR system (Crawling) Indexing Retrieval User interface

Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase. DOCUMENT: an information entity that the user wants to retrieve COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document or a query (C) 2005, The University of Michigan

Documents Not just printed paper collections vs. documents
data structures: representations Bag of words method document surrogates: keywords, summaries encoding: ASCII, Unicode, etc. (C) 2005, The University of Michigan

Document preprocessing
Formatting Tokenization (Paul’s, Willow Dr., Dr. Willow, , New York, ad hoc) Casing (cat vs. CAT) Stemming (computer, computation) Soundex (C) 2005, The University of Michigan

Document representations
Term-document matrix (m x n) term-term matrix (m x m x n) document-document matrix (n x n) Example: 3,000,000 documents (n) with 50,000 terms (m) sparse matrices Boolean vs. integer matrices (C) 2005, The University of Michigan

Document representations
Term-document matrix Evaluating queries (e.g., (AB)C) Storage issues Inverted files Evaluating queries Advantages and disadvantages (C) 2005, The University of Michigan

Major IR models Boolean Vector Probabilistic Language modeling
Fuzzy retrieval Latent semantic indexing (C) 2005, The University of Michigan

Major IR tasks Ad-hoc Filtering and routing Question answering
Spoken document retrieval Multimedia retrieval (C) 2005, The University of Michigan

Boolean queries What types of documents are returned? Stemming
restaurants AND (Mideastern OR vegetarian) AND inexpensive What types of documents are returned? Stemming thesaurus expansion inclusive vs. exclusive OR confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal) (C) 2005, The University of Michigan

Boolean queries Weighting (Beethoven AND sonatas) precedence
coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses Use of negation: potential problems Conjunctive and Disjunctive normal forms Full CNF and DNF (C) 2005, The University of Michigan

Transformations De Morgan’s Laws: CNF or DNF?
NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) CNF or DNF? Reference librarians prefer CNF - why? (C) 2005, The University of Michigan

Boolean model Partition Partial relevance?
Operators: AND, NOT, OR, parentheses (C) 2005, The University of Michigan

Exercise D1 = “computer information retrieval”
D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information  retrieval” Q2 = “information  ¬computer” (C) 2005, The University of Michigan

Stop lists most common words in English account for 50% or more of a given text. Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63 (C) 2005, The University of Michigan

Vector models Term 1 Doc 1 Doc 2 Term 3 Doc 3 Term 2

Vector queries Each document is represented as a vector
non-efficient representations (bit vectors) dimensional compatibility (C) 2005, The University of Michigan

The matching process Document space
Matching is done between a document and a query (or between two documents) distance vs. similarity Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc. (C) 2005, The University of Michigan

Miscellaneous similarity measures
The Cosine measure  (di x qi) |X  Y|  (D,Q) = = |X| * |Y|  (di)2 *  (qi)2 The Jaccard coefficient |X  Y|  (D,Q) = |X  Y| (C) 2005, The University of Michigan

Exercise Compute the cosine measures  (D1,D2) and  (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients. (C) 2005, The University of Michigan

Relevance Difficult to change: fuzzy, inconsistent
Methods: exhaustive, sampling, pooling, search-based (C) 2005, The University of Michigan

Contingency table retrieved not retrieved relevant w x n1 = w + x
not relevant y z N n2 = w + y (C) 2005, The University of Michigan

Precision and Recall w w+x w w+y Recall: Precision:

Exercise Go to Google ( and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries. (C) 2005, The University of Michigan

Interpolated average precision (e.g., 11pt)
Interpolation – what is precision at recall=0.5? (C) 2005, The University of Michigan

Issues Why not use accuracy A=(w+z)/N? Average precision
Average P at given “document cutoff values” Report when P=R F measure: F=(b2+1)PR/(b2P+R) F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R (C) 2005, The University of Michigan

Kappa N: number of items (index i) n: number of categories (index j)
k: number of annotators (C) 2005, The University of Michigan

Kappa example (from Manning, Schuetze, Raghavan)
J1+ J1- J2+ 300 10 J2- 20 70 (C) 2005, The University of Michigan

Kappa (cont’d) P(A) = 370/400 P (-) = (10+20+20+70)/800 = 0.2125
P (E) = * * = 0.665 K = ( )/( ) = 0.776 Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good (C) 2005, The University of Michigan

Relevance collections
TREC ad hoc collections, 2-6 GB TREC Web collections, 2-100GB (C) 2005, The University of Michigan

Sample TREC query <top> <num> Number: 305
<title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA FT LA LA LA LA FT LA FT LA LA FT LA LA LA LA LA LA LA LA LA LA FT LA LA (C) 2005, The University of Michigan

<DOCNO> LA </DOCNO> <DOCID> </DOCID> <DATE>March 16, 1989, Thursday, Home Edition </DATE> <SECTION>Business; Part 4; Page 1; Column 5; Financial Desk </SECTION> <LENGTH>586 words </LENGTH> <HEADLINE>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </HEADLINE> <BYLINE>By LINDA WILLIAMS, Times Staff Writer </BYLINE> <TEXT> The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. Several Fatalities However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the Bronco II models, the agency said. According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. ... </TEXT> <GRAPHIC> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </GRAPHIC> <SUBJECT> TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </SUBJECT> </DOC> (C) 2005, The University of Michigan

TREC (cont’d) http://trec.nist.gov/tracks.html

Word distribution models

Shakespeare Romeo and Juliet:
And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; … A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; (C) 2005, The University of Michigan

The BNC (Adam Kilgarriff)
the det be v of prep and conj a det in prep to infinitive-marker have v it pron to prep for prep i pron that conj you pron he pron on prep with prep do v at prep by prep Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) Pp (C) 2005, The University of Michigan

Stop lists most common words in English account for 50% or more of a given text. Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63 (C) 2005, The University of Michigan

Zipf’s law Rank x Frequency  Constant

Zipf's law is fairly general!
Frequency of accesses to web pages in particular the access counts on the Wikipedia page, with s approximately equal to 0.3 page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5 Words in the English language for instance, in Shakespeare’s play Hamlet with s approximately 0.5 Sizes of settlements Income distributions amongst individuals Size of earthquakes Notes in musical performances (C) 2005, The University of Michigan

Zipf’s law (cont’d) Limitations: Power law with coefficient c = -1
Low and high frequencies Lack of convergence Power law with coefficient c = -1 Y=kxc Li (1992) – typing words one letter at a time, including spaces (C) 2005, The University of Michigan

Heap’s law Size of vocabulary: V(n) = Knb
In English, K is between 10 and 100, β is between 0.4 and 0.6. V(n) n (C) 2005, The University of Michigan

Heap’s law (cont’d) Related to Zipf’s law: generative models
Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN , ISBN , Springer-Verlag, pp. 332–335. (C) 2005, The University of Michigan

Methods Manual: e.g., Library of Congress subject headings, MeSH
Automatic (C) 2005, The University of Michigan

LOC subject headings http://www.loc.gov/catdir/cpso/lcco/lcco.html
A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) (C) 2005, The University of Michigan

Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General)
R General works R History of medicine. Medical expeditions R Medicine as a profession. Physicians R Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R Directories R Missionary medicine. Medical missionaries R Medical philosophy. Medical ethics R Medicine and disease in relation to psychology. Terminal care. Dying R Medical personnel and the public. Physician and the public R Practice of medicine. Medical practice economics R Medical education. Medical schools. Research R Medical technology R Biomedical engineering. Electronics. Instrumentation R Computer applications to medicine. Medical informatics R864 Medical records R Medical physics. Medical radiology. Nuclear medicine (C) 2005, The University of Michigan

Finding the most frequent terms in a document
Typically stop words: the, and, in Not content-bearing Terms vs. words Luhn’s method (C) 2005, The University of Michigan

Computing term salience
Term frequency (IDF) Document frequency (DF) Inverse document frequency (IDF) (C) 2005, The University of Michigan

Applications of TFIDF Cosine similarity Indexing Clustering

Vector-based matching
The cosine measure S (dk . ck . idf(k)) sim (D,C) = k S S (dk)2 . (ck)2 k k (C) 2005, The University of Michigan

IDF: Inverse document frequency
TF * IDF is used for automated indexing and for topic discrimination: N: number of documents dk: number of documents containing term k fik: absolute frequency of term k in document i wik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1 (C) 2005, The University of Michigan

Asian and European news
deng china beijing chinese xiaoping jiang communist body party died leader state people nato albright belgrade enlargement alliance french opposition russia government told would their which (C) 2005, The University of Michigan

Other topics 120.385 shuttle 99.487 space 90.128 telescope
hubble rocket astronauts discovery canaveral cape mission florida center compuserve massey salizzoni bob online executive interim chief service second world president (C) 2005, The University of Michigan

Compression Methods Fixed length codes Huffman coding Ziv-Lempel codes

Fixed length codes Binary representations ASCII
Representational power (2k symbols where k is the number of bits) (C) 2005, The University of Michigan

Variable length codes Alphabet: A .- N -. 0 ----- B -... O --- 1 .----
C -.-. P .--. D -.. Q --.- — E . R F S G --. T - H U ..- I .. V ...- J .--- W .-- K -.- X -..- L .-.. Y -.— M -- Z --.. Demo: (C) 2005, The University of Michigan

Most frequent letters in English
E T A O I N S H R D L U Demo: Also: bigrams: TH HE IN ER AN RE ND AT ON NT (C) 2005, The University of Michigan

Useful links about cryptography

Huffman coding Developed by David Huffman (1952)
Average of 5 bits per character (37.5% compression) Based on frequency distributions of symbols Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols (C) 2005, The University of Michigan

Exercise Consider the bit string: Use the Huffman code from the example to decode it. Try inserting, deleting, and switching some bits at random locations and try decoding. (C) 2005, The University of Michigan

Ziv-Lempel coding Two types - one is known as LZ77 (used in GZIP)
Code: set of triples <a,b,c> a: how far back in the decoded text to look for the upcoming text segment b: how many characters to copy c: new character to add to complete segment (C) 2005, The University of Michigan

<8,2,r> peter_piper <6,3,c> peter_piper_pic
<0,0,p> p <0,0,e> pe <0,0,t> pet <2,1,r> peter <0,0,_> peter_ <6,1,i> peter_pi <8,2,r> peter_piper <6,3,c> peter_piper_pic <0,0,k> peter_piper_pick <7,1,d> peter_piper_picked <7,1,a> peter_piper_picked_a <9,2,e> peter_piper_picked_a_pe <9,2,_> peter_piper_picked_a_peck_ <0,0,o> peter_piper_picked_a_peck_o <0,0,f> peter_piper_picked_a_peck_of <17,5,l> peter_piper_picked_a_peck_of_pickl <12,1,d> peter_piper_picked_a_peck_of_pickled <16,3,p> peter_piper_picked_a_peck_of_pickled_pep <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers (C) 2005, The University of Michigan

Links on text compression
Data compression: Calgary corpus: Huffman coding: LZ (C) 2005, The University of Michigan

Relevance feedback and query expansion

Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need. Idea: modify the original query so that it gets closer to the right documents in the vector space (C) 2005, The University of Michigan

Relevance feedback Automatic Manual Method: identifying feedback terms
Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N| (C) 2005, The University of Michigan

Example Q = “safety minivans”
D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non-relevant R = ? S = ? Q’ = ? (C) 2005, The University of Michigan

Pseudo relevance feedback
Automatic query expansion Thesaurus-based expansion (e.g., using latent semantic indexing – later…) Distributional similarity Query log mining (C) 2005, The University of Michigan

Examples Lexical semantics (Hypernymy): Distributional clustering:
Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper (C) 2005, The University of Michigan

Examples (query logs) Book: booksellers, bookmark, blue
Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn (C) 2005, The University of Michigan

Problems with automatic query expansion
Adding frequent words may dilute the results (example?) (C) 2005, The University of Michigan

Goals Motivation: Representing related words as one token
Computer, computers, computerize, computational, computerization User, users, using, used Representing related words as one token Simplify matching Reduce storage and computation Also known as: term conflation (C) 2005, The University of Michigan

Methods Manual (tables) Affix removal (Harman 1991, Frakes 1992)
Achievement  achiev Achiever  achiev Etc. Affix removal (Harman 1991, Frakes 1992) if a word ends in “ies” but not “eies” or “aies” then “ies”  “y” If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es”  “e” If a word ends in “s” but not “us” or “ss” then “s”  NULL (apply only the first applicable rule) (C) 2005, The University of Michigan

Porter’s algorithm (Porter 1980)
Home page: Reading assignment: Consonant-vowel sequences: CVCV ... C CVCV ... V VCVC ... C VCVC ... V Shorthand: [C]VCVC ... [V] (C) 2005, The University of Michigan

Porter’s algorithm (cont’d)
[C](VC){m}[V] {m} indicates repetition Examples: m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN Conditions: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). (C) 2005, The University of Michigan

SSES -> SS caresses -> caress
Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing Step 1b1 If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file (C) 2005, The University of Michigan

(*v*) Y -> I happy -> happi sky -> sky Step 2
Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible (C) 2005, The University of Michigan

Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler (C) 2005, The University of Michigan

(m>1) E -> probate -> probat rate -> rate
Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll (C) 2005, The University of Michigan

Porter’s algorithm (cont’d)
Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf % ./stem.pl computers computers comput (C) 2005, The University of Michigan

Stemming Not always appropriate (e.g., proper names, titles)
The same applies to casing (e.g., CAT vs. cat) (C) 2005, The University of Michigan

String matching methods
Index-based Full or approximate E.g., theater = theatre (C) 2005, The University of Michigan

Index-based matching Inverted files Position-based inverted files
Block-based inverted files This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55 (C) 2005, The University of Michigan

Inverted index (trie) Letters: 60 l d Made: 50 a m n Many: 28 t
Text: 11, 19 w Words: 33, 40 (C) 2005, The University of Michigan

Sequential searching No indexing structure given
Given: database d and search pattern p. Example: find “words” in the earlier example Brute force method try all possible starting positions O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) Typical runtime is actually O(n) given that mismatches are easy to notice (C) 2005, The University of Michigan

Knuth-Morris-Pratt Average runtime similar to BF
Worst case runtime is linear: O(n) Idea: reuse knowledge Need preprocessing of the pattern (C) 2005, The University of Michigan

Knuth-Morris-Pratt (cont’d)
Example ( database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index char A B C D A B D – pos ABCDABD (C) 2005, The University of Michigan

Knuth-Morris-Pratt (cont’d)
ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ (C) 2005, The University of Michigan

Boyer-Moore Used in text editors Demos

Word similarity Hamming distance - when words are of the same length
Levenshtein distance - number of edits (insertions, deletions, replacements) color --> colour (1) survey --> surgery (2) com puter --> computer ? Longest common subsequence (LCS) lcs (survey, surgery) = surey (C) 2005, The University of Michigan

Levenshtein edit distance
Examples: Theatre-> theater Ghaddafi->Qadafi Computer->counter Edit distance (inserts, deletes, substitutions) Edit transcript Done through dynamic programming (C) 2005, The University of Michigan

Recurrence relation Three dependencies Simple edit distance: D(i,0)=i
D(0,j)=j D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] Simple edit distance: t(i,j) = 0 iff S1(i)=S2(j) (C) 2005, The University of Michigan

Example Gusfield 1997 W R I T E S 1 2 3 4 5 6 7 V N
1 2 3 4 5 6 7 V N (C) 2005, The University of Michigan Gusfield 1997

Example (cont’d) Gusfield 1997 W R I T E S 1 2 3 4 5 6 7 V N *
1 2 3 4 5 6 7 V N * (C) 2005, The University of Michigan Gusfield 1997

Tracebacks Gusfield 1997 W R I T E S 1 2 3 4 5 6 7 V N *
1 2 3 4 5 6 7 V N * (C) 2005, The University of Michigan Gusfield 1997

Weighted edit distance
Used to emphasize the relative cost of different edit operations Useful in bioinformatics Homology information BLAST Blosum (C) 2005, The University of Michigan

Web sites: http://www.merriampark.com/ld.htm

Clustering Exclusive/overlapping clusters Hierarchical/flat clusters
The cluster hypothesis Documents in the same cluster are relevant to the same query (C) 2005, The University of Michigan

Representations for document clustering
Typically: vector-based Words: “cat”, “dog”, etc. Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed) (C) 2005, The University of Michigan

Hierarchical clustering Dendrograms
E.g., language similarity: (C) 2005, The University of Michigan

Another example Kingdom = animal Phylum = Chordata
Subphylum = Vertebrata Class = Osteichthyes Subclass = Actinoptergyii Order = Salmoniformes Family = Salmonidae Genus = Oncorhynchus Species = Oncorhynchus kisutch (Coho salmon) (C) 2005, The University of Michigan

Clustering using dendrograms
Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? (C) 2005, The University of Michigan

Methods Single-linkage Complete-linkage Average-linkage
One common pair is sufficient disadvantages: long chains Complete-linkage All pairs have to match Disadvantages: too conservative Average-linkage Centroid-based (online) Look at distances to centroids Demo: /clair4/class/ir-w05/clustering (C) 2005, The University of Michigan

k-means Needed: small number k of desired clusters
hard vs. soft decisions Example: Weka (C) 2005, The University of Michigan

k-means 1 initialize cluster centroids to arbitrary vectors
2 while further improvement is possible do 3 for each document d do find the cluster c whose centroid is closest to d assign d to cluster c 6 end for 7 for each cluster c do recompute the centroid of cluster c based on its documents 9 end for 10 end while (C) 2005, The University of Michigan

Example Cluster the following vectors into two groups: A = <1,6>
B = <2,2> C = <4,0> D = <3,3> E = <2,5> F = <2,1> (C) 2005, The University of Michigan

Demos http://vivisimo.com/

Lexical Networks Used to represent relationships between words
Example: WordNet - created by George Miller’s team at Princeton Based on synsets (synonyms, interchangeable words) and lexical matrices (C) 2005, The University of Michigan

Synsets Disambiguation Synonyms {board, plank} {board, committee}
substitution weak substitution synonyms must be of the same part of speech (C) 2005, The University of Michigan

$ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material (C) 2005, The University of Michigan

Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface Sense 6 board, table => fare => food, nutrient => substance, matter (C) 2005, The University of Michigan

Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings (C) 2005, The University of Michigan

Antonymy “x” vs. “not-x” “rich” vs. “poor”?
{rise, ascend} vs. {fall, descend} (C) 2005, The University of Michigan

Other relations Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. Hyponymy: {tree} is a hyponym of {plant}. Hierarchical structure based on hyponymy (and hypernymy). (C) 2005, The University of Michigan

Other features of WordNet
Index of familiarity Polysemy (C) 2005, The University of Michigan

Familiarity and polysemy
board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1) (C) 2005, The University of Michigan

Compound nouns advisory board appeals board backboard backgammon board
baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees (C) 2005, The University of Michigan

Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board") (C) 2005, The University of Michigan

Top-level concepts {act, action, activity} {animal, fauna} {artifact}
{attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time} (C) 2005, The University of Michigan

WordNet and DistSim wn reason -hypen - hypernyms
wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /data2/tools/relatedwords/relate reason (C) 2005, The University of Michigan

Comparing two systems Comparing A and B One query?
Average performance? Need: A to consistently outperform B [this slide: courtesy James Allan] (C) 2005, The University of Michigan

The sign test Example 1: Example 2: A > B (12 times)
p < (significant at the 5% level) Example 2: A > B (18 times) A < B (9 times) p < (not significant at the 5% level) [this slide: courtesy James Allan] (C) 2005, The University of Michigan

Other tests The t test: The sign test:
Takes into account the actual performances, not just which system is better The sign test: (C) 2005, The University of Michigan

Techniques for dimensionality reduction: SVD and LSI

Techniques for dimensionality reduction
Based on matrix decomposition (goal: preserve clusters, explain away variance) A quick review of matrices Vectors Matrices Matrix multiplication (C) 2005, The University of Michigan

SVD: Singular Value Decomposition
A=USVT This decomposition exists for all matrices, dense or sparse If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3 In Matlab, use [U,S,V] = svd (A) (C) 2005, The University of Michigan

Term matrix normalization
D1 D2 D3 D4 D5 D D D D D5 (C) 2005, The University of Michigan

Example (Berry and Browne)
T1: baby T2: child T3: guide T4: health T5: home T6: infant T7: proofing T8: safety T9: toddler D1: infant & toddler first aid D2: babies & children’s room (for your home) D3: child safety at home D4: your baby’s health and safety: from infant to toddler D5: baby proofing basics D6: your guide to easy rust proofing D7: beanie babies collector’s guide (C) 2005, The University of Michigan

Decomposition Spread on the v1 axis s = 1.5849 0 0 0 0 0 0

Rank-4 approximation s4 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0

Rank-4 approximation u*s4*v'

Rank-4 approximation u*s4 -1.1056 -0.1203 0.0207 -0.5558 0 0 0

Rank-4 approximation s4*v'

Rank-2 approximation s2 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0

Rank-2 approximation u*s2*v'

Rank-2 approximation u*s2 -1.1056 -0.1203 0 0 0 0 0

Rank-2 approximation s2*v'

Documents to concepts and terms to concepts
A(:,1)'*u*s >> A(:,1)'*u*s4 >> A(:,1)'*u*s2 >> A(:,2)'*u*s2 >> A(:,3)'*u*s2 (C) 2005, The University of Michigan

Documents to concepts and terms to concepts
>> A(:,4)'*u*s2 >> A(:,5)'*u*s2 >> A(:,6)'*u*s2 >> A(:,7)'*u*s2 (C) 2005, The University of Michigan

Properties A is a document to term matrix. What is A*A’, what is A’*A?
A'*A (C) 2005, The University of Michigan

Latent semantic indexing (LSI)
Dimensionality reduction = identification of hidden (latent) concepts Query matching in latent space (C) 2005, The University of Michigan

Useful pointers http://lsa.colorado.edu

Size The Web is the largest repository of data and it grows exponentially. 320 Million Web pages [Lawrence & Giles 1998] 800 Million Web pages, 15 TB [Lawrence & Giles 1999] 8 Billion Web pages indexed [Google 2005] Amount of data roughly 200 TB [Lyman et al. 2003] 1.5 million terabytes = 1.5 petabytes = 1.5 E +18. At 2,000 bytes/page = 1 E 15 pages = 1 billion years at 1 page/minute 1 in every 28 page views on the Web is a search results page (3.5% of all page views) (June 1, 1999, Alexa Insider) (C) 2005, The University of Michigan

Bow-tie model of the Web
TEND 44M SCC 56 M IN 44 M OUT 44 M 24% of pages reachable from a given page DISC 17 M (C) 2005, The University of Michigan Bröder & al. WWW 2000, Dill & al. VLDB 2001

Power laws Web site size (Huberman and Adamic 1999)
Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3. (C) 2005, The University of Michigan

Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example… Milgram experiment (1967) Kansas/Omaha --> Boston (42/160 letters) diameter = 6 Albert et al – average distance between two verstices is d = log10n. For n = 109, d=18.89. Six degrees of separation (C) 2005, The University of Michigan

Clustering coefficient
Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors. Examples: n k d drand C crand Actors 225226 61 3.65 2.99 0.79 Power grid 4941 2.67 18.7 12.4 0.08 0.005 C. Elegans 282 14 2.65 2.25 0.28 0.05 (C) 2005, The University of Michigan

Models of the Web Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 A B a b Kleinberg 98 Indegree/outdegree plots Evolving networks: fundamental object of statistical physics Social networks Bow tie model (not?) My Erdos number is 4 Growth of the Web (p 130) Why is the Web different (MN) – how do users really create the Web Fat-tailed distributions, small worlds (W&S, # of triangles), hard to destroy Communities (Kleinberg) Evolutionary networks, equilibrium/non-equilibrium Evaluation metrics: degree distribution, clustering coefficient, diameter (give comparison) W/S model is equlibrium (term from statistical mechanics) EN have history, memory Small word networks (+result) D&M Zipf What are typical clustering coefficients and diameters Random graph – clustering coefficient is much smaller Mesoscopic objects Phase transitions Bipartite networks Pagerank Google changed the way the Web works Menczer 02 Radev 03 (C) 2005, The University of Michigan

Self-triggerability across hyperlinks
pj pi Document closures for information retrieval Self-triggerability [Mosteller&Wallace 84]  Poisson distribution Two-Poisson [Bookstein&Swanson 74] Negative Binomial, K-mixture [Church&Gale 95] Triggerability across hyperlinks? p p’ by with from p p’ photo dream path Perltree/Lexpagerank/Web Models/Lex/Link Predict communities based on a few words Realistic attachment model (based on words) Dilemma: add links/remove links Property-driven attachment (C) 2005, The University of Michigan

Evolving Word-based Web
Observations: Links are made based on topics Topics are expressed with words Words are distributed very unevenly (Zipf, Benford, self-triggerability laws) Model Pick n Generate n lengths according to a power-law distribution Generate n documents using a trigram model Model (cont’d) Pick words in decreasing order of r. Generate hyperlinks with random directionality Outcome Generates power-law degree distributions Generates topical communities Natural variation of PageRank: LexRank (C) 2005, The University of Michigan

Social network analysis for IR

Social networks Induced by a relation Symmetric or not Examples:
Friendship networks Board membership Citations Power grid of the US WWW (C) 2005, The University of Michigan

Prestige and centrality
Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs. Pagerank/HITS (C) 2005, The University of Michigan

Graph-based representations
Square connectivity (incidence) matrix Graph G (V,E) 1 2 3 4 5 6 7 8 1 2 3 4 5 7 6 8 Directed, undirected (C) 2005, The University of Michigan

Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x0, x1, …, xn). Xi = xi-1*E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find Xj given x0, E, and j. (C) 2005, The University of Michigan

Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: E is stochastic E is irreducible E is aperiodic To make these conditions true: All rows of E add up to 1 (and no value is negative) Make sure that E is strongly connected Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use “teleportation” Stochastic, aperiodic, irreducible (strongly connected) Unique solutions (C) 2005, The University of Michigan

Eigenvectors An eigenvector is an implicit “direction” for a matrix.
Mv = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, ETp = p. (C) 2005, The University of Michigan

Computing the stationary distribution
Solution for the stationary distribution function PowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; until L <  return p(i) end Power methods (KHMG) (C) 2005, The University of Michigan

How Google works Crawling Anchor text Fast query processing Pagerank

More about PageRank Named after Larry Page, founder of Google (and UM alum) Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page. Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank. (C) 2005, The University of Michigan

HITS Query-dependent model (Kleinberg 97)
Hubs and authorities (e.g., cars, Honda) Algorithm obtain root set using input query expanded the root set by radius one run iterations on the hub and authority scores together report top-ranking authorities and hubs (C) 2005, The University of Michigan

The link-content hypothesis
Topical locality: page is similar () to the page that points to it (). Davison (TF*IDF, 100K pages) 0.31 same domain 0.23 linked pages 0.19 sibling 0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic 1=1.8, 2=0.6, (C) 2005, The University of Michigan Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001

Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek 10,000 queries in mid and late 1997 Estimate is 200M pages Only 1.4% are indexed by all of them (C) 2005, The University of Michigan

Example (from Bharat&Broder)
A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998). (C) 2005, The University of Michigan

Basic principles The HTTP/HTML protocols Following hyperlinks
Some problems: Link extraction Link normalization Robot exclusion Loops Spider traps Server overload (C) 2005, The University of Michigan

Example U-M’s root robots.txt file: http://www.umich.edu/robots.txt
User-agent: * Disallow: /~websvcs/projects/ Disallow: /%7Ewebsvcs/projects/ Disallow: /~homepage/ Disallow: /%7Ehomepage/ Disallow: /~smartgl/ Disallow: /%7Esmartgl/ Disallow: /~gateway/ Disallow: /%7Egateway/ (C) 2005, The University of Michigan

Example crawler E.g., poacher
/data0/projects/perltree-index (C) 2005, The University of Michigan

&ParseCommandLine();
&Initialise(); $robot->run($siteRoot) #======================================================================= # Initialise() - initialise global variables, contents, tables, etc # This function sets up various global variables such as the version number # for WebAssay, the program name identifier, usage statement, etc. sub Initialise { $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, ' ' => $ , 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error); } # follow_url_test() - tell the robot module whether is should follow link sub follow_url_test {} # process_get_error() - hook function invoked whenever a GET fails sub process_get_error {} # process_contents() - process the contents of a URL we've retrieved sub process_contents run_command($COMMAND, $filename) if defined $COMMAND; (C) 2005, The University of Michigan

Focused crawling Topical locality The radius-1 hypothesis
Pages that are linked are similar in content (and vice-versa: Davison 00, Menczer 02, 04, Radev et al. 04) The radius-1 hypothesis given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page) Focused crawling Keeping a priority queue of the most relevant pages (C) 2005, The University of Michigan

People ask questions Excite corpus of 2,477,283 queries (one day’s worth) 8.4% of them are questions 43.9% factual (what is the country code for Belgium) 56.1% procedural (how do I set up TCP/IP) or other In other words, 100 K questions per day (C) 2005, The University of Michigan

People ask questions In what year did baseball become an offical sport? Who is the largest man in the world? Where can i get information on Raphael? where can i find information on puritan religion? Where can I find how much my house is worth? how do i get out of debt? Where can I found out how to pass a drug test? When is the Super Bowl? who is California's District State Senator? where can I buy extra nibs for a foutain pen? how do i set up tcp/ip ? what time is it in west samoa? Where can I buy a little kitty cat? what are the symptoms of attention deficit disorder? Where can I get some information on Michael Jordan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live? Which Frenchman declined the Nobel Prize for Literature for ideological reasons? What is the largest city in Northern Afghanistan? Each question can have multiple answers How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live? (C) 2005, The University of Michigan

Question answering What is the largest city in Northern Afghanistan?

Possible approaches Map?
Knowledge base Find x: city (x)  located (x,”Northern Afghanistan”)   ¬exists (y): city (y)  located (y,”Northern Afghanistan”)   greaterthan (population (y), population (x)) Database? World factbook? Search engine? (C) 2005, The University of Michigan

The TREC Q&A evaluation
Run by NIST [Voorhees and Tice 2000] 2GB of input 200 questions Essentially fact extraction Who was Lincoln’s secretary of state? What does the Peugeot company manufacture? Questions are based on text Answers are assumed to be present No inference needed (C) 2005, The University of Michigan

Question answering Q: When did Nelson Mandela become president of South Africa? A: 10 May 1994 Q: How tall is the Matterhorn? A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches Q: How tall is the replica of the Matterhorn at Disneyland? A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years Q: If Iraq attacks a neighboring country, what should the US do? A: ?? (C) 2005, The University of Michigan

NSIR Current project at U-M Reading:
Reading: [Radev et al., 2005a] Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005 (C) 2005, The University of Michigan

... Afghanistan, Kabul, 2, Administrative capital and largest city (1997 est ... Undetermined. Panama, Panama City, 450, of the Gauteng, Northern Province, Mpumalanga ... died in Kano, northern Nigeria's largest city, during two days of anti-American riots led by Muslims protesting the US-led bombing of Afghanistan, according to ... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significant blow ... defection would be the largest since the United States k ... Kabul is the capital and largest city of Afghanistan met. area pop. 2,029,889), is the largest city in Uttar Pradesh, a state in northern India school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html ... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlying regions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated english.pravda.ru/hotspots/2001/09/17/ ... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: Education Offers Women in Northern Afghanistan a Ray of Hope ... within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan, held since 1998 by the Taliban. There was no immediate comment uk.fc.yahoo.com/photos/a/afghanistan.html (C) 2005, The University of Michigan Google

Query modulation Document retrieval Sentence retrieval
What is the largest city in Northern Afghanistan? Query modulation (largest OR biggest) city “Northern Afghanistan” Document retrieval Sentence retrieval Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regions within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan Answer extraction Gudermes Mazer-e-Sharif Answer ranking (C) 2005, The University of Michigan Mazer-e-Sharif Gudermes

Research problems Source identification: Query modulation:
semi-structured vs. text sources Query modulation: best paraphrase of a NL question given the syntax of a search engine? Compare two approaches: noisy channel model and rule-based Sentence ranking n-gram matching, Okapi, co-reference? Answer extraction question type identification phrase chunking no general-purpose named entity tagger available Answer ranking what are the best predictors of a phrase being the answer to a given question: question type, proximity to query words, frequency Evaluation (MRDR) accuracy, reliability, timeliness (C) 2005, The University of Michigan

Sentence ranking Weighted N-gram matching:
Weights are determined empirically, e.g., 0.6, 0.3, and 0.1 (C) 2005, The University of Michigan

Probabilistic phrase reranking
Answer extraction: probabilistic phrase reranking. What is: p(ph is answer to q | q, ph) Evaluation: TRDR Example: (2,8,10) gives .725 Document, sentence, or phrase level Criterion: presence of answer(s) High correlation with manual assessment (C) 2005, The University of Michigan

Question Type Identification
Wh-type not sufficient: Who: PERSON 77, DESCRIPTION 19, ORG 6 What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16, NUMBER 14, etc. How: NUMBER 33, LENGTH 6, RATE 2, etc. Ripper: 13 features: Question-Words, Wh-Word, Word-Beside-Wh-Word, Is-Noun-Length, Is-Noun-Person, etc. Top 2 question types Heuristic algorithm: About 100 regular expressions based on words and parts of speech (C) 2005, The University of Michigan

Ripper performance - 20.69% TREC8,9,10 30% 17.03% TREC10 TREC8,9 24%
22.4% TREC8 TREC9 Test Error Rate Train Error Rate Test Training (C) 2005, The University of Michigan

Regex performance 7.6% 5.5% 4.6% TREC8,9,10 18.2% 6% 7.4% TREC8,9 18%
15% 7.8% TREC9 Test on TREC10 Test on TREC8 Test on TREC9 Training (C) 2005, The University of Michigan

Phrase ranking Phrases are identified by a shallow parser (ltchunk from Edinburgh) Four features: Proximity POS (part-of-speech) signature (qtype) Query overlap Frequency (C) 2005, The University of Michigan

Proximity Phrasal answers tend to appear near words from the query
Average distance = 7 words, range = 1 to 50 words Use linear rescaling of scores (C) 2005, The University of Michigan

Part of speech signature
Penn Treebank tagset (DT = determiner, JJ = adjective) NO (100%) NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%) PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%) NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%) PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%) ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%) VBD DT NN NNP DT JJ NNP NNP NNP DT NNP Phrase Types Signature Example: “Hugo/NNP Young/NNP” P (PERSON | “NNP NNP”) = .458 Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN” P (PERSON | “DT NNP NNP NNP NN”) = 0 (C) 2005, The University of Michigan

Query overlap and frequency
What is the capital of Zimbabwe? Possible choices: Mugabe, Zimbabwe, Luanda, Harare Frequency: Not necessarily accurate but rather useful (C) 2005, The University of Michigan

Reranking Proximity = .5164 Rank Probability and phrase
the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._ International_NNP Space_NNP Station_NNP Alpha_NNP International_NNP Space_NNP Station_NNP to_TO become_VB a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP NASA_NNP Johnson_NNP Space_NNP Center_NNP will_MD form_VB The_DT purpose_NN prime_JJ contracts_NNS First_NNP American_NNP this_DT bulletin_NN board_NN Space_NNP :_: 'Spirit_NN '_'' of_IN Alan_NNP Shepard_NNP Proximity = .5164 (C) 2005, The University of Michigan

Reranking Qtype = .7288 Proximity * qtype = .3763
Rank Probability and phrase Space_NNP Administration_NNP ._ SPACE_NNP CALENDAR_NNP _ First_NNP American_NNP International_NNP Space_NNP Station_NNP Alpha_NNP her_PRP$ third_JJ space_NN mission_NN NASA_NNP Johnson_NNP Space_NNP Center_NNP the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP the_DT Red_NNP Planet_NNP ._ First_NNP American_NNP Alan_NNP Shepard_NNP February_NNP Space_NNP International_NNP Space_NNP Station_NNP Qtype = Proximity * qtype = .3763 (C) 2005, The University of Michigan

Reranking All four features Rank Probability and phrase
Neptune_NNP Beach_NNP ._ February_NNP Go_NNP Space_NNP Go_NNP Alan_NNP Shepard_NNP First_NNP American_NNP Space_NNP May_NNP First_NNP American_NNP woman_NN Life_NNP Sciences_NNP Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN the_DT Moon_NNP International_NNP Space_NNP Station_NNP Space_NNP Research_NNP A_NNP Session_NNP All four features (C) 2005, The University of Michigan

Document level performance
TREC 8 corpus (200 questions) 164 163 149 #>0 1.3361 1.0495 0.8355 Avg Google NLight AlltheWeb Engine (C) 2005, The University of Michigan

Sentence level performance
135 137 159 119 121 99 148 #>0 0.49 0.54 2.55 0.44 0.48 2.53 0.26 0.31 2.13 Avg GO O GO L GO U NL O NL L NL U AW O AW L AW U Engine (C) 2005, The University of Michigan

Phrase level performance
0.199 0.157 0.117 0.105 Combined 0.0646 0.058 0.054 0.038 Global proximity 0.068 0.048 0.026 Appearance order 1.941 2.698 2.652 2.176 Upperbound Google S+P Google D+P NorthernLight AlltheWeb Experiments performed Oct-Nov. 2001 (C) 2005, The University of Michigan

Introduction Text classification: assigning documents to predefined categories Hierarchical vs. flat Many techniques: generative (maxent, knn, Naïve Bayes) vs. discriminative (SVM, regression) Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) Discriminative: model p(y|x) directly. (C) 2005, The University of Michigan

Generative models: knn
K-nearest neighbors Very easy to program Issues: choosing k, b? (C) 2005, The University of Michigan

Feature selection: The 2 test
It 1 C k00 k01 k10 k11 For a term t: Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0) P(C=0) = (k00+k01)/n P(C=1) = 1-P(C=0) = (k10+k11)/n P(It=0) = (k00+K10)/n P(It=1) = 1-P(It=0) = (k01+k11)/n (C) 2005, The University of Michigan

Feature selection: The 2 test
High values of 2 indicate lower belief in independence. In practice, compute 2 for all words and pick the top k among them. (C) 2005, The University of Michigan

Feature selection: mutual information
No document length scaling is needed Documents are assumed to be generated according to the multinomial model (C) 2005, The University of Michigan

Naïve Bayesian classifiers
Assuming statistical independence (C) 2005, The University of Michigan

Spam recognition (C) 2005, The University of Michigan
Return-Path: X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" Reply-To: To: Date: Tue, 14 Jan :06: Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A (C) 2005, The University of Michigan

Well-known datasets 20 newsgroups Reuters-21578 WebKB
Reuters-21578 Cats: grain, acquisitions, corn, crude, wheat, trade… WebKB course, student, faculty, staff, project, dept, other NB performance (2000) P=26,43,18,6,13,2,94 R=83,75,77,9,73,100,35 (C) 2005, The University of Michigan

Support vector machines
Introduced by Vapnik in the early 90s. (C) 2005, The University of Michigan

Semi-supervised learning
Co-training Graph-based (C) 2005, The University of Michigan

Additional topics Soft margins VC dimension Kernel methods

NB also good in many circumstances
SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. NB also good in many circumstances (C) 2005, The University of Michigan

Readings Books: 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-Wesley/ACM Press, 1999. 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms; Wiley, 2003, ISBN: Papers: Barabasi and Albert "Emergence of scaling in random networks" Science (286) , 1999 Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998 Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 Bush "As we may thing" The Atlantic Monthly 1945 Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998 Davison "Topical locality on the Web" SIGIR 2000 Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999 Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS 41(6) 1990 (C) 2005, The University of Michigan

Readings Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" JAIR 22, 2004 Jeong and Barabasi "Diameter of the world wide web" Nature (401) , 1999 Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 Haveliwala "Topic-sensitive pagerank" WWW 2002 Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph" PODS 2000 Lawrence and Giles "Accessibility of information on the Web" Nature (400) , 1999 Lawrence and Giles "Searching the World-Wide Web" Science (280) , 1998 Menczer "Links tell us about lexical and semantic Web content" arXiv 2001 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998 Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" JASIST 2005 Singhal "Modern Information Retrieval: an Overview" IEEE 2001 (C) 2005, The University of Michigan

More readings Gerard Salton, Automatic Text Processing, Addison-Wesley (1989) Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997) Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983) C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979) Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994) ACM SIGIR Proceedings, SIGIR Forum ACM conferences in Digital Libraries (C) 2005, The University of Michigan

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback