Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Information Retrieval Advaith Siddharthan References Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book at

Similar presentations


Presentation on theme: "Adaptive Information Retrieval Advaith Siddharthan References Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book at"— Presentation transcript:

1 Adaptive Information Retrieval Advaith Siddharthan References Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book at http://nlp.stanford.edu/IR-book/ http://nlp.stanford.edu/IR-book/ Wikipedia, just search for terms used in these slides...

2 Overview 3 Lectures: Information Retrieval –History and Evolution –Vector Models Links as recommendations –PageRank / HITS Personalised Web Search

3 Why Study Information Retrieval? Google –Searches billions of pages –Gives back results in under a second –Worth $30,000,000,000

4 Library Index Card

5 Organising documents Fields associated with a document –Author, Title, Year, Publisher, Number of pages, etc. –Subject Areas Curated by Librarians Creating a classification scheme for all the books in a library is a lot of work You can only search on indexed fields Can cards be indexed on multiple fields?

6 Edge-notched Cards (1896)

7 Key Notions Index Fields and Terms –Values assigned to fields for each document –E.g., Author, Title, Subject Query –field:value that can be combined by boolean logic operators: AND, OR, NOT Retrieval –Finding documents that match query

8 Key Notions Index Fields –Notch hole to activate index term Retrieval –Put needle through search term –Relevant documents fall out (holes are notched) Logical operations on query terms –AND: Put multiple needles –OR : Put one needle. Then put another –NOT: Keep cards on Needle, rather than ones that fall out

9 Edge-notched cards (1896) Allows indexing over multiple fields Automates computation of search Reminds you of Punch cards? Jacquard Loom (1801) –first machine to use punch cards to control a sequence of operations Next big jump... 1998?

10 Boolean Search Very little has changed for Information Retrieval over closed document collections –Documents labeled with terms from a domain-specific ontology –Search with boolean operators permitted over these terms

11 MESH: Medical Subject Headings C11 Eye Diseases –C11.93 Asthenopia –C11.187 Conjunctival Diseases C11.187.169 Conjunctival Neoplasms C11.187.183 Conjunctivitis –C11.187.183.220 Conjunctivitis, Allergic »C11.187.183.220.889 Trachoma C11.187.781 Pterygium C11.187.810 Xerophthalmia... C11 Eye Diseases –C11.93 Asthenopia –C11.187 Conjunctival Diseases C11.187.169 Conjunctival Neoplasms C11.187.183 Conjunctivitis –C11.187.183.220 Conjunctivitis, Allergic »C11.187.183.220.889 Trachoma C11.187.781 Pterygium C11.187.810 Xerophthalmia... C11 Eye Diseases –C11.93 Asthenopia –C11.187 Conjunctival Diseases C11.187.169 Conjunctival Neoplasms C11.187.183 Conjunctivitis –C11.187.183.220 Conjunctivitis, Allergic »C11.187.183.220.889 Trachoma C11.187.781 Pterygium C11.187.810 Xerophthalmia... www.nlm.nih.gov/meshwww.nlm.nih.gov/mesh

12 ACM Classification for CS B Hardware –B.3 Memory structures B.3.1 Semiconductor Memories – Dynamic memory (DRAM) – Read-only memory (ROM) – Static memory (SRAM) B.3.2 Design Styles B.3.3 Performance Analysis – Simulation –Worst-case analysis www.acm.org/class/

13 Limitations Manual effort by trained catalogers: –required to create classification scheme –and for annotation of documents with subject classes Users need to be aware of subject classes BUT –high precision searches –works well for closed collections of documents (libraries, etc.)

14 The Internet NOT a closed collection –Billions of webpages –Documents change on daily basis –Not possible to index or search by manually constructed subject classes How does Indexing work? How does Search work?

15 Simple Indexing Model Bag-of-Words –Documents and queries are represented as a bag of words –Ignore order of words –Ignore morphology/syntax (cat vs cats etc) –Just count the number of matches between words in document and query This already works rather well!

16 Vector Space Model Ranks Documents for relevance to query Documents and queries are vectors –What do vectors look like? –How do you compute relevance?

17 Coordinate Matching D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics Q. D1 = 2 (Athletes + dope) Q. D2 = 2 (Athletes + Olympics)

18 Term Frequency D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics Q. D1 = 3 (Athletes + 2*dope) Q. D2 = 2 (Athletes + Olympics)

19 Similarity Metrics Each Cell is the number of times the word occurs in the document or query(simplification, more later...) Doc1 Doc2 Doc3...DocN Query Term1ct 11 ct 12 ct 13 ct 1N q 1 Term2ct 21 ct 22 ct 23 ct 2N q 2... TermMct M1 ct M2 ct M3 ct MN q M

20 Comparison Metrics Dot Product Sim Doc_n,Query = Doc_n. Query = ct 1n q 1 + ct 2n q 2 +...+ ct Mn q M = m ct mn q m But, there can be a large dot product just because documents are very long, so normalise by lengths Cosine of vectors

21 Comparison Metrics Cosine (Q,D)= Q.D / |Q| |D| Cosine of angles between Document and Query vectors (diagram for M=3)

22 Problems? D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics –But both documents are about the London Olympics and about doping –Indexing on words rather than subject classes

23 Problems? Dimensions are not independent –Drug and Dope are closer together than Dope and London –Apache could mean the server, the helicopter or the tribe. These should be different dimensions Therefore, the cosine is not necessarily an accurate reflection of similarity

24 Index terms What makes a good index term? –The term should describe some aspect of the document –The term should not be generic enough that it also describes all the other documents in the collection –A good index term distinguishes a document from the rest of the collection

25 Not all terms are equal Zipf's Law: Frequency*rank of a word in a large text collection is constant Evidence from Tom Sawyer: TermFrequency RankRank*Frequency he877 108770 but420 208400 one172 508600 name21 4008400 family8 10008000 brushed4 20008000

26 Text Coverage Coverage with N most frequent words –15%(the) –1042% (the, and, a, he, but...) –10065% –100090% –1000099% Most frequent words are not informative! Least frequent words are typos or too specialised

27 Inverse Document Frequncy In a vector model, different words should have different weights –Search for Query: Tom and Jerry –Match on documents with Tom or Jerry should count for more than and The more documents a word appears in, the less is its use as an index term Documents are characterised by words which are relatively rare in other docs

28 Inverted Document Frequency Numerator = number of Documents in collection Denominator = number of documents containing term t i idf i = log ( |D| / |{d:t i d}| )

29 tf*idf Normalise term frequency by length of document tf i,j = n i,j / k n k,j tf*idf i,j = tf i,j Х idf i tf*idf is high for a term in a document if its frequency in the document is high, and its frequency in rest of collection is low

30 What is an index term? Use Natural Language terms, but how? –Words separated by whitespace or punctuation? End of sentence. BUT Ph.D.? –De-capitalise? Turkey vs turkey? –Stem? plastered – plaster, BUT wander - wand –Multi-word terms? Cheque book –Index stop words? The Who

31 Cheating the system Indexing done by algorithm, not humans No control over documents in collection Websites try to show up at the top of a search Check for cheating –Lists of keywords at the end of document –Text and background are same colour –Text does not fit a statistical model of naturalness (checks for keyword packing) But its a game that can't be won, UNTIL...


Download ppt "Adaptive Information Retrieval Advaith Siddharthan References Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book at"

Similar presentations


Ads by Google