Adaptive Information Retrieval Advaith Siddharthan References Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book at

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Text Categorization.
Traditional IR models Jian-Yun Nie.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Boolean and Vector Space Retrieval Models
CSE3201/4500 Information Retrieval Systems
WEB MINING. Why IR ? Research & Fun
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Search engines. The number of Internet hosts exceeded in in in in in
The Vector Space Model …and applications in Information Retrieval.
CS 430 / INFO 430 Information Retrieval
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Which of the two appears simple to you? 1 2.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
CS 430: Information Discovery
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Basic Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
Data Mining Chapter 6 Search Engines
CS 430: Information Discovery
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to Search Engines
CS 430: Information Discovery
Presentation transcript:

Adaptive Information Retrieval Advaith Siddharthan References Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book at Wikipedia, just search for terms used in these slides...

Overview 3 Lectures: Information Retrieval –History and Evolution –Vector Models Links as recommendations –PageRank / HITS Personalised Web Search

Why Study Information Retrieval? Google –Searches billions of pages –Gives back results in under a second –Worth $30,000,000,000

Library Index Card

Organising documents Fields associated with a document –Author, Title, Year, Publisher, Number of pages, etc. –Subject Areas Curated by Librarians Creating a classification scheme for all the books in a library is a lot of work You can only search on indexed fields Can cards be indexed on multiple fields?

Edge-notched Cards (1896)

Key Notions Index Fields and Terms –Values assigned to fields for each document –E.g., Author, Title, Subject Query –field:value that can be combined by boolean logic operators: AND, OR, NOT Retrieval –Finding documents that match query

Key Notions Index Fields –Notch hole to activate index term Retrieval –Put needle through search term –Relevant documents fall out (holes are notched) Logical operations on query terms –AND: Put multiple needles –OR : Put one needle. Then put another –NOT: Keep cards on Needle, rather than ones that fall out

Edge-notched cards (1896) Allows indexing over multiple fields Automates computation of search Reminds you of Punch cards? Jacquard Loom (1801) –first machine to use punch cards to control a sequence of operations Next big jump ?

Boolean Search Very little has changed for Information Retrieval over closed document collections –Documents labeled with terms from a domain-specific ontology –Search with boolean operators permitted over these terms

MESH: Medical Subject Headings C11 Eye Diseases –C11.93 Asthenopia –C Conjunctival Diseases C Conjunctival Neoplasms C Conjunctivitis –C Conjunctivitis, Allergic »C Trachoma C Pterygium C Xerophthalmia... C11 Eye Diseases –C11.93 Asthenopia –C Conjunctival Diseases C Conjunctival Neoplasms C Conjunctivitis –C Conjunctivitis, Allergic »C Trachoma C Pterygium C Xerophthalmia... C11 Eye Diseases –C11.93 Asthenopia –C Conjunctival Diseases C Conjunctival Neoplasms C Conjunctivitis –C Conjunctivitis, Allergic »C Trachoma C Pterygium C Xerophthalmia...

ACM Classification for CS B Hardware –B.3 Memory structures B.3.1 Semiconductor Memories – Dynamic memory (DRAM) – Read-only memory (ROM) – Static memory (SRAM) B.3.2 Design Styles B.3.3 Performance Analysis – Simulation –Worst-case analysis

Limitations Manual effort by trained catalogers: –required to create classification scheme –and for annotation of documents with subject classes Users need to be aware of subject classes BUT –high precision searches –works well for closed collections of documents (libraries, etc.)

The Internet NOT a closed collection –Billions of webpages –Documents change on daily basis –Not possible to index or search by manually constructed subject classes How does Indexing work? How does Search work?

Simple Indexing Model Bag-of-Words –Documents and queries are represented as a bag of words –Ignore order of words –Ignore morphology/syntax (cat vs cats etc) –Just count the number of matches between words in document and query This already works rather well!

Vector Space Model Ranks Documents for relevance to query Documents and queries are vectors –What do vectors look like? –How do you compute relevance?

Coordinate Matching D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics Q. D1 = 2 (Athletes + dope) Q. D2 = 2 (Athletes + Olympics)

Term Frequency D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics Q. D1 = 3 (Athletes + 2*dope) Q. D2 = 2 (Athletes + Olympics)

Similarity Metrics Each Cell is the number of times the word occurs in the document or query(simplification, more later...) Doc1 Doc2 Doc3...DocN Query Term1ct 11 ct 12 ct 13 ct 1N q 1 Term2ct 21 ct 22 ct 23 ct 2N q 2... TermMct M1 ct M2 ct M3 ct MN q M

Comparison Metrics Dot Product Sim Doc_n,Query = Doc_n. Query = ct 1n q 1 + ct 2n q ct Mn q M = m ct mn q m But, there can be a large dot product just because documents are very long, so normalise by lengths Cosine of vectors

Comparison Metrics Cosine (Q,D)= Q.D / |Q| |D| Cosine of angles between Document and Query vectors (diagram for M=3)

Problems? D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics –But both documents are about the London Olympics and about doping –Indexing on words rather than subject classes

Problems? Dimensions are not independent –Drug and Dope are closer together than Dope and London –Apache could mean the server, the helicopter or the tribe. These should be different dimensions Therefore, the cosine is not necessarily an accurate reflection of similarity

Index terms What makes a good index term? –The term should describe some aspect of the document –The term should not be generic enough that it also describes all the other documents in the collection –A good index term distinguishes a document from the rest of the collection

Not all terms are equal Zipf's Law: Frequency*rank of a word in a large text collection is constant Evidence from Tom Sawyer: TermFrequency RankRank*Frequency he but one name family brushed

Text Coverage Coverage with N most frequent words –15%(the) –1042% (the, and, a, he, but...) –10065% –100090% – % Most frequent words are not informative! Least frequent words are typos or too specialised

Inverse Document Frequncy In a vector model, different words should have different weights –Search for Query: Tom and Jerry –Match on documents with Tom or Jerry should count for more than and The more documents a word appears in, the less is its use as an index term Documents are characterised by words which are relatively rare in other docs

Inverted Document Frequency Numerator = number of Documents in collection Denominator = number of documents containing term t i idf i = log ( |D| / |{d:t i d}| )

tf*idf Normalise term frequency by length of document tf i,j = n i,j / k n k,j tf*idf i,j = tf i,j Х idf i tf*idf is high for a term in a document if its frequency in the document is high, and its frequency in rest of collection is low

What is an index term? Use Natural Language terms, but how? –Words separated by whitespace or punctuation? End of sentence. BUT Ph.D.? –De-capitalise? Turkey vs turkey? –Stem? plastered – plaster, BUT wander - wand –Multi-word terms? Cheque book –Index stop words? The Who

Cheating the system Indexing done by algorithm, not humans No control over documents in collection Websites try to show up at the top of a search Check for cheating –Lists of keywords at the end of document –Text and background are same colour –Text does not fit a statistical model of naturalness (checks for keyword packing) But its a game that can't be won, UNTIL...