Presentation is loading. Please wait.

Presentation is loading. Please wait.

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.

Similar presentations


Presentation on theme: "Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR."— Presentation transcript:

1 Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR

2 Srihari-CSE535-Spring2008 Motivation IR: representation, storage, organization of, and access to unstructured data Focus is on the user information need User information need: When did the Buffalo Bills last win the Super Bowl? Find all docs containing information on cricket players who are: (1) tempermental, (ii) popular in their countries, and (iii) play in international test series. Emphasis is on the retrieval of information (not data)

3 Srihari-CSE535-Spring2008 Motivation Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic deals with unstructured text semantics is frequently loose small errors are tolerated IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important

4 Srihari-CSE535-Spring2008 Basic Concepts The User Task Retrieval information or data purposeful needle in a haystack problem Browsing glancing around Formula 1 racing; cars, Le Mans, France, tourism Filtering (push rather than pull) Retrieval Browsing Database

5 Srihari-CSE535-Spring2008 Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? Could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia ? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the phrase Romans and countrymen ) not feasible

6 Srihari-CSE535-Spring2008 Term-document incidence 1 if play contains word, 0 otherwise

7 Srihari-CSE535-Spring2008 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. 110100 AND 110111 AND 101111 = 100100.

8 Srihari-CSE535-Spring2008 Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

9 Srihari-CSE535-Spring2008 Bigger document collections Consider N = 1million documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these.

10 Srihari-CSE535-Spring2008 Can’t build the matrix 500K x 1M matrix has half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s. matrix is extremely sparse: >99% zeros What’s a better representation? We only record the 1 positions. Inverted Index Why?

11 Srihari-CSE535-Spring2008 Ad-Hoc Retrieval Most standard IR task System to provide documents from the collection that are relevant to an arbitrary user information need Information need: topic that user wants to know about Query: user’s abstraction of the information need Relevance: document is relevant if the user perceives it as valuable wrt his information need

12 Srihari-CSE535-Spring2008 Issues to be Addressed by IR How to improve quality of retrieval Precison: what fraction of the returned results are relevant to information need? Recall: what fraction of relevant documents in the collection are returned by the system Understanding user information need Faster indexes and smaller query response times Better understanding of user behaviour interactive retrieval visualization techniques

13 Srihari-CSE535-Spring2008 Inverted index For each term T : store a list of all documents that contain T. Do we use an array or a list for this? Brutus Calpurnia Caesar 12358132134 248163264128 1316 What happens if the word Caesar is added to document 14?

14 Srihari-CSE535-Spring2008 Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar 248163264128 2358132134 1316 1 Dictionary Postings Sorted by docID (more later on why).

15 Srihari-CSE535-Spring2008 Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman 24 2 13 16 1 More on these later. Documents to be indexed. Friends, Romans, countrymen.

16 Srihari-CSE535-Spring2008 Basic Concepts Logical view of the documents documents represented by a set of index terms or keywords Document representation viewed as a continuum: logical view of docs might shift Structure recognition Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs structureFull textIndex terms

17 Srihari-CSE535-Spring2008 User Interface Text Operations Query Operations Indexer Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file Indexing Criteria, Preferences 2,3 9 4,5,6 7,21 7 Document Collection The Retrieval Process Indexing Retrieval

18 Srihari-CSE535-Spring2008 Applications of IR Specialized Domains biomedical, legal, patents, intelligence Summarization Cross-lingual Retrieval, Information Access Question-Answering Systems Ask Jeeves Web/Text Mining data mining on unstructured text Multimedia IR images, document images, speech, music Web applications shopbots personal assistant agents

19 Srihari-CSE535-Spring2008 IR Techniques Machine learning clustering, SVM, latent semantic indexing, etc. improving relevance feedback, query processing etc. Natural Language Processing, Computational Linguistics better indexing, query processing incorporating domain knowledge: e.g., synonym dictionaries use of NLP in IR: benefits yet to be shown for large-scale IR Information Extraction Highly focused Natural language processing (NLP) named entity tagging, relationship/event detection Text indexing and compression User interfaces and visualization AI advanced QA systems, inference, etc.


Download ppt "Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR."

Similar presentations


Ads by Google