Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur

Similar presentations


Presentation on theme: "INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur"— Presentation transcript:

1 INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur pabitra@gmail.com

2 Information Retrieval Problem definition: Given a user’s information need, find documents satisfying that need “Document” is the generic term for an information holder (book, chapter, article, webpage, etc) Types of information: text, images/graphics, speech, video, etc. Text is still the most commonly used.

3 Information Retrieval Information Retrieval is a research-driven theoretical and experimental discipline The focus is on different aspects of the information– seeking process: Computer scientist – fast and accurate search engine Librarian – organization and indexing of information Cognitive scientist – the process in the searcher’s mind … Progress influenced by advances in Computational Linguistics, Information Visualization, Cognitive Psychology, HCI, …

4 Information Retrieval Basic principle: Document -> list of keywords / content-descriptors / terms User’s information need -> (natural-language) query -> list of Keywords Measure overlap between query and documents.

5 Stages of IR Indexed and structured information Information Retrieval Searching Browsing Indexing, organizing Creation

6 IR process Collection of documents Real world Document representations Query Information need Anomalous state of knowledge Matching Results

7 Document Representation: Indexing Inverted index

8 Vocabulary Vocabulary (indexing language): The set of concepts (terms or phrases) that can be used to index documents in a collection Controlled Specific for specialized domains Potential for increased consistency of indexing and precision of retrieval Un-controlled (free) Potentially all the terms in the documents Potential for increased recall

9 Indexing Tokenize: identify individual words. Stopword removal: eliminate common words, e.g. and, of, the, etc. Stemming: reduce words to a common root. e.g. analysis, analyze, analyzing -> analy, use standard algorithms (Porter). Thesaurus: find synonyms for words in the document. Phrases: find multi-word terms e.g. computer science, data mining. use syntax/linguistic methods or “statistical” methods. Named entities: identify names of people, organizations and places; dates; monetary or other amounts, etc.

10 Boolean Retrieval Model Keywords combined using AND, OR, (AND) NOT e.g. (medicine OR treatment) AND (hypertension OR “high blood pressure”) Efficient and easy to implement (list merging) AND : intersection OR : union Drawbacks OR — one match as good as many AND — one miss as bad as all no ranking

11 Term Weighting Any text item (“document”) is represented as list of terms and associated weights. Term = keywords or content-descriptors Weight = measure of the importance of a term in representing the information contained in the document

12 Vector Space Model Term frequency (tf): repeated words are strongly related to content Inverse document frequency (idf): uncommon term is more important Normalization by document length long docs. contain many distinct words. long docs. contain same word many times. term-weights for long documents should be reduced. use # bytes, # distinct words, Euclidean length, etc. Weight = tf x idf / normalization

13 Retrieval Measure vocabulary overlap between user query and documents. Use inverted index Cosine of the angle between document and query vectors Ranked retrieval

14 Query Expansion Searching depends on matching keywords between user- query and document Nature of language -> searchers and document creators may use different keywords to denote same “concept” Example: fatalities in road accidents on G.T. Road Vocabulary mismatch -> poor retrieval quality Problem aggravated by short queries + large, heterogeneous databases Solution: expand the query by adding related words/ phrases. Issues: select which terms to add to query calculate weights for added terms

15 Relevance Feedback Original query is used to retrieve some number of documents. User examines some of the retrieved documents and provides feedback about which documents are relevant and which are non- relevant. System uses the feedback to “learn” a better query: select/emphasize words that occur more frequently in relevant documents than non-relevant documents; eliminate/de-emphasize words that occur more frequently in non- relevant than in relevant documents. Resulting query should bring in more relevant documents and fewer non-relevant documents

16 Link/Citation Analysis In uncontrolled environments like WWW documents are uncontrolled, untrusted, commercial implications Presence of terms itself do not signify relevance Spamming Importance of author Link/Citation analysis

17 Page Rank Used in Google Search Engine ’Global’ ranking of every web page calculated based on hyperlink structure of web (content ignored) Documents with matching keywords returned in the global rank order Principle: Highly linked pages are more important than pages with a few links. A page has a high rank if the sum of the ranks of its back- links is high. Most effective for underspecified (general) queries

18 Page Rank

19 Open Source Search Engines Lucene Terrier Zettair ….. Lucene is the search engine used by Dspace

20 Lucene/Solr Architecture 20 Apache Lucene /select/spellXMLCSV XMLBinary JSON Data Import Handler (SQL/RSS) Extracting Request Handler ( PDF/WORD) CachingFaceting Query Parsing Apache Tika binary /admin High- lighting Schema Index Replication Request HandlersUpdate HandlersResponse Writers Query Search Components Spelling Faceting Highlightin g Signature Logging Update Processors Indexing Config Debug Statistics More like this Distributed Search Clustering FilteringSearch Core Search IndexReader/Search er Indexing IndexWriter Text Analysis Analysis

21 Evaluation Background User has an information need. Information need is converted into a query. Documents are relevant or non-relevant. Ideal system retrieves all and only the relevant documents.

22 Set Based Metrics

23 Evaluation Forums TREC, CLEF, NTCIR

24 References Introduction to Information Retrieval Manning, Raghavan, Schultz Lucene in Action Manning


Download ppt "INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur"

Similar presentations


Ads by Google