Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Information Retrieval Models: Probabilistic Models
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Modern Information Retrieval
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Advance Information Retrieval Topics Hassan Bashiri.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Overview of IR Research ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Real World IR Challenges (CS598-CXZ Advanced Topics in IR Presentation) Jan. 20, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Search Engine Architecture
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Relevance Feedback Hongning Wang
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval and Web Search
Search Engine Architecture
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Course Summary (Lecture for CS410 Intro Text Info Systems)
Information Retrieval and Web Search
Multimedia Information Retrieval
Overview of IR Research
CSE 635 Multimedia Information Retrieval
Search Engine Architecture
Presentation transcript:

Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Text Information Systems Applications Access Mining Organization Select information Create Knowledge Add Structure/Annotations 2

Two Modes of Information Access: Pull vs. Push Pull Mode –Users take initiative and “pull” relevant information out from a text information system (TIS) –Works well when a user has an ad hoc information need Push Mode –Systems take initiative and “push” relevant information to users –Works well when a user has a stable information need or the system has good knowledge about a user’s need 3

Pull Mode: Querying vs. Browsing Querying –A user enters a (keyword) query, and the system returns relevant documents –Works well when the user knows exactly what keywords to use Browsing –The system organizes information with structures, and a user navigates into relevant information by following a path enabled by the structures –Works well when the user wants to explore information or doesn’t know what keywords to use, or can’t conveniently enter a query (e.g., with a smartphone) 4

Information Seeking as Sightseeing Sightseeing: Know address of an attraction? –Yes: take a taxi and go directly to the site –No: walk around or take a taxi to a nearby place then walk around Information seeking: Know exactly what you want to find? –Yes: use the right keywords as a query and find the information directly –No: browse the information space or start with a rough query and then browse Querying is faster, but browsing is useful when querying fails or a user wants to explore 5

Text Mining: Two Different Views Data Mining View: Explore patterns in textual data –Find latent topics –Find topical trends –Find outliers and other hidden patterns Natural Language Processing View: Make inferences based on partial understanding of natural language text –Information extraction –Knowledge representation + inferences Often mixed in practice 6

Applications of Text Mining Direct applications –Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? –Data-driven (WWW, literature, , customer reviews, etc): We have a lot of data; what can we do with it? Indirect applications –Assist information access (e.g., discover latent topics to better summarize search results) –Assist information organization (e.g., discover hidden structures) 7

IR Topics (Broader View): Text Information Systems (TIS) Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Topic Analysis Visualization Retrieval Applications Mining Applications Information Access Knowledge Acquisition Information Organization 8

Elements of TIS: Natural Language Content Analysis Natural Language Processing (NLP) is the foundation of TIS –Enable understanding of meaning of text –Provide semantic representation of text for TIS Current NLP techniques mostly rely on statistical machine learning enhanced with limited linguistic knowledge –Shallow techniques are robust, but deeper semantic analysis is only feasible for very limited domain Some TIS capabilities require deeper NLP than others Most text information systems use very shallow NLP (“bag of words” representation) 9

Elements of TIS: Text Access Search: take a user’s query and return relevant documents Filtering/Recommendation: monitor an incoming stream and recommend to users relevant items (or discard non-relevant ones) Categorization: classify a text object into one of the predefined categories Summarization: take one or multiple text documents, and generate a concise summary of the essential content 10

Elements of TIS: Text Mining Topic Analysis: take a set of documents, extract and analyze topics in them Information Extraction: extract entities, relations of entities or other “knowledge nuggets” from text Clustering: discover groups of similar text objects (terms, sentences, documents, …) Visualization: visually display patterns in text data 11

IR Topics (narrow view) User query judgments docs results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHING QUERY MODIFICATION LEARNING INTERFACE 1. Evaluation 2. Retrieval (Ranking) Models 4. Efficiency & scalability 3. Document representation/structure 6. User interface (browsing) 7. Feedback/Learning 5. Search result summarization/presentation Our focus: 1, 2, 7 12

Typical TR System Architecture User querydocs results Query Rep Doc Rep (Index) Scorer Indexer Tokenizer Index judgments Feedback 13

Tokenization Normalize lexical units: Words with similar meanings should be mapped to the same indexing term Stemming: Mapping all inflectional forms of words to the same root form, e.g. –computer -> compute –computation -> compute –computing -> compute Some languages (e.g., Chinese) pose challenges in word segmentation 14

Indexing Indexing = Convert documents to data structures that enable fast search Inverted index is the dominating indexing method (used by all search engines): basic idea is to enable quick look up of all the documents containing a particular term Other indices (e.g., document index) may be needed for feedback 15

How to Design a Ranking Function? Query q = q 1,…,q m, where q i  V Document d = d 1,…,d n, where d i  V Ranking function: f(q, d)  A good ranking function should rank relevant documents on top of non-relevant ones Key challenge: how to measure the likelihood that document d is relevant to query q? Retrieval Model = formalization of relevance (give a computational definition of relevance) 16

Many Different Retrieval Models Similarity-based models: –a document that is more similar to a query is assumed to be more likely relevant to the query –relevance (d,q) = similarity (d,q) –e.g., Vector Space Model Probabilistic models (language models): –compute the probability that a given document is relevant to a query based on a probabilistic model –relevance(d,q) = p(R=1|d,q), where R  {0,1} is a binary random variable –E.g., Query Likelihood 17

Relevance Feedback Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query Retrieval Engine Results: d d … d k User Document collection Users make explicit relevance judgments on the initial results (judgments are reliable, but users don’t want to make extra effort) 18

Pseudo/Blind/Automatic Feedback Query Retrieval Engine Results: d d … d k Judgments: d 1 + d 2 + d 3 + … d k -... Document collection Feedback Updated query top 10 assumed relevant Top-k initial results are simply assumed to be relevant (judgments aren’t reliable, but no user activity is required) 19

Implicit Feedback Updated query Feedback Clickthroughs: d 1 + d 2 - d 3 + … d k -... Query Retrieval Engine Results: d d … d k User Document collection User-clicked docs are assumed to be relevant; skipped ones non-relevant (judgments aren’t completely reliable, but no extra effort from users) 20

Evaluation: Two Different Reasons Reason 1: So that we can assess how useful an IR system/technology would be (for an application) –Measures should reflect the utility to users in a real application –Usually done through user studies (interactive IR evaluation) Reason 2: So that we can compare different systems and methods (to advance the state of the art) –Measures only need to be correlated with the utility to actual users, thus don’t have to accurately reflect the exact utility to users –Usually done through test collections (test set IR evaluation) 21

22 What to Measure? Effectiveness/Accuracy: how accurate are the search results? –Measuring a system’s ability of ranking relevant docucments on top of non-relevant ones Efficiency: how quickly can a user get the results? How much computing resources are needed to answer a query? –Measuring space and time overhead Usability: How useful is the system for real user tasks? –Doing user studies

23 The Cranfield Evaluation Methodology A methodology for laboratory testing of system components developed in 1960s Idea: Build reusable test collections & define measures –A sample collection of documents (simulate real document collection) –A sample set of queries/topics (simulate user queries) –Relevance judgments (ideally made by users who formulated the queries)  Ideal ranked list –Measures to quantify how well a system’s result matches the ideal ranked list A test collection can then be reused many times to compare different systems

What You Should Know Information access modes: pull vs. push Pull mode: querying vs. browsing Basic elements of TIS: –search, filtering/recommendation, categorization, summarization –topic analysis, information extraction, clustering, visualization Know the terms of the major concepts and techniques (e.g., query, document, retrieval model, feedback, evaluation, inverted index, etc) 24