Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval Techniques

Similar presentations

Presentation on theme: "Information Retrieval Techniques"— Presentation transcript:

1 Information Retrieval Techniques

2 Issues and Challenges in IR

3 Issues and Challenges in IR
Query Formulation Describing information need Relevance Relevant to query (system relevancy) Relevant to information need (User relevancy) Evaluation System oriented (Bypass User) User Oriented (Relevance Feedback)

4 What makes IR “experimental”?
Evaluation How do design experiments that answer our questions? How do we assess the quality of the documents that come out of the IR black box? Can we do this automatically?

5 Simplification? Source Selection Resource
source reselection Is this itself a vast simplification? Query Formulation Query query reformulation, vocabulary learning, relevance feedback Search Ranked List Selection Documents Examination Documents Delivery

6 The Central Problem in IR
Information Seeker Authors Concepts Concepts Why is IR hard? Because language is hard! Query Terms Document Terms Do these represent the same concepts?

7 Problems in Query Formulation Stefano Mizzaro Model of Relevance in IR
RIN: Real Information Need (Target) PIN: Perceived Information Need (Mentality) EIN: Expressed Information Need (Natural Lng) FIN: Formal Information Need (Query) Paper reference 4 dimensions of Relevance by stephen W Draper

8 Taylor’s Model The visceral need (Q1)  the actual, but unexpressed, need for information The conscious need (Q2)  the conscious within-brain description of the need The formalized need (Q3)  the formal statement of the question The compromised need (Q4)  the question as presented to the information system Robert S. Taylor. (1962) The Process of Asking Questions. American Documentation, 13(4),

9 Taylor’s Model and IR Systems
Visceral need (Q1) Conscious need (Q2) Question Negotiation naïve users Formalized need (Q3) Compromised need (Q4) IR System Results

10 The classic search model
User task Get rid of mice in a politically correct way Misconception? Info need Info about removing mice without killing them Misformulation? Query how trap mice alive Search Search engine Query refinement Results Collection

11 Building Blocks of IRS-I
Different models of information retrieval Boolean model Vector space model Languages models Representing the meaning of documents How do we capture the meaning of documents? Is meaning just the sum of all terms? Indexing How do we actually store all those words? How do we access indexed terms quickly?

12 Building Blocks of IRS-II
Relevance Feedback How do humans (and machines) modify queries based on retrieved results? User Interaction Information retrieval meets computer-human interaction How do we present search results to users in an effective manner? What tools can systems provide to aid the user in information seeking?

13 IR Extensions Filtering and Categorization Multimedia Retrieval
Traditional information retrieval: static collection, dynamic queries What about static queries against dynamic collections? Multimedia Retrieval Thus far, we’ve been focused on text… What about images, sounds, video, etc.? Question Answering We want answers, not just documents!

Structured Unstructured Semi structured

15 What about databases? What are examples of databases?
Banks storing account information Retailers storing inventories Universities storing student grades What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world”

16 A (Simple) Database Example
Student Table Department Table Course Table Enrollment Table

17 IR vs. databases: Structured vs unstructured data
Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < AND Manager = Smith.

18 Database Queries What would you want to know from a database?
What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest address?

19 Unstructured data Typically refers to free text Allows
Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents

20 20

21 21

22 Semi-structured data In fact almost no data is “unstructured”
E.g., this slide has distinctly identified zones such as the Title and Bullets … to say nothing of linguistic structure Facilitates “semi-structured” search such as Title contains data AND Bullets contain search Or even Title is about Object Oriented Programming AND Author something like stro*rup where * is the wild-card operator

23 Comparing IR to databases
Structured Unstructured Fields Clear semantics (SSN, age) No fields (other than text) Queries Defined (relational algebra, SQL) Free text (“natural language”), Boolean Recoverability Critical (concurrency control, recovery, atomic operations) Downplayed, though still an issue Matching Exact (results are always “correct”) Imprecise (need to measure effectiveness) Hopkins IR Workshop 2005 Copyright © Victor Lavrenko

24 Databases vs. IR Other issues Interaction with system Results we get
Queries we’re posing What we’re retrieving IR Databases Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model. Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Sometimes relevant, often not. Exact. Always correct in a formal sense. Interaction is important. One-shot queries. Issues downplayed. Concurrency, recovery, atomicity are all critical.

25 IRS IN ACTION (TASKS) Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan

26 Outline What is the IR problem?
How to organize an IR system? (Or the main processes in IR) Indexing Retrieval

27 The problem of IR Goal = find documents relevant to an information need from a large document set Info. need Query IR system Document collection Retrieval Answer list

28 IR problem First applications: in libraries (1950s)
ISBN: Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> external attributes and internal attribute (content) Search by external attributes = Search in DB IR: search by content

29 Possible approaches 1. String matching (linear search in documents)
- Slow - Difficult to improve 2. Indexing (*) - Fast - Flexible to further improvement

30 Indexing-based IR Document Query indexing indexing (Query analysis)
Representation Representation (keywords) Query (keywords) evaluation

31 Main problems in IR Document and query indexing
How to best represent their contents? Query evaluation (or retrieval process) To what extent does a document correspond to a query? System evaluation How good is a system? Are the retrieved documents relevant? (precision) Are all the relevant documents retrieved? (recall)

32 The basic indexing pipeline
Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 13 16 1

33 Document indexing Goal = Find the important meanings and create an internal representation Factors to consider: Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate What is the best representation of contents? Char. string (char trigrams): not precise enough Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise Coverage (Recall) Accuracy (Precision) String Word Phrase Concept

34 Parsing a document What format is it in? What language is it in?
Sec. 2.1 Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically …

35 Complications: Format/language
Sec. 2.1 Complications: Format/language Documents being indexed can include docs from many different languages A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats French with a German pdf attachment. What is a unit document? A file? An ? (Perhaps one of many in an mbox.) An with 5 attachments? A group of files (PPT or LaTeX as HTML pages) Nontrivial issues. Requires some design decisions.


37 Stopwords / Stoplist function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “standard” stoplists are commonly used.

38 Sec Stop words With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: They have little semantic content: the, a, and, to, be There are a lot of them: ~30% of postings for top 30 words But the trend is away from doing this: Good compression techniques means the space for including stop words in a system is very small Good query optimization techniques mean you pay little at query time for including stop words. You need them for: Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London” Nevertheless: “Google ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.” (Though you can explicitly ask for them to remain.)

39 Stemming Reason: Stemming:
Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: Removing some endings of word computer compute computes computing computed computation comput

40 Stemming Reduce terms to their “roots” before indexing
Sec Stemming Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress.

Download ppt "Information Retrieval Techniques"

Similar presentations

Ads by Google