Presentation on theme: "Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS."— Presentation transcript:
Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS
Issues and Challenges in IR RELEVANCE ?
Issues and Challenges in IR Query Formulation – Describing information need Relevance – Relevant to query (system relevancy) – Relevant to information need (User relevancy) Evaluation – System oriented (Bypass User) – User Oriented (Relevance Feedback)
What makes IR “experimental”? Evaluation – How do design experiments that answer our questions? – How do we assess the quality of the documents that come out of the IR black box? – Can we do this automatically?
Simplification? Source Selection Search Query Selection Ranked List Examination Documents Delivery Documents Query Formulation Resource query reformulation, vocabulary learning, relevance feedback source reselection Is this itself a vast simplification?
The Central Problem in IR Information SeekerAuthors Concepts Query Terms Document Terms Do these represent the same concepts?
Problems in Query Formulation Stefano Mizzaro Model of Relevance in IR RIN: Real Information Need (Target) PIN: Perceived Information Need (Mentality) EIN: Expressed Information Need (Natural Lng) FIN: Formal Information Need (Query) Paper reference 4 dimensions of Relevance by stephen W Draper
Taylor’s Model The visceral need (Q 1 ) the actual, but unexpressed, need for information The conscious need (Q 2 ) the conscious within-brain description of the need The formalized need (Q 3 ) the formal statement of the question The compromised need (Q 4 ) the question as presented to the information system Robert S. Taylor. (1962) The Process of Asking Questions. American Documentation, 13(4),
Taylor’s Model and IR Systems Visceral need (Q 1 ) Conscious need (Q 2 ) Formalized need (Q 3 ) Compromised need (Q 4 ) IR System Results naïve users Question Negotiation
how trap mice alive The classic search model Collection User task Info need Query Results Search engine Query refinement Get rid of mice in a politically correct way Info about removing mice without killing them Misconception?Misformulation? Searc h
Building Blocks of IRS-I Different models of information retrieval – Boolean model – Vector space model – Languages models Representing the meaning of documents – How do we capture the meaning of documents? – Is meaning just the sum of all terms? Indexing – How do we actually store all those words? – How do we access indexed terms quickly?
Relevance Feedback – How do humans (and machines) modify queries based on retrieved results? User Interaction – Information retrieval meets computer-human interaction – How do we present search results to users in an effective manner? – What tools can systems provide to aid the user in information seeking? Building Blocks of IRS-II
IR Extensions Filtering and Categorization – Traditional information retrieval: static collection, dynamic queries – What about static queries against dynamic collections? Multimedia Retrieval – Thus far, we’ve been focused on text… – What about images, sounds, video, etc.? Question Answering – We want answers, not just documents!
CAN U GUESS WHAT DATA IS MAINLY FOCUSED BY IR? Structured Unstructured Semi structured
What about databases? What are examples of databases? – Banks storing account information – Retailers storing inventories – Universities storing student grades What exactly is a (relational) database? – Think of them as a collection of tables – They model some aspect of “the world”
A (Simple) Database Example Student Table Department TableCourse Table Enrollment Table
IR vs. databases: Structured vs unstructured data Structured data tends to refer to information in “tables” 17 EmployeeManagerSalary SmithJones50000 ChangSmith IvySmith Typically allows numerical range and exact match (for text) queries, e.g., Salary < AND Manager = Smith.
Database Queries What would you want to know from a database? – What classes is John Arrow enrolled in? – Who has the highest grade in LBSC 690? – Who’s in the history department? – Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest address?
Unstructured data Typically refers to free text Allows – Keyword queries including operators – More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents 19
Semi-structured data In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the Title and Bullets … to say nothing of linguistic structure Facilitates “semi-structured” search such as – Title contains data AND Bullets contain search Or even – Title is about Object Oriented Programming AND Author something like stro*rup – where * is the wild-card operator 22
Databases vs. IR Other issues Interaction with system Results we get Queries we’re posing What we’re retrieving IRDatabases Issues downplayed.Concurrency, recovery, atomicity are all critical. Interaction is important.One-shot queries. Sometimes relevant, often not. Exact. Always correct in a formal sense. Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model.
IRS IN ACTION (TASKS) Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan
Outline What is the IR problem? How to organize an IR system? (Or the main processes in IR) Indexing Retrieval
The problem of IR Goal = find documents relevant to an information need from a large document set Document collection Info. need Query Answer list IR system Retrieval
IR problem First applications: in libraries (1950s) ISBN: Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: external attributes and internal attribute (content) Search by external attributes = Search in DB IR: search by content
Possible approaches 1.String matching (linear search in documents) - Slow - Difficult to improve 2.Indexing (*) - Fast - Flexible to further improvement
Main problems in IR Document and query indexing – How to best represent their contents? Query evaluation (or retrieval process) – To what extent does a document correspond to a query? System evaluation – How good is a system? – Are the retrieved documents relevant? (precision) – Are all the relevant documents retrieved? (recall)
The basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen. 32
33 Document indexing Goal = Find the important meanings and create an internal representation Factors to consider: – Accuracy to represent meanings (semantics) – Exhaustiveness (cover all the contents) – Facility for computer to manipulate What is the best representation of contents? – Char. string (char trigrams): not precise enough – Word: good coverage, not precise – Phrase: poor coverage, more precise – Concept: poor coverage, precise Coverage (Recall) Accuracy (Precision) String Word Phrase Concept
Parsing a document What format is it in? – pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … Sec
Complications: Format/language Documents being indexed can include docs from many different languages – A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats – French with a German pdf attachment. What is a unit document? – A file? – An ? (Perhaps one of many in an mbox.) – An with 5 attachments? – A group of files (PPT or LaTeX as HTML pages) Sec
HOW TO CONSTRUCT INDEX OF TERMS?
function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index – Prepositions – Articles – Pronouns – Some adverbs and adjectives – Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “standard” stoplists are commonly used. Stopwords / Stoplist
Stop words With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: – They have little semantic content: the, a, and, to, be – There are a lot of them: ~30% of postings for top 30 words But the trend is away from doing this: – Good compression techniques means the space for including stop words in a system is very small – Good query optimization techniques mean you pay little at query time for including stop words. – You need them for: Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London” Sec
Stemming Reason: – Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: – Removing some endings of word computer compute computes computing computed computation comput
Stemming Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping – language dependent – e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress Sec