3Issues and Challenges in IR Query FormulationDescribing information needRelevanceRelevant to query (system relevancy)Relevant to information need (User relevancy)EvaluationSystem oriented (Bypass User)User Oriented (Relevance Feedback)
4What makes IR “experimental”? EvaluationHow do design experiments that answer our questions?How do we assess the quality of the documents that come out of the IR black box?Can we do this automatically?
5Simplification? Source Selection Resource source reselectionIs this itself a vast simplification?QueryFormulationQueryquery reformulation,vocabulary learning,relevance feedbackSearchRanked ListSelectionDocumentsExaminationDocumentsDelivery
6The Central Problem in IR Information SeekerAuthorsConceptsConceptsWhy is IR hard? Because language is hard!Query TermsDocument TermsDo these represent the same concepts?
7Problems in Query Formulation Stefano Mizzaro Model of Relevance in IR RIN: Real Information Need (Target)PIN: Perceived Information Need (Mentality)EIN: Expressed Information Need (Natural Lng)FIN: Formal Information Need (Query)Paper reference 4 dimensions of Relevance by stephen W Draper
8Taylor’s ModelThe visceral need (Q1) the actual, but unexpressed, need for informationThe conscious need (Q2) the conscious within-brain description of the needThe formalized need (Q3) the formal statement of the questionThe compromised need (Q4) the question as presented to the information systemRobert S. Taylor. (1962) The Process of Asking Questions. American Documentation, 13(4),
9Taylor’s Model and IR Systems Visceral need (Q1)Conscious need (Q2)QuestionNegotiationnaïve usersFormalized need (Q3)Compromised need (Q4)IR SystemResults
10The classic search model User taskGet rid of mice in a politically correct wayMisconception?Info needInfo about removing micewithout killing themMisformulation?Queryhow trap mice aliveSearchSearchengineQuery refinementResultsCollection
11Building Blocks of IRS-I Different models of information retrievalBoolean modelVector space modelLanguages modelsRepresenting the meaning of documentsHow do we capture the meaning of documents?Is meaning just the sum of all terms?IndexingHow do we actually store all those words?How do we access indexed terms quickly?
12Building Blocks of IRS-II Relevance FeedbackHow do humans (and machines) modify queries based on retrieved results?User InteractionInformation retrieval meets computer-human interactionHow do we present search results to users in an effective manner?What tools can systems provide to aid the user in information seeking?
13IR Extensions Filtering and Categorization Multimedia Retrieval Traditional information retrieval: static collection, dynamic queriesWhat about static queries against dynamic collections?Multimedia RetrievalThus far, we’ve been focused on text…What about images, sounds, video, etc.?Question AnsweringWe want answers, not just documents!
14CAN U GUESS WHAT DATA IS MAINLY FOCUSED BY IR? StructuredUnstructuredSemi structured
15What about databases? What are examples of databases? Banks storing account informationRetailers storing inventoriesUniversities storing student gradesWhat exactly is a (relational) database?Think of them as a collection of tablesThey model some aspect of “the world”
16A (Simple) Database Example Student TableDepartment TableCourse TableEnrollment Table
17IR vs. databases: Structured vs unstructured data Structured data tends to refer to information in “tables”EmployeeManagerSalarySmithJones50000ChangSmith60000IvySmith50000Typically allows numerical range and exact match(for text) queries, e.g.,Salary < AND Manager = Smith.
18Database Queries What would you want to know from a database? What classes is John Arrow enrolled in?Who has the highest grade in LBSC 690?Who’s in the history department?Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest address?
19Unstructured data Typically refers to free text Allows Keyword queries including operatorsMore sophisticated “concept” queries e.g.,find all web pages dealing with drug abuseClassic model for searching text documents
22Semi-structured data In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the Title and Bullets… to say nothing of linguistic structureFacilitates “semi-structured” search such asTitle contains data AND Bullets contain searchOr evenTitle is about Object Oriented Programming AND Author something like stro*rupwhere * is the wild-card operator
24Databases vs. IR Other issues Interaction with system Results we get Queries we’re posingWhat we’re retrievingIRDatabasesMostly unstructured. Free text with some metadata.Structured data. Clear semantics based on a formal model.Vague, imprecise information needs (often expressed in natural language).Formally (mathematically) defined queries. Unambiguous.Sometimes relevant, often not.Exact. Always correct in a formal sense.Interaction is important.One-shot queries.Issues downplayed.Concurrency, recovery, atomicity are all critical.
25IRS IN ACTION (TASKS) Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan
26Outline What is the IR problem? How to organize an IR system? (Or the main processes in IR)IndexingRetrieval
27The problem of IRGoal = find documents relevant to an information need from a large document setInfo. needQueryIR systemDocumentcollectionRetrievalAnswer list
28IR problem First applications: in libraries (1950s) ISBN:Author: Salton, GerardTitle: Automatic text processing: the transformation, analysis, and retrieval of information by computerEditor: Addison-WesleyDate: 1989Content: <Text>external attributes and internal attribute (content)Search by external attributes = Search in DBIR: search by content
29Possible approaches 1. String matching (linear search in documents) - Slow- Difficult to improve2. Indexing (*)- Fast- Flexible to further improvement
31Main problems in IR Document and query indexing How to best represent their contents?Query evaluation (or retrieval process)To what extent does a document correspond to a query?System evaluationHow good is a system?Are the retrieved documents relevant? (precision)Are all the relevant documents retrieved? (recall)
33Document indexingGoal = Find the important meanings and create an internal representationFactors to consider:Accuracy to represent meanings (semantics)Exhaustiveness (cover all the contents)Facility for computer to manipulateWhat is the best representation of contents?Char. string (char trigrams): not precise enoughWord: good coverage, not precisePhrase: poor coverage, more preciseConcept: poor coverage, preciseCoverage(Recall)Accuracy(Precision)String Word Phrase Concept
34Parsing a document What format is it in? What language is it in? Sec. 2.1Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?Each of these is a classification problem, which we will study later in the course.But these tasks are often done heuristically …
35Complications: Format/language Sec. 2.1Complications: Format/languageDocuments being indexed can include docs from many different languagesA single index may have to contain terms of several languages.Sometimes a document or its components can contain multiple languages/formatsFrench with a German pdf attachment.What is a unit document?A file?An ? (Perhaps one of many in an mbox.)An with 5 attachments?A group of files (PPT or LaTeX as HTML pages)Nontrivial issues. Requires some design decisions.
37Stopwords / Stoplistfunction words do not bear useful information for IRof, in, about, with, I, although, …Stoplist: contain stopwords, not to be used as indexPrepositionsArticlesPronounsSome adverbs and adjectivesSome frequent words (e.g. document)The removal of stopwords usually improves IR effectivenessA few “standard” stoplists are commonly used.
38SecStop wordsWith a stop list, you exclude from the dictionary entirely the commonest words. Intuition:They have little semantic content: the, a, and, to, beThere are a lot of them: ~30% of postings for top 30 wordsBut the trend is away from doing this:Good compression techniques means the space for including stop words in a system is very smallGood query optimization techniques mean you pay little at query time for including stop words.You need them for:Phrase queries: “King of Denmark”Various song titles, etc.: “Let it be”, “To be or not to be”“Relational” queries: “flights to London”Nevertheless: “Google ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.” (Though you can explicitly ask for them to remain.)
39Stemming Reason: Stemming: Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for themStemming:Removing some endings of wordcomputercomputecomputescomputingcomputedcomputationcomput
40Stemming Reduce terms to their “roots” before indexing SecStemmingReduce terms to their “roots” before indexing“Stemming” suggest crude affix choppinglanguage dependente.g., automate(s), automatic, automation all reduced to automat.for exampl compress andcompress ar both acceptas equival to compressfor example compressedand compression are bothaccepted as equivalent tocompress.