Presentation on theme: "1 Information Retrieval. 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of."— Presentation transcript:
1 Information Retrieval
2 What is IR? IR is concerned with the representation, storage, organization, and accessing of information items. [Salton] Information include Text, Audio, image, …. For simplicity, we consider texts. Text information retrieval. Information Space User Request Documents Languag e
3 The role of databases Databases hold specific data items Organization is explicit Keys relate items to each other Queries are constrained, but effective in retrieving the data that is there Databases generally respond to specific queries with specific results Searching for items not anticipated by the designers can be difficult
4 Information vs. Information sources User needs information Distinguish data, information, knowledge Information sources Very well organized, indexed, controlled Totally unorganized, uncharacterized, uncontrolled Something in between Connect the two in a way that matches information needs to information available.
5 The Web Extreme opposite of a database No organization, no overall structure, no index or key to the content Searching and browsing are supported, but generally are not complete. (You will not know if you got every good response to your request. You may be able to tell that you got the response that meets your need, but may not know if you got the best response available.) Each HTML page is considered as a document
6 Information Retrieval vs. Data Retrieval n Data retrieval consists of determining which documents of the collection contain the keywords in the user query. n Information retrieval should “interpret” the contents of the documents in the collection and retrieve all the documents that are relevant to the user query while retrieving a few non relevant documents as possible.
7 A General text-information retrieval model n An information retrieval model is a quadruple where H D is a set composed of logical views (or representations) for the documents in the collection H Q is a set composed of logical views (or representations) for the user information needs called “queries” H F is a framework for modeling document representations, queries and their relationships R(qi, dj) is a ranking function which associates a real number with a query qi in Q and a document representation dj in D.(A similarity measure which perform a mapping from query to documents that are more similar to our particular query )
8 Retrieval models n Probabilistic IR : H (Baysian, Naïve Bayes), Compute the probability of relevance of a document to given query. n Statistical IR (Vector Space, Concept space). n Machine Learning based techniques. (Extracting knowledge or identifying patterns) H Symbolic learning (ID3) H Neural Networks (Any where that is required) H Evolution based Algorithms( For adapting of F as matching function ) n The effectiveness of an IR system depends on the ability of the document representation to capture the “meaning” of the documents with respect to the users’ needs
9 Text retrieval Overall Architecture Users Queries (Q) Relevance Feedback Relevance Feedback Matching Algorithm (R) Document Representation (D) Documents User Side User Side Information Space Retrieved Documents F
10 Preparing queries and documents Convert file format. Text segmentation. Term extraction Stemming, eliminating stop words Term weighting Phrase construction Storing indexed documents Similar stages for preparing queries, but instead of storage it passes to matching algorithm.
11 The Retrieval Process User Interface Text Operations Query Operations Searching Indexing DB Manager Module Index Text Database Ranking User’s need Ranked Docs Retrieved Docs Query User’s feedback Text Logical view Inverted file
12 Vector space 1960s introduction of vector space model (Salton, cornell Univ. Smart system ) Dj=(Wj1,Wj2,…,Wjt) if kth term doesn’t exist then Wjk=0 Qj=(wj1,wj2,…,wjt) Sparse Term-Doc matrix Doc Term Query
13 Term Weighting Global None 1 IDF Entropy IDFB IDFP Normalization None 1 Cosine PUQN Local Binary 1 TF Log LOGN ATF1 If f ij ==0 then 0 else
14 Vector space graphical representation Example : D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T 3 T3T3 T1T1 T2T2 D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?
15 Similarity Measure (Matching Function) F Similarity between documents D i and query Q can be computed as the inner vector product: where d ik is the weight of Term k in document i and q k is the weight of Term k in the query Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0, 1, 0, 0, 1, 1 sim(D, Q) = 3 retrieval database architecture computer text management information
17 Inner Product: Cosine: Jaccard : d i and q k here are sets of keywords d i and q k here are vectors Similarity Measures
18 Comments on Vector Space Models Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections.
19 Problems with Vector Space There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions Terms are not independent of all other terms
20 Semantic IR Different voc. For users and authors (or indexers) Polysemy problem (words having multiple meaning) Synonymy problem (multiple words having the same meaning) Using a dictionary of Synonymy and Polysemy inside IR system. Latent Semantic Indexing (LSI). H Using Singular Value decomposition H Identifying the correlation between terms by means of singular values (e.g. car and auto inside gasoline,… in different docs) SVD provides a solution to this, and in doing so, It captures all the info in the original array, without loss. Reduces the size of the matrix to operate on. (Deals with non- sparse parts) Places similar elements closer to each other. Allows the reconstruction of the original matrix, with some loss of precision.
21 Information Retrieval Systems Information retrieval (IR) systems use a simpler data model than database systems Information organized as a collection of documents Documents are unstructured, no schema Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents e.g., find documents containing the words “database systems” Can be used even on textual descriptions provided with non- textual data such as images IR on Web documents has become extremely important E.g. google, altavista, …
22 Information Retrieval Systems (Cont.) Differences from database systems IR systems don’t deal with transactional updates (including concurrency control and recovery) Database systems deal with structured data, with schemas that define the data organization IR systems deal with some querying issues not generally addressed by database systems Approximate searching by keywords Ranking of retrieved answers by estimated degree of relevance
23 Query Modification Process F: accepts relevance judgement from the user and produces as output sets of relevant and nonrelevant documents G: implements the feedback formula (for rewriting the original query) Retrieval Process FG Original query Q Ranked output Rel. & nonrel. documents Reformulated query Q’ Relevancy judgement
24 The Effect of Relevance Feedback x x x Relevant documents Nonrelevant documents Original query Original query retrieved five documents x x Reformulated query
25 The Basic Idea of Query Modification Terms that occur in relevant documents are added to the original query vectors, or the weight of such terms is increased by an appropriate factor in constructing the new query statements Terms occurring in nonrelevant documents are deleted from the original query statements, or the weight of such terms is appropriately reduced
26 Keyword Search In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not Ands are implicit, even if not explicitly specified Ranking of documents on the basis of estimated relevance to a query is critical Relevance ranking is based on factors such as Term frequency –Frequency of occurrence of query keyword in document Inverse document frequency –How many documents the query keyword occurs in »Fewer give more importance to keyword Hyperlinks to documents –More links to a document document is more important
27 Relevance Ranking Using Terms TF-IDF (Term frequency/Inverse Document frequency) ranking: Let n(d) = number of terms in the document d n(d, t) = number of occurrences of term t in the document d. Then relevance of a document d to a term t The log factor is to avoid excessive weightage to frequent terms And relevance of document to query Q n(d)n(d)n(d)n(d) n(d, t) 1 + r(d, t) = log r(d, Q) = r(d, t) n(t)n(t)n(t)n(t) tQtQtQtQ
28 Relevance Ranking Using Terms (Cont.) Most systems add to the above model Words that occur in title, author list, section headings, etc. are given greater importance Words whose first occurrence is late in the document are given lower importance Very common words such as “a”, “an”, “the”, “it” etc are eliminated Called stop words Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart Documents are returned in decreasing order of relevance score Usually only top few documents are returned, not all
29 Relevance Using Hyperlinks When using keyword queries on the Web, the number of documents is enormous (many billions) Number of documents relevant to a query can be enormous if only term frequencies are taken into account Most of the time people are looking for pages from popular sites Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords Problem: hard to find actual popularity of site Solution: next slide
30 Relevance Using Hyperlinks (Cont.) Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site Count only one hyperlink from each site (why?) Popularity measure is for site, not for individual page Most hyperlinks are to root of site Site-popularity computation is cheaper than page popularity computation Refinements When computing prestige based on links to a site, give more weightage to links from sites that themselves have higher prestige Definition is circular Set up and solve system of simultaneous linear equations Above idea is basis of the Google PageRank ranking mechanism
31 Relevance Using Hyperlinks (Cont.) Connections to social networking theories that ranked prestige of people E.g. the president of the US has a high prestige since many people know him Someone known by multiple prestigious people has high prestige Hub and authority based ranking A hub is a page that stores links to many pages (on a topic) An authority is a page that contains actual information on a topic Each page gets a hub prestige based on prestige of authorities that it points to Each page gets an authority prestige based on prestige of hubs that point to it Again, prestige definitions are cyclic, and can be got by solving linear equations Use authority prestige when ranking answers to a query
32 Similarity Based Retrieval Similarity based retrieval - retrieve documents similar to a given document Similarity may be defined on the basis of common words E.g. find k terms in A with highest r(d, t) and use these terms to find relevance of other documents; each of the terms carries a weight of r (d,t) Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these
33 Synonyms and Homonyms Synonyms E.g. document: “motorcycle repair”, query: “motorcycle maintenance” need to realize that “maintenance” and “repair” are synonyms System can extend query as “motorcycle and (repair or maintenance)” Homonyms E.g. “object” has different meanings as noun/verb Can disambiguate meanings (to some extent) from the context Extending queries automatically using synonyms can be problematic Need to understand intended meaning in order to infer synonyms Or verify synonyms with user Synonyms may have other meanings as well
34 Indexing of Documents An inverted index maps each keyword K i to a set of documents S i that contain the keyword Documents identified by identifiers Inverted index may record Keyword locations within document to allow proximity based ranking Counts of number of occurrences of keyword to compute TF and operation: Finds documents that contain all of K 1, K 2,..., K n. Intersection S 1 S 2 ..... S n or operation: documents that contain at least one of K 1, K 2, …, K n union, S 1 U S US n Each S i is kept sorted to allow efficient intersection/union by merging “not” can also be efficiently implemented by merging of sorted lists
35 Measuring Retrieval Effectiveness IR systems save space by using index structures that support only approximate retrieval. May result in: false negative (false drop) - some relevant documents may not be retrieved. false positive - some irrelevant documents may be retrieved. For many applications a good index should not permit any false drops, but may permit a few false positives. Relevant performance metrics: Precision - what percentage of the retrieved documents are relevant to the query. Recall - what percentage of the documents relevant to the query were retrieved.
36 Performance Evaluation of Information Retrieval Systems
37 Why is System Evaluation Needed? There are many retrieval systems on the market, which one is the best? When the system is in operation, is the performance satisfactory? Does it deviate from the expectation? To fine tune a query to obtain the best result (for a particular set of documents and application) To determine the effects of changes made to an existing system (system A versus system B) Efficiency: speed Effectiveness: how good the result is?
38 Difficulties in Evaluating IR System Effectiveness is related to relevancy of items retrieved Relevancy is not a binary evaluation but a continuous function Relevancy, from a human judgement standpoint, is subjective - depends upon a specific user’s judgement situational - relates to user’s requirement cognitive - depends on human perception and behavior temporal - changes over time
39 Retrieval Effectiveness - Precision and Recall Relevant documents Retrieved documents Entire document collection retrieved & relevant not retrieved but relevant retrieved & irrelevant Not retrieved & irrelevant retrievednot retrieved relevant irrelevant
40 Precision and Recall Precision evaluates the correlation of the query to the database an indirect measure of the completeness of indexing algorithm Recall the ability of the search to find all of the relevant items in the database Among three numbers, only two are always available total number of items retrieved number of relevant items retrieved total number of relevant items is usually not available Unfortunately, precision and recall affect each other in the opposite direction! Given a system: Broadening a query will increase recall but lower precision Increasing the number of documents returned has the same effect
41 Total Number of Relevant Items Problem: which documents are actually relevant, and which are not Usual solution: human judges Create a corpus of documents and queries, with humans deciding which documents are relevant to which queries In an uncontrolled environment (e.g., the web), it is unknown. Two possible approaches to get estimates Sampling across the database and performing relevance judgment on the returned items Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total number of relevant documents in the collection
42 Relationship between Recall and Precision precision recall Return mostly relevant documents but miss many relevant ones The ideal Return most of the relevant documents but include many junks
43 Fallout Rate Problems with precision and recall: A query on “Hong Kong” will return most relevant documents but it doesn’t tell you how good or how bad the system is! (What is the chance that a randomly picked document is relevant to the query?) number of irrelevant documents in the collection is not taken into account recall is undefined when there is no relevant document in the collection precision is undefined when no document is retrieved A good system should have high recall and low fallout
44 Fallout (cont) Fallout can be viewed as the inverse of recall It is very unlikely to have situation as 0/0 the number of non-relevant items in a collection can be safely be assumed to be non-zero. It is the probability that a retrieved item is nonrelevant. (Recall: the probability that a retrieved item is relevant) Among three measures, precision, recall and fallout, fallout is least sensitive to the accuracy of the search process A good system should have high recall and low fallout
45 R=2/5=0.4;p=2/3=0.67 Computation of Recall and Precision Suppose: total no. of relevant docs = 5 R=1/5=0.2;p=1/1=1 R=2/5=0.4;p=2/2=1 R=5/5=1;p=5/13=0.38
46 Computation of Recall and Precision recall precision
47 Compare Two or More Systems Computing recall and precision values for two or more systems Superimposing the results in the same graph The curve closest to the upper right-hand corner of the graph indicates the best performance TREC (Text REtrieval Conference) Benchmark
48 Web Crawling Web crawlers are programs that locate and gather information on the Web Recursively follow hyperlinks present in known documents, to find other documents Starting from a seed set of documents Fetched documents Handed over to an indexing system Can be discarded after indexing, or store as a cached copy Crawling the entire Web would take a very large amount of time Search engines typically cover only a part of the Web, not all of it Take months to perform a single crawl
49 Web Crawling (Cont.) Crawling is done by multiple processes on multiple machines, running in parallel Set of links to be crawled stored in a database New links found in crawled pages added to this set, to be crawled later Indexing process also runs on multiple machines Creates a new copy of index instead of modifying old index Old index is used to answer queries After a crawl is “completed” new index becomes “old” index Multiple machines used to answer queries Indices may be kept in memory Queries may be routed to different machines for load balancing
50Browsing Storing related documents together in a library facilitates browsing users can see not only requested document but also related ones. Browsing is facilitated by classification system that organizes logically related documents together. Organization is hierarchical: classification hierarchy
51 A Classification Hierarchy For A Library System
52 Classification DAG Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important. Classification hierarchy is thus Directed Acyclic Graph (DAG)
53 A Classification DAG For A Library Information Retrieval System
54 Web Directories A Web directory is just a classification directory on Web pages E.g. Yahoo! Directory, Open Directory project Issues: What should the directory hierarchy be? Given a document, which nodes of the directory are categories relevant to the document Often done manually Classification of documents into a hierarchy may be done based on term similarity
55 Computational Creativity is a small sub-field of artificial intelligence Its focus is the study and support, through computational methods, of behaviour which, in humans, would be deemed “creative” Ranging from intelligent digital libraries to systems which create music, art, scientific theories, etc. What is Computational Creativity?
56 Work in CCC falls into various categories: literary forensics (Dr Peter Smith/Dr Gea De Jong) computer-based musicology (Dr Geraint Wiggins/Tim Crawford/David Lewis/Michael Gale) intelligent digital signal & score processing (Dr Michael Casey/Dr Geraint Wiggins/Dr Darrell Conklin/Dave Meredith/Miguel Ferrand) computational music cognition (Dr Geraint Wiggins/Dr Andrés Melo/Dave Meredith/Marcus Pearce/Miguel Ferrand) intelligent composition and performance systems (Dr Geraint Wiggins/Dr Darrell Conklin/Dr John Drever/Marcus Pearce/Tak-Shing Chan/ Prof Simon Emmerson/Prof Denis Smalley) formal models of creative systems (Dr Geraint Wiggins) Work in CCC
57 Content-Based Information Retrieval Driven by large volumes of multimedia. Search terra-bytes of sound and images by similarity. International Standardisation makes it work globally. (Like the WWW).
58 MPEG-7 International Standard ISO/IEC/JTC-1/SC29/WG11 [MPEG] ISO Part 4 (Audio) [MPEG-7] Multimedia Content Description Interface
59 Audio Information Retrieval MPEG-7 Database A pre-indexed Collection of Sounds
60 Audio Query Extract MPEG-7 Database SegmentMatch Result List A Sound or Scene or List of Sounds Audio Information Retrieval
61 Audio Query Extract MPEG-7 Database SegmentMatch Result List Feature extraction from audio. Audio Information Retrieval
62 Audio Query Extract MPEG-7 Database SegmentMatch Result List Partitioning of audio into chunks. Audio Information Retrieval
63 Audio Query Extract MPEG-7 Database SegmentMatch Result List Find similar chunks of Audio Audio Information Retrieval
64 Audio Query Extract MPEG-7 Database SegmentMatch Result List Use Results for Creativity Support Application User Collect Relate Audio Information Retrieval
65 Arabic IR
66 Creating the Database Method 1: Without Morphology Index the text based on the form of the word Method 2: With Morphology Index the text based on the stem of the word
67 Retrieval Systems Monolingual retrieval system Arabic query Returns Arabic Documents Cross lingual retrieval system (Arabic Translingual System) English query Translated Using Online Dictionary with human selection of terms Returns Arabic Documents and translations
68 Monolingual retrieval system Arabic query Retrieve text Display Morphology
69 Monolingual retrieval system Enter Query, Select Data Source, Search
70 List of Documents and Top Document Returned Monolingual retrieval system
71 Arabic Translingual System Arabic query Retrieve text Display/Translate Morphology English query Translate
72 Arabic Translingual System Type query in English, Select Translate option
73 Double click on any word to see dictionary entry Arabic Translingual System
74 Click on translation button for gisting translation Arabic Translingual System
75 Translingual System Include syntax in translation system Expand the bi-directional dictionaries Improve Onomasticon Perform automatic disambiguation of translated queries in cross-language system (necessary for TREC-9) Using ontology? Participate in TREC
76 Distributed IR Engine 1Engine 2Engine 3Engine 4Engine n..... ? Information Need Common scenarios: Multiple partitions, single service Independent engines, single organization Independent engines, affiliated organizations Independent engines, unaffiliated organizations Defining dimensions: Cooperative vs. uncooperative engines Centralized vs. decentralized solutions