Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

Similar presentations


Presentation on theme: "1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009."— Presentation transcript:

1 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009

2 2 Chapter 2– Part1 Information Retrieval Models

3 3 Introduction Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword (or group of related words) which has some meaning of its own (usually a noun). Advantages: Simple The semantic of the documents and of the user information need can be naturally expressed through sets of index terms.

4 4 IR Models  Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

5 5 A taxonomy of information retrieval models Retrieval: Ad hoc Filtering Classic Models Browsing USERTASKUSERTASK Boolean Vector Probabilistic Structured Models Non-overlapping lists Proximal Nodes Flat Structured Guided Hypertext Browsing Fuzzy Extended Boolean Set Theoretic Algebraic Generalized Vector Lat. Semantic Index Neural Networks Inference Network Belief Network Probabilistic

6 6 Basic Concept Each document is described by a set of representative keywords called index term. Index term: is simply a word whose semantic help in remembering the documents main themes. Consider a collection of one hundred documents, a word which appears in each of the one hundred document is completely useless as an index term, because it does not tell us anything about which documents are the user interested in. But the word which appear in just 5 document is quite useful, because it narrow down considerably the space of documents which might be of interest to the user.

7 7 Retrieval Strategy An IR strategy is a technique by which a relevance measure is obtained between a query and a document. Manual Systems Boolean, Fuzzy Set Automatic Systems Vector Space Model Language Models Latent Semantic Indexing

8 8 Basic Concepts In the classic models each document is described by a set of representative keywords called index terms index terms are mainly nouns index term weights are usually assumed to be mutually independent

9 9 Boolean Model Binary decision criterion Data retrieval model A query is a Boolean expression which can be represented as a disjunction of conjunctive vectors Advantage clean formalism, simplicity Disadvantage exact matching may lead to retrieval of too few or too many documents

10 10 Boolean IR Documents composed of TERMS(words, stems) Express result in set-theoretic terms Doc’s containing term A term B term C Doc’s containing term A term B term C A AND B(A AND B) OR C - Pre 1970’s - Dominant industrial model through 1994 (Lexis-Nexis, DIALOG)

11 11 Information Retrieval Models An information retrieval model is a formal framework that supports all the major phases of the information retrieval process, including: Item (document) representation User need representation Matching of needs to items Ranking of retrieved items An information retrieval model is analogous to a database model (relation, object-oriented, semi- structured, etc.).

12 12 A Generic Model D: A set of document representations. Q: A set of user-need representations (queries). R : DxQ gives Real-Numbers A function that assigns each document and each query a real number that represents the ranking(relevance) of the document with respect to the query.

13 13 Common Preprocessing Steps Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.). Break into tokens (keywords) on whitespace. Stem tokens to “root” words computational  compute Remove common stopwords (e.g. a, the, it, etc.). Detect common phrases (possibly using a domain specific dictionary). Build inverted index (keyword  list of docs containing it).

14 14 Common Assumptions The three models are all based on document representations that are sets of index terms. Index terms (keywords): Words (mostly nouns) extracted from a document to summarize its contents. Method: Could extract all distinct words, only those that appearin a global lexicon, etc. (to be discussed in Topic 8). Weights: Not all extracted index terms carry the same importance in summarizing the contents of a document. Let t be the number of terms in the entire system. Let kibe a term, let djbe a document. wi,j >= is a weight associated with the pair (ki, dj). wi,j = 0 when kidoes not appear in dj. Document dj is associated with index term vector dj = (w1j, w2j,…, wtj).

15 15 Common Assumptions (cont.) A naïve assumption is made that the terms in a document are mutually independent. Term independence: The appearance of one term in a document is unrelated to the appearance of another. This assumption is made for simplicity of calculations, but is often wrong. Example: The terms computer and network are not independent. If the term computer appears, the probability that the term network appears is higher than if the term computer does not appear. Whereas independence requires P(network|computer) = P(network). Nonetheless, it has been demonstrated that performance is still good under this naïve assumption.

16 16 The Boolean Model Document: A document is a set of index terms without weights; or, equivalently, with binaryterm weights: wi,j = 0 or wi,j = 1. i.e., a keyword is either presentin or absentfrom a document. Query: A query is a Boolean expression of index terms. i.e., index terms connected with and, or and not, according to the usual conventions. Ranking: With each index term ki we associate a set D ki of the documents in which ki appears: D ki = {dj| wij = 1} and the Boolean expression is converted to a set-theoretic expression: Each term ki is substituted by the set Dki The Boolean operators and(^), or(V) and not(¥) are substituted, respectively, by the set operators intersection, union andcomplement. The documents in the resulting set are relevant, all others are non- relevant.

17 17 The Boolean Model (cont.) Example: Terms: K1, …,K8. Documents: D1= {K1, K2, K3, K4, K5} D2 = {K1, K2, K3, K4} D3= {K2, K4, K6, K8} D4= {K1, K3, K5, K7} D5= {K4, K5, K6, K7, K8} D6 = {K1, K2, K3, K4} Query: K1 AND (K2 OR NOT K3) Answer: {D1, D2, D4, D6} 3({D1, D2, D3, D6} 4{D3, D5}) = {D1, D2, D6}

18 18 The Boolean Model (cont.) Popular retrieval model because: Easy to understand for simple queries. Clean formalism. Reasonably efficient implementations possible for normal queries. Simple and clean formalism. Adopted by many early commercial IR systems. A document is either relevant or non-relevant to a query (i.e., no “strong” or “weak” relevance). Hence, a query splits the collection into two distinct sets of documents: relevant and non-relevant, and there is no ranking from most relevant to least relevant. May lead to answers with too few or too many documents.

19 19 Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Very rigid: AND means all; OR means any. Difficult to express complex user requests. Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query Difficult to perform relevance feedback. If a document is identified by the user as relevant or irrelevant, how should the query be modified?

20 20 Example Assume that we have 5 documents D1,D2,D3,D4,D5 and the following terms found in each document. D1:{play,sport,football,swimming,university,just} D2:{jordan,news,sport,university} D3:{football,yarmouk,university,sport,play} D4:{univesity,just,sport,jordan} D5:{play,jordan,university,football}

21 21 continue The following query will be used to search for relevant document. “Yarmouk” AND(“Univesity” OR NOT ”Just”) Translate the query into disjunctive normal form dfn dfn1:Yarmouk AND NOT University AND NOT Just=(100) dfn2:Yarmouk AND University AND NOT Just=(110) dfn3:Yarmouk AND University AND Just=(111) The set of vector represent the query are Qdfn={(111),(110),(100)}

22 22 continue K1K2K3 K1^(K2 V ~K3)Disjunct 000 0 001 0 010 0 011 0 100 1K1^~K2^~K3 101 0 110 1K1^K2^~K3 111 1K1^K2^K3

23 23 continue pla y spor t footb all swimm ing jorda n new s yarmoukunivesityjust D1111100011 D2010011010 D3111000110 D4010010011 D5101010010

24 24 finally To find the similarity between document and query, we search for conjunctive component such that every key term in the query has the same value in on of the dfn component as it does in the document

25 25 Query key term matrex YarmoukUniversityjust Dfn1100 Dfn2110 Dfn3111

26 26 The result The query returns the document D3 as the relevant document


Download ppt "1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009."

Similar presentations


Ads by Google