Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)

Similar presentations


Presentation on theme: "Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)"— Presentation transcript:

1 Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
(c) Wolfgang Hürst, Albert-Ludwigs-University

2 Organizational Remarks
Exercises: Please, register for the exercises by sending me an containing - Your name, - Matrikelnummer, - Studiengang (BA, MSc, Diploma, ...) - Plans for exam (yes, no, undecided) This is just to organize the exercises, i.e. there are no consequences if you decide to drop this course. Registrations should be done before the exercises start. Later registration might be possible under certain circumstances (contact me).

3 Recap: IR System & Tasks Involved
INFORMATION NEED SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION

4 Models for Information Retrieval
Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for practice, e.g. because they increase our understanding, allow more fact-based statements, etc. General advantages of theoretical models: Behavior can be clearly understood and reconstructed, characteristics can be proven, etc. Plug-and-play, i.e. easily build on previous work, strong theoretical background and framework, etc.

5 Models for IR - Taxonomy
Fuzzy set model Extended Boolean model Generalized vector model Latent semantic indexing Neural networks Inference networks Belief network Classic models: Boolean model (based on set theory) Vector space model (based on algebra) Probabilistic models (based on probability theory) Further models: Structured Models Models for Browsing Filtering SOURCE: R. BAEZA-YATES [1], PAGE 20+21

6 Formal Specification of the Task
Definition: An information retrieval model is a quadrupel [D, Q, F, R(qi, dj)] where D is a set composed of logical views (or represen-tations) for the documents in the collection Q is a set composed of logical views (or representations) for the user information needs. Such representations are called queries. F is a framework for modeling document representations, queries, and their relationships. R(qi, dj) is a ranking function which associates a real number with a query qi in Q and a document representation dj in D. Such ranking defines an ordering among the documents with regard to the query qi. SOURCE: R. BAEZA-YATES [1], PAGE 23

7 Formal Specific. of the Task (Cont.)
Generally, we represent the query and documents through a set of terms T = {t1, ..., tk} where k is the number of all unique index terms in the system. We assume wi,j to be a weight for term ti in document dj with wi,j = 0 if ti is not in dj. Document dj can be represented as an index term vector dj = (w1,j, w2,j, ..., wk,j). gi represents a function for which gi(dj) = wi,j (i.e. given a document dj, gi delivers the weight of term ti in dj). CF. R. BAEZA-YATES [1], PAGE 25

8 Classic Retrieval Models
1. Boolean Model (set theoretic)

9 Boolean Retrieval Model - Queries
Based on set theory and Boolean algebra Documents: Index term vector dj = (w1,j, ..., wk,j) with wi,j{0,1} Queries: Terms combined with AND, OR, NOT Boolean expression in disjunctive normal form (DNF) Example: CF. R. BAEZA-YATES [1], CH

10 Boolean Retr. Model - Definition
A query q is defined as a Boolean expression qdnf in DNF with qcc being the conjunctive elements from qdnf. wi,j = 0 or 1 are the index term weight variables. We define the similarity sim of a document dj with query q as (A document is considered relevant if sim = 1 and irrelevant otherwise)

11 Boolean Retrieval Model
Advantages: Precise, clean formalism Offers great control and transparency, Simplicity, easy math, easy implementation Good for domains with ranking by other means than relevance, i.e. chronological Disadvantages: Query might be hard to specify Binary decision (relevant or not) Often too many or too few results

12 Classic Retrieval Models
1. Boolean Model (set theoretic) 2. Vector Model (algebraic)

13 Vector Model - Definition
Based on vector algebra Main advantage (compared to Boolean models): Considers non-binary weights and calculates similarity measure between query and document Formal Definition: wi,q is defined as the weight associated with the pair (ti, q) and wi,q = 0 or > 0 k describes the number of all unique index terms With this, we can define Query vector q = (w1,q, w2,q, …, wk,q) Document vector dj = (w1,j, w2,j, …, wk,j)

14 Vector Model - Definition (Cont.)
The similarity between a query and a document can then be quantified by the correlation of the respective vectors, e.g. Using the inner product (arithmetical): Using the cosinus of the angle between the 2 vectors Weights: Often TF*IDF (or variants of it)

15 Vector Model - Illustration
Easy example w. 3 terms:

16 Vector Model Advantages: Fast and easy, Finds similar documents (no binary decision), Ranking based on similarity Often better results than Boolean search (because of the term weighting) Disadvantages: Terms are assumed to be independent

17 Classic Retrieval Models
1. Boolean Model (set theoretic) 2. Vector Model (algebraic)

18 Classic Retrieval Models
1. Boolean Model (set theoretic) 2. Vector Model (algebraic) 3. Probabilistic Models (probabilistic)


Download ppt "Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)"

Similar presentations


Ads by Google