Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 Information Retrieval

Similar presentations


Presentation on theme: "2 Information Retrieval"— Presentation transcript:

1 2 Information Retrieval

2 Motivation Information Retrieval is a field of activity for many years
It was long seen as an area of narrow interest Advent of the Web changed this perception universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the solutions! Information Retrieval and Knowledge Organisation - 2 Information Retrieval

3 Motivation Information Retrieval: representation, storage, organization of, and access to information items Emphasis on the retrieval of information (not data) Focus is on the user information need User information need Example: Find all documents containing information about car accidents which happend in Vienna had people injured The information need is expressed as a query Information Retrieval and Knowledge Organisation - 2 Information Retrieval

4 Generic Schema of an Information System
representation of resources (index/meta-data) representation of information need (query) user information resources Comparison (Ranking) Information Retrieval systems do not search through the documents but through the representation (also called index, meta-data or description). source: (Ferber 2004) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

5 Example D1 D2 D3 Heavy accident Because of a heavy car accident 4 people died yesterday morning in Vienna. More vehicles In this quarter more cars became registered in Vienna. Truck causes accident In Vienna a trucker drove into a crowd of people. Four people were injured. Information need: documents containing information about accidents with heavy vehicles in Vienna Query: accident heavy vehicles vienna Expected result: document D3 but: not all terms of the query occur in the document the occurring terms „accident“ and „heavy“ also occur in D1 Information Retrieval and Knowledge Organisation - 2 Information Retrieval

6 Retrieval System retreival (search) indexing
information need Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes indexing retreival (search) interface answer: sorted list of IDs Ranking weighted documents set of documents query processing query terms the index is stored in an efficient system or data structured Queries are answered using the index index assign IDs store documents and IDs document resources indexing terms Text with the ID die document can be retrieved documents Information Retrieval and Knowledge Organisation - 2 Information Retrieval

7 manual indexing – key words
user specifies key words, he/she assumes useful Usually, key words are nouns because nouns have meaning by themselves there are two possibilities user can assign any terms user can select from a predefined set of terms ( controlled vocabulary) automatic indexing – full text search search engines assume that all words are index terms (full text representation) system generates index terms from the words occurring in the text Information Retrieval and Knowledge Organisation - 2 Information Retrieval

8 Automatic Indexing: 1. Decompose a Document into Terms
rules determine how text are decomposed into terms by defining separators like punctuation marks, blanks or hyphens Additional preprocessing, e.g.. exclude specific strings (stop words, numbers) generate normal form stemming substitute characters (e.g. upper case – lower case, Umlaut) D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna D2: more vehicles in this quarter more cars became registered in vienna D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured Information Retrieval and Knowledge Organisation - 2 Information Retrieval

9 Automatic Indexing 2. Index represented as an inverted List
For each term: list of documents in which the term occurs additional information can be stored with each document like frequency of occurrence positions of occurrence Term Dokument-IDs a D1,D3 accident D1,D3 became D2 because D1 car D1 cars D2 died D1 heavy D1 in D1,D2,D3 more D2 of D1 people D1,D3 quarter D2 registered D2 truck D3 vehicles D2 In inverted list is similar to an index in a book Information Retrieval and Knowledge Organisation - 2 Information Retrieval

10 Index as Inverted List with Frequency
In this example the inverted list contains the document identifier an the frequency of the term in the document. term (document; frequency) a (D1,1) (D3,2) accident (D1,2) (D3,1) became (D2,1) because (D1,1) car (D1,1) cars (D2,1) died (D1,1) heavy (D1,2) in (D1,1) (D2,1) (D3,1) more (D2,1) of (D1,1) people (D1,1) (D3,2) quarter (D2,1) registered (D2,1) truck (D3,1) vehicles (D2,1) ... Information Retrieval and Knowledge Organisation - 2 Information Retrieval

11 Problems of Information Retrieval
Word form A word can occur in different forms, e.g. singular or plural. Example: A query for „car“ should find also documents containing the word „cars“ Meaning A singular term can have different meanings; on the other hand the same meaning can be expressed using different terms. Example: when searching for „car“ also documents containing „vehicle“ should be found. Wording, phrases The same issue can be expressed in various ways Example: searching for „motorcar“ should also find documents containing „motorized car“ Information Retrieval and Knowledge Organisation - 2 Information Retrieval

12 Word Forms Flexion: Conjugation and declension of a word
car - cars run – ran - running Derivations: words having the same stem form – format – formation compositions: statements information management - management of information In German, compositions are written as single words, sometimes with hyphen Informationsmanagement Informations-Management Information Retrieval and Knowledge Organisation - 2 Information Retrieval

13 Word Meaning and Phrases
Dealing with words having the same or similar meaning Synonyms record – file - dossier seldom – not often Variants in spelling (e.g BE vs. AE) organisation – organization night - nite Abbrevations UN – United Nations Polyseme: words with multiple meanings Bank Information Retrieval and Knowledge Organisation - 2 Information Retrieval

14 2.1 Dealing with Word Forms and Phrases
We distinguish two ways to deal with word forms and phrases Indexing without preprocessing All occuring word forms are included in the index Different word forms are unified at search time string operations Indexing with preprocessing Unification of word forms during indexing Terms normal forms of occuring word forms index is largely independent of the concrete formulation of the text computerlinguistic approach Information Retrieval and Knowledge Organisation - 2 Information Retrieval

15 2.1.1 Indexing Without Preprocessing
Index: contains all the word forms occuring in the document Query: Searching for specific word forms is possible (e.g. searching for „cars“ but not for „car“) To search for different word forms string operations can be applied Operators for truncation and masking, e.g. ? covers exactly one character * covers arbitrary number of characters Context operators, e.g. [n] exact distance between terms <n> maximal distance between terms Truncation schreib#: schreiben, schreibt, schreibst, schreibe schreib$$: schreiben, schreibst #schreiben: schreiben, beschreiben, anschreiben, verschreiben $$schreiben: beschreiben, anschreiben Maskierung schr$$b#: schreiben, schrieb / schrauben h$$s#: Haus, Häuser / Hanse, hausen, hassen Kontextoperatoren genauer Wortabstand ($): retrieval $ information: retrieval of information, retrieval with information loss maximaler Wortabstand (#): text # # retrievel: text retrieval, text and fact retrieval Wortreihenfolge (,): information retrieval, retrieval of information gleicher Satz (.): information # retrieval. matcht nicht ...this information. Retrieval of data... aber auch nicht ...storage of information. Its retrieval... Information Retrieval and Knowledge Organisation - 2 Information Retrieval

16 Index Without Preprocessing and Query
Term Dokument-IDs a D1,D3 accident D1,D3 became D2 because D1 car D1 cars D2 died D1 heavy D1 in D1,D2,D3 more D2 of D1 people D1,D3 quarter D2 registered D2 truck D3 vehicles D2 vehicle? car? people Information Retrieval and Knowledge Organisation - 2 Information Retrieval

17 Truncation and Masking: Searching for Different Word Forms
Truncation: Wildcards cover characters at the beginning and end of words – prefix or suffix schreib* finds schreiben, schreibt, schreibst, schreibe,… ??schreiben finds anschreiben, beschreiben, but not verschreiben Masking deals with characters in words – in particular in German, declensions and conjugation affect not only suffix and prefix schr??b* can find schreiben, schrieb h??s* can find Haus, Häuser Disadvantage: With truncation and masking not only the intended words are found schr??b* also finds schrauben h??s* also finds Hans, Hanse, hausen, hassen and also words in other languages like horse Truncation schreib#: schreiben, schreibt, schreibst, schreibe schreib$$: schreiben, schreibst #schreiben: schreiben, beschreiben, anschreiben, verschreiben $$schreiben: beschreiben, anschreiben Maskierung schr$$b#: schreiben, schrieb / schrauben h$$s#: Haus, Häuser / Hanse, hausen, hassen Kontextoperatoren genauer Wortabstand ($): retrieval $ information: retrieval of information, retrieval with information loss maximaler Wortabstand (#): text # # retrievel: text retrieval, text and fact retrieval Wortreihenfolge (,): information retrieval, retrieval of information gleicher Satz (.): information # retrieval. matcht nicht ...this information. Retrieval of data... aber auch nicht ...storage of information. Its retrieval... Information Retrieval and Knowledge Organisation - 2 Information Retrieval

18 Context Operators exact word distance maximal word distance
Context operators allow searching for variations of text phrases exact word distance Bezug [3] Telefonat Bezug nehmend auf unser Telefonat maximal word distance text <2> retrieval text retrieval text and fact retrieval For context operators to be applicable, the positions of the words must be stored in the index Truncation schreib#: schreiben, schreibt, schreibst, schreibe schreib$$: schreiben, schreibst #schreiben: schreiben, beschreiben, anschreiben, verschreiben $$schreiben: beschreiben, anschreiben Maskierung schr$$b#: schreiben, schrieb / schrauben h$$s#: Haus, Häuser / Hanse, hausen, hassen Kontextoperatoren genauer Wortabstand ($): retrieval $ information: retrieval of information, retrieval with information loss maximaler Wortabstand (#): text # # retrievel: text retrieval, text and fact retrieval Wortreihenfolge (,): information retrieval, retrieval of information gleicher Satz (.): information # retrieval. matcht nicht ...this information. Retrieval of data... aber auch nicht ...storage of information. Its retrieval... Information Retrieval and Knowledge Organisation - 2 Information Retrieval

19 Indexing Without Preprocessing
Efficiency Efficient Indexing Overhead at retrieval time to apply string operators Wort forms user has to codify all possible word forms and phrases in the query using truncation and masking operators no support given by search engine retrieval engine is language independent Phrases Variants in text phrases can be coded using context operators Truncation schreib#: schreiben, schreibt, schreibst, schreibe schreib$$: schreiben, schreibst #schreiben: schreiben, beschreiben, anschreiben, verschreiben $$schreiben: beschreiben, anschreiben Maskierung schr$$b#: schreiben, schrieb / schrauben h$$s#: Haus, Häuser / Hanse, hausen, hassen Kontextoperatoren genauer Wortabstand ($): retrieval $ information: retrieval of information, retrieval with information loss maximaler Wortabstand (#): text # # retrievel: text retrieval, text and fact retrieval Wortreihenfolge (,): information retrieval, retrieval of information gleicher Satz (.): information # retrieval. matcht nicht ...this information. Retrieval of data... aber auch nicht ...storage of information. Its retrieval... Information Retrieval and Knowledge Organisation - 2 Information Retrieval

20 2.1.2 Preprocessing of the Index – Computerlinguistic Approach
Each document is represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document’s main themes Index contains standard forms of useful terms Restrict allowed terms Normalisation: Map terms to a standard form Index terms not for Index Information Retrieval and Knowledge Organisation - 2 Information Retrieval

21 Restricting allowed Index Terms
Objective: increase efficiency effectivity by neglecting terms that do not contribute to the assessment of a document‘s relevance There are two possibilities to restrict allowed index terms Explicitly specify allowed index terms controlled vocabulary Specify terms that are not allowed as index terms stopwords Information Retrieval and Knowledge Organisation - 2 Information Retrieval

22 Stop Words Stop words are terms that are not stored in the index
Candidates for stop words are words that occur very frequently A term occurring in every document ist useless as an index term, because it does not tell anything about which document the user might be interested in a word which occurs only in 0.001% of the documents is quite useful because it narrows down the space of documents which might be of interest for the user words with no/little meanings terms that are not words (e.g. numbers) Examples: General: articles, conjunctions, prepositions, auxiliary verbs (to be, to have) occur very often and in general have no meaning as a search criteria application-specific stop words are also possible Information Retrieval and Knowledge Organisation - 2 Information Retrieval

23 Normalisation of Terms
There are various possibilities to compute standard forms N-Grams stemming: removing suffixes or prefixes Information Retrieval and Knowledge Organisation - 2 Information Retrieval

24 N-Grams Index: sequence of charcters of length N
Example: „persons“ 3-Grams (N=3): per, ers, rso, son, ons 4-Grams (N=4): pers, erso, rson, sons N-Grams can also cross word boundaries Example: „persons from switzerland“ 3-Grams (N=3): er, ers, rso, son, ons, ns_, s_f, _fr, rom, om_, m_s, _sw, swi, wit, itz, tze, zer, erl, rla, lan, and Information Retrieval and Knowledge Organisation - 2 Information Retrieval

25 Stemming Stemming: remove suffixes and prefixes to find a comming stem, e.g. remove –ing and –ed for verbs remove plural -s for nouns There are a number of exceptions, e.g. –ing and –ed may belong to a stem as in red or ring irregular verbs like go - went - gone, run - ran - run Approaches for stemming: rule-based approach lexicon-based approach Information Retrieval and Knowledge Organisation - 2 Information Retrieval

26 Rules for Stemming in English
Kuhlen (1977) derived a rule set for stemming of most English words: Ending Replacement Condition 1 ies y 2 XYes XY XY = Co, ch, sh, ss, zz oder Xx 3 XYs XY = XC, Xe, Vy, Vo, oa oder ea 4 ies' 5 Xes' X 6 Xs' 7 X 's 8 X' 9 XYing XY= CC, XV, Xx 10 XYe XY= VC 11 ied 12 XYed XY = CC, XV, Xx 13 X and Y are any letters C stands for a consonant V stands for any vowel Source: (Ferber 2003) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

27 Problems for Stemming In English a small number of rules cover most of the aorkd In German it is more difficult because also stem changes for many words insertion of Umlauts, e.g. Haus – Häuser new prefixes, e.g laufen – gelaufen separation/retaining of prefix, e.g. mitbringen – er brachte den Brief mit überbringen – er überbrachte den Brief Irregular insertion of „Verfungen“ when building composita Schwein-kram, Schwein-s-haxe, Schwein-e-braten These problems that cannot be easily dealt with by general rules operating only on strings Information Retrieval and Knowledge Organisation - 2 Information Retrieval

28 Lexicon-based Approaches for Stemming
Principle idea: A lexicon contains stems for word forms complete lexicon: for each possible form the stem is stored persons – person went – go running – run going – go ran – run gone - go word stem lexicon: to each stem all the necessary data are stored to derive all word forms Distinction of different flexion classes specification of anomalies Example: To compute the stem of Flüssen, the last characters are removed successively and the Umlaut is exchanged until a valid stem is found (Lezius 1995) Fall/Endung - n en sen ... normal Flüssen- Flüsse-n Flüss-en Flüs-sen Umlaut Flussen- Flusse-n Fluss-en Flus-sen Source: (Ferber 2003) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

29 Index with Stemming and Stop Word Elimination
Terms Document IDs accident D1,D3 car D1, D2 cause D3 crowd D3 die D1 drive D3 four D3 heavy D1 injur D3 more D2 morning D1 people D1, D3 quarter D2 register D2 truck D3 trucker D3 vehicle D2 vienna D1, D2, D3 yesterday D1 D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna D2: more vehicles in this quarter more cars became registered in vienna D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured Information Retrieval and Knowledge Organisation - 2 Information Retrieval

30 2.2 Classical Information Retrieval Models
Classcial Models Boolean Model Vectorspace model Probabilistic Model Alternative Models user preferences Associative Search Social Filtering Information Retrieval and Knowledge Organisation - 2 Information Retrieval

31 Classic IR Models - Basic Concepts
Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents The importance of the index terms is represented by weights associated to them Let ti be an index term dj be a document wij is a weight associated with (ti,dj) The weight wij quantifies the importance of the index term for describing the document contents (Stop words can be regarded as terms where wij = 0 for every document) (Baeza-Yates & Ribeirp-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

32 Classic IR Models - Basic Concepts
ti is an index term dj is a document n is the total number of docs T = (t1, t2, …, tk) is the set of all index terms wij >= 0 is a weight associated with (ti,dj) wij = 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, …, wkj) is a weighted vector associated with the document dj gi(vec(dj)) = wij is a function which returns the weight associated with pair (ti,dj) fi is the number of documents containing term ti source: teaching material of Ribeirp-Neto Information Retrieval and Knowledge Organisation - 2 Information Retrieval

33 Index vectors as Matrix
the vectors vec(dj) = (w1j, w2j, …, wkj) associated with the document dj can be represented as a matrix Each colunm represents a document vector vec(dj) = (w1j, w2j, …, wkj) the document dj contains a term ti if wij > 0 Each row represents a term vector tvec(ti) = (wi1, wi2, …, win) the term ti is in document dj if wij > 0 d1 d2 d3 d4 t1 w1,1 w1,2 w1,3 w1,4 t2 w2,1 w2,2 w2,3 w2,4 t3 w3,1 w3,2 w3,3 w3,4 ... tn wn,1 wn,2 wn,3 wn,4 Information Retrieval and Knowledge Organisation - 2 Information Retrieval

34 Boolean Document Vectors
d1 d2 d3 accident 1 0 1 car 1 1 0 cause 0 0 1 crowd 0 0 1 die 1 0 0 drive 0 0 1 four 0 0 1 heavy 1 0 0 injur 0 0 1 more 0 1 0 morning 1 0 0 people 1 0 1 quarter 0 1 0 register 0 1 0 truck 0 0 1 trucker 0 0 1 vehicle 0 1 0 vienna 1 1 1 yesterday 1 0 0 d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna d2: more vehicles in this quarter more cars became registered in vienna d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured Information Retrieval and Knowledge Organisation - 2 Information Retrieval

35 2.2.1 The Boolean Model Simple model based on set theory
precise semantics neat formalism Binary index: Terms are either present or absent. Thus, wij Î {0,1} Queries are specified as boolean expressions using operators AND (), OR (), and NOT () q = ta  (tb  tc) vehicle OR car AND accident Information Retrieval and Knowledge Organisation - 2 Information Retrieval

36 Boolean Retrieval Function
The retrieval function can be defined recursivley R(ti,di) = TRUE, if wij = 1 (i.e. ti is in dj ) = FALSE, if wij = 0 (i.e. ti is not in dj ) R(q1 AND q2,di) = R(q1,di) AND R(q2,di) R(q1 OR q2,di) = R(q1,di) OR R(q2,di) R(NOT q,di) = NOT R(q,di) The Boolean functions computes only values 0 or 1, i.e. Boolean retrieval classifies documents into two categories relevant (R = 1) irrelevant (R = 0) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

37 Example für Boolesches Retrieval
Query: d1 d2 d3 accident 1 0 1 car 1 1 0 cause 0 0 1 crowd 0 0 1 die 1 0 0 drive 0 0 1 four 0 0 1 heavy 1 0 0 injur 0 0 1 more 0 1 0 morning 1 0 0 people 1 0 1 quarter 0 1 0 register 0 1 0 truck 0 0 1 trucker 0 0 1 vehicle 0 1 0 vienna 1 1 1 yesterday 1 0 0 (vehicle OR car) AND accident R(vehicle OR car AND accident, d1) = R(vehicle OR car AND accident, d2) = R(vehicle OR car AND accident, d3) = 1 Query: (vehicle AND car) OR accident R(vehicle AND car OR accident, d1) = R(vehicle AND car OR accident, d2) = R(vehicle AND car OR accident, d3) = 1 Information Retrieval and Knowledge Organisation - 2 Information Retrieval

38 Drawbacks of the Boolean Model
Retrieval based on binary decision criteria no notion of partial matching No ranking of the documents is provided (absence of a grading scale) The query q = t1 OR t2 OR t3 is satisfied by document containing one, two or three of the terms t1, t2, t3 No weighting of terms, wij Î {0,1} Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query Information Retrieval and Knowledge Organisation - 2 Information Retrieval

39 2.2.2 Vector Space Model Index can be regarded as an n- dimensional space wij > 0 whenever ti  dj Each term corresponds to a dimension To each term ti is associated a unitary vector vec(i) The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) document can be regarded as vector started from (0,0,0) point in space Example: d1 d2 accident 4 3 car 3 2 vehicle 1 3 (4,3,1) vehicle accident car (3,2,3) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

40 S 2.2.2.1 Coordinate Matching Documents and query are represented as
document vectors vec(dj) = (w1j, w2j, …, wkj) query vector vec(q) = (w1q,...,wkq) Vectors have binary values wij = 1 if term ti occurs in Dokument dj wij = 0 else Ranking: Return the documents containing at least one query term rank by number of occuring query terms Ranking function: scalar product R(q,d) = q * d Multiply components and summarize = qi * di S i=1 n Beispiel:”Das Gewicht hat Einfluß auf die Handhabung und die Geschwindigkeit, nicht die Stabilität” Retrievalgewichte: Information Retrieval and Knowledge Organisation - 2 Information Retrieval ti q d1 d2 d3 d4 Einfluß Gewicht Handhabung Geschwindigkeit not Stabilität -2 1 Retrievalgewicht

41 Coordinate Matching: Example
query vector represents terms of the query (cf. stemming) d1 d2 d3 q accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday accident heavy vehicles vienna Resultat: q * d1 = q * d2 = q * d3 = 3 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval

42 Assessment of Coordinate Matching
Advantage compared to Boolean Model: Ranking Three main drawbacks frequency of terms in documents in not considered no weighting of terms privilege for larger documents Information Retrieval and Knowledge Organisation - 2 Information Retrieval

43 2.2.2.2 Term Weighting Use of binary weights is too limiting
Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document How to compute the weights wij and wiq ? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequency wij = tf(i,j) * idf(i) (Baeza-Yates & Ribeirp-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

44 TF - Term Frequency d1 d2 d3 q accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday Let freq(i,j) be the raw frequency of term ti within document dj (i.e. number of occurrences of term ti in document dj) A simple tf factor can be computed as f(i,j) = freq(i,j) A normalized tf factor is given by f(i,j) = freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document dj For reasons of simplicity, in this example f(i,j) = freq(i,j) (Baeza-Yates & Ribeiro-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

45 IDF – Inverse Document Frequency
IDF can also be interpreted as the amount of information associated with the term ti . A term occurring in few documents is more useful as an index term than a term occurring in nearly every document Let ni be the number of documents containing term ti N be the total number of documents A simple idf factor can be computed as idf(i) = 1/ni A normalized idf factor is given by idf(i) = log (N/ni) the log is used to make the values of tf and idf comparable. Information Retrieval and Knowledge Organisation - 2 Information Retrieval

46 Example with TF and IDF In this examle a simple tf factor
f(i,j) = freq(i,j) and a simple idf factor idf(i) = 1/ni are used It is of advantage to store IDF and TF separately IDF d1 d2 d3 accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday Information Retrieval and Knowledge Organisation - 2 Information Retrieval

47 Indexing a new Document
Changes of the indexes when adding a new document d a new document vector with tf factors for d is created idf factors for terms occuring in d are adapted All other document vectors remain unchanged Information Retrieval and Knowledge Organisation - 2 Information Retrieval

48 Ranking Scalar product computes co-occurrences of term in document and query Drawback: Scalar product privileges large documents over small ones Euclidian distance between endpoint of vectors Drawback: euclidian distance privileges small documents over large ones Angle between vectors the smaller the angle beween query and document vector the more similar they are the angle is independent of the size of the document the cosine is a good measure of the angle t1 t2 q d Information Retrieval and Knowledge Organisation - 2 Information Retrieval

49 Cosine Ranking Formula
t1 t2 q dj the more the directions of query a and document dj coincide the more relevant is dj the cosine formula takes into account the ratio of the terms not their concrete number Let q be the angle between q and dj Because all values wij >= 0 the angle q is between 0° und 90° the larger q the less is cos q the less q the larger is cos q cos 0 = 1 cos 90° = 0 cos(q,dj) = q ° dj |q| ° |dj| Information Retrieval and Knowledge Organisation - 2 Information Retrieval

50 The Vector Model The best term-weighting schemes use weights which are given by wij = f(i,j) * log(N/ni) the strategy is called a tf-idf weighting scheme For the query term weights, a suggestion is wiq = ( [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) (Baeza-Yates & Ribeirp-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

51 The Vector Model The vector model with tf-idf weights is a good ranking strategy with general collections The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute. Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: assumes independence of index terms (??); not clear that this is bad though (Baeza-Yates & Ribeiro-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

52 2.2.3 Extensions of the Classical Models
Combination of Boolean model vector model indexing with and without preprocessing Extended index with additional information like document format (.doc, .pdf, …) language Using information about links in hypertext link structure anchor text Information Retrieval and Knowledge Organisation - 2 Information Retrieval

53 Boolean Operators in the Vector Model
Many search engines allow queries with Boolean operators Retrieval: Boolean operators are used to select relevant documents in the example, only documents containing „accident“ and either „vehicle“ or „car“are considered relevant ranking of the relevant documents is based on vector model idf-tf weighting cosine ranking formula d1 d2 d3 q accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday (vehicle OR car) AND accident Information Retrieval and Knowledge Organisation - 2 Information Retrieval

54 Queries with Wild Cards in the Vector Model
d1 d2 d3 q accident car cars causes crowd died drove four heavy injured more morning people quarter registered truck trucker vehicles vienna yesterday vector model based in index without preprocessing index contains all word forms occuring in the documents Queries allow wildcards (masking and truncation), e.g. Principle of query answering First, wildcards are extended to all matching terms (here vehicle* matches „vehicles“) ranking according to vector model accident heavy vehicle* vienna Information Retrieval and Knowledge Organisation - 2 Information Retrieval

55 Using Link Information in Hypertext
Ranking: link structure is used to calculate a quality ranking for each web page PageRank® HITS – Hypertext Induced Topic Selection (Authority and Hub) Hilltop Indexing: text of a link (anchor text) is associated both with the page the link is on and with the page the link points to Information Retrieval and Knowledge Organisation - 2 Information Retrieval

56 The PageRank Calculation
PageRank has been developed by Sergey Brin and Lawrence Page at Stanford University and published in 19981) PageRank uses the link structure of web pages Original version of PageRank calculation: PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) with PR(A) being the PageRank of page A, PR(Ti) being the PageRank of apges Ti that contain a link to page A C(Ti) being the number of links going out of page Ti d being a damping factor with 0 <= d <= 1 1) S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Computer Networks and ISDN Systems. Vol. 30, 1998, Seiten oder Information Retrieval and Knowledge Organisation - 2 Information Retrieval

57 The PageRank Calculation - Explanation
PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) The PageRank of page A is recursively defined by the PageRanks of those pages which link to page A The PageRank of a page Ti is always weighted by the number of outbound links C(Ti) on page Ti: This means that the more outbound links a page Ti has, the less will page A benefit from a link to it on page Ti. The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank. Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. Source: Information Retrieval and Knowledge Organisation - 2 Information Retrieval

58 Damping Factor and the Random Surfer Model
The PageRank algorithm and the damping factor are motivated by the model of a random surfer. The random surfer finds a page A by following a link from a page Ti to page A or by random choice of a web page (e.g. typing the URL). The probability that the random surfer clicks on a particular link is given by the number of links on that page: If a page Ti contains C(Ti) links, the probability for each links is 1/ C(Ti) The justification of the damping factor is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random. d is the probability for the random surfer not stopping to click on links – this is way the sum of pageRanks is multiplied by d (1-d) is the probability that the surfer jumps to another page at random after he stopped clicking links. Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank (According to Brin and Page d = 0.85 is a good value) Source: Information Retrieval and Knowledge Organisation - 2 Information Retrieval

59 Calculation of the PageRank - Example
We regard a small web consisting of only three pages A, B and C and the links structure shon in the figure To keep the calculation simple d is set to 0.5 These are the equation for the PageRank calculation: PR(A) = PR(C) PR(B) = (PR(A) / 2) PR(C) = (PR(A) / 2 + PR(B)) Solving these equations we get the following PageRank values for the single pages: PR(A) = 14/13 = PR(B) = 10/13 = PR(C) = 15/13 = Quelle: Information Retrieval and Knowledge Organisation - 2 Information Retrieval

60 Iterative Calculation of the PageRank - Example
Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values each page is assigned an initial starting value the PageRanks of all pages are then calculated in several computation cycles. Iteration PR(A) PR(B) PR(C) 1 0.75 1.125 2 1.0625 3 4 5 6 7 8 9 10 11 12 According to Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web. Quelle: Information Retrieval and Knowledge Organisation - 2 Information Retrieval

61 Alternative Link Analysis Algorithms (I): HITS
Hypertext-Induced Topic Selection (HITS) is a link analysis algorithm proposed by J. Kleinberg 1999 HITS rates Web pages for their authority and hub values: The authority value estimates the value of the content of the page; a good authority is a page that is pointed to by many good hubs the hub value estimates the value of its links to other pages; a good hub is a page that points to many good authorities (examples of hubs are good link collections); Every page i is assigned a hub weight hi and an Authority weight ai : Jon Kleinberg: Authoritative sources in a hyperlinked environment. In: Journal of the ACM, Vol. 36, No. 5, pp , 1999, Information Retrieval and Knowledge Organisation - 2 Information Retrieval

62 Alternative Link Analysis Algorithms (II): Hilltop
The Hilltop-Algorithm1) rates documents based on their incoming links from so-called expert pages Expert pages are defined as pages that are about a topic and have links to many non-affiliated pages on that topic. Pages are defined as non-affiliated if they are from authors of non-affiliated organisations. Websites which have backlinks from many of the best expert pages are authorities and are ranked high. A good directory page is an example of an expert page (cp. hubs). Determination of expert pages is a central point of the hilltop algorithm. 1) The Hilltop-Algorithmus was developed by Bharat und Mihaila an publishes in 1999: Krishna Bharat, George A. Mihaila: Hilltop: A Search Engine based on Expert Documents. In 2003 Google bought the patent of the algorithm (see also Information Retrieval and Knowledge Organisation - 2 Information Retrieval

63 Anchor-Text The Google search engine uses the text of links twice
First, the text of a link is associated with the page that the link is on, In addition, it is associated with the page the link points to. Advantages: Anchors provide additional description of a web pages – from a user‘s point of view Documents without text can be indexed, such as images, programs, and databases. Disadvantage: Search results can be manipulated (cf. Google Bombing1)) The polar bear Knut was born in the zoo of Berlin A Google bomb influences the ranking of the search engine. It is created if a large number of sites link to the page with anchor text that often has humourous, political or defamatory statements. In the meanwhile, Google bombs are defused by Google. Information Retrieval and Knowledge Organisation - 2 Information Retrieval

64 Natural Language Queries
i need information about accidents with cars and other vehicles is equivalent to information accident car vehicle Natural language queries are treated as any other query Stop word elimination Stemming but no interpretation of the meaning of the query Information Retrieval and Knowledge Organisation - 2 Information Retrieval

65 Searching Similar Documents
Is is often difficult to express the information need as a query An alternative search method can be to search for similar documents to a given document d Information Retrieval and Knowledge Organisation - 2 Information Retrieval

66 Finding Similar Documents – Principle and Example
Example: Find the most similar documents to d1 IDF d1 d2 d3 accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday Principle: Use a given document d as a query Compare all document di with d Example (scalar product): IDF * d1 * d2 = IDF * d1 * d3 = The approach is the same as for a : same index same ranking function 0.83 2.33 Information Retrieval and Knowledge Organisation - 2 Information Retrieval

67 The Vector Space Model The vector space model ...
is relatively simple and clear, is efficient, ranks documents, can be applied for any collection of documents The model has many heuristic components of parameters, e.g. determintation of index terms calculation of tf and idf ranking function The best parameter setting depends on the document collection Information Retrieval and Knowledge Organisation - 2 Information Retrieval

68 2.3 Implementation of the Index
The vector space model is usually implemented with an inverted index For each term a pointer references a „posting list“ with an entry for each document containing the term The posting lists can be implemented as linked lists or more efficient data structures that reduce the storage requirements (index pruning) To answer a query, the corresponding posting lists are retrieved and the documents are ranked, i.e. efficient retrieval of posting list is essential Source: D. Grossman, O. Frieder (2004) Information Retrieval, Springer-Verlag Information Retrieval and Knowledge Organisation - 2 Information Retrieval

69 Implementing the Term Structure as a Trie
Example: Structure of a node in a trie*) Sequentially scanning the index for query terms/posting lists is inefficient a trie is a tree structure each node is an array, one element for each character each element contains a link to another node *) the characters and their order are identical for each node. Therefore they do not need to be stored explicitly. Source: G. Saake, K.-U. Sattler: Algorithmen und Datenstrukturen – Eine Einführung mit Java. dpunkt Verlag 2004 Information Retrieval and Knowledge Organisation - 2 Information Retrieval

70 The Index as a Trie The leaves of the trie are the index terms, pointing to the corresponding position lists Searching a term in in a trie: search starts at the root subsequently for each character of the term the reference to the corresponding subtree is followed until a leave with the term is found search stops without success (Saake, Sattler 2004) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

71 Patricia Trees Idea: Skip irrelevant parts of terms
This is achieved by storing in each node the number of characters to be skipped. Example: (Saake, Sattler 2004) Patricia = Practical Algorithm To Retrieve Information Coded in Alphanumeric Information Retrieval and Knowledge Organisation - 2 Information Retrieval

72 2.4 Evaluating Search Methods
Set of all document relevant documents found documents found relevant documents that are not found Information Retrieval and Knowledge Organisation - 2 Information Retrieval

73 Performance Measure of Information Retrieval: Recall und Precision
answer set DA relevant documents DR relevant documents in answer set DRA Several different measures for evaluating the performance of information retrieval systems have been proposed; two important ones are: Recall: fraction of the relevant documents that are are successfully retrieved. D R = |DRA| |DR| Precision: fraction of the documents retrieved that are relevant to the user's information need P = |DRA| |DA| Information Retrieval and Knowledge Organisation - 2 Information Retrieval

74 F-Measure The F-measure is a mean of precision and recall
In this version, precision and recall are equally weighted. The more general version allows to give preference to recall or precision F2 weights recall twice as much as precision F0.5 weights precision twice as much as recall Information Retrieval and Knowledge Organisation - 2 Information Retrieval

75 Computing Recall and Precision
Evaluation: Perform a predefined set of queries The search engines delivers a ranked set of documents Use the first X documents of the result list as answer set Compute recall and precision for the frist X documents of the ranked result list. How do you know, which documents are relevant? A general reference set of documents can be used. For example, TREC (Text REtrieval Conference) is an annual event where large test collections in different domains are used to measure and compare performance of infomration retrieval systems For companies it is more important to evaluate information retrieval systems using their own documents Collect a representative set of documents Specify queries and associated relevant documents evaluate search engines by computing recall and precision for the query results Information Retrieval and Knowledge Organisation - 2 Information Retrieval

76 2.5 User Adaptation Take into account information of a user to filter document particularly relevant to this user Relevance Feedback Retrieval in multiple passes; in each pass the use refines the query based on results of previous queries Explicit User Profiles subscription User-specific weights of terms Social Filtering Similar use get similar documents Information Retrieval and Knowledge Organisation - 2 Information Retrieval

77 2.5.1 Relevance Feedback given by the User
Example: In the next pass, the top ranked documents are only about the education assessment The user specifies relevance of each document. Example: for query "Pisa" only the documents about the education assessment are regarded as relevant This example is from the SmartFinder system from empolis. The mindaccess system from Insiders GmbH uses the same techology. Information Retrieval and Knowledge Organisation - 2 Information Retrieval

78 Relevance Feedback: Probabilistic Model
Assumption: Given a user query, there is an ideal answer set Idea: An initial answer is iteratively improved based on user feedback Approach: An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (usually, only top need to be inspected) IR system uses this information to refine description of ideal answer set By repeting this process, it is expected that the description of the ideal answer set will improve The description of ideal answer set is modeled in probabilistic terms (Baeza-Yates & Ribeiro-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

79 Probabilistic Ranking
Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Probabilistic ranking is: Definitions: wij  {0,1} (i.e. weights are binary) similarity of document dj to the query q is the probability that document dj is relevant is the probability that document dj is not relevant is the document vector of dj (Baeza-Yates & Ribeiro-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

80 Computing Probabilistic Ranking
Probabilistic ranking can be computed as: where stands for the probability that the index term ki is present in a document randomly selected from the set R of relevant documents stands for the probability that the index term ki is not present in a document randomly selected from the set R of relevant documents is the weight of term ki in the query is the weight of term ki in document dj (Baeza-Yates & Ribeiro-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval

81 Relevance Feedback: Probabilistic Model
The probabilities that a term ki is (not) present in a set of relevant documents can be computed as : N total number of documents ni number of documents containing term ki V number of relevant documents retrieved by the probabilistic model Vi number of relevant documents containing term ki There are different ways to find the relevant document V : Automatically: V can be specified as the top r documents found By user feedback: The user specifies for each retrieved document whether it is relevant or not Information Retrieval and Knowledge Organisation - 2 Information Retrieval

82 2.5.2 Explicit User Profiles
Idea: Using knoweldge about the user to provide information that is particularly relelvant for him/her users specify topics of interest as a set of terms these terms represent the user profile documents containing the terms of the user profile are prefered information need/ preferences documents user profile index profile acquisition document representation ranking function Information Retrieval and Knowledge Organisation - 2 Information Retrieval

83 User profiles for subscribing to information
IDF d1 d2 d3 U1 U2 accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday user profiles are treated as queries Example: news feed As soon as a new document arrives it is tested for similarity with the user profiles Vector space model can be applied A document is regarded relevant if the ranking reaches a specified threshold Example: User 1 is interested in any car accicent User 2 is interested in deadly car accidents with trucks Information Retrieval and Knowledge Organisation - 2 Information Retrieval

84 User profiles for Individual Queries
Example: users profiles with term Users specify importance of terms User profiles are used as additional term weights Different ranking for different users Example ranking for user 1 IDF * d1 * U1 *q = IDF * d2 * U1 *q = IDF * d3 * U1 *q = IDF d1 d2 d3 U1 U2 q accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday 1,4 1 0,5 ranking for user 2 IDF * d1 * U2 *q = IDF * d2 * U2 *q = IDF * d3 * U2 *q = 2,2 0,1 0,5 Information Retrieval and Knowledge Organisation - 2 Information Retrieval

85 Acquisition and Maintenance of User Profiles
There are different ways to specify user profiles manual: users specifies topics of interests (and weights) explicitly selection of predefined terms or query Problem: Maintenance user feedback: user collects relevant documents terms in selected document are regarded as important Problem: How to motivate the user to give feedback (a similar approach is used by spam filters - classification) Heuristics: observing user behaviour Example: If a user has opened a document for long time, it is assumend that he/she read it and therefore it might be relevant Problem: Heuristics might be wrong Information Retrieval and Knowledge Organisation - 2 Information Retrieval

86 Social Filtering Idea: Information is relevant, if other users who showed similar behaviour regarded the information as relevant Relevance is specified by the users User profiles are compared Example: A simple variant can be found at Amazon purchases of books and CDs are stored „people who bought this book also bought …“ Information Retrieval and Knowledge Organisation - 2 Information Retrieval

87 Anfrage nach informatischem Ansatz
Formulieren Sie folgenden Informationsbedarf als Anfrage nach informatischen Ansatz: „Finde alle Dokumente ... die Hinweise zur die Kreditbeurteilung für Personen aus Bärlund enthalten.“ in denen es um ein Haus am Meer geht" zum Thema Dokumenten-Management.“ über die Verwaltung von Daten in einer Datenbank" Information Retrieval and Knowledge Organisation - 2 Information Retrieval

88 Assume you made a search in Google
Assume you made a search in Google. From 150 documents found, you can use 9 for your work, the remaining documents are useless Which measure can be used to evaluated the result? Compute the measure. Which information is missing to make a complete evaluation of the result? Information Retrieval and Knowledge Organisation - 2 Information Retrieval


Download ppt "2 Information Retrieval"

Similar presentations


Ads by Google