Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information.

Similar presentations


Presentation on theme: "2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information."— Presentation transcript:

1 2 Information Retrieval

2 Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information Retrieval is a field of activity for many years It was long seen as an area of narrow interest Advent of the Web changed this perception  universal repository of knowledge  free (low cost) universal access  no central editorial board  many problems though: IR seen as key to finding the solutions!

3 Prof. Dr. Knut Hinkelmann 3 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information Retrieval: representation, storage, organization of, and access to information items Emphasis on the retrieval of information (not data) Focus is on the user information need User information need  Example: Find all documents containing information about car accidents which  happend in Vienna  had people injured The information need is expressed as a query

4 Prof. Dr. Knut Hinkelmann 4 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Generic Schema of an Information System Comparison (Ranking) Information Retrieval systems do not search through the documents but through the representation (also called index, meta-data or description). source: (Ferber 2004) representation of resources (index/meta-data) representation of information need (query) user information resources

5 Prof. Dr. Knut Hinkelmann 5 Information Retrieval and Knowledge Organisation - 2 Information Retrieval document D3 but:  not all terms of the query occur in the document  the occurring terms „accident“ and „heavy“ also occur in D1 Example accident heavy vehicles vienna Heavy accident Because of a heavy car accident 4 people died yesterday morning in Vienna. Truck causes accident In Vienna a trucker drove into a crowd of people. Four people were injured. More vehicles In this quarter more cars became registered in Vienna. D1D2D3 Expected result: Query: Information need:documents containing information about accidents with heavy vehicles in Vienna

6 Prof. Dr. Knut Hinkelmann 6 Information Retrieval and Knowledge Organisation - 2 Information Retrieval indexing retreival (search) Retrieval System Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes Ranking weighted documents set of documents index assign IDs store documents and IDs document resources indexing terms Text query processing query terms interface answer: sorted list of IDs information need documents the index is stored in an efficient system or data structured Queries are answered using the index with the ID die document can be retrieved

7 Prof. Dr. Knut Hinkelmann 7 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Indexing manual indexing – key words  user specifies key words, he/she assumes useful  Usually, key words are nouns because nouns have meaning by themselves  there are two possibilities 1.user can assign any terms 2.user can select from a predefined set of terms (  controlled vocabulary) automatic indexing – full text search  search engines assume that all words are index terms (full text representation)  system generates index terms from the words occurring in the text

8 Prof. Dr. Knut Hinkelmann 8 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Automatic Indexing: 1. Decompose a Document into Terms rules determine how text are decomposed into terms by defining separators like  punctuation marks, blanks or hyphens Additional preprocessing, e.g..  exclude specific strings (stop words, numbers)  generate normal form  stemming  substitute characters (e.g. upper case – lower case, Umlaut) D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna D2: more vehicles in this quarter more cars became registered in vienna D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna D2: more vehicles in this quarter more cars became registered in vienna D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

9 Prof. Dr. Knut Hinkelmann 9 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Automatic Indexing 2. Index represented as an inverted List For each term: list of documents in which the term occurs additional information can be stored with each document like  frequency of occurrence  positions of occurrence TermDokument-IDs aD1,D3 accident D1,D3 becameD2 becauseD1 carD1 carsD2 diedD1 heavyD1 inD1,D2,D3 moreD2 ofD1 peopleD1,D3 quarterD2 registeredD2 truckD3 vehiclesD2 … TermDokument-IDs aD1,D3 accident D1,D3 becameD2 becauseD1 carD1 carsD2 diedD1 heavyD1 inD1,D2,D3 moreD2 ofD1 peopleD1,D3 quarterD2 registeredD2 truckD3 vehiclesD2 … In inverted list is similar to an index in a book

10 Prof. Dr. Knut Hinkelmann 10 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Index as Inverted List with Frequency term(document; frequency) a(D1,1) (D3,2) accident (D1,2) (D3,1) became(D2,1) because(D1,1) car(D1,1) cars(D2,1) died(D1,1) heavy(D1,2) in(D1,1) (D2,1) (D3,1) more(D2,1) of(D1,1) people(D1,1) (D3,2) quarter(D2,1) registered(D2,1) truck(D3,1) vehicles(D2,1)... term(document; frequency) a(D1,1) (D3,2) accident (D1,2) (D3,1) became(D2,1) because(D1,1) car(D1,1) cars(D2,1) died(D1,1) heavy(D1,2) in(D1,1) (D2,1) (D3,1) more(D2,1) of(D1,1) people(D1,1) (D3,2) quarter(D2,1) registered(D2,1) truck(D3,1) vehicles(D2,1)... In this example the inverted list contains the document identifier an the frequency of the term in the document.

11 Prof. Dr. Knut Hinkelmann 11 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Problems of Information Retrieval Word form  A word can occur in different forms, e.g. singular or plural.  Example: A query for „car“ should find also documents containing the word „cars“ Meaning  A singular term can have different meanings; on the other hand the same meaning can be expressed using different terms.  Example: when searching for „car“ also documents containing „vehicle“ should be found. Wording, phrases  The same issue can be expressed in various ways  Example: searching for „motorcar“ should also find documents containing „motorized car“

12 Prof. Dr. Knut Hinkelmann 12 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Word Forms Flexion: Conjugation and declension of a word car - cars run – ran - running Derivations: words having the same stem form – format – formation compositions: statements information management - management of information In German, compositions are written as single words, sometimes with hyphen Informationsmanagement Informations-Management

13 Prof. Dr. Knut Hinkelmann 13 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Word Meaning and Phrases Synonyms record – file - dossier seldom – not often Variants in spelling (e.g BE vs. AE) organisation – organization night - nite Abbrevations UN – United Nations Polyseme: words with multiple meanings Bank Dealing with words having the same or similar meaning

14 Prof. Dr. Knut Hinkelmann 14 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 2.1 Dealing with Word Forms and Phrases We distinguish two ways to deal with word forms and phrases Indexing without preprocessing  All occuring word forms are included in the index  Different word forms are unified at search time  string operations Indexing with preprocessing  Unification of word forms during indexing  Terms normal forms of occuring word forms  index is largely independent of the concrete formulation of the text  computerlinguistic approach

15 Prof. Dr. Knut Hinkelmann 15 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Indexing Without Preprocessing Index: contains all the word forms occuring in the document Query:  Searching for specific word forms is possible (e.g. searching for „cars“ but not for „car“)  To search for different word forms string operations can be applied  Operators for truncation and masking, e.g. ? covers exactly one character * covers arbitrary number of characters  Context operators, e.g. [ n ] exact distance between terms maximal distance between terms

16 Prof. Dr. Knut Hinkelmann 16 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Index Without Preprocessing and Query vehicle? car? people TermDokument-IDs aD1,D3 accident D1,D3 becameD2 becauseD1 carD1 carsD2 diedD1 heavyD1 inD1,D2,D3 moreD2 ofD1 peopleD1,D3 quarterD2 registeredD2 truckD3 vehiclesD2 … TermDokument-IDs aD1,D3 accident D1,D3 becameD2 becauseD1 carD1 carsD2 diedD1 heavyD1 inD1,D2,D3 moreD2 ofD1 peopleD1,D3 quarterD2 registeredD2 truckD3 vehiclesD2 …

17 Prof. Dr. Knut Hinkelmann 17 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Truncation and Masking: Searching for Different Word Forms Truncation: Wildcards cover characters at the beginning and end of words – prefix or suffix schreib* finds schreiben, schreibt, schreibst, schreibe,… ??schreiben finds anschreiben, beschreiben, but not verschreiben Masking deals with characters in words – in particular in German, declensions and conjugation affect not only suffix and prefix schr??b* can find schreiben, schrieb h??s* can find Haus, Häuser Disadvantage: With truncation and masking not only the intended words are found schr??b* also finds schrauben h??s* also finds Hans, Hanse, hausen, hassen and also words in other languages like horse

18 Prof. Dr. Knut Hinkelmann 18 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Context Operators Context operators allow searching for variations of text phrases  exact word distance Bezug [3] Telefonat Bezug nehmend auf unser Telefonat  maximal word distance text retrieval text retrieval text and fact retrieval For context operators to be applicable, the positions of the words must be stored in the index

19 Prof. Dr. Knut Hinkelmann 19 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Indexing Without Preprocessing Efficiency  Efficient Indexing  Overhead at retrieval time to apply string operators Wort forms  user has to codify all possible word forms and phrases in the query using truncation and masking operators  no support given by search engine  retrieval engine is language independent Phrases  Variants in text phrases can be coded using context operators

20 Prof. Dr. Knut Hinkelmann 20 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Preprocessing of the Index – Computerlinguistic Approach Each document is represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document’s main themes Index contains standard forms of useful terms 1. Restrict allowed terms 2. Normalisation: Map terms to a standard form Index terms not for Index

21 Prof. Dr. Knut Hinkelmann 21 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Restricting allowed Index Terms Objective:  increase efficiency effectivity by neglecting terms that do not contribute to the assessment of a document‘s relevance There are two possibilities to restrict allowed index terms 1.Explicitly specify allowed index terms  controlled vocabulary 2.Specify terms that are not allowed as index terms  stopwords

22 Prof. Dr. Knut Hinkelmann 22 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Stop Words Stop words are terms that are not stored in the index Candidates for stop words are  words that occur very frequently  A term occurring in every document ist useless as an index term, because it does not tell anything about which document the user might be interested in  a word which occurs only in 0.001% of the documents is quite useful because it narrows down the space of documents which might be of interest for the user  words with no/little meanings  terms that are not words (e.g. numbers) Examples:  General: articles, conjunctions, prepositions, auxiliary verbs (to be, to have)  occur very often and in general have no meaning as a search criteria  application-specific stop words are also possible

23 Prof. Dr. Knut Hinkelmann 23 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Normalisation of Terms There are various possibilities to compute standard forms  N-Grams  stemming: removing suffixes or prefixes

24 Prof. Dr. Knut Hinkelmann 24 Information Retrieval and Knowledge Organisation - 2 Information Retrieval N-Grams Index: sequence of charcters of length N  Example: „persons“  3-Grams (N=3): per, ers, rso, son, ons  4-Grams (N=4): pers, erso, rson, sons N-Grams can also cross word boundaries  Example: „persons from switzerland“  3-Grams (N=3): er, ers, rso, son, ons, ns_, s_f, _fr, rom, om_, m_s, _sw, swi, wit, itz, tze, zer, erl, rla, lan, and

25 Prof. Dr. Knut Hinkelmann 25 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Stemming Stemming: remove suffixes and prefixes to find a comming stem, e.g.  remove –ing and –ed for verbs  remove plural -s for nouns There are a number of exceptions, e.g.  –ing and –ed may belong to a stem as in red or ring  irregular verbs like go - went - gone, run - ran - run Approaches for stemming:  rule-based approach  lexicon-based approach

26 Prof. Dr. Knut Hinkelmann 26 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Rules for Stemming in English EndingReplacementCondition 1 iesy 2 XYesXYXY = Co, ch, sh, ss, zz oder Xx 3 XYsXYXY = XC, Xe, Vy, Vo, oa oder ea 4 ies'y 5 Xes'X 6 Xs'X 7 X 'sX 8 X'X'X 9 XYingXYXY= CC, XV, Xx 10 XYingXYeXY= VC 11 iedy 12 XYedXYXY = CC, XV, Xx 13 XYedXYeXY= VC X and Y are any letters C stands for a consonant V stands for any vowel Kuhlen (1977) derived a rule set for stemming of most English words: Source: (Ferber 2003)

27 Prof. Dr. Knut Hinkelmann 27 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Problems for Stemming In English a small number of rules cover most of the aorkd In German it is more difficult because also stem changes for many words  insertion of Umlauts, e.g. Haus – Häuser  new prefixes, e.g laufen – gelaufen  separation/retaining of prefix, e.g.  mitbringen – er brachte den Brief mit  überbringen – er überbrachte den Brief  Irregular insertion of „Verfungen“ when building composita  Schwein-kram, Schwein-s-haxe, Schwein-e-braten These problems that cannot be easily dealt with by general rules operating only on strings

28 Prof. Dr. Knut Hinkelmann 28 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Lexicon-based Approaches for Stemming Principle idea: A lexicon contains stems for word forms complete lexicon: for each possible form the stem is stored persons – person went – go running – run going – go ran – run gone - go word stem lexicon: to each stem all the necessary data are stored to derive all word forms  Distinction of different flexion classes  specification of anomalies  Example: To compute the stem of Flüssen, the last characters are removed successively and the Umlaut is exchanged until a valid stem is found (Lezius 1995) Fall/Endung -nensen... normal Flüssen-Flüsse-nFlüss-enFlüs-sen... Umlaut Flussen-Flusse-nFluss-enFlus-sen... Source: (Ferber 2003)

29 Prof. Dr. Knut Hinkelmann 29 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Index with Stemming and Stop Word Elimination TermsDocument IDs accidentD1,D3 carD1, D2 causeD3 crowdD3 dieD1 driveD3 fourD3 heavyD1 injurD3 moreD2 morningD1 peopleD1, D3 quarterD2 registerD2 truckD3 truckerD3 vehicleD2 viennaD1, D2, D3 yesterdayD1 Index: D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna D2: more vehicles in this quarter more cars became registered in vienna D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna D2: more vehicles in this quarter more cars became registered in vienna D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

30 Prof. Dr. Knut Hinkelmann 30 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 2.2 Classical Information Retrieval Models Classcial Models  Boolean Model  Vectorspace model  Probabilistic Model Alternative Models  user preferences  Associative Search  Social Filtering

31 Prof. Dr. Knut Hinkelmann 31 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Classic IR Models - Basic Concepts Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents The importance of the index terms is represented by weights associated to them Let ti be an index term dj be a document wij is a weight associated with (ti,dj) The weight wij quantifies the importance of the index term for describing the document contents (Stop words can be regarded as terms where wij = 0 for every document) (Baeza-Yates & Ribeirp-Neto 1999)

32 Prof. Dr. Knut Hinkelmann 32 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Classic IR Models - Basic Concepts ti is an index term dj is a document n is the total number of docs T = (t1, t2, …, tk) is the set of all index terms wij >= 0 is a weight associated with (ti,dj) wij = 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, …, wkj) is a weighted vector associated with the document dj gi(vec(dj)) = wij is a function which returns the weight associated with pair (ti,dj) fi is the number of documents containing term ti source: teaching material of Ribeirp-Neto

33 Prof. Dr. Knut Hinkelmann 33 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Index vectors as Matrix the vectors vec(dj) = (w1j, w2j, …, wkj) associated with the document dj can be represented as a matrix Each colunm represents a document vector  vec(dj) = (w1j, w2j, …, wkj)  the document dj contains a term ti if wij > 0 Each row represents a term vector  tvec(ti) = (wi1, wi2, …, win)  the term ti is in document dj if wij > 0 d1d2d3d4 t1w 1,1 w 1,2 w 1,3 w 1,4 t2 w 2,1 w 2,2 w 2,3 w 2,4 t3 w 3,1 w 3,2 w 3,3 w 3,4... tn w n,1 w n,2 w n,3 w n,4 d1d2d3d4 t1w 1,1 w 1,2 w 1,3 w 1,4 t2 w 2,1 w 2,2 w 2,3 w 2,4 t3 w 3,1 w 3,2 w 3,3 w 3,4... tn w n,1 w n,2 w n,3 w n,4

34 Prof. Dr. Knut Hinkelmann 34 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Boolean Document Vectors d1d2d3 accident101 car110 cause001 crowd001 die100 drive001 four001 heavy100 injur001 more010 morning100 people101 quarter010 register010 truck001 trucker001 vehicle010 vienna111 yesterday100 d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna d2: more vehicles in this quarter more cars became registered in vienna d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna d2: more vehicles in this quarter more cars became registered in vienna d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

35 Prof. Dr. Knut Hinkelmann 35 Information Retrieval and Knowledge Organisation - 2 Information Retrieval The Boolean Model Simple model based on set theory  precise semantics  neat formalism Binary index: Terms are either present or absent. Thus, wij  {0,1} Queries are specified as boolean expressions using operators AND (  ), OR (  ), and NOT (  )  q = ta  (tb   tc) vehicle OR car AND accident

36 Prof. Dr. Knut Hinkelmann 36 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Boolean Retrieval Function The retrieval function can be defined recursivley R(t i,d i ) =TRUE, if wij = 1 (i.e. t i is in d j ) =FALSE, if wij = 0 (i.e. t i is not in d j ) R(q 1 AND q 2,d i ) = R(q 1,d i ) AND R(q 2,d i ) R(q 1 OR q 2,d i ) = R(q 1,d i ) OR R(q 2,d i ) R(NOT q,d i ) = NOT R(q,d i ) The Boolean functions computes only values 0 or 1, i.e. Boolean retrieval classifies documents into two categories  relevant (R = 1)  irrelevant (R = 0)

37 Prof. Dr. Knut Hinkelmann 37 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Example für Boolesches Retrieval (vehicle OR car) AND accident Query: R(vehicle OR car AND accident, d1) = R(vehicle OR car AND accident, d2) = R(vehicle OR car AND accident, d3) = d1d2d3 accident101 car110 cause001 crowd001 die100 drive001 four001 heavy100 injur001 more010 morning100 people101 quarter010 register010 truck001 trucker001 vehicle010 vienna111 yesterday100 (vehicle AND car) OR accident Query: R(vehicle AND car OR accident, d1) = R(vehicle AND car OR accident, d2) = R(vehicle AND car OR accident, d3) = 0 1 0

38 Prof. Dr. Knut Hinkelmann 38 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Drawbacks of the Boolean Model Retrieval based on binary decision criteria  no notion of partial matching  No ranking of the documents is provided (absence of a grading scale)  The query q = t1 OR t2 OR t3 is satisfied by document containing one, two or three of the terms t1, t2, t3 No weighting of terms, wij  {0,1} Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

39 Prof. Dr. Knut Hinkelmann 39 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Vector Space Model Index can be regarded as an n- dimensional space  wij > 0 whenever ti  dj Each term corresponds to a dimension  To each term ti is associated a unitary vector vec(i)  The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) document can be regarded as  vector started from (0,0,0)  point in space (4,3,1) vehicle accident car (3,2,3) d1d2 accident43 car 32 vehicle13 Example:

40 Prof. Dr. Knut Hinkelmann 40 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Coordinate Matching Documents and query are represented as  document vectors vec(dj) = (w1j, w2j, …, wkj)  query vector vec(q) = (w1q,...,wkq) Vectors have binary values  wij = 1 if term ti occurs in Dokument dj  wij = 0else Ranking:  Return the documents containing at least one query term  rank by number of occuring query terms Ranking function: scalar product  R(q,d) = q * d = q i * d i  i=1 n Multiply components and summarize

41 Prof. Dr. Knut Hinkelmann 41 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Coordinate Matching: Example Resultat: q * d1 = q * d2 = q * d3 = query vector represents terms of the query (cf. stemming) accident heavy vehicles vienna d1d2d3 q accident101 1 car110 0 cause001 0 crowd001 0 die100 0 drive001 0 four001 0 heavy100 1 injur001 0 more010 0 morning100 0 people101 0 quarter010 0 register010 0 truck001 0 trucker001 0 vehicle010 1 vienna111 1 yesterday100 0

42 Prof. Dr. Knut Hinkelmann 42 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Assessment of Coordinate Matching Advantage compared to Boolean Model: Ranking Three main drawbacks  frequency of terms in documents in not considered  no weighting of terms  privilege for larger documents

43 Prof. Dr. Knut Hinkelmann 43 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Term Weighting Use of binary weights is too limiting  Non-binary weights provide consideration for partial matches  These term weights are used to compute a degree of similarity between a query and each document How to compute the weights wij and wiq ? A good weight must take into account two effects:  quantification of intra-document contents (similarity)  tf factor, the term frequency within a document  quantification of inter-documents separation (dissi-milarity)  idf factor, the inverse document frequency  wij = tf(i,j) * idf(i) (Baeza-Yates & Ribeirp-Neto 1999)

44 Prof. Dr. Knut Hinkelmann 44 Information Retrieval and Knowledge Organisation - 2 Information Retrieval TF - Term Frequency Let freq(i,j) be the raw frequency of term ti within document dj (i.e. number of occurrences of term ti in document dj) A simple tf factor can be computed as  f(i,j) = freq(i,j) A normalized tf factor is given by  f(i,j) = freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document dj d1d2d3 q accident201 1 car110 0 cause001 0 crowd001 0 die100 0 drive001 0 four001 0 heavy200 1 injur001 0 more020 0 morning100 0 people102 0 quarter010 0 register010 0 truck001 0 trucker001 0 vehicle010 1 vienna111 1 yesterday100 0 (Baeza-Yates & Ribeiro-Neto 1999) For reasons of simplicity, in this example f(i,j) = freq(i,j)

45 Prof. Dr. Knut Hinkelmann 45 Information Retrieval and Knowledge Organisation - 2 Information Retrieval IDF – Inverse Document Frequency IDF can also be interpreted as the amount of information associated with the term ti. A term occurring in few documents is more useful as an index term than a term occurring in nearly every document Letni be the number of documents containing term ti N be the total number of documents A simple idf factor can be computed as  idf(i) = 1/ni A normalized idf factor is given by  idf(i) = log (N/ni) the log is used to make the values of tf and idf comparable.

46 Prof. Dr. Knut Hinkelmann 46 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Example with TF and IDF In this examle a simple tf factor  f(i,j) = freq(i,j) and a simple idf factor  idf(i) = 1/ni are used It is of advantage to store IDF and TF separately IDFd1d2 d3 accident car cause100 1 crowd100 1 die110 0 drive100 1 four100 1 heavy120 0 injur100 1 more102 0 morning110 0 people quarter101 0 register101 0 truck100 1 trucker100 1 vehicle101 0 vienna yesterday110 0

47 Prof. Dr. Knut Hinkelmann 47 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Indexing a new Document Changes of the indexes when adding a new document d  a new document vector with tf factors for d is created  idf factors for terms occuring in d are adapted All other document vectors remain unchanged

48 Prof. Dr. Knut Hinkelmann 48 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Ranking Scalar product computes co-occurrences of term in document and query  Drawback: Scalar product privileges large documents over small ones Euclidian distance between endpoint of vectors  Drawback: euclidian distance privileges small documents over large ones Angle between vectors  the smaller the angle beween query and document vector the more similar they are  the angle is independent of the size of the document  the cosine is a good measure of the angle t1t1 t2t2 q d

49 Prof. Dr. Knut Hinkelmann 49 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Cosine Ranking Formula the more the directions of query a and document dj coincide the more relevant is dj the cosine formula takes into account the ratio of the terms not their concrete number Let  be the angle between q and dj Because all values wij >= 0 the angle  is between 0° und 90°  the larger  the less is cos   the less  the larger is cos   cos 0 = 1  cos 90° = 0 t1t1 t2t2 q dj cos(q,d j ) = q ° d j |q| ° |d j |

50 Prof. Dr. Knut Hinkelmann 50 Information Retrieval and Knowledge Organisation - 2 Information Retrieval The Vector Model The best term-weighting schemes use weights which are given by  wij = f(i,j) * log(N/ni)  the strategy is called a tf-idf weighting scheme For the query term weights, a suggestion is  wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) (Baeza-Yates & Ribeirp-Neto 1999)

51 Prof. Dr. Knut Hinkelmann 51 Information Retrieval and Knowledge Organisation - 2 Information Retrieval The Vector Model The vector model with tf-idf weights is a good ranking strategy with general collections The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute. Advantages:  term-weighting improves quality of the answer set  partial matching allows retrieval of docs that approximate the query conditions  cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages:  assumes independence of index terms (??); not clear that this is bad though (Baeza-Yates & Ribeiro-Neto 1999)

52 Prof. Dr. Knut Hinkelmann 52 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Extensions of the Classical Models Combination of  Boolean model  vector model  indexing with and without preprocessing Extended index with additional information like  document format (.doc,.pdf, …)  language Using information about links in hypertext  link structure  anchor text

53 Prof. Dr. Knut Hinkelmann 53 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Boolean Operators in the Vector Model Many search engines allow queries with Boolean operators Retrieval:  Boolean operators are used to select relevant documents  in the example, only documents containing „accident“ and either „vehicle“ or „car“are considered relevant  ranking of the relevant documents is based on vector model  idf-tf weighting  cosine ranking formula (vehicle OR car) AND accident d1d2d3 q accident201 1 car110 0 cause001 0 crowd001 0 die100 0 drive001 0 four001 0 heavy200 1 injur001 0 more020 0 morning100 0 people102 0 quarter010 0 register010 0 truck001 0 trucker001 0 vehicle010 1 vienna111 1 yesterday100 0

54 Prof. Dr. Knut Hinkelmann 54 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Queries with Wild Cards in the Vector Model vector model based in index without preprocessing index contains all word forms occuring in the documents Queries allow wildcards (masking and truncation), e.g. Principle of query answering  First, wildcards are extended to all matching terms (here vehicle* matches „vehicles“)  ranking according to vector model d1d2d3 q accident201 1 car100 0 cars010 0 causes001 0 crowd001 0 died100 0 drove001 0 four001 0 heavy200 1 injured001 0 more020 0 morning100 0 people102 0 quarter010 0 registered010 0 truck001 0 trucker001 0 vehicles010 1 vienna111 1 yesterday100 0 accident heavy vehicle* vienna

55 Prof. Dr. Knut Hinkelmann 55 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Using Link Information in Hypertext Ranking: link structure is used to calculate a quality ranking for each web page  PageRank ®  HITS – Hypertext Induced Topic Selection (Authority and Hub)  Hilltop Indexing: text of a link (anchor text) is associated both  with the page the link is on and  with the page the link points to

56 Prof. Dr. Knut Hinkelmann 56 Information Retrieval and Knowledge Organisation - 2 Information Retrieval The PageRank Calculation 1) S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Computer Networks and ISDN Systems. Vol. 30, 1998, Seiten oder PageRank has been developed by Sergey Brin and Lawrence Page at Stanford University and published in ) PageRank uses the link structure of web pages Original version of PageRank calculation: with PR(A)being the PageRank of page A, PR(T i )being the PageRank of apges T i that contain a link to page A C(T i ) being the number of links going out of page T i d being a damping factor with 0 <= d <= 1 PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n ))

57 Prof. Dr. Knut Hinkelmann 57 Information Retrieval and Knowledge Organisation - 2 Information Retrieval The PageRank Calculation - Explanation The PageRank of page A is recursively defined by the PageRanks of those pages which link to page A The PageRank of a page T i is always weighted by the number of outbound links C(T i ) on page T i : This means that the more outbound links a page T i has, the less will page A benefit from a link to it on page T i. The weighted PageRank of pages T i is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank. Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) Source:

58 Prof. Dr. Knut Hinkelmann 58 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Damping Factor and the Random Surfer Model The PageRank algorithm and the damping factor are motivated by the model of a random surfer. The random surfer finds a page A by  following a link from a page T i to page A or  by random choice of a web page (e.g. typing the URL). The probability that the random surfer clicks on a particular link is given by the number of links on that page: If a page T i contains C(T i ) links, the probability for each links is 1/ C(T i ) The justification of the damping factor is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random.  d is the probability for the random surfer not stopping to click on links – this is way the sum of pageRanks is multiplied by d  (1-d) is the probability that the surfer jumps to another page at random after he stopped clicking links. Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank (According to Brin and Page d = 0.85 is a good value) Source:

59 Prof. Dr. Knut Hinkelmann 59 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Calculation of the PageRank - Example We regard a small web consisting of only three pages A, B and C and the links structure shon in the figure To keep the calculation simple d is set to 0.5 These are the equation for the PageRank calculation: PR(A) = PR(C) PR(B) = (PR(A) / 2) PR(C) = (PR(A) / 2 + PR(B)) Solving these equations we get the following PageRank values for the single pages: PR(A) = 14/13 = PR(B) = 10/13 = PR(C) = 15/13 = Quelle:

60 Prof. Dr. Knut Hinkelmann 60 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Iterative Calculation of the PageRank - Example IterationPR(A)PR(B)PR(C) According to Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web. Quelle: Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values each page is assigned an initial starting value the PageRanks of all pages are then calculated in several computation cycles.

61 Prof. Dr. Knut Hinkelmann 61 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Alternative Link Analysis Algorithms (I): HITS Jon Kleinberg: Authoritative sources in a hyperlinked environment. In: Journal of the ACM, Vol. 36, No. 5, pp , 1999, Hypertext-Induced Topic Selection (HITS) is a link analysis algorithm proposed by J. Kleinberg 1999 HITS rates Web pages for their authority and hub values:  The authority value estimates the value of the content of the page; a good authority is a page that is pointed to by many good hubs  the hub value estimates the value of its links to other pages; a good hub is a page that points to many good authorities (examples of hubs are good link collections); Every page i is assigned a hub weight h i and an Authority weight a i :

62 Prof. Dr. Knut Hinkelmann 62 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Alternative Link Analysis Algorithms (II): Hilltop The Hilltop-Algorithm 1) rates documents based on their incoming links from so-called expert pages  Expert pages are defined as pages that are about a topic and have links to many non-affiliated pages on that topic.  Pages are defined as non-affiliated if they are from authors of non- affiliated organisations.  Websites which have backlinks from many of the best expert pages are authorities and are ranked high. A good directory page is an example of an expert page (cp. hubs). Determination of expert pages is a central point of the hilltop algorithm. 1) The Hilltop-Algorithmus was developed by Bharat und Mihaila an publishes in 1999: Krishna Bharat, George A. Mihaila: Hilltop: A Search Engine based on Expert Documents. In 2003 Google bought the patent of the algorithm (see also

63 Prof. Dr. Knut Hinkelmann 63 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Anchor-Text The Google search engine uses the text of links twice  First, the text of a link is associated with the page that the link is on,  In addition, it is associated with the page the link points to. Advantages:  Anchors provide additional description of a web pages – from a user‘s point of view  Documents without text can be indexed, such as images, programs, and databases. Disadvantage:  Search results can be manipulated (cf. Google Bombing 1) ) A Google bomb influences the ranking of the search engine. It is created if a large number of sites link to the page with anchor text that often has humourous, political or defamatory statements. In the meanwhile, Google bombs are defused by Google. The polar bear Knut was born in the zoo of Berlin

64 Prof. Dr. Knut Hinkelmann 64 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Natural Language Queries Natural language queries are treated as any other query  Stop word elimination  Stemming but no interpretation of the meaning of the query i need information about accidents with cars and other vehicles is equivalent to information accident car vehicle

65 Prof. Dr. Knut Hinkelmann 65 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Searching Similar Documents Is is often difficult to express the information need as a query An alternative search method can be to search for similar documents to a given document d

66 Prof. Dr. Knut Hinkelmann 66 Information Retrieval and Knowledge Organisation - 2 Information Retrieval IDFd1d2d3 accident car cause1001 crowd1001 die1100 drive1001 four1001 heavy1200 injur1001 more1020 morning1100 people quarter1010 register1010 truck1001 trucker1001 vehicle1010 vienna yesterday1100 Finding Similar Documents – Principle and Example Example:Find the most similar documents to d1 Principle: Use a given document d as a query Compare all document d i with d Example (scalar product): IDF * d1 * d2 = IDF * d1 * d3 = The approach is the same as for a :  same index  same ranking function

67 Prof. Dr. Knut Hinkelmann 67 Information Retrieval and Knowledge Organisation - 2 Information Retrieval The Vector Space Model The vector space model... …is relatively simple and clear, …is efficient, …ranks documents, …can be applied for any collection of documents The model has many heuristic components of parameters, e.g.  determintation of index terms  calculation of tf and idf  ranking function The best parameter setting depends on the document collection

68 Prof. Dr. Knut Hinkelmann 68 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 2.3 Implementation of the Index The vector space model is usually implemented with an inverted index For each term a pointer references a „posting list“ with an entry for each document containing the term The posting lists can be implemented as  linked lists or  more efficient data structures that reduce the storage requirements (index pruning) To answer a query, the corresponding posting lists are retrieved and the documents are ranked, i.e. efficient retrieval of posting list is essential Source: D. Grossman, O. Frieder (2004) Information Retrieval, Springer-Verlag

69 Prof. Dr. Knut Hinkelmann 69 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Implementing the Term Structure as a Trie Sequentially scanning the index for query terms/posting lists is inefficient a trie is a tree structure  each node is an array, one element for each character  each element contains a link to another node *) the characters and their order are identical for each node. Therefore they do not need to be stored explicitly. Example: Structure of a node in a trie *) Source: G. Saake, K.-U. Sattler: Algorithmen und Datenstrukturen – Eine Einführung mit Java. dpunkt Verlag 2004

70 Prof. Dr. Knut Hinkelmann 70 Information Retrieval and Knowledge Organisation - 2 Information Retrieval The Index as a Trie The leaves of the trie are the index terms, pointing to the corresponding position lists Searching a term in in a trie:  search starts at the root  subsequently for each character of the term the reference to the corresponding subtree is followed until  a leave with the term is found  search stops without success (Saake, Sattler 2004)

71 Prof. Dr. Knut Hinkelmann 71 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Patricia Trees Idea: Skip irrelevant parts of terms This is achieved by storing in each node the number of characters to be skipped. Example: (Saake, Sattler 2004) Patricia = Practical Algorithm To Retrieve Information Coded in Alphanumeric

72 Prof. Dr. Knut Hinkelmann 72 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 2.4 Evaluating Search Methods Set of all document relevant documents that are not found documents found relevant documents found

73 Prof. Dr. Knut Hinkelmann 73 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Performance Measure of Information Retrieval: Recall und Precision Several different measures for evaluating the performance of information retrieval systems have been proposed; two important ones are: Recall: fraction of the relevant documents that are are successfully retrieved. answer set D A relevant documents D R relevant documents in answer set D RA Precision: fraction of the documents retrieved that are relevant to the user's information need R = |D RA | |D R | P = |D RA | |D A | D

74 Prof. Dr. Knut Hinkelmann 74 Information Retrieval and Knowledge Organisation - 2 Information Retrieval F-Measure The F-measure is a mean of precision and recall In this version, precision and recall are equally weighted. The more general version allows to give preference to recall or precision  F 2 weights recall twice as much as precision  F 0.5 weights precision twice as much as recall

75 Prof. Dr. Knut Hinkelmann 75 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Computing Recall and Precision Evaluation: Perform a predefined set of queries  The search engines delivers a ranked set of documents  Use the first X documents of the result list as answer set  Compute recall and precision for the frist X documents of the ranked result list. How do you know, which documents are relevant? 1.A general reference set of documents can be used. For example, TREC (Text REtrieval Conference) is an annual event where large test collections in different domains are used to measure and compare performance of infomration retrieval systems 2.For companies it is more important to evaluate information retrieval systems using their own documents 1.Collect a representative set of documents 2.Specify queries and associated relevant documents 3.evaluate search engines by computing recall and precision for the query results

76 Prof. Dr. Knut Hinkelmann 76 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 2.5 User Adaptation Take into account information of a user to filter document particularly relevant to this user  Relevance Feedback  Retrieval in multiple passes; in each pass the use refines the query based on results of previous queries  Explicit User Profiles  subscription  User-specific weights of terms  Social Filtering  Similar use get similar documents

77 Prof. Dr. Knut Hinkelmann 77 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Relevance Feedback given by the User The user specifies relevance of each document. Example: for query "Pisa" only the documents about the education assessment are regarded as relevant In the next pass, the top ranked documents are only about the education assessment This example is from the SmartFinder system from empolis. The mindaccess system from Insiders GmbH uses the same techology. Example:

78 Prof. Dr. Knut Hinkelmann 78 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Relevance Feedback: Probabilistic Model Assumption: Given a user query, there is an ideal answer set Idea: An initial answer is iteratively improved based on user feedback Approach:  An initial set of documents is retrieved somehow  User inspects these docs looking for the relevant ones (usually, only top need to be inspected)  IR system uses this information to refine description of ideal answer set  By repeting this process, it is expected that the description of the ideal answer set will improve The description of ideal answer set is modeled in probabilistic terms (Baeza-Yates & Ribeiro-Neto 1999)

79 Prof. Dr. Knut Hinkelmann 79 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Probabilistic Ranking Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Probabilistic ranking is: Definitions: wij  {0,1}(i.e. weights are binary) similarity of document dj to the query q is the document vector of dj is the probability that document dj is relevant is the probability that document dj is not relevant (Baeza-Yates & Ribeiro-Neto 1999)

80 Prof. Dr. Knut Hinkelmann 80 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Computing Probabilistic Ranking Probabilistic ranking can be computed as: where stands for the probability that the index term k i is present in a document randomly selected from the set R of relevant documents is the weight of term k i in the query stands for the probability that the index term k i is not present in a document randomly selected from the set R of relevant documents is the weight of term k i in document d j (Baeza-Yates & Ribeiro-Neto 1999)

81 Prof. Dr. Knut Hinkelmann 81 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Relevance Feedback: Probabilistic Model The probabilities that a term ki is (not) present in a set of relevant documents can be computed as : Ntotal number of documents n i number of documents containing term k i Vnumber of relevant documents retrieved by the probabilistic model V i number of relevant documents containing term k i There are different ways to find the relevant document V :  Automatically: V can be specified as the top r documents found  By user feedback: The user specifies for each retrieved document whether it is relevant or not

82 Prof. Dr. Knut Hinkelmann 82 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Explicit User Profiles Idea: Using knoweldge about the user to provide information that is particularly relelvant for him/her  users specify topics of interest as a set of terms  these terms represent the user profile  documents containing the terms of the user profile are prefered information need/ preferences documents user profile index profile acquisition document representation ranking function

83 Prof. Dr. Knut Hinkelmann 83 Information Retrieval and Knowledge Organisation - 2 Information Retrieval User profiles for subscribing to information user profiles are treated as queries Example: news feed As soon as a new document arrives it is tested for similarity with the user profiles Vector space model can be applied A document is regarded relevant if the ranking reaches a specified threshold Example:User 1 is interested in any car accicent User 2 is interested in deadly car accidents with trucks IDFd1d2d3U1U2 accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday110000

84 Prof. Dr. Knut Hinkelmann 84 Information Retrieval and Knowledge Organisation - 2 Information Retrieval User profiles for Individual Queries Users specify importance of terms User profiles are used as additional term weights Different ranking for different users Example ranking for user 1 IDF * d1 * U1 *q = IDF * d2 * U1 *q = IDF * d3 * U1 *q = Example:users profiles with term 1,4 1 0,5 2,2 0,1 0,5 ranking for user 2 IDF * d1 * U2 *q = IDF * d2 * U2 *q = IDF * d3 * U2 *q = IDFd1d2d3U1U2q accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday

85 Prof. Dr. Knut Hinkelmann 85 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Acquisition and Maintenance of User Profiles There are different ways to specify user profiles manual: users specifies topics of interests (and weights) explicitly  selection of predefined terms or query  Problem: Maintenance user feedback: user collects relevant documents  terms in selected document are regarded as important  Problem: How to motivate the user to give feedback  (a similar approach is used by spam filters - classification) Heuristics: observing user behaviour  Example: If a user has opened a document for long time, it is assumend that he/she read it and therefore it might be relevant  Problem: Heuristics might be wrong

86 Prof. Dr. Knut Hinkelmann 86 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Social Filtering Idea: Information is relevant, if other users who showed similar behaviour regarded the information as relevant  Relevance is specified by the users  User profiles are compared Example: A simple variant can be found at Amazon  purchases of books and CDs are stored  „people who bought this book also bought …“

87 Prof. Dr. Knut Hinkelmann 87 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Anfrage nach informatischem Ansatz Formulieren Sie folgenden Informationsbedarf als Anfrage nach informatischen Ansatz: „Finde alle Dokumente... …die Hinweise zur die Kreditbeurteilung für Personen aus Bärlund enthalten.“ …in denen es um ein Haus am Meer geht" …zum Thema Dokumenten-Management.“ …über die Verwaltung von Daten in einer Datenbank"

88 Prof. Dr. Knut Hinkelmann 88 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Assume you made a search in Google. From 150 documents found, you can use 9 for your work, the remaining documents are useless Which measure can be used to evaluated the result? Compute the measure. Which information is missing to make a complete evaluation of the result?


Download ppt "2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information."

Similar presentations


Ads by Google