Presentation is loading. Please wait.

Presentation is loading. Please wait.

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh

Similar presentations

Presentation on theme: "Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh"— Presentation transcript:

1 Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh

2 2 Previous chapter User Information Need oVague oSemantic, not formal Document Relevance oOrder, not retrieve Huge amount of information oEfficiency concerns oTradeoffs Art more than science

3 3 Modeling Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model Why math if the model is not precise (simplified)? phenomenon model = step 1 = step 2 =... = result math phenomenon model step 1 step 2... ?!

4 4 Modeling in IR: idea Tag documents with fields oAs in a (relational) DB: customer = {name, age, address} oUnlike DB, very many fields: individual words! oE.g., bag of words: {word 1, word 2,...}: {3, 5, 0, 0, 2,...} Define a similarity measure between query and such a record oUnlike DB, order, not retrieve (yes/no) oJustify your model (optional, but nice) Develop math and algorithms for fast access oas relational algebra in DB

5 Taxonomy of IR systems

6 6 Aspects of an IR system IR model oBoolean, Vector, Probabilistic Logical view of documents oFull text, bag of words,... User task oretrieval, browsing Independent, though some are more compatible

7 7 Taxonomy of IR models Boolean (set theoretic) ofuzzy oextended Vector (algebraic) ogeneralized vector olatent semantic indexing oneural network Probabilistic oinference network obelief network

8 8 Taxonomy of other aspects Text structure Non-overlapping lists Proximal nodes model Browsing Flat Structure guided hypertext

9 Appropriate models

10 10 Retrieval operation mode Ad-hoc ostatic documents ointeractive oordered Filtering ( ad-hoc on new docs) ochanging document collection notification onot interactive machine learning techniques can be used oyes/no

11 11 Characterization of an IR model D = {d j }, collection of formal representations of docs oe.g., keyword vectors Q = {q i }, possible formal representations of user information need (queries) F, framework for modeling these two: reason for the next R(q i,d j ): Q D R, ranking function odefines ordering

12 Specific IR models

13 13 IR models Classical oBoolean oVector oProbabilistic (clear ideas, but some disadvantages) Refined oEach one with refinements oSolve many of the problems of the basic models oGive good examples of possible developments in the area oNot investigated well We can work on this

14 14 Basic notions Document: Set of index term oMainly nouns oMaybe all, then full text logical view Term weights osome terms are better than others oterms less frequent in this doc and more frequent in other docs are less useful Documents index term vector {w 1j, w 2j,..., w tj } oweights of terms in the doc ot is the number of terms in all docs oweights of different terms are independent (simplification)

15 15 Boolean model Weights {0, 1} oDoc: set of words Query: Boolean expression oR(q i,d j ) {0, 1} Good: oclear semantics, neat formalism, simple Bad: ono ranking ( data retrieval), retrieves too many or too few odifficult to translate User Information Need into query No term weighting

16 16 Vector model Weights (non-binary) Ranking, much better results (for User Info Need) R(q i,d j ) = correlation between query vector and doc vector E.g., cosine measure: (there is a typo in the book)

17 Projection

18 18 Weights How are the weights w ij obtained? Many variants. One way: TF-IDF balance TF: Term frequency oHow well the term is related to the doc? oIf appears many times, is important oProportional to the number of times that appears IDF: Inverse document frequency oHow important is the term to distinguish documents? oIf appears in many docs, is not important oInversely proportional to number of docs where appears Contradictory. How to balance?

19 19 TF-IDF ranking TF: Term frequency IDF: Inverse document frequency Balance: TF IDF oOther formulas exist. Art.

20 20 Advantages of vector model One of the best known strategies Improves quality (term weighting) Allows approximate matching (partial matching) Gives ranking by similarity (cosine formula) Simple, fast But: Does not consider term dependencies oconsidering them in a bad way hurts quality ono known good way No logical expressions (e.g., negation: mouse & NOT cat)

21 21 Probabilistic model Assumptions: oset of relevant docs, oprobabilities of docs to be relevant oAfter Bayes calculation: probabilities of terms to be important for defining relevant docs Initial idea: interact with the user. oGenerate an initial set oAsk the user to mark some of them as relevant or not oEstimate the probabilities of keywords. Repeat Can be done without user oJust re-calculate the probabilities assuming the users acceptance is the same as predicted ranking

22 22 (Dis) advantages of Probabilistic model Advantage: Theoretical adequacy: ranks by probabilities Disadvantages: Need to guess the initial ranking Binary weights, ignores frequencies Independence assumption (not clear if bad) Does not perform well (?)

23 23 Alternative Set Theoretic models Fuzzy set model Takes into account term relationships (thesaurus) oBible is related to Church Fuzzy belonging of a term to a document oDocument containing Bible also contains a little bit of Church, but not entirely Fuzzy set logic applied to such fuzzy belonging ological expressions with AND, OR, and NOT Provides ranking, not just yes/no Not investigated well. oWhy not investigate it?

24 24 Extended Boolean model Alternative Set Theoretic models Extended Boolean model Combination of Boolean and Vector In comparison with Boolean model, adds distance from query osome documents satisfy the query better than others In comparison with Vector model, adds the distinction between AND and OR combinations There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like This can be even different within one query Not investigated well. Why not investigate it?

25 25 Alternative Algebraic models Generalized Vector Space model Classical independence assumptions: oAll combinations of terms are possible, none are equivalent (= basis in the vector space) oPair-wise orthogonal: cos ({k i }, {k j }) = 0 This model relaxes the pair-wise orthogonality: cos ({k i }, {k j }) 0 Operates by combinations (co-occurrences) of index terms, not individual terms More complex, more expensive, not clear if better Not investigated well. Why not investigate it?

26 26 Latent Semantic Indexing model Alternative Algebraic models Latent Semantic Indexing model Index by larger units, concepts sets of terms used together Retrieve a document that share concepts with a relevant one (even if it does not contain query terms) Group index terms together (map into lower dimensional space). So some terms are equivalent. oNot exactly, but this is the idea oEliminates unimportant details oDepends on a parameter (what details are unimportant?) Not investigated well. Why not investigate it?

27 27 Neural Network model Alternative Algebraic models Neural Network model NNs are good at matching Iteratively uses the found documents as auxiliary queries oSpreading activation. oTerms docs terms docs terms docs... Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it?

28 28 Alternative Probabilistic models Bayesian Inference Network model (One of the authors of the book worked in this. In fact not so important) Probability as belief (not as frequency) oBelief in importance of terms. Query terms have 1.0 Similar to Neural Net oDocuments found increase the importance of their terms oThus act as new queries oBut different propagation formulas Flexible in combining sources of evidence Can be applied to different ranking strategies (Boolean or TF-IDF) Good quality of results (Warning! Authors work in this)


30 30 Belief Network model Alternative Probabilistic models Belief Network model (Introduced by one of the authors of the book.) Better network topology oSeparation of document and term space oMore general than Inference model Bayesian network models: odo not include cycles and this have linear complexity unlike Neural Nets oCombine distinct evidence sources (also user feedback) oAre a neat formalism. oBetter alternative to combinations of Boolean and Vector

31 31 Models for structured text Cat in the 3 rd chapter. Cat in same paragraph as Dog Non-overlapping lists oChapters, sections, paragraphs – as regions oTechnically treated much like terms (ranges of positions) Sections containing Cat Proximal nodes model (suggested by the authors) oChapters, sections, paragraphs – as objects (nodes)

32 32 Models for browsing Flat browsing oJust as a list of paper oNo context cues provided Structure guided oHierarchy oLike directory tree in the computer Hypertext (Internet!) oNo limitations of sequential writing oModeled by a directed graph: links from unit A to unit B units: docs, chapters, etc. oA map (with traversed path) can be helpful

33 33 The Web Internet Not hypertext oAuthors call hypertext a well-organized hypertext oInternet: not depository but heap of information

34 34 Research issues How people judge relevance? oranking strategies How to combine different sources of evidence? What interfaces can help users to understand and formulate their Information Need? ouser interfaces: an open issue Meta-search engines: combine results from different Web search engines oThey almost do not intersect oHow to combine ranking?

35 35 Conclusions Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and simplicity oTF-IDF term weighting oThis (or similar) weighting is used in all further models Many interesting and not well-investigated variations opossible future work

36 36 Thank you! Till October 2

Download ppt "Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh"

Similar presentations

Ads by Google