Modern Information Retrieval Chapter 2 Modeling
Can keywords be used to represent a document or a query? keywords as query and matching as query processing cannot generate good results, in general ranking algorithm, document relevance and IR model
Taxonomy of IR models
Ad hoc and filtering retrieval ad hoc retrieval: static document collection, queries submitted filtering retrieval: static queries, document streaming user profile describes user ’ s preference keywords, relevance feedback and dynamic keywords adjustment
Formal characterization of IR models
Classic IR Index terms deciding on the importance of a term is difficult consider a term ’ s semantics as well as its distribution in all documents weight ’ s are used to quantify the importance of the index terms for describing the document contents
mutual independence assumption simplifies the task of fast ranking computation
Boolean model index term weights are binary query as a Boolean expression not, and, or as connectives Users might find it difficult to specify their information needs dominant model for commercial systems
advantages and disadvantages each document is either relevant or non- relevant given = (0,1,0), is document d j an answer?
Vector model given a set of index terms, allows partial matching and ranking by a similarity measure coordinate matching the number of query index terms contained in a document decides the similarity degree three drawbacks: term frequency, term scarcity, document size
sim(d j,q) = d j ‧ q favor long documents sim(d j,q) = (d j ‧ q) / ︱ d j ︱ sim(d j,q) = 1 - D(d j,q) discriminate against long documents
Computing index term weights term frequency, tf factor: how well the term describes the document contents inverse document frequency, idf factor: how well the term represents the document how to balance these two effects?
the term-weighting scheme improves retrieval performance the partial matching strategy allows approximate query results the results are ranked by the similarity degree the vector model is a popular retrieval model nowadays due to its simplicity and performance