Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.

Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014

Back in those days 2 Once upon a time in the world, there were days without search engines We had access to much smaller amount of information Had to find information manually

Search engine 3 User needs some information Assumption: the required information is present somewhere A search engine tries to bridge this gap How:  User “expresses” the information need – query  Engine returns – list of documents, or by some better means

Search engine 4 User needs some information Assumption: the required information is present somewhere A search engine tries to bridge this gap Simplest model  User submits query – a set of words (terms)  Search engine returns documents “matching” the query  Assumption: matching the query would satisfy the information need  Modern search has come a long way from the simple model, but the fundamentals are still required

Basic approach 5 This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning  Documents contain terms  Documents are represented by terms present in them  Match queries and documents by terms  For simplicity: ignore positions, consider documents as “bag-of- words”  There may be many matching documents – need to rank them Query: india statistics

Vector space model Each term represents a dimension Documents are vectors in the term-space Term-document matrix: a very sparse matrix Query is also a vector in the term-space 6 d1d1 d2d2 d3d3 d4d4 d5d5 q diwali10000 india100111 flying01000 population00010 autumn00100 statistical010011  Similarity of each document d with the query q is measured by the cosine similarity (dot product normalized by norms of the vectors)

Scoring function: TF.iDF  How important is a term t in a document d  Approach: take two factors into account – With what significance does t occur in d? [term frequency] – Does t occur in many other documents also? [document frequency] – Called TF.iDF: TF × iDF, has many variants for TF and iDF  Variants for TF(t, d) 1.Number of times t occurs in d: freq(t, d) 2.Logarithmically scaled frequency: 1 + log(freq(t, d)) 3.Augmented frequency: avoid bias towards longer documents  Inverse document frequency of t : iDF(t) 7 for all t in d; 0 otherwise where N = total number of documents DF(t) = number of documents in which t occurs Half the score for just being present Rest a function of frequency

BM25  Okapi IR system – Okapi BM25  If the query q = {q 1, …, q n } where q i ’s are words in the query 8 where N = total number of documents avgdl = average length of documents k 1 and b are optimized parameters, usually b = 0.75 and 1.2 ≤ k 1 ≤ 2.0  BM25 exhibited better performance than TF.iDF in TREC consistently

Relevance  Simple IR model: query, documents, returned results  Relevant document: a document that satisfies the information need expressed by the query – Merely matching query terms does not make a document relevant – Relevance is human perception, not a mathematical statement – User may want some statistics on population of India by the query “india statistics” – The document “Indian Statistical Institute” matches the query terms, but not relevant  To evaluate effectiveness of a system, we need for each query 1.Given a result, an assessment of whether it is relevant 2.The set of all relevant results assessed (pre-validated) If the second is available, it serves the purpose of the first as well  Measures: precision, recall, F-measure (harmonic mean of precision and recall) 9

Inverted index  Standard representation: document  terms  Inverted index: term  documents  For each term t, store the list of the documents in which t occurs 10 This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning 1 1 2 2 3 3 4 4 5 5 6 6 7 7 diwali: d3d3 india:d2d2 d3d3 d7d7 flying: d1d1 population: d7d7 autumn: d4d4 statistical: d1d1 d2d2 Scores?

Inverted index  Standard representation: document  terms  Inverted index: term  documents  For each term t, store the list of the documents in which t occurs 11 diwali: d 3 (0.5) india:d 2 (0.7)d 3 (0.3)d 7 (0.4) flying: d 1 (0.3) population:d 7 (0.6) autumn:d 4 (0.8) statistical: d 1 (0.2)d 2 (0.5) Note: These scores are dummy, not by any formula This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning 1 1 2 2 3 3 4 4 5 5 6 6 7 7

Positional index  Just documents and scores follows bag of words model  Cannot perform proximity search or phrase query search  Positional inverted index: also store position of each occurrence of term t in each document d where t occurs 12 diwali: d 3 (0.5): india:d 2 (0.7): d 3 (0.3): d 7 (0.4): flying: d 1 (0.3): population:d 7 (0.6): autumn:d 4 (0.8): statistical: d 1 (0.2): d 2 (0.5): This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning 1 1 2 2 3 3 4 4 5 5 6 6 7 7

Pre-processing  Removal of stopwords: of, the, and, … – Modern search does not completely remove stopwords – Such words add meaning to sentences as well as queries  Stemming: words  stem (root) of words – Statistics, statistically, statistical  statistic (same root) – Loss of slight information (the form of the word also matters) – But unifies differently expressed queries on the same topic – Lemmatization: doing this properly with morphological analysis of words  Normalization: unify equivalent words as much as possible – U.S.A, USA – Windows, windows  Stemming, lemmatization, normalization, synonym finding, all are important subfields on their own!! 13

Creating an inverted index  For each document, write out pairs (term, docid)  Sort by term  Group, compute DF 14 This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning 1 1 2 2 3 3 4 4 5 5 6 6 7 7 TermdocId statistic1 fly1 safe1 …… india2 statistic2 india3 …… 7 TermdocId india2 3 7 …… fly1 safe1 statistic1 2 …… TermdocId india (df=3)237 fly (df=1)1 statistic (df=2)12 ……

Traditional architecture 15 Analysis (stemming, normalization, …) Basic format conversion, parsing Indexing Core query processing (accessing index, ranking) Core query processing (accessing index, ranking) Index Different types of documents Query handler (query parsing) Results handler (displaying results) User Query Results Query Results

Query processing lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 16 Pick the smallest doc id

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 17 doc 5 (0.6) Pick the smallest doc id

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 18 Pick the smallest doc id doc 5 (0.6)

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 19 Pick the smallest doc id doc 5 (0.6)

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 20 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1)

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 23 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6)

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 26 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6) doc 17 (1.6)

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 27 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6) doc 17 (1.6)

Merge lists sorted by doc id doc 17 0.3 doc 5 0.6 doc 10 0.1 doc 21 0.2 doc 14 0.6 doc 17 0.7 doc 25 0.6 doc 17 0.6 doc 61 0.3 doc 78 0.5 doc 21 0.3 doc 65 0.1 doc 83 0.4 doc 38 0.6 doc 81 0.2 doc 91 0.1 doc 44 0.1 doc 83 0.9 doc 83 0.5 List 1 List 2 List 3 One pointer in each list 28 doc 5 (0.6) doc 10 (0.1) doc 14 (0.6) doc 17 (1.6) doc 21 (0.5) doc 25 (0.6) doc 38 (0.6) doc 44 (0.1) doc 61 (0.3) doc 65 (0.1) doc 78 (0.5) doc 81 (0.2) doc 83 (1.8) doc 91 (0.1) Merged list still sorted by doc id (Partial) sort doc 83 (1.8) doc 17 (1.6) Top-2 Complexity? klogn

Merge Simple and efficient, minimal overhead Lists sorted by doc id Merge Merged list But, have to scan the lists fully! 29

Top-k algorithms  If there are millions of documents in the lists – Can the ranking be done without accessing the lists fully?  Exact top-k algorithms (used more in databases) – Family of threshold algorithms (Ronald Fagin et al) – Threshold algorithm (TA) – No random access algorithm (NRA) [we will discuss, as an example] – Combined algorithm (CA) – Other follow up works  Inexact top-k algorithms – Exact top-k not required, the scores are only “crude” approximation of “relevance” (human perception) – Several heuristics – Further reading: IR book by Manning, Raghavan and Schuetze, Ch. 7 30

NRA (No Random Access) Algorithm lists sorted by score doc 25 0.6 doc 17 0.6 doc 83 0.9 doc 78 0.5 doc 38 0.6 doc 17 0.7 doc 83 0.4 doc 14 0.6 doc 61 0.3 doc 17 0.3 doc 5 0.6 doc 81 0.2 doc 21 0.2 doc 83 0.5 doc 65 0.1 doc 91 0.1 doc 21 0.3 doc 10 0.1 doc 44 0.1 List 1 List 2 List 3 Fagin ’ s NRA Algorithm: read one doc from every list 31

NRA (No Random Access) Algorithm lists sorted by score doc 25 0.6 doc 17 0.6 doc 83 0.9 doc 78 0.5 doc 38 0.6 doc 17 0.7 doc 83 0.4 doc 14 0.6 doc 61 0.3 doc 17 0.3 doc 5 0.6 doc 81 0.2 doc 21 0.2 doc 83 0.5 doc 65 0.1 doc 91 0.1 doc 21 0.3 doc 10 0.1 doc 44 0.1 Fagin ’ s NRA Algorithm: round 1 doc 83 [0.9, 2.1] doc 17 [0.6, 2.1] doc 25 [0.6, 2.1] Candidates min top-2 score: 0.6 maximum score for unseen docs: 2.1 min-top-2 < best-score of candidates List 1 List 2 List 3 read one doc from every list current score best-score 0.6 + 0.6 + 0.9 = 2.1 32

NRA (No Random Access) Algorithm lists sorted by score Fagin ’ s NRA Algorithm: round 2 doc 17 [1.3, 1.8] doc 83 [0.9, 2.0] doc 25 [0.6, 1.9] doc 38 [0.6, 1.8] doc 78 [0.5, 1.8] Candidates min top-2 score: 0.9 maximum score for unseen docs: 1.8 doc 25 0.6 doc 17 0.6 doc 83 0.9 doc 78 0.5 doc 38 0.6 doc 17 0.7 doc 83 0.4 doc 14 0.6 doc 61 0.3 doc 17 0.3 doc 5 0.6 doc 81 0.2 doc 21 0.2 doc 83 0.5 doc 65 0.1 doc 91 0.1 doc 21 0.3 doc 10 0.1 doc 44 0.1 List 1 List 2 List 3 min-top-2 < best-score of candidates read one doc from every list 0.5 + 0.6 + 0.7 = 1.8 33

NRA (No Random Access) Algorithm lists sorted by score doc 83 [1.3, 1.9] doc 17 [1.3, 1.7] doc 25 [0.6, 1.5] doc 78 [0.5, 1.4] Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.3 doc 25 0.6 doc 17 0.6 doc 83 0.9 doc 78 0.5 doc 38 0.6 doc 17 0.7 doc 83 0.4 doc 14 0.6 doc 61 0.3 doc 17 0.3 doc 5 0.6 doc 81 0.2 doc 21 0.2 doc 83 0.5 doc 65 0.1 doc 91 0.1 doc 21 0.3 doc 10 0.1 doc 44 0.1 Fagin ’ s NRA Algorithm: round 3 List 1 List 2 List 3 min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list 0.4 + 0.6 + 0.3 = 1.3 34

NRA (No Random Access) Algorithm doc 25 0.6 doc 17 0.6 doc 83 0.9 doc 78 0.5 doc 38 0.6 doc 17 0.7 doc 83 0.4 doc 14 0.6 doc 61 0.3 doc 17 0.3 doc 5 0.6 doc 81 0.2 doc 21 0.2 doc 83 0.5 doc 65 0.1 doc 91 0.1 doc 21 0.3 doc 10 0.1 doc 44 0.1 lists sorted by score doc 17 1.6 doc 83 [1.3, 1.9] doc 25 [0.6, 1.4] Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.1 Fagin ’ s NRA Algorithm: round 4 List 1 List 2 List 3 min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list 0.3 + 0.6 + 0.2 = 1.1 35

NRA (No Random Access) Algorithm doc 25 0.6 doc 17 0.6 doc 83 0.9 doc 78 0.5 doc 38 0.6 doc 17 0.7 doc 83 0.4 doc 14 0.6 doc 61 0.3 doc 17 0.3 doc 5 0.6 doc 81 0.2 doc 21 0.2 doc 83 0.5 doc 65 0.1 doc 91 0.1 doc 21 0.3 doc 10 0.1 doc 44 0.1 lists sorted by score doc 83 1.8 doc 17 1.6 Candidates min top-2 score: 1.6 maximum score for unseen docs: 0.8 Done! Fagin ’ s NRA Algorithm: round 5 List 1 List 2 List 3 no extra candidate in queue read one doc from every list 0.2 + 0.5 + 0.1 = 0.8 36 More approaches:  Periodically also perform random accesses on documents to reduce uncertainty (CA)  Sophisticated scheduling on lists  Crude approximation: NRA may take a lot of time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores?

References  Primarily: IR Book by Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/http://nlp.stanford.edu/IR-book/ 37

Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.

Similar presentations

Presentation on theme: "Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.

Similar presentations

Presentation on theme: "Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014."— Presentation transcript:

Similar presentations

About project

Feedback