Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Similar presentations


Presentation on theme: "Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query."— Presentation transcript:

1 Web Algorithmics Web Search Engines

2 Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!? Goal of a Search Engine

3 Two main difficulties The Web: Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

4 Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] Google, Yahoo, MSN, ASK,………

5

6

7

8

9 This is a search engine!!!

10 Web Algorithmics The structure of a Search Engine

11 The structure Web Crawler Page archive Control Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer

12

13 Problem: Indexing Consider Wikipedia En: Collection size ≈ 10 Gbytes # docs ≈ 4 * 10 6 #terms in total > 1 billion (avg term len = 6 chars) #terms distinct = several millions Which kind of data structure do we build to support word-based searches ?

14 DB-based solution: Term-Doc matrix 1 if play contains word, 0 otherwise #terms > 1M #docs ≈ 4M Space ≈ 4Tb !

15 Current solution: Inverted index Brutus the Calpurnia 12358132134 2461032 1316 Currently they get 2  4% original text A term like Calpurnia may use log 2 N bits per occurrence A term like the should take about 1 bit per occurrence

16 Gap-coding for postings Sort the docIDs Store gaps between consecutive docIDs: Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Two advantages: Space: store smaller integers (clustering?) Speed: query requires just a scan

17  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

18 Rice code (simplification of Golomb code) It is a parametric code: depends on k Quotient q=  (v-1)/k , and the rest is r= v – k * q – 1 Useful when integers concentrated around k How do we choose k ? Usually k  0.69 * mean(v) [Bernoulli model] Optimal for Pr(x) = p (1-p) x-1, where mean(x)=1/p, and i.i.d ints [q times 0s] 1 Log k bits

19 PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers Use b (e.g. 2) bits to encode 128 numbers or create exceptions Several approaches to encode exceptions Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data

20 Interpolative coding  = 1 1 1 2 2 2 2 4 3 1 1 1 M = 1 2 3 5 7 9 11 15 18 19 20 21 Recursive coding  preorder traversal of a balanced binary tree At every step we know: lowest possible value, highest possible value, number of values, i.e. num = |M| = 12, low = 1, hi = 21 Take the middle element: h=6  M[6]=9 It is 1+5 = 6 ≤ M[h] ≤ 21 – (12 – 6) = 15 We can encode 9 in log 2 (15-6) = 4 bits lo=1, hi=8, num = 5 lo=10, hi=21, num = 6

21 Query processing 1)Retrieve all pages matching the query Brutus the Caesar 12358132134 2461332 413 17

22 Some optimization Best order for query processing ? Shorter lists first… Brutus Caesar Calpurnia 12358132134 2481331 1316 Query: Brutus AND Calpurnia AND Caesar

23 Expand the posting lists with word positions to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Larger space occupancy, about *10 on Web Phrase queries

24 Query processing 1)Retrieve all pages matching the query 2)Order pages according to various scores:  Term position & freq (body, title, anchor,…)  Link popularity  User clicks or preferences Brutus the Caesar 12358132134 2461332 413 17

25 Generating the snippets !

26 The big fight: find the best ranking...

27 Ranking: Google vs Google.cn


Download ppt "Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query."

Similar presentations


Ads by Google