Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.

Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2/31 Arizona State University Outline Simple introduction of Google Architecture of Web search engine Key techniques of search engine Indexing Matching & ranking Open Sources for search engine

3/31 Arizona State University Google Search Engine “Google” Number : 1 followed by 100 zeros reflects the company's mission organize the immense amount of information available on the web. Information Types Text Image Video

4/31 Arizona State University Google Service

5/31 Arizona State University Google Web Searching

6/31 Arizona State University Life of a Google Query

7/31 Arizona State University Web Search System Data Indexing Index Searching Search Engine User Information Query Crawling Web … d1d3K2 d1d2K1

8/31 Arizona State University Conventional Overview of Text Retrieval Text ProcessingUser/System Interaction Search Engine Matching & rank Text Analysis Analysis of Info Needs raw text Info Needs IndexQuery Knowledge Resources & Tools Retrieval Result

9/31 Arizona State University Text Processing (1) - Indexing A list of terms with relevant information Frequency of terms Location of terms Etc. Index terms: represent document content & separate documents “economy” vs “computer” in a news article of Financial Times To get Index Extraction of index terms Computation of their weights

10/31 Arizona State University Text Processing (2) - Extraction Extraction of index terms Word or phrase level Morphological Analysis (stemming in English) “information”, “informed”, “informs”, “informative”  inform Removal of common words from “Stop list” “a”, “an”, “the”, “is”, “are”, “am”, … n-gram “ 정보검색시스템 ” => “_ 정 ”, “ 정보 ”, “ 보검 ”, “ 검색 ”, … (bi-gram) Surprisingly effective in some languages

11/31 Arizona State University An Example Identify all unique words in collection of 1,033 Abstracts in biomedicine Delete 170 common function words included In stop list Delete all terms with collection frequency equal to 1 (terms occurring in one doc with frequency 1) Remove terminal “s” endings & combine Identical word forms Delete 30 very high-frequency terms occurring In over 24% of the documents 13,471 terms 13,301 terms left 7,236 terms left 6,056 terms left 6,026 terms left Final indexing vocabulary

12/31 Arizona State University Text Processing (3) – Term Weight Calculation of term weights Statistical weights using frequency information importance of a term in a document E.g. TF*IDF TF: total frequency of a term in a document IDF: inverse document frequency DF: In how many documents the term appears? High TF, low DF  good word to represent text High TF, High DF  bad word

13/31 Arizona State University An Example TF for “Arizona” In Doc 1 is 1 In Doc 2 is 2 DF for “Arizona” In this collection (Doc 1 & Doc 2) Is 2  IDF = ½ TW = TF*IDF Normalization of TF is critical to retrieval effectiveness prevent a bias towards longer documents TF = 0.5 + 0.5*(TF / Max TF) TW = TF * log 2 (N / DF + 1) Document 1 Document 2 Log 10 -34 is -34

14/31 Arizona State University Text Processing (4) - Storing indexing results For raw text to index Arizona University ：：：：：： … 1 1 2 2 Index WordWord Info. Document 1 Document 2 1 1 1 1

15/31 Arizona State University Text Processing (5) - Storing indexing results Inverted File, … search Google. ASU. tiger 3...2......23...2......2 1 2 3 4 5. 275 276. 1011 1012 12546...35....1412546...35....14 Terms Pointers DirectoryPosting file Doc #1 ----- Doc #2 ----- Doc #5 -----... Query

16/31 Arizona State University Matching & Ranking Ranking Retrieval Model Boolean (exact) Vector Space Probabilistic Inference Net Language Model … Weighting Schemes Index terms, query terms Parameters in formulas

17/31 Arizona State University Vector Space Model Treat document and query as a vector. (DOC 1)... dog........dog.... 0 2 Doc 1 = (DOC 2)... cat........ cat......................dog..............dog.................... Doc 2 = 0 2 2 dog cat

18/31 Arizona State University Vector Space Model 0 2 2 dog cat Query 1 : dog Query 2 : cat, dog Query 1 = Query 2 = COS (Q1,Doc)<COS(Q2,Doc) If we use angles as a similarity measure, then Q2 is more similar to Doc than Q1 (DOC)... cat........ cat......................dog..............dog.................... Doc =

19/31 Arizona State University Vector Space Model Given Dot product Cosine Similiarity

20/31 Arizona State University Vector Space Model... cat........ dog......................dog................mouse.....dog........ mouse........................ Q = dog mouse cat D1 Q D1 = (1, 2, 3) Q = (1, 1,0) Similarity = (1*1+2*1+3*0)/( length of line D1 + length of line Q) Term weight is only decided by the term frequency

21/31 Arizona State University Matching & Ranking Techniques for efficiency New storage structure esp. for new document types Use of accumulators for efficient generation of ranked output Compression/decompression of indexes Technique for Web search engines Use of hyperlinks PageRank : Inlinks & outlinks HITS : Authority vs hub pages In conjunction with Directory Services (e.g. Yahoo)...

22/31 Arizona State University PageRank Basic idea: more links to a page implies a better page But, all links are not created equal Links from a more important page should count more than links from a weaker page Basic PageRank R(A) for page A: outDegree(x) = number of edges leaving page x= hyperlinks on page B Page x distributes its rank boost over all the pages it points to PR(A) = PR(C)/1 PR(B) = PR(A) / 2 PR(C) = PR(A) / 2 + PR(B)/1

23/31 Arizona State University PageRank PageRank definition is recursive Rank of a page depends on and influences other pages Eventually, ranks will converge To compute PageRank: Choose arbitrary initial R_old and use it to compute R_new Repeat, setting R_old to R_new until R converges (the difference between old and new R is sufficiently small) Rank values typically converge in 50-100 iterations Rank orders converge even faster

24/31 Arizona State University Problems with Basic PageRank Web is not a strongly connected graph Rank sink – single page (node) with no outward links Nodes not part of sink get rank of 0

25/31 Arizona State University Extended PageRank Remove all nodes without outlinks No rank for these pages Add decay factor, d n is the number of nodes/pages d is a constant, typically between 0.8~ 0.9 Represents fraction of a pages rank that is distributed among pages it links to, rest of rank is distributed among all pages In random surfer model, decay factor corresponds to user getting bored (or unhappy) with links on a given page and jumping to any random page (not linked to)

26/31 Arizona State University Example Set d=0.5 and Ignore n Small pages can be directly solved PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) Get ： PR(A) = 14/13 = 1.07692308 PR(B) = 10/13 = 0.76923077 PR(C) = 15/13 = 1.15384615

27/31 Arizona State University Example PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) Set initial value of P(A), P(B), P(C) to 1. After first iteration, PR(A) = 0.5 + 0.5 *1 = 1 PR(B) = 0.5 + 0.5 (1 / 2) = 0.75 PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) =1.125 After second iteration PR(A) = 0.5 + 0.5 * 1.125=1.0625 PR(B) = 0.5 + 0.5 (1 / 2)= 0.765625 PR(C) = 0.5 + 0.5 (1 / 2 +0.75) =1.1484375

28/31 Arizona State University Example Large numbers, Iteration method Iteration PR(A)PR(B)PR(C) 0111 110.751.125 21.06250.7656251.1484375 31.074218750.768554691.15283203 41.076416020.769104001.15365601 51.076828000.769207001.15381050 61.076905250.769226311.15383947 71.076919730.769229931.15384490 81.076922450.769230611.15384592 91.076922960.769230741.15384611 101.076923050.769230761.15384615 111.076923070.769230771.15384615 121.076923080.769230771.15384615

29/31 Arizona State University Problems with PageRank Show bias to new WebPages Can be solved by a boost factor No balance between relevancy and popularity Very popular pages (such as search engines and web portals) may be returned artificially high due to their popularity (even if not very related to the query) Despite these problems, seems to work fairly well in practice

30/31 Arizona State University Open-Source Search Engine Code Lucene Search Engine http://lucene.apache.org/ SWISH http://swish-e.org/ Glimpse http://webglimpse.net/ and more

31/31 Arizona State University Reference L.Pager & S. Brin,The PageRank citation ranking: Bringing order to the web, Stanford Digital Library Technique, Working paper 1999-0120, 1998. Steven Levy (2004). All Eyes on Google. Newsweek, April 12, 2004. E. Brown, J. Callan, B. Croft (1994). “Fast Incremental Indexing for Full-Text Information Retrieval.” Proceedings of the 20th International Conference on Very Large Databases (VLDB). Lawrence Page and Sergey Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998.

Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.

Similar presentations

Presentation on theme: "Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.

Similar presentations

Presentation on theme: "Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University."— Presentation transcript:

Similar presentations

About project

Feedback