Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Similar presentations


Presentation on theme: "The Inside Story Christine Reilly CSCI 6175 September 27, 2011."— Presentation transcript:

1 The Inside Story Christine Reilly CSCI 6175 September 27, 2011

2 Back the late 1990’s…

3 Problems With 1990’s Search Engines Spam: top results were ads Users only look at top 10 results Rapid growth of the Web

4 Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

5 Welcome to my page. I have links to other pages on my page. Welcome to my page. I have links to other pages on my page. Step 1: Crawl to Retrieve Pages http://page1.html http://page2.html http://page1.html http://page2.html URL List http://page1.html http://link1.html http://link2.html

6 Welcome to another page. I also have links to other pages on my page. Welcome to another page. I also have links to other pages on my page. Step 1: Crawl to Retrieve Pages http://page1.html http://page2.html http://page1.html http://page2.html URL List http://page2.html http://link1.html http://link2.html http://link3.html http://link4.html

7 Issues With Web Crawling How to crawl as much of web as possible Choose order of pages to crawl Storing all the pages When to re-crawl Don’t irritate the page owner

8 Step 2-a: Create Hit List All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. http://pageX.html … pageX; Bicycles; 50; h1; 1st pageX; Bicycles; 60; norm; 1st pageX; fun; 67; norm; none pageX; ride; 81; norm; none … pageX; Bicycles; 50; h1; 1st pageX; Bicycles; 60; norm; 1st pageX; fun; 67; norm; none pageX; ride; 81; norm; none … Hits

9 Step 2-b: Create Anchors File All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. http://pageX.html pageX; linkM; Bicycles pageX; linkN; cars pageX; linkM; Bicycles pageX; linkN; cars Anchors

10 More Steps Create inverted index sorted by word Creates lexicon Search uses lexicon, inverted index, and Page Rank

11 Search Process Parse the query Find documents that have all search terms Compute the rank of the document Return the top k documents (sorted by rank)

12 Search for “bicycle” bicycle; pageA; 30, 70 bicycle; pageB; 98, 1100 car; pageA; 103 car; pageC; 107 car; pageD; 119, 598, 2004 Inverted Index pageA pageB Results

13 Ranking Results of a Query Hit type: title, anchor, URL, large font, etc. PageRank (more about that next) Documents with words appearing closer together have higher weight

14 Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

15 Data Storage Use specialized data structures Avoid expensive disk seeks

16 Repository of Crawled Web Pages docIdurlLenpageLenurlpage Pages compressed using zlib All other data structures can be rebuilt from repository and list of crawler errors

17 Hit Data Structure 2 bytes per hit 3 types of hits: – Plain – Fancy (URL, title, meta tag, etc) – Anchor text Plain:Cap (1)Font (3)Position (12) Fancy:Cap (1)Font = 7Type (4)Position (8) Anchor:Cap (1)Font = 7Type (4)Hash (4)Position (4) Parts of the hit data structure; (bits used by part)

18 Forward Index docIdwordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits null wordId docIdwordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits null wordId (n) = number of bits used

19 Inverted Index wordIdnum Docs wordIdnum Docs wordIdnum Docs docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List … Lexicon Index

20 Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

21 Importance of a Web Page Simple approximation: count backlinks – Can easily create many links to my own page – A page with one link from a “good” web page should get a higher importance Better method: PageRank – Use graph of the web – Measure relative importance of web pages

22 Simplified Page Rank 100 9 9 50 3 3 3 53 50 26.5 25

23 The Real Page Rank Handles cycles of pages Random Surfer: periodically jump to a random page

24 Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

25 Quality of Results Simple example showed high quality results Current Google is used by tons of people

26 Other Performance Metrics Storage: All data used takes 55 GB – Better compression -> 7 GB System Performance – Crawl: 9 days first time, 2.6 days (48.5 pages / s) second time – Indexer: 54 pages / s; runs in parallel with crawl – Sorting w 4 parallel machines: 24 hours Search Performance: not a focus of the research

27 Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

28 Modern Search Challenge Return relevant results – Find hotel in NYC with certain amenities – Assemble a geographically distributed committee Current search engines: sift through tons of results, find relevant information

29 Information Extraction Extract meaningful data from text, store as structured data Example: – Text: “Paris is the stylish capital of France” – Data tuple: (Paris, capital of, France) Automatically create collections of data that are currently human curated

30 Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

31 Ways to improve search: – Format of text on page – Following page links Search must scale as the web grows Search has come a long way, but new techniques will improve it

32 Questions?


Download ppt "The Inside Story Christine Reilly CSCI 6175 September 27, 2011."

Similar presentations


Ads by Google