Presentation is loading. Please wait.

Presentation is loading. Please wait.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Similar presentations


Presentation on theme: "“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072."— Presentation transcript:

1 “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072

2 Outline Paper Objective Introduction & History Related Work Design Goals Google Search Engine Features Google Architecture Results & Performance Conclusion References

3 Paper Objective To describe the anatomy of a large scale web search engine = Google

4 Introduction & History Why the name Google? From googol = 10 100 = very large scale Web is different from a normal search engine. –Web is vast and is growing exponentially –Web is heterogonous – images, HTML, files … etc. –IR on small and well controlled homogenous collections is much easier. Human Maintained Lists can ’ t keep up –Yahoo! Is a human maintained list and so – subjective, slow to improve, expensive to build and maintain. Google make use of the hypertext info to get better results

5 Related Work WWWW (World Wide Web Worm) – 1994 – first web search engine - indexed about 110,000 web pages – handled about 1500 queries/day. In 1997, the top SE claimed to index about 2 million web documents - handled about 20 million queries/day. In 2000, it was expected to index more than a billion documents – with more than 100 million queries/day. Current S.E problems –Subjective –If automated S.E then it returns low quality results –Advertisers can mislead automated S.E

6 Solution = Google Problem statement ! Must handle the problem in very efficient way –Storage requirements –Efficient processing of the indexing system –Handle a huge number of queries/second –Produce a high quality results

7 Design Goals Deliver results that have very high precision even at the expense of recall –Using hypertextual info can improve the search quality – such as font size, links, titles, anchor text …. Etc –In 1997, only 4 commercial S.E. was able to return themselves in the top ten results ! Make search engine technology transparent, i.e. advertising shouldn ’ t bias results Bring search engine technology into academic environment in order to support novel research activities on large web data sets Make system easy to use for most people, e.g. users shouldn ’ t have to specify more than a couple of words

8 Google Search Engine Features Two main features to increase result precision: Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate document retrieval Other features include: Takes into account word proximity in documents Uses font size, word position, etc. to weight word Storage of full raw html pages

9 PageRank What does it mean for a web page to have a high rank? –Many pages point to it – so it is an important one- –Some important pages point to it such as Yahoo! PR(A) = (1-d) + d [PR(T1)/C(T1) + PR(T2/C(T2) + … + PR(Tn/C(Tn)]. D is called the damping factor - used to prevent misleading the system to get a higher ranking. Page A has T1 … Tn pages which point to A. C(T1) is the number of links going out of page T1. Not all links are treated the same PageRank is calculated using simple iterative algorithms

10 Anchor Text The text of the link. Chapter1 Chapter1 Objective to return non-textual objects like files, databases, images … etc – which can not be indexed by a text-based S.E Also, it return non crawled pages Target PageLinkTextPage PDF file www … Chapter1 Dr. Wasfi ’ s Page

11 Google Architecture Most of Google was built using C and C++ for efficiency. Works on Solaris and Linux.

12 Repository Barrels Lexicon Searcher PageRank Sorter Crawler Store ServerURL Server URL Resolver Links Doc Index Indexer Anchors

13 Crawling the Web To fetch URL and gather web pages into the store. It is a challenging task. Needs to be fast to keep up to date info Has to interact with the outside world – web servers, name servers … etc To scale well  distributed crawling Each crawler keep 300 open connections Up to 100 web pages per second using 4 crawler DNS cache to improve performance A connection can be in one of these states - looking up DNS, connecting to host, sending request, and receiving response

14 Major Data Structure Repository: Contains the full html page compressed using zlib standard. Document Index: Keeps information about each document. ordered by docID. Hit Lists: Corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.

15 Forwarded Index Stored in the barrels. Each barrel hold a set of wordID If a document has a word in that barrel, the docID is recorded in that barrel. Hit list# Hits =2WordID =1docID Hit list# Hits =3WordID =2

16 Inverted Index Same as forwarded index expect that it was processed by the sorter. Docs #WordID Docs #WordID Docs #WordID Hit list# hitsdocID Hit list# hitsdocID Hit list# hitsdocID Hit list# hitsdocID Lexicon

17 Results & Performance Quality of the results is the most important metric in search engines Authors claim that Google outperform major commercial search engines Example

18

19 Storage Requirements Scale well. Utilize the storage efficiently Use compression

20 System Performance Major operations are crawling, indexing and sorting 9 days to get 26 million pages. Average 48.5 pages/second Indexer – 54 pages/second The whole sorting operation takes 24 hours

21 Search Performance Was not the most important issue in their design that time The response time for a query was between 1 to 10 second for all queries – mainly Disk IO time – Did not have any query cashing, subindices on common terms – for optimization -

22 Search Performance

23 Repository Barrels Lexicon Searcher PageRank Sorter Crawler Store ServerURL Server URL Resolver Links Doc Index Indexer Anchors

24 Conclusion & Future Work The most concern of Google design is to be a scalable web search engine. And to provide high quality results –Page ranking –Anchor text

25 Future Work To improve search efficiency –Cash the query –Smart disk allocation –Subindices Updates – old and new pages – Add Boolean operators, negation, and stemming relevance feedback and clustering user context result summarization PageRank can be personalized by increasing the weight of a user ’ s home page or bookmarks

26 References S. Brin,L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998)

27 Q & A Thanks for your Attention


Download ppt "“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072."

Similar presentations


Ads by Google