Download presentation
Presentation is loading. Please wait.
1
Searching on the WWW The Google Phenomena Snyder p119-141
2
Searching The best place to look for something is where it’s likely to be found Key to finding information.
4
Searching A lot of information can be found on the WWW But, that is the flaw of the WWW: Too much Information No organization No structure
5
Organizing Information Classification Hierarchy Categories with sub-categories What are some problems with Hierarchies?
6
Problem With Hierarchies I want to find the requirements for the Minor in IS Siena College AcademicsFinancial AidAthletics Degree Requirements Forms Schools Science CS Arts IS Minor
7
Webs or Networks Multiple Paths Siena College AcademicsFinancial AidAthletics Degree Requirements Forms Schools Science CS Arts IS Minor
8
Problem With Web Networks Less Important to Get things in the correct category Information architects don’t worry too much about Classification Organization
9
Problem With Web Networks As different Networks of Information are connected Excessive redundant links emerge Different organization strategies class A mess is created
10
Search Engines to the Rescue Alternative to searching via navigation http://www.marketwatch.com/newsimages/mi sc/search_engines_timeline.pdf http://www.marketwatch.com/newsimages/mi sc/search_engines_timeline.pdf
11
How Search Engines Work 1. Web Crawling – program (robot) surfs from hyper-link to hyper-link accumulating pages 2. Web Indexed – each accumulated page is added to a database. URL of web page is stored Each word, occurrences, and sometimes position are stored.
12
How Search Engines Work 3.User Search – actually searches the index database 4.Sophisticated Algorithms are used to retrieve and rank pages that “match” the user search.
13
Step 1: Web Crawling Hardest task. 5 - 10 million new web pages are added to the Internet every day. Robots need to know where to start looking You need the help and cooperation of web page creators.
14
Step 2: Web Indexing Each URL consists of a list of words... URL1 word5 word74 word195 word456 URL2 word7 word82 word135 URL3 word5 word74 word165 word288 URL4 word21 word59 URL5 word25 word74 word188 word432 URL6 word7 word186 word430 URL7 word2 word398 URL8 word34 word39 word84 word193 ...
15
Step 2: Web Indexing Inverted Index: Each word consists of a list of URLs word1 URL19 URL39 URL82 URL91 word2 URL27 URL41 URL66 URL67 word3 URL49 URL75 URL65 word4 URL29 URL89 word5 URL12 URL48 URL66 word6 URL53 URL73 URL123 URL144 word7 URL3 URL41 URL77 ...
16
Step 3: User Search Searching the index database must be quick. The database is sorted by key words (primary index) The English language has about 600,000 words Luckily, only about a tenth them are widely used The database server needs to store the primary index in memory (RAM).
17
Step 4: Ranking the results Searches on common words can return millions of pages. Ordering or ranking becomes more important as the data increases Intuitive measures Number of occurances of search words Search word in title, keyword, etc “Importance” of web page User feedback.
18
Search Engine Issues Logical statements AND, OR, etc. Phrases “Grilled Cheese” Images – Dali Example Dishonesty – XXX Example Differences in Vocabulary - IBM-Issue
19
Search Engines (Catch 22) Search Engine Companies make money by placing ads. More searches = bigger audience = more $$$ from ads Best Thing: Get as many people to use your search engine as possible Worst Thing: What if everyone exclusively uses search engines to search the WWW?
20
How Google became the best PageRank algorithm (based on the Clever Algorithm) PageRank is a measure of importance. Links from important pages improves your PageRank 1.21.6 2.43.12.04.6 3.8 12.1
21
PageRank continued http://www.iprcom.com/papers/pagerank/ http://www.iprcom.com/papers/pagerank/ Simplistic Explanation: Initially all pages have the same PageRank An iterative process increases the page rank of all pages based on direct links first (highly weighted *1) then, one hop links then, two hop links ... then, ten hop links (very low weighting *0.001) ... The algorithm ignores cycles The algorithm does not reward cliques Eventually, the page ranks will stabilize (stop increasing) once you’ve considered Until the page ranks stablize
22
PageRank intuition ESPN.com is highly ranked because Several other highly ranked pages point to it Millions of low ranked pages point to it Any page connected-to or part-of ESPN.com will benefit from this. Intuitively, ESPN is an information authority on sports.
23
PageRank intuition Breimer.net is poorly ranked because Very few pages point to it. None to be exact. The page is not an authority on “Breimer” until other pages acknowledge its existence via a link.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.