Searching on the WWW The Google Phenomena Snyder p119-141.

Searching on the WWW The Google Phenomena Snyder p119-141

Searching  The best place to look for something is where it’s likely to be found  Key to finding information.

Searching  A lot of information can be found on the WWW  But, that is the flaw of the WWW:  Too much Information  No organization  No structure

Organizing Information  Classification  Hierarchy  Categories with sub-categories  What are some problems with Hierarchies?

Problem With Hierarchies  I want to find the requirements for the Minor in IS Siena College AcademicsFinancial AidAthletics Degree Requirements Forms Schools Science CS Arts IS Minor

Webs or Networks  Multiple Paths Siena College AcademicsFinancial AidAthletics Degree Requirements Forms Schools Science CS Arts IS Minor

Problem With Web Networks  Less Important to Get things in the correct category  Information architects don’t worry too much about  Classification  Organization

Problem With Web Networks  As different Networks of Information are connected  Excessive redundant links emerge  Different organization strategies class  A mess is created

Search Engines to the Rescue  Alternative to searching via navigation  http://www.marketwatch.com/newsimages/mi sc/search_engines_timeline.pdf http://www.marketwatch.com/newsimages/mi sc/search_engines_timeline.pdf

How Search Engines Work 1. Web Crawling – program (robot) surfs from hyper-link to hyper-link accumulating pages 2. Web Indexed – each accumulated page is added to a database.  URL of web page is stored  Each word, occurrences, and sometimes position are stored.

How Search Engines Work 3.User Search – actually searches the index database 4.Sophisticated Algorithms are used to retrieve and rank pages that “match” the user search.

Step 1: Web Crawling  Hardest task.  5 - 10 million new web pages are added to the Internet every day.  Robots need to know where to start looking  You need the help and cooperation of web page creators.

Step 2: Web Indexing Each URL consists of a list of words...  URL1  word5  word74  word195  word456  URL2  word7  word82  word135  URL3  word5  word74  word165  word288  URL4  word21  word59  URL5  word25  word74  word188  word432  URL6  word7  word186  word430  URL7  word2  word398  URL8  word34  word39  word84  word193 ...

Step 2: Web Indexing Inverted Index: Each word consists of a list of URLs  word1  URL19  URL39  URL82  URL91  word2  URL27  URL41  URL66  URL67  word3  URL49  URL75  URL65  word4  URL29  URL89  word5  URL12  URL48  URL66  word6  URL53  URL73  URL123  URL144  word7  URL3  URL41  URL77 ...

Step 3: User Search  Searching the index database must be quick.  The database is sorted by key words (primary index)  The English language has about 600,000 words  Luckily, only about a tenth them are widely used  The database server needs to store the primary index in memory (RAM).

Step 4: Ranking the results  Searches on common words can return millions of pages.  Ordering or ranking becomes more important as the data increases  Intuitive measures  Number of occurances of search words  Search word in title, keyword, etc  “Importance” of web page  User feedback.

Search Engine Issues  Logical statements AND, OR, etc.  Phrases “Grilled Cheese”  Images – Dali Example  Dishonesty – XXX Example  Differences in Vocabulary - IBM-Issue

Search Engines (Catch 22)  Search Engine Companies make money by placing ads.  More searches = bigger audience = more $$$ from ads  Best Thing: Get as many people to use your search engine as possible  Worst Thing: What if everyone exclusively uses search engines to search the WWW?

How Google became the best  PageRank algorithm  (based on the Clever Algorithm)  PageRank is a measure of importance.  Links from important pages improves your PageRank 1.21.6 2.43.12.04.6 3.8 12.1

PageRank continued  http://www.iprcom.com/papers/pagerank/ http://www.iprcom.com/papers/pagerank/  Simplistic Explanation:  Initially all pages have the same PageRank  An iterative process increases the page rank of all pages based on  direct links first (highly weighted *1)  then, one hop links  then, two hop links ...  then, ten hop links (very low weighting *0.001) ...  The algorithm ignores cycles  The algorithm does not reward cliques  Eventually, the page ranks will stabilize (stop increasing) once you’ve considered  Until the page ranks stablize

PageRank intuition  ESPN.com is highly ranked because  Several other highly ranked pages point to it  Millions of low ranked pages point to it  Any page connected-to or part-of ESPN.com will benefit from this.  Intuitively, ESPN is an information authority on sports.

PageRank intuition  Breimer.net is poorly ranked because  Very few pages point to it. None to be exact.  The page is not an authority on “Breimer” until other pages acknowledge its existence via a link.

Searching on the WWW The Google Phenomena Snyder p119-141.

Similar presentations

Presentation on theme: "Searching on the WWW The Google Phenomena Snyder p119-141."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Searching on the WWW The Google Phenomena Snyder p119-141.

Similar presentations

Presentation on theme: "Searching on the WWW The Google Phenomena Snyder p119-141."— Presentation transcript:

Similar presentations

About project

Feedback