Presentation is loading. Please wait.

Presentation is loading. Please wait.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Similar presentations


Presentation on theme: "Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages."— Presentation transcript:

1 Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages indexed –In January, 2001, 1.3 billion pages indexed Estimate is that this is around 10-15% of the web

2 Finding Out About People interact with all that information because they want to KNOW something; there is a question they are trying to answer or a piece of information they want Simplest approach: –Knowledge is organized into chunks (pages) –Goal is to return appropriate chunks

3 Search Engines Goal of search engine is to return appropriate chunks Steps involve include –asking a question –finding answers –evaluating answers –presenting answers Value of a search engine depends on how well it does on all of these.

4 Asking a question Reflect some information need Query Syntax needs to allow information need to be expressed –Keywords –Combining terms Simple: “required”, NOT (+ and -) Boolean expressions with and/or/not and nested parentheses Variations: strings, NEAR, capitalization. –Simplest syntax that works –Typically more acceptable if predictable Another set of problems when information isn’t text: graphics, music

5 Finding the Information Goal is to retrieve all relevant chunks. Too time- consuming to do in real-time, so search engines index pages. Two basic approaches –Index and classify by hand –Automate For BOTH approaches deciding what to index on (e.g., what is a keyword) is a significant issue. Most major search sites now provide both

6 Indexing by Hand Indexing by hand involves having a person look at web pages and assign them to categories. Assumes a hierarchy of categories exists into which pages are placed Each document can go into multiple categories Produces very high quality indices Can retrieve by browsing the hierarchy Very expensive to create. –YAHOO is best-known early example

7 Automated Indexing Automated indexing involves parsing documents to pull out key words and creating a table which links keywords to documents Doesn’t have any predefined categories or keywords Can cover a much higher proportion of the web Can update more quickly Much lower quality, therefore important to have some kind of relevance ranking –Alta Vista was a well-known early example

8 Automating Search We will focus on automated search and indexing. Always balancing various factors: –Recall and Precision If there are 100 relevant documents and you find 50, your recall is 50%. If you find 100 documents, and 10 of them are on topic, your precision is 10%. Which is more important varies with query and with coverage –Speed, storage, completeness, timeliness How fast can you locate and index documents? Answer a query? How much room do you need on your server? What percent of the web do you cover? How many dead links do you have? How long before information is found by your search engine? –Ease of use vs power of queries Full Boolean queries very rich, very confusing. Alta Vista Advanced Search. Simplest is “and”ing together keywords; fast, straightforward. Google

9 Search Engine Basics A spider or crawler starts at a web page, identifies all links on it, and follows them to new web pages. A parser processes each web page and extracts individual words. An indexer creates/updates a hash table which connects words with documents A searcher uses the hash table to retrieve documents based on words A ranking system decides the order in which to present the documents: their relevance

10 Search Engines: Not So Basic Summary of document Cache of document Format Filters (pdf, postscript, etc) Duplicate identification and removal “More like this” Content Filters

11 Evaluating Search Engines Generally, the usability of a search engine includes several factors: –Coverage. Most important -- if it doesn’t even look at a page it can’t retrieve it. Matters more when looking for rare information. –Relevance ranking. If precision is low and relevance ranking is poor, takes too much wading to find desired result. Matters more in noisy domains. –Ease of use. This includes the query syntax, and also factors like how crowded the page is. A lot of personal preference here. –Other features. In some settings special features may be critical.

12 Some Well-Known Search Engines Google. www.google.com AltaVista Yahoo Lycos...


Download ppt "Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages."

Similar presentations


Ads by Google