Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Searching the Web Representation and Management of Data on the Internet.

Similar presentations


Presentation on theme: "1 Searching the Web Representation and Management of Data on the Internet."— Presentation transcript:

1 1 Searching the Web Representation and Management of Data on the Internet

2 2 What does a Search Engine do? Processes users queries Finds pages with related information Returns a resources list Why can’t we use an ordinary database system that is reachable via an ordinary Web server? What are the difficulties in creating a search engine?

3 3 Motivation The web is –Used by millions –Contains lots of information –Link based –Incoherent –Changes rapidly –Distributed Traditional information retrieval was built with the exact opposite in mind

4 4 The Web’s Characteristics Size –Over a billion pages available (Google is a spelling of googol = 10 100 ) –5-10K per page => tens of terrabytes –Size doubles every 2 years Change –23% change daily –About half of the pages do not exist after 10 days –Bowtie structure

5 5 Bowtie Structure Core: Strongly connected component (28%) Reachable from core (22%) Reach the core (22%)

6 6 Search Engine Components User Interface Query processor Crawler Indexer Ranker

7 7 An HTML form for inserting a search query Usually a query is a list of words What was the most popular query in Google in the last year? What does it mean to be popular in Google?

8 8

9 9

10 10 Crawling the Web

11 11 Basic Crawler (Spider) Queue of Pages removeBestPage( ) findLinksInPage( ) insertIntoQueue( ) A crawler finds Web pages to download into a search engine cache

12 12 Choosing Pages to Download Q: Which pages should be downloaded? A: It is usually not possible to download all pages because of space limitations. Try to get the most important pages Q: When is a page important? A: Use a metric – by interest, by popularity, by location, or combination

13 13 Interest Driven Suppose that there is a query Q that contains the words we will be interested in Define the importance of a page P by its textual similarity to the query Q Example: use a formula that combines –The number of appearances of words from Q in P –For each word of Q how frequently does it being used (why is this important?) Problem: We must decide if a page is important while crawling. However, we don’t know how rare is a word until the crawl is complete Solution: Use an estimate

14 14 Popularity Driven The importance of a page P is proportional to the number of pages with a link to P This is also called the number of back links of P As before, need to estimate this amount There is a more sophisticated metric, called PageRank (was taught on Tuesday)

15 15 Location Driven The importance of P is a function of its URL Example: –Words appearing on URL (e.g., edu or ac) –Number of “/” on the URL Easily evaluated, requires no data from pervious crawls Note: We can also use a combination of all three metrics

16 16 Refreshing Web Pages Pages that have been downloaded must be refreshed periodically Q: Which pages should be refreshed? Q: How often should we refresh a page?

17 17 Freshness Metric A cached page is fresh if it is identical to the version on the Web Suppose that S is a set of pages (i.e., a cache) Freshness(S) = (number of fresh pages in S) number of pages in S

18 18 Age Metric The age of a page is the number of days since it was refreshed Suppose that S is a set of pages (i.e., a cache) Age(S) = Average age of pages in S

19 19 Refresh Goal Crawlers can refresh only a certain amount of pages in a period of time The page download resource can be allocated in many ways Goal: Minimize the age of a cache and maximize the freshness of a cache We need a refresh strategy

20 20 Refresh Strategies Uniform Refresh: The crawler revisits all pages with the same frequency, regardless of how often they change Proportional Refresh: The crawler revisits a page with frequency proportional to the page’s change rate (i.e., if it changes more often, we visit it more often) Which do you think is better?

21 21 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit one page per week How should we visit pages? –e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional] –e 1 e 1 e 1 e 1 e 1 e 1... –e 2 e 2 e 2 e 2 e 2 e 2... –?–? e1e1 e2e2 e1e1 e2e2 web database

22 22 Proportional Often Not Good! Visit fast changing e 1  get 1/2 day of freshness Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal!

23 23 Another Example The collection contains 2 pages: e 1 changes 9 times a day, e 2 changes once a day Simplified change model: –Day is split into 9 equal intervals: e 1 changes once on each interval, and e 2 changes once during the day –Don’t know when the pages change within the intervals The crawler can download a page a day Our goal is to maximize the freshness

24 24 Which Page Do We Refresh? Suppose we refresh e 2 in midday If e 2 changes in first half of the day, it remains fresh for the rest (half) of the day. –50% for 0.5 day freshness increase –50% for no increase –Expectancy of 0.25 day freshness increase

25 25 Which Page Do We Refresh? Suppose we refresh e 1 in midday If e 1 changes in first half of the interval, and we refresh in midday (which is the middle of the interval), it remains fresh for the rest half of the interval = 1/18 of a day. –50% for 1/18 day freshness increase –50% for no increase –Expectancy of 1/36 day freshness increase

26 26 Not Every Page is Equal! Suppose that e 1 is accessed twice as often as e 2 Then, it is twice as important to us that e 1 is fresh than it is that e 2 is fresh

27 27 Politeness Issues When a crawler crawls a site, it uses the site’s resources: –The web server needs to find the file in file system –The web server needs to send the file in the network If a crawler asks for many of the pages and at a high speed it may –crash the sites web server or –be banned from the site Solution: Ask for pages “slowly”

28 28 Politeness Issues (cont) A site may identify pages that it doesn’t want to be crawled (how?) A polite crawler will not crawl these sites (although nothing stops the crawler from being impolite) Put a file called robots.txt at the main directory to identify pages that should not be crawled (e.g., http://www.cnn.com/robots.txt)

29 29 robots.txt Use the header User-Agent to identify programs whose access should be restricted Use the header Disallow to identify pages that should be restricted

30 30 Other Issues Suppose that a search engine uses several crawlers at the same time (in parallel) How can we make sure that they are not doing the same work (i.e., visiting the same pages)?

31 31 Index Repository

32 32 Storage Challenges Scalability: Should be able to store huge amounts of data (data spans disks or computers) Dual Access Mode: Random access (find specific pages) and Streaming access (find large subsets of pages) Large Batch Updates: Reclaim old space, avoid access/update conflicts Obsolete Pages: Remove pages no longer on the web (how do we find these pages?)

33 33 Storage Challenges Storage cost: Should be able to store the huge amounts of data at a reasonable cost (a disk that can store a few terabytes is very expensive, so what do search engines such as Google do?)

34 34 Update Strategies Updates are generated by the crawler Several characteristics –Time in which the crawl occurs and the repository receives information –Whether the crawl’s information replaces the entire database or modifies parts of it

35 35 Batch Crawler vs. Steady Crawler Batch mode –Periodically executed –Allocated a certain amount of time Steady mode –Run all the time –Always send results back to the repository

36 36 Partial vs. Complete Crawls A batch mode crawler can either do –A complete crawl every run, and replace entire cache –A partial crawl and replace only a subset of the cache The repository can implement –In place update: Replaces the data in the cache, thus, quickly refreshes pages –Shadowing: Create a new index with updates, and later replace the previous, thus, avoiding refresh-access conflicts

37 37 Partial vs. Complete Crawls Shadowing resolves the conflicts between updates and read for the queries Batch mode suits well with shadowing Steady crawler suits with in place updates

38 38 Types of Indices Content index: Allow us to easily find pages with certain words Links index: Allow us to easily find links between pages Utility index: Allow us to easily find pages in certain domain, or of a certain type, etc. Q: What do we need these for?

39 39 Is the Following Content Index Good? Consider the table: We want to quickly find pages with a specific word Is this a good way of storing a content index? WordFrequencyUrlId...

40 40 Is the Following Content Index Good? NO If a word appears in a thousand documents, then the word will be in a thousand rows. Why waste the space? If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents Does not easily support queries that require multiple words

41 41 Inverted Keyword Index bush: (1, 5, 11, 17)saddam: (3, 5, 11, 17) war: (3, 5, 17, 28) butterfly: (22, 4) Hashtable Words as keys lists of matching documents as the values lists are sorted by urlId

42 42 Query: “bush saddam war” bush: (1, 5, 11, 17) saddam: (3, 5, 11, 17) war: (3, 5, 17, 28) 517 Answers: Algorithm: Always advance pointer(s) with lowest urlId

43 43 Challenges Index build must be : – Fast – Economic Incremental Indexing must be supported Tradeoff when using compression: memory is saved but time is lost compressing and uncompressing

44 44 How do we Distribute the Indices Between Files? Local inverted file –Each file contains disjoint random pages of the index –Query is broadcasted –Result is the merged query answers Global inverted file –Each file is responsible for a subset of terms in the collection –Query “sent” only to the apropriate files What will happen if a disk will crash (which is better in this case?)

45 45 Ranking

46 46 A Naïve Approach Let Q (the query) be a set of words Let count Q (P) be the number of occurrences of words of Q in P A naïve approach: –If count Q (P 1 ) > count Q (P 2 ) then rank P 1 should be higher than rank P 2 What are the problems with the naïve approach?

47 47 Testing the Naïve Approach Q = “green men mars” –P 1 = “I live in a green house with a green roof” –P 2 = “There is no life form on Mars” –P 3 = “Men don’t like green cars” –P 4 = “I saw some little green men yesterday” In what order do you think that these ‘pages’ should be returned?

48 48 The Vector Space Model The Vector Space Model (VSM) is a way of representing documents through the words that they contain It is a standard technique in Information Retrieval The VSM allows decisions to be made about which documents are similar to each other and to keyword queries

49 49 How Does it Work Each document is broken down into a word frequency table The tables are called vectors and can be stored as arrays A vocabulary is built from all the words in all documents in the system Each document is represented as a vector based against the vocabulary

50 50 Example Document A –“A dog and a cat.” Document B –“A frog.” adogandcat 2111 afrog 11

51 51 Example (continued) The vocabulary contains all the words that are used: –a, dog, and, cat, frog The vocabulary is sorted –a, and, cat, dog, frog

52 52 Example (continued) Document A: “A dog and a cat.” –Vector: (2,1,1,1,0) Document B: “A frog.” –Vector: (1,0,0,0,1) aandcatdogfrog 21110 aandcatdogfrog 10001

53 53 Queries Queries can be represented as vectors in the same way as documents: –“dog” = (0,0,0,1,0) –“frog” = (0,0,0,0,1) –“dog and frog” = (0,1,0,1,1)

54 54 Similarity Measures There are many different ways to measure how similar two documents are, or how similar a document is to a query The cosine measure is a very common similarity measure Using a similarity measure, a set of documents can be compared to a query and the most similar document returned

55 55 The Cosine Measure For two vectors d and d’ the cosine similarity between d and d’ is given by: Here d  d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together The cosine measure calculates the angle between the vectors in a high-dimensional virtual space

56 56 Example Let d = (2,1,1,1,0) and d’ = (0,0,0,1,0) –d  d’ = 2  0 + 1  0 + 1  0 + 1  1 + 0  0 = 1 –|d| =  (2 2 +1 2 +1 2 +1 2 +0 2 ) =  7=2.646 –|d’| =  (0 2 +0 2 +0 2 +1 2 +0 2 ) =  1=1 –Similarity = 1/(1  2.646) = 0.378

57 57 Ranking Documents A user enters a query The query is compared to all documents using a similarity measure The user is shown the documents in decreasing order of similarity to the query term

58 58 Vocabulary Stopword lists –Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing Examples: a, and, to, is, of, in, if, would, very, when, you, … –Stopword lists contain frequent words to be excluded –Stopword lists need to be used carefully E.g. “to be or not to be”

59 59 Stemming Suppose that a user is interested in finding pages about “running shoes” In many cases it is desired to return pages containing shoe instead of shoes, and pages containing run or runs instead of running In order to accommodate such variations, a stemmer is used

60 60 Stemming (continued) A stemmer receives a keyword as input, and returns its stem (or normal form) For example, the stem of running might be run Instead of checking whether a word w appears in a page P, a search engine might check if there is a word w' in P that has the same stem as w, i.e., stem(w)=stem(w')

61 61 Term Weighting Not all words are equally useful A word is most likely to be highly relevant to document A if it is: –Infrequent in other documents –Frequent in document A The cosine measure needs to be modified to reflect this

62 62 Normalised Term Frequency (tf) A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document This is known as the tf factor Example: –Given raw frequency vector: (2,1,1,1,0) –We get the tf vector: (2/5, 1/5, 1/5, 1/5, 0) This stops large documents from scoring higher

63 63 Inverse Document Frequency (idf) A calculation designed to make rare words more important than common words The idf of word w is given by Where N is the number of documents and n w is the number of pages that contain the word w

64 64 tf-idf The tf-idf weighting scheme is to multiply each word in each document by its tf factor and idf factor –TF-IDF(P, Q) = Sum w in Q (tf(P,w)*idf(w)) Different schemes are usually used for query vectors Different variants of tf-idf are also used

65 65 Traditional Ranking Faults (e.g., TF-IDF) Many pages containing a term may be of poor quality or not relevant People put popular words in irrelevant sites to promote the site Queries are short, so containing the words from a query does not indicate importance

66 66 Additional Factors for Ranking Links: If an important page links to P, then P must be important Words on links: If a page links to P with the query keyword in the link text, the page P must really be about the keywords Style of words: If a keyword appears in P in a title, header, large font size, it is more important

67 67 The Hidden Web Challenge

68 68 The Hidden (Deep) Web Web pages that are protected by a password Web pages that require filling a registration form in order to get them Web pages that are dynamically created from data in a database (e.g., search results) In a weaker sense: –Web pages that no other page has a link to them –Pages that are not allowed for search engines (by robots.txt)

69 69 One of the Challenges in Archiving the Web Can we reach all of the Web by crawling? Why do we care about parts that are not reachable by ordinary web crawlers? There is an estimation that the deep web is 500 larger than the visible web What will be the effect of web services on the ratio between the visible web and the hidden web?


Download ppt "1 Searching the Web Representation and Management of Data on the Internet."

Similar presentations


Ads by Google