Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

Similar presentations


Presentation on theme: "1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and."— Presentation transcript:

1 1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and Anton –Dec. 5 (Monday) Colin and Paul

2 2 Web Search Lecture 23

3 3 Searching the Web Only search what is indexed –1999, 800 million documents indexed by Northern Light[7] Largest Index - 16% of the indexable web –2004, 800 billion urls indexed by Google [1] Largest Index - ?% of indexable web

4 4 Visualizing the Web View the web as a directed graph of nodes and edges –set of abstract nodes (the pages) –joined by directional edges (the hyperlinks) Structure provides significant insight about the content

5 5 Example Graph [6]

6 6 Citation Analysis[2] Use structure to identify important, or prominent, nodes Garfield’s impact factor –Quantitative “score” for each journal proportional to the average number of citations per paper published in the previous two years –More heavily cited journals have more overall impact on a field Consider it better to receive citations from an important journal

7 7 Influence Weights Pinski and Narin’s notion of influence weights –strength of the connection from one journal to another percentage of citations in the first journal that refer to the second –equilibrium: the weight of each journal J equal to sum of the weights of all journals citing J (scaled by strengths of connections) If a journal receives regular citations from other journals of large weight, it will acquire large weight

8 8 On the web Lot of dead-ends in the link structure –Prominent sites may have no links to outside world –Use “smoothing” operation, giving all pages a small, positive connection strength to every other page Compute equilibrium weights with respect to modified connection strengths

9 9 Different Model on the Web Prominent cites do not link to other prominent cites –Search engines won’t link to other search engines because they are competitors –Want to keep users on its sites Large collection of pages link to many prominent sites in a focused manner –act as resource lists and guides to search engines

10 10 Hubs and Authorities Authorities – most prominent sources of primary content for a topic Hubs – high quality guides and resource lists direct users to recommended authorities Each page is assigned a hub weight and an authority weight –authority weight - proportional to the sum of the hub weights of pages that link to it –hub weight - proportional to the sum of the authority weights of the pages that it links to

11 11 Simplified PageRank Algorithm[5] Formula used by Google to rank pages –Let u be a web page –F u is a set of pages u points to –B u is the set of pages that point to u –N u = |F u | –c factor used for normalization

12 12 Simplified PageRank Calculation where c = 1

13 13 PageRank Formula Account for sinks Complete Formula –d is empirically set to about 0.15 to 0.2 by the system

14 14 Using Queries to find Documents Vector Space Model – Content Relevance Slide by Mark Levene [3]

15 15 Term Frequency (TF) Count number of occurrences of each term. Bag of words approach Ignore stopwords such as is, a, of, the, … Stemming - computer is replaced by comput, as are its variants: computers, computing computation,computer and computed. Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. chess computer programming chess game chess game is a Slide by Mark Levene [3]

16 16 Inverse Document Frequency (IDF) N is number of documents in the corpus. n i is number of docs in which word i appears. Log dampens the effect of IDF. IDF is also number of bits to represent the term. Slide by Mark Levene [3]

17 17 Ranking with TF-IDF i – refers to document i j – refers to word (or term) j in doc i q – is the query which is a sequence of terms score j - is the score for document j given q Rank results according to the scoring function. Slide by Mark Levene [3]

18 18 Factor in Link Metrics Multilply by PageRank of document (web page). We do not know exactly how Google factors in the PR, it may be that log(PR) is used. Slide by Mark Levene [3]

19 19 Rate of change on the Web [4] Search engines update their index periodically in order to keep up with evolving web –obsolete index leads to irrelevant or “broken” search results –update both content and link structure Source of change –content of pages change –new pages are added

20 20 What’s new on the Web? New pages created rate of 8% a week[4] –New pages borrow significant amount of content from old pages –After one year, 50% of the content on the web is new Only 20% of pages available today accessible after one year

21 21 New Link Structure After a year, about 80% of links on the Web will be replaced with new ones 25% change per week –week-old rankings may not reflect the current ranking of the pages very well

22 22 Change in old pages After one week – 30% of the changed pages – difference > 5% After one year – less than 50% of changed pages – difference > 5% Creation of new pages more significant source of change on the Web

23 23 Impact on Search Engines Need to continually update links – this data changes more rapidly then content –most links persist for less than 6 months Page removed and replaced by new ones at rapid rates –Sometimes better to used cached version of page Pages that persist usually do not change very much –Past change does not predict future change

24 24 Citations [1] GOOGLE. Google. www.google.comwww.google.com [2] J. Kleinberg. Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), 1999. [3] M. Levene. Lecture 4: Searching the Web. www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt [4] A. Ntoulas et al. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of The Thirteenth International World Wide Web Conference, New York, May 17-22, 2004. [5] L. Page et al. The PageRank citation ranking: Bringing Order to the web. Stanford Digital Libraries Working Paper, 1998. [6] I. Rogers. The Google PageRank Algorithm and How It Works. www.iprcom.com/papers/pagerank, April, 2002. www.iprcom.com/papers/pagerank [7] E. Selberg and O. Etzioni. On the Stability of Web Search Engines. In Proceedings of RIAO 2000 Conference, Paris, April 12-14, 2000.


Download ppt "1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and."

Similar presentations


Ads by Google