1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

Slides:



Advertisements
Similar presentations
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Searching the Web Mark Levene (Follow the links to learn more!)
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Hinrich Schütze and Christina Lioma
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Vector Space Model …and applications in Information Retrieval.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Web Characterization: What Does the Web Look Like?
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Using Hyperlink structure information for web search.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Which of the two appears simple to you? 1 2.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Bibliometrics toolkit Website: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Further info: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Scopus Scopus was launched by Elsevier in.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
IR 6 Scoring, term weighting and the vector space model.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.
Automated Information Retrieval
Plan for Today’s Lecture(s)
The PageRank Citation Ranking: Bringing Order to the Web
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Information retrieval and PageRank
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and Anton –Dec. 5 (Monday) Colin and Paul

2 Web Search Lecture 23

3 Searching the Web Only search what is indexed –1999, 800 million documents indexed by Northern Light[7] Largest Index - 16% of the indexable web –2004, 800 billion urls indexed by Google [1] Largest Index - ?% of indexable web

4 Visualizing the Web View the web as a directed graph of nodes and edges –set of abstract nodes (the pages) –joined by directional edges (the hyperlinks) Structure provides significant insight about the content

5 Example Graph [6]

6 Citation Analysis[2] Use structure to identify important, or prominent, nodes Garfield’s impact factor –Quantitative “score” for each journal proportional to the average number of citations per paper published in the previous two years –More heavily cited journals have more overall impact on a field Consider it better to receive citations from an important journal

7 Influence Weights Pinski and Narin’s notion of influence weights –strength of the connection from one journal to another percentage of citations in the first journal that refer to the second –equilibrium: the weight of each journal J equal to sum of the weights of all journals citing J (scaled by strengths of connections) If a journal receives regular citations from other journals of large weight, it will acquire large weight

8 On the web Lot of dead-ends in the link structure –Prominent sites may have no links to outside world –Use “smoothing” operation, giving all pages a small, positive connection strength to every other page Compute equilibrium weights with respect to modified connection strengths

9 Different Model on the Web Prominent cites do not link to other prominent cites –Search engines won’t link to other search engines because they are competitors –Want to keep users on its sites Large collection of pages link to many prominent sites in a focused manner –act as resource lists and guides to search engines

10 Hubs and Authorities Authorities – most prominent sources of primary content for a topic Hubs – high quality guides and resource lists direct users to recommended authorities Each page is assigned a hub weight and an authority weight –authority weight - proportional to the sum of the hub weights of pages that link to it –hub weight - proportional to the sum of the authority weights of the pages that it links to

11 Simplified PageRank Algorithm[5] Formula used by Google to rank pages –Let u be a web page –F u is a set of pages u points to –B u is the set of pages that point to u –N u = |F u | –c factor used for normalization

12 Simplified PageRank Calculation where c = 1

13 PageRank Formula Account for sinks Complete Formula –d is empirically set to about 0.15 to 0.2 by the system

14 Using Queries to find Documents Vector Space Model – Content Relevance Slide by Mark Levene [3]

15 Term Frequency (TF) Count number of occurrences of each term. Bag of words approach Ignore stopwords such as is, a, of, the, … Stemming - computer is replaced by comput, as are its variants: computers, computing computation,computer and computed. Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. chess computer programming chess game chess game is a Slide by Mark Levene [3]

16 Inverse Document Frequency (IDF) N is number of documents in the corpus. n i is number of docs in which word i appears. Log dampens the effect of IDF. IDF is also number of bits to represent the term. Slide by Mark Levene [3]

17 Ranking with TF-IDF i – refers to document i j – refers to word (or term) j in doc i q – is the query which is a sequence of terms score j - is the score for document j given q Rank results according to the scoring function. Slide by Mark Levene [3]

18 Factor in Link Metrics Multilply by PageRank of document (web page). We do not know exactly how Google factors in the PR, it may be that log(PR) is used. Slide by Mark Levene [3]

19 Rate of change on the Web [4] Search engines update their index periodically in order to keep up with evolving web –obsolete index leads to irrelevant or “broken” search results –update both content and link structure Source of change –content of pages change –new pages are added

20 What’s new on the Web? New pages created rate of 8% a week[4] –New pages borrow significant amount of content from old pages –After one year, 50% of the content on the web is new Only 20% of pages available today accessible after one year

21 New Link Structure After a year, about 80% of links on the Web will be replaced with new ones 25% change per week –week-old rankings may not reflect the current ranking of the pages very well

22 Change in old pages After one week – 30% of the changed pages – difference > 5% After one year – less than 50% of changed pages – difference > 5% Creation of new pages more significant source of change on the Web

23 Impact on Search Engines Need to continually update links – this data changes more rapidly then content –most links persist for less than 6 months Page removed and replaced by new ones at rapid rates –Sometimes better to used cached version of page Pages that persist usually do not change very much –Past change does not predict future change

24 Citations [1] GOOGLE. Google. [2] J. Kleinberg. Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), [3] M. Levene. Lecture 4: Searching the Web. [4] A. Ntoulas et al. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of The Thirteenth International World Wide Web Conference, New York, May 17-22, [5] L. Page et al. The PageRank citation ranking: Bringing Order to the web. Stanford Digital Libraries Working Paper, [6] I. Rogers. The Google PageRank Algorithm and How It Works. April, [7] E. Selberg and O. Etzioni. On the Stability of Web Search Engines. In Proceedings of RIAO 2000 Conference, Paris, April 12-14, 2000.