Link Analysis David Kauchak cs160 Fall 2009 adapted from:

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
. Markov Chains as a Learning Tool. 2 Weather: raining today40% rain tomorrow 60% no rain tomorrow not raining today20% rain tomorrow 80% no rain tomorrow.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
INF 2914 Web Search Lecture 4: Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
Information Retrieval
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis.
ITCS 6265 Lecture 17 Link Analysis This lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Web-Mining Agents Community Analysis Tanya Braun Universität zu Lübeck Institut für Informationssysteme.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Basic Information Retrieval
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Chapter 5: Information Retrieval and Web Search
Presentation transcript:

Link Analysis David Kauchak cs160 Fall 2009 adapted from:

Administrative Course feedback homeworks Midterm review today Assign 4 out soon

The Web as a Directed Graph A hyperlink between pages denotes author perceived relevance AND importance Page A hyperlink Page B How can we use this information?

Query-independent ordering First generation: using link counts as simple measures of popularity Two basic suggestions: Undirected popularity: Each page gets a score = the number of in-links plus the number of out-links (3+2=5) Directed popularity: Score of a page = number of its in-links (3) problems?

What is pagerank? The random surfer model Imagine a user surfing the web randomly using a web browser The pagerank score of a page is the probability that that user will visit a given page Globe-Mascot-Cartoon-Character-Surfing-On-A-Blue-And-Yellow-Surfboard.jpg

Random surfer model We want to model the behavior of a “random” user interfacing the web through a browser Model is independent of content (i.e. just graph structure) What types of behavior should we model and how? Where to start Following links on a page Typing in a url (bookmarks) What happens if we get a page with no outlinks Back button on browser

Random surfer model Start at a random page Go out of the current page along one of the links on that page, equiprobably “Teleporting” If a page has no outlinks always jump to random page With some fixed probability, randomly jump to any other page, otherwise follow links 1/3

The questions… Given a graph and a teleporting probability, we have some probability of visiting every page What is that probability for each page in the graph? o/zsHdSlKc-q4/s640/searchology-web-graph.png

Markov process A markov process is defined by: x 1, x 2, …, x n a set of states An n by n transition matrix describing the probability of transitioning to state x j given that you’re in state x i x1x1 x2x2 xnxn … x1x1 … x2x2 xnxn x ij : p(x j |x i ) rows must sum to 1

Given that a person’s last cola purchase was Coke, there is a 90% chance that his next cola purchase will also be Coke. If a person’s last cola purchase was Pepsi, there is an 80% chance that his next cola purchase will also be Pepsi. coke pepsi Markov Process Coke vs. Pepsi Example transition matrix: pepsi coke

Given that a person is currently a Pepsi purchaser, what is the probability that he will purchase Coke two purchases from now? Markov Process Coke vs. Pepsi Example (cont) coke pepsi transition matrix: pepsi coke

Given that a person is currently a Pepsi purchaser, what is the probability that he will purchase Coke two purchases from now? Markov Process Coke vs. Pepsi Example (cont) Pepsi  ? ?  Coke pepsicoke pepsicoke pepsicoke pepsi coke

14 Given that a person is currently a Coke purchaser, what is the probability that he will purchase Pepsi three purchases from now? Markov Process Coke vs. Pepsi Example (cont) pepsicoke pepsicoke pepsicoke pepsi coke

Steady state In general, we can calculate the probability after n purchases as P n We might also ask the question, what is the probability of being in state coke or pepsi? This is described by the steady state distribution of a markov process Note, this is a distribution over states not state transitions How might we obtain this?

Steady state We have some initial vector describing the probabilities of each starting state For example, coke drinker: x = [1, 0] We could have said a person that drinks coke 80% of the time, x = [0.80, 0.20] pepsicoke pepsicoke pepsicoke

Steady state Most common start with some initial x xP, xP 2, xP 3, xP 4, … For many processes, this will eventually settle

18 Simulation: Markov Process Coke vs. Pepsi Example (cont) week - i Pr[X i = Coke] 2/3 coke pepsi

Back to pagerank Can we use this to solve our random surfer problem? States are web pages Transitions matrix is the probability of transitioning to page A given at page B “Teleport” operation makes sure that we can get to any page from any other page Matrix is much bigger, but same approach is taken… P, P 2, P 3, P 4, …

Pagerank summary Preprocessing: Given a graph of links, build matrix P From it compute steady state of each state An entry is a number between 0 and 1: the pagerank of a page Query processing: Retrieve pages meeting query Integrate pagerank score with other scoring (e.g. tf-idf) Rank pages by this combined score

The reality Pagerank is used in google, but so are many other clever heuristics

Pagerank: Issues and Variants How realistic is the random surfer model? Modeling the back button Surfer behavior sharply skewed towards short paths Search engines, bookmarks & directories make jumps non-random Note that all of these just vary how we create our initial transition probability matrix

Biased surfer models Random teleport to any page is not very reasonable Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)

Topic Specific Pagerank Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a category based on a query & user- specific distribution over the categories Teleport to a page uniformly at random within the chosen category What is the challenge?

Topic Specific Pagerank Ideas? Offline:Compute pageranks for individual categories Query independent as before Each page has multiple pagerank scores – one for each category, with teleportation only to that category Online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of category-specific pageranks

Spamming pagerank

Other link analysis Pagerank is not the only link analysis method Many improvements/variations of pagerank Hubs and authorities

Midterm review: general notes We’ve covered a lot of material Anything from lecture, readings, homeworks and assignments is fair game Today’s material NOT on the midterm T/F, short answer, short workthrough problems Some questions like homework, but also many “conceptual” questions Won’t need calculator 1hr and 15 mins goes by fast Grading for the class

Midterm review indexes representation skip pointers why we need an index boolean index merge operation query optimization phrase queries (query proximity)

Midterm review index construction implementing efficiently sort-based spimi distributed index contruction map reduce dealing with data that refreshes frequently

Midterm review index compression dictionary compression variable width entries blocking front-coding postings list compression gaps gap encoding/compression

Midterm review documents in the index text preprocessing tokenization text normalization stop lists java regex computer hardware basics data set analysis statistics heaps’ law zipf's law

Midterm review ranked retrieval vector space representation and retrieval representing documents and queries as vectors cosine similarity measure normalization/reweighting techniques calculating similarities from index Speeding up ranking calculations approximate top K approaches (e.g. champion lists) cluster pruning

Midterm review Evaluation precision recall F1 11-point precision MAP Kappa statistic A/B testing

Midterm review snippet/summary generation spelling correction edit distance word n-grams jaccard coefficient relevance feedback

Midterm review web basic web search engine spam estimating the size of the web (or the size of a search engine's index) duplicate detection at web scale crawling