Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval Search Engine Technology (11)

Similar presentations


Presentation on theme: "Information Retrieval Search Engine Technology (11)"— Presentation transcript:

1 Prof. Dragomir R. Radev radev@umich.edu
Information Retrieval Search Engine Technology (11) Prof. Dragomir R. Radev

2 SET/IR – W/S 2009 17. continued

3 [Slide from Reka Albert]

4 [Slide from Reka Albert]

5 The strength of weak ties
Granovetter’s study: finding jobs Weak ties: more people can be reached through weak ties than strong ties (e.g., through your 7th and 8th best friends) More here:

6 Prestige and centrality
Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs. Pagerank/HITS

7 … 18. Graph-based methods Harmonic functions Random walks PageRank
SET/IR – W/S 2009 18. Graph-based methods Harmonic functions Random walks PageRank

8 Random walks and harmonic functions
Drunkard’s walk: Start at position 0 on a line What is the prob. of reaching 5 before reaching 0? Harmonic functions: P(0) = 0 P(N) = 1 P(x) = ½*p(x-1)+ ½*p(x+1), for 0<x<N (in general, replace ½ with the bias in the walk) 1 2 3 4 5

9 The original Dirichlet problem
(**) The original Dirichlet problem Distribution of temperature in a sheet of metal. One end of the sheet has temperature t=0, the other end: t=1. Laplace’s differential equation: This is a special (steady-state) case of the (transient) heat equation : In general, the solutions to this equation are called harmonic functions.

10 Learning harmonic functions
The method of relaxations Discrete approximation. Assign fixed values to the boundary points. Assign arbitrary values to all other points. Adjust their values to be the average of their neighbors. Repeat until convergence. Monte Carlo method Perform a random walk on the discrete representation. Compute f as the probability of a random walk ending in a particular fixed point. Eigenvector methods Look at the stationary distribution of a random walk

11 Eigenvectors and eigenvalues
An eigenvector is an implicit “direction” for a matrix where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle Computing eigenvalues:

12 Eigenvectors and eigenvalues
Example: Det (A-lI) = (-1-l)*(-l)-3*2=0 Then: l+l2-6=0; l1=2; l2=-3 For l1=2: Solutions: x1=x2

13 Stochastic matrices Stochastic matrices: each row (or column) adds up to 1 and no value is less than 0. Example: The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, GTp = p.

14 Electrical networks and random walks
Ergodic (connected) Markov chain with transition matrix P c 1 Ω 1 Ω w=Pw a b 1 Ω 0.5 Ω 0.5 Ω d From Doyle and Snell 2000

15 Electrical networks and random walks
1 Ω 1 Ω a b 1 Ω 0.5 Ω 0.5 Ω vx is the probability that a random walk starting at x will reach a before reaching b. d The random walk interpretation allows us to use Monte Carlo methods to solve electrical circuits. 1 V

16 Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x0, x1, …, xn). Xi = xi-1*E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find Xj given x0, E, and j.

17 Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: E is stochastic E is irreducible E is aperiodic To make these conditions true: All rows of E add up to 1 (and no value is negative) Make sure that E is strongly connected Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use “teleportation” Stochastic, aperiodic, irreducible (strongly connected) Unique solutions

18 Example 1 2 3 4 5 7 6 8 t=0 t=1 This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph.

19 Eigenvectors An eigenvector is an implicit “direction” for a matrix.
Ev = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, ETp = p.

20 Computing the stationary distribution
Solution for the stationary distribution function PowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; until L <  return p(i) end Power methods (KHMG) Convergence rate is O(m)

21 Example t=0 1 2 3 4 5 7 6 8 t=1 t=10

22 PageRank Developed at Stanford and allegedly still being used at Google. Not query-specific, although query-specific varieties exist. In general, each page is indexed along with the anchor texts pointing to it. Among the pages that match the user’s query, Google shows the ones with the largest PageRank. Google also uses vector-space matching, keyword proximity, anchor text, etc.

23

24

25 … 19. Hubs and authorities Bipartite graphs HITS and SALSA
SET/IR – W/S 2009 19. Hubs and authorities Bipartite graphs HITS and SALSA Models of the web

26 HITS Hypertext-induced text selection.
Developed by Jon Kleinberg and colleagues at IBM Almaden as part of the CLEVER engine. HITS is query-specific. Hubs and authorities, e.g. collections of bookmarks about cars vs. actual sites about cars. Honda Ford VW Car and Driver

27 HITS Each node in the graph is ranked for hubness (h) and authoritativeness (a). Some nodes may have high scores on both. Example authorities for the query “java”: java.sun.com digitalfocus.com/digitalfocus/… (The Java developer) lightyear.ncsa.uiuc.edu/~srp/java/javabooks.html sunsite.unc.edu/javafaq/javafaq.html

28 HITS HITS algorithm: Eigenvector interpretation:
obtain root set (using a search engine) related to the input query expand the root set by radius one on either side (typically to size ) run iterations on the hub and authority scores together report top-ranking authorities and hubs Eigenvector interpretation:

29 Example [slide from Baldi et al.]

30 HITS HITS is now used by Ask.com and Teoma.com .
It can also be used to identify communities (e.g., based on synonyms as well as controversial topics. Example for “jaguar” Principal eigenvector gives pages about the animal The positive end of the second nonprincipal eigenvector gives pages about the football team The positive end of the third nonprincipal eigenvector gives pages about the car. Example for “abortion” The positive end of the second nonprincipal eigenvector gives pages on “planned parenthood” and “reproductive rights” The negative end of the same eigenvector includes “pro-life” sites. SALSA (Lempel and Moran 2001)

31 Models of the Web Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 A B a b Kleinberg 98 Indegree/outdegree plots Evolving networks: fundamental object of statistical physics Social networks Bow tie model (not?) My Erdos number is 4 Growth of the Web (p 130) Why is the Web different (MN) – how do users really create the Web Fat-tailed distributions, small worlds (W&S, # of triangles), hard to destroy Communities (Kleinberg) Evolutionary networks, equilibrium/non-equilibrium Evaluation metrics: degree distribution, clustering coefficient, diameter (give comparison) W/S model is equlibrium (term from statistical mechanics) EN have history, memory Small word networks (+result) D&M Zipf What are typical clustering coefficients and diameters Random graph – clustering coefficient is much smaller Mesoscopic objects Phase transitions Bipartite networks Pagerank Google changed the way the Web works Menczer 02 Radev 03

32 Evolving Word-based Web
Observations: Links are made based on topics Topics are expressed with words Words are distributed very unevenly (Zipf, Benford, self-triggerability laws) Model Pick n Generate n lengths according to a power-law distribution Generate n documents using a trigram model Model (cont’d) Pick words in decreasing order of r. Generate hyperlinks with random directionality Outcome Generates power-law degree distributions Generates topical communities Natural variation of PageRank: LexRank

33 Readings paper by Church and Gale (


Download ppt "Information Retrieval Search Engine Technology (11)"

Similar presentations


Ads by Google