Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.

Similar presentations


Presentation on theme: "Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie."— Presentation transcript:

1 Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie Mellon University PhD Thesis Proposal October 8, 2010, Pittsburgh, PA, USA Thesis Committee William W. Cohen (Chair), Christos Faloutsos, Tom Mitchell, Xiaojin Zhu

2 Motivation Graph data is everywhere We want to do machine learning on them For non-graph data, often it makes sense to represent them as graphs for unsupervised and semi-supervised learning 2 And some of them very big Often require a similarity measure between data points – naturally represented as edges between nodes

3 Motivation We want to do clustering and semi-supervised learning on large graphs We need scalable methods that provide quality results 3

4 Thesis Goal Make contributions toward fast, space- efficient, effective, and simple graph-based learning methods that scale up to large datasets. 4

5 Road Map Basic Learning MethodsApplications to a Web-Scale Knowledge Base GraphInduced GraphApplicationsSupport Unsupervised / Clustering Power Iteration Clustering PIC with Path Folding Learning Synonym NP’s Distributed Implementation Avoiding Collision Hierarchical Learning Polyseme NP’s Finding New Types and Relations Semi- Supervised Multi- RankWalk MRW with Path Folding Noun Phrase Categorization Distributed Implementation 5

6 Road Map Basic Learning MethodsApplications to a Web-Scale Knowledge Base GraphInduced GraphApplicationsSupport Unsupervised / Clustering Power Iteration Clustering PIC with Path Folding Learning Synonym NP’s Distributed Implementation Avoiding Collision Hierarchical Learning Polyseme NP’s Finding New Types and Relations Semi- Supervised Multi- RankWalk MRW with Path Folding Noun Phrase Categorization Distributed Implementation 6 Prior Work Proposed Work

7 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 7

8 Power Iteration Clustering Spectral clustering methods are nice, and natural choice for graph data But they are rather expensive (slow) Power iteration clustering (PIC) can provide a similar solution at a very low cost (fast)! 8

9 Background: Spectral Clustering Idea: instead of clustering data points in their original (Euclidean) space, cluster them in the space spanned by the “significant” eigenvectors of a (Laplacian) similarity matrix A popular spectral clustering method: normalized cuts (NCut) 9

10 Background: Spectral Clustering dataset and normalized cut results 2 nd smallest eigenvector 3 rd smallest eigenvector value index 1 2 3cluster clustering space 10

11 Background: Spectral Clustering Normalized Cut algorithm (Shi & Malik 2000): 1.Choose k and similarity function s 2.Derive A from s, let W=I-D -1 A, where I is the identity matrix and D is a diagonal square matrix D ii =Σ j A ij 3.Find eigenvectors and corresponding eigenvalues of W 4.Pick the k eigenvectors of W with the 2 nd to k th smallest corresponding eigenvalues as “significant” eigenvectors 5.Project the data points onto the space spanned by these vectors 6.Run k-means on the projected data points Finding eigenvectors and eigenvalues of a matrix is very slow in general Can we find a similar low- dimensional embedding for clustering without eigenvectors? 11

12 The Power Iteration Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix: W : a square matrix v t : the vector at iteration t; v 0 typically a random vector c : a normalizing constant to keep v t from getting too large or too small Typically converges quickly; fairly efficient if W is a sparse matrix 12

13 The Power Iteration Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix: What if we let W=D -1 A (like Normalized Cut)? 13 Row- normalized similarity matrix

14 The Power Iteration 14

15 Power Iteration Clustering The 2 nd to k th eigenvectors of W=D -1 A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001) The linear combination of piece-wise constant vectors is also piece-wise constant! 15

16 Spectral Clustering dataset and normalized cut results 2 nd smallest eigenvector 3 rd smallest eigenvector value index 1 2 3cluster clustering space 16

17 a· b· + = 17

18 Power Iteration Clustering dataset and PIC results vtvt we just need the clusters to be separated in some space. Key idea: to do clustering, we may not need all the information in a full spectral embedding (e.g., distance between clusters in a k-dimension eigenspace) 18

19 When to Stop Recall: Then: Because they are raised to the power t, the eigenvalue ratios determines how fast v converges to e 1 At the beginning, v changes fast (“accelerating”) to converge locally due to “noise terms” (k+1…n) with small λ When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the eigenvalue ratios are close to 1

20 Power Iteration Clustering A basic power iteration clustering (PIC) algorithm: Input: A row-normalized affinity matrix W and the number of clusters k Output: Clusters C 1, C 2, …, C k 1.Pick an initial vector v 0 2.Repeat Set v t+1 ← Wv t Set δ t+1 ← |v t+1 – v t | Increment t Stop when |δ t – δ t-1 | ≈ 0 3.Use k-means to cluster points on v t and return clusters C 1, C 2, …, C k 20

21 PIC Runtime Normalized Cut Normalized Cut, faster implementation Ran out of memory (24GB) 21

22 PIC Accuracy on Network Datasets Upper triangle: PIC does better Lower triangle: NCut or NJW does better

23 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 23

24 Clustering Text Data Spectral clustering methods are nice We want to use them for clustering text data (A lot of) 24

25 The Problem with Text Data Documents are often represented as feature vectors of words: The importance of a Web page is an inherently subjective matter, which depends on the readers… In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use… You're not cool just because you have a lot of followers on twitter, get over yourself… coolwebsearchmakeoveryou 048253 087432 100012 25

26 The Problem with Text Data Feature vectors are often sparse But similarity matrix is not! coolwebsearchmakeoveryou 048253 087432 100012 Mostly zeros - any document contains only a small fraction of the vocabulary 27125- 23-125 -2327 Mostly non-zero - any two documents are likely to have a word in common 26

27 The Problem with Text Data A similarity matrix is the input to many clustering methods, including spectral clustering Spectral clustering requires the computation of the eigenvectors of a similarity matrix of the data 27125- 23-125 -2327 In general O(n 3 ); approximation methods still not very fast O(n 2 ) time to construct O(n 2 ) space to store > O(n 2 ) time to operate on Too expensive! Does not scale up to big datasets! 27

28 The Problem with Text Data We want to use the similarity matrix for clustering (like spectral clustering), but: – Without calculating eigenvectors – Without constructing or storing the similarity matrix Power Iteration Clustering + Path Folding 28

29 Path Folding A basic power iteration clustering (PIC) algorithm: Input: A row-normalized affinity matrix W and the number of clusters k Output: Clusters C 1, C 2, …, C k 1.Pick an initial vector v 0 2.Repeat Set v t+1 ← Wv t Set δ t+1 ← |v t+1 – v t | Increment t Stop when |δ t – δ t-1 | ≈ 0 3.Use k-means to cluster points on v t and return clusters C 1, C 2, …, C k Okay, we have a fast clustering method – but there’s the W that requires O(n 2 ) storage space and construction and operation time! Key operation in PIC Note: matrix-vector multiplication! 29

30 Path Folding What’s so good about matrix-vector multiplication? If we can decompose the matrix… Then we arrive at the same solution doing a series of matrix-vector multiplications! Isn’t this more expensive? Well, not if this matrix is dense… And these are sparse! 30

31 Path Folding As long as we can decompose the matrix into a series of sparse matrices, we can turn a dense matrix-vector multiplication into a series of sparse matrix-vector multiplications. This means that we can turn an operation that requires O(n 2 ) storage and runtime into one that requires O(n) storage and runtime! This is exactly the case for text data And many other kinds of data as well! 31

32 Path Folding Example – inner product similarity: The original feature matrix The feature matrix transposed Diagonal matrix that normalizes W so rows sum to 1 Construction: given Storage: O(n) Construction: given Storage: just use F Storage: n 32

33 Path Folding Example – inner product similarity: Iteration update: Construction: O(n) Storage: O(n) Operation: O(n) Okay…how about a similarity function we actually use for text data? 33

34 Path Folding Example – cosine similarity: Iteration update: Diagonal cosine normalizing matrix Construction: O(n) Storage: O(n) Operation: O(n) Compact storage: we don’t need a cosine- normalized version of the feature vectors 34

35 Path Folding We refer to this technique as path folding due to its connections to “folding” a bipartite graph into a unipartite graph. 35

36 Results An accuracy result: Upper triangle: we win Lower triangle: spectral clustering wins Each point is accuracy for a 2-cluster text dataset Diagonal: tied (most datasets) No statistically significant difference – same accuracy 36

37 Results A scalability result: y: algorithm runtime (log scale) x: data size (log scale) Linear curve Quadratic curve Spectral clustering (red & blue) Our method (green) 37

38 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 38

39 MultiRankWalk Classification labels are expensive to obtain Semi-supervised learning (SSL) learns from labeled and unlabeled data for classification When it comes to network data, what is a general, simple, and effective method that requires very few labels? Our Answer: MultiRankWalk (MRW)

40 Random Walk with Restart Imagine a network, and starting at a specific node, you follow the edges randomly. But with some probability, you “jump” back to the starting node (restart!). If you recorded the number of times you land on each node, what would that distribution look like?

41 Random Walk with Restart What if we start at a different node? Start node

42 Random Walk with Restart The walk distribution r satisfies a simple equation: Start node(s) Transition matrix of the network Restart probability “Keep-going” probability (damping factor)

43 Random Walk with Restart Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:

44 RWR for Classification RWR with start nodes being labeled points in class A RWR with start nodes being labeled points in class B Nodes frequented more by RWR(A) belongs to class A, otherwise they belong to B Simple idea: use RWR for classification

45 Results 45 Accuracy: MRW vs. harmonic functions method (HF) Upper triangle: MRW better Lower triangle: MRW better Point color & style roughly indicates # of labeled examples With larger # of labels, both methods do well MRW does much better with only a few labels!

46 Results How much better is MRW using authoritative seed preference? y-axis: MRW F1 score minus wvRN F1 x-axis: number of seed labels per class The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels

47 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 47

48 MRW with Path Folding MRW is based on random walk with restart (RWR): 48 Core operation: matrix-vector multiplication “Path folding” can be applied to MRW as well to efficiently learn from induced graph data! Most other methods preprocess the induced graph data to create a sparse similarity matrix Can also be applied to iterative implementations of the harmonic functions method

49 On to real, large, difficult applications… We now have a basic set of tools to do efficient unsupervised and semi-supervised learning on graph data We want to apply to cool, challenging problems… 49

50 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 50

51 A Web-Scale Knowledge Base Read the Web (RtW) project: 51 Build a never-ending system that learns to extract information from unstructured web pages, resulting in a knowledge base of structured information.

52 Noun Phrase and Context Data As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced: – NP-context co-occurrence data – NP-NP co-occurrence data These datasets can be treated as graphs 52

53 Noun Phrase and Context Data NP-context data: 53 … know that drinking pomegranate juice may not be a bad … NPbefore contextafter context pomegranate juice know that drinking _ _ may not be a bad 3 2 _ is made from _ promotes responsible JELL-O Jagermeister 5 8 2 1 NP-context graph

54 Noun Phrase and Context Data NP-NP data: 54 … French Constitution Council validates burqa ban … NPcontext French Constitution Council JELL-O Jagermeister NP-NP graph NP burqa ban French Court hot pants veil Context can be used for weighting edges or making a more complex graph

55 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 55

56 Noun Phrase Categorization We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 56

57 Noun Phrase Categorization We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 57

58 Noun Phrase Categorization Preliminary experiment: – Small subset of the NP-context data 88k NPs 99k contexts – Find category “city” Start with a handful of seeds – Ground truth set of 2,404 city NPs created using 58

59 Noun Phrase Categorization Preliminary result: 59

60 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 60

61 Learning Synonym NP’s Currently the knowledge base see NP’s with different strings as different entities It would be useful to know when two different surface forms refer to the same semantic entity 61 Natural thing to try: use PIC to cluster NPs But do we get synonyms?

62 Learning Synonym NP’s 62 NP-context graph yields categorical clusters… President Obama Senator Obama Barack Obama Senator McCain George W. Bush Nicolas Sarkozy Tony Blair NP-NP graph yields semantically related clusters… President Obama Senator Obama Barack Obama Senator McCain African American Democrat Bailout But together they may help us to identify synonyms! Other methods may help to further refine results: President Obama Senator Obama Barack Obama BarackObama.com Obamamania Anti-Obama Michelle Obama e.g., string similarity

63 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 63

64 Learning Polyseme NP’s Currently a noun phrase found on a web page can only be mapped to a single entity in the knowledge base It would be useful to know when two different entities share the same surface form. Example: 64 Jordan ?

65 Learning Polyseme NP’s Proposal: Cluster the NP’s neighbors: – Cluster contexts in the NP-context graph and NPs in the NP-NP graph – Identify salient clusters as reference to different entities Results from clustering neighbors of “Jordan” in the NP-NP graph: 65 Tru, morgue, Harrison, Jackie, Rick, Davis, Jack, Chicago, episode, sights, six NBA championships, glass, United Center, brother, restaurant, mother, show, Garret, guest star, dead body, woman, Khouri, Loyola, autopsy, friend, season, release, corpse, Lantern, NBA championships, Maternal grandparents, Sox, Starr, medical examiner, Ivers, Hotch, maiden name, NBA titles, cousin John, Scottie Pippen, guest appearance, Peyton, Noor, night, name, Dr. Cox, third pick, Phoebe, side, third season, EastEnders, Tei, Chicago White Sox, trumpeter, day, Chicago Bulls, products, couple, Pippen, Illinois, All-NBA First Team, Dennis Rodman, first retirement 1948-1967, 500,000 Iraqis, first Arab country, Palestinian Arab state, Palestinian President, Arab- states, al-Qaeda, Arab World, countries, ethics, Faynan, guidelines, Malaysia, methodology, microfinance, militias, Olmert, peace plan, Strategic Studies, million Iraqi refugees, two Arab countries, two Arab states, Palestinian autonomy, Muslim holy sites, Jordan algebras, Jordanian citizenship, Arab Federation, Hashemite dynasty, Jordanian citizens, Gulf Cooperation, Dahabi, Judeh, Israeli-occupied West Bank, Mashal, 700,000 refugees, Yarmuk River, Palestine Liberation, Sunni countries, 2 million Iraqi, 2 million Iraqi refugees, Israeli borders, moderate Arab states

66 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 66

67 Finding New Types and Relations Proposed methods for finding synonyms and polysemes identify particular nodes and relationship between nodes based on the characteristics of the graph structure, using PIC 67 Can we generalize these methods to find any unary or binary relationships between nodes in the graph? Different relationships may require different kinds of graphs and similarity functions Can we learn effective similarity functions for clustering efficiently? Admittedly the most open-ended part of this proposal … Tom’s suggestion: Given a set of NPs, can you find k categories and l relationships that cover the most NPs?

68 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 68

69 PIC Extension: Avoiding Collisions One robustness question for vanilla PIC as data size and complexity grows: How many (noisy) clusters can you fit in one dimension without them “colliding”? 69 Cluster signals cleanly separated A little too close for comfort?

70 PIC Extension: Avoiding Collisions We propose two general solutions: 1.Precondition the starting vector; instead of random: Degree Skewed (1, 2, 4, 8, 16, etc.) May depend on particular properties of the data 2.Run PIC d times with different random starts and construct a d-dimension embedding Unlikely two clusters collide on all d dimensions We can afford it because PIC is fast and space-efficient! 70

71 PIC Extension: Avoiding Collisions Preliminary results on network classification datasets: 71 RED: PIC embedding with a random start vector GREEN: PIC using a degree start vector BLUE: PIC using 4 random start vectors Dataset (k) 1-dimension PIC embeddings lose on accuracy at higher k’s compared to NCut and NJW # of clusters But using a 4 random vectors instead helps! Note # of vectors << k

72 PIC Extension: Avoiding Collisions Preliminary results on name disambiguation datasets: 72 Again using a 4 random vectors seems to work! Again note # of vectors << k

73 PIC Extension: Hierarchical Clustering Real, large-scale data may not have a “flat” clustering structure A hierarchical view may be more useful 73 Good News: The dynamics of a PIC embedding display a hierarchically convergent behavior!

74 PIC Extension: Hierarchical Clustering Why? Recall PIC embedding at time t: 74 Less significant eigenvectors / structures go away first, one by one More salient structure stick around e’s – eigenvectors (structure) Small Big There may not be a clear eigengap - a gradient of cluster saliency

75 PIC Extension: Hierarchical Clustering 75 PIC already converged to 8 clusters… But let’s keep on iterating… “N” still a part of the “2009” cluster… Similar behavior also noted in matrix-matrix power methods (diffusion maps, mean-shift, multi-resolution spectral clustering) Same dataset you’ve seen Yes (it might take a while)

76 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 76

77 Distributed / Parallel Implementations Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework We propose to use A lternatives: 77 Existing graph analysis tool:

78 Talk Outline Prior Work Power Iteration Clustering (PIC) PIC with Path Folding MultiRankWalk (MRW) Proposed Work MRW with Path Folding MRW for Populating a Web-Scale Knowledge Base Noun Phrase Categorization PIC for Extending a Web-Scale Knowledge Base Learning Synonym NP’s Learning Polyseme NP’s Finding New Ontology Types and Relations PIC Extensions Distributed Implementations 78

79 Questions?

80 Additional Information 80

81 PIC: Some “Powering” Methods at a Glance How far can we go with a one- or low-dimensional embedding?

82 PIC: Versus Popular Fast Sparse Eigencomputation Methods For Symmetric Matrices For General Matrices Improvement Successive Power Method Basic; numerically unstable, can be slow Lanczos MethodArnoldi Method More stable, but require lots of memory Implicitly Restarted Lanczos Method (IRLM) Implicitly Restarted Arnoldi Method (IRAM) More memory- efficient 82 MethodTimeSpace IRAM(O(m 3 )+(O(nm)+O(e))×O(m-k))×(# restart)O(e)+O(nm) PICO(e)x(# iterations)O(e) Randomized sampling methods are also popular

83 PIC: Related Clustering Work Spectral Clustering – (Roxborough & Sen 1997, Shi & Malik 2000, Meila & Shi 2001, Ng et al. 2002) Kernel k-Means (Dhillon et al. 2007) Modularity Clustering (Newman 2006) Matrix Powering – Markovian relaxation & the information bottleneck method (Tishby & Slonim 2000) – matrix powering (Zhou & Woodruff 2004) – diffusion maps (Lafon & Lee 2006) – Gaussian blurring mean-shift (Carreira-Perpinan 2006) Mean-Shift Clustering – mean-shift (Fukunaga & Hostetler 1975, Cheng 1995, Comaniciu & Meer 2002) – Gaussian blurring mean-shift (Carreira-Perpinan 2006) 83

84 PICwPF: Results Each point represents the accuracy result from a dataset Lower triangle: k-means wins Upper triangle: PIC wins 84

85 PICwPF: Results Two methods have almost the same behavior Overall, one method not statistically significantly better than the other 85

86 PICwPF: Results Not sure why NCUTiram did not work as well as NCUTevd Lesson: Approximate eigen- computation methods may require expertise to work well 86

87 PICwPF: Results PIC is O(n) per iteration and the runtime curve looks linear… But I don’t like eyeballing curves, and perhaps the number of iteration increases with size or difficulty of the dataset? Correlation plot Correlation statistic (0=none, 1=correlated) 87

88 PICwPF: Results Linear run-time implies constant number of iterations. Number of iterations to “acceleration- convergence” is hard to analyze: – Faster than a single complete run of power iteration to convergence. – On our datasets 10-20 iterations is typical 30-35 is exceptional 88

89 PICwPF: Related Work Faster spectral clustering – Approximate eigendecomposition (Lanczos, IRAM) – Sampled eigendecomposition (Nyström) Sparser matrix – Sparse construction k-nearest-neighbor graph k-matching – graph sampling / reduction Not O(n) time methods Still require O(n 2 ) construction in general Not O(n) space methods 89

90 PICwPF: Results 90

91 MRW: RWR for Classification We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

92 MRW: Seed Preference Obtaining labels for data points is expensive We want to minimize cost for obtaining labels Observations: – Some labels inherently more useful than others – Some labels easier to obtain than others Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others?

93 Seed Preference Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data We consider 3 preferences: – Random – Link Count – PageRank Nodes with highest counts make the list Nodes with highest scores make the list

94 MRW: The Question What really makes MRW and wvRN different? Network-based SSL often boil down to label propagation. MRW and wvRN represent two general propagation methods – note that they are call by many names: MRWwvRN Random walk with restartReverse random walk Regularized random walkRandom walk with sink nodes Personalized PageRankHitting time Local & global consistencyHarmonic functions on graphs Iterative averaging of neighbors Great…but we still don’t know why the differences in their behavior on these network datasets!

95 MRW: The Question It’s difficult to answer exactly why MRW does better with a smaller number of seeds. But we can gather probable factors from their propagation models: MRWwvRN 1Centrality-sensitiveCentrality-insensitive 2 Exponential drop-off / damping factor No drop-off / damping 3 Propagation of different classes done independently Propagation of different classes interact

96 MRW: The Question An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative: 1.000neoconservatives.blogspot.com 1.000strangedoctrines.typepad.com 1.000jmbzine.com 0.593presidentboxer.blogspot.com 0.585rooksrant.com 0.568purplestates.blogspot.com 0.553ikilledcheguevara.blogspot.com 0.540restoreamerica.blogspot.com 0.539billrice.org 0.529kalblog.com 0.517right-thinking.com 0.517tom-hanna.org 0.514crankylittleblog.blogspot.com 0.510hasidicgentile.org 0.509stealthebandwagon.blogspot.com 0.509carpetblogger.com 0.497politicalvicesquad.blogspot.com 0.496nerepublican.blogspot.com 0.494centinel.blogspot.com 0.494scrawlville.com 0.493allspinzone.blogspot.com 0.492littlegreenfootballs.com 0.492wehavesomeplanes.blogspot.com 0.491rittenhouse.blogspot.com 0.490secureliberty.org 0.488decision08.blogspot.com 0.488larsonreport.com 0.020firstdownpolitics.com 0.019neoconservatives.blogspot.com 0.017jmbzine.com 0.017strangedoctrines.typepad.com 0.013millers_time.typepad.com 0.011decision08.blogspot.com 0.010gopandcollege.blogspot.com 0.010charlineandjamie.com 0.008marksteyn.com 0.007blackmanforbush.blogspot.com 0.007reggiescorner.blogspot.com 0.007fearfulsymmetry.blogspot.com 0.006quibbles-n-bits.com 0.006undercaffeinated.com 0.005samizdata.net 0.005pennywit.com 0.005pajamahadin.com 0.005mixtersmix.blogspot.com 0.005stillfighting.blogspot.com 0.005shakespearessister.blogspot.com 0.005jadbury.com 0.005thefulcrum.blogspot.com 0.005watchandwait.blogspot.com 0.005gindy.blogspot.com 0.005cecile.squarespace.com 0.005usliberals.about.com 0.005twentyfirstcenturyrepublican.blogspot.com Seed labels underlined 1. Centrality-sensitive: seeds have different scores and not necessarily the highest 2. Exponential drop- off: much less sure about nodes further away from seeds 3. Classes propagate independently: charlineandjamie.com is both very likely a conservative and a liberal blog (good or bad?) We still don’t really understand it yet.

97 MRW: Related Work MRW is very much related to – “Local and global consistency” (Zhou et al. 2004) – “Web content categorization using link information” (Gyongyi et al. 2006) – “Graph-based semi-supervised learning as a generative model” (He et al. 2007) Seed preference is related to the field of active learning – Active learning chooses which data point to label next based on previous labels; the labeling is interactive – Seed preference is a batch labeling method Similar formulation, different view RWR ranking as features to SVM Random walk without restart, heuristic stopping Authoritative seed preference a good base line for active learning on network data!


Download ppt "Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie."

Similar presentations


Ads by Google