Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University KDD-MLG 2011 2011-08-21, San Diego, CA, USA

Overview Graph-based SSL The Problem with Text Data Implicit Manifolds – Two GSSL Methods – Three Similarity Functions A Framework Results Related Work

Graph-based SSL Graph-based semi-supervised learning methods can often be viewed as methods for propagating labels along edges of a graph. Naturally they are a good fit for network data Labeled class A Labeled Class B How to label the rest of the nodes in the graph?

The Problem with Text Data Documents are often represented as feature vectors of words: The importance of a Web page is an inherently subjective matter, which depends on the readers… In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use… You're not cool just because you have a lot of followers on twitter, get over yourself… coolwebsearchmakeoveryou 048253 087432 100012

The Problem with Text Data Feature vectors are often sparse But similarity matrix is not! coolwebsearchmakeoveryou 048253 087432 100012 Mostly zeros - any document contains only a small fraction of the vocabulary 27125- 23-125 -2327 Mostly non-zero - any two documents are likely to have a word in common

The Problem with Text Data A similarity matrix is the input to many GSSL methods How can we apply GSSL methods efficiently? 27125- 23-125 -2327 O(n 2 ) time to construct O(n 2 ) space to store > O(n 2 ) time to operate on Too expensive! Does not scale up to big datasets!

The Problem with Text Data Solutions: 1.Make the matrix sparse 2.Implicit Manifold But this is what we’ll talk about! A lot of cool work has gone into how to do this… (If you do SSL on large-scale non- network data, there is a good chance you are using this already...)

Implicit Manifolds A pair-wise similarity matrix is a manifold under which the data points “reside”. It is implicit because we never explicitly construct the manifold (pair-wise similarity), although the results we get are exactly the same as if we did. 8 What do you mean by…? Sounds too good. What’s the catch?

Implicit Manifolds Two requirements for using implicit manifolds on text (-like) data: 1.The GSSL method can be implemented with matrix-vector multiplications 2.We can decompose the dense similarity matrix into a series of sparse matrix-matrix multiplications 9 Let’s look at some specifics, starting with two GSSL methods As long as they are met, we can obtain the exact same results without ever constructing a similarity matrix!

Two GSSL Methods Harmonic functions method (HF) MultiRankWalk (MRW) If you’ve never heard of them… 10 You might have heard one of these:  Gaussian fields and harmonic functions classifier (Zhu et al. 2003)  Weighted-voted relational network classifier (Macskassy & Provost 2007)  Weakly-supervised classification via random walks (Talukdar et al. 2008)  Adsorption (Baluja et al. 2008)  Learning on diffusion maps (Lafon & Lee 2006)  and others … HF’s cousins:  Partially labeled classification using Markov random walks (Szummer & Jaakkola 2001)  Learning with local and global consistency (Zhou et al. 2004)  Graph-based SSL as a generative model (He et al. 2007)  Ghost edges for classification (Gallagher et al. 2008)  and others … MRW’s cousins: Propagation via forward random walks ( w/ restart ) Propagation via backward random walks

Two GSSL Methods Harmonic functions method (HF) 11 MultiRankWalk (MRW) In both of these iterative implementations, the core computation are matrix-vector multiplications! Let’s look at their implicit manifold qualification…

Three Similarity Functions Inner product similarity Cosine similarity Bipartite graph walk 12 Simple, good for binary features Often used for document categorization Can be viewed as a bipartite graph walk, or as feature reweighting by relative frequency; related to TF-IDF term weighting… Side note: perhaps all the cool work on matrix factorization could be useful here too…

Putting Them Together Example: HF + inner product similarity: Iteration update becomes: Construction: O(n) Storage: O(n) Operation: O(n) How about a similarity function we actually use for text data? 13 Diagonal matrix D can be calculated as: D (i,i) =d (i) where d=FF T 1 Parentheses are important!

Putting Them Together Example: HF + cosine similarity: Iteration update: Diagonal cosine normalizing matrix Compact storage: we don’t need a cosine- normalized version of the feature vectors 14 Diagonal matrix D can be calculated as: D (i,i) =d (i) where d=N c FFTN c 1 Construction: O(n) Storage: O(n) Operation: O(n)

A Framework 15 So what’s the point? 1.Towards a principled way for researchers to apply GSSL methods to text data, and the conditions under which they can do this efficiently 2.Towards a framework on which researchers can develop and discover new methods (and recognizing old ones) 3.Building a SSL tool set – pick one that works for you, according to all the great work that has been done

A Framework 16 Choose your SSL Method… Harmonic Functions MultiRankWalk Hmm… I have a large dataset with very few training labels, what should I try? How about MultiRankWalk with a low restart probability?

A Framework 17 But the documents in my dataset are kinda long… Can’t go wrong with cosine similarity! … and pick your similarity function Inner Product Cosine Similarity Bipartite Graph Walk

Results On 2 commonly used document categorization datasets On 2 less common NP classification datasets Goal: to show they do work on large text datasets, and consistent with what we know about these SSL methods and similarity functions 18

Results Document categorization 19

Results Noun phrase classification dataset: 20

Questions?

Additional Information 22

MRW: RWR for Classification We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

MRW: Seed Preference Obtaining labels for data points is expensive We want to minimize cost for obtaining labels Observations: – Some labels inherently more useful than others – Some labels easier to obtain than others Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others?

Seed Preference Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data We consider 3 preferences: – Random – Link Count – PageRank Nodes with highest counts make the list Nodes with highest scores make the list

MRW: The Question What really makes MRW and wvRN different? Network-based SSL often boil down to label propagation. MRW and wvRN represent two general propagation methods – note that they are call by many names: MRWwvRN Random walk with restartReverse random walk Regularized random walkRandom walk with sink nodes Personalized PageRankHitting time Local & global consistencyHarmonic functions on graphs Iterative averaging of neighbors Great…but we still don’t know why the differences in their behavior on these network datasets!

MRW: The Question It’s difficult to answer exactly why MRW does better with a smaller number of seeds. But we can gather probable factors from their propagation models: MRWwvRN 1Centrality-sensitiveCentrality-insensitive 2 Exponential drop-off / damping factor No drop-off / damping 3 Propagation of different classes done independently Propagation of different classes interact

MRW: The Question An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative: 1.000neoconservatives.blogspot.com 1.000strangedoctrines.typepad.com 1.000jmbzine.com 0.593presidentboxer.blogspot.com 0.585rooksrant.com 0.568purplestates.blogspot.com 0.553ikilledcheguevara.blogspot.com 0.540restoreamerica.blogspot.com 0.539billrice.org 0.529kalblog.com 0.517right-thinking.com 0.517tom-hanna.org 0.514crankylittleblog.blogspot.com 0.510hasidicgentile.org 0.509stealthebandwagon.blogspot.com 0.509carpetblogger.com 0.497politicalvicesquad.blogspot.com 0.496nerepublican.blogspot.com 0.494centinel.blogspot.com 0.494scrawlville.com 0.493allspinzone.blogspot.com 0.492littlegreenfootballs.com 0.492wehavesomeplanes.blogspot.com 0.491rittenhouse.blogspot.com 0.490secureliberty.org 0.488decision08.blogspot.com 0.488larsonreport.com 0.020firstdownpolitics.com 0.019neoconservatives.blogspot.com 0.017jmbzine.com 0.017strangedoctrines.typepad.com 0.013millers_time.typepad.com 0.011decision08.blogspot.com 0.010gopandcollege.blogspot.com 0.010charlineandjamie.com 0.008marksteyn.com 0.007blackmanforbush.blogspot.com 0.007reggiescorner.blogspot.com 0.007fearfulsymmetry.blogspot.com 0.006quibbles-n-bits.com 0.006undercaffeinated.com 0.005samizdata.net 0.005pennywit.com 0.005pajamahadin.com 0.005mixtersmix.blogspot.com 0.005stillfighting.blogspot.com 0.005shakespearessister.blogspot.com 0.005jadbury.com 0.005thefulcrum.blogspot.com 0.005watchandwait.blogspot.com 0.005gindy.blogspot.com 0.005cecile.squarespace.com 0.005usliberals.about.com 0.005twentyfirstcenturyrepublican.blogspot.com Seed labels underlined 1. Centrality-sensitive: seeds have different scores and not necessarily the highest 2. Exponential drop- off: much less sure about nodes further away from seeds 3. Classes propagate independently: charlineandjamie.com is both very likely a conservative and a liberal blog (good or bad?) We still don’t really understand it yet.

MRW: Related Work MRW is very much related to – “Local and global consistency” (Zhou et al. 2004) – “Web content categorization using link information” (Gyongyi et al. 2006) – “Graph-based semi-supervised learning as a generative model” (He et al. 2007) Seed preference is related to the field of active learning – Active learning chooses which data point to label next based on previous labels; the labeling is interactive – Seed preference is a batch labeling method Similar formulation, different view RWR ranking as features to SVM Random walk without restart, heuristic stopping Authoritative seed preference a good base line for active learning on network data!

Results How much better is MRW using authoritative seed preference? y-axis: MRW F1 score minus wvRN F1 x-axis: number of seed labels per class The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels

A Web-Scale Knowledge Base Read the Web (RtW) project: 31 Build a never-ending system that learns to extract information from unstructured web pages, resulting in a knowledge base of structured information.

Noun Phrase and Context Data As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced: – NP-context co-occurrence data – NP-NP co-occurrence data These datasets can be treated as graphs 32

Noun Phrase and Context Data NP-context data: 33 … know that drinking pomegranate juice may not be a bad … NPbefore contextafter context pomegranate juice know that drinking _ _ may not be a bad 3 2 _ is made from _ promotes responsible JELL-O Jagermeister 5 8 2 1 NP-context graph

Noun Phrase and Context Data NP-NP data: 34 … French Constitution Council validates burqa ban … NPcontext French Constitution Council JELL-O Jagermeister NP-NP graph NP burqa ban French Court hot pants veil Context can be used for weighting edges or making a more complex graph

Noun Phrase Categorization We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 35

Noun Phrase Categorization We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 36

Noun Phrase Categorization Preliminary experiment: – Small subset of the NP-context data 88k NPs 99k contexts – Find category “city” Start with a handful of seeds – Ground truth set of 2,404 city NPs created using 37

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Similar presentations

Presentation on theme: "Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Similar presentations

Presentation on theme: "Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback