Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-Supervised Classification of Network Data Using Very Few Labels

Similar presentations


Presentation on theme: "Semi-Supervised Classification of Network Data Using Very Few Labels"— Presentation transcript:

1 Semi-Supervised Classification of Network Data Using Very Few Labels
Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ASONAM 2010 , Odense, Denmark

2 Overview Preview MultiRankWalk Seed Preference Experiments Results
Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

3 Preview Classification labels are expensive to obtain
Semi-supervised learning (SSL) learns from labeled and unlabeled data for classification

4 Preview [Adamic & Glance 2005]

5 Label high PageRank nodes first (authoritative seeding)
Preview When it comes to network data, what is a general, simple, and effective method that requires very few labels? One that researchers could use as a strong baseline when developing more complex and domain-specific methods? Our Answer: MultiRankWalk (MRW) & Label high PageRank nodes first (authoritative seeding)

6 Preview MRW (red) vs. a popular method (blue)
Only 1 training label per class! accuracy # of training labels

7 Preview The popular method using authoritative seeding (red & green) vs. random seeding (blue) label “authoritative seeds” first Same blue line as before

8 Overview Preview MultiRankWalk Seed Preference Experiments Results
Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

9 Random Walk with Restart
Imagine a network, and starting at a specific node, you follow the edges randomly. But (perhaps you’re afraid of wondering too far) with some probability, you “jump” back to the starting node (restart!). If you record the number of times you land on each node, what would that distribution look like?

10 Random Walk with Restart
What if we start at a different node? Start node

11 Random Walk with Restart
The walk distribution r satisfies a simple equation: Start node(s) Transition matrix of the network Equivalent to the well-known PageRank ranking if all nodes are start nodes! (u is uniform) Restart probability “Keep-going” probability (damping factor)

12 Random Walk with Restart
Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:

13 Overview Preview MultiRankWalk Seed Preference Experiments Results
Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

14 RWR for Classification
Simple idea: use RWR for classification RWR with start nodes being labeled points in class A RWR with start nodes being labeled points in class B Nodes frequented more by RWR(A) belongs to class A, otherwise they belong to B

15 RWR for Classification
We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

16 Overview Preview MultiRankWalk Seed Preference Experiments Results
Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

17 Seed Preference Obtaining labels for data points is expensive
We want to minimize cost for obtaining labels Observations: Some labels inherently more useful than others Some labels easier to obtain than others Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others?

18 Seed Preference Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data We consider 3 preferences: Random Link Count PageRank Nodes with highest counts make the list Nodes with highest scores make the list

19 Overview Preview MultiRankWalk Seed Preference Experiments Results
Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

20 Experiments Test effectiveness of MRW and compare seed preferences on five real network datasets: Political Blogs (Liberal vs. Conservative) Citation Networks (7 and 6 academic fields, respectively)

21 “weighted-voted relational network classifier”
Experiments We compare MRW against a currently very popular network SSL method – wvRN You may know wvRN as the harmonic functions method, adsorption, random walk with sink nodes, … “weighted-voted relational network classifier” Recommended as a strong network SSL baseline in (Macskassy & Provost 2007)

22 Experiments To simulate a human expert labeling data, we use the “ranked-at-least-n-per-class” method Political blog example with n=2: blogsforbush.com dailykos.com moorewatch.com right-thinking.com talkingpointsmemo.com instapundit.com michellemalkin.com atrios.blogspot.com littlegreenfootballs.com washingtonmonthly.com powerlineblog.com drudgereport.com conservative liberal We have at least 2 labels per class. Stop.

23 Overview Preview MultiRankWalk Seed Preference Experiments Results
Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

24 Results MRW vs. wvRN with random seed preference Averaged over 20 runs
MRW does extremely well with just one randomly selected label per class! MRW drastically better with a small number of seed labels; performance not significantly different with larger numbers of seeds

25 Results wvRN with different seed preferences
PageRank slightly better than LinkCount, but in general not significantly so LinkCount or PageRank much better than Random with smaller number of seed labels

26 Results Does MRW benefit from seed preference?
A rare instance where authoritative seeds hurt performance, but not statistically significant Yes, on certain datasets with small number of seed labels; note the already very high F1 on most datasets

27 x-axis: number of seed labels per class
Results How much better is MRW using authoritative seed preference? y-axis: MRW F1 score minus wvRN F1 x-axis: number of seed labels per class The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels

28 Results Summary MRW much better than wvRN with small number of seed labels MRW more robust to varying quality of seed labels than wvRN Authoritative seed preference boosts algorithm effectiveness with small number of seed labels We recommend MRW and authoritative seed preference as a strong baseline for semi-supervised classification on network data

29 Overview Preview MultiRankWalk Seed Preference Experiments Results
Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

30 The Question What really makes MRW and wvRN different?
Network-based SSL often boil down to label propagation. MRW and wvRN represent two general propagation methods – note that they are call by many names: MRW wvRN Random walk with restart Reverse random walk Regularized random walk Random walk with sink nodes Personalized PageRank Hitting time Local & global consistency Harmonic functions on graphs Iterative averaging of neighbors Great…but we still don’t know why the differences in their behavior on these network datasets!

31 The Question It’s difficult to answer exactly why MRW does better with a smaller number of seeds. But we can gather probable factors from their propagation models: MRW wvRN 1 Centrality-sensitive Centrality-insensitive 2 Exponential drop-off / damping factor No drop-off / damping 3 Propagation of different classes done independently Propagation of different classes interact

32 The Question We still don’t completely understand it yet.
1. Centrality-sensitive: seeds have different scores and not necessarily the highest The Question Seed labels underlined An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative: 0.020 firstdownpolitics.com 0.019 neoconservatives.blogspot.com 0.017 jmbzine.com 0.017 strangedoctrines.typepad.com 0.013 millers_time.typepad.com 0.011 decision08.blogspot.com 0.010 gopandcollege.blogspot.com 0.010 charlineandjamie.com 0.008 marksteyn.com 0.007 blackmanforbush.blogspot.com 0.007 reggiescorner.blogspot.com 0.007 fearfulsymmetry.blogspot.com 0.006 quibbles-n-bits.com 0.006 undercaffeinated.com 0.005 samizdata.net 0.005 pennywit.com 0.005 pajamahadin.com 0.005 mixtersmix.blogspot.com 0.005 stillfighting.blogspot.com 0.005 shakespearessister.blogspot.com 0.005 jadbury.com 0.005 thefulcrum.blogspot.com 0.005 watchandwait.blogspot.com 0.005 gindy.blogspot.com 0.005 cecile.squarespace.com 0.005 usliberals.about.com 0.005 twentyfirstcenturyrepublican.blogspot.com 1.000 neoconservatives.blogspot.com 1.000 strangedoctrines.typepad.com 1.000 jmbzine.com 0.593 presidentboxer.blogspot.com 0.585 rooksrant.com 0.568 purplestates.blogspot.com 0.553 ikilledcheguevara.blogspot.com 0.540 restoreamerica.blogspot.com 0.539 billrice.org 0.529 kalblog.com 0.517 right-thinking.com 0.517 tom-hanna.org 0.514 crankylittleblog.blogspot.com 0.510 hasidicgentile.org 0.509 stealthebandwagon.blogspot.com 0.509 carpetblogger.com 0.497 politicalvicesquad.blogspot.com 0.496 nerepublican.blogspot.com 0.494 centinel.blogspot.com 0.494 scrawlville.com 0.493 allspinzone.blogspot.com 0.492 littlegreenfootballs.com 0.492 wehavesomeplanes.blogspot.com 0.491 rittenhouse.blogspot.com 0.490 secureliberty.org 0.488 decision08.blogspot.com 0.488 larsonreport.com 2. Exponential drop-off: much less sure about nodes further away from seeds We still don’t completely understand it yet. 3. Classes propagate independently: charlineandjamie.com is both very likely a conservative and a liberal blog (good or bad?)

33 Questions?

34 Related Work MRW is very much related to
Random walk without restart, heuristic stopping RWR ranking as features to SVM Related Work Similar formulation, different view MRW is very much related to “Local and global consistency” (Zhou et al. 2004) “Web content categorization using link information” (Gyongyi et al. 2006) “Graph-based semi-supervised learning as a generative model” (He et al. 2007) Seed preference is related to the field of active learning Active learning chooses which data point to label next based on previous labels; the labeling is interactive Seed preference is a batch labeling method Authoritative seed preference a good base line for active learning on network data!


Download ppt "Semi-Supervised Classification of Network Data Using Very Few Labels"

Similar presentations


Ads by Google