Semi-Supervised Classification of Network Data Using Very Few Labels

Slides:



Advertisements
Similar presentations
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Advertisements

Sparsification and Sampling of Networks for Collective Classification
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Data Mining Classification: Alternative Techniques
CMPUT 466/551 Principal Source: CMU
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
Expertise Networks in Online Communities: Structure and Algorithms Jun Zhang Mark S. Ackerman Lada Adamic University of Michigan WWW 2007, May 8–12, 2007,
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Spatial Semi- supervised Image Classification Stuart Ness G07 - Csci 8701 Final Project 1.
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Computing Trust in Social Networks
Three kinds of learning
Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Interactive Image Segmentation of Non-Contiguous Classes using Particle Competition and Cooperation Fabricio Breve São Paulo State University (UNESP)
Abstract - Many interactive image processing approaches are based on semi-supervised learning, which employ both labeled and unlabeled data in its training.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Random Walk with Restart (RWR) for Image Segmentation
Bug Localization with Machine Learning Techniques Wujie Zheng
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
A Graph-based Friend Recommendation System Using Genetic Algorithm
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Classification Ensemble Methods 1
Single-Pass Belief Propagation
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Learning with Green’s Function with Application to Semi-Supervised Learning and Recommender System ----Chris Ding, R. Jin, T. Li and H.D. Simon. A Learning.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Semi-Supervised Learning With Graphs William Cohen.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
Graph-based WSD の続き DMLA /7/10 小町守.
Exploring Social Tagging Graph for Web Object Classification
Semi-Supervised Clustering
Sofus A. Macskassy Fetch Technologies
DTMC Applications Ranking Web Pages & Slotted ALOHA
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Ensemble learning.
Presentation transcript:

Semi-Supervised Classification of Network Data Using Very Few Labels Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ASONAM 2010 2010-08-11, Odense, Denmark

Overview Preview MultiRankWalk Seed Preference Experiments Results Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

Preview Classification labels are expensive to obtain Semi-supervised learning (SSL) learns from labeled and unlabeled data for classification

Preview [Adamic & Glance 2005]

Label high PageRank nodes first (authoritative seeding) Preview When it comes to network data, what is a general, simple, and effective method that requires very few labels? One that researchers could use as a strong baseline when developing more complex and domain-specific methods? Our Answer: MultiRankWalk (MRW) & Label high PageRank nodes first (authoritative seeding)

Preview MRW (red) vs. a popular method (blue) Only 1 training label per class! accuracy # of training labels

Preview The popular method using authoritative seeding (red & green) vs. random seeding (blue) label “authoritative seeds” first Same blue line as before

Overview Preview MultiRankWalk Seed Preference Experiments Results Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

Random Walk with Restart Imagine a network, and starting at a specific node, you follow the edges randomly. But (perhaps you’re afraid of wondering too far) with some probability, you “jump” back to the starting node (restart!). If you record the number of times you land on each node, what would that distribution look like?

Random Walk with Restart What if we start at a different node? Start node

Random Walk with Restart The walk distribution r satisfies a simple equation: Start node(s) Transition matrix of the network Equivalent to the well-known PageRank ranking if all nodes are start nodes! (u is uniform) Restart probability “Keep-going” probability (damping factor)

Random Walk with Restart Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:

Overview Preview MultiRankWalk Seed Preference Experiments Results Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

RWR for Classification Simple idea: use RWR for classification RWR with start nodes being labeled points in class A RWR with start nodes being labeled points in class B Nodes frequented more by RWR(A) belongs to class A, otherwise they belong to B

RWR for Classification We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

Overview Preview MultiRankWalk Seed Preference Experiments Results Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

Seed Preference Obtaining labels for data points is expensive We want to minimize cost for obtaining labels Observations: Some labels inherently more useful than others Some labels easier to obtain than others Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others?

Seed Preference Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data We consider 3 preferences: Random Link Count PageRank Nodes with highest counts make the list Nodes with highest scores make the list

Overview Preview MultiRankWalk Seed Preference Experiments Results Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

Experiments Test effectiveness of MRW and compare seed preferences on five real network datasets: Political Blogs (Liberal vs. Conservative) Citation Networks (7 and 6 academic fields, respectively)

“weighted-voted relational network classifier” Experiments We compare MRW against a currently very popular network SSL method – wvRN You may know wvRN as the harmonic functions method, adsorption, random walk with sink nodes, … “weighted-voted relational network classifier” Recommended as a strong network SSL baseline in (Macskassy & Provost 2007)

Experiments To simulate a human expert labeling data, we use the “ranked-at-least-n-per-class” method Political blog example with n=2: blogsforbush.com dailykos.com moorewatch.com right-thinking.com talkingpointsmemo.com instapundit.com michellemalkin.com atrios.blogspot.com littlegreenfootballs.com washingtonmonthly.com powerlineblog.com drudgereport.com conservative liberal We have at least 2 labels per class. Stop.

Overview Preview MultiRankWalk Seed Preference Experiments Results Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

Results MRW vs. wvRN with random seed preference Averaged over 20 runs MRW does extremely well with just one randomly selected label per class! MRW drastically better with a small number of seed labels; performance not significantly different with larger numbers of seeds

Results wvRN with different seed preferences PageRank slightly better than LinkCount, but in general not significantly so LinkCount or PageRank much better than Random with smaller number of seed labels

Results Does MRW benefit from seed preference? A rare instance where authoritative seeds hurt performance, but not statistically significant Yes, on certain datasets with small number of seed labels; note the already very high F1 on most datasets

x-axis: number of seed labels per class Results How much better is MRW using authoritative seed preference? y-axis: MRW F1 score minus wvRN F1 x-axis: number of seed labels per class The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels

Results Summary MRW much better than wvRN with small number of seed labels MRW more robust to varying quality of seed labels than wvRN Authoritative seed preference boosts algorithm effectiveness with small number of seed labels We recommend MRW and authoritative seed preference as a strong baseline for semi-supervised classification on network data

Overview Preview MultiRankWalk Seed Preference Experiments Results Random Walk with Restart RWR for Classification Seed Preference Experiments Results The Question

The Question What really makes MRW and wvRN different? Network-based SSL often boil down to label propagation. MRW and wvRN represent two general propagation methods – note that they are call by many names: MRW wvRN Random walk with restart Reverse random walk Regularized random walk Random walk with sink nodes Personalized PageRank Hitting time Local & global consistency Harmonic functions on graphs Iterative averaging of neighbors Great…but we still don’t know why the differences in their behavior on these network datasets!

The Question It’s difficult to answer exactly why MRW does better with a smaller number of seeds. But we can gather probable factors from their propagation models: MRW wvRN 1 Centrality-sensitive Centrality-insensitive 2 Exponential drop-off / damping factor No drop-off / damping 3 Propagation of different classes done independently Propagation of different classes interact

The Question We still don’t completely understand it yet. 1. Centrality-sensitive: seeds have different scores and not necessarily the highest The Question Seed labels underlined An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative: 0.020 firstdownpolitics.com 0.019 neoconservatives.blogspot.com 0.017 jmbzine.com 0.017 strangedoctrines.typepad.com 0.013 millers_time.typepad.com 0.011 decision08.blogspot.com 0.010 gopandcollege.blogspot.com 0.010 charlineandjamie.com 0.008 marksteyn.com 0.007 blackmanforbush.blogspot.com 0.007 reggiescorner.blogspot.com 0.007 fearfulsymmetry.blogspot.com 0.006 quibbles-n-bits.com 0.006 undercaffeinated.com 0.005 samizdata.net 0.005 pennywit.com 0.005 pajamahadin.com 0.005 mixtersmix.blogspot.com 0.005 stillfighting.blogspot.com 0.005 shakespearessister.blogspot.com 0.005 jadbury.com 0.005 thefulcrum.blogspot.com 0.005 watchandwait.blogspot.com 0.005 gindy.blogspot.com 0.005 cecile.squarespace.com 0.005 usliberals.about.com 0.005 twentyfirstcenturyrepublican.blogspot.com 1.000 neoconservatives.blogspot.com 1.000 strangedoctrines.typepad.com 1.000 jmbzine.com 0.593 presidentboxer.blogspot.com 0.585 rooksrant.com 0.568 purplestates.blogspot.com 0.553 ikilledcheguevara.blogspot.com 0.540 restoreamerica.blogspot.com 0.539 billrice.org 0.529 kalblog.com 0.517 right-thinking.com 0.517 tom-hanna.org 0.514 crankylittleblog.blogspot.com 0.510 hasidicgentile.org 0.509 stealthebandwagon.blogspot.com 0.509 carpetblogger.com 0.497 politicalvicesquad.blogspot.com 0.496 nerepublican.blogspot.com 0.494 centinel.blogspot.com 0.494 scrawlville.com 0.493 allspinzone.blogspot.com 0.492 littlegreenfootballs.com 0.492 wehavesomeplanes.blogspot.com 0.491 rittenhouse.blogspot.com 0.490 secureliberty.org 0.488 decision08.blogspot.com 0.488 larsonreport.com 2. Exponential drop-off: much less sure about nodes further away from seeds We still don’t completely understand it yet. 3. Classes propagate independently: charlineandjamie.com is both very likely a conservative and a liberal blog (good or bad?)

Questions?

Related Work MRW is very much related to Random walk without restart, heuristic stopping RWR ranking as features to SVM Related Work Similar formulation, different view MRW is very much related to “Local and global consistency” (Zhou et al. 2004) “Web content categorization using link information” (Gyongyi et al. 2006) “Graph-based semi-supervised learning as a generative model” (He et al. 2007) Seed preference is related to the field of active learning Active learning chooses which data point to label next based on previous labels; the labeling is interactive Seed preference is a batch labeling method Authoritative seed preference a good base line for active learning on network data!