# Semi-Supervised Learning With Graphs William Cohen 1.

## Presentation on theme: "Semi-Supervised Learning With Graphs William Cohen 1."— Presentation transcript:

Semi-Supervised Learning With Graphs William Cohen 1

HW7 Hints m movies n users m movies x1y1 x2y2.. …… xnyn a1a2..…am b1b2……bm v11 … …… vij … vnm ~ V[i,j] = user i’s rating of movie j r W H V 2

Homework 7 - Hints What’s parallel and what’s sequential? for epoch t=1,….,T in sequence – for stratum s=1,…,K in sequence for block b=1,… in stratum s in parallel – for triple (i,j,rating) in block b in sequence » do SGD step for (i,j,rating) Hint 1: You need some loops around rdd operations Hint 2: Look at rdd.mapPartitions() to see how to control a parallel/sequential breakdown like the inner loop only needs part of W,H this needs r/w access to (part of) W and H How? 3

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory as shared synchronized memory: you should worry about transmission cost and blocking Abhinav hint: Python dictionaries are thread-safe 4

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory as broadcast memory: workers get unsynchronized copies of the data: worry here is extra broadcast cost hint: mapPartitions ~= Hadoop map 5

Homework 7 - Hints What is rdd.mapPartitions(foo)? Function foo is an iterator that loops over the entries in a partition (aka shard) of rdd and yields a sequence of entries: def foo(iterator): doSomething() #setup for x in iterator … yield … … doAnotherThing() #breakdown 6

Homework 7 - Hints What is rdd.mapPartitions(foo)? Function foo is an iterator that loops over the entries in a partition (aka shard) of rdd and yields a sequence of entries: def foo(iterator): doSomething() #setup: modifiedEntries = set() for (i,j,r) in iterator … yield … # update copies of W[i,*], H[*,j] … doAnotherThing() #output modified entries of W,H input: broadcast W,H, ratings Vij output: updated W,H 7

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory shared memory individual copies of H, W memory – W,H on disk, joined in with (i,j,r) should worry about inflating the size of the ratings – tis scheme inserts many copies of each W – Hybrid schemes: triples (i,j,r)  (i,{(j1,r1),(j2,r2),…,}, W[i,*]) H in memory 8

Homework 7 - Hints Change grain size of computation Rather than streaming through (r,i,j) tasks: – create SGD subtasks W1,H1,V11 W2,H2,V22 W3,H3,V33 – each map task does SGD on one subtask Should worry about subtask size (part of ratings are in memory) 9

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory shared memory individual copies of H, W memory – W,H on disk, joined in with (i,j,r) should worry about inflating the size of the ratings – tis scheme inserts many copies of each W – Hybrid schemes: triples (i,j,r)  (i,{(j1,r1),(j2,r2),…,}, W[i,*]) H in memory 10

Homework 7 – Avuncular Advice This is a really interesting assignment – think and learn! …But don’t stress out We know this assignment is challenging and we’ll curve it appropriately HW8 is being reduced in scope PS. If you’re an undergrad doing carnival stuff talk to me about the (useless) extension 11

Semi-Supervised Learning With Graphs William Cohen 12

Flashback – Graph Algorithms so far…. PageRank and how to scale it up Personalized PageRank/Random Walk with Restart and – how to implement it – how to use it for extracting part of a graph Other uses for graphs? – not so much 13

Main topics today Scalable semi-supervised learning on graphs – SSL with RWR – SSL with coEM/wvRN/HF Scalable unsupervised learning on graphs – Power iteration clustering – … 14

Semi-supervised learning A pool of labeled examples L A (usually larger) pool of unlabeled examples U Can you improve accuracy somehow using U? 15

Semi-Supervised Bootstrapped Learning/Self-training Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 anxiety selfishness Berlin Extract cities: 16

Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness 17

Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Nodes “near” seedsNodes “far from” seeds Information from other categories tells you “how far” (when to stop propagating) arrogance traits such as arg1 denial selfishness 18

ASONAM-2010 (Advances in Social Networks Analysis and Mining) 19

Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 20

RWR - fixpoint of: Seed selection 1.order by PageRank, degree, or randomly 2.go down list until you have at least k examples/class Seed selection 1.order by PageRank, degree, or randomly 2.go down list until you have at least k examples/class 21

Results – Blog data Random Degree PageRank We’ll discuss this soon…. 22

Results – More blog data Random Degree PageRank 23

Results – Citation data Random Degree PageRank 24

Seeding – MultiRankWalk 25

Seeding – HF/wvRN 26

What is HF aka coEM aka wvRN? 27

CoEM/HF/wvRN One definition [MacKassey & Provost, JMLR 2007]:… Another definition: A harmonic field – the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors’ scores; [X. Zhu, Z. Ghahramani, and J. Lafferty, ICML 2003] Another definition: A harmonic field – the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors’ scores; [X. Zhu, Z. Ghahramani, and J. Lafferty, ICML 2003] 28

CoEM/wvRN/HF Another justification of the same algorithm…. … start with co-training with a naïve Bayes learner 29

CoEM/wvRN/HF One algorithm with several justifications…. One is to start with co-training with a naïve Bayes learner And compare to an EM version of naïve Bayes – E: soft-classify unlabeled examples with NB classifier – M: re-train classifier with soft-labeled examples 30

CoEM/wvRN/HF A second experiment – each + example: concatenate features from two documents, one of class A+, one of class B+ – each - example: concatenate features from two documents, one of class A-, one of class B- – features are prefixed with “A”, “B”  disjoint 31

CoEM/wvRN/HF A second experiment – each + example: concatenate features from two documents, one of class A+, one of class B+ – each - example: concatenate features from two documents, one of class A-, one of class B- – features are prefixed with “A”, “B”  disjoint NOW co-training outperforms EM 32

CoEM/wvRN/HF Co-training with a naïve Bayes learner vs an EM version of naïve Bayes – E: soft-classify unlabeled examples with NB classifier – M: re-train classifier with soft-labeled examples vs an EM version of naïve Bayes – E: soft-classify unlabeled examples with NB classifier – M: re-train classifier with soft-labeled examples incremental hard assignments iterative soft assignments 33

Co-Training Rote Learner My advisor + - - pages hyperlinks - - - - + + - - - + + + + - - + + - 34

Co-EM Rote Learner: equivalent to HF on a bipartite graph Pittsburgh + - - contexts NPs - - - - + + - - - + + + + - - + + - lives in _ 35

What is HF aka coEM aka wvRN? Algorithmically: HF propagates weights and then resets the seeds to their initial value MRW propagates weights and does not reset seeds Algorithmically: HF propagates weights and then resets the seeds to their initial value MRW propagates weights and does not reset seeds 36

MultiRank Walk vs HF/wvRN/CoEM Seeds are marked S MRWHF 37

Back to Experiments: Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 38

MultiRankWalk vs wvRN/HF/CoEM 39

How well does MWR work? 40

Parameter Sensitivity 41

Semi-supervised learning A pool of labeled examples L A (usually larger) pool of unlabeled examples U Can you improve accuracy somehow using U? These methods are different from EM – optimizes Pr(Data|Model) How do SSL learning methods (like label propagation) relate to optimization? 42

SSL as optimization slides from Partha Talukdar 43

44

yet another name for HF/wvRN/coEM 45

match seeds smoothness prior 46

47

48

49

How to do this minimization? First, differentiate to find min is at Jacobi method: To solve Ax=b for x Iterate: … or: 50

51

52

precision- recall break even point /HF/… 53

/HF/… 54

/HF/… 55

from mining patterns like “musicians such as Bob Dylan” from HTML tables on the web that are used for data, not formatting 56

57

58

More recent work (AIStats 2014) Propagating labels requires usually small number of optimization passes – Basically like label propagation passes Each is linear in – the number of edges – and the number of labels being propagated Can you do better? – basic idea: store labels in a countmin sketch – which is basically an compact approximation of an object  double mapping 59

Flashback: CM Sketch Structure Each string is mapped to one bucket per row Estimate A[j] by taking min k { CM[k,h k (j)] } Errors are always over-estimates Sizes: d=log 1/ , w=2/   error is usually less than  ||A|| 1 +c h 1 (s) h d (s) d=log 1/  w = 2/  from: Minos Garofalakis 60

More recent work (AIStats 2014) Propagating labels requires usually small number of optimization passes – Basically like label propagation passes Each is linear in – the number of edges – and the number of labels being propagated – the sketch size – sketches can be combined linearly without “unpacking” them: sketch(av + bw) = a*sketch(v)+b*sketch(w) – sketchs are good at storing skewed distributions 61

More recent work (AIStats 2014) Label distributions are often very skewed – sparse initial labels – community structure: labels from other subcommunities have small weight 62

More recent work (AIStats 2014) Freebase Flick-10k “self-injection”: similarity computation 63

More recent work (AIStats 2014) Freebase 64

More recent work (AIStats 2014) 100 Gb available 65