Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people.

Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people

Relational Data is Everywhere The web Webpages (& the entities they represent), hyperlinks Social networks People, institutions, friendship links Biological data Genes, proteins, interactions, regulation Bibliometrics Papers, authors, journals, citations Corporate databases Customers, products, transactions

Relational Data is Different Data instances not independent Topics of linked webpages are correlated Data instances are not identically distributed: Heterogeneous instances (papers, authors) No IID assumption  This is a good thing

New Learning Tasks Collective classification of related instances Labeling an entire website of related webpages Relational clustering Finding coherent clusters in the genome Link prediction & classification Predicting when two people are likely to be friends Pattern detection in network of related objects Finding groups (research groups, terrorist groups)

Probabilistic Models Uncertainty model: space of “possible worlds”; probability distribution over this space. Worlds: often defined via a set of state variables medical diagnosis: diseases, symptoms, findings, … each world: an assignment of values to variables Number of worlds is exponential in # of vars 2 n if we have n binary variables

Outline Relational Bayesian networks* Relational Markov networks Collective Classification Relational clustering * with Avi Pfeffer, Nir Friedman, Lise Getoor

Bayesian Networks nodes = variables edges = direct influence Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade ABC CPD P(G|D,I) Job Grade SAT Intelligence Difficulty

BN semantics Compact & natural representation: nodes have  k parents  2 k n vs. 2 n params parameters natural and easy to elicit conditional independencies in BN structure + local probability models full joint distribution over domain = L G S ID

Full joint distribution specifies answer to any query: P(variable | evidence about others) Reasoning using BNs Letter Grade SAT Intelligence Difficulty Letter SAT

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intelligence Difficulty Grade Intell_Jane Diffic_CS101 Grade_Jane_CS101 Intell_George Diffic_Geo101 Grade_George_Geo101 Intell_George Diffic_CS101 Grade_George_CS101 A C These “instances” are not independent

Relational Schema Specifies types of objects in domain, attributes of each type of object & types of relations between objects Teach Student Intelligence Registration Grade Satisfaction Course Difficulty Professor Teaching-Ability In Take Classes Relations Attributes

St. Nordaf University Teaches In-course Registered In-course Prof. SmithProf. Jones George Jane Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Registered Grade Satisfac Intelligence World 

Relational Bayesian Networks Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies Links define potential interactions Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability [K. & Pfeffer; Poole; Ngo & Haddawy] ABC

Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 RBN Semantics Teaching-ability Difficulty Grade Satisfac Intelligence George Jane Ground model:  variables: attributes of all objects  dependencies: determined by relational links & template model Welcome to CS101

Welcome to CS101 low / high The Web of Influence Welcome to Geo101 A C low high easy / hard

Likelihood Function Likelihood of a BN with shared parameters Joint likelihood is a product of likelihood terms One for each attribute X.A and its family For each X.A, the likelihood function aggregates counts from all occurrences x.A in world  AttributesObjects [Friedman, Getoor, K., Pfeffer, 1999]

Likelihood Function: Multinomials Sufficient statistics: Log-likelihood:

RBN Parameter Estimation MLE parameters: Bayesian estimation: Prior for each attribute X.A Posterior uses aggregated sufficient statistics aggregated sufficient statistics

Learning RBN Structure Define set of legal RBN structures Ones with legal class dependency graphs Define scoring function  Bayesian score Product of family scores: One for each X.A Uses aggregated sufficient statistics Search for high-scoring legal structure [Friedman, Getoor, K., Pfeffer, 1999]

Learning RBN Structure All operations done at class level Dependency structure = parents for X.A Acyclicity checked using class dependency graph Score computed at class level Individual objects only contribute to sufficient statistics Can be obtained efficiently using standard DB queries

Outline Relational Bayesian networks* Relational Markov networks † Collective Classification Relational clustering * with Avi Pfeffer, Nir Friedman, Lise Getoor † with Ben Taskar, Pieter Abbeel

Why Undirected Models? Symmetric, non-causal interactions E.g., web: categories of linked pages are correlated Cannot introduce direct edges because of cycles Patterns involving multiple entities E.g., web: “triangle” patterns Directed edges not appropriate “Solution”: Impose arbitrary direction Not clear how to parameterize CPD for variables involved in multiple interactions Very difficult within a class-based parameterization [Taskar, Abbeel, K. 2001]

Markov Networks A Markov network is an undirected graph over some set of variables V Graph associated with a set of potentials  i Each potential is factor over subset V i Variables in V i must be a (sub)clique in network

Markov Networks Laura Noah Mary James Kyle Template potential

Relational Markov Networks Universals: Probabilistic patterns hold for all groups of objects Locality: Represent local probabilistic dependencies Sets of links give us possible interactions Study Group Student 2 Reg 2 Grade Intelligence Course Reg 1 Grade Student 1 Difficulty Intelligence Template potential

Welcome to CS101 RMN Semantics Welcome to Geo101 Difficulty Grade Intelligence George Jane Jill Intelligence Geo Study Group CS Study Group Grade

Outline Relational Bayesian Networks Relational Markov Networks Collective Classification* Discriminative training Web page classification Link prediction Relational clustering * with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel

Model Structure Probabilistic Relational Model Course Student Reg Training Data New Data Learning Inference Conclusions Collective Classification Train on one year of student intelligence, course difficulty, and grades Given only grades in following year, predict all students’ intelligence Example: Features: .x Labels: .y* Features:  ’.xLabels:  ’.y

Learning RMN Parameters Student 2 Reg 2 Grade Intelligence Course Reg 1 Grade Student 1 Difficulty Intelligence Template potential  Study Group Parameterize potentials as log-linear model

Max Likelihood Estimation maximize w EstimationClassification argmax y . x . y* We don’t care about the joint distribution P ( . x, . y )

Web  KB Tom Mitchell Professor WebKB Project Sean Slattery Student Advisor-of Project-of Member [Craven et al.]

Web Classification Experiments WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations

Professor department extract information computer science machine learning … Standard Classification Categories: faculty course project student other Page... Category Word 1 Word N

Standard Classification... LinkWord N workin g with Tom Mitchell … Page... Category Word 1 Word N 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Logistic test set error 4-fold CV: Trained on 3 universities Tested on 4th Discriminatively trained naïve Markov = Logistic Regression

Power of Context Professor?Student?Post-doc?

Collective Classification... Page Category Word 1 Word N From- Link... Page Category Word 1 Word N To- Compatibility  (From,To) FT

Collective Classification... Page Category Word 1 Word N From- Link... Page Category Word 1 Word N To- LogisticLinks Classify all pages collectively, maximizing the joint label probability 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 test set error [Taskar, Abbeel, K., 2002]

More Complex Structure

C Wn W1 Faculty S Students S Courses

Collective Classification: Results 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 LogisticLinksSectionLink+Section [Taskar, Abbeel, K., 2002] test set error 35.4% error reduction over logistic

Max Conditional Likelihood maximize w EstimationClassification argmax y . x . y* We don’t care about the conditional distribution P ( . y | . x )

margin # labeling mistakes in y Max Margin Estimation [Taskar, Guestrin, K., 2003] (see also [Collins, 2002; Hoffman 2003]) Quadratic program Exponentially many constraints  maximize || w || =1  EstimationClassification argmax y . x . y* What we really want: correct class labels

Max Margin Markov Networks We use structure of Markov network to provide equivalent formulation of QP Exponential only in tree width of network Complexity = max-likelihood classification Can solve approximately in networks where induced width is too large Analogous to loopy belief propagation Can use kernel-based features! SVMs meet graphical models [Taskar, Guestrin, K., 2003]

WebKB Revisited 16.1% relative reduction in error relative to cond. likelihood RMNs

Predicting Relationships Even more interesting: relationships between objects Tom Mitchell Professor WebKB Project Sean Slattery Student Advisor-of Member

Predicting Relations Introduce exists/type attribute for each potential link Learn discriminative model for this attribute Collectively predict its value in new world Relation... Page Word 1 Word N From-... Page Word 1 Word N To- Exists/ Type... LinkWord 1 LinkWord N Category 72.9% error reduction over flat [Taskar, Wong, Abbeel, K., 2003]

Outline Relational Bayesian Networks Relational Markov Networks Collective Classification Relational clustering Movie data* Biological data † * with Ben Taskar, Eran Segal † with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er, Haidong Wang, Micha Shapira, David Botstein

Model Structure Probabilistic Relational Model Course Stude nt Reg Unlabeled Relational Data Learning Relational Clustering Given only students’ grades, cluster similar students Example: Clustering of instances

Learning w. Missing Data: EM EM Algorithm applies essentially unchanged E-step computes expected sufficient statistics, aggregated over all objects in class M-step uses ML (or MAP) parameter estimation Key difference: In general, the hidden variables are not independent Computation of expected sufficient statistics requires inference over entire network

P(Registration.Grade | Course.Difficulty, Student.Intelligence) Learning w. Missing Data: EM low / high easy / hard ABC Courses Students [Dempster et al. 77]

Movie Data Internet Movie Database http://www.imdb.com

Actor Director Movie Genres Rating Year #Votes MPAA Rating Discovering Hidden Types Type [Taskar, Segal, K., 2001] Learn model using EM

Directors Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola Actors Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger … Movies Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson … Terminator 2 Batman Batman Forever GoldenEye Starship Troopers Mission: Impossible Hunt for Red October Discovering Hidden Types [Taskar, Segal, K., 2001]

Biology 101: Gene Expression Gene 2 Coding Control Gene 1 Coding Control DNA RNAProtein Swi5 Transcription factor Swi5 Cells express different subsets of their genes in different tissues and under different conditions

Gene Expression Microarrays Measure mRNA level for all genes in one condition Hundreds of experiments Highly noisy Expression of gene i in experiment j Experiments Genes Induced Repressed

Standard Analysis Cluster genes by similarity of expression profiles Manually examine clusters to understand what’s common to genes in cluster Clustering

General Approach Expression level is a function of gene properties and experiment properties Learn model that best explains the data Observed properties: gene sequence, array condition, … Hidden properties: gene cluster Gene Experiment Expression Properties of Gene i Properties of Experiment j Expression level of Gene i in Experiment j Attributes Level Assignment to hidden variables (e.g., module assignment) Expression level as function of properties

Level Gene Experiment Cluster Expression ID Clustering as a PRM P(E i.L | g.C)g.C 1 2 3 0 0 0 g.E 1 g.E 2 g.E k CPD 2 CPD k Naïve Bayes CPD 1

Modular Regulation Learn functional modules: Clusters of genes that are similarly controlled Learn control program for modules Expression as function of control genes HAP4  CMK1  true false true false

[Segal, Regev, Pe’er, Koller, Friedman, 2003] Level Gene Control k Experiment Cluster Expression Control 2 Control 1 Module Network PRM HAP4  CMK1  true false true false 0 0 0 Cluster 1 BMH1  Yer184c true false truefalse GIC2  USV1  FAR1  true false true false true false USV1  truefalse APG1  Cluster 2 Activity level of control gene in experiment

Experimental Results Yeast Stress Data (Gasch et al.) 2355 genes that showed activity 173 experiments (microarrays): Diverse environmental stress conditions (e.g. heat shock) Learned module network with 50 modules: Cluster assignments are hidden variables Structure of dependency trees unknown Learned model using structural EM algorithm Segal et al., Nature Genetics, 2003

Biological Evaluation Find sets of co-regulated genes (regulatory module) Find the regulators of each module [Segal et al., Nature Genetics, 2003] 46/50 30/50

Experimental Results Hypothesis: Regulator ‘X’ regulates process ‘Y’ Experiment: Knock out ‘X’ and rerun the experiment HAP4  CMK1  true false true false X ? [Segal et al., Nature Genetics, 2003]

wt   Ypl230w 0 3 5 7 9 24 0 2 5 7 9 24 (hrs.) >16x 341 differentially expressed genes 0 7 15 30 60 wt  (min.)  Ppt1 >4x 602 0 5 15 30 60 wt  (min.)  Kin82 >4x 281 Differentially Expressed Genes [Segal et al., Nature Genetics, 2003]

Were the differentially expressed genes predicted as targets? Rank modules by enrichment for diff. expressed genes #ModuleSignificance 14 Ribosomal and phosphate metabolism 8/32, 9e3 11 Amino acid and purine metabolism11/53, 1e2 15 mRNA, rRNA and tRNA processing 9/43, 2e2 39 Protein folding6/23, 2e2 30 Cell cycle7/30, 2e2  Ppt1 # ModuleSignificance 39Protein folding7/23, 1e-4 29Cell differentiation6/41, 2e-2 5 Glycolysis and folding5/37, 4e-2 34Mitochondrial and protein fate 5/37, 4e-2  Ypl230w #ModuleSignificance 3Energy and osmotic stress I8/31, 1e4 2Energy, osmolarity & cAMP signaling9/64, 6e3 15 mRNA, rRNA and tRNA processing6/43, 2e2  Kin82 Biological Experiments Validation All regulators regulate predicted modules [Segal et al., Nature Genetics, 2003]

Biology 102: Pathways Pathways are sets of genes that act together to achieve a common function

Finding Pathways: Attempt I Use protein-protein interaction data

Finding Pathways: Attempt I Use protein-protein interaction data Problems: Data is very noisy Structure is lost: Large connected component in interaction graph (3527/3589 genes)

Finding Pathways: Attempt II Use expression microarray clusters Pathway I Pathway II Problems: Expression is only ‘weak’ indicator of interaction Interacting pathways are not separable

Finding Pathways: Our Approach Use both types of data to find pathways Find “active” interactions using gene expression Find pathway-related co-expression using interactions Pathway I Pathway II Pathway III Pathway IV [Segal, Wang, K., 2003]

Probabilistic Model... Pathway Exp 1 Exp N Gene Interacts [Segal, Wang, K., 2003] 1... Pathway Exp 1 Exp N Gene 2 Expression level in N arrays protein product interaction Compatibility potential  (g.C,g.C) g 1.C g 2.C 123123123123123123 111222333111222333  1 1 2 3 0 0 Cluster all genes collectively, maximizing the joint model likelihood

Capturing Protein Complexes Independent data set of interacting proteins 0 50 100 150 200 250 300 350 400 0102030405060708090100 Complex Coverage (%) Num Complexes Our method Standard expression clustering 124 complexes covered at 50% for our method 46 complexes covered at 50% for clustering [Segal, Wang, K., 2003]

YHR081W RRP40 RRP42 MTR3 RRP45 RRP4 RRP43 DIS3 TRM7 SKI6 RRP46 CSL4 RNAse Complex Pathway YHR081W SKI6 RRP42 RRP45 RRP46 RRP43 TRM7 RRP40 MTR3 RRP4 DIS3 CSL4 Includes all 10 known pathway genes Only 5 genes found by clustering [Segal, Wang, K., 2003]

Interaction Clustering RNAse complex found by interaction clustering as part of cluster with 138 genes [Segal, Wang, K., 2003]

Truth in Advertising Huge graphical models: 3000-50,000 hidden variables Hundreds of thousands of observed nodes Very densely connected Learning: Multiple iterations of model updates Each requires running inference on the model Inference: Exact inference is intractable Use belief propagation Single inference iteration: 1-6 hours Algorithmic ideas key to scaling

Relational Data: A New Challenge Data consists of different types of instances Instances are related in complex networks Instances are not independent New tasks for machine learning Collective classification Relational clustering Link prediction Group detection Opportunity

http://robotics.stanford.edu/~koller/

Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people.

Similar presentations

Presentation on theme: "Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people.

Similar presentations

Presentation on theme: "Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people."— Presentation transcript:

Similar presentations

About project

Feedback