Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Lise Getoor Ming-Fai Wong Eran Segal Avi Pfeffer Pieter Abbeel.

Slides:



Advertisements
Similar presentations
Slide 1 of 18 Uncertainty Representation and Reasoning with MEBN/PR-OWL Kathryn Blackmond Laskey Paulo C. G. da Costa The Volgenau School of Information.
Advertisements

CS188: Computational Models of Human Behavior
Learning Probabilistic Relational Models Daphne Koller Stanford University Nir Friedman Hebrew University Lise.
Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea
Dynamic Bayesian Networks (DBNs)
© Daphne Koller, 2003 Probabilistic Models of Relational Domains Daphne Koller Stanford University.
Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.
John Lafferty, Andrew McCallum, Fernando Pereira
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
CPSC 422, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Apr, 8, 2015 Slide source: from David Page (MIT) (which were.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Modelling Relational Statistics With Bayes Nets School of Computing Science Simon Fraser University Vancouver, Canada Tianxiang Gao Yuke Zhu.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Rosa Cowan April 29, 2008 Predictive Modeling & The Bayes Classifier.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Bayesian Networks Alan Ritter.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
CIS 410/510 Probabilistic Methods for Artificial Intelligence Instructor: Daniel Lowd.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Bayes’ Nets  A Bayes’ net is an efficient encoding of a probabilistic model of a domain  Questions we can ask:  Inference: given a fixed BN, what is.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
A Brief Introduction to Graphical Models
Relational Probability Models Brian Milch MIT 9.66 November 27, 2007.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction Advisor: Hsin-His Chen.
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Markov Logic And other SRL Approaches
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Collective Classification A brief overview and possible connections to -acts classification Vitor R. Carvalho Text Learning Group Meetings, Carnegie.
EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.
Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.
Probabilistic Models of Object-Relational Domains
Announcements Project 4: Ghostbusters Homework 7
CPSC 322, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 30, 2015 Slide source: from David Page (MIT) (which were.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Lecture 2: Statistical learning primer for biologists
CPSC 422, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 34 Dec, 2, 2015 Slide source: from David Page (MIT) (which were.
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
John Lafferty Andrew McCallum Fernando Pereira
CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Brief Intro to Machine Learning CS539
Learning Bayesian Network Models from Data
CAP 5636 – Advanced Artificial Intelligence
Bayesian Statistics and Belief Networks
CS 188: Artificial Intelligence
Shared Features in Log-Linear Models
Discriminative Probabilistic Models for Relational Data
Label and Link Prediction in Relational Data
Plate Models Template Models Representation Probabilistic Graphical
Presentation transcript:

Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Lise Getoor Ming-Fai Wong Eran Segal Avi Pfeffer Pieter Abbeel Nir Friedman Ben Taskar

Why Relational? The real world is composed of objects that have properties and are related to each other Natural language is all about objects and how they relate to each other “George got an A in Geography 101”

Attribute-Based Worlds Smart students get A’s in easy classes Smart_Jane & easy_CS101  GetA_Jane_CS101 Smart_Mike & easy_Geo101  GetA_Mike_Geo101 Smart_Jane & easy_Geo101  GetA_Jane_Geo101 Smart_Rick & easy_CS221  GetA_Rick_C World = assignment of values to attributes / truth values to propositional symbols

Object-Relational Worlds World = relational interpretation: Objects in the domain Properties of these objects Relations (links) between objects  x,y(Smart(x) & Easy(y) & Take(x,y)  Grade(A,x,y))

Why Probabilities? All universals are false Smart students get A’s in easy classes True universals are rarely useful Smart students get either A, B, C, D, or F C student The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful … (almost) James Clerk Maxwell Therefore the true logic for this world is the calculus of probabilities …

Probable Worlds Probabilistic semantics: A set of possible worlds Each world associated with a probability hard smart A hard smart B hard smart C hard weak A hard weak B hard weak C easy smart A easy smart B easy smart C easy weak A easy weak B easy weak C course difficulty student intell. grade

Representation: Design Axes AttributesObjects Categorical Probabilistic Epistemic state World state Propositional logic CSPs First-order logic Relational databases Sequences AutomataGrammars Bayesian nets Markov nets n-gram models HMMs Prob. CFGs

Outline Bayesian Networks Representation & Semantics Reasoning Probabilistic Relational Models Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP

Bayesian Networks nodes = variables edges = direct influence Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade ABC CPD P(G|D,I) Letter Grade SAT Intelligence Difficulty

BN semantics Compact & natural representation: nodes have  k parents  2 k n vs. 2 n params parameters natural and easy to elicit conditional independencies in BN structure + local probability models full joint distribution over domain = L G S ID

Full joint distribution specifies answer to any query: P(variable | evidence about others) Reasoning using BNs Letter Grade SAT Intelligence Difficulty Letter SAT Probability theory is nothing but common sense reduced to calculation. Pierre Simon Laplace

BN Inference BN Inference is NP-hard Structure can use graph structure: Graph separation  conditional independence Do separate inference in parts Results combined over interface. A C B D FE Complexity: exponential in largest separator Structured BNs allow effective inference Exact inference in dense BNs is intractable

Approximate BN Inference Belief propagation is an iterative message passing algorithm for approximate inference in BNs Each iteration (until “convergence”): Nodes pass “beliefs” as messages to neighboring nodes Cons: Limited theoretical guarantees Might not converge Pros: Linear time per iteration Works very well in practice, even for dense networks

Outline Bayesian Networks Probabilistic Relational Models Language & Semantics Web of Influence Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intelligence Difficulty Grade Intell_Jane Diffic_CS101 Grade_Jane_CS101 Intell_George Diffic_Geo101 Grade_George_Geo101 Intell_George Diffic_CS101 Grade_George_CS101 A C These “instances” are not independent

Probabilistic Relational Models Combine advantages of relational logic & BNs: Natural domain modeling: objects, properties, relations Generalization over a variety of situations Compact, natural probability models Integrate uncertainty with relational model: Properties of domain entities can depend on properties of related entities Uncertainty over relational structure of domain

St. Nordaf University Teaches In-course Registered In-course Prof. SmithProf. Jones George Jane Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Registered Grade Satisfac Intelligence

Relational Schema Specifies types of objects in domain, attributes of each type of object & types of relations between objects Teach Student Intelligence Registration Grade Satisfaction Course Difficulty Professor Teaching-Ability In Take Classes Relations Attributes

Probabilistic Relational Models Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies Links define potential interactions Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability [K. & Pfeffer; Poole; Ngo & Haddawy] ABC

Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 PRM Semantics Teaching-ability Difficulty Grade Satisfac Intelligence Instantiated PRM  BN  variables: attributes of all objects  dependencies: determined by links & PRM George Jane

Welcome to CS101 low / high The Web of Influence Welcome to Geo101 A C low high easy / hard

Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Learning models from data Collective classification of webpages Undirected discriminative models Collective Classification Revisited PRMs for NLP

Learning PRMs Learner Relational Database Course Student Reg D Expert knowledge [Friedman, Getoor, K., Pfeffer]

Learning PRMs Parameter estimation: Probabilistic model with shared parameters Grades for all students share same model Can use standard techniques for max-likelihood or Bayesian parameter estimation Structure learning: Define scoring function over structures Use combinatorial search to find high-scoring structure

Web  KB Tom Mitchell Professor WebKB Project Sean Slattery Student Advisor-of Project-of Member [Craven et al.]

Web Classification Experiments WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations

Professor department extract information computer science machine learning … Standard Classification Categories: faculty course project student other words only Naïve Bayes Page... Category Word 1 Word N

Exploiting Links... LinkWord N workin g with Tom Mitchell … words onlylink words Page... Category Word 1 Word N

Collective Classification... Page Category Word 1 Word N From-... Page Category Word 1 Word N Link Exists To- [Getoor, Segal, Taskar, Koller] Approx. inference: belief propagation words onlylink wordscollective Classify all pages collectively, maximizing the joint label probability

P(Registration.Grade | Course.Difficulty, Student.Intelligence) Learning w. Missing Data: EM low / high easy / hard ABC Courses Students [Dempster et al. 77]

Discovering Hidden Types Internet Movie Database

Actor Director Movie Genres Rating Year #Votes MPAA Rating Discovering Hidden Types Type [Taskar, Segal, Koller]

Directors Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola Actors Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger … Movies Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson … Terminator 2 Batman Batman Forever GoldenEye Starship Troopers Mission: Impossible Hunt for Red October Discovering Hidden Types

Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Markov Networks Relational Markov Networks Collective Classification Revisited PRMs for NLP

Directed Models: Limitations Acyclicity constraint limits expressive power: Two objects linked to by a student probably not both professors Allow arbitrary patterns over sets of objects & links Acyclicity forces modeling of all potential links: Network size O(N 2 ) Inference is quadratic Generative training: Train to fit all of data, not to maximize accuracy Influence flows over existing links, exploiting link graph sparsity Network size O(N) Allow discriminative training: Max P (labels | observations) Solution: Undirected Models [Lafferty, McCallum, Pereira]

Markov Networks Graph structure encodes independence assumptions: Chris conditionally independent of Eve given Alice & Dave ChrisDave EveAlice Betty ABC Compatibility  (A,B,C)

Relational Markov Networks Universals: Probabilistic patterns hold for all groups of objects Locality: Represent local probabilistic dependencies Sets of links give us possible interactions Study Group Student2 Reg2 Grade Intelligence Course Reg Grade Student Difficulty Intelligence [Taskar, Abbeel, Koller ‘02] Template potential

RMN Semantics Instantiated RMN  MN  variables: attributes of all objects  dependencies: determined by links & RMN George Jane Welcome to CS101 Welcome to Geo101 Difficulty Jill Geo Study Group CS Study Group Intelligence Grade

Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited Discriminative training of RMNs Webpage classification Link prediction PRMs for NLP

Learning RMNs Parameter estimation is not closed form Convex problem  unique global maximum  (Reg1.Grade,Reg2.Grade) P(Grades,Intelligence|Difficulty) Difficulty Intelligence Grade low / higheasy / hard ABCABC L = log Intelligence Grade Intelligence Grade Maximize

Flat Models... Page Category Word 1 Word N LinkWord N... P(Category|Words) Logistic Regression

Exploiting Links... Page Category Word 1 Word N From- Link... Page Category Word 1 Word N To- 42.1% relative reduction in error relative to generative approach

More Complex Structure C Wn W1 Faculty S Students S Courses

Collective Classification: Results 35.4% relative reduction in error relative to strong flat approach

Scalability WebKB data set size 1300 entities 180K attributes 5800 links Network size / school: Directed model 200,000 variables 360,000 edges Undirected model 40,000 variables 44,000 edges Difference in training time decreases substantially when some training data is unobserved want to model with hidden variables 3 sec180 sec 20 minutes15-20 sec Directed models Undirected models TrainingClassification

Predicting Relationships Even more interesting are the relationships between objects e.g., verbs are almost always relationships Tom Mitchell Professor WebKB Project Sean Slattery Student Advisor-of Member

Rel Flat Model... Page Word 1 Word N From-... Page Word 1 Word N To- Type... LinkWord 1 LinkWord N NONE advisor instructor TA member project-of

Flat Model...

Collective Classification: Links Rel... Page Word 1 Word N From-... Page Word 1 Word N To- Type... LinkWord 1 LinkWord N Category

Link Model...

Triad Model ProfessorStudent Group Advisor Member

Triad Model ProfessorStudent Course Advisor TA Instructor

Triad Model

WebKB++ Four new department web sites: Berkeley, CMU, MIT, Stanford Labeled page type (8 types): faculty, student, research scientist, staff, research group, research project, course, organization Labeled hyperlinks and virtual links (6 types): advisor, instructor, TA, member, project-of, NONE Data set size: 11K pages 110K links 2million words

Link Prediction: Results Error measured over links predicted to be present Link presence cutoff is at precision/recall break-even point (  30% for all models) % relative reduction in error relative to strong flat approach

Summary PRMs inherit key advantages of probabilistic graphical models: Coherent probabilistic semantics Exploit structure of local interactions Relational models inherently more expressive “Web of influence”: use all available information to reach powerful conclusions Exploit both relational information and power of probabilistic reasoning

Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited PRMs for NLP Word-Sense Disambiguation Relation Extraction Natural Language Understanding (?) * An outsider’s perspective or “Why Should I Care?”*

Her advisor gave her feedback about the draft. Word Sense Disambiguation Neighboring words alone may not provide enough information to disambiguate We can gain insight by considering compatibility between senses of related words financial academic physical figurative electrical criticism wind paper

Collective Disambiguation Objects: words in text Attributes: sense, gender, number, pos, … Links: Grammatical relations (subject-object, modifier,…) Close semantic relations (is-a, cause-of, …) Same word in different sentences (one-sense-per-discourse) Compatibility parameters: Learned from tagged data Based on prior knowledge (e.g., WordNet, FrameNet) Her advisor gave her feedback about the draft. financial academic physical figurative electrical criticism wind paper Can we infer grammatical structure and disambiguate word senses simultaneously rather than sequentially? Can we integrate inter-word relationships directly into our probabilistic model?

Relation Extraction Announcement Miller Jackson Made Candidate Concerns Departs CEO Of ACME’s board of directors began a search for a new CEO after the departure of current CEO, James Jackson, following allegations of creative accounting practices at ACME. [6/01] … In an attempt to improve the company’s image, ACME is considering former judge Mary Miller for the job. [7/01] … As her first act in her new position, Miller announced that ACME will be doing a stock buyback. [9/01] … Hired??

Professor Sarah met Jane. She explained the hole in her proof. Understanding Language Proof: Theorem: P=NP N=1 Most likely interpretation: Student JaneProfessor Sarah

Resolving Ambiguity Professors often meet with students Jane is probably a student Professors like to explain “She” is probably Prof. Sarah Attribute values Link types Object identity [Goldman & Charniak, Pasula & Russell] Professor Sarah met Jane. She explained the hole in her proof. Probabilistic reasoning about objects, their attributes, and the relationships between them

Acquiring Semantic Models Statistical NLP reveals patterns: Standard models learn patterns at word level But word-patterns are only implicit surrogates for underlying semantic patterns “Teacher” objects tend to participate in certain relationships Can use this pattern for objects not explicitly labeled as a teacher teacher be train hire pay fire serenade 24% 3% 1.5% 1.4% 0.3%

Competing Approaches Logical Statistical Semantic Understanding Scaling Up (via learning) PRMs Noise & Ambiguity Desiderata: Complementary Approaches

Statistics: from Words to Semantics Represent statistical patterns at semantic level What types of objects participate in what types of relationships Learn statistical models of semantics from text Reason using the models to obtain global semantic understanding of the text Georgia O’Keefe Ladder to the Moon