© P. Smyth: UC Irvine: Graphical Models: 2004: 1 Learning from data with graphical models Padhraic Smyth © Information and Computer Science University.

© P. Smyth: UC Irvine: Graphical Models: 2004: 2 Outline Graphical models –A general language for specification and computation with probabilistic models Learning graphical models from data –EM algorithm, Bayesian learning, etc Applications –Learning topics from documents –Discovering clusters of curves –Other work…

© P. Smyth: UC Irvine: Graphical Models: 2004: 3 The Data Revolution Technological Advances in past 20 years –Sensors –Storage –Computational power –Databases and indexing –Machine learning Science and engineering are increasingly data driven

© P. Smyth: UC Irvine: Graphical Models: 2004: 4 Examples of Digital Data Sets The Web: > 4.3 billion pages –Link graph –Text on Web pages Genomic Data –Sequence = 3 billion base pairs (human) –25k genes: expression, location, networks…. Sloan Digital Sky Survey –15 terabytes –~500 million sky objects Earth Sciences –NASA MODIS satellite –Entire Earth at 15m to 250m resolution every day –37 spectral bands

© P. Smyth: UC Irvine: Graphical Models: 2004: 13 Questions of Interest can we detect significant change? identify, classify, and catalog certain phenomena? segment/cluster pixels into land cover types? measure and predict seasonal variation? And so on…. Previously scientists did much of this work by hand But now…. automated algorithms are essential

© P. Smyth: UC Irvine: Graphical Models: 2004: 14 Learning from Data Uncertainty abounds…. –Measurement error –Unobserved phenomena –Model and parameter uncertainty –Forecasting and prediction Probability is the “ language of uncertainty” –Significant shift in computer science towards probabilistic models in recent years

© P. Smyth: UC Irvine: Graphical Models: 2004: 15 Preliminaries Probability X = x : statement about the world (true or false) P(X=x) : degree of belief that the world is in state X=x Set of variables U = {X 1, X 2,...... X N } Joint Probability Distribution: P(U) = p (X 1, X 2,...... X N ) P(U) is a table of K x K x K …. = K N numbers (say all variables take K values)

© P. Smyth: UC Irvine: Graphical Models: 2004: 16 Conditional Probabilities Many problems of interest involve computing conditional probabilities –Prediction P(X N, | x N,...... x 1 ) –Classification/Diagnosis/Detection arg max { P(Y | x 1,...... x N ) } –Learning P(  | x 1,...... x N ) Note: –Computing P (X N+1 | X 1 = x) has time complexity O(K N )

© P. Smyth: UC Irvine: Graphical Models: 2004: 17 Two Problems Problem 1: Computational Complexity –Inference computations scale as O(K N ) Problem 2: Model Specification –To specify p(U) we need a table of K N numbers –Where do these numbers come from?

© P. Smyth: UC Irvine: Graphical Models: 2004: 18 A very brief history…. Problems recognized by 1970’s, 80’s in artificial intelligence (AI) –E.g., see early AI work in constructing diagnostic reasoning systems: e.g., medical diagnosis, N=100 different symptoms –1970’s/80’s Solutions? invent new formalisms –Certainty factors –Fuzzy logic –Non-monotonic logic –1985: “In defense of probability”, P. Cheeseman, IJCAI 85 –1990’s: marriage of statistics and computer science

© P. Smyth: UC Irvine: Graphical Models: 2004: 19 Two Key Ideas Problem 1: Computational Complexity –Idea: Represent dependency structure as a graph and exploit sparseness in computation Problem 2: Model Specification –Idea: learn models from data using statistical learning principles

© P. Smyth: UC Irvine: Graphical Models: 2004: 20 “…probability theory is more fundamentally concerned with the structure of reasoning and causation than with numbers.” Glenn Shafer and Judea Pearl Introduction to Readings in Uncertain Reasoning, Morgan Kaufmann, 1990

© P. Smyth: UC Irvine: Graphical Models: 2004: 21 Graphical Models Dependency structure encoded by an acyclic directed graph –Node random variable –Edges encode dependencies Absence of edge -> conditional independence –Directed and undirected versions Why is this useful? –A language for communication –A language for computation Origins: –Wright 1920’s –1988 Spiegelhalter and Lauritzen in statistics Pearl in computer science –Aka Bayesian networks, belief networks, causal networks, etc

© P. Smyth: UC Irvine: Graphical Models: 2004: 35 A More General Algorithm Message Passing (MP) Algorithm –Pearl, 1988; Lauritzen and Spiegelhalter, 1988 –Declare 1 node (any node) to be a root –Schedule two phases of message-passing nodes pass messages up to the root messages are distributed back to the leaves –In time O(N), we can compute any probability of interest

© P. Smyth: UC Irvine: Graphical Models: 2004: 41 Complexity of the MP Algorithm Efficient –Complexity scales as O(N K m ) N = number of variables K = arity of variables m = maximum number of parents for any node –Compare to O(K N ) for brute-force method

© P. Smyth: UC Irvine: Graphical Models: 2004: 46 Probability Calculations on Graphs Structure of the graph reveals –Computational strategy Sparser graphs -> faster computation –Dependency relations Automated computation of conditional probabilities –i.e., a fully general algorithm for arbitrary graphs –exact vs. approximate answers Extensions –For continuous variables: replace sum with integral –For identification of most likely values Replace sum with max operator –Conditional probability tables can be approximated

© P. Smyth: UC Irvine: Graphical Models: 2004: 47 Hidden or Latent Variables In many applications there are 2 sets of variables: –Variables whose values we can directly measure –Variables that are “hidden”, cannot be measured Examples: –Speech recognition: Observed: acoustic voice signal Hidden: label of the word spoken –Face tracking in images Observed: pixel intensities Hidden: position of the face in the image –Text modeling Observed: counts of words in a document Hidden: topics that the document is about

© P. Smyth: UC Irvine: Graphical Models: 2004: 50 A Graphical Model for Clustering S YjYj Hidden discrete (cluster) variable Observed variable(s) (assumed conditionally independent given S) YdYd Y1Y1 Clusters = p(Y 1,…Y d | S = s) Clustering = learning these probability distributions from data

© P. Smyth: UC Irvine: Graphical Models: 2004: 52 Mixtures of Markov Chains S Y2Y2 YNYN Y1Y1 Can learn model sets of sequences as coming from K different Markov chains Provides a useful method for clustering sequences Cadez, Heckerman, Meek, Smyth, 2003 used to cluster 1 million Web user navigation patterns algorithm is part of SQL Server 2005 Y3Y3

© P. Smyth: UC Irvine: Graphical Models: 2004: 53 Hidden Markov Model (HMM) Y1Y1 S1S1 Y2Y2 S2S2 Y3Y3 S3S3 YNYN SNSN - - - - - - - - - - - - - - - - - - - - - - - - - - Observed Hidden Two key conditional independence assumptions Widely used in: ….speech recognition, protein models, error-correcting codes Comments: - inference about S given Y is O(N) - if S is continuous, we have a Kalman filter

© P. Smyth: UC Irvine: Graphical Models: 2004: 62 Maximum Likelihood yiyi i=1:n  Likelihood() = p(Data |  ) =  p(y i |  ) Maximum Likelihood:  ML = arg max{ Likelihood() } Data = {y 1,…y n } Model parameters

© P. Smyth: UC Irvine: Graphical Models: 2004: 64 Learning = Inference in a Graph yiyi i=1:n   is unknown Learning = process of computing p(  | y’s, prior ) Information “flows” from green nodes to  node 

© P. Smyth: UC Irvine: Graphical Models: 2004: 68 Learning with Hidden Variables Guess at some initial parameters   E-step (inference in a graph) –For each case, and each unknown variable compute p(S | known data,   ) M-step: (optimization) –Maximize L() using p(S | …..) –This yields new parameter estimates   This is the EM algorithm: –Guaranteed to converge to a (local) maximum of L()

© P. Smyth: UC Irvine: Graphical Models: 2004: 82 Application 1: Probabilistic Topic Modeling in Text Documents Collaborators: Mark Steyvers, UC Irvine Michal Rosen-Zvi, UC Irvine Tom Griffiths, Stanford/MIT

© P. Smyth: UC Irvine: Graphical Models: 2004: 91 Clustering “non-vector” data Challenges with the data…. –May be of different “lengths”, “sizes”, etc –Not easily representable in vector spaces –Distance is not naturally defined a priori Possible approaches –“convert” into a fixed-dimensional vector space Apply standard vector clustering – but loses information –use hierarchical clustering But O(N 2 ) and requires a distance measure –probabilistic clustering with mixtures Define a generative mixture model for the data Learn distance and clustering simultaneously

© P. Smyth: UC Irvine: Graphical Models: 2004: 93 Curve-Specific Transformations y T  t  N curves  e.g., E[y i ] = at 2 + bt + c +  i,  = {a, b, c,  1,….  N } Note: we can learn function parameters and shifts simultaneously with EM

© P. Smyth: UC Irvine: Graphical Models: 2004: 94 Learning Shapes and Shifts Original data Data after Learning Data = smoothed growth acceleration data from teenagers EM used to learn a spline model + time-shift for each curve

© P. Smyth: UC Irvine: Graphical Models: 2004: 95 Clustering: Mixtures of Curves y T  t  N curves  c Each set of trajectory points comes from 1 of K models Model for group k is a Gaussian curve model Marginal probability for a trajectory = mixture model

© P. Smyth: UC Irvine: Graphical Models: 2004: 96 Clustering Methodology Mixtures of polynomials and splines –model data as mixtures of noisy regression models –2d (x,y) position as a function of time –use the model as a first-order approximation for clustering Compare to vector-based clustering... –provides a quantitative (e.g., predictive) model –can handle variable-length trajectories missing measurements background/outlier process coupling of other “features” (e.g., intensity)

© P. Smyth: UC Irvine: Graphical Models: 2004: 104 Summary Graphical Models –Representation language for complex probabilistic models –Provide a systematic framework for Model description Probabilistic computation –General framework for learning from data Applications –author-topic models for text data –mixture of regressions for sets of curves –…. many more Extensions –Hierarchical Bayesian models –Spatio-temporal models –……

Further Information Papers online at www.datalab.uci.edu –Graphical models Smyth, Heckerman, Jordan, 1997, Neural Computation –Topic models Steyvers et al, ACM SIGKDD 2004 Rosen-Zvi et al, UAI 2004 –Curve clustering Gaffney and Smyth, NIPS 2004 Scott Gaffney, Phd thesis, UCI 2004 Author-Topic Browser: www.datalab.uci.edu/author-topic –JAVA demo of online browser –additional tables and results

© P. Smyth: UC Irvine: Graphical Models: 2004: 1 Learning from data with graphical models Padhraic Smyth © Information and Computer Science University.

Similar presentations

Presentation on theme: "© P. Smyth: UC Irvine: Graphical Models: 2004: 1 Learning from data with graphical models Padhraic Smyth © Information and Computer Science University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© P. Smyth: UC Irvine: Graphical Models: 2004: 1 Learning from data with graphical models Padhraic Smyth © Information and Computer Science University.

Similar presentations

Presentation on theme: "© P. Smyth: UC Irvine: Graphical Models: 2004: 1 Learning from data with graphical models Padhraic Smyth © Information and Computer Science University."— Presentation transcript:

Similar presentations

About project

Feedback