Inferring Mixtures of Markov Chains Tuğkan BatuSudipto GuhaSampath Kannan University of Pennsylvania.

Slides:

Advertisements

Similar presentations

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.

Advertisements

A Tutorial on Learning with Bayesian Networks

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.

State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.

Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.

Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Everything You Need to Know (since the midterm). Diagnosis Abductive diagnosis: a minimal set of (positive and negative) assumptions that entails the.

Dynamic Bayesian Networks (DBNs)

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

STAT 497 APPLIED TIME SERIES ANALYSIS

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.

From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.

Markov Chains Lecture #5

Stochastic Processes Dr. Nur Aini Masruroh. Stochastic process X(t) is the state of the process (measurable characteristic of interest) at time t the.

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.

Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.

Visual Recognition Tutorial

Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Bayesian Networks Alan Ritter.

CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.

Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case

. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Information Theory and Security

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Misuse and Anomaly Detection Sampath Kannan Wenke Lee Insup Lee Diana Spears Oleg Sokolsky William Spears Linda Zhao.

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu

EM and expected complete log-likelihood Mixture of Experts

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,

Randomized Algorithms for Bayesian Hierarchical Clustering

An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Processing Sequential Sensor Data The “John Krumm perspective” Thomas Plötz November 29 th, 2011.

1 Quantization Error Analysis Author: Anil Pothireddy 12/10/ /10/2002.

CS Statistical Machine learning Lecture 24

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

MaskIt: Privately Releasing User Context Streams for Personalized Mobile Applications SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference.

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

Pattern Recognition and Machine Learning-Chapter 13: Sequential Data

An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto Univ.)

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.

Learning Deep Generative Models by Ruslan Salakhutdinov

Open Problems in Streaming

Lecture 18: Uniformity Testing Monotonicity Testing

Exact Inference Continued

Hidden Markov Models Part 2: Algorithms

Estimating Networks With Jumps

Topic models for corpora and for graphs

CONTEXT DEPENDENT CLASSIFICATION

Introduction to Stream Computing and Reservoir Sampling

Markov Chains Tutorial #5

Topic models for corpora and for graphs

Approximation and Load Shedding Sampling Methods

Presentation transcript:

Inferring Mixtures of Markov Chains Tuğkan BatuSudipto GuhaSampath Kannan University of Pennsylvania

An Example: Browsing habits You read sports and cartoons. You’re equally likely to read both. You do not remember what you read last. You’d expect a “random” sequence SCSSCSSCSSCCSCCCSSSSCSC…

Suppose there were two I like health and entertainment I always read entertainment first and then read health page. The sequence would be EHEHEHEHEHEHEH…

Two readers, one log file If there is one log file… Assume there is no correlation between us SECHSSECSHESCSSHCCESCHCCSESHESSHECSHCE… Is there enough information to tell that there are two people browsing? What are they browsing? How are they browsing?

Clues in stream? Yes, somewhat. H and E have special relationship. They cannot belong to different (uncorrelated) people. Not clear about S and C. Suppose there were 3 uncorrelated persons … SECHSSECSHESCSSHCCESCHCCSESHESSHECSHCE

Markov Chains as Stochastic Sources Output sequence:

Markov chains on S,E,C,H S C 1/2 Modeled by … H E 1 1 Their interleaving cannot be Markovian.

Another example Consider network traffic logs… Malicious attacks were made Can you tell apart the pattern of attack from the log? Intrusion detection, log validation, etc…

Yet another example Consider a genome sequence Each genome sequence has “coding” regions and “non-coding” regions –(Separate) Markov chains (usually higher order) are used to model these two regions Can we predict anything about such regions?

The origins of the problem Two or more probabilistic processes We are observing interleaved behavior We do not know which state belongs to which process – cold start.

The Problem MC1 MC Observe Infer: MC1 & MC2

How About ? MC1 MC A gate function How powerful is this function? Clearly a powerful function can produce arbitrary sequences …

Power of the Gate function A powerful gate function can encode powerful models. Hidden or Hierarchical Markov models… Assume a simple (k-way) coin flip for now.

Streaming Model(s) Processor Processor memory is small (polylog?) compared to input size. One or more passes but data read left-to-right in each pass. Input order adversarial or “natural”.

For our problem we assume: Stream is polynomially long in the number of states of each Markov chain (need perhaps long stream). Nonzero probabilities are bounded away from 0. Space available is some small polynomial in #states.

Related Work [Freund & Ron] Considered gate function to be a “special” Markov chain and individual processes as distribution. Mixture Analysis [Duda & Hart] Mixture of Bayesian Networks, DAG models [Thiesson et al.] Mixture of Gaussians [Dasgupta, Arora & Kannan] [Abe & Warmuth] complexity of learning HMMs Hierarchical Markov Models [Kervrann & Heitz]

The old example No “HH”. No “HSH” but “HEH”. The logic: if E is in a different chain then we should also see “HH” SECHSSECSHEHSECSSHCCESCHCCSESHESSHECSH

A few definitions T[u] : probability of ……u…… T[uv] : probability of ……uv…… T[uv]/T[u] = probability of v after u S[u]: stationary probability of u (in its chain)  u : mixing probability of chain of u Remark. We have approximations to T and S.

Assumption Assume that stream is generated by Markov chains (number unknown to us) that have disjoint state spaces. Remark. Once we figure out state spaces, rest is simple.

Warm-up: T[uv]=0 : u and v are in same chain. Idea: If u,v in different chains, v will follow u w/ freq.  v S(v) Lemma. If, u,v are in same chain. Proof. If u,v in different chain, So, in first phase, we grow components based on this rule. Inference Idea 1

What do we have after Idea 1? If we have not “resolved” u & v, T[uv]=T[u] T[v]. Either u,v in different chain, or M uv = S(v) so that T[uv]=T[u]  v M uv =T[u]  v S(v)=T[u]T[v].

End of Phase 1 We have a set of component vertices But, further collapsing is possible. S C H E S C 1/2

Inference Idea 2 Consider u,v already same component, z in separate component. State z is in same chain if and only if T[uzv]= T[u]T[z]T[v]. Now, we can complete collapsing components.

At the end Either we will resolve all edges incident to all chains, or we have some singleton components such that for each pair u,v, T[u] T[v] = T[uv], equivalently, M uv =S(v). Hence, next state distribution (for any state) is S.

The Old Example S C 1/2 H E The components of S and C will be left unmerged. This is no bug!

More Precisely If we have two competing hypotheses then the likelihood of observing the string is exactly equal for both the hypotheses. In other words, we have two competing models which are equivalent.

More General Mixing Processes Up to now, i.i.d. coin flips for mixing We can handle –even when the next chain is chosen depending on last output (i.e., each state has its own “next-chain” distribution) e.g.: Web logs: At some pages you click sooner, others you read before clicking

Intersecting State Sets We need two assumptions: 1.Two Markov chains, 2.There exists a state w that belongs to exactly one chain, for all v, M wv > S(v) or M wv =0. Using analogous inference rules and state w as a reference point, we can infer underlying Markov chains.

Open Questions Remove/relax assumptions for intersecting state spaces Hardness results? Reduce stream length? Sample more frequently, but lose independence of samples... is there a more sophisticated argument? Some form of “hidden” Markov model? Rather than seeing a stream of states we see a stream of a function of states. Difficulty: Identical labels for states CAUTION: inferring a single hidden Markov model is hard.