Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring Data Inter-Relationships Via Fast Hierarchical Models Lawrence Carin Duke University www.ece.duke.edu/~lcarin.

Similar presentations


Presentation on theme: "Inferring Data Inter-Relationships Via Fast Hierarchical Models Lawrence Carin Duke University www.ece.duke.edu/~lcarin."— Presentation transcript:

1 Inferring Data Inter-Relationships Via Fast Hierarchical Models Lawrence Carin Duke University www.ece.duke.edu/~lcarin

2 Sensor Deployed Previously Across Globe Deploy to New Location. Can Algorithm Infer Which Data from Past is Most Relevant for New Sensing Task? Previous deployments New deployment

3 Semi-Supervised & Active Learning Enormous quantity of unlabeled data -> exploit context via semi-supervised learning Focus the analyst on most-informative data -> active learning

4 Appropriately exploit related data from previous experience over sensor “lifetime” - Transfer learning Place learning with labeled data in the context of unlabeled data, thereby exploiting manifold information - Semi-supervised learning Reduce load on analyst: only request labeled data on subset of data for which label acquisition would be most informative - Active learning Technology Employed & Motivation

5 Bayesian Hierarchical Models: Dirichlet Processes Principled setting for transfer learning Avoids problems with model selection - Number of mixture components - Number of HMM states [iGMM: Rasmussan, 00], [iHMM: Teh et al., 04,06], [Escobar & West, 95]

6 Data Sharing: Stick-Breaking View of DP – 1/2 The Dirichlet process (DP) is a prior on a density function, i.e., G(Θ) ~DP[α,G o (Θ)] One draw of G(Θ) from DP[α,G o (Θ)]: 1 [Sethuraman, 94]

7 Data Sharing: Stick-Breaking View of DP – 2/2 1 As α → 0, the more likely that Beta(1, α) yields large ν k, implying more sharing; a few larger “sticks”, with corresponding likely parameters As α → ∞, sticks very small and roughly the same size, so reduces to G o

8 Non-Parametric Mixture Models - Data sample d i drawn from a Gaussian/HMM with associated parameters - Posterior on model parameters indicates which parameters are shared, yielding a Gaussian/HMM mixture model; no model selection on number of mixture components Gaussian or HMM didi

9 Dirichlet Process as a Shared Prior Cumulative set of data D={d 1, d 2, …,d n }, with associated parameters When parameters are shared then the associated data are also shared; data sharing implies learning from previous/other experiences → Life-long learning Posterior reflects a balance between the DP-based desire for sharing, constituted by the prior, against the likelihood function that rewards parameters that match the data well DP Desire for Sharing Parameters Likelihood’s Desire to Fit Data Posterior Balances these Objectives

10 Hierarchical Dirichlet Process – 1/2 A DP prior on the parameters of a Gaussian model yields a GMM in which the number of mixture components need not be set a priori (non-parametric) Assume we wish to build N GMMs, each designed using a DP prior We link the N GMMs via an overarching DP “hyper prior” we draw [Teh et al., 06]

11 Hierarchical Dirichlet Process – 2/2 HDP yields a set of GMMs, each of which shares the same parameters, corresponding to Gaussian mean and covariance, with distinct probabilities of observation Coefficients a n,k represent the probability of transitioning from state n to state k Naturally yields the structure of an HMM; number of large amplitude coefficients a n,k implicitly determines the most-probable number of states

12 Computational Challenges in Performing Inference We have the general challenge of estimating the posterior The denominator is typically of high dimension (number of parameters in model), and cannot be computed exactly in reasonable time Approximations required Computational Complexity Accuracy MCMC Laplace Variational Bayes (VB) [Blei & Jordan, 05]

13 Graphical Model of the nDP-iHMM [Ni, Dunson, Carin; ICML 07]

14 How Do You Convince Navy Data Search Works? Validation Not as “Simple” as Text Search Consider Special Kind of Acoustic Data: Music

15 Multi-Task HMM Learning Assume we have N sequential data sets Wish to learn HMM for each of the data sets Believe that data can be shared between the learning tasks; not independent task All N HMMs learned jointly, with appropriate data sharing Use of iHMM avoids the problem of selecting number of states in HMM Validation on large music database; VB yields fast inference

16 Demonstration Music Database 525 Jazz 975 Classical 997 Rock Jazz Rock

17 Classical

18 Inter-Task Similarity Matrix

19 Typical Recommendations from Three Genres ClassicalJazzRock

20

21

22 Applications of Interest to Navy Music search provides a fairly good & objective demonstration of the technology Other than use of acoustic/speech features (MFCCs), nothing in previous analysis specifically tied to music – simply data search Use similar technology for underwater acoustic sensing (MCM) - generative Use related technology for synthetic aperture radar and EO/IR detection and classification – discriminative Technology delivered to NSWC Panama City, and demonstrated independently on mission-relevant MCM data

23 Underwater Mine Counter Measures (MCM)

24 Generative Model - iHMM [Ni & Carin, 07]

25

26

27 Full Posterior on Number of HMM States

28 Anti-Submarine Warfare (ASW)

29 Design HMM for all Targets of Interest Over Sensor Lifetime

30 State Sharing Between ASW Targets

31 Semi-Supervised Multi-Task Learning

32 Semi-Supervised Discriminative Multi-Task Learning Semi-supervised learning implemented via graphical techniques Multi-task learning implemented via DP Exploits all available data-driven context - Data available from previous collections, labeled & unlabeled - Labeled and unlabeled data from current data set

33 Graph representation of partially labeled data manifolds (1/2) Construct the graph G=(X,W), with the affinity matrix W, where the (i, j)- th element of W is defined by a Gaussian kernel: Define a Markov random walk on the graph by the transition matrix A, where the (i, j)-th element: which gives the probability of walking from x i to x j by a single step Markov random walk. The one-step Markov random walk provides a local similarity measure between data points. [Lu, Liao, Carin; 07] [Szummer & Jaakkola, 02]

34 Graph representation (2/2) To account for global similarity between data points, we consider a t-step random walk, where the transition matrix is given by A raised to the power of t: It was demonstrated [1] that the t-step Markov random walk would result in a volume of paths connecting the data points in stead of the shortest path that are susceptible to noise; thus it permits us to incorporate global manifold structure in the training data set. The t-step neighborhood of x i is defined as the set of data points x j with and denoted as [1] Tishby and Slonim, Data clustering by Markovian relaxation and the information bottleneck Method. NIPS 13, 2000

35 Semi-Supervised Learning Algorithm (1/2) Neighborhood-based classifier: Define the probability of label y i given the t-step neighborhood of x i as: where is probability of labeling y i given a single data point x j and is represented by a standard probabilistic classifier parameterized by The label y i implicitly propagates over the neighborhood. Thus it is possible to learn a classifier with only a few labels present.

36 The Algorithm (2/2) For binary classification problems, we choose the form of as logistic regression classifier: To enforce sparseness, we impose a normal prior with zero mean and diagonal precision matrix on, and each hyper- parameter has an independent Gamma prior. Important for transfer learning: The semi-supervised algorithm is inductive and parametric Place a DP prior on parameters, shared among all tasks

37 Toy Data for Tasks 1-6

38 Sharing Data Pooling tasks 1-3Pooling tasks 1-6

39

40 Task similarity for MTL tasks 1-6

41 Navy-Relevant Data Synthetic Aperture Radar (SAR) Data Collected At 19 Different Locations Across USA

42 Real Radar Sensor Data Data from 19 “tasks” or geographical regions 10 of these regions are relatively highly foliated 9 regions bare earth, or desert Algorithm adaptively and autonomously clusters the task-dependent classifier weights into two basic pools, which agree with truth Active learning used to define labels of interest for the site under test Other sites used as auxiliary data, in a “life-long-learning” setting

43 Supervised MTL: JMLR 07

44

45 Classifier at new site placed appropriately within context of all available previous data Both labeled and unlabeled data employed Found that the algorithm relatively insensitive to particular labeled data selected Validation with relatively large music database Previous deployments New deployment

46 Reconstruction of Random-Bars with hybrid CS. Example (a) is from [3], and (b-c) are the modified images from (a) by us to represent similar tasks for simultaneous CS inversion. The intensities of all the rectangles are randomly permuted, and the positions of all the rectangles are shifted by distances randomly sampled from a uniform distribution of [-10,10].

47


Download ppt "Inferring Data Inter-Relationships Via Fast Hierarchical Models Lawrence Carin Duke University www.ece.duke.edu/~lcarin."

Similar presentations


Ads by Google