# 1 Some Ideas for Detecting Spurious Observations Based on Mixture Models Jim Lynch NISS/SAMSI & University of South Carolina.

## Presentation on theme: "1 Some Ideas for Detecting Spurious Observations Based on Mixture Models Jim Lynch NISS/SAMSI & University of South Carolina."— Presentation transcript:

1 Some Ideas for Detecting Spurious Observations Based on Mixture Models Jim Lynch NISS/SAMSI & University of South Carolina

2 Some Ideas for Detecting Spurious Observations Work with Dave Dickey and Francisco Vera Very Preliminary Ideas Primarily Motivated by Daves American Airlines Data and Proschans (1963) paper on pooling to explain a decreasing failure rate and, to a lesser extent, M. J. Bayarri talk on Multiple testing

3 Outline 1. Introduction 2. Mixture Models 3. Some Ideas 4. Simulations 5. The American Airlines Data

4 Introduction Some Motivation – AA Data (Largest Log Vol Removed) Some Time Series Diagnostics Suggest That Log Volume Ratio is an MA(1) Fit an MA(1) to the log Vol Ratio to the AA Data Look At The Residuals

5 Introduction Detecting spurious observations is an important area of research and has implications for anomaly detection (AD). The term spurious observation is used to distinguish it from an outlier, since outliers are usually extreme observations in the data while a spurious observation need not be. –E.g., one could imagine that sophisticated intruders into computer systems would make sporadic intrusions and try to mimic as best as possible normal behavior

6 Introduction Goal –To develop approaches to detect very transient spurious events where the objectives are To detect when there are spurious events present and, if possible, To identify them

7 Introduction The Basic Data Analytic Model –X 1,…, X n iid ~ f p = (1-p) f 0 + p f 1 f 0 is the background model f 1 models the spurious behavior The likelihood is then

8 Introduction A More Realistic Model –Generate a configuration C with probability p(C) –Given C, for i C, X i are iid ~ f 0 and, for i C c, X i are iid ~ f 1 C and C c model a spatial or temporal (e.g., a change-point) pattern You are pooling observations based on the configuration C The likelihood is then

9 Introduction Some Approaches for Analyzing the MR Model Envision that the data are the effects of pooling observations from f 0 and f 1. Treat the data as if it is from a mixture model and use a mixture model to determine the mle, p*, of the mixing proportion. –Use p* to test H 0 : p=0 versus H 1 : p>0 (Under H 0 and the mixture model, n -.5 p* converges in distribution to X where X=0 with probability.5 and =|N(0,I 0 -1 )| with probability.5) –If H 0 is rejected see if the mixture model can give insights into the configuration C j E.g., do an empirical Bayes with prior p(C j )=(1-p*) j p* n-j. Then

10 Introduction Another Approach Since f 1 models the spurious behavior p~0 p~0 suggest using the locally most powerful (LMP) test statistic for testing H 0 :p=0 versus H 1 :p>0 as the basis of discovering if there are spurious observations present The test statistic is related essentially to the gradient plot introduced by Lindsay (1983) to determine when a finite mixture mle is the global mixture mle in the mixed distribution model

11 Introduction Another Approach The basis of this approach –use the gradient plot to determine if the one point mixture mle is the global mixture mle –When it isnt, this suggest that some spurious behavior is present One can then use the components in the mle mixed distribution to calculate assignment probabilities to the data to indicate what observations might be considered spurious The examples indicate that detecting the presence of spurious observations seems to be considerably simpler than identifying which ones they are

12 Introduction Mining Data Graphs Data (Maguire, Pearson and Wynn, 1952): Time Between Accidents with 10 or more fatalities At the right are the gradient plots for the 2 and 3 point mixture mles and the assignment function for the 3 pt mle (mixing over exponentials) The 2 and 3 pt mixture mles – 592.9, 166.2 p:.175,.825 – 595.5, 171.6, 29.1 p:.171,.806,.023

13 Mixture Models –X 1,…, X n iid ~ f p = (1-p) f 0 + p f 1 f 0 is the background model f 1 models the spurious behavior Since the spurious observations are sporadic/transient p~0 –Denote the log likelihood by (f(X 1 ),…, f(X n )) = (f) = log i f(X i ) –Denote the gradient function of by

14 Mixture Models – LMP Lemma The locally most powerful test for testing H 0 :p=0 versus H 1 :p>0 is based on 0 (f 1 ; f 0 ). Proof The LMP test for testing H 0 :p= p 0 versus H 1 :p> p 0 is based on the statistic For p=0 this reduces to

15 Mixture Model The Function (f 1 ; f 0 ) –Plays a prominent role in the analysis of data from mixtures models where it is essentially the gradient function. –Introduced by Lindsay (1983a&b and 1995) to determine when the mle for the mixing distribution with a finite number of points was the global mixture mle.

16 Mixture Model Framework Family of densities {f : }. –M is the set of probability measures on Q. –The mixed distribution over the family with mixing distribution Q by –For X 1,…, X n be iid from f Q, the likelihood and log likelihood are given by L(Q) = f Q (X i ) and (f Q ) = log i f Q (X i ) f Q = (f Q (X 1 ),…, f Q (X n )).

17 Mixture Model Framework The Directional Derivative

18 Mixture Model A Diagnostic Theorem 4.1 of Lindsay (1983a) –A. The following three conditions are equivalent: Q* maximizes L(Q) Q* minimizes sup D( ;Q) sup D( ;Q*)=0. –B. Let f*=f Q*. The point (f*,f*) is a saddle point of.i.e., (f Q ;f*) < 0 = (f*;f*) < (f*; f Q ) for Q, Q e M. –C. The support of Q* is contained in the set of for which D( ;Q*)=0.

19 Mixture Model The Assignment/Membership Function

20 Simulations n=10: 5 points N(0,1), 5 points N(1,1) 0 -0.34964 0 -1.77582 0 -0.92900 0 0.58061 0 -0.36032 1 2.51937 1 0.59549 1 1.16238 1 0.76632 1 1.57752

21 Simulations n=10: 5 points N(0,1), 5 points N(1,1) p -.487880.388813.929969.611187

22 Simulations The Assignment Function

23 Simulations n=30: 25 points N(0,1), 5 points N(1,1) p - 0.05537 0.867670 2.05801 0.132330

24 Simulations n=30: 25 points N(0,1), 5 points N(1,1)

25 Simulations Another n=30: 25 points N(0,1), 5 points N(1,1) p 0.78767 0.921009 3.30559 0.078991

26 Simulations Another n=30: 25 points N(0,1), 5 points N(1,1)

27 AA Data Francisco will discuss this and some other simulations in a moment.

28 Closing Comments Is there an analogue (or alternative) of these ideas for the SCAN (or for the SCAN framework)? –As an alternative, view the problem as having several (two) mechanisms creating observations background infectious material is present. –Just consider that the data are a pooling from all these sites. See if the data is a 2-component mixture. If it is, try to assign the sites to these components. (You might use a thresh-holding of the assignment function to do this or p in the LMP Test Statistic.) –Instead of the assignment function, consider the following based on the LMP test statistic. Define L i =(f 1 (X i ) - f 0 (X i ))/f 0 (X i ). Let L (1) <L (2) <…< L (n) and let j(i) denote the inverse rank, i.e., L (i) = L j(i). For mixture or scanning purposes, consider the sets C i ={j(n),..,j(n-i+1)}={k: L (n-i+1) < L k }. For mixtures with mle p*, assign C i to f 1 and C i c to f 0 where np*~i. For scanning purposes, look through increasing sequence of sets C i for a spatial pattern to emerge.

29 REFERENCES Ferguson, T. S. (1967) Mathematical Statistics: A Decision Theoretical Approach. Academic Press, NY. Grego, J., Hsi, Hsiu-Li, and Lynch, J. D. (1990). A strategy for analyzing mixed and pooled exponentials. Applied Stochastic Models and Data Analysis, 6, 59-70. Lindsay, B.G. (1983a). The geometry of mixture likelihoods: a general theory. Ann. Statist., 11, 86-94. Lindsay, B.G. (1983b). The geometry of mixture likelihoods, Part II: the exponential family. Ann. Statist., 11, 783-792. Lindsay, B.G. (1995). Mixture Models: Theory, Geometry & Applications, NSF-CBMS lecture series, IMS/ASA Maguire, B.A., Pearson, E.S., and Wynn, A.H.A. (1952) The time interval between industrial accidents. Biometrika, 39, 168-180. Proschan, F. (1963). Theoretical explanation of decreasing failure rate. Technometrics, 5, 375-383.

Download ppt "1 Some Ideas for Detecting Spurious Observations Based on Mixture Models Jim Lynch NISS/SAMSI & University of South Carolina."

Similar presentations