Download presentation
Presentation is loading. Please wait.
Published byCarlee Boner Modified over 10 years ago
1
3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas, Texas 75275 lyle.smu.edu/~mhd mhd@lyle.smu.edu This material is based upon work supported by the National Science Foundation under Grant No IIS- This material is based upon work supported by the National Science Foundation under Grant No IIS-0948893.
2
Objectives/Outline nEMM Overview nEMM + Stream Clustering nEMM + Bioinformatics 3/11/10, BYU2
3
Objectives/Outline nEMM Overview n Why n What n How nEMM + Stream Clustering nEMM + Bioinformatics 3/11/10, BYU3
4
Lots of Questions nWhy don’t data miners practice what they preach? nWhy is training usually viewed as a one time thing? nWhy do we usually ignore the temporal aspect of data streams? 3/11/10, BYU4 Continuous Learning Interleave learning & application Add time to online clustering
5
3/11/10, BYU5 MM A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that: nS ={N 1,N 2, …, N m }, and nA = {L ij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, L ij = is labeled with a transition probability P ij = P(N j | N i ).
6
3/11/10, BYU6 Problem with Markov Chains nThe required structure of the MC may not be certain at the model construction time. nAs the real world being modeled by the MC changes, so should the structure of the MC. nNot scalable – grows linearly as number of events. nOur solution: n Extensible Markov Model (EMM) n Cluster real world events n Allow Markov chain to grow and shrink dynamically
7
3/11/10, BYU7 EMM (Extensible Markov Model) nTime Varying Discrete First Order Markov Model nContinuously evolves nNodes are clusters of real world states. nLearning continues during prediction phase. nLearning: n Transition probabilities between nodes n Node labels (centroid of cluster) n Nodes are added and removed as data arrives
8
3/11/10, BYU8 EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include: nEMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. nEMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. nEMMDecrement algorithm, which removes nodes from the EMM when needed.
9
3/11/10, BYU9 EMM Cluster nNearest Neighbor nIf none “close” create new node nLabeling of cluster is centroid of members in cluster nO(n) Here n is the number of states
10
3/11/10, BYU10 EMM Increment <18,10,3,3,1,0,0><17,10,2,3,1,0,0><16,9,2,3,1,0,0><14,8,2,3,1,0,0><14,8,2,3,0,0,0><18,10,3,3,1,1,0.> 1/3 N1 N2 2/3 N3 1/1 1/3 N1 N2 2/3 1/1 N3 1/1 1/2 1/3 N1 N2 2/3 1/2 N3 1/1 2/3 1/3 N1 N2 N1 2/2 1/1 N1 1
11
3/11/10, BYU11 EMMDecrement N2 N1N3 N5N6 2/2 1/3 1/2 N1N3 N5N6 1/6 1/3 Delete N2
12
3/11/10, BYU12 EMM Advantages nDynamic nAdaptable nUse of clustering nLearns rare event nScalable: n Growth of EMM is not linear on size of data. n Hierarchical feature of EMM nCreation/evaluation quasi-real time nDistributed / Hierarchical extensions
13
3/11/10, BYU13 EMM Sublinear Growth Servent Data
14
3/11/10, BYU14 Growth Rate Automobile Traffic Minnesota Traffic Data
15
EMM River Prediction 3/11/10, BYU15
16
3/11/10, BYU16 Determining Rare Event nOccurrence Frequency (OF i ) of an EMM state S i is normalized count of state: nNormalized Transition Probability (NTP mn ), from one state, S m, to another, S n, is a normalized transition Count:
17
EMM Rare Event Detection 3/11/10, BYU17 Intrusion Data, Train DARPA 1999, Test DARPA 2000, Ozone Data, UCI ML, Jaccard similarity, 2536 instances, 73 attributes, 73 ozone days
18
Objectives/Outline nEMM Overview nEMM + Stream Clustering n Handle evolving clusters n Incorporate time in clustering nEMM + Bioinformatics 3/11/10, BYU18
19
3/11/10, BYU19 Stream Data A growing number of applications generate streams of data. Computer network monitoring data Call detail records in telecommunications Highway transportation traffic data Online web purchase log records Sensor network data Stock exchange, transactions in retail chains, ATM operations in banks, credit card transactions. Clustering techniques play a key role in modeling and analyzing this data.
20
3/11/10, BYU20 Stream Data Format nEvents arriving in a stream nAt any time, t, we can view the state of the problem as represented by a vector of n numeric values: V t = V1V1 V2V2 …VqVq S1S1 S 11 S 12 …S 1q S2S2 S 21 S 22 …S 2q …………… SnSn S n1 S n2 …S nq Time
21
Traditional Clustering 3/11/10, BYU21
22
TRAC-DS (Temporal Relationship Among Clusters for Data Streams) 3/11/10, BYU22
23
Motivation nTemporal Ordering is a major feature of stream data. nMany stream applications depend on this ordering n Prediction of future values n Anomaly (rare event) detection n Concept drift 3/11/10, BYU23
24
Stream Clustering Requirements nDynamic updating of the clusters nCompletely online nIdentify outliers nIdentify concept drifts nBarbara [2]: n compactness n fast n incremental processing 3/11/10, BYU24
25
Data Stream Clustering nAt each point in time a data stream clustering ζ is a partitioning of D ', the data seen thus far. nInstead of the whole partitions C1, C2,..., Ck only synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time. nThe summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci. 3/11/10, BYU25
26
TRAC-DS NOTE nTRAC-DS is not: n Another stream clustering algorithm nTRAC-DS is: n A new way of looking at clustering n Built on top of an existing clustering algorithm nTRAC-DS may be used with any stream clustering algorithm 3/11/10, BYU26
27
TRAC-DS Overview 3/11/10, BYU27
28
TRAC-DS Definition Given a data stream clustering ζ, a temporal relationship among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisfied: (1) There is a one-to-one correspondence between the clusters in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k. (3) The EMM M is created online together with the data stream clustering 3/11/10, BYU28
29
Stream Clustering Operations * nqassign point(ζ,x): Assigns the new data point x to an existing cluster. nqnew cluster(ζ,x): Create a new cluster. nqremove cluster(ζ,x): Removes a cluster. Here x is the cluster, i, to be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one. nqmerge clusters(ζ,x): Merges two clusters. nqfade clusters(ζ,x): Fades the cluster structure. nqsplit clusters(ζ,x): Splits a cluster. * Inspired by MONIC [13] 3/11/10, BYU29
30
TRAC-DS Operations nrassign point(M,sc,y): Assigns the new data point to the state representing an existing cluster nrnew cluster(M,sc,y): Create a state for a new cluster. nrremove cluster(M,sc,y): Removes state. nrmerge clusters(M,sc,y): Merges two states. nrfade clusters(M,sc,y): Fades the transition probabilities using an exponential decay f(t)=2 −λt nrsplit clusters(M,sc,y): Splits states. Y clustering operations. 3/11/10, BYU30
31
TRAC-DS Example 3/11/10, BYU31
32
Objectives/Outline nEMM Overview nEMM + Stream Clustering nEMM + Bioinformatics n Background n Preprocessing n Classification n Differentiation 3/11/10, BYU32
33
DNA nBasic building blocks of organisms nLocated in nucleus of cells nComposed of 4 nucleotides nTwo strands bound together 3/11/10, BYU33 http://www.visionlearning.com/library/module_viewer.php?mi d=63
34
Central Dogma: DNA -> RNA -> Protein 3/11/10, BYU34 Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA Amino Acid CCUGAGCCAACUAUUGAUGAA www.bioalgorithms.infowww.bioalgorithms.info; chapter 6; Gene Prediction
35
3/11/10, BYU35 Transcription http://ghs.gresham.k12.or.us/science/ps/sci/ ibbio/chem/nucleic/chpt15/transcripti on.gif
36
3/11/10, BYU36 RNA nRibonucleic Acid nContains A,C,G but U (Uracil) instead of T nSingle Stranded nMay fold back on itself nNeeded to create proteins nMove around cells – can act like a messenger nmRNA – moves out of nucleus to other parts of cell
37
The Magical 16s nRibosomal RNA (rRNA) is at the heart of the protein creation process n16S rRNA n About 1542 nucleotides in length n In all living organisms n Important in the classification of organisms into phyla and class nPROBLEM: An organism may actually contain many different copies of 16S, each slightly different. nOUR WORK: Can we use EMM to quantify this diversity? Can we use it to classify different species of the same genus? 3/11/10, BYU37
38
3/11/10, BYU Using EMM with RNA Data acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga Moving Window ACGT Pos 0-8 2331 Pos 1-9 1332 … Pos 34-42 2421 Construct EMM with nodes representing clusters of count vectors 38
39
EMM for Classification 3/11/10, BYU39
40
TRAC-DS and Bioinformatics nEfficient n Alignment free sequence analysis n Clustering reduces size of model nFlexible n Any sequence n Applicability to Metagenomics nScoring based on similarity between EMMs or EMM and input sequence nApplications n Classification n Differentiation 3/11/10, BYU40
41
Profile EMMs for Organism Classification 3/11/10, BYU41
42
Profile EMM – E Coli 3/11/10, BYU42
43
Differentiating Strains nIs it possible to identify different species of same genus? nInitial test with EMM: Bacillus has 21 species Construct EMM for each species using training set (64%) Test by matching unknown strains (36%) and place in closest EMM All unknown strains correctly classified except one: accuracy of 95% 3/11/10, BYU43
44
3/11/10, BYU44Bibliography 1)C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. Proceedings of the International Conference on Very Large Data Bases (VLDB), pp 81-92, 2003. 2)D. Barbara, “Requirements for clustering data streams,” SIGKDD Explorations, Vol 3, No 2, pp 23-27, 2002. 3)Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, “Visualization of DNA/RNA Structure using Temporal CGRs,”Proceedings of the IEEE 6 th Symposium on Bioinformatics & Bioengineering (BIBE06), October 16-18, 2006, Washington D.C.,pp 171-178. 4)S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams: Theory and practice,” IEEE Transactions on Knowledge and Data Engineering, Vol 15, No 3, pp 515-528, 2003. 5)Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, submitted to SIAM International Conference on Data Mining. 6)Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE ICDM Conference, November 2004, pp 371- 374. 7)Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International Journal of Computer Science and Network Security, Vol 6, No 6, June 2006, pp 258-265. 8)Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLDM Conference, pp 440-453. 9)Mallik Kotamarti, Douglas W. Raiford, M. L. Raymer, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for Microbial Organisms Using Genome-Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics and Bioengineering, pp 161-167, June 22-24 2009. 10)Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,” Proceedings of the IEEE PAKDD Conference, April 2006, Singapore. (Also in Lecture Notes in Computer Science, Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.) 11)Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,” Journal of Computers, Vol 1, No 3, June 2006, pp 43-50. 12)MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html, (2008)MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html 13)M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706–711, 2006.
45
3/11/10, BYU45
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.