Presentation is loading. Please wait.

Presentation is loading. Please wait.

10/31/2012, METU Spatiotemporal Stream Mining using TRACDS Middle East Technical University October 31, 2012 Margaret H Dunham, Michael Hahsler, Yu Su,

Similar presentations


Presentation on theme: "10/31/2012, METU Spatiotemporal Stream Mining using TRACDS Middle East Technical University October 31, 2012 Margaret H Dunham, Michael Hahsler, Yu Su,"— Presentation transcript:

1 10/31/2012, METU Spatiotemporal Stream Mining using TRACDS Middle East Technical University October 31, 2012 Margaret H Dunham, Michael Hahsler, Yu Su, Sudheer Chelluboina, and Hadil Shaiba Computer Science and Engineering This work is supported by NSFIIS-0948893

2 10/31/2012, METU IDA@SMU Intelligent Data Analysis Lab Team led by Margaret H. Dunham Michael Hahsler Mission At IDA@SMU we create novel techniques inspired by knowledge discovery, data mining, machine learning, artificial intelligence and statistical analysis to work with data from various sources. Current Focus  Massive data stream modeling: TRACDS TM  Hurricane intensity prediction  Effective metagenomic classification for the Human Genome Project  Recommender systems: R/Apache Mahout http://www.lyle.smu.edu/IDA

3 10/31/2012, METU Outline Spatiotemporal Stream Data TRACDS Hurricane Intensity Prediction PIIH PIIH online

4 10/31/2012, METU From Sensors to Streams Data captured and sent by a set of sensors is usually referred to as “stream data”. Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items May be viewed as arriving in discrete time intervals. Stream data is infinite - the data keeps coming. Examples: Weather data, network data (VoIP), traffic data.

5 10/31/2012, METU Stream Data Format Events arriving in a stream At any time, t, we can view the state of the problem as represented by a vector of n numeric values: V t = V1V1 V2V2 …VqVq S1S1 S 11 S 12 …S 1q S2S2 S 21 S 22 …S 2q …………… SnSn S n1 S n2 …S nq Time

6 10/31/2012, METU Modeling Stream Data –Summarization (Synopsis) of data –Temporal and Spatial –Dynamic –Continuous (infinite stream) –Concept Drift Learn Forget –Sublinear growth rate - Clustering

7 10/31/2012, METU MM A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that: S ={N 1,N 2, …, N m }, and A = {L ij | i  1, 2, …, m, j  1, 2, …, m} and Each arc, L ij = is labeled with a transition probability P ij = P(N j | N i ).

8 10/31/2012, METU Problem with Markov Chains The required structure of the MC may not be certain at the model construction time. As the real world being modeled by the MC changes, so should the structure of the MC. Not scalable – grows linearly as number of events. Our solution: –Extensible Markov Model (EMM) –Cluster real world events –Allow Markov chain to grow and shrink dynamically

9 10/31/2012, METU EMM (Extensible Markov Model) Time Varying Discrete First Order Markov Model Continuously evolves Nodes are clusters of real world states. Learning continues during application phase. Learning: –Transition probabilities between nodes –Node labels (centroid of cluster) –Nodes are added and removed as data arrives Applications: –Anomaly/Rare Event Detection –Prediction –Classification

10 10/31/2012, METU EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node and algorithms to modify it, where algorithms include: EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. EMMDecrement algorithm, which removes nodes from the EMM when needed.

11 10/31/2012, METU EMM Cluster Nearest Neighbor (or any clustering technique) If none “close” create new node Labeling of cluster is centroid of members in cluster or Clustering Feature O(n) Here n is the number of states

12 10/31/2012, METU EMM Advantages Dynamic Adaptable Use of clustering Learns rare event Scalable: –Growth of EMM is not linear on size of data. –Hierarchical feature of EMM Creation/evaluation quasi-real time

13 10/31/2012, METU EMM Sublinear Growth Servent Data

14 10/31/2012, METU Growth Rate Automobile Traffic Minnesota Traffic Data

15 10/31/2012, METU EMM Learning <18,10,3,3,1,0,0><17,10,2,3,1,0,0><16,9,2,3,1,0,0><14,8,2,3,1,0,0><14,8,2,3,0,0,0><18,10,3,3,1,1,0.> 1/3 N1 N2 2/3 N3 1/1 1/3 N1 N2 2/3 1/1 N3 1/1 1/2 1/3 N1 N2 2/3 1/2 N3 1/1 2/3 1/3 N1 N2 N1 2/2 1/1 N1 1

16 10/31/2012, METU N2 N1N3 N5N6 2/2 1/3 1/2 N1N3 N5N6 1/6 1/3 EMM Forgetting

17 10/31/2012, METU Outline Spatiotemporal Stream Data TRACDS Hurricane Intensity Prediction PIIH PIIH online

18 10/31/2012, METU Traditional Stream Clustering Standard Data Stream Clustering ignores temporal aspect of data

19 10/31/2012, METU Stream Clustering Clusters change over time – they move Some techniques use micro clusters/reclustering Reclustering is often off line (batch while stream data comes). STREAM –Partitions stream data into segments –Clusters each segment (k-medians) –Iteratively reclusters the centers of these clusters S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. “Clustering data streams: Theory and practice.” IEEE Transactions on Knowledge and Data Engineering, 15(3):515-528, 2003.

20 10/31/2012, METU Stream Clustering Requirements Dynamic updating of the clusters Completely online Identify outliers Identify concept drifts Compactness Fast Incremental processing 10/31/2012, METU

21 Temporal Relationship Among Clusters in Data Streams

22 10/31/2012, METU TRACDS NOTE TRACDS is not: – Another stream clustering algorithm TRACDS is: –A new way of looking at clustering –Built on top of an existing clustering algorithm TRACDS may be used with any stream clustering algorithm 10/31/2012, METU

23 TRAC-DS Overview 10/31/2012, METU

24 TRACDS Definition Given a data stream clustering ζ, a temporal relationship among clusters (TRACDS) overlays a data stream clustering ζ with an EMM M, in such a way that the following are satisfied: (1) There is a one-to-one correspondence between the clusters in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k. (3) The EMM M is created online together with the data stream clustering 10/31/2012, METU

25 TRACDS Clustering Operations

26 10/31/2012, METU TRACDS Example C EMM http://www.lyle.smu.edu/IDA/TRACDS

27 10/31/2012, METU Outline Spatiotemporal Stream Data TRACDS Hurricane Intensity Prediction PIIH PIIH online

28 10/31/2012, METU Lower 9 th Ward of New Orleans, Louisiana, Feb 27, 2006 Photographer: Mackenzie Schott 10/31/2012, METU

29 The major issues in forecasting hurricanes are predicting their tracks of movement and their intensities. Compared with prediction of track movement, intensity prediction is still relatively inaccurate. Hurricanes are tropical cyclones with sustained winds of at least 64 kt (119 km/h, 74 mph). Time step [0h, 12h, 24h, …, 120h] Hurricanes 10/31/2012, METU

30 Hurricane Intensity Prediction Hurricane Intensity: Maximum sustained surface wind. Highest average wind speed within 1 minute and10m above surface. Rapid Intensification 24-h increase in maximum wind speed >= 30knots. “Maximum Sustained Wind”. Wikipedia. Wikimedia foundation, 27 August 2011. Web. 4 December 2011. Retrieved from http://en.wikipedia.org/wiki/Maximum_sustained_wind. http://en.wikipedia.org/wiki/Maximum_sustained_wind “ Rapid Intensification,” accessed on 10/24/12, http://www.hurrnet.com/tutorial/forecasts/intensity/rapid.htm. http://www.hurrnet.com/tutorial/forecasts/intensity/rapid.htm

31 10/31/2012, METU Hurricane Saffir–Simpson Hurricane Scale[1]: Category 5: Wind speed >= 136 knots Category 4: Wind speed (114-135) knots Category 3: Wind speed (96-113) knots Category 2: Wind speed (83-95) knots Category 1: Wind speed (64-82) knots Tropical storm: Wind speed (35-63) knots Tropical depression: Wind speed (0-34) knots Maximum Sustained Wind”. Wikipedia. Wikimedia foundation, 27 August 2011. Web. 4 December 2011. Retrieved from http://en.wikipedia.org/wiki/Maximum_sustained_wind. http://en.wikipedia.org/wiki/Maximum_sustained_wind

32 10/31/2012, METU Predicting Intensity Statistical models predict intensity based on measured stream data. Current state of storm History of this storm How similar storms behaved in past Regression models are the most popular. NOAA (branch of U.S. Government) –collects stream data. –Yearly updates it models based on data from previous year –Makes predictions in a quasi-real time manner.

33 10/31/2012, METU Hurricane Intensity Prediction  Category 5 - 175 mph  Damage: estimated $125 billion  Fatalities: >1,800  “Hurricane Katrina – Most Destructive Hurricane Ever to Strike the U.S.”, August 28, 2005, February 12, 2007, http://www.katrina.noaa.gov/.http://www.katrina.noaa.gov/ “Objective: Improve forecast skill to accuracy and confidence levels required for decision ‐ making and risk management” NOAA’s National Weather Service Strategic Plan 2010-2020  Very difficult to predict Intensity (rapid intensification)  National Hurricane Center (NHC) uses –Dynamical models: computational intensive and slow –Statistical models: Statistical Hurricane Intensity Prediction Scheme (SHIPS) Current Storm – SANDY http://www.nhc.noaa.gov/archive/2012/SANDY_gr aphics.shtml Path of Hurricane Katrina (2005) Color shows intensity 10/31/2012, METU

34 Remote Sensing Storm features are gathered from the earth's observations using remote sensing. Real time data are gathered every few hours and stored in large databases. Historical data of more than 20 years of the earth's behavior is stored in the database. Methods: Satellite Buoy Ship Aircraft

35 10/31/2012, METU Satellite Images Analogous to to how the eye and camera captures the images. Passive: The sun omits light to earth Light hits objects Light reflected from objects to satellite sensors Image is captured Each object has a different color reflection  helps analyze the image and  understand the actual representation on earth Active: Satellites omit energy to objects Radiation is reflected to the sensors Reflection is measured and analyzed

36 10/31/2012, METU Satellite Images Source:http://maps.unomaha.edu/Peterson/gis/notes/RS2.htm

37 10/31/2012, METU Buoy and Ships Used to gather direct measurements within the sea. Used when the readings gathered by the satellite are not accurate. Buoys: form a network in the ocean and are used to take hourly measurements  Such as: sea surface temperature, wind speed and direction, and humidity. Ships: Observations are taken occasionally Crews ride ships and take measurements using different tools  Such as: anemometers that are used to measure the wind speed

38 10/31/2012, METU Aircraft Reconnaissance Used to gather data by flying above the hurricanes The aircraft includes different tools that are used for measurements They try to find the center of the storm They fly on top of the storm to provide detailed and more accurate information about the storm This could be very dangerous and might cause damage to the aircraft and the crew

39 10/31/2012, METU Hurricane Data

40 10/31/2012, METU Hurricane Data First-order Markov chain assumes that the current state only depends on the previous state. We assume hurricane states preserve the first - order Markov chain property. For instance, let s t denote a current state. Then s t only depends on the state s t-1, where t − 1 indicates the previous time interval of t. A hurricane 0 hours, 12 hours, 24 hours, …., 120 hours stst s t+1 s t+2 s t+3 … dependence A First Order Markov chain is a sequence of random variables X 1, X 2, X 3,... with the Markov property

41 10/31/2012, METU Outline Spatiotemporal Stream Data TRACDS Hurricane Intensity Prediction PIIH PIIH online

42 10/31/2012, METU Hurricane Data The data contains 16 predictors. The dataset is formed by time ordered 12 hour interval records and contains the hurricane data from seasons 1982 to 2003. 1982 2003 Hurricane Data hurricane 10h, 12h, 24h, … 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 25,0,1,-5.83,668,0,140,14.9,-53.5,13.25,40.5,23,6.6,27,372.5,19600 25,0,1,-5.83,708,0,140,12.7,-53.45,13.65,37.5,17.5,5.69,4,317.5,19600 30,5,1,-3.58,682,150,135,12.75,-53.35,13.25,34,1.5,5.79,15,382.5,18225 35,5,1,-4.9,674,175,130,14.2,-53.35,13.4,33,-12,6.66,-13,497,16900 50,15,1,0.44,681,750,113.52,17.1,-53.15,13.2,35,-20,8.32,-7,855,12885.79 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 30,0,0.99,-7.02,656,0,124.55,19.05,-52.55,14.75,51,0.5,6.68,45,571.5,15512.49 30,0,0.98,-7.02,675,0,123.75,17.3,-52.6,14.15,54,5,6.63,22,519,15314.28 35,5,0.98,-4.16,722,175,119.55,17.9,-52.6,14.65,58,10,7.43,34,626.5,14292 65,30,0.97,4.09,635,1950,88.77,19.15,-52.1,14.7,54.5,27.5,8.63,33,1244.75,7879.26 75,10,0.97,6.25,724,750,70.08,17.8,-52.15,12.55,54,48.5,8.61,45,1335,4910.92 95,20,0.96,9.17,641,1900,37.59,14.85,-52.9,11.1,56.5,55,7.87,15,1410.75,1413.13 95,0,0.96,7.2,691,0,33.33,15.6,-53.45,9.25,51.5,44.5,8.97,32,1482,1110.98 95,0,0.95,0.82,713,0,35.62,17.9,-53.25,7.85,47,38,10.72,31,1700.5,1268.43 95,0,0.95,2.4,813,0,28.12,20.85,-52.65,7.25,45,45,12.84,63,1980.75,790.65 115,20,0.93,10.65,635,2300,-11.1,24.45,-52.7,4.55,41.5,57.5,15.81,24,2811.75,123.2 110,-5,0.93,14.51,622,-550,-26.24,30.7,-53.55,1.15,40.5,50.5,21.2,28,3377,688.71 90,-20,0.91,18.15,613,-1800,-17.97,37.05,-53.95,0,46,29.5,27.08,42,3334.5,322.99 70,-20,0.91,21.86,668,-1400,1.01,40.3,-53.7,0,52.5,20,30.72,41,2821,1.02 70,0,0.89,26.22,688,0,2.35,45.05,-52.7,0.25,50.5,37.5,35.18,31,3153.5,5.5 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 …… 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 hurricane 20h, 12h, 24h, … … hurricane 274 0h, 12h, 24h, … … 16 predictors Intensity

43 10/31/2012, METU Construct EMM

44 10/31/2012, METU Use EMM for Prediction

45 10/31/2012, METU EMM, TRACDS and Hurricane Data Approach: Using TRACDS algorithms, construct multiple EMMs. One will be built for each time point into the future for which predictions are to be made: 12 hours, …, 120 hours. NOAA provides 16 different features or predictors (attribute values). Clustering is performed based on a distance calculation from input feature vector to centroid of clusters in EMMs. However the importance of these to intensity prediction is not uniform. How can we determine weight for each feature? Used during clustering.

46 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM) WFL-EMM assumes that the different predictors contribute differently during the prediction. V 1 = V 2 = …… f1f2f3f4f5f6f7 1 0 Weights for predictors In WFL-EMM, a weight vector u = to indicate the weights for different predictors, where u i ∈ [0, 1]. u i =1 means the ith predictor is important and u i =0 implies that the ith predictor is ignored.

47 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM) The question is how to locate a fitness weight vector u = for hurricane intensity predictions. Genetic algorithm (GA) is introduced in WFL-EMM to find the best fitness weight vector, which gives the smallest error of the prediction. GA Learning Process

48 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM) Given a weights vector u =. Two steps of data transformation Normalization: normalize all the predictor within the range of [0, 1] First standardize the predictor values by Transformation: Assume a normalized record d =. Then the record is transformed as d’ =. where and sd(x) are the mean and standard deviation of the ith predictor. Then a non-linear normalization maps z i to interval [0, 1], where is damping coefficient.

49 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

50 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM) The question is how to locate a fitness weight vector u = for hurricane intensity predictions. These weights are used during the clustering and applied to the distance/similarity measure used for clustering Genetic algorithm (GA) is introduced in WFL- EMM to find the best fitness weight vector, which gives the smallest error of the prediction. GA Learning Process

51 10/31/2012, METU GAs try to locate a fitness solution from the a solution space. Solution space Fitness solution Weight vector u = spans a vector space [0, 1] n since each u i is a real value ranged in [0, 1]. Weighted Feature Learning -Extensible Markov Model (WFL-EMM)

52 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM) Genetic algorithm evolution Each time, two chromosomes are selected randomly from the ith population with a probability proportional to their fitness, where a chromosome is a Gray code string of a weight vector u. Chromosome 1 Chromosome 2 GA Learning Process Population i

53 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM) Genetic algorithm evolution GA Learning Process Chromosome 1 Chromosome 2 Calculate the fitness of the obtained chromosome and place it into the population i+1 New chromosome crossovermutation Randomly alter one or more bits in the offspring based on a given probability. inversion Randomly select a break point in a chromosome and then exchange the position of the two pieces.

54 10/31/2012, METU Weighted Feature Learning -Extensible Markov Model (WFL-EMM) GA Learning Process Fitness of the chromosome A chromosome is first decoded into a weight vector u. Apply this obtained u to generate a G EMM by using the training data. Then the fitness is calculated by either mean absolute deviation (MAD) or root mean square error (RMSE) based on the testing data. The best fitness weight vector u is located during the evolution of a GA. Fitness where

55 10/31/2012, METU Results Input parameters of the experiments.

56 10/31/2012, METU Results - Experiment 1: Incremental training and testing for the periods from 2001 to 2003 (set RMSE as fitness). The model is trained on the data from 1982 to 2000 and evaluated using the data of 2001. Then the model is trained on the data from 1982 to 2001 and evaluated using the data of 2002 etc. For each time interval in [12h, 24h, …, 120h], the average error is computed based on the errors of 2001, 2002 and 2003.

57 10/31/2012, METU Results - Experiment 2: Evaluating WFL-EMM by using k-fold cross validation technique over the dataset from 1982 to 2003 (set MAD as fitness).

58 10/31/2012, METU Results It is interesting to look at the weights of the features because these weights reveals information about what the main drivers of intensity change might be.

59 10/31/2012, METU Learn feature weights using Genetic Algorithm. Weights for features over time.

60 10/31/2012, METU PIIH – Prediction Intensity Interval Model for Hurricanes Historic hurricane data Features  Current wind speed  Various temperatures  Time of the year  Direction of movement  GOES Satellite Data (IR) Currently 23 features from the Statistical Hurricane Intensity Prediction Scheme (SHIPS) TRACDS TM Data stream clustering + temporal order model

61 10/31/2012, METU Prediction using PIIH – Irene (2011) Current features of hurricane

62 10/31/2012, METU Prediction using PIIH – Irene (2011) Current features of hurricane Aggregate possible future scenarios into a prediction

63 10/31/2012, METU PIIH Output for Irene (2011) MAD … Mean average deviation MSE … Mean squared error * Baseline model

64 10/31/2012, METU PIIH Advantages Real Time Dynamic Machine Learning Confidence Bands By analyzing the 2011 storms through Nate, we observed the following: –96.33% of observations fell within the 95% confidence band –92.8% of observations fell within the 90% confidence band –74.27% of observations fell within the 68% confidence band

65 10/31/2012, METU Outline Spatiotemporal Stream Data TRACDS Hurricane Intensity Prediction PIIH PIIH online

66 10/31/2012, METU http://IDA.lyle.smu.edu/PIIH/

67 10/31/2012, METU Cooperation & Media Coverage James Franklin Branch Chief, Hurricane Specialist Unit, NHC, NOAA Mark Demaria Chief of the NESDIS Regional and Mesoscale Meteorology Branch, CIRA, NOAA

68 10/31/2012, METU Future Work 1. Deploy model with NOAA  Add decay model over land  Evaluate additional features  Predict rapid intensification  Interface with NOAA’s systems 2. Improve the TRACDS TM model  Data stream clustering  Higher-order effects  Improve model selection and outlier handling

69 10/31/2012, METU PIIH Bibliography

70 10/31/2012, METU Thank you! http://www.lyle.smu.edu/IDA


Download ppt "10/31/2012, METU Spatiotemporal Stream Mining using TRACDS Middle East Technical University October 31, 2012 Margaret H Dunham, Michael Hahsler, Yu Su,"

Similar presentations


Ads by Google