Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trajectory Data Mining and Management Hsiao-Ping Tsai CSIE, YuanZe Uni. 2009.12.04.

Similar presentations


Presentation on theme: "Trajectory Data Mining and Management Hsiao-Ping Tsai CSIE, YuanZe Uni. 2009.12.04."— Presentation transcript:

1 Trajectory Data Mining and Management Hsiao-Ping Tsai 蔡曉萍 @ CSIE, YuanZe Uni. 2009.12.04

2 Outline Introduction to Data Mining Background of Trajectory Data Mining Part I: Group Movement Patterns Mining Part II: Semantic Data Compression

3 Why Data Mining? The explosive growth of data - toward petabyte scale  Commerce : Web, e-commerce, bank/Credit transactions, …  Science: Remote sensing, bioinformatics, …  Many others: news, digital cameras, books, magazine, … We are drowning in data, but starving for knowledge! Data Mining – Automated analysis of massive data sets Somebody ~ Help~~~~

4 What Is Data Mining? Data mining (knowledge discovery from data)  Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) knowledge, e.g., rules, regularities, patterns, constraints, from huge amount of data Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization Neural Network Graph Theory Confluence of Multiple Disciplines

5 Potential Applications Data analysis and decision support  Market analysis and management  Risk analysis and management  Fraud detection and detection of unusual patterns (outliers) Other Applications  Text mining and Web mining  Stream data mining  Bioinformatics and bio-data analysis  …

6 Data Mining Functionalities (1/2) Multidimensional concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics Frequent patterns, association, correlation vs. causality  Diaper  Beer [0.5%, 75%]  Discovering relation between data items. Classification and prediction  Construct models that describe and distinguish classes  Predict some unknown or missing numerical values

7 Data Mining Functionalities (2/2) Cluster analysis  Clustering: Group data to form classes  Maximizing intra-class similarity and minimizing interclass similarity Outlier analysis  Outlier: Data object that does not comply with the general behavior of the data  Useful in fraud detection, rare events (exception) analysis Trend and evolution analysis  Trend and deviation, e.g., regression analysis  Sequential pattern mining, Periodicity analysis, Similarity- based analysis

8 Outline Introduction to Data Mining Background of Trajectory Data Mining Part I: Group Movement Patterns Mining Part II: Semantic Data Compression

9 Trajectory data are everywhere! The world becomes more and more mobile  Prevalence of mobile devices, e.g., smart phones, car PNDs, NBs, PDAs, … Satellite, sensor, RFID, and wireless technologies have fostered many applications  Tremendous amounts of trajectory data Market Prediction: 25-50% of cellphones in 2010 will have GPS

10 Related Research Projects (1/2) GeoPKDD: Geographic Privacy-aware Knowledge Discovery and Delivery (Pisa Uni., Priaeus Uni, …) MotionEye: Querying and Mining Large Datasets of Moving Objects (UIUC) GeoLife: Building social networks using human location history (Microsoft Researcch) Reality Mining (MIT Media Lab) Data Mining in Spatio-Temporal Data Sets (Australia's ICT Research Centre of Excellence) Trajectory Enabled Service Support Platform for Mobile Users' Behavior Pattern Mining (IBM China Research Lab) U.S. Army Research Laboratory

11 Related Research Prjoects (2/2) Mobile Data Management ( 李強教授@ CSIE.NCKU) Energy efficient strategies for object tracking in sensor networks: A data mining approach ( 曾新穆教授@ CSIE.NCKU) Object tracking and moving pattern mining ( 彭文志教授 @CSIE.NCTU) Mining Group Patterns of Mobile Users ( 黃三義教授 @CSIE.NSYSU) …

12 Wireless Sensor Networks (1/2) Technique advances in wireless sensor network (WSN) are promising for various applications  Object Tracking  Military Surveillance  Dwelling Security  … These applications generate large amounts of location-related data, and many efforts are devoted to compiling the data to extract useful information  Past behavior analysis  Future behavior prediction and estimation

13 Wireless Sensor Networks (2/2) A wireless sensor network (WSN) is composed of a large number of sensor nodes  Each node consists of sensing, processing, and communicating components  WSNs are data driven  Energy conservation is paramount among all design issues Object tracking is viewed as a killer application of WSNs  A task of detecting a moving object’s location and reporting the location data to the sink periodically  Tracking moving objects is considered most challenging

14 Part I: Group Movement Patterns Mining Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Group Moving Pattern for Tracking Objects Efficiently,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE), 2009

15 Motivation Many applications are more concerned with the group relationships and their aggregated movement patterns  Movements of creatures have some degree of regularity  Many creatures are socially aggregated and migrate together The application level semantics can be utilized to track objects in efficient ways  Data aggregation  In-network scheduling  Data compression

16 Possible Applications Group data aggregation In-network aggregation PST prediction

17 Assumptions Objects each have a globally unique ID A hierarchical structure of WSN, where each sensor within a cluster has a locally unique ID, ex.  Location of an object is modeled by the ID of a nearby sensor (or cluster)  The trajectory of a moving object is thus modeled as a series of observations and expressed by a location sequence

18 Problem Formulation Similarity  Given the similarity measure function sim p and a minimal threshold sim min, o i and o j are similar if their similarity score sim p (o i, o j ) is above sim min, i.e., Group  A set of objects g is a group if, where so(o i ) denotes the set of objects that are similar to o i The moving object clustering (MOC)problem: Given a set of moving objects O together with their associated location sequence dataset S and a minimal threshold sim min, the MOC problem is formulated as partitioning O into non-overlapped groups, denoted by G = {g 1, g 2,..., g i }, such that the number of groups is minimized, i.e., |G| is minimal.

19 Challenges of the MOC Problem How to discover the group relationships? A centralized approach? Compare similarity on entire movement trajectories ? Compiling all data at a single node is expensive! Local characteristics might be blurred! Other issues Heterogeneous data from different tracking configurations Trade-off between resolution and privacy-preserving A distributed mining approach is more desirable

20 The proposed DGMPMine Algorithm To resolve the MOC problem, we propose a distributed group movement pattern mining algorithm Provide Transmission Efficiency Improve discriminability Improve Clustering Quality Provide Flexibility Preserve Privacy

21 Definition of a Significant Movement Pattern A subsequence that occurs more frequently carries more information about the movement of an object Movement transition distribution characterizes the movements of an object Definition of a movement pattern  A subsequence s of a sequence S is significant if its occurrence probability is higher than a minimal threshold, i.e., P(s) ≥ P min  A significant movement pattern is a significant subsequence s together with its transition distribution P(δ|s) with the constraint that P(δ|s) of s must differ from P(δ|suf(s)) with a ratio r or 1/r VMM

22 Learning of Significant Movement Patterns Leaning movement patterns in the trajectory data set by Probabilistic Suffix Tree  PST is an implementation of VMM with least storage requirement  The PST building algorithm learns from a location sequence data set and generate a compact tree with O(n) complexity in both computing and space  Storing significant movement patterns and their empirical probabilities and conditional empirical probabilities of movement patterns  Advantages: Useful and efficient for prediction Controllable tree depth (size)

23 Example of a location sequence and the generated PST Prediction Complexity:

24 Similarity Comparison A novel pattern-based similarity measure is proposed to compare the similarity of objects.  Measuring the similarity of two objects based on their movement patterns Providing better scalability and resilience to outliers Free from sequence alignment and variable length handling  Considering not only the patterns shared by two objects but also their relative importance to individual objects Provide better discriminability

25 The Novel Similarity Measure Sim p Sim p computes the similarity of objects o i and o j based on their PSTs as follows, : The union of significant patterns on the T 1 and T 2 : The maximal length of VMM (or maximal depth of a PST) Σ : The alphabet of symbols (IDs of a cluster of sensors) : The predicted value of the occurrence probability of s based on T i Euclidean distance of a significant pattern regarding to T i and T j normalization factor

26 Local Grouping Phase --The GMPMine Algorithm Step 1. Learning movement patterns for each object Step 2. Computing the pair-wise similarity core to constructing a similarity graph Step 3. Partitioning the similarity graph for highly- connected sub graphs Step 4. Choosing representative movement patterns for each group

27 Example of GMPMine

28 Inconsistency may exist among local grouping results  Trajectory of a group may span cross several clusters  Group relationships may vary at different locations  A CH may have incomplete statistics  … A consensus function is required to combine multiple local grouping results  remove inconsistency  improve clustering quality  improve stability Global Ensembling Phase GaGa GbGb GcGc GdGd o0o0 02 o1o1 12 o2o2 12 o3o3 02 o4o4 1200 o5o5 2220 o6o6 2 00 o7o7 2 0 o8o8 0 1 o9o9 0311 o 10 031 o 11 1

29 Global Ensembling Phase ( contd.) Normalized Mutual Information (NMI) is useful in measuring the shared information between two grouping results Given K local grouping results, the objective is to find a solution that keeps most information of the local grouping results join entropy entropy

30 The CE Algorithm For a set of similarity thresholds D, we reformulate our objective as The CE algorithm includes three steps: 1. Measuring the pair-wise similarity to construct a similarity matrix by using Jaccard coefficient 2. Generating the partitioning results for a set of thresholds based on the similarity matrix 3. Selecting the final ensembling result

31 Example of CE (b) Highly connected subgraphs (δ=0.1).(a) Similarity graph (δ=0.1). δGδGδ ∑NMI(G δ, G i ) 0.1{{0,1,2,3},{5,6,7 },{8,9,10,11}}2.322 0.2{{0,1,2,3},{5,6,7 },{8,9,10,11}}2.322 0.3{{0,1,2,3},{4,5,6,7 },{8,9,10,11}}2.636 0.4{{0,1,2,3},{4,5,6,7 },{8,9,10}}2.401 0.5{{0,1,2,3},{4,5,6,7 },{8,9,10}}2.401

32 Experiments An event driven simulator in C++ Synthetic data with patterns and group relationships  Leader object: Location-dependent parameterization of a random direction mobility model  Other members: Uniform distribution within a specified GDR of the leader  Outliers: Random-walk model Part I: The distributed mining algorithm  Two-layer mesh network, each cluster with 4*4 sensors  Evaluation Metric NMI of the mining result and the input workload

33 Part I: Experimental Results (a) Impact of GDR and the number of groups ( n r =25) (b) Impact of the size of SG ( R )

34 Part I: Experimental Results (contd.) (b) Impact of moving distance (d ) (a) Impact of the number of outliers (m=25)

35 Part I: Experimental Results (contd.) (a) Impact of PST L max on grouping quality(b) Impact of PST L max on prediction hit rate

36 Experiments (Part II)  Part II: Performance of our tracking network with group data aggregation  A WSN of 65, 536 sensors is modeled by a K-layer mesh network, each cluster with sensors Evaluation Metric  TC: Average hop count per update  Amount of energy cost: in unit of joule Evaluation Metric

37 Part II: Experimental Results (a) Efficiency of the proposed tracking algorithm and impact of network structure on the transmission cost (b) Impact of network structure on the prediction hit rate

38 Part II: Experimental Results ( contd.) (a) Impact of accuracy on the transmission cost (K=2)

39 Part II: Experimental Results ( contd.) (a) Performance comparison of our IA with a simple coverage scheme and PES

40 Part II: Semantic Data Compression Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Application Level Semantics for Data Compression,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE), 2009

41 Introduction Data transmission of is one of the most energy expensive operations in WSNs A batch-and-send network  NAND flash memory reduce network energy consumption increase network throughput Data compression is a paradigm in WSNs However, few works address application- dependent semantics in data, such as the correlations of a group of moving objects How to manage the location data for a group of objects? Compress data by general algorithms like Huffman? Compress a group of trajectory sequences simultaneously?

42 Motivation Redundancy in a group of location sequences comes from two aspects  Vertical redundancy  Horizontal redundancy Vertical redundancy Group relationships Horizontal redundancy Statistics of symbols Predictability of symbols

43 What is Predictability of Symbols? With group movement patterns shared  Predict the next location (symbol) Replacing predictable items with a common symbol helps reduce entropy!

44 Problem Formulation Assume  A batch-based tracking network  Group movement patterns are shared between a sender and a receiver The Group Data Compression (GDC) Problem Given the group movement patterns of a group of objects, the GDC problem is formulated as a merge problem and a hit item replace (HIR) problem to reduce the amount of bits required to represent their location sequences. The merge problem is to combine multiple location sequences to reduce the overall sequence length the HIR problem targets to minimize the entropy of a sequence such that the amount of data is reduced with or without loss of information

45 Our Approach The proposed two-phase and two-dimensional (2P2D) algorithm  Sequence merge phase Utilizing the group relationships to merge the location data of a group of objects  Entropy reduction phase Utilizing the object movement patterns to reduce the entropy of the merged data horizontally Compressing… Uncompressing… Compressibility is enhanced w/ or w/o information loss Guarantee the reduction of entropy

46 We propose the Merge algorithm that  avoids redundant reporting of their locations by trimming multiple identical symbols into a single symbol  chooses a qualified symbol to represent multiple symbols when a tolerance of loss of accuracy is specified The maximal distance between the reported location and the real location is below a specified error bound eb While multiple qualified symbols exist, we choose a symbol to minimize the average location error Sequence Merge Phase 60 symbols -> 20 symbols

47 Entropy Reduction Phase Group movement patterns carry the information about whether an item of a sequence is predictable Since some items are predictable, extra redundancy exists How to remove the redundancy and even increase the compressibility?

48 Entropy Reduction Phase According to Shannon’s theorem, the entropy is the upper bound of the compression ratio  Definition of entropy Increasing the skewness of the data is able to reduce the entropy

49 The Hit Item Replacement (HIR) Problem A simple and intuitive method is to replace all predictable symbols to increase the skewness However, the simple method can not guarantee to reduce the entropy Definition of the Hit Item Replace (HIR) problem: Given a sequence and the information about whether each item is predictable, the HIR problem is to decide whether to replace each of the predicted symbols in the given sequence with a hit symbol to minimize the entropy of the sequence.

50 Three Rules 1. Accumulation rule: 2. Concentration rule: 3. Multi-symbol rule:

51 Example of the Replace Algorithm

52 Segmentation, Alignment, and Packaging

53 Experiments An event driven simulator in C++ A WSN of 256 sensors is modeled by a 2-layer mesh network, each cluster with sensors Synthetic data with patterns and group relationships  Leader object: Location-dependent parameterization of a random direction mobility model  Other members: Uniform distribution within a specified GDR of the leader Evaluation metrics  Amount of data: in unit of KB  Compression Ratio:

54 Experimental Results (a) Performance comparison of the batch- based and on-line approaches. (b) Impact of the group size (n)

55 Experimental Results ( contd.) (a) Impact of batch period(b) Impact of the required accuracy

56 Experimental Results ( contd.) (a) Effectiveness of the Replace Algorithm(b) Prediction Hit Rate vs. GDR

57 ~The End~ Any Question ?


Download ppt "Trajectory Data Mining and Management Hsiao-Ping Tsai CSIE, YuanZe Uni. 2009.12.04."

Similar presentations


Ads by Google