Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Slides:



Advertisements
Similar presentations
An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases Yi-Cheng Chen, Ji-Chiang Jiang, Wen-Chih Peng and Suh-Yin Lee Department.
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Incremental Discovery of Sequential Patterns (ACM-SIGMOD's 96 Data Mining Workshop)
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.
Continuous Data Stream Processing  Music Virtual Channel – extensions  Data Stream Monitoring – tree pattern mining  Continuous Query Processing – sequence.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
A survey on stream data mining
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Data Mining Techniques Sequential Patterns. Sequential Pattern Mining Progress in bar-code technology has made it possible for retail organizations to.
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Sequential Pattern Mining
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Data Mining: Concepts and Techniques Mining data streams
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Association Analysis (3)
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Packet Classification Using Dynamically Generated Decision Trees
1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor :柯佳伶.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
Finding Maximal Frequent Itemsets over Online Data Streams Adaptively
Online Frequent Episode Mining
Byung Joon Park, Sung Hee Kim
Query in Streaming Environment
Supporting Fault-Tolerance in Streaming Grid Applications
Query-Friendly Compression of Graph Streams
CARPENTER Find Closed Patterns in Long Biological Datasets
Approximate Frequency Counts over Data Streams
Mining Sequential Patterns
Chapter 11 Indexing And Hashing (1)
Scalable Multi-Match Packet Classification Using TCAM and SRAM
Evaluation of Relational Operations: Other Techniques
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. *: corresponding author

Outline Introduction Mining of Data Streams, Tree Mining Problem Definition Online Mining of Frequent Query Trees over XML Data Streams The Proposed Algorithm FQT-Stream (Frequent Query Trees of Streams) Conclusions and Future Work

Mining of Data Streams: Motivations Many Applications generate data streams Day to day business (credit card, ATM transactions, etc) Hot Web services (XML data, record and click streams) Telecommunication (call records) Financial market (stock exchange) Surveillance (sensor network, audio/video) System management (network events) Application characteristics Massive volumes of data (several terabytes) Records arrive at a rapid rate Data distribution changes on the fly What do we want to get from data streams ? Real time query answering, Statistics, and Pattern discovery

Mining of Data Streams: Computation Model Requirements of Mining Data Streams Single pass: each record is examined at most once Bounded storage: Limited Memory for storing synopsis Real-time: Per record processing time (to maintain synopsis) must be low Stream Mining Processor Synopsis in Memory Buffer (Approximate) Results Data Streams

Problem Definition of Frequent Query Tree Mining (1/2) XML Query Tree Stream (XQTS) A sequence of query trees (QTs) QT 1, QT 2, …, QT N N is tree id the latest incoming query tree Support of a Query Tree QT i sup(QT i ): the number of QTs in XQTS containing QT i as a subtree

Problem Definition of Frequent Query Tree Mining (2/2) A QT i is a Frequent Query Tree (FQT) if and only if sup(QTi) sN s is a user-defined minimum support threshold in the range of [0, 1] Our Task To mine the set of all frequent query trees (FQTs) by one scan of the XQTS Using as smaller memory as possible

Proposed Algorithm FQT-Stream (Frequent Query Trees of Streams) FQT-Stream consists of 5 phases 1. read a QT (Query Tree) from the buffer in the main memory 2. transform the QT into a new NQTS (Normalized Query Tree Sequence) representation 3. construct a in-memory summary data structure called FQT-forest (a forest of Frequent Query Trees) by projecting the NQTSs 4. prune the infrequent query trees from FQT-forest 5. find the set of all FQTs (Frequent Query Trees) from current FQT-forest Since phase 1 is straightforward, We focus on phases 2-5

Phase 2 of FQT-Stream: NQTS Transformation NQTS Transformation of QT Using DFS on the QT A sequence of triple (node-id, level, order) level: the level of the QT order: sequence order of the NQTS For example (5-NQTS in Figure 1)

Phase 3 of FQT-Stream: FQT- forest Construction (1/4) For each NQTS, 2 steps are performed to construct the FQT- forest Step 1: enumerate each NQTS into a set of sub-sequences using Order-Break (OB) technique OB is a level-wise method

Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (2/4) For example, a 5-NQTS = First, the 5-NQTS is broken into three 4- NQTSs These sequences are 1-OB (One Order Break) 1-OB sequences have one order break in the sequence order The original 5-NQTS is called 0-OB

Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (3/4) After delete the duplicates Three 4-NQTSs Two 3-NQTSs with One Order Break Two 3-NQTSs One 2-NQTS, Finally, the set of 1-OB contains 8 NQTSs

Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (4/4) Set of 2-OB is generated from the set of 1-OB For example 2-OB is generated from 1-OB Repeat this process until no candidate k- OB Property 1 The maximum size of order break is k-3, i.e., (k- 3)-OB, if the query tree has k nodes

Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (1/3) The OBs (0-OB, 1-OB, 2-OB) are projected and inserted into a FQT- forest using Incremental Projection (IP) technique A NQTS,, with i nodes is projected into i sub-NQTSs (also called node-suffix NQTSs),, …,, We use one field node-id to represent the fields (node-id, level, order) for simplicity

Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (2/3) Example of IP 1-OB: is projected into 4 node-suffix NQTSs as follows After projection, a tree structure checking is preformed If the level of the first node in a node-suffix NQTS is not the smallest level the node-suffix NQTS is deleted

Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (3/3) After tree structure checking The node-suffix NQTSs are inserted into FQT-forest Update the corresponding nodes supports FQT-forest consists of 2 parts FN-list A list of Frequent Nodes Each node X i in FN-list has a NQTS-tree (X i.NQTS-tree) NQTS-trees (trees of Normalized Query Tree Sequences) A sequence (NQTS) is represented by a path And its appearance frequent is maintained in the last of node of the path

Phase 4 of FQT-Stream: Infrequent Information Pruning In order to guarantee the limited space requirement Pruning Infrequent Information Pruning steps Check each node X i in the FN-list of FQT-forest If its sup(X i ) < sN delete X i and its NQTS-tree Check other NQTS-trees to prune these infrequent nodes

Phase 4 of FQT-Stream: Frequent Query Tree Mining Assume that there are k frequent nodes,, in the FN-list FQT-Stream traverses the X i.NQTS-tree ( i, i = 1, 2, …, k) to find the sequences with prefix X i whose estimated support is greater than or equal to sN in a DFS manner These frequent query trees are stored into a temporal list, called FQT-List

Conclusions and Future Work We propose an efficient one-pass algorithm FQT-Stream (Frequent Query Trees of Streams) To find the set of all frequent query trees over the entire history of online XML data streams Future Work Online Mining of Frequent Query Trees over Sliding Windows