BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005.

Slides:



Advertisements
Similar presentations
High Performance Discovery from Time Series Streams
Advertisements

Arc-length computation and arc-length parameterization
Chapter 5 Multiple Linear Regression
DETC ASME Computers and Information in Engineering Conference ONE AND TWO DIMENSIONAL DATA ANALYSIS USING BEZIER FUNCTIONS P.Venkataraman Rochester.
Geometric Representation of Regression. ‘Multipurpose’ Dataset from class website Attitude towards job –Higher scores indicate more unfavorable attitude.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.
Data mining and statistical learning - lecture 6
WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Optical flow and Tracking CISC 649/849 Spring 2009 University of Delaware.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Chapter Seven The Correlation Coefficient. Copyright © Houghton Mifflin Company. All rights reserved.Chapter More Statistical Notation Correlational.
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Relationships Among Variables
Quakefinder : A Scalable Data Mining System for detecting Earthquakes from Space A paper by Paul Stolorz and Christopher Dean Presented by, Naresh Baliga.
CMPS1371 Introduction to Computing for Engineers NUMERICAL METHODS.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Copyright © Cengage Learning. All rights reserved. 3 Exponential and Logarithmic Functions.
CAFE router: A Fast Connectivity Aware Multiple Nets Routing Algorithm for Routing Grid with Obstacles Y. Kohira and A. Takahashi School of Computer Science.
by B. Zadrozny and C. Elkan
Efficient Irradiance Normal Mapping Ralf Habel, Michael Wimmer Institute of Computer Graphics and Algorithms Vienna University of Technology.
Fast and Exact Monitoring of Co-evolving Data Streams Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Naonori Ueda (NTT) Masatoshi Yoshikawa (Kyoto.
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
CSCI 256 Data Structures and Algorithm Analysis Lecture 14 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.
?v=cqj5Qvxd5MO Linear and Quadratic Functions and Modeling.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
Investigating the Relationship between Scores
Splines Vida Movahedi January 2007.
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
Abdullah Mueen Eamonn Keogh University of California, Riverside.
Describing Relationships Using Correlations. 2 More Statistical Notation Correlational analysis requires scores from two variables. X stands for the scores.
Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Raquel A. Romano 1 Scientific Computing Seminar May 12, 2004 Projective Geometry for Computer Vision Projective Geometry for Computer Vision Raquel A.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Lecture 22 Numerical Analysis. Chapter 5 Interpolation.
Introduction to Biostatistics and Bioinformatics Regression and Correlation.
Speech Recognition Raymond Sastraputera.  Introduction  Frame/Buffer  Algorithm  Silent Detector  Estimate Pitch ◦ Correlation and Candidate ◦ Optimal.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
Testing Random-Number Generators Andy Wang CIS Computer Systems Performance Analysis.
Fast calculation methods. Addition  Add 137,95 Solution: = (137-5)+100= = 232.
Linear Prediction Correlation can be used to make predictions – Values on X can be used to predict values on Y – Stronger relationships between X and Y.
Copyright © 2013 Pearson Education, Inc. Section 3.2 Linear Equations in Two Variables.
Dynamic Models, Autocorrelation and Forecasting ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
Dynamic Programming.  Decomposes a problem into a series of sub- problems  Builds up correct solutions to larger and larger sub- problems  Examples.
Curve Fitting Introduction Least-Squares Regression Linear Regression Polynomial Regression Multiple Linear Regression Today’s class Numerical Methods.
D YNA MM O : M INING AND S UMMARIZATION OF C OEVOLVING S EQUENCES WITH M ISSING V ALUES Lei Li joint work with Christos Faloutsos, James McCann, Nancy.
Demand Management and Forecasting Chapter 11 Portions Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Global predictors of regression fidelity A single number to characterize the overall quality of the surrogate. Equivalence measures –Coefficient of multiple.
Enabling Real Time Alerting through streaming pattern discovery Chengyang Zhang Computer Science Department University of North Texas 11/21/2016 CRI Group.
Dense-Region Based Compact Data Cube
Clustering Data Streams
The Stream Model Sliding Windows Counting 1’s
T. Chernyakova, A. Aberdam, E. Bar-Ilan, Y. C. Eldar
Copyright © 2014, 2010, 2007 Pearson Education, Inc.
The Rectangular Coordinate System
Linear Regression.
Sampling the Fourier Transform
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Curve fit metrics When we fit a curve to data we ask:
Curve fit metrics When we fit a curve to data we ask:
Introduction to Regression
Which sequence is linear? How do you know?
Chapter 4 –Dimension Reduction
Presentation transcript:

BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005

Outline Introduction Introduction Proposed method Proposed method EXPERIMENTS EXPERIMENTS CONCLUSIONS CONCLUSIONS

Introduction Data Stream Data Stream Lag correlations : Lag correlations : For example: For example: Higher amounts of fluoride in water → fewer dental cavities some years later Higher amounts of fluoride in water → fewer dental cavities some years later Goal : Goal : Monitor multiple numerical streams determine the pair correlated with lag and the value Monitor multiple numerical streams determine the pair correlated with lag and the value

Introduction k numerical sequences X 1,…X k, report all pair of X i and X j which X i follow X j with lag l k numerical sequences X 1,…X k, report all pair of X i and X j which X i follow X j with lag l

Introduction

Introduction In this paper, propose BRAID handle data stream In this paper, propose BRAID handle data stream Any time processing, and fast Any time processing, and fast Nimble Nimble Accurate Accurate Small resource consumption Small resource consumption

Proposed method Data stream X : {x 1, …, x t,..., x n }, x n is the most recent value Data stream X : {x 1, …, x t,..., x n }, x n is the most recent value R(0) : X and Y with the same length n and have zero lag R(0) : X and Y with the same length n and have zero lag Pearson ρ Coefficient : Pearson ρ Coefficient :

Proposed method For lag l,consider common part of X and shifted Y For lag l,consider common part of X and shifted Y

Proposed method

R(l) : correlation coefficient, X is delayed by l R(l) : correlation coefficient, X is delayed by l Score at lag l : Score at lag l :

Proposed method R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping Restrict maximum lag m to be n/2 Restrict maximum lag m to be n/2

Proposed method Naive solution : Naive solution : At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1, … ) At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1, … ) Choose earliest max score above r, or report no lag Choose earliest max score above r, or report no lag The solution based on three major step The solution based on three major step

Proposed method Need some sufficient statistics for R to computed easily Need some sufficient statistics for R to computed easily Sx(l,n) = : sum of X of length n Sx(l,n) = : sum of X of length n Sxx(l,n) = : sum of square X of length n Sxx(l,n) = : sum of square X of length n Sxy(l) = : sum of square X of length n Sxy(l) = : sum of square X of length n

Proposed method R(l) is obtained : R(l) is obtained :

Proposed method R(l) can estimate at any point time, only need to keep track five sufficient statistics R(l) can estimate at any point time, only need to keep track five sufficient statistics It still needs linear time to compute the cross-correlation function between two sequences It still needs linear time to compute the cross-correlation function between two sequences

Proposed method Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2 i,. Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2 i,. Only O(logn) number to track of, instead of O(n) that “ Na ï ve solution ” requires Only O(logn) number to track of, instead of O(n) that “ Na ï ve solution ” requires Space required grow linearly with length n Space required grow linearly with length n

Proposed method In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows

Proposed method Window size : power of g=2 Window size : power of g=2 X : original time sequence X : original time sequence Ax h : smoothed version with window of length 2 h Ax h : smoothed version with window of length 2 h Ax 0 : original sequence, Ax 1 : consists of n/2 ticks,..etc Ax 0 : original sequence, Ax 1 : consists of n/2 ticks,..etc Ax h ‘s sufficient statistic need compute every 2 h time ticks Ax h ‘s sufficient statistic need compute every 2 h time ticks At time n, need O(log n) level, for each level compute sufficient statistic At time n, need O(log n) level, for each level compute sufficient statistic

Proposed method In contrast with small lags, the larger one are sparse In contrast with small lags, the larger one are sparse Use cubic spline to interpolate the missing correlation coefficient Use cubic spline to interpolate the missing correlation coefficient

Proposed method Ax h (t) : window average at time tick t for level h Ax h (t) : window average at time tick t for level h Ax h (0) ≡ x t Ax h (0) ≡ x t

Proposed method Sufficient statistics: Sufficient statistics:

EXPERIMENTS

Conclusion Proposed BRAID to detection lag correlation on streaming data Proposed BRAID to detection lag correlation on streaming data At any time At any time Low resource consumption Low resource consumption High accuracy High accuracy