18 February 2003Mathias Creutz 1 T-122.102 Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.

Slides:



Advertisements
Similar presentations
Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.
Advertisements

Learning Rules from System Call Arguments and Sequences for Anomaly Detection Gaurav Tandon and Philip Chan Department of Computer Sciences Florida Institute.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Evaluating Search Engine
Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.
Mining Sequential Patterns Dimitrios Gunopulos, UCR.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Aho-Corasick String Matching An Efficient String Matching.
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Department of Computer Science 1 CSS 496 Business Process Re-engineering for BS(CS)
Fundamentals of Python: From First Programs Through Data Structures
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Fundamentals of Python: First Programs
Genetic Algorithm.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Querying Structured Text in an XML Database By Xuemei Luo.
Exact methods for ALB ALB problem can be considered as a shortest path problem The complete graph need not be developed since one can stop as soon as in.
3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 CSC 321: Data Structures Fall 2013 See online syllabus (also available through BlueLine2): Course goals:  To understand.
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Web- and Multimedia-based Information Systems Lecture 2.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Chapter 9 Sorting. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step is.
GO enrichment and GOrilla
Sequence Alignment.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
1 Mining Episode Rules in STULONG dataset N. Méger 1, C. Leschi 1, N. Lucas 2 & C. Rigotti 1 1 INSA Lyon - LIRIS FRE CNRS Université d’Orsay – LRI.
An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
VizTree Huyen Dao and Chris Ackermann. Introducing example
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Advanced Algorithms Analysis and Design
OPERATING SYSTEMS CS 3502 Fall 2017
Text Based Information Retrieval
Data Mining Jim King.
COSC160: Data Structures Linked Lists
Algorithm Analysis CSE 2011 Winter September 2018.
Objective of This Course
A Fast Algorithm For Finding Frequent Episodes In Event Streams
Overview of Query Evaluation
Discovery of Significant Usage Patterns from Clickstream Data
Presentation transcript:

18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. –Department of Computer Science, Series of Publications C, Report C , University of Helsinki Presented by Mathias Creutz, 18 Feb Motivation –Most data mining techniques process unordered collections of data. –But there are important application areas, where the data consist of sequences of events, e.g. alarms in a telecommunication network user interface actions crimes committed by a person occurrences of recurrent illnesses –Goal: Find relationships between events occurring together.

18 February 2003Mathias Creutz 2 Concepts Event type E : e.g., A, B, C, D, E, F. Event: Pair of an event type and its occurrence time (A, t), e.g., (F, 40). Event sequence: Triple (s, T start, T end ), where s= is an ordered sequence of events, and T start and T end are the starting and ending times, e.g., (, 30, 35). Note: The ending time 35 does not include (A, 35) ! Window: Event sequence with a particular width, which equals T end – T start. Episode: Partially ordered set of events, e.g., whenever A and B occur (in either order), C occurs soon. EDF A BCEF C D BAD C EFC BEAECF

18 February 2003Mathias Creutz 3 Episodes Episodes can be described as directed acyclic graphs: Serial episode: Defined order between events, e.g., . (But other events can intervene!) Parallel episode: No constraints on the relative order of the events, e.g., . An episode is injective if no event type occurs twice in it, e.g., , , . A subepisode contains part of the events of its superepisode, and the same ordering constraints apply, e.g.,   . E F A B A B C   

18 February 2003Mathias Creutz 4 Rules Frequency of an episode: Fraction of windows in which the episode occurs. This is a function of the window width. Frequent episodes: Set of episodes having a frequency over a particular frequency threshold. When the frequent episodes are known, rules can be obtained that describe connections between events, e.g., –If  occurs in 4.2% and  in 4.0% of the windows, then there is a chance of 0.95 that C follows in a window, where A and B have been observed. –We can output the rule    with the confidence fr(  ) / fr(  ) = 4.0 / 4.2 = A B C  

18 February 2003Mathias Creutz 5 WINEPI algorithm Lemma 1: If an episode is frequent, then all its subepisodes are frequent. Generate rules (Algorithm 1) Find all frequent episodes of size l (Algorithm 2) Suggest frequent episode candidates of size l+1 (Algorithm 3) Find frequent parallel episodes (Algorithm 4) Find frequent serial episodes (Algorithm 5)

18 February 2003Mathias Creutz 6 WINEPI implementation details Algorithm 2 (finding frequent episodes): Breadth-first search: Successively increase the size of the episodes (according to Lemma 1). Algorithm 3 (generation of candidate episodes): Sort episodes lexicographically  All episodes that share the same first event types are consecutive in the episode list. Algorithm 4 (finding parallel episodes): Slide a window over the event sequence. For every parallel episode, increase and decrease a counter of how many events of the episode are within the window. When an episode is entirely within a window, increase its frequency count. Algorithm 5 (finding serial episodes): For every serial episode, use a state automaton that accepts that episode and rejects all others. Increase the frequency count of the episode when the accepting state is reached. Remove automata when they go out of the window.

18 February 2003Mathias Creutz 7 General partial orders An arbitrary episode can be reduced to a hierarchical combination of serial and parallel episodes, e.g., Complications: –Sometimes necessary to duplicate event nodes (complicated for non-injective episodes) –Composite events have a duration unlike elementary events. Practical and relatively fast alternative: –Handle all episodes as parallel episodes –Check the correct partial ordering only when all events are within the window. A B C  A B C  ’’ ’’

18 February 2003Mathias Creutz 8 Minimal occurrences Instead of using windows we can look at exact occurrences of episodes. This makes it more easy to find episode rules such as ”if A and B occur within 15 seconds, then C will follow within 30 seconds”   [15]   [30]. Minimal occurrence: Shortest possible interval that contains a particular episode. E.g., the set of minimal occurrences of  : mo(  ) = {[35,38[, [46,48[, [47,58[, [57,60[} Instead of frequency, we now use support, the number of minimal occurrences of an episode in an event sequence. –Support threshold vs. frequency threshold. A B C   EDF A BCEF C D BAD C EFC BEAECF

18 February 2003Mathias Creutz 9 MINEPI algorithm Generate rules (Alg. 1 with support thresh.) Find all frequent episodes of size l (Algorithm 2) Suggest frequent episode candidates of size l+1: Temporal join of two suitable subepisodes (Algorithm 3) Find frequent parallel episodes: Subepisode 1 contains all events but the last. Subepisode 2 contains all events but the first. Find frequent serial episodes Subepisode 1 contains all events but one. Sub- episode 2 contains all events but another one.

18 February 2003Mathias Creutz 10 Experiments Tests: –WINEPI Serial episodes (complex task) vs. injective parallel episodes (”easy” task) Different frequency thresholds Different window widths –MINEPI Serial episodes vs. parallel episodes Different support thresholds Different number of times bounds for rules Different confidence thresholds for rules Evaluation: –Time consumption –Quality of candidate generation –Comparison of WINEPI and MINEPI Differences in frequent episodes found

18 February 2003Mathias Creutz 11 Data used in experiments (1) Telecommunications network fault management database – alarms covering 7 weeks –287 different types of alarms –Average: 1 alarm / minute –Alarms tend to occur in bursts: It is possible to have 40 alarms in one second. WWW server log –Department of Computer Science at the University of Helsinki – events (WWW pages fetched in February and March 1996) –7634 different pages English text 1 –GNU man pages –5415 words (1102 word types) –Each word is indexed consecutively to give it a ”time”. Sentence boundaries cause a gap.

18 February 2003Mathias Creutz 12 Data used in experiments (2) English text 2 –The same GNU man pages, with non- informative words such as articles, prepositions and conjunctions stripped off –2871 words –905 word types Protein sequences –PROSITE database of the ExPASy WWW molecular biology server of the Geneva University Hospital and the University of Geneva –DNA and protein patterns –Target: Family of 7 sequences known to contain the string GFRGEAL. –4941 events –22 event types

18 February 2003Mathias Creutz 13 Results WWW server log –Users navigate through long paths from the homepage of the department to the pages of individual courses (rather than using bookmarks). Text database –Only few rules can be found, e.g., the, value [2]  the, value, of [3] –Window widths from 24 to 50 produce the same amount of episodes. Protein sequences –Found 17 episodes of length 7 or 8 – GFRGEAL is among them, and so are patterns with an 8 th symbol fairly near, e.g., GFRGEAL*S

18 February 2003Mathias Creutz 14 Conclusions Rules of WINEPI have nice interpretations as probabilities concerning randomly chosen windows. Rules of MINEPI usually more informative. WINEPI more efficient in the first phases of the discovery. MINEPI outperforms WINEPI in the later iterations. Methods could be modified for cross-use.