Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Large Scale Computing Systems
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium on Computer Modeling.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Software Architecture
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de Data-Parallel.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Magellan: Experiences from a Science Cloud Lavanya Ramakrishnan.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
Web Log Data Analytics with Hadoop
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Authors: Jiann-Liang Chenz, Szu-Lin Wuy, Yang-Fang Li, Pei-Jia Yang,
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Introduction to MapReduce and Hadoop
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Group 15 Swathi Gurram Prajakta Purohit
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
Presentation transcript:

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa { Eindhoven, The Netherlands 12th March, 2015

Distributed Process Discovery From Large Event Logs Distributed Process Discovery A Framework for Distributed Computing Summary and Future Work Outline Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs Distributed Process Discovery A Framework for Distributed Computing Summary and Future Work Outline Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs Distributed Process Discovery Sergio Hernández de Mesa 12th March, 2015 Big Data 4

Distributed Process Discovery From Large Event Logs Distributed Process Discovery Sergio Hernández de Mesa 12th March, 2015 The 3 V’s of Big Data 5

Distributed Process Discovery From Large Event Logs Distributed Process Discovery Sergio Hernández de Mesa 12th March, 2015 Big Data and process discovery 6 XES logs CSV files Offline analysis Real-time analysis Data streams TB GB MB

Distributed Process Discovery From Large Event Logs Distributed Process Discovery Sergio Hernández de Mesa 12th March, 2015 Actual problem 7

Distributed Process Discovery From Large Event Logs Distribute/Parallelize process discovery techniques - Inductive Miner - Alpha Miner - Heuristics Miner - … Take advantage of HPC infrastructures and parallel programming models - Clusters, grids and clouds - MapReduce Distributed Process Discovery Performance improvement opportunities Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs No computing resources No computing resources but money Hadoop Cluster HPC infrastructure Sergio Hernández de Mesa 12th March, “Classical” ProM Amazon Elastic MapReduce MapReduce model Distributed approach Distributed Process Discovery Execution scenarios

Distributed Process Discovery From Large Event Logs MapReduce - Programming model for data-oriented applications - Proposed by Google - Map: (k 1, v 1 )  list (k 2,v 2 ) - Reduce: (k 2, list(v 2 ) )  list (v 3 ) Hadoop - Software for reliable, scalable and distributed computing - Developed by Apache - Core components: Hadoop Distributed File System (HDFS) Hadoop MapReduce Hadoop YARN Sergio Hernández de Mesa 12th March, Distributed Process Discovery MapReduce and Hadoop

Distributed Process Discovery From Large Event Logs Distributed Process Discovery Sergio Hernández de Mesa 12th March, 2015 Step 1 Directly-Follows Graph DFG Process Model XES log XES log Optimized version of Step 1 – Reading data as a stream (SAXParser) – HashMaps to efficiently count frequencias Example XES log: 100 million traces (40 activities) – Size: 218 GB – Step 1: XES to DFG: ~ 2-3 hours – Step 2: DFG to Process Model: ~ few seconds Step 2 Motivational example: Inductive Miner 11

Distributed Process Discovery From Large Event Logs HDFS (Hadoop Distributed File System) HDFS (Hadoop Distributed File System) HDFS (Hadoop Distributed File System) XES Logs Block 1 Block 2 Block N … ….. ….. … MAP 1 MAP 2 MAP N … … … … … … … … … … … DFG 1 DFG 2 DFG N REDUCEREDUCE REDUCEREDUCE FINAL DFG Split phase Distributed Process Discovery Computing DFG: Hadoop/MapReduce approach …… Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs XES Logs XES sublog 1 XES sublog 2 XES Sublog N XES Sublog N … XES2DFG (MAP) XES2DFG (MAP) XES2DFG (MAP) XES2DFG (MAP) XES2DFG (MAP) XES2DFG (MAP) … DFG 1 DFG 2 DFG N REDUCE_DFGSREDUCE_DFGS REDUCE_DFGSREDUCE_DFGS FINAL DFG … Distributed Process Discovery Computing DFG: Distributed/HPC approach Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs Distributed Process Discovery A Framework for Distributed Computing Summary and Future Work Outline Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs A Framework for Distributed Computing Sergio Hernández de Mesa 12th March, Scientific computing

Distributed Process Discovery From Large Event Logs A Framework for Distributed Computing Sergio Hernández de Mesa 12th March, Heterogeneous Execution Environments

Distributed Process Discovery From Large Event Logs A Framework for Distributed Computing Sergio Hernández de Mesa 12th March, Challenges of scientific computing in HPC Strong coupling between applications and execution environments Lifecycle management Using multiple computing infrastructures

Distributed Process Discovery From Large Event Logs A Framework for Distributed Computing Sergio Hernández de Mesa 12th March, Framework architecture

Distributed Process Discovery From Large Event Logs Amazon EC2 Mediator A Framework for Distributed Computing Sergio Hernández de Mesa 12th March, 2015 HERMES Mediator Message bus HERMES Meta-scheduler Fault Management User application JSDL Message ✓ Selecting a computing infrastructure Job execution ✘ Selecting fault handling policy Resubmission Alternative infrastructure Aborting job execution 19 Framework operation Job execution ✓

Distributed Process Discovery From Large Event Logs Distributed Process Discovery A Framework for Distributed Computing Summary and Future Work Outline Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs Summary and Future Work Summary Sergio Hernández de Mesa 12th March, Inductive Miner Alpha Miner Heuristics Miner …

Distributed Process Discovery From Large Event Logs New ProM plugin No computing resources No computing resources but money Hadoop Server HPC infrastructure Sergio Hernández de Mesa 12th March, “Classical” ProM Amazon Elastic MapReduce MapReduce model Distributed approach Summary and Future Work Execution scenarios

Distributed Process Discovery From Large Event Logs Summary and Future Work Solution approach Sergio Hernández de Mesa 12th March,

Distributed Process Discovery From Large Event Logs Process discovery from Large Event Logs - “Sequential” way: Time-consuming - Solution approach: MapReduce and Distributed computing Current state - Code developed for distributed computing DFGs - Setting up Hadoop Cluster Future Work - Integration with the distributed computing framework - Development of a ProM plugin Sergio Hernández de Mesa 12th March, Summary and Future Work Conclusions

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa { Eindhoven, The Netherlands 12th March, 2015