Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.

Slides:



Advertisements
Similar presentations
CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.
Advertisements

Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Chapter 13 (Web): Distributed Databases
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
DATA MINING -ASSOCIATION RULES-
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Data Mining By Archana Ketkar.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Recommender systems Ram Akella November 26 th 2008.
A Local Facility Location Algorithm Supervisor: Assaf Schuster Denis Krivitski Technion – Israel Institute of Technology.
Mining Association Rules
Composition Model and its code. bound:=bound+1.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Mining and Summarizing Customer Reviews
『 Data Mining 』 By Jung, hae-sun. 1.Introduction 2.Definition 3.Data Mining Applications 4.Data Mining Tasks 5. Overview of the System 6. Data Mining.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Business Intelligence, Data Mining and Data Analytics/Predictive Analytics By: Asela Thomason IS 495 Summer 2015.
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
Recommender systems Drew Culbert IST /12/02.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Secure Incremental Maintenance of Distributed Association Rules.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
File Processing - Database Overview MVNC1 DATABASE SYSTEMS Overview.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Bug Localization with Machine Learning Techniques Wujie Zheng
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Data Mining By : Tung, Sze Ming ( Leo ) CS 157B. Definition A class of database application that analyze data in a database using tools which look for.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining Find information from data data ? information.
Association Rule Mining
Identification Authentication. 2 Authentication Allows an entity (a user or a system) to prove its identity to another entity Typically, the entity whose.
MIS2502: Data Analytics Advanced Analytics - Introduction.
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
The seven traditional tools of quality I - Pareto chart II – Flowchart III - Cause-and-Effect Diagrams IV - Check Sheets V- Histograms VI - Scatter Diagrams.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Pengantar Sistem Informasi
Data Mining Find information from data data ? information.
Data Mining Algorithms for Large-Scale Distributed Systems
By Arijit Chatterjee Dr
MIS2502: Data Analytics Advanced Analytics - Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Analysis.
Outline Announcements Fault Tolerance.
Reliable Distributed Systems
Data Mining: Introduction
Presentation transcript:

Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003

What is Data Mining? The automatic analysis of large database The discovery of previously unknown patterns The generation of a model of the data

Main Data Mining Problems Association rules Description Classification Fraud, Churn Clustering Analysis He who does this and that will usually do some other thing too These attributes indicate a good behavior - those indicate bad behavior. There are three types of entities

Examples – Classification Customers purchase artifacts in a store Each transaction is described in terms of a vector of features The owner of the store tries to predict which transactions are fraudulent Example: young men who buy small electronics during rash-hours Solution: do not respect checks

Examples – Associations Amazon tracks user queries Suggests to each user additional books he would usually be interested in Supermarket finds out “ people who buy diapers also buy beer ” Place diapers and beer at opposite sides of the supermarket

Examples – Clustering Resource location Find the best location for k distribution centers Feature selection Find 1000 concepts which summarize a whole dictionary Extract the meaning out of a document by replacing each work with the appropriate concept  Car for auto, etc.

Why Mine Data of LSD Systems? Data mining is good It is otherwise difficult to monitor an LSD system: lots of data, spread across the system, impossible to collect Many interesting phenomena are inherently distributed (e.g., DDoS), it is not enough to just monitor a few nodes

An Example Peers in the Kazza network reveal to the system which files they have on their disks in exchange to access to the files of their peers The result is a 2M peers database of people recreational preferences Mining it, you could discover that Matrix fans are also keen of Radio-Head songs Promote RH performances in Matrix-Reloaded Ask RH to write the music for Matrix-IV

What is so special about this problem? Huge systems – Huge amounts of data Dynamic setting System – join / depart Data – constant update Ad-hoc solution Fast convergence

Our Work We developed an association rule mining algorithm that works well in LSD Systems Local and therefore scalable Asynchronous and therefore fast Dynamic and therefore robust Accurate – not approximated Anytime – you get early results fast

In a Teaspoon A distributed data mining algorithm can be described as a series of distributed decisions Those decisions are reduced to a majority vote We developed a majority voting protocol which has all those good qualities The outcome is an LSD association rule mining (still to come: classification)

Problem Definition – Association Rule Mining (ARM)

Solution to Traditional ARM

Large-Scale Distributed ARM

Solution of LSD-ARM No termination Anytime solution Recall Precision

Majority Vote in LSD Systems Unknown number of nodes vote 0 or 1 Nodes may dynamically change their vote Edges are dynamically added / removed An infra-structure  detects failure  ensures message integrity  maintains a communication forest Each node should decide if the global majority is of 0 or 1

Majority Vote in LSD Systems – cont. Because of the dynamic settings, the algorithm never terminates Instead we measure the percent of correct outputs In static periods that percent ought to converge to 100% In stationary periods we will show it converges to a different percentage Assume the overall percentage of ones remains the same, but they are constantly switched

LSD-Majority Algorithm Nodes communicates by exchanging messages Node u maintains: s u – its vote, c u – one (for now) – the last it had sent to v – the last it had received from v

LSD-Majority – cont. Node u calculates: Captures the current knowledge of u Captures the current agreement between u and v

LSD-Majority – Rational It is OK if the current knowledge of u is more extreme than what it had agreed with v The opposite is not OK v might assume u supports its decision more strongly than u actually does Tie breaking prefers a negative decision

LSD-Majority – The Protocol

The same decision is applied whenever a message is received s u changes an edge fails or recovers

LSD-Majority – Example

LSD-Majority Results

Proof of Correctness Will be given in class

Back from Majority to ARM To decide whether an itemset is frequent or not

Back from Majority to ARM To decide whether a rule is confident or not

Additionally Create candidates based on the ad-hoc solution Create rules on-the-fly rather than upon termination Our algorithm outputs the correct rules without specifying their global frequency and confidence

Eventual Results By the time the database is scanned once, in parallel, the average node has discovered 95% of the rules, and has less than 10% false rules.