Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter.

Slides:



Advertisements
Similar presentations
Analysis of Algorithms: time & space Dr. Jeyakesavan Veerasamy The University of Texas at Dallas, USA.
Advertisements

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Packet Classification using Hierarchical Intelligent Cuttings
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Particle Swarm Optimization (PSO)  Kennedy, J., Eberhart, R. C. (1995). Particle swarm optimization. Proc. IEEE International Conference.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.
Title of Presentation Author 1, Author 2, Author 3, Author 4 Abstract Introduction This is my abstract. This is my abstract. This is my abstract. This.
A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202 Fall 2007 Introduction to Classification Greg Grudic.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Clementine Server Clementine Server A data mining software for business solution.
Research Project Mining Negative Rules in Large Databases using GRD.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Data Mining: A Closer Look
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti.
Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Mark A. Iwen Department of Mathematics Willis Lang, Jignesh M. Patel EECS Department University of Michigan, USA (ICDE 2008) Scalable Rule-Based Gene Expression.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Introduction to Weka Xingquan (Hill) Zhu Slides copied from Jeffrey Junfeng Pan (UST)
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
K nearest neighbors algorithm Parallelization on Cuda PROF. VELJKO MILUTINOVIĆ MAŠA KNEŽEVIĆ 3037/2015.
Analyzing Stock Quotes using Data Mining Techniques Name of Student: To Yi Fun University Number: First Presentation, Final Year Project, 2013.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
1 Detecting Hidden Messages using higher-order stats and SVMs Siwei Lyu and Hany Farid.
A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert Samy Bengio Yoshua Bengio Prepared : S.Y.C. Neural Information Processing Systems,
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International.
Authors : Baohua Yang, Jeffrey Fong, Weirong Jiang, Yibo Xue, and Jun Li. Publisher : IEEE TRANSACTIONS ON COMPUTERS Presenter : Chai-Yi Chu Date.
Fuzzy data mining for interesting generalized association rules Source : Fuzzy Sets and Systems ; Vol.138, No. 2, 2003, pp Author : Tzung-Pei,
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Technische Universität.
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,
ItemBased Collaborative Filtering Recommendation Algorithms 1.
CS260 Data Mining Using Tensor Methods 00 – Paper Title
MIS 451 Building Business Intelligence Systems
Big Data Machine Learning using Apache Spark MLlib
SEEM5770/ECLT5840 Course Review
Materials & Methods Introduction Abstract Results Conclusion
إعداد د/زينب عبد الحافظ أستاذ مساعد بقسم الاقتصاد المنزلي
ريكاوري (بازگشت به حالت اوليه)
Mining Association Rules from Stars
A Unifying View on Instance Selection
Prepared by: Mahmoud Rafeek Al-Farra
Multiple Instance Learning: applications to computer vision
POSTER TEMPLATE of ISERSE’18
Source: Pattern Recognition Vol. 38, May, 2005, pp
Materials & Methods Introduction Abstract Results Conclusion
Created by _____ & _____
POSTER TEMPLATE of ISLAC’18
Materials & Methods Introduction Abstract Results Conclusion
Ho-Ramamoorthy 2-Phase Deadlock Detection Algorithm
Using Association Rules as Texture features
Materials & Methods Introduction Abstract Results Conclusion
Language model using HTK
Materials & Methods Introduction Abstract Results Conclusion
SPECIAL ISSUE on Document Analysis, 5(2):1-15, 2005.
Presentation transcript:

Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter

Abstract Problem : Increase in Size of Training Set MIND (MINing in Database) Classifier Can be Implemented easily over SQL Other Classifiers Need O(N) space In Memory. MIND Scales Well Over : I/O # of Processors

Over View Introduction Algorithm Database Implementation Performance Experimental Results Conclusions

Introduction - Classification Problem no yes salary <= 62K safe risky Age <= 30 DETAIL TABLE CLASSIFYER

Introduction - Scalability In Classification Importance Of Scalability: Use a Very Large Training Set – Data is Not Memory Resident. Use a Very Large Training Set – Data is Not Memory Resident. Number Of CPUs – better usage of resources. Number Of CPUs – better usage of resources.

Introduction - Scalability In Classification Properties of MIND: Scalable in memory Scalable In CPU Uses SQL Easy to implement Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)

The Algorithm - DataStracture DATA in DETAIL TABLE DETAIL(attr 1,attr 2, ….,class,leaf_num) attr i = i attribute attr i = i attribute class = Class type class = Class type leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree) leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree)

The Algorithm - gini index S - data Set C - number of Classes Pi - relative frequency of class i in S gini index :

The Algorithm GrowTree(DETAIL TABLE) Initialize tree T and put all records of DETAIL in root while (some leaf in T is not a STOP node) for each attribute i do evaluate gini index for each non-STOP leaf at each split value with respect to attribute i for each non-STOP leaf do get the overall best split for it; partition the records and grow the tree for one more level according to best splits; mark all small or pure leaves as STOP nodes; return T;

Database Implementation - Dimension table For Each Attribute and each level of the tree INSERT INTO DIMi SELECT leaf_num,class,attr i,count(*) FROM DETAIL WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr i Size of Dim i = #leaves * #distinct values of attr i * #classes

Database Implementation - Dimension table SQL SELECT FROM DETAIL INSERT INTO DIM 1 leaf_num,class,attr 1,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr 1 INSERT INTO DIM 2 leaf_num,class,attr 2,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr 2

Database Implementation - UP/DOWN - split for each attribute we find all possible split places: INSERT INTO UP SELECT d 1.leaf_num, d 1.attr i, d 1.class,SUM(d 2. count) FROM(FULL OUTER JOIN DIM i d 1, DIM i d 2 ON d 1.leaf_num = d 2.leaf_num AND d 2. attr i <= d 1. attr i AND d 1.class = d 2.class GROUP BY d 1.leaf_num, d 1. attr i, d 1.class

Database Implementation - Class View create view for each class k and attribute i: CREATE VIEW C k _UP(leaf_num,attr i,count) SELECT leaf_num,attr i,count FROM UP WHERE class = k

Database Implementation - GINI VALUE create view for all gini values: CREATE VIEW GINI_VALUE(leaf_num, attr i,gini)AS SELECT u 1.leaf_num, u 1.attr i,ƒ gini FROM C 1 _UP u 1,..,Cc_UP u c,C 1 _DOWN d 1...,Cc_DOWN d c WHERE u 1.attr i =.. = u c. attr i =.. = d c. attr i AND u 1.leaf_num =.. = u c.leaf_num =.. = d c.leaf_num

Database Implementation - MIN GINI VALUE create table for minimum gini values for attribute i : INSERT INTO MIN_GINI SELECT leaf_num,i,attr i, gini FROM GINI_VALUE a WHERE a.gini = (SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num

Database Implementation - BEST SPLIT create view over MIN_GINI for best split : CREATE VIEW BEST_SPLIT (leaf_num,attr_name,attr_value) SELECT leaf_num, attr_name,attr_value FROM MIN_GINI a WHERE a.gini = (SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num

Database Implementation - Partitioning Build new nodes by spliting old nodes according to BEST_SPLIT values Set correct node to recoreds: Update leaf_node - is done by a function No need to UPDATE data or DB

Performance I/O cost of MIND: I/O cost of SPRINT:

Experimental Results Normalized time to finish building the tree Normalized time to build the tree per example the tree per example

Experimental Results Normalized time to build the tree per # of processor the tree per # of processor Time to build tree By Training Set Size

Conclusions MIND works over DB MIND works well because –MIND rephrases the classification to a DB problem –MIND avoid UPDATES the DETAIL table –Parallelism and Scaling Are achived by the use of RDBMS – MIND uses a user function to get the performance gain in the DIMi creation.