Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Slides:



Advertisements
Similar presentations
Dynamic Generation of Data Broadcasting Programs for Dynamic Generation of Data Broadcasting Programs for a Broadcast Disk Array in a Mobile Computing.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Clustering Categorical Data The Case of Quran Verses
Fast Algorithms For Hierarchical Range Histogram Constructions
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining for High Complexity Regions Using Entropy and Box Counting Dimension Quad-Trees Rosanne Vetro, Wei Ding, Dan A. Simovici Computer Science Department.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Decision Tree Algorithm
Data Mining Association Analysis: Basic Concepts and Algorithms
Basic Data Mining Techniques Chapter Decision Trees.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
Fast Algorithms for Association Rule Mining
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Introduction to Directed Data Mining: Decision Trees
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Data Mining Chun-Hung Chou
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.
A Design Method of DNA chips for SNP Analysis Using Self Organizing Maps Author : Honjoy Saga Graduate : Chien-Ming Hsiao.
Mining various kinds of Association Rules
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
杜嘉晨 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
On statistical models of cluster stability Z. Volkovich a, b, Z
Chapter 6 Classification and Prediction
12/2/2018.
Group 9 – Data Mining: Data
Presentation transcript:

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao

Outline Motivation Objective Introduction Preliminary Design of algorithm CBA Experimental results Conclusion Personal Opinion

Motivation The features of market-basket data are known to be of high dimensionality, sparsity, and with massive outliers.

Objective To devise an efficient algorithm for clustering market-basket data

Introduction In market-basket data, the taxonomy of items defines the generalization relationships for the concepts in different abstraction levels

Preliminary Problem Description Information Gain Validation Model

Problem description A database of transaction is denoted by where each transaction t j is represented by a set of items

Problem description A clustering is a partition of transactions into k clusters, where C j is a cluster consisting of a set of transactions. Items in the transactions can be generalized to multiple concept level of the taxonomy.

Problem description The leaf nodes are called the item nodes The internal nodes are called the category nodes The root node in the highest level is a virtual concept of the generalization of all categories Use the measurement of the occurrence count to determine which items or categories are major features of each cluster.

Problem description Definition 1 : the count of an item i k in a cluster C j, denoted by Count(i k, C j ), is defined as the number of transactions in cluster C j that contain this item i k An item i k in a cluster C j is called a large item if Count(i k, C j ) exceeds a predetermined threshold. Definition 2 : the count of a category c k in a cluster C j, denoted by Count(c k, C j ), is defined as the number of transactions containing items under this category c k in cluster A category c k in a cluster C j is called a large item if Count(c k, C j ) exceeds a predetermined threshold.

Example

Problem description Definition 3 : for cluster C j, the minimum support count S c (C j ) is defined as : Where | C j | denotes the number of transactions in cluster C j

Information Gain Validation Model To evaluate the quality of clustering results The nature feature of numeric data is quantitative (e.g., weight or length), whereas that of categorical data is qualitative (e.g., color or gender) To propose a validation model based on Information Gain (IG) to assess the qualities of the clustering results.

Information Gain Validation Model Definition 4 : the entropy of an attribute J a in database D is defined as : Where |D| is the number of transaction in the database D denotes the number of the transactions whose attribute J a is classified as the value in the database D

Example

Information Gain Validation Model Definition 5 : the entropy of an attribute J a in a cluster C j is defined as : Where |C j | is the number of transaction in the database C j denotes the number of the transactions whose attribute J a is classified as the value in the database D

Information Gain Validation Model Definition 6 : let a clustering U contain clusters. Thus, the entropy of an attribute J a in the clustering U is defined as : Definition 7 : the information gain obtained by separating J a into the clusters of the clustering U is defined as :

Information Gain Validation Model Definition 8 : the information gain of the clustering U is defined as : Where I is the data set of the total items purchased in the whole market-basket data records. For clustering market-basket data, the larger an IG value, the better the clustering quality is

Design of Algorithm CBA Similarity Measurement : Category-Based Adherence Procedure of Algorithm CBA

Similarity Measurement: Category-Based Adherence Definition 9 : (Distance of an item to a cluster) for an item i k of a transaction, the distance of i k to a given cluster C j, denoted by d(i k, C j ), is defined as the number of links between i k and the nearest large nodes of i k. If i k is a large node in cluster C j, then d(i k, C j ) = 0. C1C1 Example :

Similarity Measurement: Category-Based Adherence Definition 10 : (Adherence of a transaction to a cluster) For a transaction, the adherence of t to a given cluster C j Where d(i k, C j ) is the distance of i k in cluster C j Example : C1C1

Procedure of Algorithm CBA Step1 : Randomly select k transaction as the seed transactions of the k clusters from the database D. Step2 : Read each transaction sequentially and allocates it to the cluster with the minimum category-based adherence. For each moved transaction, the counts of items and their ancestors are increased by one. Step3 : Repeat Step2 until no transaction is moved between clusters. Step4 : Output the taxonomy tree for each cluster as the visual representation of clustering result.

Experimental Results Data generation Take the real market-basket data from a large book- store company for performance study. There are |D| = 100K transactions N I = items

Performance Study

Conclusion As validated by both real and synthetic datasets, it was shown by our experimental results, with the taxonomy information, algorithm CBA devised in this paper significantly outperforms the prior works in both the execution efficiency and the clustering quality for market-basket data.

Personal Opinion An Illustrative Example is necessary. The data in data mining often contains both numeric and categorical values. This characteristic prohibits many existing clustering algorithms.