Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

1 Copyright Jiawei Han; modified by Charles Ling for CS411a/538a Data Mining and Data Warehousing  Introduction  Data warehousing and OLAP for data mining.
Decision Trees Decision tree representation ID3 learning algorithm
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Decision Tree Approach in Data Mining
Decision Tree Algorithm (C4.5)
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
University of Alberta  Dr. Osmar R. Zaïane, Principles of Knowledge Discovery in Data Dr. Osmar R. Zaïane University of Alberta Fall 2004.
Induction of Decision Trees
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
6/25/2015 Acc 522 Fall 2001 (Jagdish S. Gangolly) 1 Data Mining I Jagdish Gangolly State University of New York at Albany.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
Data Mining By Archana Ketkar.
Classification and Prediction
Classification.
Data Mining – Intro.
CS490D: Introduction to Data Mining Prof. Chris Clifton
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
Data Mining: Classification
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
CS590D: Data Mining Chris Clifton February 24, 2005 Concept Description.
Basic Data Mining Technique
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Some OLAP Issues CMPT 455/826 - Week 9, Day 2 Jan-Apr 2009 – w9d21.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Concept Description: Characterization and Comparison
CS690L Data Mining: Classification
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Data Mining Functionalities / Data Mining Tasks Concepts/Class Description Concepts/Class Description Association Association Classification Classification.
Classification and Prediction
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Evaluation of DBMiner By: Shu LIN Calin ANTON. Outline  Importing and managing data source  Data mining modules Summarizer Associator Classifier Predictor.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Decision Trees.
1 By: Ashmi Banerjee (125186) Suman Datta ( ) CSE- 3rd year.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
UNIT-4 Characterization and Comparison LectureTopic ************************************************* Lecture-22What is concept description? Lecture-23.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
Data Mining Functionalities
DECISION TREES An internal node represents a test on an attribute.
Classification Algorithms
Chapter 6 Classification and Prediction
©Jiawei Han and Micheline Kamber Department of Computer Science
©Jiawei Han and Micheline Kamber Department of Computer Science
©Jiawei Han and Micheline Kamber
©Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques Course Outline
Classification and Prediction
Data Mining Concept Description
Classification by Decision Tree Induction
Concept Description: Characterization and Comparison
Machine Learning: Lecture 3
Classification and Prediction
Data Mining: Characterization
©Jiawei Han and Micheline Kamber
UNIT-4 Characterization and Comparison
©Jiawei Han and Micheline Kamber
Presentation transcript:

Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Two categories of data mining Descriptive mining: describes concepts or task- relevant data sets in concise, summarative, informative, discriminative forms Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) What is Concept Description? Concept description (or class description): generates descriptions for characterization and comparison of data Characterization: provides a concise and succinct summarization of the given collection of data Characterization: provides a concise and succinct summarization of the given collection of data Class comparison (or discrimination): provides descriptions comparing two or more collections of data Class comparison (or discrimination): provides descriptions comparing two or more collections of data

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Data Generalization A process which abstracts a large set of task- relevant data in a database from a low conceptual levels to higher ones. A process which abstracts a large set of task- relevant data in a database from a low conceptual levels to higher ones Conceptual levels Approaches: Data cube approach(OLAP approach) Attribute-oriented induction approach

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Concept Description vs OLAP Similarities: Data generalization Presentation of data summarization at multiple levels of abstraction. Interactive drilling, pivoting, slicing and dicing. Differences: Complex data types of the attributes and their aggregations Automated process to find relevant attributes and generalization degree Dimension relevance analysis and ranking when there are many relevant dimensions.

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Attribute-Oriented Induction Proposed in 1989 (KDD ‘89 workshop) Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures. Not confined to categorical data nor particular measures. How it is done? How it is done? Collect the task-relevant data (initial relation) using a relational database query Perform data generalization by attribute removal or attribute generalization, based on the nb. of distinct values of each attribute. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interactive presentation with users

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Principles (1) Data focusing: task-relevant data, including dimensions, and the result is the initial (working) relation. Attribute-removal: remove attribute A if there is a large set of distinct values for A but: (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes. Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A.

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Principles (2) Two methods to control a generalization process: Two methods to control a generalization process: Attribute-threshold control: typical 2-8, specified/default if the number of distinct values in an attribute is greater than the att. threshold, then removal or generalization applies Generalized relation threshold control: sets a threshold for the generalized (final) relation/rule size If the number of distinct tuples in the generalized relation is greater than the threshold, then further generalization applies

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Principles (3) Acummulate count or other aggregate values : to provide statistical information about the data at diff. levels of abstraction Ex: Count value for a tuple in the initial relation is 1, When generalizing data, n tuples in the initial relation result in groups of identical tuples merged into a single generalized tuple (count is n)

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Algorithm 1. InitialRel: Query processing of task-relevant data, deriving the initial relation. 2. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? 3. PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. 4. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: Example (1) Class Characterization: Example (1) Describe general characteristics of graduate students in the Big-University database (in DMQL) Describe general characteristics of graduate students in the Big-University database (in DMQL) use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate” Corresponding SQL statement: Corresponding SQL statement: select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“Msc”, “MBA”, “PhD” }

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: An Example (2) Initial Relation

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: An Example (3) Prime Generalized Relation Cross-tab

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation of Generalized Results (1) Generalized relation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation: Mapping results into cross tabulation form (similar to contingency tables). Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms.

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation— Generalized Relation

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation—Crosstab

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation of Generalized Results (2) A generalized relation may also be represented in the form of logic rules A generalized relation may also be represented in the form of logic rules Cj = target class q a = a generalized tuple describing the target class t-weight for q a : percentage of tuples of the target class from the initial working class that are covered by q a range: [0, 1]

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation of Generalized Results (3) Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it The disjunction of the conditions forms a necessary condition of the target class, i.e., all tuples of the target class must satisfy the condition The disjunction of the conditions forms a necessary condition of the target class, i.e., all tuples of the target class must satisfy the condition Not a sufficient condition of the target class, since a tuple satisfying the same condition could belong to another class Not a sufficient condition of the target class, since a tuple satisfying the same condition could belong to another class

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Attribute Relevance Analysis (1) Why? Why? Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns What? What? statistical method for preprocessing data filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes relevance related to dimensions and levels analytical characterization, analytical comparison

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Attribute relevance analysis (2) How? How? 1. Data Collection 2. Preliminary relevance analysis using conservative AOI 3. Analytical Generalization Use information gain analysis (e.g., entropy or other measures) to identify highly relevant dimensions and levels. Sort and select the most relevant dimensions and levels. 4. Attribute-oriented Induction for class description Using a less conservative threshold for AOI

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Relevance Measures Quantitative relevance measure: determines the classifying power of an attribute within a set of data. Methods: information gain (ID3) gain ratio (C4.5) gini index  2 contingency table statistics uncertainty coefficient

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Entropy and Information Gain S contains s i tuples of class C i for i = {1, …, m} Entropy or expected information measures info required to classify any arbitrary tuple Entropy of attribute A with values {a 1,a 2,…,a v } Information gained by branching on attribute A

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example of Analytical Characterization (1) Task Mine general characteristics describing graduate students using analytical characterizationGiven attributes name, gender, major, birth_place, birth_date, phone#, and gpa Gen(a i ) = concept hierarchies on a i U i = attribute analytical thresholds for a i T i = attribute generalization thresholds for a i R = attribute relevance threshold

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example of Analytical Characterization (2) 1. Data collection target class: graduate student contrasting class: undergraduate student 2. Analytical generalization using U i attribute removal remove name and phone# attribute generalization generalize major, birth_place, birth_date and gpa accumulate counts candidate relation: gender, major, birth_country, age_range and gpa

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical characterization (3) Candidate relation for Target class: Graduate students (  =120) Candidate relation for Contrasting class: Undergraduate students (  =130)

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical characterization (4) 3. Relevance analysis 3. Relevance analysis Calculate expected info required to classify an arbitrary tuple Calculate entropy of each attribute: e.g. major Number of grad students in “Science” Number of undergrad students in “Science”

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical Characterization (5) Calculate expected info required to classify a given sample if S is partitioned according to the attribute Calculate expected info required to classify a given sample if S is partitioned according to the attribute Calculate information gain for each attribute Calculate information gain for each attribute Information gain for all attributes

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical characterization (5) 4. Initial working relation (W 0 ) derivation R = 0.1 remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country remove contrasting class candidate relation 5. Perform attribute-oriented induction on W 0 using T i Initial target class working relation W 0 : Graduate students

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Mining Class Comparisons Comparison: Comparing two or more classes Comparison: Comparing two or more classes Method: Method: Partition the set of relevant data into the target class and the contrasting class(es) Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures support - distribution within single class comparison - distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis: Relevance Analysis: Find attributes (features) which best distinguish different classes

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Quantitative Discriminant Rules Cj = target class Cj = target class q a = a generalized tuple covers some tuples of target class q a = a generalized tuple covers some tuples of target class but can also cover some tuples of contrasting class d-weight d-weight range: [0, 1] quantitative discriminant rule form quantitative discriminant rule form

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example (1) Compare the general properties between the graduate students and the undergraduate students at the Big- University database, given the attributes: name, gender, etc (in DMQL) Compare the general properties between the graduate students and the undergraduate students at the Big- University database, given the attributes: name, gender, etc (in DMQL) use Big_University_DB mine comparison as “Grad-vs-Undergrad” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from “graduate_students” where status in “graduate” versus “undergraduate_students” where status in “undergraduate” analyze count% from student

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example (2) Quantitative discriminant rule Quantitative discriminant rule where 90/(90+210) = 30% Count distribution between graduate and undergraduate students for a generalized tuple

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Description Quantitative characteristic rule Quantitative characteristic rule necessary Quantitative discriminant rule Quantitative discriminant rule sufficient Quantitative description rule Quantitative description rule necessary and sufficient

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example: Quantitative Description Rule Quantitative description rule for target class Europe Quantitative description rule for target class Europe Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and computers sold at AllElectronics in 1998

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft) (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft) (Livro) Machine Learning, T. Mitchell, McGraw-Hill, 1997 (Secção 3.4) (Livro) Machine Learning, T. Mitchell, McGraw-Hill, 1997 (Secção 3.4)

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Information-Theoretic Approach Decision tree Decision tree each internal node tests an attribute each branch corresponds to an attribute value each leaf node assigns a classification ID3 algorithm ID3 algorithm build decision tree based on training objects with known class labels to classify testing objects rank attributes with information gain measure minimal height the least number of tests to classify an object

2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Top-Down Induction of Decision Tree Attributes = {Outlook, Temperature, Humidity, Wind} Outlook Humidity Wind sunnyrain overcast yes noyes high normal no strong weak yes PlayTennis = {yes, no}