1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.

1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

2 “Are there clusters of similar cells?” Light color with 1 nucleus Dark color with 2 tails 2 nuclei 1 nucleus and 1 tail Dark color with 1 tail and 2 nuclei Clustering

3 Task: Discovering association rules among items in a transaction database. An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B. In general: A 1, A 2, … => B Association Rule Discovery Association Rule Discovery

4 “Are there any associations between the characteristics of the cells?” If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%; confidence = 50%) If # nuclei = 2 and Cell = Cancerous then # tails = 2 (support = 25%; confidence = 100%) If # tails = 1 then Color = light (support = 37.5%; confidence = 75%) Association Rule Discovery Association Rule Discovery

5 Genetic Algorithms Statistics Bayesian Networks Rough Sets Time Series Many Other Data Mining Techniques Text Mining

6 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 4. Data Mining Methods 3. KDD Applications 5. Challenges for KDD

7 Challenges and Influential Aspects Handling of different types of data with different degree of supervision Changing data and knowledge Understandability of patterns, various kinds of requests and results (decision lists, inference networks, concept hierarchies, etc.) Interactive, Visualization Knowledge Discovery Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc. ) Massive data sets, high dimensionality (efficiency, scalability)

8 High dimensionality increases exponentially the size of the space of patterns. Massive Data Sets and High Dimensionality With p attributes each has d discrete values in average, the space of possible instances has the size of d p. Classes: {Cancerous, Healthy} Attributes: Cell body: {dark, light} #nuclei: {1, 2} #tails: {1, 2} Healthy Cell C1 C2 C3 h1 h2 Cancerous Cell (# instances = 2 3 = 8) 38 attributes, each 10 values # instances = 10 38 # attributes ?

9 Attribute Numerical Symbolic No structure Places, Color Ordinal structure Ring structure Rank, Resemblance Age, Temperature, Taste, Income, Length Nominal (categorical) Ordinal Measurable Different Types of Data Combinatorical search in hypothesis spaces (machine learning) Often matrix-based computation (multivariate data analysis)

10 Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures

11 Lecture 2: Preparing Data As much as 80% of KDD is about preparing data, and the remaining 20% is about mining Content of the lecture 1. Data cleaning 2. Data transformations 3. Data reduction 4. Software and case-studies Prerequisite: Nothing special but expected some understanding of statistics

12 Data Preparation The design and organization of data, including the setting of goals and the composition of features, is done by humans. There are two central goals for the preparation of data: To organize data into a standard form that is ready for processing by data mining programs. To prepare features that lead to the best data mining performance.

14 Lecture 3: Decision Tree Induction One of the most widely used KDD classification techniques for supervised data. Content of the lecture 1. Decision tree representation and framework 2. Attribute selection 3. Pruning trees 4. Software C4.5, CABRO and case-studies Prerequisite: Nothing special

15 Decision Trees Decision Tree is a classifier in the form of a tree structure that is either: A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf is met. a leaf node, indicating a class of instances, or a decision node that specifies some test to be carried out on a single attribute value, with one branch and subtree for each possible outcome of the test

16 General Framework of Decision Tree Induction 1. Choose the “best” attribute by a given selection measure 2. Extend tree by adding new branch for each attribute value 3. Sorting training examples to leaf nodes 4. If examples unambiguously classified Then Stop Else Repeat steps 1-4 for leaf nodes Headache Temperature Flu e1 yes normal no e2 yes high yes e3 yes very high yes e4 no normal no e5 no high no e6 no very high no Temperature yes Headache normal high very high Headache no yesno yes {e2} no {e5} yes {e3} no {e6} {e1, e4} {e2, e5} {e3,e6} 5. Pruning unstable leaf nodes

17 Some Attribute Selection Measures

18 Avoiding Overfitting How can we avoid overfitting? Stop growing when data split not statistically significant (pre-pruning) Grow full tree, then post-prune (post-pruning) Two post-pruning techniques Reduce-Error Pruning Rule Post-Pruning

19 Converting A Tree to Rules sunny o’cast rain outlook high normal humidity noyes true false wind noyes IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes........

20 Matching path and the class that matches the unknown instance Discovered decision tree Unknown case CABRO: Decision Tree Induction CABRO based on R-measure, a measure for attribute dependency stemmed from rough sets theory.

22 Lecture 4: Mining Association Rules A new technique and attractive topic. It allows discovering the important associations among items of transactions in a database. Content of the lecture Prerequisite: Nothing special 3. The Basket Analysis Program 1. Basic Definitions 2. Algorithm Apriori

23 There are two measures of rule strength: Support of (A => B) = [AB] / N, where N is the number of records in the database. The support of a rule is the proportion of times the rule applies. Confidence of (A => B) = [AB] / [A] The confidence of a rule is the proportion of times the rule is correct. Measures of Association Rules Measures of Association Rules

24 The task of mining association rules is mainly to discover strong association rules (high confidence and strong support) in large databases. Algorithm Apriori TID Items 1000 A, B, C 2000 A, C 3000 A, D 4000 B, E, F {A} 75% {B} 50% {C} 50% {A,C} 50% S = 40% Large support items Mining association rules is composed of two steps: 1. discover the large items, i.e., the sets of itemsets that have transaction support above a predetermined minimum support s. 2. Use the large itemsets to generate the association rules

25 Algorithm Apriori: Illustration TID Items 100 A, C, D 200 B, C, E 300 A, B, C, E 400 B, E Database D {A} {B} {C} {D} {E} Itemset Count {A} 2 {B} 3 {C} 3 {E} 3 Itemset Count {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset {A,B} {A,C} {A,E} {B,C} {B,E} {C,E} Itemset Count {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset Count {B, C, E} Itemset {B, C, E} 2 Itemset Count {B, C, E} 2 Itemset Count C1 L1 C2 L2 C2 C3 L3C3 Scan D Scan D Scan D S = 40% 2 3 3 1 3 1 2 1 2 3 2

1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.

Similar presentations

Presentation on theme: "1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.

Similar presentations

Presentation on theme: "1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162."— Presentation transcript:

Similar presentations

About project

Feedback