Download presentation
Presentation is loading. Please wait.
1
DATA MINING VESIT M.VIJAYALAKSHMI
OVERVIEW DATA MINING DATA MINING VESIT M.VIJAYALAKSHMI
2
Outline Of the Presentation
Motivation & Introduction Data Mining Algorithms Teaching Plan DATA MINING VESIT M.VIJAYALAKSHMI
3
Why Data Mining? Commercial Viewpoint
Lots of data is being collected and warehoused Web data, e-commerce purchases at department/grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) DATA MINING VESIT M.VIJAYALAKSHMI
4
Typical Decision Making
Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which of my customers are likely to be the most loyal? Which claims in insurance are potential frauds? Who may not pay back loans? Who are consistent players to bid for in IPL? Who can be potential customers for a new toy? Data Mining helps extract such information DATA MINING VESIT M.VIJAYALAKSHMI
5
Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation DATA MINING VESIT M.VIJAYALAKSHMI
6
Mining Large Data Sets - Motivation
There is often information “hidden” in the data that is not readily evident. Human analysts may take weeks to discover useful information. DATA MINING VESIT M.VIJAYALAKSHMI
7
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory Data Mining provides the Enterprise with intelligence DATA MINING VESIT M.VIJAYALAKSHMI
8
DATA MINING VESIT M.VIJAYALAKSHMI
What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their “inside stories”: Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? (Deductive) query processing. Expert systems or small ML/statistical programs DATA MINING VESIT M.VIJAYALAKSHMI
9
Potential Applications
Market analysis and management target marketing, CRM, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, quality control, competitive analysis Fraud detection and management Text mining (news group, , documents) and Web analysis. Intelligent query answering DATA MINING VESIT M.VIJAYALAKSHMI
10
DATA MINING VESIT M.VIJAYALAKSHMI
Other Applications game statistics to gain competitive advantage Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. DATA MINING VESIT M.VIJAYALAKSHMI
11
What makes data mining possible?
Advances in the following areas are making data mining deployable: data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques. -- Gartner Group DATA MINING VESIT M.VIJAYALAKSHMI
12
DATA MINING VESIT M.VIJAYALAKSHMI
What is Not Data Mining Database Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk Data Mining Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules) DATA MINING VESIT M.VIJAYALAKSHMI
13
Data Mining: On What Kind of Data?
Relational databases Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW DATA MINING VESIT M.VIJAYALAKSHMI
14
Data Mining Models And Tasks
DATA MINING VESIT M.VIJAYALAKSHMI
15
Are All the “Discovered” Patterns Interesting?
A data mining system/query may generate thousands of patterns, not all of them are interesting. Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc. DATA MINING VESIT M.VIJAYALAKSHMI
16
Can We Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness Association vs. classification vs. clustering Search for only interesting patterns: First general all the patterns and then filter out the uninteresting ones. Generate only the interesting paterns DATA MINING VESIT M.VIJAYALAKSHMI
17
DATA MINING VESIT M.VIJAYALAKSHMI
Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. DATA MINING VESIT M.VIJAYALAKSHMI
18
DATA MINING VESIT M.VIJAYALAKSHMI
KDD Process Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. DATA MINING VESIT M.VIJAYALAKSHMI
19
Data Mining and Business Intelligence
Increasing potential to support business decisions End User Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DATA MINING VESIT M.VIJAYALAKSHMI
20
Data Mining Development
Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Algorithm Design Techniques Algorithm Analysis Data Structures Neural Networks Decision Tree Algorithms DATA MINING VESIT M.VIJAYALAKSHMI
21
DATA MINING VESIT M.VIJAYALAKSHMI
Data Mining Issues Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets High Dimensionality Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application DATA MINING VESIT M.VIJAYALAKSHMI
22
Social Implications of DM
Privacy Profiling Unauthorized use DATA MINING VESIT M.VIJAYALAKSHMI
23
DATA MINING VESIT M.VIJAYALAKSHMI
Data Mining Metrics Usefulness Return on Investment (ROI) Accuracy Space/Time DATA MINING VESIT M.VIJAYALAKSHMI
24
Data Mining Algorithms
Classification Clustering Association Mining Web Mining DATA MINING VESIT M.VIJAYALAKSHMI
25
DATA MINING VESIT M.VIJAYALAKSHMI
Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. DATA MINING VESIT M.VIJAYALAKSHMI
26
Data Mining Algorithms
Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] DATA MINING VESIT M.VIJAYALAKSHMI
27
Data Mining Algorithms
CLASSIFICATION DATA MINING VESIT M.VIJAYALAKSHMI
28
DATA MINING VESIT M.VIJAYALAKSHMI
Classification Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Classifier Decision tree Age Salary Profession Location Customer type Salary > 5 K good / bad Prof. = Exec New applicant’s data DATA MINING VESIT M.VIJAYALAKSHMI
29
Classification Problem
Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes. DATA MINING VESIT M.VIJAYALAKSHMI
30
Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data DATA MINING VESIT M.VIJAYALAKSHMI
31
Overview of Naive Bayes
The goal of Naive Bayes is to work out whether a new example is in a class given that it has a certain combination of attribute values. We work out the likelihood of the example being in each class given the evidence (its attribute values), and take the highest likelihood as the classification. Bayes Rule: E- Event has occurred P[H] is called the prior probability (of the hypothesis). P[H|E] is called the posterior probability (of the hypothesis given the evidence) 31 DATA MINING VESIT M.VIJAYALAKSHMI
32
DATA MINING VESIT M.VIJAYALAKSHMI
Worked Example 1 Take the following training data, from bank loan applicants: Few Medium PAYS High Delhi City Children Many Income Low Status DEFAULTS 3 4 ApplicantID 1 2 P[City=Delhi | Status = DEFAULTS] = 2/2 = 1 P[City=Delhi | Status = PAYS] = 2/2 = 1 P[Children=Many | Status = DEFAULTS] = 2/2 = 1 P[Children=Few | Status = DEFAULTS] = 0/2 = 0 etc. 32 DATA MINING VESIT M.VIJAYALAKSHMI
33
DATA MINING VESIT M.VIJAYALAKSHMI
Worked Example 1 Summarizing, we have the following probabilities: Probability of... ... given DEFAULTS ... given PAYS City=Delhi 2/2 = 1 Children=Few 0/2 = 0 Children=Many Income=Low 1/2 = 0.5 Income=Medium Income=High and P[Status = DEFAULTS] = 2/4 = 0.5 P[Status = PAYS] = 2/4 = 0.5 The probability of ( Income=Medium ) /applicant DEFAULTs = the number of applicants with Income=Medium who DEFAULT divided by the number of applicants who DEFAULT = 1/2 = 0.5 33 DATA MINING VESIT M.VIJAYALAKSHMI
34
DATA MINING VESIT M.VIJAYALAKSHMI
Worked Example 1 Now, assume a new example is presented where City=Delhi, Children=Many, and Income=Medium: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*) P[Status = DEFAULTS | Delhi,Many,Medium] = P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS] x P[DEFAULTS] = 1 x 1 x x = 0.25 Then we estimate the likelihood that the example is a payer, given its attributes: P[H2|E] = P[E|H2].P[H2] (denominator omitted*) P[Status = PAYS | Delhi,Many,Medium] = P[Delhi|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x P[PAYS] = 1 x 0 x x = 0 As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we conclude that the new example is a defaulter. 34 DATA MINING VESIT M.VIJAYALAKSHMI
35
DATA MINING VESIT M.VIJAYALAKSHMI
Worked Example 1 Now, assume a new example is presented where City=Delhi, Children=Many, and Income=High: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[Status = DEFAULTS | Delhi,Many,High] = P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0 x = 0 Then we estimate the likelihood that the example is a payer, given its attributes: P[Status = PAYS | Delhi,Many,High] = P[Delhi|PAYS] x P[Many|PAYS] x P[High|PAYS] x P[PAYS] = x 0 x 0.5 x = 0 As the conditional likelihood of being a defaulter is the same as that for being a payer, we can come to no conclusion for this example. 35 DATA MINING VESIT M.VIJAYALAKSHMI
36
DATA MINING VESIT M.VIJAYALAKSHMI
Weaknesses Naive Bayes assumes that variables are equally important and that they are independent which is often not the case in practice. Naive Bayes is damaged by the inclusion of redundant (strongly dependent) attributes. Sparse data: If some attribute values are not present in the data, then a zero probability for P[E|H] might exist. This would lead P[H|E] to be zero no matter how high P[E|H] is for other attribute values. Small positive values which estimate the so-called ‘prior probabilities’ are often used to correct this. 36 DATA MINING VESIT M.VIJAYALAKSHMI
37
Classification Using Decision Trees
Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART DATA MINING VESIT M.VIJAYALAKSHMI
38
DATA MINING VESIT M.VIJAYALAKSHMI
DT Issues Choosing Splitting Attributes Ordering of Splitting Attributes Splits Tree Structure Stopping Criteria Training Data Pruning DATA MINING VESIT M.VIJAYALAKSHMI
39
DATA MINING VESIT M.VIJAYALAKSHMI
DECISION TREES An internal node represents a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node. DATA MINING VESIT M.VIJAYALAKSHMI
40
DATA MINING VESIT M.VIJAYALAKSHMI
Training Set DATA MINING VESIT M.VIJAYALAKSHMI
41
DATA MINING VESIT M.VIJAYALAKSHMI
Example Outlook sunny overcast rain humidity P windy high normal true false N P N P DATA MINING VESIT M.VIJAYALAKSHMI
42
Building Decision Tree
Top-down tree construction At start, all training examples are at the root. Partition the examples recursively by choosing one attribute each time. Bottom-up tree pruning Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases. Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree DATA MINING VESIT M.VIJAYALAKSHMI
43
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left DATA MINING VESIT M.VIJAYALAKSHMI
44
Choosing the Splitting Attribute
At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index DATA MINING VESIT M.VIJAYALAKSHMI
45
Which attribute to select?
DATA MINING VESIT M.VIJAYALAKSHMI
46
A criterion for attribute selection
Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain Information gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain DATA MINING VESIT M.VIJAYALAKSHMI
47
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain Assume there are two classes, P and N Let the set of examples S contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as DATA MINING VESIT M.VIJAYALAKSHMI
48
Information Gain in Decision Tree Induction
Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is The encoding information that would be gained by branching on A DATA MINING VESIT M.VIJAYALAKSHMI
49
Example: attribute “Outlook”
“Outlook” = “Sunny”: “Outlook” = “Overcast”: “Outlook” = “Rainy”: Expected information for attribute: Note: this is normally not defined. DATA MINING VESIT M.VIJAYALAKSHMI
50
Computing the information gain
Information gain: information before splitting – information after splitting Information gain for attributes from weather data: DATA MINING VESIT M.VIJAYALAKSHMI
51
DATA MINING VESIT M.VIJAYALAKSHMI
Continuing to split DATA MINING VESIT M.VIJAYALAKSHMI
52
The final decision tree
Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further DATA MINING VESIT M.VIJAYALAKSHMI
53
Avoid Overfitting in Classification
The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” DATA MINING VESIT M.VIJAYALAKSHMI
54
Data Mining Algorithms
Clustering
55
What is Cluster Analysis?
Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms DATA MINING VESIT M.VIJAYALAKSHMI
56
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults DATA MINING VESIT M.VIJAYALAKSHMI
57
Clustering vs. Classification
No prior knowledge Number of clusters Meaning of clusters Cluster results are dynamic Unsupervised learning DATA MINING VESIT M.VIJAYALAKSHMI
58
DATA MINING VESIT M.VIJAYALAKSHMI
Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data DATA MINING VESIT M.VIJAYALAKSHMI
59
DATA MINING VESIT M.VIJAYALAKSHMI
Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up DATA MINING VESIT M.VIJAYALAKSHMI
60
DATA MINING VESIT M.VIJAYALAKSHMI
Clustering Issues Outlier handling Dynamic data Interpreting results Evaluating results Number of clusters Data to be used Scalability DATA MINING VESIT M.VIJAYALAKSHMI
61
Clustering Evaluation
Manual inspection Benchmarking on existing labels Cluster quality measures distance measures high similarity within a cluster, low across clusters DATA MINING VESIT M.VIJAYALAKSHMI
62
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough” the answer is typically highly subjective. DATA MINING VESIT M.VIJAYALAKSHMI
63
Type of data in clustering analysis
Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: DATA MINING VESIT M.VIJAYALAKSHMI
64
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance DATA MINING VESIT M.VIJAYALAKSHMI
65
DATA MINING VESIT M.VIJAYALAKSHMI
Clustering Problem Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. A Cluster, Kj, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori. DATA MINING VESIT M.VIJAYALAKSHMI
66
DATA MINING VESIT M.VIJAYALAKSHMI
Types of Clustering Hierarchical – Nested set of clusters created. Partitional – One set of clusters created. Incremental – Each element handled one at a time. Simultaneous – All elements handled together. Overlapping/Non-overlapping DATA MINING VESIT M.VIJAYALAKSHMI
67
Clustering Approaches
Hierarchical Partitional Categorical Large DB Agglomerative Divisive Sampling Compression DATA MINING VESIT M.VIJAYALAKSHMI
68
DATA MINING VESIT M.VIJAYALAKSHMI
Cluster Parameters DATA MINING VESIT M.VIJAYALAKSHMI
69
Distance Between Clusters
Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids DATA MINING VESIT M.VIJAYALAKSHMI
70
Hierarchical Clustering
Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down DATA MINING VESIT M.VIJAYALAKSHMI
71
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA) DATA MINING VESIT M.VIJAYALAKSHMI
72
DATA MINING VESIT M.VIJAYALAKSHMI
Dendrogram A tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level. Leaf – individual clusters Root – one cluster A cluster at level i is the union of its children clusters at level i+1. DATA MINING VESIT M.VIJAYALAKSHMI
73
DATA MINING VESIT M.VIJAYALAKSHMI
A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. DATA MINING VESIT M.VIJAYALAKSHMI
74
DIANA (Divisive Analysis)
Implemented in statistical analysis packages, e.g., Splus Inverse order of AGNES Eventually each node forms a cluster on its own DATA MINING VESIT M.VIJAYALAKSHMI
75
Partitional Clustering
Nonhierarchical Creates clusters in one step as opposed to several steps. Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. Usually deals with static sets. DATA MINING VESIT M.VIJAYALAKSHMI
76
DATA MINING VESIT M.VIJAYALAKSHMI
K-Means Initial set of clusters randomly chosen. Iteratively, items are moved among sets of clusters until the desired set is reached. High degree of similarity among elements in a cluster is obtained. Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim) DATA MINING VESIT M.VIJAYALAKSHMI
77
DATA MINING VESIT M.VIJAYALAKSHMI
K-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 Stop as the clusters with these means are the same. DATA MINING VESIT M.VIJAYALAKSHMI
78
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment. DATA MINING VESIT M.VIJAYALAKSHMI
79
Comments on the K-Means Method
Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes DATA MINING VESIT M.VIJAYALAKSHMI
80
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids,) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering Handles outliers well. Ordering of input does not impact results. Does not scale well. Each cluster represented by one item, called the medoid. Initial set of k medoids randomly chosen. PAM works effectively for small data sets, but does not scale well for large data sets DATA MINING VESIT M.VIJAYALAKSHMI
81
PAM (Partitioning Around Medoids)
PAM - Use real object to represent the cluster Select k representative objects arbitrarily For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih For each pair of i and h, If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change DATA MINING VESIT M.VIJAYALAKSHMI
82
DATA MINING VESIT M.VIJAYALAKSHMI
PAM DATA MINING VESIT M.VIJAYALAKSHMI
83
DATA MINING ASSOCIATION RULES
84
Example: Market Basket Data
Items frequently purchased together: Computer Printer Uses: Placement Advertising Sales Coupons Objective: increase sales and reduce costs Called Market Basket Analysis, Shopping Cart Analysis DATA MINING VESIT M.VIJAYALAKSHMI
85
Transaction Data: Supermarket Data
Market basket transactions: t1: {bread, cheese, milk} t2: {apple, jam, salt, ice-cream} … … tn: {biscuit, jam, milk} Concepts: An item: an item/article in a basket I: the set of all items sold in the store A Transaction: items purchased in a basket; it may have TID (transaction ID) A Transactional dataset: A set of transactions DATA MINING VESIT M.VIJAYALAKSHMI
86
Transaction Data: A Set Of Documents
A text document data set. Each document is treated as a “bag” of keywords doc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketball doc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Team doc7: Basketball, Team, City, Game DATA MINING VESIT M.VIJAYALAKSHMI
87
Association Rule Definitions
Association Rule (AR): implication X Y where X,Y I and X Y = ; Support of AR (s) X Y: Percentage of transactions that contain X Y Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X DATA MINING VESIT M.VIJAYALAKSHMI
88
Association Rule Problem
Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. Link Analysis NOTE: Support of X Y is same as support of X Y. DATA MINING VESIT M.VIJAYALAKSHMI
89
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! DATA MINING VESIT M.VIJAYALAKSHMI
90
DATA MINING VESIT M.VIJAYALAKSHMI
Example t1: Butter, Cocoa, Milk t2: Butter, Cheese t3: Cheese, Boots t4: Butter, Cocoa, Cheese t5: Butter, Cocoa, Clothes, Cheese, Milk t6: Cocoa, Clothes, Milk t7: Cocoa, Milk, Clothes Transaction data Assume: minsup = 30% minconf = 80% An example frequent itemset: {Cocoa, Clothes, Milk} [sup = 3/7] Association rules from the itemset: Clothes Milk, Cocoa [sup = 3/7, conf = 3/3] … … Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3] DATA MINING VESIT M.VIJAYALAKSHMI
91
Mining Association Rules
Two-step approach: Frequent Itemset Generation Generate all itemsets whose support minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive DATA MINING VESIT M.VIJAYALAKSHMI
92
Frequent Itemset Generation
Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!! W N DATA MINING VESIT M.VIJAYALAKSHMI
93
Reducing Number of Candidates
Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets DATA MINING VESIT M.VIJAYALAKSHMI
94
Illustrating Apriori Principle
Found to be Infrequent Pruned supersets DATA MINING VESIT M.VIJAYALAKSHMI
95
Illustrating Apriori Principle
Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, = 13 DATA MINING VESIT M.VIJAYALAKSHMI
96
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori Algorithm Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent DATA MINING VESIT M.VIJAYALAKSHMI
97
Example – Finding frequent itemsets
Dataset T minsup=0.5 TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5 itemset:count 1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F1: {1}:2, {2}:3, {3}:3, {5}:3 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} 2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 C3: {2, 3,5} 3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5} DATA MINING VESIT M.VIJAYALAKSHMI
98
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori Adv/Disadv Advantages: Uses large itemset property. Easily parallelized Easy to implement. Disadvantages: Assumes transaction database is memory resident. Requires up to m database scans. DATA MINING VESIT M.VIJAYALAKSHMI
99
Step 2: Generating Rules From Frequent Itemsets
Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X, Let B = X - A A B is an association rule if Confidence(A B) ≥ minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A) DATA MINING VESIT M.VIJAYALAKSHMI
100
Generating Rules: An example
Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively These generate these association rules: 2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules have support = 50% DATA MINING VESIT M.VIJAYALAKSHMI
101
DATA MINING VESIT M.VIJAYALAKSHMI
Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) DATA MINING VESIT M.VIJAYALAKSHMI
102
DATA MINING VESIT M.VIJAYALAKSHMI
Generating Rules To recap, in order to obtain A B, we need to have support(A B) and support(A) All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. This step is not as time-consuming as frequent itemsets generation. Han and Kamber 2001 DATA MINING VESIT M.VIJAYALAKSHMI
103
DATA MINING VESIT M.VIJAYALAKSHMI
Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) DATA MINING VESIT M.VIJAYALAKSHMI
104
Rule Generation for Apriori Algorithm
Lattice of rules Pruned Rules Low Confidence Rule DATA MINING VESIT M.VIJAYALAKSHMI
105
Rule Generation for Apriori Algorithm
Candidate rule is generated by merging two rules that share the same prefix in the rule consequent Join (CD=>AB,BD=>AC) would produce the candidate rule D => ABC Prune rule D=>ABC if its subset AD=>BC does not have high confidence DATA MINING VESIT M.VIJAYALAKSHMI
106
APriori - Performance Bottlenecks
The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets Bottleneck of Apriori: candidate generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates. Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern DATA MINING VESIT M.VIJAYALAKSHMI
107
Mining Frequent Patterns Without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! DATA MINING VESIT M.VIJAYALAKSHMI
108
Construct FP-tree From A Transaction DB
TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Steps: Scan DB once, find frequent 1-itemset (single item pattern) Order frequent items in frequency descending order Scan DB again, construct FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
109
Benefits of the FP-tree Structure
Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) DATA MINING VESIT M.VIJAYALAKSHMI
110
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree Method For each item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) DATA MINING VESIT M.VIJAYALAKSHMI
111
Major Steps to Mine FP-tree
Construct conditional pattern base for each node in the FP-tree Construct conditional FP-tree from each conditional pattern-base Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns DATA MINING VESIT M.VIJAYALAKSHMI
112
Step 1: FP-tree to Conditional Pattern Base
Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 DATA MINING VESIT M.VIJAYALAKSHMI
113
Step 2: Construct Conditional FP-tree
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam DATA MINING VESIT M.VIJAYALAKSHMI
114
Mining Frequent Patterns by Creating Conditional Pattern-Bases
Empty f {(f:3)}|c {(f:3)} c {(f:3, c:3)}|a {(fc:3)} a {(fca:1), (f:1), (c:1)} b {(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m {(c:3)}|p {(fcam:2), (cb:1)} p Conditional FP-tree Conditional pattern-base Item DATA MINING VESIT M.VIJAYALAKSHMI
115
Step 3: Recursively mine the conditional FP-tree
{} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
116
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3 c:3 a:3 m-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
117
Why Is Frequent Pattern Growth Fast?
Performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.