Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Similar presentations


Presentation on theme: "1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and."— Presentation transcript:

1 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and warehousing d Concept description: characterization and discrimination d Classification and prediction d Association analysis d Clustering analysis d Mining complex data and advanced mining techniques Trends and research issues

2 2 Copyright Jiawei Han, modified by Charles Ling for CS411a Data Mining and Warehousing: Session 7 Clustering Analysis

3 3 Copyright Jiawei Han, modified by Charles Ling for CS411a Clustering analysis d What is Clustering Analysis? d Clustering in Data Mining Applications d Handling Different Types of Variables d Major Clustering Techniques d Outlier Discovery d Problems and Challenges

4 4 Copyright Jiawei Han, modified by Charles Ling for CS411a What Is a Cluster? d a number of similar things growing together or of things or persons collected or grouped closely together: BUNCH r a group of buildings and esp. houses built close together on a sizable tract in order to preserve open spaces larger than the individual yard for common recreation r an aggregation of stars, galaxies, or super galaxies that appear close together in the sky and seem to have common properties (as distance)

5 5 Copyright Jiawei Han, modified by Charles Ling for CS411a What Is Clustering ? d Clustering is a process of partitioning a set of data (or objects) into a set of meaningful sub-classes, called clusters. r May help users understand the natural grouping or structure in a data set. d Cluster: a collection of data objects that are similar to one another and thus can be treated collectively as one group. d Clustering: unsupervised classification: no predefined classes. d Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms.

6 6 Copyright Jiawei Han, modified by Charles Ling for CS411a What Is Good Clustering? d A good clustering method will produce high quality clusters in which: intra r the intra-class (that is, intra-cluster) similarity is high. r the inter-class similarity is low. d The quality of a clustering result also depends on both the similarity measure used by the method and its implementation. d The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

7 7 Copyright Jiawei Han, modified by Charles Ling for CS411a Requirements of Clustering in Data Mining d Scalability d Dealing with different types of attributes d Discovery of clusters with arbitrary shape d Able to deal with noise and outliers d Insensitive to order of input records d High dimensionality d Interpretability and usability.

8 8 Copyright Jiawei Han, modified by Charles Ling for CS411a Clustering analysis d What is Clustering Analysis? d Clustering in Data Mining Applications d Handling Different Types of Variables d Major Clustering Techniques d Outlier Discovery d Problems and Challenges

9 9 Copyright Jiawei Han, modified by Charles Ling for CS411a Applications of Clustering d Clustering has wide applications in r Pattern Recognition r Spatial Data Analysis: –create thematic maps in GIS by clustering feature spaces –detect spatial clusters and explain them in spatial data mining. r Image Processing r Economic Science (especially market research) r WWW: –Document classification –Cluster Weblog data to discover groups of similar access patterns

10 10 Copyright Jiawei Han, modified by Charles Ling for CS411a Examples of Clustering Applications d Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. d Land use: Identification of areas of similar land use in an earth observation database. d Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. d City-planning: Identifying groups of houses according to their house type, value, and geographical location.

11 11 Copyright Jiawei Han, modified by Charles Ling for CS411a Clustering analysis d What is Clustering Analysis? d Clustering in Data Mining Applications d Handling Different Types of Variables d Major Clustering Techniques d Outlier Discovery d Problems and Challenges

12 12 Copyright Jiawei Han, modified by Charles Ling for CS411a Similarity and Dissimilarity Between Objects d Distances are normally used to measure the similarity or dissimilarity between two data objects. d Some popular ones include: Minkowski distance: where i = (x i1, x i2, …, x ip ) and j = (x j1, x j2, …, x jp ) are two p-dimensional data objects, and q is a positive integer. d If q = 1, d is Manhattan distance. d If q = 2, d is Euclidean distance:

13 13 Copyright Jiawei Han, modified by Charles Ling for CS411a Measure Similarity d The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. d Values should be scaled (normalized to 0-1) d Weights should be associated with different variables based on applications and data semantics. d It is hard to define similar enough or good enough r the answer is typically highly subjective.

14 14 Copyright Jiawei Han, modified by Charles Ling for CS411a Binary, Nominal, Continuous variables d Binary variable: d = 0 of x=y; d=0 otherwise d Nominal variables: > 2 states, e.g., red, yellow, blue, green. r Simple matching: u: # of matches, p: total # of variables. r Also, one can use a large number of binary variables. d Continuos variables: d = |x-y| r Scaling and normalization

15 15 Copyright Jiawei Han, modified by Charles Ling for CS411a Clustering analysis d What is Clustering Analysis? d Clustering in Data Mining Applications d Handling Different Types of Variables d Major Clustering Techniques d Outlier Discovery d Problems and Challenges

16 16 Copyright Jiawei Han, modified by Charles Ling for CS411a Five Categories of Clustering Methods d Partitioning algorithms: Construct various partitions and then evaluate them by some criterion. d Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. d Density-based: based on connectivity and density functions d Grid-based: based on a multiple-level granularity structure d Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other.

17 17 Copyright Jiawei Han, modified by Charles Ling for CS411a Partitioning Algorithms: Basic Concept d Partitioning method: Construct a partition of a database D of n objects into a set of k clusters d Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion. r Global optimal: exhaustively enumerate all partitions. r Heuristic methods: k-means and k-medoids algorithms. r k-means (MacQueen67): Each cluster is represented by the center of the cluster r k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw87): Each cluster is represented by one of the objects in the cluster.

18 18 Copyright Jiawei Han, modified by Charles Ling for CS411a The K-Means Clustering Method d Given k, the k-means algorithm is implemented in 4 steps: r Partition objects into k nonempty subsets r Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. r Assign each object to the cluster with the nearest seed point. r Go back to Step 2, stop when no more new assignment.

19 19 Copyright Jiawei Han, modified by Charles Ling for CS411a Comments on the K-Means Method d Strength of the k-means: r Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n. r Often terminates at a local optimum. d Weakness of the k-means: r Applicable only when mean is defined, then what about categorical data? r Need to specify k, the number of clusters, in advance. r Unable to handle noisy data and outliers. r Not suitable to discover clusters with non-convex shapes.

20 20 Copyright Jiawei Han, modified by Charles Ling for CS411a The K-Medoids Clustering Method d Find representative objects, called medoids, in clusters r To achieve this goal, only the definition of distance from any two objects is needed. d PAM (Partitioning Around Medoids, 1987) r starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering. r PAM works effectively for small data sets, but does not scale well for large data sets.

21 21 Copyright Jiawei Han, modified by Charles Ling for CS411a Two Types of Hierarchical Clustering Algorithms d Agglomerative (bottom-up): merge clusters iteratively. r start by placing each object in its own cluster r merge these atomic clusters into larger and larger clusters r until all objects are in a single cluster. r Most hierarchical methods belong to this category. They differ only in their definition of between-cluster similarity. d Divisive (top-down): split a cluster iteratively. r It does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. r Divisive methods are not generally available, and rarely have been applied.

22 22 Copyright Jiawei Han, modified by Charles Ling for CS411a Hierarchical Clustering d Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA)

23 23 Copyright Jiawei Han, modified by Charles Ling for CS411a More on Hierarchical Clustering Methods d between-cluster similarity r Minimal distance r Maximal distance r Center distance d Major weakness of agglomerative clustering methods: r do not scale well: time complexity of at least O(n 2 ), where n is the number of total objects r can never undo what was done previously. d Integration of hierarchical clustering with distance-based method:

24 24 Copyright Jiawei Han, modified by Charles Ling for CS411a Clustering analysis d What is Clustering Analysis? d Clustering in Data Mining Applications d Handling Different Types of Variables d Major Clustering Techniques d Outlier Discovery d Problems and Challenges

25 25 Copyright Jiawei Han, modified by Charles Ling for CS411a What Is Outlier Discovery? d What are outliers? r The set of objects are considerably dissimilar from the remainder of the data r Example: Sports: Michael Jordon, Wayne Gretzky,... d Problem r Given: Data points r Find top n outlier points d Applications: r Credit card fraud detection r Telecom fraud detection r Customer segmentation r Medical analysis

26 26 Copyright Jiawei Han, modified by Charles Ling for CS411a Outlier Discovery Methods d Distance-based vs. statistics-based outlier analysis: r Most outlier analyses are univariate (single-var) and distribution-based (how do we know it is in a normal or gammar distribution?) r We need multi-dimensional analysis without knowing on data distribution. d Distance-based outlier: r An object O in a dataset T is a DB(p, D)-outlier if at least fraction p of the object in T lies greater than distance D from O.

27 27 Copyright Jiawei Han, modified by Charles Ling for CS411a Clustering analysis d What is Clustering Analysis? d Clustering in Data Mining Applications d Handling Different Types of Variables d Major Clustering Techniques d Outlier Discovery d Problems and Challenges

28 28 Copyright Jiawei Han, modified by Charles Ling for CS411a Problems and Challenges d Considerable progress has been made in scalable clustering methods: r Partitioning: k-means, k-medoids, CLARANS r Hierarchical: BIRCH, CURE r Density-based: DBSCAN, CLIQUE, OPTICS r Grid-based: STING, WaveCluster. r Model-based: Autoclass, Denclue, Cobweb. d Current clustering techniques do not address all the requirements adequately. d Constraint-based clustering analysis: Constraints exists in data space (bridges and highways) or in user queries.

29 29 Copyright Jiawei Han, modified by Charles Ling for CS411a Data Mining and Data Warehousing d Introduction d Data warehousing and OLAP d Data preprocessing for mining and warehousing d Concept description: characterization and discrimination d Classification and prediction d Association analysis d Clustering analysis d Mining complex data and advanced mining techniques Trends and research issues

30 30 Copyright Jiawei Han, modified by Charles Ling for CS411a Data Mining and Warehousing: Session 6 Association Analysis

31 31 Copyright Jiawei Han, modified by Charles Ling for CS411a Session 6: Association Analysis d What is association analysis? d Mining single-dimensional Boolean association rules in transactional databases d Mining multi-level association rules

32 32 Copyright Jiawei Han, modified by Charles Ling for CS411a What Is Association Mining? d Association rule mining: r Finding association, correlation, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. d Applications: r Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. d Examples. Rule form: Body ead [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%] major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]

33 33 Copyright Jiawei Han, modified by Charles Ling for CS411a Session 6: Association Analysis d What is association analysis? d Mining single-dimensional Boolean association rules in transactional databases d Mining multi-level association rules

34 34 Copyright Jiawei Han, modified by Charles Ling for CS411a What Is an Association Rule? d Given r A database of customer transactions r Each transaction is a list of items (purchased by a customer in a visit) d Find all rules that correlate the presence of one set of items with that of another set of items r Example: 98% of people who purchase tires and auto accessories also get automotive services done r Any number of items in the consequent/antecedent of rule r Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances).

35 35 Copyright Jiawei Han, modified by Charles Ling for CS411a Application Examples d Market Basket Analysis r * Maintenance Agreement What the store should do to boost Maintenance Agreement sales r Home Electronics * What other products should the store stocks up on if the store has a sale on Home Electronics d Attached mailing in direct marketing d Detecting ping-ponging of patients transaction: patient item: doctor/clinic visited by a patient support of a rule: number of common patients

36 36 Copyright Jiawei Han, modified by Charles Ling for CS411a Rule Measures: Support and Confidence d Find all the rules X & Y Z with minimum confidence and support r support, s, probability that a transaction contains {X, Y, Z} r confidence, c, conditional probability that a transaction having {X, Y} also contains Z. Let minimum support 50%, and minimum confidence 50%, we have r A C (50%, 66.6%) r C A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer

37 37 Copyright Jiawei Han, modified by Charles Ling for CS411a Mining Association Rules -- Example For rule A C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent. Min. support 50% Min. confidence 50%

38 38 Copyright Jiawei Han, modified by Charles Ling for CS411a Mining Frequent Itemsets: the Key Step À Find the frequent itemsets: the sets of items that have minimum support u A subset of a frequent itemset must also be a frequent itemset, i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset u Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Á Use the frequent itemsets to generate association rules.

39 39 Copyright Jiawei Han, modified by Charles Ling for CS411a The Apriori Algorithm C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k != ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ;

40 40 Copyright Jiawei Han, modified by Charles Ling for CS411a The Apriori Algorithm -- Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

41 41 Copyright Jiawei Han, modified by Charles Ling for CS411a Generating Association Rules d A Naive Algorithm for each frequent itemset F do for each subset c of F do if ( support(F)/support(F-c) minconf ) then output rule (F-c) c, with confidence = support(F)/support (F-c) and support = support(F)

42 42 Copyright Jiawei Han, modified by Charles Ling for CS411a Session 6: Association Analysis d What is association analysis? d Mining single-dimensional Boolean association rules in transactional databases d Mining multi-level association rules

43 43 Copyright Jiawei Han, modified by Charles Ling for CS411a Multiple-Level Association Rules d Items often form hierarchy. d Items at the lower level are expected to have lower support. d Rules regarding itemsets at appropriate levels could be quite useful. d Transaction database can be encoded based on dimensions and levels d It is smart to explore shared multi-level mining (Han & Fu,VLDB95). Food bread milk skim SunsetFraser 2%white wheat

44 44 Copyright Jiawei Han, modified by Charles Ling for CS411a Mining Multi-Level Associations d A top_down, progressive deepening approach: r First find high-level strong rules: milk bread [20%, 60%]. r Then find their lower-level weaker rules: 2% milk wheat bread [6%, 50%]. d Variations at mining multiple-level association rules. – Level-crossed association rules: 2% milk Wonder wheat bread – Association rules with multiple, alternative hierarchies: 2% milk Wonder bread

45 45 Copyright Jiawei Han, modified by Charles Ling for CS411a Multi-Level Mining: Progressive Deepening d A top-down, progressive deepening approach: r First mine high-level frequent items: milk (15%), bread (10%) r Then mine their lower-level weaker frequent itemsets: 2% milk (5%), wheat bread (4%) d Different min_support threshold across multi-levels lead to different algorithms: r If adopting the same min_support across multi-levels then toss t if any of ts ancestors is infrequent. r If adopting reduced min_support at lower levels then examine only those descendents whose ancestors support is frequent/non-negligible.


Download ppt "1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and."

Similar presentations


Ads by Google