数据挖掘 Introduction to Data Mining

数据挖掘 Introduction to Data Mining
Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring 2018 S C

Course schedule (日程安排)
Lecture 1 Introduction What is the knowledge discovery process? Lecture 2 Exploring the data Lecture 3 Classification (part 1) Lecture 4 Classification (part 2) Lecture 5 Association analysis Lecture 6 Lecture 7 Clustering Lecture 8 Anomaly detection and advanced topics Final exam (date to be announced)

Introduction Last time: Important: Association analysis - part 2
Solution assignment #1 Assignment #2 Important: QQ group: The PPTs are on the website.

Clustering (群集)

Introduction Clustering (群集): to automatically group similar objects/instances into clusters (groups). The clusters should capture the natural structure of the data.

Clustering Why do clustering? to summarize the data,
to understand the data for decision-making, as a first step before applying other data mining techniques. Clustering is a task that humans naturally do in everyday life Many applications: Grouping similar webpages, Grouping customers with similar behavior or preferences Grouping similar movies, songs

What are « good » clusters?
In general, we may want to find clusters that: Minimize the similarity between points of different categories Maximize the similarity between points of a category

To reduce the size of datasets
Some data mining techniques such as PCA may be slow if a database is large (since they have an exponential complexity). A solution is to replace all points in each cluster by a single data point representing the cluster. This reduces the size of the database and allows data mining algorithms to run faster.

Classification (分类) Classification (分类): predicting the value of a target attribute for some new data. The possible values for the target attributes are called “classes” or categories “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Classes are known in advance : Ph.D., Master, high school…

Classification (分类) Supervised classification (监督分类) require to have training data that is already labelled for training a classification model. “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Training data 训练数据

Clustering (群集) Automatically group instances into groups.
No training data is required No labels or target attribute needs to be selected.

What is a good clustering?
How many categories? Six? Four? Two?

Partitional Clustering (划分聚类)
Each object must belong to exactly one cluster A Partitional Clustering Original Points

Hierarchical Clustering (层次聚类)
Clusters are created as a hierarchy of clusters Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

http://www. instituteofcaninebiology. org/how-to-read-a-dendrogram
An example of dendrogram

Many types of clustering
Exclusive versus non-exclusive In a non-exclusive clustering, points may belong to multiple clusters. Can represent multiple classes or ‘border’ points Exclusive clustering Non-exclusive clustering

Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Non fuzzy clustering Fuzzy clustering

Partial versus complete In some cases, we only want to cluster some of the data e.g. to eliminate the outliers. Complete clustering Partial clustering

Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Homogeneous (均匀的) Heterogeneous (各种各样的) (in terms of size)

Types of clusters: Well-Separated Clusters
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters

Types of clusters: Center-Based clusters
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters

Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters

Types of Clusters: Density-Based
A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters

The K-Means algorithm

Introduction A simple and popular approach Partitional clustering
Each cluster is associated with a centroid (center point) Each point (object) is assigned to the cluster with the closest centroid Number of clusters, K, must be specified by the user.

K-Means Input: Output: k, the number of clusters to be generated
P, a set of points to be clustered Output: k partitions, some may be empty

Example – iteration 1 Three points are randomly selected to be the centroids

Example – iteration 2 Centroid are recalculated as the average of each category and each point is assigned to the category with the closest centroid.

Example – iteration 6 This is the last iteration because after that, the categories do not change.

More information about K-Means
Initially, centroids are randomly selected. Thus, if we run K-Means several times, the result may be different. The similarity or distance between points may be calculated using different distance functions such as the Euclidian distance, correlation, etc. For such measures, K-means will always converge to a solution (set of clusters). Usually, the clusters will change more during the first iterations. We can stop K-Means when the results does not change much between two iterations.

The choice of the initial centroids can have a huge influence on the final result
Data A clustering that is optimal A clustering that is quite bad

In some cases, K-Means can find a good solution despite an initial choice of centroids that appears to not be very good

How to evaluate a clustering
Sum of squared errors (SSE) k = number of clusters x : an object from a cluster Ci mi : the prototype (centroid) of Ci SSE: allows to choose the best clustering Note: if we increase k, it will decrease the SSE But a good clustering with still have a small SSE even for a small k value.

Some problems with k-means
It may be difficult to find a perfect clustering if k is large, because it becomes unlikely that a centroid will be chosen in each natural clusters. K-Means can create some empty clusters. Many strategies to fix this problem Apply the algorithm several times…

Limitations of K-Means
K-means does not work very well for categories: having different sizes, having different densities, Having a globular shape. K-means may also not work very well when the data contains outliers.

Limitations of K-means: different sizes
Original Points K-Means (3 clusters)

And what if we increase k ?
Original points K-Means (3 clusters)

Limitations of K-Means : different densities
Original points K-Means 3 clusters

Original points K-Means 9 clusters

Limitations of K-Means: non-globular shapes
Original points K-Means (2 clusters)

Not better…

Pre-processing and post-processing
Normalize data, Remove outliers. Post-processing Remove small clusters that could be outliers. If a cluster has a high SSE, split the cluster into two clusters. Merge two clusters that are very similar to each other, if the SSE is low. Some of these operations can be integrated into the K-Means algorithm.

Density-BASED CLUSTERING (基于密度的聚类) (DB-SCAN)

What is density? Density can be defined as the number of points within a circular area defined by some radius (半径) Density is here defined with respect to a given point

DBScan (1996) Input: Output: some data points (objects)
eps: a distance (a positive number) minPoints: a number of points Output: clusters that are created based on the density of points, some points are considered as noise and are not included in any clusters.

Definitions Neighbors: points at a distance not greater than eps.
Core point: points having at least MinPts neighbors. Border point: point having less than MinPts neighbors, but having a neighbor that is a core point. Noise: the other points Example: eps = 1 minPts = 4

How DBScan works? Current label = 1. For each core point p
IF p has no label THEN: p.Label = Current_label. Current_label = Current_Label + 1. FOR EACH point y in the neighborhood of p (transitively) IF y is a border point or a core point without label THEN y.label = CurrentLabel.

DBSCAN: Illustration Types of points core points border points noise
Original points Eps = 10, MinPts = 4

Advantages of DBScan Clusters Original points Noise-tolerant
Can discover clusters of various size and shapes.

Other examples

Limitations of DBScan Various densities High dimensional data
(MinPts=4, Eps=9.75). Original points Various densities High dimensional data (MinPts=4, Eps=9.92)

Other examples

How to choose the EPS and MinPTS parameters?
We can observe the distance from each point to its kth closest neighbor. Noise points are more far from their kth neighbor than points that are not noise To chose the value k to be used with eps, we choose k, and then we can sort the points by their distance to the kth node.

Density-based clustering
Advantages Clusters of different sizes and shapes Do not need to specify the number of clusters Remove points that are noise Can be quite fast, if the software is using appropriate spatial data structures to search quickly for neighbors. Disadvantages It can be difficult to find good parameter values Results may vary greatly depending on how the parameters are set.

Density-peak clustering (Science, 2014)
Clusters: peaks in the density of points Allows to find non-spherical clusters of different densities The number of clusters is found automatically Can also remove noise Simple

(for a distance dc)

Minimum distance Density

This algorithm solves some problems of DBScan.

Clustering evaluation

Clustering evaluation
Evaluating a clustering found by an algorithm is very important to avoid finding clusters in noise, to compare different clustering algorithms, to compare two clusterings generated by the same algorithm, to compare two clusters

Clusters found in random data
DBSCAN Random points K-means Hierarchical clustering

Issues related to cluster evaluation
Is there really some natural categories in the data (or is it just some random data). Evaluating clusters using external data (e.g. some already known class labels). Evaluating clusters without using external data (e.g. using the SSE or other measures). Comparing two clustering to choose one Determining how many categories there is.

A method: using a similarity matrix
Order the points by cluster labels. Calculate the similarity between all pairs of points EXAMPLE 1: K-MEANS If categories are well separated, there should be some squares appearing diagonally

EXAMPLE 2: DBScan, random data
la diagonale est moins bien définie EXAMPLE 3: K-Means, random data

EXAMPLE 4: DBSCAN

A method to choose the number of categories
There are various methods A simple method is to use the sum of squared errors (SSE) For example: The SSE with respect to the number of categories for K-Means

Another example:

Hierarchical clustering (层次聚类)

MIN: proximity between the two closest points of two categories
MAX: proximity between the two farthest points of two categories Average: average proximity between points of two categories

Comparison of hierarchical clustering methods
5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 3 1 MIN MAX 5 1 2 3 4 5 6 4 2 3 1 Average

Conclusion Today, we discussed clustering.
K-Means DB-Scan Density peak clustering How to evaluate clusters Next week, we will discuss anomaly detection, discuss some more advanced topics. Tutorial: how to use K-Means with the SPMF software:

References Chapter 8, 9. Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education, ISBN-10: (and PPTs) Han & Kamber (2011). Data Mining Concepts and Techniques.

数据挖掘 Introduction to Data Mining

Similar presentations

Presentation on theme: "数据挖掘 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

数据挖掘 Introduction to Data Mining

Similar presentations

Presentation on theme: "数据挖掘 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback