Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Clustering 1 – An introduction

Similar presentations


Presentation on theme: "Data Clustering 1 – An introduction"— Presentation transcript:

1 Data Clustering 1 – An introduction
Slide 1

2 The Data Explosion “If you feel like you are drowning in information, it’s because you are.” Advance of IT and the Internet Massive increase in ability to: Record: Electronic records and forms Store: Data Warehouses (as we have seen) Analyse: Data Mining and Visualisation (more later) Risk of Information Overload Data Clustering – An Introduction Slide 2

3 The Aims of Data Mining Classification Association Detection
Categorising Risk-Return of Stocks Association Identify products that tend to sell together Detection Identify profiles of customers Prediction Forecasting Market Performance Data Clustering – An Introduction Slide 3

4 Database Technology Timeline
Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s: Data mining and data warehousing, multimedia databases, and Web databases Data Clustering – An Introduction Slide 4

5 From Data to Knowledge Common to break down the process of learning from data into the following: Data, Information and Knowledge Data Clustering – An Introduction Slide 5

6 From Data to Knowledge Data: Raw numbers
Information: Data with context or meaning Knowledge: Data Structures / Patterns (Knowledge must be useful) Data Clustering – An Introduction Slide 6

7 Data Mining / Intelligent Data Analysis
“Data mining is applying Machine Learning techniques to historical data to improve future decisions” Tom Mitchell 1997 Data Clustering – An Introduction Slide 7

8 Knowledge Discovery Knowledge Discovery in Databases (KDD)
The Process (from Advances in KDD and Data mining): Data Knowledge Patterns Target Data Pre-processed Data Transformed Data Data Clustering – An Introduction Slide 8

9 Data Mining - Tools Typical tools Statistical Analysis
Summarisation Outlier Detection Correlation Regression Clustering Association Rules Time Series Models Decision Trees (classification) Data Clustering – An Introduction Slide 9

10 Data Mining - Applications
Some successful examples of its use: Pharmaceutical companies – Drug Discovery Credit card companies – Fraud Detection Transportation companies - Routing Large consumer package goods companies (to improve the sales process to retailers) Hospital Organisation – Decision Analysis Data Clustering – An Introduction Slide 10

11 Examples of Data Mining Tools
We will now look at some core techniques commonly used for analysing and mining business warehouses Correlation Visualisation Clustering Regression Data Clustering – An Introduction Slide 11

12 Clustering An example in biology… plants animals
clustering is a basic learning algorithm we do clustering everywhere of our lives an example in biology introduce the 3 concepts in pattern recognition – patterns, features, classes. Things that are brown and run away Things that are green and don’t run away Data Clustering – An Introduction Slide 12

13 Clustering An example in biology… Kingdom Phylum Class Order Family
Genus Species Hierarchical clustering (more later) clustering is a basic learning algorithm we do clustering everywhere of our lives an example in biology introduce the 3 concepts in pattern recognition – patterns, features, classes. Data Clustering – An Introduction Slide 13

14 Clustering The process
Extract features (colour, movement, sensory organs etc): more later Cluster into categories Consolidation clustering is a basic learning algorithm we do clustering everywhere of our lives an example in biology introduce the 3 concepts in pattern recognition – patterns, features, classes. Data Clustering – An Introduction Slide 14

15 Clustering Clustering: to partition a data set into subsets (clusters), so that the data in each subset share some common trait - often similarity or proximity for some defined distance measure. The process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Unsupervised: No need for the ‘teacher’ signals, i.e. the desired output. x2 Cluster 1 clustering: extend the concept of living beings to a general one key issues of clustering analysis Cluster 2 x1

16 Supervised and Unsupervised Learning
Unsupervised learning: learning without the desired output (‘teacher’ signals). Supervised learning: learning with the desired output. Clustering is one of the widely-used unsupervised learning methods. Other unsupervised learning: Dimensionality reduction (factor analysis, principal component analysis, independent component analysis …) Time serious modelling Source separation Supervised learning: Classification Regression briefly introduce the concepts of supervised and unsupervised learning misunderstanding: clustering = unsupervised learning Data Clustering – An Introduction Slide 16

17 Patterns, Clusters and Features (1)
Patterns: physical objects Clusters: categories of objects Features: attributes of objects animals plants before introducing how to perform clustering, the basic concepts patterns clusters features Colour: brown, green, …

18 Patterns, Clusters and Features (2)
Features’ space Creating vehicles’ clusters 3500 3000 Lorries 2500 cluster 2000 Sports cars Weight [kg] 1500 Medium market cars Another example for vehicle clustering individual cars – patterns(objects) features – weight/speed clusters – lorries/sport cars/cars Feature-1 values 1000 500 100 150 200 250 300 Top speed [ml/h] Feature-2 values

19 Social networks Marketing Terror networks
Allocation of resources in a company / university Data Clustering – An Introduction

20 Gene networks Understanding gene interactions
Identifying important genes linked to disease Data Clustering – An Introduction

21 How to do clustering? What we know: patterns represented by their feature vectors, e.g. General case: is in the d -dimensional domain of the feature vectors x2 Cluster 1 Cluster 2 what we know patterns feature vectors examples like animals and cars 2. what we need to find out the number of clusters the clusters, in a form easy for computing What we need to find out: the clusters x1

22 Pattern Similarity A key concept in clustering: similarity.
Clusters are formed by similar patterns. In computer science, we need to define some metric to measure similarity. One of the commonly adopted similarity metrics is distance. A general definition of distance (between pattern A and B): b=2: Euclidean distance b=1: Manhattan distance a key concept of clustering, and many other pattern recognition techniques, is similarity. distance similarity is inversely proportional to the distance – this sometimes presents problems. The shorter the distance, the more similar the two patterns.

23 Pattern Similarity & Distance Metrics
Many methods are designed to work on Distance Metrics, e.g. K-Means They assume that the Triangle Inequality holds: “the sum of the lengths of any two sides must be greater than the length of the remaining side” Data Clustering – An Introduction

24 Pattern Similarity & Distance Metrics
Euclidean Correlation Minkowski Manhattan Mahalanobis Relationship Metrics How Long is a Piece of String? Often Application Dependant Data Clustering – An Introduction

25 K-Means Clustering 25

26 Algorithm 1: K-Means Clustering
Place K points into the feature space. These points represent initial cluster centroids. Assign each pattern to the closest cluster centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the assignments do not change. this description is rather generic. many issues are unspecified. initialisation assignment updating For example, if using (1) Euclidean distance, (2) average of patterns, what the algorithm becomes? Data Clustering – An Introduction Slide 26

27 K-Means Clustering Interactive Demo:
Data Clustering – An Introduction Slide 27

28 Discussions (1) 1. How to determine k, the number of clusters?
Data Clustering – An Introduction Slide 28

29 Discussions (2) 2. Any alternative ways of choosing the initial cluster centroids? Data Clustering – An Introduction Slide 29

30 Discussions (3) 3. Does the algorithm converge to the same results with different selections of initial cluster centroids? If not, what should we do in practice? Data Clustering – An Introduction Slide 30

31 Reading Chapter 9, Section 9.3: David Hand “Principles of Data Mining”, MIT Press Chapter 8: Pang-Ning Tan “Introduction to Data Mining” Anil Jain: “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters Data Clustering – An Introduction Slide 31

32 Lab In the lab: Examine a piece of JAVA code for K-Means clustering
Explore the use of K-Means on some Toy datasets Visualise the clusterings using an EXCEL macro Data Clustering – An Introduction Slide 32


Download ppt "Data Clustering 1 – An introduction"

Similar presentations


Ads by Google