Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
MIS2502: Data Analytics Clustering and Segmentation.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
Case-based Reasoning System (CBR)
Classical Techniques: Statistics, Neighborhoods, and Clustering.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Data Mining Adrian Tuhtan CS157A Section1.
CLUSTERING (Segmentation)
Birch: An efficient data clustering method for very large databases
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Enterprise systems infrastructure and architecture DT211 4
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Data Mining Using IBM Intelligent Miner Presented by: Qiyan (Jennifer ) Huang.
Data Mining Techniques
Chapter Eight Database Applications and Implications.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
CLUSTER ANALYSIS.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
MIS2502: Data Analytics Advanced Analytics - Introduction.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Clustering Algorithms Minimize distance But to Centers of Groups.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer
GROUP 6 KIIZA FELIX 2013/BIT/110 MUHANGUZI EUSTUS 2013/BIT/104/PS TUGIROKWIKIRIZA FLAVIA 2013/BIT/111/PS HAMSTONE NATOSHA 2013/BIT/122/PS GILBERT MUMBERE.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Unsupervised Learning
Data Mining.
What Is Cluster Analysis?
Data Mining--Clustering
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining K-means Algorithm
Adrian Tuhtan CS157A Section1
Topic 3: Cluster Analysis
MIS2502: Data Analytics Clustering and Segmentation
Self organizing networks
Fuzzy Clustering.
Dr. Unnikrishnan P.C. Professor, EEE
MIS2502: Data Analytics Clustering and Segmentation
Clustering.
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Unsupervised Learning
Presentation transcript:

Nearest Neighbour and Clustering

Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques used in data mining. Clustering – records are grouped or clustered together and put into the same grouping. Nearest neighbour is a prediction technique similar to clustering In order to determine what a prediction value in one record The user should look for records with similar predictor value in the historical database Use the prediction value from the record that is nearest to the unknown record. The nearest neighbour prediction algorithm works with the nearness in a database. It depends on variety of factors.

Where to use Clustering and nearest neighbour prediction Clustering and Nearest –Neighbour Prediction is used in a wide variety of applications like Prediction of Personal financial problems of banking industry Computer recognition of a person’s hand writing These methods are often used by common people in their every day life Without knowing that they are doing clustering. Eg: we group certain food items, Automobiles together Clustering for clarity Clustering is a method in which same kind of records are grouped together. This is done for providing an easy view of the operations inside the database. Clustering is sometimes called as segmentation. Which is most important in marketing.

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Two commercial products which offer the clustering are PRIZM – Claritas corporation and MicroVision – Equifar corporation These companies have grouped the population by demographic information into segments. To build these groupings they use information such as income,age, occupation, housing and race collectively from US census. Then assigned memorable names for each of the clusters. This clustering can be used by the end users to tag the customers in their database. Then the business user can get a quick high level view of the cluster. Once they work with these clusters for long time then they can assume well how these clusters will behave to the marketing offers of their business. Not all the clusters are useful for a particular business people. Some may be related and some may not be useful. The same clusters may be used by the competitors for their business and marketing offers. So its important to be aware of our own customer base reacts to the clusters.

Clustering for outlier analysis Some clustering is performed not so much to keep records together To make it easier to view when one record sticks out from the rest. Ie the clustering is helping to understand those missing records from the clusters. Which are called outliers. Clusters will help us to do analysis on these outliers for finding out why they behave differently in the characteristics of the clusters. Eg Credit cards cluster outliers.

Nearest Neighbour for prediction One essential element of clustering is that one particular object is closer to another object. Most people have the sense of ordering on a variety of objects Eg Most people will agree the apple is closer to orange than tomato. this sense of ordering helps us to make clusters. The nearest neighbour prediction algorithm stated as “Objects that are near to each other will have similar prediction values.” Ie if we know the prediction value of one object then we can predict it for its nearest neighbours. One of the classic place where the nearest neighbour has been used is in Text retrieval.

How clustering and nearest neighbour prediction works Looking at an n- dimensional space Age 100 yrs y Income > $ x

Weighting the dimensions: distance with a purpose The round clusters are easy to spot visually Because of there implicit normalization of the dimensions In some cases we need to give the weightage for some particular field for creating the clusters or for finding the nearness. Ie we cannot depend on a particular dimensions contributions. It depends on what u r trying to achieve. Based on that we should select a key predictor in determining what is near and what is not would be more heavily weighted.

Calculating dimension weights There are several ways for calculating the importance of different dimensions. Data mining documents has many dimensions and all may be binary Each dimension type can be weighted by calculating how relevant that particular predictor is for making that prediction. This calculation is based on the predictor and the prediction columns. Like conditional probability, that the prediction has a certain value given that the predictor has a certain value. Dimensions and weights have also been calculated via algorithmic searches.

There are two main types of clustering techniques are there Those that create a hierarchy of clusters Those that do not Hierarchical clustering techniques create a hierarchy of clusters from small to big. The main reason is clustering techniques does not have an absolute correct answer. So depending on the particular application few or greater clusters may be desired. With the hierarchy of clusters defined it is possible to choose the number of clusters that are desired. The clusters are the records in the data base. Any clustering algorithm that ends up with as many clusters. One of the main points about hierarchical clustering is that they allow the end user to choose from either many clusters or only few. The hierarchy of clusters is viewed as a tree. In which the smallest clusters are merged together to create a next high level clusters.

When th hierarchy is given then we can understand easily the right no of clusters are created Whether they are providing adequate information There are two main types of hierarchical clustering algorithms are there Agglomerative - starts with small record and then merge it together and become the large cluster Divisive – This is the opposite approach of Agglomerative. Which will split the clusters into smaller pieces, then in turn try to split those smaller pieces. The agglomerative techniques are the most commonly used for clustering. Non hierarchical is more easy to create.