Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.

Nearest Neighbour and Clustering

Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques used in data mining. Clustering – records are grouped or clustered together and put into the same grouping. Nearest neighbour is a prediction technique similar to clustering In order to determine what a prediction value in one record The user should look for records with similar predictor value in the historical database Use the prediction value from the record that is nearest to the unknown record. The nearest neighbour prediction algorithm works with the nearness in a database. It depends on variety of factors.

Where to use Clustering and nearest neighbour prediction Clustering and Nearest –Neighbour Prediction is used in a wide variety of applications like Prediction of Personal financial problems of banking industry Computer recognition of a person’s hand writing These methods are often used by common people in their every day life Without knowing that they are doing clustering. Eg: we group certain food items, Automobiles together Clustering for clarity Clustering is a method in which same kind of records are grouped together. This is done for providing an easy view of the operations inside the database. Clustering is sometimes called as segmentation. Which is most important in marketing.

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Two commercial products which offer the clustering are PRIZM – Claritas corporation and MicroVision – Equifar corporation These companies have grouped the population by demographic information into segments. To build these groupings they use information such as income,age, occupation, housing and race collectively from US census. Then assigned memorable names for each of the clusters. This clustering can be used by the end users to tag the customers in their database. Then the business user can get a quick high level view of the cluster. Once they work with these clusters for long time then they can assume well how these clusters will behave to the marketing offers of their business. Not all the clusters are useful for a particular business people. Some may be related and some may not be useful. The same clusters may be used by the competitors for their business and marketing offers. So its important to be aware of our own customer base reacts to the clusters.

Clustering for outlier analysis Some clustering is performed not so much to keep records together To make it easier to view when one record sticks out from the rest. Ie the clustering is helping to understand those missing records from the clusters. Which are called outliers. Clusters will help us to do analysis on these outliers for finding out why they behave differently in the characteristics of the clusters. Eg Credit cards cluster outliers.

Nearest Neighbour for prediction One essential element of clustering is that one particular object is closer to another object. Most people have the sense of ordering on a variety of objects Eg Most people will agree the apple is closer to orange than tomato. this sense of ordering helps us to make clusters. The nearest neighbour prediction algorithm stated as “Objects that are near to each other will have similar prediction values.” Ie if we know the prediction value of one object then we can predict it for its nearest neighbours. One of the classic place where the nearest neighbour has been used is in Text retrieval.

How clustering and nearest neighbour prediction works Looking at an n- dimensional space Age 100 yrs y Income --------> $120000 x

Weighting the dimensions: distance with a purpose The round clusters are easy to spot visually Because of there implicit normalization of the dimensions In some cases we need to give the weightage for some particular field for creating the clusters or for finding the nearness. Ie we cannot depend on a particular dimensions contributions. It depends on what u r trying to achieve. Based on that we should select a key predictor in determining what is near and what is not would be more heavily weighted.

Calculating dimension weights There are several ways for calculating the importance of different dimensions. Data mining documents has many dimensions and all may be binary Each dimension type can be weighted by calculating how relevant that particular predictor is for making that prediction. This calculation is based on the predictor and the prediction columns. Like conditional probability, that the prediction has a certain value given that the predictor has a certain value. Dimensions and weights have also been calculated via algorithmic searches.

There are two main types of clustering techniques are there Those that create a hierarchy of clusters Those that do not Hierarchical clustering techniques create a hierarchy of clusters from small to big. The main reason is clustering techniques does not have an absolute correct answer. So depending on the particular application few or greater clusters may be desired. With the hierarchy of clusters defined it is possible to choose the number of clusters that are desired. The clusters are the records in the data base. Any clustering algorithm that ends up with as many clusters. One of the main points about hierarchical clustering is that they allow the end user to choose from either many clusters or only few. The hierarchy of clusters is viewed as a tree. In which the smallest clusters are merged together to create a next high level clusters.

When th hierarchy is given then we can understand easily the right no of clusters are created Whether they are providing adequate information There are two main types of hierarchical clustering algorithms are there Agglomerative - starts with small record and then merge it together and become the large cluster Divisive – This is the opposite approach of Agglomerative. Which will split the clusters into smaller pieces, then in turn try to split those smaller pieces. The agglomerative techniques are the most commonly used for clustering. Non hierarchical is more easy to create.

Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.

Similar presentations

Presentation on theme: "Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.

Similar presentations

Presentation on theme: "Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques."— Presentation transcript:

Similar presentations

About project

Feedback