Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.

Slides:



Advertisements
Similar presentations
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Categorical Data The Case of Quran Verses
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Implementation of ICP Variants Pavan Ram Piratla Janani Venkateswaran.
One-Shot Multi-Set Non-rigid Feature-Spatial Matching
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Image Segmentation Chapter 14, David A. Forsyth and Jean Ponce, “Computer Vision: A Modern Approach”.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Robust Information-theoretic Clustering By C. Bohm, C. Faloutsos, J-Y. Pan, and C. Plant Presenter: Niyati Parikh.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Cliff Rhyne and Jerry Fu June 5, 2007 Parallel Image Segmenter CSE 262 Spring 2007 Project Final Presentation.
Neural Network Homework Report: Clustering of the Self-Organizing Map Professor : Hahn-Ming Lee Student : Hsin-Chung Chen M IEEE TRANSACTIONS ON.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
Presented By Wanchen Lu 2/25/2013
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
DOCUMENT CLUSTERING. Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Algorithm Design by Éva Tardos and Jon Kleinberg Copyright © 2005 Addison Wesley Slides by Kevin Wayne 7. Edmonds-karp Demo.
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Uncovering Clusters in Crowded Parallel Coordinates Visualizations Alimir Olivettr Artero, Maria Cristina Ferreiara de Oliveira, Haim levkowitz Information.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
Other Clustering Techniques
INTRODUCTION JavaScript can make websites more interactive, interesting, and user-friendly.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
K-means and Hierarchical Clustering
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
Hierarchical clustering approaches for high-throughput data
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
Clustering 77B Recommender Systems
Richard Anderson Lecture 10 Minimum Spanning Trees
Overview Accomplishments Automatic Queen selection Side by Side Tracks
Unsupervised Learning and Clustering
Presentation transcript:

Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI

Running Times (rounded up) Image source: Algorithm Design by Jon Kleinberg and Eva Tardos, Page 34

Clustering Large Data Sets  There are three different ways in which the data set can be large: ◦There can be a large number of elements in the data set ◦Each element can have many features (many attributes) ◦There can be many clusters to discover

Problem Statement  Traditional clustering algorithms are computationally expensive when we cluster large data sets.  Traditional algorithms such as: ◦Single Linkage ◦K-means ◦Etc.

Single Linkage Clustering Algorithm Start by placing each point in its own cluster O(n) Calculate and store the distance between each pair of clusters O(n 2 ) While there are more than k clusters O(n) - Let A, B be the two closest clusters O(n 2 ) - Add cluster A U B O(n) - Remove clusters A and B O(n) - Find the distance between A U B and all other clusters O(n 2 ) Time Complexity: O(n 3 ) Space Complexity: O(n 2 )

Single Linkage Clustering Output

Better Approach?  Can we somehow cluster data points more efficiently?  Can we somehow apply clustering without computing distances for each point?  Can we somehow disregard data points that will never lie in same clusters?

Efficient Clustering Using Canopies  First Stage: Create canopies ◦Compute a quick and cheap distance matrix  Second Stage: Use traditional clustering algorithm ◦Compute expensive distance matrix ◦Only for points that lie in or overlapping canopies

Canopy Based Clustering Example

End Goal  Implement efficient clustering algorithm using the concept of canopies: ◦Using Agglomerative Clustering (such as linkage based) ◦Using Expectation Maximization Clustering (such as k-means)  Find out the effective performance on a large data set ◦Such as text data ◦Reference matching data  Compare the performance with traditional clustering algorithms

Thanks!  Any questions?