Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis.

Slides:



Advertisements
Similar presentations
S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
K-means clustering Hongning Wang
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.
Face Recognition Jeremy Wyatt.
Cluster Analysis (1).
What is Cluster Analysis?
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Hydrologic Statistics
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Computer Graphics and Image Processing (CIS-601).
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
Exploring EDA 1 iCSC2015,Vince Croft, NIKHEF -Nijmegen Exploring EDA, Clustering and Data Preprocessing Lecture 1 Exploring EDA Vincent Croft NIKHEF -
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Unsupervised Classification
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Data statistics and transformation revision Michael J. Watts
Chapter 12 – Discriminant Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Lecture 2-2 Data Exploration: Understanding Data
Map Reduce.
Data Clustering Michael J. Watts
Dimension Reduction via PCA (Principal Component Analysis)
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Descriptive Statistics vs. Factor Analysis
Data Mining – Chapter 4 Cluster Analysis Part 2
Presentation transcript:

Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis Vincent Croft NIKHEF - Nijmegen Inverted CERN School of Computing, February 2015

Taking Raw Data Towards Analysis 2 iCSC2015, Vince Croft, NIKHEF The path towards the sunlight…  Our eyes see hundreds of colours, our ears hear thousands of frequencies, our user logs thousands of alphanumeric values… How do we keep ourselves from being overwhelmed.

Taking Raw Data Towards Analysis 3 iCSC2015, Vince Croft, NIKHEF Outline  Mapping  Clustering  Data Reduction  Higher focus on examples  Using real data from internet  Brief introduction to scalable data analysis on big data

Taking Raw Data Towards Analysis 4 iCSC2015, Vince Croft, NIKHEF Worked Examples  All examples will be available online  If you are not here in person or want to see the examples presented for yourself please see the support documentation on my institute web page

Taking Raw Data Towards Analysis 5 iCSC2015, Vince Croft, NIKHEF Mapping – Heat Maps  One last page in R

Taking Raw Data Towards Analysis 6 iCSC2015, Vince Croft, NIKHEF Rotations - Fisher Discriminant  Rotating the axis of a 2d plot.  Used to separate two distributions.  For example signal and background.  0 axis is defined as line best separating two distributions.  This line doesn’t have to be Straight…  Other transformations?

Taking Raw Data Towards Analysis 7 iCSC2015, Vince Croft, NIKHEF Rotations- PCA  Principle Component Analysis  Rotates axis to show maximum variance. This axis is referred to as the principle axis  Other axis are defined in accordance

Taking Raw Data Towards Analysis 8 iCSC2015, Vince Croft, NIKHEF Clustering  The notion of clusters is intuitive. A grouping of objects.  Clusters can be formed from:  Objects close together  Objects with similar properties  Objects that fit a particular distribution  Clustering can include all data points  Automatically characterising groups of data.  Generalizes information for quicker processing  Clustering can highlight regions of interest  Removing data that doesn’t represent some underlying process.  Cleans data.

Taking Raw Data Towards Analysis 9 iCSC2015, Vince Croft, NIKHEF Defining Distance  Euclidean Distance (x,y)  Simple. Intuitive. Easy to visualise  Density  Correlations  Shows similarity between variables  Mahalanobis distance (standardised statistical distance)  Accounts for differences in scales between variables  Ignores effects from highly correlated variables  Ignores effects from variables with high variance  Many others.  E.g. binary distance, like manhattan distance.

Taking Raw Data Towards Analysis 10 iCSC2015, Vince Croft, NIKHEF Hierarchical Clustering  Deterministic  Results are always the same  Shows scale  All points are clustered eventually  Needs stopping condition  Uses various distance metrics  The closest two points are always the closest two, the two highest correlations are the two highest correlations

Taking Raw Data Towards Analysis 11 iCSC2015, Vince Croft, NIKHEF Hierarchical Clustering  First find two closest points  Merge into single cluster  Find next two closest points  Merge  Continue until stop or all points are clustered  Stopping conditions include:  Number of clusters  Max distance  Fit to distribution

Taking Raw Data Towards Analysis 12 iCSC2015, Vince Croft, NIKHEF K-Means Clustering  K is the number of clusters  This must be specified.  The initial properties of each centroid must be provided  Often this must be guessed  Iterates over data until the position of the centroid doesn’t change

Taking Raw Data Towards Analysis 13 iCSC2015, Vince Croft, NIKHEF K-Means Clustering  Pick number of clusters  Guess/assign centroids  Assign points to the closest centroid  Recalculate centroids

Taking Raw Data Towards Analysis 14 iCSC2015, Vince Croft, NIKHEF Dimensional Reduction  Often we don’t need all the information about a topic to characterise the underlying process.  We can transform the data to summarise the data  E.g SVD or PCA  We can cluster the data  E.g. Hierarchical or k-means clustering  This can give us statistical information.  This can also be used for data compression.  (less variables=less data but with the same information)

Taking Raw Data Towards Analysis 15 iCSC2015, Vince Croft, NIKHEF Summary  Data can show us lots of information.  Information can be obtained from the inter-variable relationships. E.g. (PCA)  Information can be obtained from the summaries of multivariate distributions.  In Multivariate analysis adding variables and adding more data sometimes hides information rather than adds to it.  By exploring the correlations, ranks and distributions of our data we can optimise the information contained for analysis.

Taking Raw Data Towards Analysis 16 iCSC2015, Vince Croft, NIKHEF Map Reduce  In MVA each additional variable reduces the density of information and increases processing time exponentially.  MapReduce is a scalable programming model designed for processing very large data sets in a parallel distributed environment  Two steps. (possibly iterated)  Map Data  Filters and sorting  e.g. making clusters for each event  Reduce Data  Makes summary of data  E.g. combines clusters into histograms Use these to redefine clusters

Taking Raw Data Towards Analysis 17 iCSC2015, Vince Croft, NIKHEF Hadoop  Platform for distributed computing and parallelized computation whilst being scalable to meet exponential increases in data and cheap to implement.  Inspired by Google research and Google File System  Key implementation in analysis for Facebook, Yahoo, american express and many more.