Advanced Methods and Analysis for the Learning and Social Sciences

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
PARTITIONAL CLUSTERING
ANALYZING MORE GENERAL SITUATIONS UNIT 3. Unit Overview  In the first unit we explored tests of significance, confidence intervals, generalization, and.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 27, 2012.
Model Assessment, Selection and Averaging
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Chapter 9 – Classification and Regression Trees
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 16, 2012.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 3, 2013.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Computacion Inteligente Least-Square Methods for System Identification.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 April 15, 2013.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
HUDK5199: Special Topics in Educational Data Mining
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Stats Methods at IC Lecture 3: Regression.
Unsupervised Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Deep Feedforward Networks
Machine Learning and Data Mining Clustering
SLAW: A Mobility Model for Human Walks
Data Science Algorithms: The Basic Methods
LECTURE 11: Advanced Discriminant Analysis
Part III – Gathering Data
Core Methods in Educational Data Mining
Haim Kaplan and Uri Zwick
Topic 3: Cluster Analysis
Latent Variables, Mixture Models and EM
HUDK5199: Special Topics in Educational Data Mining
Machine Learning Feature Creation and Selection
Understanding Randomness
REMOTE SENSING Multispectral Image Classification
Collaborative Filtering Matrix Factorization Approach
Clustering and Multidimensional Scaling
Clustering.
COSC 4335: Other Classification Techniques
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
MIS2502: Data Analytics Clustering and Segmentation
Nearest Neighbors CSC 576: Data Mining.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Machine Learning and Data Mining Clustering
Topic 5: Cluster Analysis
Support Vector Machines 2
Unsupervised Learning
Presentation transcript:

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012

Today’s Class Clustering

Clustering You have a large number of data points You want to find what structure there is among the data points You don’t know anything a priori about the structure Clustering tries to find data points that “group together”

Clustering What types of questions could you study with clustering?

Related Topic Factor Analysis Not the same as clustering Factor analysis finds how data features/variables/items group together Clustering finds how data points/students group together In many cases, one problem can be transformed into the other But conceptually still not the same thing Next class!

Trivial Example Let’s say your data has two variables Pknow Time Clustering works for (and is equally effective in) large feature spaces

+3 -3 time 0 1 pknow

k-Means +3 -3 time 0 1 pknow

Not the only clustering algorithm Just the simplest

How did we get these clusters? First we decided how many clusters we wanted, 5 How did we do that? More on this in a minute We picked starting values for the “centroids” of the clusters… Usually chosen randomly

How did we get these clusters? First we decided how many clusters we wanted, 5 How did we do that? More on this in a minute We picked starting values for the “centroids” of the clusters… For instance…

+3 -3 time 0 1 pknow

Then… We classify every point as to which centroid it’s closest to This defines the clusters This creates a “voronoi diagram”

+3 -3 time 0 1 pknow

Then… We re-fit the centroids as the center of the points in the cluster

+3 -3 time 0 1 pknow

Then… Repeat until the centroids stop moving

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

Questions? Comments?

What happens? What happens if your starting points are in strange places? Not trivial to avoid, considering the full span of possible data distributions

What happens? What happens if your starting points are in strange places? Not trivial to avoid, considering the full span of possible data distributions There is some work on addressing this problem

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

Solution Run several times, involving different starting points cf. Conati & Amershi (2009)

Questions? Comments?

How many clusters should you have? Can you use goodness of fit metrics?

Mean Squared Deviation (also called Distortion) MSD = Take each point P Find the center of P’s cluster C Find the distance D from C to P Square D to get D’ Sum all D’ to get MSD

Any problems with MSD?

Any problems with MSD? More clusters almost always leads to smaller MSD Distance to nearest cluster center should always be smaller with more clusters

Questions? Comments?

What about cross-validation? Will that fix the problem?

What about cross-validation? Not necessarily This is a different problem than classification You’re not trying to predict specific values You’re determining whether any center is close to a given point More clusters cover the space more thoroughly So MSD will often be smaller with more clusters, even if you cross-validate

An Example 14 centers, ill-chosen (what you’d get on a cross-validation with too many centers) 2 centers, well-chosen (what you’d get on a cross-validation with not enough centers)

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

An Example The ill-chosen 14 centers will achieve a better MSD than the well-chosen 2 centers

Solution Penalize models with more clusters, according to how much extra fit would be expected from the additional clusters

Solution Penalize models with more clusters, according to how much extra fit would be expected from the additional clusters What comes to mind?

Solution Penalize models with more clusters, according to how much extra fit would be expected from the additional clusters Common approach – the Bayesian Information Criterion

Bayesian Information Criterion (Raftery, 1995) Assesses how much fit would be spuriously expected from a random N parameters Assesses how much fit you actually had Finds the difference

Bayesian Information Criterion (Raftery, 1995) The math is… painful See Raftery, 1995; many statistical packages can also compute this for you

So how many clusters? Try several values of k Find “best-fitting” set of clusters for each value of k Choose k with best value of BiC

Questions? Comments?

Let’s do a set of hands-on exercises Apply k-means using the following points and centroids Everyone gets to volunteer!

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

+3 -3 time 0 1 pknow

Comments? Questions?

Other cluster algorithms Gaussian Mixture Models Can do fun things like Overlapping clusters Assigning points to no clusters

Gaussian Mixture Models A centroid and a radius Fit with the same approach (some subtleties on process for selecting radius)

+3 -3 time 0 1 pknow

Subtlety GMM still assigns every point to a cluster, but has a threshold on what’s really considered “in the cluster” Used during model calculation

+3 -3 Mathematically in red cluster, but outside threshold time 0 1 pknow

Can you select appropriate Gaussian Mixture Models? Everyone gets to volunteer!

5 centroids +3 -3 time 0 1 pknow

5 centroids +3 -3 time 0 1 pknow

5 centroids +3 -3 time 0 1 pknow

5 centroids +3 -3 time 0 1 pknow

3 centroids +3 -3 time 0 1 pknow

4 centroids +3 -3 time 0 1 pknow

4 centroids +3 -3 time 0 1 pknow

4 centroids +3 -3 time 0 1 pknow

4 centroids +3 -3 time 0 1 pknow

5 centroids +3 -3 time 0 1 pknow

1 centroid +3 -3 time 0 1 pknow

Questions? Comments?

Assessment Can assess with same approaches as before Plus

Likelihood (more commonly, log likelihood) The probability of the data occurring, given the model Assesses each point’s probability, given the set of clusters, adds it all together

For instance… +3 -3 Very unlikely point Likely points -3 Very unlikely point Likely points Less likely points time 0 1 pknow

Disadvantages of GMMs Much slower to create

Questions? Comments?

Advanced Clustering

Spectral Clustering

Spectral Clustering +3 -3 time 0 1 pknow

Spectral Clustering Conducts clustering through dimensionality reduction Mathematically equivalent to K-means clustering on a non-linear dimension-reduced space

Hierarchical Clustering Clusters can contain sub-clusters Image from Perera et al. (2009)

Hierarchical Agglommerative Clustering (HAC) Each data point starts as its own cluster Two clusters are combined if the resulting fit is better Continue until no more clusters can be combined

Hierarchical Agglommerative Clustering (HAC) What are the advantages of this method relative to traditional clustering? In what applications might it make more sense?

Applications of Clustering

Conati & Amershi (2009) Split students by their behaviors in an exploratory learning environment, then looked at learning outcomes Students with lower learning – 2 groups Students with better learning – 1 group

Conati & Amershi (2009) Students with lower learning (both groups) tended to move through the environment faster, and did not pause to reflect after moves Failure to Self-Explain

Conati & Amershi (2009) The two groups of students with lower learning differed in other aspects Amount of pause after back-tracking Degree to which students adjusted graph But this (apparently) did not affect learning

Beal, Qu, & Lee (2006) Clustered students by their behaviors in an intelligent tutor

Interestingly… (at minimum to me) One of the clusters was essentially Gaming the System

These Findings These findings establish that gaming the system and failure to self-explain (shallow learning strategies) are Widely occurring Easily distinguishable from other behavior

Cluster Analysis In these two cases, the phenomena were already known But it’s quite possible that this was a faster route to discovering them within a specific learning environment than observational methods And certainly a fast route to having a model of them within the specific context

Important points… If you cluster in a well-known domain, you are likely to obtain well-known findings

Because of this… Clustering is relatively popular But relatively prone to uninteresting papers in education research Where usually a lot is already known So use with caution!

Asgn. 7 Let’s go over 3 solutions Sweet Zak Mike Wixon

Asgn. 8 Questions? Comments?

Next Class Friday, March 16 SPECIAL SESSION 3pm-5pm SSPS Conference Room Clustering Readings Witten, I.H., Frank, E. (2005)Data Mining: Practical Machine Learning Tools and Techniques. Sections 4.8, 6.6. Assignments Due: 7. Clustering

The End