Lecture 07: Dealing with Big Data October 5, 2015 SDS 235 Visual Analytics Note: slide deck adapted from R. Chang
Announcements Solutions to Assignment 1 posted: Piazza > Resources Assignment 2 posted to course website: due October 14th Please respond to Piazza poll re: preferred lab style
Outline Dealing with high-dimensional data Real World Problem #1: Aggregation and Sampling (Jordan) Dimension reduction / projection (Rajmonda Caceres, MITLL) Real World Problem #1: Ellen Moorhouse, Women’s Fund
Recap: Keim’s Visual Analytics Model Image source: Keim, Daniel, et al. Visual analytics: Definition, process, and challenges. Springer Berlin Heidelberg, 2008.
Putting Keim’s VA Model in Context Knowledge Visual Mapping Visualization Mental Model Data Model Interaction Data
General Concept If the data scale is truly too large, we can find ways trim it: We can reduce the number of rows Subsampling Clustering Or reduce the number of columns Dimension reduction
Food for Thought How do we cut down the size while maintaining the general “characteristics” of the original data? How much can be trimmed? If we do analysis based on the trimmed data, does it still apply to the original (raw) data?
Reducing Rows: Sub-Sampling Goal: find a subset of the original dataset that exhibits the same (or similar) characteristics as the full dataset One simple approach: use random sampling Simple random sampling Systematic sampling Etc. Key point: each element must have an equal non-zero chance of being selected e.g. Selecting individuals from households
Caveat Whenever you look only at a subset of the data, you may be introducing sampling error
Measuring Sampling Error If we assume that the population follows a normal distribution Further assume that the variability of the population is known (as measured by standard deviation σ) Then the standard error of the sample mean is given by: σ/√n (where n = sampling size)
Reducing Rows: Clustering Imagine subsampling as “cropping” your data What if instead, we just want to change the “resolution”? Main idea: group similar data items together and represent them as a single entity Note the word similar: clustering always requires a distance function (some way to compare elements)
Clustering Algorithm: K-means There are numerous clustering algorithms out there Here we look at one popular one: K-means Inputs: K: number of clusters distance function: d(xi, xj) (1) (2) (3) (4)
Reducing Columns: Rank vs. Dimensionality How many dimensions are there in your data? What is its true rank? Example: degrees of freedom
Flashback to Lecture 1: Data Definition A typical dataset in visualization consists of n records: (r1, r2, r3, … , rn) Each record ri consists of (m >=1) observations or variables: (v1, v2, v3, … , vm) A variable may be either independent or dependent: An independent variable (iv) is not controlled or affected by another variable (e.g., time in a time-series dataset) A dependent variable (dv) is affected by a variation in one or more associated independent variables (e.g., temperature in a region) Formal definition: ri = (iv1, iv2, iv3, … , ivmi , dv1, dv2, dv3, … , dvmd) where m = mi + md
Rank vs. Dimensionality How many dimensions are there in your data? What is its true rank? Example: degrees of freedom
Dimension Reduction Goal: find the smallest set of dimensions that effectively characterize parts of the dataset you care about Two common techniques (Rajmonda): Principle Component Analysis Multi-Dimensional Scaling