Lecture 07: Dealing with Big Data

Lecture 07: Dealing with Big Data
October 5, 2015 SDS 235 Visual Analytics Note: slide deck adapted from R. Chang

Announcements Solutions to Assignment 1 posted: Piazza > Resources
Assignment 2 posted to course website: due October 14th Please respond to Piazza poll re: preferred lab style

Outline Dealing with high-dimensional data Real World Problem #1:
Aggregation and Sampling (Jordan) Dimension reduction / projection (Rajmonda Caceres, MITLL) Real World Problem #1: Ellen Moorhouse, Women’s Fund

Recap: Keim’s Visual Analytics Model
Image source: Keim, Daniel, et al. Visual analytics: Definition, process, and challenges. Springer Berlin Heidelberg, 2008.

Putting Keim’s VA Model in Context
Knowledge Visual Mapping Visualization Mental Model Data Model Interaction Data

General Concept If the data scale is truly too large, we can find ways trim it: We can reduce the number of rows Subsampling Clustering Or reduce the number of columns Dimension reduction

Food for Thought How do we cut down the size while maintaining the general “characteristics” of the original data? How much can be trimmed? If we do analysis based on the trimmed data, does it still apply to the original (raw) data?

Reducing Rows: Sub-Sampling
Goal: find a subset of the original dataset that exhibits the same (or similar) characteristics as the full dataset One simple approach: use random sampling Simple random sampling Systematic sampling Etc. Key point: each element must have an equal non-zero chance of being selected e.g. Selecting individuals from households

Caveat Whenever you look only at a subset of the data, you may be introducing sampling error

Measuring Sampling Error
If we assume that the population follows a normal distribution Further assume that the variability of the population is known (as measured by standard deviation σ) Then the standard error of the sample mean is given by: σ/√n (where n = sampling size)

Reducing Rows: Clustering
Imagine subsampling as “cropping” your data What if instead, we just want to change the “resolution”? Main idea: group similar data items together and represent them as a single entity Note the word similar: clustering always requires a distance function (some way to compare elements)

Clustering Algorithm: K-means
There are numerous clustering algorithms out there Here we look at one popular one: K-means Inputs: K: number of clusters distance function: d(xi, xj) (1) (2) (3) (4)

Reducing Columns: Rank vs. Dimensionality
How many dimensions are there in your data? What is its true rank? Example: degrees of freedom

Flashback to Lecture 1: Data Definition
A typical dataset in visualization consists of n records: (r1, r2, r3, … , rn) Each record ri consists of (m >=1) observations or variables: (v1, v2, v3, … , vm) A variable may be either independent or dependent: An independent variable (iv) is not controlled or affected by another variable (e.g., time in a time-series dataset) A dependent variable (dv) is affected by a variation in one or more associated independent variables (e.g., temperature in a region) Formal definition: ri = (iv1, iv2, iv3, … , ivmi , dv1, dv2, dv3, … , dvmd) where m = mi + md

Rank vs. Dimensionality
How many dimensions are there in your data? What is its true rank? Example: degrees of freedom

Dimension Reduction Goal: find the smallest set of dimensions that effectively characterize parts of the dataset you care about Two common techniques (Rajmonda): Principle Component Analysis Multi-Dimensional Scaling

Lecture 07: Dealing with Big Data

Similar presentations

Presentation on theme: "Lecture 07: Dealing with Big Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 07: Dealing with Big Data

Similar presentations

Presentation on theme: "Lecture 07: Dealing with Big Data"— Presentation transcript:

Similar presentations

About project

Feedback