Lecture 07: Dealing with Big Data

Slides:



Advertisements
Similar presentations
Chapter 6 Sampling and Sampling Distributions
Advertisements

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Population Population
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Dimension reduction (1)
MISUNDERSTOOD AND MISUSED
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Section 7-3 Estimating a Population Mean:  Known Created by.
Who and How And How to Mess It up
Beginning the Research Design
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Sampling.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Created by Tom Wegleitner, Centreville, Virginia Section 1-3.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Sampling & Experimental Control Psych 231: Research Methods in Psychology.
CE 428 LAB IV Error Analysis (Analysis of Uncertainty) Almost no scientific quantities are known exactly –there is almost always some degree of uncertainty.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Chapter 1: Introduction to Statistics
Measurement Error.
LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Foundations of Sociological Inquiry The Logic of Sampling.
Final Study Guide Research Design. Experimental Research.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Chapter 8: Confidence Intervals
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Why Draw A Sample? Why not just the get the whole enchilada? – Pragmatic reasons – The true population is typically “unknowable” When done right, a small.
Populations, Samples, & Data Summary in Nat. Resource Mgt. ESRM 304.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 7 The Logic Of Sampling. Observation and Sampling Polls and other forms of social research rest on observations. The task of researchers is.
Slide Slide 1 Section 8-6 Testing a Claim About a Standard Deviation or Variance.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Question paper 1997.
Slide Slide 1 Section 8-4 Testing a Claim About a Mean:  Known.
Estimating a Population Mean:  Known
Confidence Intervals for Variance and Standard Deviation.
Exam 1 Review GOVT 120. Review: Levels of Analysis Theory: Concept 1 is related to Concept 2 Hypothesis: Variable 1 (IV) is related to Variable 2 (DV)
Chapter 6 Conducting & Reading Research Baumgartner et al Chapter 6 Selection of Research Participants: Sampling Procedures.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent.
Statistics (cont.) Psych 231: Research Methods in Psychology.
T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics: A First Course 5 th Edition.
RESEARCH METHODS Lecture 28. TYPES OF PROBABILITY SAMPLING Requires more work than nonrandom sampling. Researcher must identify sampling elements. Necessary.
Statistical Concepts Basic Principles An Overview of Today’s Class What: Inductive inference on characterizing a population Why : How will doing this allow.
1/59 Lecture 02: Data Mapping September 15, 2015 COMP Visualization.
Lecture 5.  It is done to ensure the questions asked would generate the data that would answer the research questions n research objectives  The respondents.
Data Transformation: Normalization
Arrangements or patterns for producing data are called designs
Principles of Experiment
Jianping Fan Dept of CS UNC-Charlotte
Arrangements or patterns for producing data are called designs
Reasoning in Psychology Using Statistics
Dimension reduction : PCA and Clustering
Psych 231: Research Methods in Psychology
Population Population
Multidimensional Space,
Population Population
Two Halves to Statistics
Data Pre-processing Lecture Notes for Chapter 2
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Lecture 07: Dealing with Big Data October 5, 2015 SDS 235 Visual Analytics Note: slide deck adapted from R. Chang

Announcements Solutions to Assignment 1 posted: Piazza > Resources Assignment 2 posted to course website: due October 14th Please respond to Piazza poll re: preferred lab style

Outline Dealing with high-dimensional data Real World Problem #1: Aggregation and Sampling (Jordan) Dimension reduction / projection (Rajmonda Caceres, MITLL) Real World Problem #1: Ellen Moorhouse, Women’s Fund

Recap: Keim’s Visual Analytics Model Image source: Keim, Daniel, et al. Visual analytics: Definition, process, and challenges. Springer Berlin Heidelberg, 2008.

Putting Keim’s VA Model in Context Knowledge Visual Mapping Visualization Mental Model Data Model Interaction Data

General Concept If the data scale is truly too large, we can find ways trim it: We can reduce the number of rows Subsampling Clustering Or reduce the number of columns Dimension reduction

Food for Thought How do we cut down the size while maintaining the general “characteristics” of the original data? How much can be trimmed? If we do analysis based on the trimmed data, does it still apply to the original (raw) data?

Reducing Rows: Sub-Sampling Goal: find a subset of the original dataset that exhibits the same (or similar) characteristics as the full dataset One simple approach: use random sampling Simple random sampling Systematic sampling Etc. Key point: each element must have an equal non-zero chance of being selected e.g. Selecting individuals from households

Caveat Whenever you look only at a subset of the data, you may be introducing sampling error

Measuring Sampling Error If we assume that the population follows a normal distribution Further assume that the variability of the population is known (as measured by standard deviation σ) Then the standard error of the sample mean is given by: σ/√n (where n = sampling size)

Reducing Rows: Clustering Imagine subsampling as “cropping” your data What if instead, we just want to change the “resolution”? Main idea: group similar data items together and represent them as a single entity Note the word similar: clustering always requires a distance function (some way to compare elements)

Clustering Algorithm: K-means There are numerous clustering algorithms out there Here we look at one popular one: K-means Inputs: K: number of clusters distance function: d(xi, xj) (1) (2) (3) (4)

Reducing Columns: Rank vs. Dimensionality How many dimensions are there in your data? What is its true rank? Example: degrees of freedom

Flashback to Lecture 1: Data Definition A typical dataset in visualization consists of n records: (r1, r2, r3, … , rn) Each record ri consists of (m >=1) observations or variables: (v1, v2, v3, … , vm) A variable may be either independent or dependent: An independent variable (iv) is not controlled or affected by another variable (e.g., time in a time-series dataset) A dependent variable (dv) is affected by a variation in one or more associated independent variables (e.g., temperature in a region) Formal definition: ri = (iv1, iv2, iv3, … , ivmi , dv1, dv2, dv3, … , dvmd) where m = mi + md

Rank vs. Dimensionality How many dimensions are there in your data? What is its true rank? Example: degrees of freedom

Dimension Reduction Goal: find the smallest set of dimensions that effectively characterize parts of the dataset you care about Two common techniques (Rajmonda): Principle Component Analysis Multi-Dimensional Scaling