Quantitative Methods of Data Analysis Natalia Zakharova, TA Bill Menke, Instructor.

Slides:



Advertisements
Similar presentations
How to Use the Earthquake Travel Time Graph (Page 11
Advertisements

Chapter 6 Writing a Program
Slide 1 FastFacts Feature Presentation August 28, 2008 We are using audio during this session, so please dial in to our conference line… Phone number:
Writing Pseudocode And Making a Flow Chart A Number Guessing Game
Tutorial 3 – Creating a Multiple-Page Report
Tutorial 9 – Creating On-Screen Forms Using Advanced Table Techniques
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
Leading for High Performance. PKR, Inc., for Cedar Rapids 10/04 2 Everythings Up-to-Date in Cedar Rapids! Working at classroom, building, and district.
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
Assumptions underlying regression analysis
SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13.
SADC Course in Statistics Analysing Data Module I3 Session 1.
Chapter 7 Sampling and Sampling Distributions
Computer Literacy BASICS
Configuration management
Module 4. Forecasting MGS3100.
Maintaining data quality: fundamental steps
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Non-Parametric Statistics
School of Geography FACULTY OF ENVIRONMENT Working with Tables 1.
INSERT BOOK COVER 1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Excel 2010 by Robert Grauer, Keith.
Vanderbilt Business Objects Users Group 1 Reporting Techniques & Formatting Beginning & Advanced.
Chi-Square and Analysis of Variance (ANOVA)
Environmental Data Analysis with MatLab Lecture 10: Complex Fourier Series.
Record Keeping F OR A S MALL B USINESS. RECORD KEEPING 2 Welcome 1. Agenda 2. Ground Rules 3. Introductions.
Environmental Data Analysis with MatLab
1 Displaying Open Purchase Orders (F/Y 11). 2  At the end of this course, you should be able to: –Run a Location specific report of all Open Purchase.
Problem Solving and Algorithm Design
CMPT 275 Software Engineering
Statistical Analysis SC504/HS927 Spring Term 2008
Services Course Windows Live SkyDrive Participant Guide.
Physics Tools and Standards
Science as a Process Chapter 1 Section 2.
Lab # 03- SS Basic Graphic Commands. Lab Objectives: To understand M-files principle. To plot multiple plots on a single graph. To use different parameters.
6.4 Best Approximation; Least Squares
Mrs. Warren’s Human Services Class “LIFE 101” (Personal and Family Development) Syllabus Course Objectives Grading Procedure Rules and Expectations.
IG Pro & CMS.
The Right Questions about Statistics: How regression works Maths Learning Centre The University of Adelaide Regression is a method designed to create a.
Arithmetic of random variables: adding constants to random variables, multiplying random variables by constants, and adding two random variables together.
Determining How Costs Behave
© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.
Warm up! Almost Every day we will start with a warm up question. Start warm up as soon as the bell rings. If I check homework and you are not working on.
Lecture 20 Continuous Problems Linear Operators and Their Adjoints.
Simple Linear Regression Analysis
Multiple Regression and Model Building
By Hui Bian Office for Faculty Excellence Spring
Eat that Frog! By Brian Tracy
Informatics 201 Week 1: Introductions. Introducing each other Pair up with someone you don’t know very much about.
Environmental Data Analysis with MatLab Lecture 21: Interpolation.
MA 1128: Lecture 06 – 2/15/11 Graphs Functions.
Lecture 10 Nonuniqueness and Localized Averages. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
0 WPI First Experience Teaching Software Testing Lessons Learned Gary Pollice Worcester Polytechnic Institute and Rational Software Corp.
Environmental Data Analysis with MatLab Lecture 15: Factor Analysis.
Probabilistic Reasoning over Time
Lecture 14 Nonlinear Problems Grid Search and Monte Carlo Methods.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
Environmental Data Analysis with MatLab Lecture 16: Orthogonal Functions.
Environmental Data Analysis with MatLab Lecture 17: Covariance and Autocorrelation.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Approaches to Representing and Recognizing Objects Visual Classification CMSC 828J – David Jacobs.
Scientific Method Scientific Method Interactive Lotus Diagram By Michelle O’Malley 6 th Grade Science League Academy Work Cited Work Cited Forward.
CSCI 51 Introduction to Computer Science Dr. Joshua Stough January 20, 2009.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Please CLOSE YOUR LAPTOPS, and turn off and put away your cell phones, and get out your note- taking materials.
Introduction to Blackboard Rabie A. Ramadan Session 3.
Success Criteria: I will be able to analyze data about my classmates.
Presentation transcript:

Quantitative Methods of Data Analysis Natalia Zakharova, TA Bill Menke, Instructor

Goals Make you comfortable with the analysis of numerical data through practice Teach you a set of widely-applicable data analysis techniques Provide a setting in which you can creatively apply what you’ve learned to a project of your own choosing

September 03 (W)Intro; Issues associated with working with data September 08 (M)Issues associated with coding; MatLab tutorial September 10 (W)Linear Algebra Review; Least Squares September 15 (M)Probability and Uncertainty September 17 (W)Variance and other measures of error, bootstraps September 22 (M)The principle of maximum likelihood September 24 (W)Advanced topics in Least-Squares, Part 1 September 29 (M)Advanced topics in Least-Squares, Part 2 October 01 (W)Interpolation and splines October 06 (M)Hypothesis testing October 08 (W) Linear systems, impulse response & convolutions October 13 (M)Filter Theory October 15 (W)Applications of Filters October 20 (M)Midterm Exam October 22 (W)Orthogonal functions; Fourier series October 27 (M)Basic properties of Fourier transforms October 29 (W)Fourier transforms and convolutions November 03 (M)Sampling theory November 05 (W)spectral analysis; power spectra November 12 (W)statistics of spectra; practical considerations November 17 (M) wavelet analysis November 19 (W)Empirical Orthogonal Functions December 01 (M)Adjoint methods December 03 (W)Class project presentations December 08 (M) Review for Final SYLLABUS

Assigned on a weekly basis Due Mondays at the start of class Due in hardcopy; arrange so that the numbered problem (typically 1, 2, and 3) can be physically separated from one another Advice: start early; seek assistance of classmates, TA and me (in that order) Homework

Substantial and creative analysis of a dataset of your choice Chance to apply a wide suite of techniques learned in this class in a realistic setting might (or might not) be part of your research; might (or might not) lead to a paper Project

September 17 (W) 1-page abstract due; then schedule brief meeting with me November 05 (W) Progress report due December 03 (W) Brief presentation of results in class December 08 (M) Hardcopy of Project Report due at start of class Project Dates

Homework20% Midterm15% Final15% Project50% Grading You should read my grading policy:

Software Excel point-and-click environment little overhead for quick analysis hard to automate repetitive tasks hard to document operations columns, rows of data cell-oriented formulas MatLab scripting environment some overhead, so less quick easy to automate repetitive tasks easy to document operations vectors and matrices of data general programming environment

Survey THEORY DATA ANALYSIS LAB & FIELD WORK 2. Have you had a course that included (check all that apply) matrices & linear algebra probability and statistics vector calculus computer programming 1. Put an x in this triangle that represents your expectation for your career 3.Calculate this sum w/o electronic assistance: = ______ 4. Plot the function y(x) = 1 + 2x + 3x 2 on this graph: 5. Estimate the number of minutes it would take you to walk from Morningside to Lamont: ______.

The Nature of Data please read Doug Martinson’s Part 1: Fundamentals (available under Courseworks)

Key Ideas How data were estimated is important! Data are never perfect; inherently contain error You analyze data to learn something specific; not to show virtuosity with some analysis method! A scientific objective must be clearly-articulated; the analysis method must fit the objective.

Data Lingo Discrete vs. Continuous data is always discrete (a series of individual numbers, such as a sequence of readings off of a thermometer) even though the process being observed may be continuous (the temperature of the room, which varies continuously as a function of time and space)

Sequential vs. non-sequential Data data often has some sort of natural organization, the most common of which is sequential, E.g. temperature of this room, measured every fifteen minutes temperature right now along the hallway, measured every thirty centimeters Such data is often called a time-series, even in the case where the organization is not based on time, but on distance (or whatever) …

Multivariate Data while a time-series is the data analog of a function of one variable, e.g. f(t) a multivariate dataset is the data analog of a function of two or more variables, e.g. f(x,y) My photo, at left, is one such multivariate dataset, because the digital camera that captured the image was measuring light intensity as a function of two independent spatial variables. There are 300,000 individual measurements in this (relatively low resolution) image.

Precision and Dynamic Range Any measurement is made only to a finite number of decimal places, or precision. It can make a big difference whether the measurement is to one decimal place, 1.1, or to 7, A sequence of values will vary in size. The dynamic range quantifies the ratio of the largest value to the smallest (non-zero) value*. It can make a big difference if all the data vary in the range or  * See Doug’s notes for the exact definition, which involves a logarithm

Vectors and Matrices A list of measurements (d 1, d 2, d 3.. ) can be organized very effectively into a vector, d. A table of data site 1 site 2 site 3 time 1 d 11 d 12 d 13 time 2 d 21 d 22 d 22 time 3 d 31 d 32 d 32 can be organized very effectively into a matrix, D. As we will see during the semester, the algebra of vector and matrix arithmetic can then be used very effectively to implement many different kinds of data analysis

Precision* vs. Accuracy precision – repeatability of the measurement what is the scatter if you make the measurement many times? Accuracy - difference between the center of a group of scattered measurements and the true value of what you’re trying to measure * Note the different sense of the word precision than 3 slides ago.

Signal-to-Noise Ratio Error in the data compare to Size of the data The size of the error is most meaningful when compared to the size of the data

some examples of data … and techniques used to analyze it

Biggest peak has a period of exactly one year … makes sense, it’s the annual cycle in river flow But what about these smaller peaks?

Daily Temperature at Laguardia Airport (LGA), New York, NY I’m suspicious of these ‘exacly zero’ values. Missing data, defaulted to zero, maybe?

Laguardia Airport (New York, NY) Temperature vs. Precipitation

Mean precip in a given one-degree temperature range Related to: Conditional probability that it will rain, given the temperature?

Ground motion here at Lamont on August Here’s an earthquake But what are these little things?

A simple filtering technique to accentuate the longer, clear periods in the data Here’s an earthquake Little similar looking earthquakes and lots of them!

A little case study: correlation of  N-15 and dust in the Vostok Ice Core

ftp://ftp.ncdc.noaa.gov/pub/data/paleo/icecore/antarctica/vostok/dustnat.txt Ice age (GT4)Dust Conc (ppm) (522 lines of data) A little Googling indicates that dust data readily available on web Given as age vs. dust concentration, assuming the GT4 age model (which relates depth in the ice core to age) About 1 sample per few hundred years at the too of the core, declining to one sample per few thousand years at the bottom

ftp://sidads.colorado.edu/pub/DATASETS/AGDC/bender_nsidc_0107 Vos_O2-N2_isotope_data_all.txt CoreDepth  N15  O18atm  O2/N2 5G G G G G G (572 lines of data) Vostok_EGT20_chronology.txt Depth EGT20 ice EGT20 *ageEGT20 gas age (3200 lines of data)  N15 data also readily available on web Given as depth vs.  N15 Roughly the same number of lines of data (so presumably similar age sampling) EGT20 chronology given; presumably different (but by how much?) than GT4 Age of air in ice and be as much as 4000 years younger than age of ice itself

Decision: Compare data at same depth in ice (probably not so sensible) same age (probably more sensible) Need then to convert  N15 depth to  N15 (gas) age (we ‘ve found the conversion table) (some sort of interpolation probably necessary, how much error will that introduce? ) dust GT4 age EGT20 age (we need to look for the conversion table) Need to deal with the likely problem that the sampled ages of the dust will not match the ages of the  N15 (how much error will interpolation introduce?)

Some Thoughts on Working With Data

Start Simple ! Don’t launch into time-consuming analyses until you’ve spent time… … gathering background information... learning enough about a database to have some sense that it really contains the information that you want … retrieving a small subset of the data and looking them over carefully

Look at your data! look at the numerical values (e.g, in spreadsheet format) graph it in a variety of ways you’ll pick up on all sorts of useful - and scary – things

Where do I get data ? You collect it your self through some sort of electronic instrumentation You compile it manually from written sources A colleague gives it to you (e.g. s you a file) You retrieve it from some sort of data archive (e.g. accessible through the web)

Don’t Be Afraid to Ask … … technicians familiar with instrumentation … authors of paper that you have read … your colleagues... data professionals at mission agencies

Learn how the data were collected What was really measured Hidden assumptions and conversion factors The number of steps between the measurement and the data as it appears in the database

Who are the people who collected the data Who performed the actual measurements? How many different people over what period of time? What kind of quality control was performed? Are you accessing the original database or somebody’s copy?

What are the data’s limitations? How much is nonsense? (Typically 5% of data records in compilations have errors) What is the measurement accuracy? (Are formal error estimates given?) Perform sanity checks on both the data and your understanding of it. Compare similar data from different databases. Identify and understand differences.

data analysis can be messy Many files, some of them very large Many steps in the analysis process including rather tedious data re-formatting Possibly many versions of an analysis exploring different choices in analysis Lots of output … tables, graphs, etc.

Organization very important Well-designed directory (folder) structure Disciplined use of filenames Ongoing documentation of the analysis process, and especially of the big picture

Quandry How much to keep … How much to delete …

Advice #1 Always keep a copy of the unaltered raw data (and make sure that you identify it as such)

Advice #2 Always keep anything that you type in manually (notes, MatLab scripts, references, etc) with the theory that you couldn’t possibly type fast enough to consume significant storage space.

Advice #3 Whenever possible, design and use a single script that recreates a sensible part of your work. You can use it to recreate anything you’ve deleted, and also documents what you’ve done.

Advice #4 If you do delete a large chunk of your work, leave the top directory and put in a note to yourself explaining what you’ve deleted … and why