Presentation on theme: "Quantitative Methods of Data Analysis Natalia Zakharova, TA Bill Menke, Instructor."— Presentation transcript:
Quantitative Methods of Data Analysis Natalia Zakharova, TA Bill Menke, Instructor
Goals Make you comfortable with the analysis of numerical data through practice Teach you a set of widely-applicable data analysis techniques Provide a setting in which you can creatively apply what you’ve learned to a project of your own choosing
September 03 (W)Intro; Issues associated with working with data September 08 (M)Issues associated with coding; MatLab tutorial September 10 (W)Linear Algebra Review; Least Squares September 15 (M)Probability and Uncertainty September 17 (W)Variance and other measures of error, bootstraps September 22 (M)The principle of maximum likelihood September 24 (W)Advanced topics in Least-Squares, Part 1 September 29 (M)Advanced topics in Least-Squares, Part 2 October 01 (W)Interpolation and splines October 06 (M)Hypothesis testing October 08 (W) Linear systems, impulse response & convolutions October 13 (M)Filter Theory October 15 (W)Applications of Filters October 20 (M)Midterm Exam October 22 (W)Orthogonal functions; Fourier series October 27 (M)Basic properties of Fourier transforms October 29 (W)Fourier transforms and convolutions November 03 (M)Sampling theory November 05 (W)spectral analysis; power spectra November 12 (W)statistics of spectra; practical considerations November 17 (M) wavelet analysis November 19 (W)Empirical Orthogonal Functions December 01 (M)Adjoint methods December 03 (W)Class project presentations December 08 (M) Review for Final SYLLABUS
Assigned on a weekly basis Due Mondays at the start of class Due in hardcopy; arrange so that the numbered problem (typically 1, 2, and 3) can be physically separated from one another Advice: start early; seek assistance of classmates, TA and me (in that order) Homework
Substantial and creative analysis of a dataset of your choice Chance to apply a wide suite of techniques learned in this class in a realistic setting might (or might not) be part of your research; might (or might not) lead to a paper Project
September 17 (W) 1-page abstract due; then schedule brief meeting with me November 05 (W) Progress report due December 03 (W) Brief presentation of results in class December 08 (M) Hardcopy of Project Report due at start of class Project Dates
Homework20% Midterm15% Final15% Project50% Grading You should read my grading policy:
Software Excel point-and-click environment little overhead for quick analysis hard to automate repetitive tasks hard to document operations columns, rows of data cell-oriented formulas MatLab scripting environment some overhead, so less quick easy to automate repetitive tasks easy to document operations vectors and matrices of data general programming environment
Survey THEORY DATA ANALYSIS LAB & FIELD WORK 2. Have you had a course that included (check all that apply) matrices & linear algebra probability and statistics vector calculus computer programming 1. Put an x in this triangle that represents your expectation for your career 3.Calculate this sum w/o electronic assistance: = ______ 4. Plot the function y(x) = 1 + 2x + 3x 2 on this graph: 5. Estimate the number of minutes it would take you to walk from Morningside to Lamont: ______.
The Nature of Data please read Doug Martinson’s Part 1: Fundamentals (available under Courseworks)
Key Ideas How data were estimated is important! Data are never perfect; inherently contain error You analyze data to learn something specific; not to show virtuosity with some analysis method! A scientific objective must be clearly-articulated; the analysis method must fit the objective.
Data Lingo Discrete vs. Continuous data is always discrete (a series of individual numbers, such as a sequence of readings off of a thermometer) even though the process being observed may be continuous (the temperature of the room, which varies continuously as a function of time and space)
Sequential vs. non-sequential Data data often has some sort of natural organization, the most common of which is sequential, E.g. temperature of this room, measured every fifteen minutes temperature right now along the hallway, measured every thirty centimeters Such data is often called a time-series, even in the case where the organization is not based on time, but on distance (or whatever) …
Multivariate Data while a time-series is the data analog of a function of one variable, e.g. f(t) a multivariate dataset is the data analog of a function of two or more variables, e.g. f(x,y) My photo, at left, is one such multivariate dataset, because the digital camera that captured the image was measuring light intensity as a function of two independent spatial variables. There are 300,000 individual measurements in this (relatively low resolution) image.
Precision and Dynamic Range Any measurement is made only to a finite number of decimal places, or precision. It can make a big difference whether the measurement is to one decimal place, 1.1, or to 7, A sequence of values will vary in size. The dynamic range quantifies the ratio of the largest value to the smallest (non-zero) value*. It can make a big difference if all the data vary in the range or * See Doug’s notes for the exact definition, which involves a logarithm
Vectors and Matrices A list of measurements (d 1, d 2, d 3.. ) can be organized very effectively into a vector, d. A table of data site 1 site 2 site 3 time 1 d 11 d 12 d 13 time 2 d 21 d 22 d 22 time 3 d 31 d 32 d 32 can be organized very effectively into a matrix, D. As we will see during the semester, the algebra of vector and matrix arithmetic can then be used very effectively to implement many different kinds of data analysis
Precision* vs. Accuracy precision – repeatability of the measurement what is the scatter if you make the measurement many times? Accuracy - difference between the center of a group of scattered measurements and the true value of what you’re trying to measure * Note the different sense of the word precision than 3 slides ago.
Signal-to-Noise Ratio Error in the data compare to Size of the data The size of the error is most meaningful when compared to the size of the data
some examples of data … and techniques used to analyze it
Biggest peak has a period of exactly one year … makes sense, it’s the annual cycle in river flow But what about these smaller peaks?
Daily Temperature at Laguardia Airport (LGA), New York, NY I’m suspicious of these ‘exacly zero’ values. Missing data, defaulted to zero, maybe?
Laguardia Airport (New York, NY) Temperature vs. Precipitation
Mean precip in a given one-degree temperature range Related to: Conditional probability that it will rain, given the temperature?
Ground motion here at Lamont on August Here’s an earthquake But what are these little things?
A simple filtering technique to accentuate the longer, clear periods in the data Here’s an earthquake Little similar looking earthquakes and lots of them!
A little case study: correlation of N-15 and dust in the Vostok Ice Core
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/icecore/antarctica/vostok/dustnat.txt Ice age (GT4)Dust Conc (ppm) (522 lines of data) A little Googling indicates that dust data readily available on web Given as age vs. dust concentration, assuming the GT4 age model (which relates depth in the ice core to age) About 1 sample per few hundred years at the too of the core, declining to one sample per few thousand years at the bottom
ftp://sidads.colorado.edu/pub/DATASETS/AGDC/bender_nsidc_0107 Vos_O2-N2_isotope_data_all.txt CoreDepth N15 O18atm O2/N2 5G G G G G G (572 lines of data) Vostok_EGT20_chronology.txt Depth EGT20 ice EGT20 *ageEGT20 gas age (3200 lines of data) N15 data also readily available on web Given as depth vs. N15 Roughly the same number of lines of data (so presumably similar age sampling) EGT20 chronology given; presumably different (but by how much?) than GT4 Age of air in ice and be as much as 4000 years younger than age of ice itself
Decision: Compare data at same depth in ice (probably not so sensible) same age (probably more sensible) Need then to convert N15 depth to N15 (gas) age (we ‘ve found the conversion table) (some sort of interpolation probably necessary, how much error will that introduce? ) dust GT4 age EGT20 age (we need to look for the conversion table) Need to deal with the likely problem that the sampled ages of the dust will not match the ages of the N15 (how much error will interpolation introduce?)
Some Thoughts on Working With Data
Start Simple ! Don’t launch into time-consuming analyses until you’ve spent time… … gathering background information... learning enough about a database to have some sense that it really contains the information that you want … retrieving a small subset of the data and looking them over carefully
Look at your data! look at the numerical values (e.g, in spreadsheet format) graph it in a variety of ways you’ll pick up on all sorts of useful - and scary – things
Where do I get data ? You collect it your self through some sort of electronic instrumentation You compile it manually from written sources A colleague gives it to you (e.g. s you a file) You retrieve it from some sort of data archive (e.g. accessible through the web)
Don’t Be Afraid to Ask … … technicians familiar with instrumentation … authors of paper that you have read … your colleagues... data professionals at mission agencies
Learn how the data were collected What was really measured Hidden assumptions and conversion factors The number of steps between the measurement and the data as it appears in the database
Who are the people who collected the data Who performed the actual measurements? How many different people over what period of time? What kind of quality control was performed? Are you accessing the original database or somebody’s copy?
What are the data’s limitations? How much is nonsense? (Typically 5% of data records in compilations have errors) What is the measurement accuracy? (Are formal error estimates given?) Perform sanity checks on both the data and your understanding of it. Compare similar data from different databases. Identify and understand differences.
data analysis can be messy Many files, some of them very large Many steps in the analysis process including rather tedious data re-formatting Possibly many versions of an analysis exploring different choices in analysis Lots of output … tables, graphs, etc.
Organization very important Well-designed directory (folder) structure Disciplined use of filenames Ongoing documentation of the analysis process, and especially of the big picture
Quandry How much to keep … How much to delete …
Advice #1 Always keep a copy of the unaltered raw data (and make sure that you identify it as such)
Advice #2 Always keep anything that you type in manually (notes, MatLab scripts, references, etc) with the theory that you couldn’t possibly type fast enough to consume significant storage space.
Advice #3 Whenever possible, design and use a single script that recreates a sensible part of your work. You can use it to recreate anything you’ve deleted, and also documents what you’ve done.
Advice #4 If you do delete a large chunk of your work, leave the top directory and put in a note to yourself explaining what you’ve deleted … and why