Environmental Data Analysis with MatLab

Slides:



Advertisements
Similar presentations
Environmental Data Analysis with MatLab Lecture 10: Complex Fourier Series.
Advertisements

Environmental Data Analysis with MatLab
Environmental Data Analysis with MatLab Lecture 21: Interpolation.
Environmental Data Analysis with MatLab Lecture 15: Factor Analysis.
Environmental Data Analysis with MatLab Lecture 8: Solving Generalized Least Squares Problems.
Time Series and Forecasting
Lecture 23 Exemplary Inverse Problems including Earthquake Location.
Lecture 18 Varimax Factors and Empircal Orthogonal Functions.
Environmental Data Analysis with MatLab Lecture 9: Fourier Series.
Environmental Data Analysis with MatLab
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Environmental Data Analysis with MatLab Lecture 13: Filter Theory.
Environmental Data Analysis with MatLab Lecture 16: Orthogonal Functions.
Environmental Data Analysis with MatLab. Goals Make you comfortable with the analysis of numerical data through practice Teach you a set of widely-applicable.
Environmental Data Analysis with MatLab Lecture 23: Hypothesis Testing continued; F-Tests.
Spectral analysis for point processes. Error bars. Bijan Pesaran Center for Neural Science New York University.
Point process and hybrid spectral analysis.
Environmental Data Analysis with MatLab Lecture 11: Lessons Learned from the Fourier Transform.
Environmental Data Analysis with MatLab
Environmental Data Analysis with MatLab Lecture 12: Power Spectral Density.
Lecture 2 Probability and Measurement Error, Part 1.
Environmental Data Analysis with MatLab Lecture 17: Covariance and Autocorrelation.
Lecture 17 Factor Analysis. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture 03Probability and.
Environmental Data Analysis with MatLab Lecture 5: Linear Models.
Environmental Data Analysis with MatLab Lecture 3: Probability and Measurement Error.
Evaluation.
Environmental Data Analysis with MatLab Lecture 24: Confidence Limits of Spectra; Bootstraps.
Quantitative Business Analysis for Decision Making Simple Linear Regression.
Environmental Data Analysis with MatLab Lecture 7: Prior Information.
Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.
Lecture II-2: Probability Review
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Time Series and Forecasting
Hydrologic Statistics
Environmental Data Analysis with MatLab Lecture 20: Coherence; Tapering and Spectral Analysis.
ETM 607 – Random Number and Random Variates
MANAGEMENT AND ANALYSIS OF WILDLIFE BIOLOGY DATA Bret A. Collier 1 and T. Wayne Schwertner 2 1 Institute of Renewable Natural Resources, Texas A&M University,
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Environmental Data Analysis with MatLab Lecture 10: Complex Fourier Series.
Autocorrelation correlations between samples within a single time series.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Applied Quantitative Analysis and Practices LECTURE#11 By Dr. Osman Sadiq Paracha.
Martian Soil Analysis With Linear Algebra By Gary Newsom and Jessalyn Timson.
Correction of daily values for inhomogeneities P. Štěpánek Czech Hydrometeorological Institute, Regional Office Brno, Czech Republic
COLORADO FRONT RANGE CHINOOK WIND EVENTS FROM IS THERE AN UPWARD TREND? Ben Converse.
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
Ichetucknee River Flow 1929 – 2010: Statistical Trend Bin Gao, Kathleen McKee, Wendy Graham Fig. 3 Flow (cfs) at US 27 on the Ichetucknee River from 1929.
Probability and Statistics in Geology Probability and statistics are an important aspect of Earth Science. Understanding the details, population of a data.
GG313 Lecture 24 11/17/05 Power Spectrum, Phase Spectrum, and Aliasing.
CE 401 Climate Change Science and Engineering evolution of climate change since the industrial revolution 9 February 2012
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Time Series and Forecasting Chapter 16 McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Environmental Data Analysis with MatLab 2 nd Edition Lecture 14: Applications of Filters.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Time Series and Forecasting Chapter 16.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Environmental Data Analysis with MatLab 2 nd Edition Lecture 22: Linear Approximations and Non Linear Least Squares.
1.Image Error and Quality 2.Sampling Theory 3.Univariate Descriptive Image Statistics 4.Multivariate Statistics 5.Geostatistics for RS Next Remote Sensing1.
Statistical Forecasting
Regression and Correlation
Review 1. Describing variables.
Inference for Regression
Environmental Data Analysis with MatLab
Lecture 26: Environmental Data Analysis with MatLab 2nd Edition
Introduction to Statistics
Environmental Data Analysis with MatLab
Inference for Regression
Copyright © Cengage Learning. All rights reserved.
Lecture 23: Environmental Data Analysis with MatLab 2nd Edition
Environmental Data Analysis with MatLab
Presentation transcript:

Environmental Data Analysis with MatLab Lecture 2: Looking at Data Our experience is that many students don’t bother to look at data, they merely read it into data processing software and expect for the right answer to fall out. They need to be taught that they, themselves, can make useful inferences by carefully looking at data.

SYLLABUS Lecture 01 Using MatLab Lecture 02 Looking At Data Lecture 03 Probability and Measurement Error Lecture 04 Multivariate Distributions Lecture 05 Linear Models Lecture 06 The Principle of Least Squares Lecture 07 Prior Information Lecture 08 Solving Generalized Least Squares Problems Lecture 09 Fourier Series Lecture 10 Complex Fourier Series Lecture 11 Lessons Learned from the Fourier Transform Lecture 12 Power Spectra Lecture 13 Filter Theory Lecture 14 Applications of Filters Lecture 15 Factor Analysis Lecture 16 Orthogonal functions Lecture 17 Covariance and Autocorrelation Lecture 18 Cross-correlation Lecture 19 Smoothing, Correlation and Spectra Lecture 20 Coherence; Tapering and Spectral Analysis Lecture 21 Interpolation Lecture 22 Hypothesis testing Lecture 23 Hypothesis Testing continued; F-Tests Lecture 24 Confidence Limits of Spectra, Bootstraps

get you started looking critically at data purpose of the lecture get you started looking critically at data You should stress that the students need to develop a habit of critically looking at their data, before they start to analyze it in any detailed way. Many infamous data analysis mistakes and nonsensical interpretations of the past could have been avoided if the people had taken the time to look carefully at the data before rushing into a complicated analysis of it.

Objectives when taking a first look at data Understand the general character of the dataset. Understand the general behavior of individual parameters. Detect obvious problems with the data.

Tools for Looking at Data covered in this lecture reality checks time plots histograms rate information scatter plots

Black Rock Forest Temperature I downloaded the weather station data from the International Research Institute (IRI) for Climate and Society at Lamont-Doherty Earth Observatory, which is the data center used by the Black Rock Forest Consortium for its environmental data. About 20 parameters were available, but I downloaded only hourly averages of temperature. My original file, brf_raw.txt has time in a format that I thought would be hard to work with, so I wrote a MatLab script, brf_convert.m, that converted it into time in days, and wrote the results into the file that I gave you. Stress the importance of maintain a narrative of where a dataset came from and what has been done to it.

format conversion days from start of first year of data calendar date/time 0100-0159 2 Jan 1997 1.042 sequential time variable need for data analysis but format conversions provide opportunity for error to creep into dataset

Reality Checks properties that your experience tells you that the data must have check you expectations against the data You should ask students to offer reality checks, and make a list of them on the board.

Reality Checks What do you expect the data to look like? hourly measurements thirteen years of data location in New York (moderate climate) Emphasize that you usually know something about a dataset even before looking at it.

to sketch a plot of what you expect the data to look like take a moment ... to sketch a plot of what you expect the data to look like

Reality Checks What do you expect the data to look like? hourly measurements thirteen years of data location in New York (moderate climate) time increments by 1/24 day per sample about 24*365*13 = 113880 lines of data temperatures in the -20 to +35 deg C range diurnal and seasonal cycles Emphasize that you usually know something about a dataset even before looking at it.

Does time increment by 1/24 days per sample? 1/24 = 0.0417 D(1:5,:) 0 17.2700 0.0417 17.8500 0.0833 18.4200 0.1250 18.9400 0.1667 19.2900 Emphasize that you usually know something about a dataset even before looking at it. Yes

Are there about 24*365*20 = 113880 lines of data ? length(D) 110430 Yes Emphasize that you usually know something about a dataset even before looking at it.

temperatures in the -20 to +35 deg C range? diurnal and seasonal cycles? Have the students stare at the plot and try to answer the questions. You should prod them to identify the data drop outs, too.

Temperatures in the -20 to +35 deg C range? Mostly annual cycle cold spikes hot spike data drop-outs -20 to +35 range Have the students check that the cold part of the seasonal cycles are in January. You should ask them what they would need to do to examone the diurnal cycle. (Answer: make an enlarged plot). Temperatures in the -20 to +35 deg C range? Mostly Diurnal and seasonal cycles? Certainly seasonal.

Data Drop-outs common in datasets the instrument wasn’t working for a while … take two forms: missing rows of table data set to some default value 0 n/a -999 Ask the class which default value is the best choice. all common

50 days of data from winter 50 days of data from summer diurnal cycle Mention that plots with both lines and dots at the data points. The cold spike, which is almost certainly and instrumental glitch, consists of two data points. The data drop out consists of a several days of zero values (as contrasted to missing data). Try to get the students to recognize that the diurnal cycle is more prominent in summer than in winter. data drop-out cold spike

Histograms determine range of the majority of data values quantifies the frequency of occurrence of data at different data values easy to spot over-represented and under-represented values

MatLab code for Histogram Lh = 100; dmin = min(d); dmax = max(d); bins = dmin+(dmax-dmin)*[0:Lh-1]’/(Lh-1); dhist = hist(d, bins)’; The first 4 lines set up the bins, in this case 100 bins from dmin to dmax. The hist() function returns the number of data in each of the bins.

Histogram of Black Rock Forest temperatures counts Ask the class to interpret the histogram. The range of most of the data is -20 to +35 deg C. The zero value is wildly over-represented. The histogram is not useful in spotting outliers (hot and cold spikes). temperature, ºC

Alternate ways of displaying a histogram B) counts The grey-shaded image method of displaying a vector or a matrix is heavily used in the text, and should be introduced here. temperature, ºC

Moving-Window Histograms Series of histograms, each on a relatively short time interval of data Advantage: Shows the way that the frequency of occurrence of data varies with time Disadvantage: Each histogram is computed using less data, and so is less accurate

Moving-Window Histogram of Black Rock Forest temperatures time, days 5000 -60 temperature, C Mention that intervals containing drop-outs are easily detected. 40

good use of FOR loop offset=1000; Lw=floor(N/offset)-1; Dhist = zeros(Lh, Lw); for i = [1:Lw]; j=1+(i-1)*offset; k=j+offset-1; Dhist(:,i) = hist(d(j:k), bins)'; end Explain that the matrix, Dhist, will contain a series of histograms, one histogram per column. Each histogram is for a different 1000-sample long section of data, d. Each pass through the loop computes the histogram on one window of data. Note that the matrix, Dhist, is pre-allocated using the zeros() function. Pre-allocation is a good scripting practice for a variety of reasons, foremost of which is that it focuses your attention on what the final size of the matrix ought to be. Note the use of the floor() command, which rounds a number down to the nearest integer.

Rate Information how fast a parameter is changing with time or with distance

finite-difference approximation to derivative Mention that the starting point is the standard definition of a derivative.

We will use the Neuse River Hydrograph to study rates We will use the Neuse River Hydrograph to study rates. Try to get the students to identify the annual cycle and the storm events.

MatLab code for derivative N=length(d); dddt=(d(2:N)-d(1:N-1))./(t(2:N)-t(1:N-1)); Note that dddt is of length N-1.

hypothetical storm event note that more time has negative dd/dt rain draining of land Show how the slope varies along the length of the discharge vs time curve (left), and that this pattern corresponds to the derivative dd/dt.

A) is a an enlargement of part of the hydrograph. B) Is its derivative A) is a an enlargement of part of the hydrograph. B) Is its derivative. C) is a histogram of the derivative. Note that it is not centered around zero, but instead has more negatives than positives.

Hypothesis rate of change in discharge correlates with amount of discharge logic a river is bigger when it has high discharge a big river flows faster than a small river a river that flows faster drains away water faster (might only be true after the rain has stopped) Discuss with class how rate information might be used to test this hypothesis. What kind of plot might be useful? (Answer: a scatter plot).

MatLab Script purpose: make two separate plots, one for times of increasing discharge, one for times of decreasing discharge pos = find(dddt>0); neg = find(dddt<0); - - - plot(d(pos),dddt(pos),'k.'); plot(d(neg),dddt(neg),'k.'); Explain that a logical function, like (dddt>0) returns a vectors of zeros and ones, depending upon whether the test is false or true. Explain that the find() function returns a list of all the elements of the vector for which the test is true. explain that d(pos) and d(neg) are vectors of the discharge at times of positive and negative discharge. Explain ‘k.’ is a black dot on the plot.

Have the class interpret these results. Do they support the hypothesis?

Atlantic Rock Dataset I downloaded rock chemistry data from PetDB’s website at www.petdb.org. Their database contains chemical information about ocean floor igneous and metamorphic rocks. I extracted all samples from the Atlantic Ocean that had the following chemical species: SiO2, TiO2, Al2O3, FeOtotal, MgO, CaO, Na2O and K2O My original file, rocks_raw.txt included a description of the rock samples, their geographic location and other textual information. However, I deleted everything except the chemical data from the file, rocks.txt, so it would be easy to read into MatLab. The order of the columns is as is given above and the units are weight percent.

Using scatter plots to look for correlations among pairs of the eight chemical species 8! / [2! (8-2!)] = 28 plots Marginally possible to examine 28 plots.

four interesting scatter plot Al203 Ti02 Si02 K20 Fe0 Mg0 A) B) C) D) a) Shotgun patter, with no obvious correlation. B) Data scatter about a line. C) Two populations. D) Something complicated. All from the same dataset! This points out a limitation of scatter plots. You see 2D, but the data re inherently multi-dimensional.