Presentation on theme: "General good advice on data handling Peter Shaw. Introduction n We have spent the last 11 weeks engaged in picking up some technical details about various."— Presentation transcript:
General good advice on data handling Peter Shaw
Introduction n We have spent the last 11 weeks engaged in picking up some technical details about various aspects of data handling and analysis. n This week I do not intend any new names or techniques (unless you specifically ask..), just to round off with a few unifying thoughts and snippets of good advice.
Project design n Get it right before you start!! n It is not hard to get a balanced design, though you may well have to make some sacrifices about the number of treatments / sites / replicates etc. n Check your design with staff - that’s what they’re paid for! n It can be impossible to fix a bad design: Rothamsted once had to throw away 50 years of meticulously collected data because the faulty experimental design made the data useless.
Data Collection n Keep a notebook, and write things down as you go along (dating each entry). n This is best done on the spot - by the time you get home you will have forgotten some important details. n Often you have to fall back on Operational Taxonomic Units (OTUs: Sp1, Sp2, small pink thin species, etc). Fine - this is more honest than trying to shoehorn an unfamiliar specimen into a known species. n Make sure that you keep such specimens carefully for ID, and that these Ids are recorded in the relevant lab/field notebook. Take it from me - trying to fathom out how to decode entries like “?blue-brown oddity: 2 specimens” after a year’s absence is playing Russian roulette with your datamatrix!
Once data are written down.. n You need to transcribe them into a PC. n This procedure is easy to skimp on, as you look forward to the analyses ahead! n GIGO - Garbage In, Garbage Out! If you allow errors to creep in at this stage, all subsequent analyses will be suspect if not downright invalidated. n Entering species data into spreadsheets is particularly tedious due to the predominance of zeroes
Metadata n These are data about data: information the set the actual measurements in context. n Some forms of metadata are essential for analysis and must be held within the datamatrix: date, depth, sample number, time, observer, plot number etc. n Others are immaterial for the analyses but crucial for write-ups and replicability: details of methods used, site location etc. The notebooks that hold these data are essential documents in archives. Metadata site, date, plot etc Raw site data pH, elevation etc Raw species data Log-trans data etc 6ish 4-10ish10-100
Debugging and verifying n Once data are in, go back and check every entry against the notebook. n I find it helpful to photocopy notebook pages, so I can cross out or highlight entries once validated against the data file. n Even then, don’t believe the data! Use boxplots to check for outliers. What are your units? Often you need to convert raw data into a derived format (densities per unit area, mg 100g -1 etc). Don’t change source data but create new variables, and ensure that each variable is unambiguously labelled.
Outliers 1 n These are datapoints which “clearly” lie outside the range of the rest of the dataset, and show up on boxplots or scattergraphs. n Always eyeball the data, and check outliers. Usually they result from a typing mistake and are easily remedied. n Sometimes they are clearly an error in the notebook - how you sort this out depends on your judgement, experience and intuition. If in doubt ask!
Outliers 2 n Then you get the awkward sort! The notebook is adamant and the entry looks plausible, but the datapoint looks odd. Now what? n It is legitimate to exclude such points from further analysis, although you should record this fact in your methods section. Be careful, as you may be removing the most interesting observation!
Multivariate techniques.. n Are especially sensitive to outliers: watch as one data point has its decimal place entered one place out:
Missing data n These are sadly common. You knocked the tube on the floor, you lost the sample… n Don’t put a zero (-1, etc) there! This is tantamount to saying that you actually measured this value. n SPSS has a specific solution to missing data - enter a “.” (full stop, decimal place etc). That data point will be excluded from analyses. n Check the options in each technique used to see how missing values are handled. They cause insurmountable difficultties for many analyses, and either the variable or the observation will have to be excluded.