Download presentation
Presentation is loading. Please wait.
1
Introduction to Data Handling
2
A Fast Hour Review of data types Scalar, ordinal, nominal Decisions regarding encoding data Turning information into analyzable data Dealing with missing data The structure of experimental data Getting things into 2 dimensional (or a few dimensional) tables Deciding on which software to use Excel Spreadsheet-style analysis packages Scripted analysis
3
Review of Data Types
4
Scalar Continuous Discrete Ordinal Nominal
5
Scalar Data Continuous Data Real numbers used to measure magnitude Unbounded at least in one direction Ex: Average Dilantin level (3.1+4.4)/2 = 3.75 Discrete Data Data that can take on a finite number of values Unbounded at least in one direction Ex: Average number of fingers (5+4)/2 ≠ 4.5, but is ‘in between’ 4 and 5
6
Scalar Data Truly continuous data are theoretical – you don’t run into them in the real world Because of limitations of measurement (e.g., significant figures), scalar data are actually discrete In most real life applications, discrete data can be handled as if continuous Just beware of the ‘2.3 kids’ problem
7
Ordinal Data Data whose attributes are ordered but for which the numerical differences between adjacent attributes are not necessarily interpreted as equal Bounded Scale has some upper and lower limit Classic Example: Glasgow Coma Scale GCS of 4 intuitively ranks lower than GCS of 5 Difference of GCS of 14 and 15 is not the same as difference between GCS of 3 and GCS of 4 GCS of 4 + GCS of 5 ≠ GCS of 9
8
Nominal Data May have an assigned numerical value for analytical reasons, but there is no numerical underpinning for the variables Example: Race African american = 1 Hispanic = 2 Asian = 3 1 + 2 ≠ 3
9
Turning information into analyzable data
10
Discrete data are usually easy Age Vital signs One dimensional measures (e.g., Hgb, time-to-relapse) Ordinal and nominal data get tricky If you’re only going to do descriptive statistics, it doesn’t matter much If you’re going to model (e.g., do regression) it gets involved
11
Real Life Example from the Camp Survey Question 3. On a usual camp day, the person on site with the highest level of health care training is a: Physician Registered nurse Licensed practical nurse Licensed paramedic Licensed EMT Licensed first responder First aid provider
12
Real Life Example from the Camp Survey What type of variable would you use?
13
Real Life Example from the Camp Survey One choice: A continuous variable On a usual camp day, how many years of training has the senior- most caregiver completed? Var_Years
14
Real Life Example from the Camp Survey Another more likely choice: An ordinal variable Physician = 1 RN= 2 LPN= 3 Paramedic= 4 EMT= 5 First responder= 6 First aid= 7 Var_Caregiver
15
Real Life Example from the Camp Survey A Third Choice Seven nominal ‘dummy variables’ Var_MD = 1 or 0 (yes or no) Var_RN = 1 or 0 Var_LPN = 1 or 0 Var_Para = 1 or 0 Var_EMT = 1 or 0 Var_Respond = 1 or 0 Var_FirstAid = 1 or 0
16
Real Life Example from the Camp Survey Who cares? Var_Caregiver 12345671234567 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 7 Dummy Variables Var_Years + Real Numbers
17
A Basic Modeling Problem Is there a relationship between the level of on-site caregiver training and the number of deaths per year at camp? Deaths = f (Caregiver Level)
18
Number of Deaths Var_Caregiver 17 Deaths = b 1 x 1 + b 0 where x 1 = Var_Caregiver (1-7) b 1 = a coefficient b 0 = the y-intercept
19
Number of Deaths Var_MD Var_First_Aid Deaths = b 1 x 1 + b 2 x 2 + b 3 x 3 + b 4 x 4 + b 5 x 5 + b 6 x 6 + b 7 x 7 + b 0 where x 1 = Var_MD, x 2 = Var_RN, etc. b 1-7 = are coefficients for each x b 0 = the y-intercept Var_RN Var_Para Var_LPN Var_EMT Var_Respond
20
Number of Deaths Var_Caregiver 17 Number of Deaths Var_MD Var_First_ Aid Pros: Easy to compute Easy to understand Cons: Forces a ‘continuous’ structure onto Var_Caregiver that may not really exist Pros: Agrees more closely with experimental results Doesn’t impose any relationship between different provider levels Cons: Less easy to understand ‘Discards’ the knowledge that some caregivers have more training than others
21
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as you can
22
Stick close to original measurement When you call an ambulance in an emergency, how long does it take for the ambulance to get to your camp? < 5 minutes (Time = 1) 5-10 minutes (Time = 2) 10-15 minutes(Time = 3) 15-20 minutes (Time = 4) > 20 minutes (Time = 5) Don’t know(Time = 6) Good, bad, Indifferent?
23
Stick close to original measurement Do you know how long it would take an ambulance to respond to a call from your camp? (y/n) If so, how many minutes? (some discrete #)
24
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as you can Abstraction seems useful, but distances you from what you were originally looking at Keep continuous data continuous if at all possible Likewise preserve ordinal and nominal data Later on, you can ‘digest’ the raw data into categories, etc., as necessary.
25
Decisions regarding how to encode data Remember: Data can always be made more general during analysis. They cannot be made more specific.
26
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as you can Avoid bundling more than one idea into a single variable Ex. 20, ‘Don’t Know’
27
Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as you can Avoid bundling more than one idea into a single variable Use a specific plan for missing data!
28
Missing Data Blank data cells are ambiguous Data not provided/collected? Data erroneously omitted? Data provided but nonsensical? Note: Many statistical packages will ignore an entire ‘observation’ if a data point is missing!!!
29
Missing Data Pick something (other than nothing) to denote a missing data point ‘.’ or ‘Null’ are commonly used
30
The Structure of Data Statistical analysis is based on the idea of ‘observations’ An observation often is a patient (and all of the data you collect about that patient) Really is just an experimental ‘unit’ or ‘trial,’ such as one summer camp or one hospital day Any analysis of many observations requires you establish a ‘structure’ for your observations
31
The Structure of Data You’ll need to think about the ‘shape’ of your experimental data early in your study Preferably during planning Fortunately, very many data sets can be structured into a tabular form For better or worse, Excel is used really often
32
The Structure of Data Obs #Last NameSystolic BPDiastolic BP 1Fawcett11454 2Smith9342 3Jackson7849 4Ladd5838 Fields Observations Don’t confuse a 2-dimensional data table with 2-dimensional data! Ultimately, every observation is a mathematical ‘vector’ that completely describes that event in an n-dimensional space
33
Fawcett Jackson Smith Ladd SBP DBP Your data have as many dimensions as they have data fields!
34
(Unavoidable) Shortcomings of Tabular Data Large Number of Fields or Observations Difficult to ‘look’ at all of the data Troubles with Repeated Measures
36
Handling Repeated Measure Data in a Tabular Data Structure Patient IDWeight Day 1BUN Day 1Urine Day 1Weight Day 2BUN Day 2Urine Day 2Weight Day 3BUN Day 3Urine Day 3Weight Day 4BUN Day 4Urine Day 4 Weight Day 5 BUN Day 5Urine Day 5
37
Obs #Last NameHospital DaySystolic BP 1Fawcett184 2Fawcett272 3Fawcett384 4Smith194 Handling Repeated Measures in Tabular Data Structures The ‘Day in the Life’ strategy A Patient Day becomes the observation Can be a more compact way of saving data
38
Demographic Data Daily Data (For each of 7 Study days Bacterial Isolate Data Outcome Data Using Relational Databases for More Complex Data
39
Deciding Which Software to Use Some useful groundrules 1. Use software with all of the tools you need 2. Don’t make things unnecessarily complicated 3. Know in advance what your statistical collaborators are going to use, and how they like the data to appear
40
Deciding Which Software to Use Data-entry Level Tools Input method other than just entering fields in Excel spreadsheet ‘Forms’ type page Interface with other data types Interface with Scantron Interface with analytical instruments
42
Deciding Which Software to Use Data-entry Level Tools Entry error control Double entry Restricted data fields that must fit a particular format or be rejected Merging data sets Doing this by hand is fine for 15 patients, but not for 1,500
43
Deciding Which Software to Use Data Manipulation Needs Do your data need some post-collection modification prior to analysis? Transformation (e.g., log-transforming to achieve normal statistical distributions) Relabeling missing data fields Text or numerical string modification E.g. changing all dates to MM/DD/YYYY Internal data consistency checks E.g. is the number of ICU days < the number of hospital days?
44
Deciding Which Software to Use What Analyses are You Going to Perform? Summary Statistics Frequencies, means, etc. Simple x by y regressions Contingency tables (and 2 ) ANOVA Multivariate modeling Logistical modeling Nonlinear modeling Easy in Excel Not Easy in Excel Best Handled in Dedicated Stats Packages or Elsewhere
45
Deciding Which Software to Use Output Needs Tabular data that can be dumped into a word processor Text files Cut-and-paste Graph preparations and dumping Cut-and-paste Specialized output formats .tif,.jpg,.svg, MS metafiles Colors (RGB v. CMYK)
46
Deciding Which Software to Use Other needs you might not have thought about but that are really important Interim “noodling” type analysis Needing to repeat the analysis on multiple data sets, or to ‘update’ the analysis if new data become available
47
Deciding Which Software to Use Spreadsheets Excel Spreadsheet and ‘Pull-Down’ Stats Packages SPSS, Prism (Graphpad), JMP Database Managers Access, Foxpro Scripted Statistical Languages SAS, R, MatLab Increasing Level of Organization Increasing Front-End Time
50
Handling Your Data in Excel Few up-front requirements Load your data and you’re ready to go Many simple stats can be done as ‘one off’ analyses VERY Inflexible You pay for your choice later on in debugging, rerunning analysis, editing the data set, etc.
51
Using Spreadsheet-’Pulldown’ Stats Packages Is the most power most users will ever need Slightly more up-front time Forced data structures are like eating oatmeal Most have integrated graphics utilities Some unusual applications are tough to manage Nonlinear analysis
54
Using Scripted Statistical Packages When you anticipate running relatively complicated analyses on a series of data sets When you can design the analysis plan without having all of the data available When you must document exactly how you did your analysis and be able to exactly duplicate it at will Which is arguably every time (!)
56
g<-read.csv("expdata2.csv",header=TRUE) gmat<-as.matrix(g) gmati<-gmat*-1 heatplot(gmati,Colv=NA)
57
Back-End Utilities Graphical Output Excel has horrible graphics that can be spotted a mile away in journals Most stats packages will do better Consider ‘Post-Processing’ in Dedicated Graphics Software E.g., Adobe Illustrator
59
Research is a Data Business, Use the Tools at Your Disposal Data Input System Dedicated Database Manager Statistical Package Statistical Package Statistical Package Graphing System Graphics Polishing for Publication
60
Other Very Important Resources Google Almost everything you need to know Most of it’s pretty accurate Java Applets Many stats applications can be found on line that will run on any machine Open source code is on its way R Linux CSCAR Sometimes more helpful than others.
61
Who Will Not Be Helpful MCIT
62
Questions? People and their Software Sue Stern JMP (The SAS ‘PullDown’ Package) Repeated measures analysis of clinical data Bonnie Singal SAS Pretty much any clinical statistical research question Matt Trowbridge Stats and GIS packages Merging complex data sets John Younger SAS, Prism, R Kinetics, Logistic and nonlinear models of complex behaviors
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.