Cornell University January 2015 Sponsored by Cornell Statistical Consulting Unit Instructors Emily Davenport (Cornell University) Erika Mudrak (CSCU) Jeramia.

Slides:



Advertisements
Similar presentations
FDA/Industry Workshop September, 19, 2003 Johnson & Johnson Pharmaceutical Research and Development L.L.C. 1 Uses and Abuses of (Adaptive) Randomization:
Advertisements

Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
Introduction to ReportSmith and Effective Dated Tables
Organisation Of Data (1) Database Theory
Getting started with GEM-SA Marc Kennedy. This talk  Starting GEM-SA program  Creating input and output files  Explanation of the menus, toolbars,
ICDL Software Applications - Database Concepts. Unit 6 Data and Data Representation Database Concepts –File Structure –Relationships Database Design –Data.
Do files, log files, and workflow in Stata Biostatistics 212 Lecture 2.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Teaching Statistics Using Stata Software Susan Hailpern BSN MPH MS Department of Epidemiology and Population Health Albert Einstein College of Medicine.
Microsoft Excel Working with Excel Lists, Subtotals and Pivot Tables.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Important concepts in software engineering The tools to make it easy to apply common sense!
Beginning Data Manipulation HRP Topic 4 Oct 19 th 2011.
Ann Arbor ASA ‘Up and Running’ Series: SPSS Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association, in cooperation with.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
Introduction to Statistical Computing in Clinical Research Biostatistics 212 Course director: Mark Pletcher Teaching Assistant: Lee Zane.
Spreadsheet design an overview of further issues Research Methods Group Wim Buysse – ICRAF-ILRI Research Methods Group October 2004.
RESEARCH HUB AT THE UNIVERSITY LIBRARIES PENN STATE UNIVERSITY TOUR OF STATISTICAL PACKAGES.
Quantifying Data.
Rebecca Boger Earth and Environmental Sciences Brooklyn College.
Flow Cytometry and Reproducible Analysis Cliburn Chan Department of Biostatistics and Bioinformatics, DUMC.
Biostatistics Analysis Center Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine Minimum Documentation Requirements.
Trying to Give Business Students What They Need for Their Future Bob Andrews Virginia Commonwealth University.
STAT 3130 Statistical Methods II Missing Data and Imputation.
Microsoft Excel How to make a SPREADSHEET. Microsoft Excel IT is recommended that you have EXCEL running at the same time. You can try what you are reading.
Sterling Chadee Director of Statistics. The processing of the data from the field enumeration began in July 2011 until September All data processors.
 A databases is a collection of data organized to make it easy to search and easy to retrieve in a useful, usable form.
Institute for Tribal Environmental Professionals Tribal Air Monitoring Support Center Melinda Ronca-Battista Brenda Sakizzie Jarrell Southern Ute Indian.
Introduction to SPSS Edward A. Greenberg, PhD
Highline Class, BI 348 Basic Business Analytics using Excel, Chapter 01 Intro to Business Analytics BI 348, Chapter 01.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Introduction to STATA for Clinical Researchers Jay Bhattacharya August 2007.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
1 Click to edit Master title style Demographic Analysis Panel Current and Future State FDA/PhUSE CSS - Working Group 5 - Analysis Standards Script Examples.
Questionnaire Development: SPSS and Reliability Personality Lab October 8, 2010.
Example SPSS Basic Medical Statistics Course October 2010 Wilma Heemsbergen.
11 3 / 12 CHAPTER Databases MIS105 Lec15 Irfan Ahmed Ilyas.
HSC IT Center Training University of Florida Introduction to Database Concepts and Microsoft Access 2007 Health Science Center IT Center – Training
Data and information. Information and data By the end of this, you should be able to state the difference between DATE and INFORMAITON.
Introduction to Statistical Computing in Clinical Research Biostatistics 212.
Accuracy Chapter 5.1 Data Screening. Data Screening So, I’ve got all this data…what now? – Please note this is going to deviate from the book a bit and.
A Powerful Python Library for Data Analysis BY BADRI PRUDHVI BADRI PRUDHVI.
Introduction to R Introductions What is R? RStudio Layout Summary Statistics Your First R Graph 17 September 2014 Sherubtse Training.
Preparing Data for Quantitative Analysis Copyright © 2010 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
KEEP THIS TEXT BOX this slide includes some ESRI fonts. when you save this presentation, use File > Save As > Tools (upper right) > Save Options > Embed.
CSC 1010 Programming for All Lecture 1 Some material courtesy of Python for Informatics: Exploring Information (
LO3 Know the features and functions of information systems.
Data Organization Quality Assurance and Transformations.
R Workshop #2 Basic Data Analysis. What we did last week: Understand the basics of how R works Generated objects (vectors, matrices, etc.) Read in data.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Elementary Analysis Richard LeGates URBS 492. Univariate Analysis Distributions –SPSS Command Statistics | Summarize | Frequencies Presents label, total.
PREPARED BY: PN. SITI HADIJAH BINTI NORSANI. LEARNING OUTCOMES: Upon completion of this course, students should be able to: 1. Understand the structure.
The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part II – Basic Stats, Standard Deviation and.
Software Sustainability Institute Data Carpentry Aleksandra Pawlik Software Sustainability Institute Data Science Club, 17 th March.
Cornell University June 2016 Sponsored by Cornell Statistical Consulting Unit Instructors Emily Davenport (Cornell University) Erika Mudrak (CSCU) Lynn.
Saving Everyone’s Time and Energy: Practical Tips for Database Design Cynthia Wilson Garvan PhD Statistics, MA Mathematics College of Nursing
Introduction to the SPSS Interface
Utilization of Data Maryland Highway Safety office/Washington College
SPSS: Using statistical software — a primer
DATABASE.
GO! with Microsoft Office 2016
GO! with Microsoft Access 2016
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Florida State University
A Digital Record-Keeping App for Aviation Technologies’ Students
Analytics: Its More than Just Modeling
Spreadsheets, Modelling & Databases
Introduction to the SPSS Interface
Presentation transcript:

Cornell University January 2015 Sponsored by Cornell Statistical Consulting Unit Instructors Emily Davenport (Cornell University) Erika Mudrak (CSCU) Jeramia Ory (Kings College) Assistants Jay Barry Lynn Johnson Françoise Vermeylen

Community driven effort Data Carpentry board: Karen Cranston (NESCent), Hilmar Lapp (Duke), Aleksandra Pawlik (ELIXIR UK), Karthik Ram (rOpenSci), Tracy Teal (Michigan State), Ethan White (Univ of Florida), Greg Wilson (Software Carpentry) Contributors: 20 people contributing to materials development already 4 workshops taught, 11 instructors, ~20 helpers Open source materials

Goal: Develop and teach workshops to help train the next generation of researchers in good data analysis and management practices to enable individual research progress and open and reproducible research.

I usually manage data in Excel and it's terrible and I want to do it better. I'm organizing GIS data and it's becoming a nightmare. My advisor insists that we store 50,000 barcodes in a spreadsheet, and something must be done about that. I'm having a hard time analyzing microarray, SNP or multivariate data with Excel and Access. I want to use public data. I work with faculty at undergrad institutions and want to teach data practices, but I need to learn it myself first. I'm interested in going in to industry and companies are asking for data analysis experience. I'm trying to reboot my lab's workflow to manage data and analysis in a more sustainable way. I'm re-entering data over and over again by hand and know there's a better way. I have overwhelming amounts of data. I'm tired of feeling out of my depth on computation and want to increase my confidence. Sentiments on data within the NSF BIO Centers (BEACON, SESYNC, NESCent, iPlant, iDigBio)

Reproducible Research Data analysis – Data and analysis can be re-created by anyone Including you in the future! Repeat analysis on updated data Repeat analyses on similar datasets – Scripted data management and analysis Manages and analyzes Provides a record of what was done Easy to edit and re-run

Raw Data Cleaned Data Analysis Results FiguresTables Publication Fame Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Working Data

Raw Data Cleaned Data Analysis Results FiguresTables Publication Fame Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Updated Raw Data Working Data

Raw Data Cleaned Data Data Cleaning Script Univariate & Bivariate EDA Find/Replace values Merge grouping labels Re-code variables Fix typos Standardize entries Convert dates Convert variable formats Missing values

Raw Data Cleaned Data Data Cleaning Script Summarizing Script Subset data for particular project Transform variables Average, min, max by group imputation Working Data

Raw Data Cleaned Data Analysis Results Data Cleaning Script Summarizing Script Analysis Script Linear Models Mixed Models Search for Correlates Loop! General Functions Working Data

Raw Data Cleaned Data Analysis Results FiguresTables Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Plotting Table making Working Data

Raw Data Cleaned Data Analysis Results FiguresTables Publication Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Paper Writing Script Working Data

Raw Data Cleaned Data Working Data Analysis Results FiguresTables Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script New Raw Data Cleaned Data Working Data Analysis Results FiguresTables

Raw Data Cleaned Data Summarized Data Analysis Results FiguresTables Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Cleaned Data Working Data Analysis Results FiguresTables Re-use and edit scripts for new projects New Raw Data

Raw Data Cleaned Data Analysis Results FiguresTables Publication Fame Data Cleaning Script Summarizing Script Analysis Script Figure Script Results Formatting Script Working Data Univariate & Bivariate EDA Find/Replace values Merge grouping labels Re-code variables Fix typos Standardize entries Convert dates Convert variable formats Missing values Subset data for particular project Transform variables Average, min, max by group imputation Plotting Table making Linear Models Mixed Models Search for Correlates Loop! General Functions

Data Cleaning Script Summarizing Script Analysis Script Results Formatting Script Univariate & Bivariate EDA Find/Replace values Merge grouping labels Re-code variables Fix typos Standardize entries Convert dates Convert variable formats Missing values Subset data for particular project Transform variables Average, min, max by group imputation Linear Models Mixed Models Search for Correlates Loop! General Functions Plotting Table making Excel OpenRefine SQL databases R / RStudio Raw Data R markdown / RStudio commandline shell Thursday morning Thursday morning Friday morning Friday afternoon