NIH fMRI Summer Course fMRI and BIG DATA

Slides:



Advertisements
Similar presentations
Chapter 14: Usability testing and field studies
Advertisements

Testing Relational Database
Service user participation in clinical trials Partnership or Co-option? Dr. Jan Wallcraft Operational Manager of SURGE (Service User Research Group for.
What Should You Know and Be Doing About Genome Privacy? Ellen Wright Clayton, MD, JD Center for Biomedical Ethics and Society Vanderbilt University.
Getting started with LEGO NXT Mindstorms software This is intended to be a short introduction to the LEGO Mindstorms software and programming the LEGO.
Regulatory Clinical Trials Clinical Trials. Clinical Trials Definition: research studies to find ways to improve health Definition: research studies to.
Extreme Programming Alexander Kanavin Lappeenranta University of Technology.
Finding a Research Topic Janie Irwin CSE, Penn State with credits to Kathy Yelick, EECS, UC Berkeley.
How the BMJ triages submitted manuscripts Richard Smith Editor, BMJ
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
ICT IGCSE Expert Systems.
Basic Research Methods CSE EST ISE 323 Spring 2012 Tony Scarlatos.
Research Ethics The American Psychological Association Guidelines
Chapter 14: Usability testing and field studies. 2 FJK User-Centered Design and Development Instructor: Franz J. Kurfess Computer Science Dept.
Computer Referbishment The Demonstration. To Do… Virus Protection Schedule A Full System Scan Install Service Pack 3 Clean Up Tools Drive Formatting Install.
Web Security A how to guide on Keeping your Website Safe. By: Robert Black.
Computers: Tools for an Information Age
Health Informatics Series
Review an existing website Usability in Design. to begin with.. Meeting Organization’s objectives and your Usability goals Meeting User’s Needs Complying.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Level 2 IT Users Qualification – Unit 1 Improving Productivity Name.
Internet Safety By: Caitlyn Stevenson. Information about Internet Safety  The internet is a huge deal, any child that can press a few letters on a keyboard.
Development of Questionnaire By Dr Naveed Sultana.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 14 Systems Analysis and Design: The Big Picture.
Spring break survey how much will your plans suck? how long are your plans? how many people are involved? how much did you overpay? what’s your name? how.
Introduction to Systems Analysis and Design Trisha Cummings.
Thinking Actively in a Social Context T A S C.
Information guide.
Database Systems COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
Chapter 9 Database Management Discovering Computers Fundamental.
Future Use of Stored Samples & Data and the NIH Policy on GWAS and dbGaP NIAID/DAIDS Dione Washington, M.S. -- ProPEP Sudha Srinivasan, Ph.D.-- TRP Tanisha.
Basics of Functional Magnetic Resonance Imaging. How MRI Works Put a person inside a big magnetic field Transmit radio waves into the person –These "energize"
Software Quality Assurance in Neuroinformatics H Jeremy Bockholt NITRC Grantee Meeting.
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Level 2 IT Users Qualification – Unit 1 Improving Productivity Chris.
Today: Our process Assignment 3 Q&A Concept of Control Reading: Framework for Hybrid Experiments Sampling If time, get a start on True Experiments: Single-Factor.
System Security Chapter no 16. Computer Security Computer security is concerned with taking care of hardware, Software and data The cost of creating data.
Scientific Research in Biotechnology 5.03 – Demonstrate the use of the scientific method in the planning and development of an experimental SAE.
SCIENTIFIC METHOD The way we (_______) do things. Scientists have a unique way of looking at the world. They want to know more about their world. They.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Assumes that events are governed by some lawful order
12 Developing a Web Site Section 12.1 Discuss the functions of a Web site Compare and contrast style sheets Apply cascading style sheets (CSS) to a Web.
Critical Appraisal of the Scientific Literature
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Henry Domenico Vanderbilt University Medical Center.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Consent for Research Study A study for patients newly diagnosed with advanced glioblastoma (brain cancer): Learning whether a PET scan with F-fluoromisonidazole.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
SHARe. What is WHI-SHARe? SHARe – stands for SNP Health Association Resource. It is an NHLBI program for genome-wide association studies (GWAS). GWAS.
C82MST Statistical Methods 2 - Lecture 1 1 Overview of Course Lecturers Dr Peter Bibby Prof Eamonn Ferguson Course Part I - Anova and related methods (Semester.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Research Design Week 6 Part February 2011 PPAL 6200.
The effects of Peer Pressure, Living Standards and Gender on Underage Drinking Psychologist- Kanari zukoshi.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Matthew Glenn AP2 Techno for Tanzania This presentation will cover the different utilities on a computer.
CSE SW Measurement and Quality Engineering Copyright © , Dennis J. Frailey, All Rights Reserved CSE8314M15 version 5.09Slide 1 SMU CSE.
Software Quality Assurance and Testing Fazal Rehman Shamil.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Function BIRN The ability to find a subject who may have participated in multiple experiments and had multiple assessments done is a critical component.
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
EuResist? EuResist is an international project designed to improve the treatment of HIV patients by developing a computerized system that can recommend.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Advanced Higher Computing Science
Program Design Chapter 5
CHAPTER 4 Designing Studies
Managing the Project Lifecycle
NeurOn: Modeling Ontology for Neurosurgery
Chapter 13- Experiments and Observational Studies
Presentation transcript:

NIH fMRI Summer Course fMRI and BIG DATA Daniel Handwerker August 4, 2014

What is big data? An fMRI scan generates 1-10 gigabytes of reconstructed data in 2 hours. A genome sequencer can generate 1 terabyte of data per week (Intramural NIAID has 7) Google, Microsoft, Amazon, etc regularly deal with petabytes of data The Large Hadron Collider generates 1 petabyte of data per second LHC #: http://home.web.cern.ch/about/updates/2013/04/animation-shows-lhc-data-processing

Big Data “The minimum amount of data required to make one’s peers feel uncomfortable with the size of one’s data.” – Tal Yarkoni http://www.talyarkoni.org/blog/2014/05/19/ big-data-n-a-kind-of-black-magic/ “If you can solve your alleged big data problems by installing a new hard drive or some more RAM, you don’t really have a big data problem” Big data is a fundamental

Rethinking fMRI with big data How do we collect, store, and access data? Which analyses steps require human interaction? Which analyses steps can’t require human interaction? What types of automation mistakes are acceptable? Why is this important?

Overview Why are big fMRI data interesting? Why isn’t everyone doing this?

A typical fMRI study without big data Design a study Collect data from 15-30 people A few people in a group look at the raw images, time series, and statistical maps The same people (hopefully) write up their finding and publish The data are kept on a hard drive somewhere or archived

How does this typical fMRI study limit our understanding of the brain? With fMRI, either no one bothers to do a similar study or a similar study is done and gets a slightly or very different result. (Not necessarily a bad thing). Why does this happen? Data quality? Experimental design decisions? Sample size? Analysis decisions? We need to be able to directly compare these typical studies or design studies that don’t have these issues Murphy, K., Garavan, H., 2004. An empirical investigation into the number of subjects required for an event-related fMRI study. NeuroImage 22, 879–885. doi:10.1016/j.neuroimage.2004.02.005 (at least 20 subjects) David, S.P., Ware, J.J., Chu, I.M., Loftus, P.D., Fusar-Poli, P., Radua, J., Munafò, M.R., Ioannidis, J.P.A., 2013. Potential Reporting Bias in fMRI Studies of the Brain. PLoS ONE 8, e70104. doi:10.1371/journal.pone.0070104 (at least 45 subjects)

More data means means more interesting and useful science Consistently identifying subtle fMRI magnitude differences between populations requires a lot of data Meta-analyses using data can identify things that were ignored in the original studies David, S.P., Ware, J.J., Chu, I.M., Loftus, P.D., Fusar-Poli, P., Radua, J., Munafò, M.R., Ioannidis, J.P.A., 2013. Potential Reporting Bias in fMRI Studies of the Brain. PLoS ONE 8, e70104. doi:10.1371/journal.pone.0070104

If we want to understand brain disease, we need big data NIMH Research Domain Criteria (RDoC) We can’t study brain disorder X vs controls anymore A proposed example study of fear circuitry ALL patients presenting at an anxiety disorders clinic Independent variable: fMRI amygdala response to fearful stimuli Dependent variables: Symptom measures on fear & distress Hypothesize which responses affect severity of symptoms Outcome: Predictive validity studies leading to improved treatment selection RDoC is starting to see some genetics findings, but no big fMRI findings yet http://www.nimh.nih.gov/research-priorities/rdoc/nimh-research-domain-criteria-rdoc.shtml

While important, these projects are rare, risky, and expensive Where does big fMRI data come from? Large projects with centralized data collection and preliminary analyses 1200 healthy, adult subjects ages 22-35 will be scanned by multiple imaging modalities: Resting-state fMRI (R-fMRI) Task-evoked fMRI (T-fMRI) Diffusion MRI (dMRI) with tractography analysis Subjects include 300 twin pairs and their siblings, allowing robust analysis of the heritability of brain connectivity and associated behavior traits All 1200 subjects scanned on a single Skyra 3T scanner, beginning in Summer 2012 200 of these subjects will also have R-fMRI, T-fMRI and dMRI imaging on a single 7T scanner Collecting extensive demographic and behavioral data on each subject, as well as blood samples to be used for genotyping in Year 5 of the project T-fMRI protocols will maximize coverage of different areas of the brain Overlap of tasks with those in other large studies will facilitate cross-study comparison Standardized imaging protocols and processing pipelines for maximal comparability MEG/EEG imaging on a subset of 100 subjects, including both resting-state MEG/EEG and task-evoked MEG/EEG, using some of the same tasks and timing as will be used in T-fMRI All HCP-generated methods will made available as they are finalized HCP data will be freely available through quarterly releases of our ConnectomeDB database, beginning Fall 2012 Freely-available Connectome Workbench analysis and visualization While important, these projects are rare, risky, and expensive

Where does big fMRI data come from? Retrospective Data Sharing http://fcon_1000.projects.nitrc.org/indi/IndiRetro.html

Where does big fMRI data come from? Prospective Data Sharing Also from http://fcon_1000.projects.nitrc.org/indi/IndiRetro.html

What types of data are analyzed? Full fMRI time series FMRI Data Center Openfmri.org central.xnat.org fcon_1000.projects.nitrc.org incf.org/resources/data-space/ Statistical Maps or clusters BrainMap.org Neurosynth.org The prospective stuff tends to be more single site, but also rarely as large

How are cross-site data collected? Consortia with common experimental paradigms ADNI (Alzheimer’s Disease) Human Connectome Projects Consortia with common populations NDAR (National Database for Autism Research) FITBAR (Traumatic Brain Injury) dbGaP (Genotypes + phenotypes) ABIDE (Autism) ADHD-200 ENIGMA (Imaging + Genetics) FBIRN (Standardization across sites)

Benefits of Big Data Sharing Much larger sample sizes People without resources to collect high quality fMRI data can make important contributions to the field Experts in other areas can offer new insights Better ways to develop and compare analysis methods Better understanding of replication attempts If there is inconsistency, you can then directly compare two data sets

And the exciting findings are… Thanks to these fMRI big datasets what have we learned about the brain and disease? Seriously, what? (This isn’t to say a lot of interesting stuff, particularly methods development, isn’t happening thanks to these big data sets) Why is this? The field is still young Big data is more than just putting data together

The field is still young Building big data capabilities takes time USA started funding the public GeneBank in 1982 The National Center for Biotechnology Information at NLM launched & look over GeneBank in 1989 Basic Local Alignment Search Tool (BLAST) algorithm to rapidly match sequences came out in 1990 The first human genomes were released in 2003. Nearly real time tracking of a dangerous bacteria at NIH Clinical center in 2012 Images from: http://www.davelunt.net/evophylo/2012/12/printing-out-genbank-nucleotide-sequences-1984/ Dates are from wikipedia pages and Ruth Kirschstein’s biography: http://www.nih.gov/about/kirschstein/Always_There.pdf Images from: http://www.davelunt.net/evophylo/2012/12/printing-out-genbank-nucleotide-sequences-1984/

Types of problems that can’t be solved with a bigger hard drive Universal big data issues Giving multiple people quick access to data Security Application specific issues Quality control with less human interaction Meaningful queries of data Documentation and provenance

The limits for fMRI aren’t technological The general technology needed to securely store, share, run computations, and visualize big fMRI data exists This doesn’t mean technology can’t get much better Technology isn’t free If nothing else, things like Amazon or Microsoft cloud computer systems are already big, modular, and fast enough for most of what we currently call big data with fMRI The really specialized technology for massive genomics, physics, meteorology, or military projects work even better & the rate limiting step is funding for a similar system or space/compute time on existing systems.

The failure of the FMRI Data Center Scientists weren’t motivated to contribute No social pressure It was really hard to enter data and information and submit it fMRIDC needed to organize data in whatever format it got (unorganized hard drives people pulled from their computers) Removing identifiable information Data shared by mail Short-term funding disappeared Van Horn, J.D., Gazzaniga, M.S., 2012. Why share data? Lessons learned from the fMRIDC. NeuroImage

Human Connectome Project example How do you share 50+ Terabytes of fMRI data? If you want the first 500 subject’s of data, they’ll mail you five 4 Terabyte drives for $1000. Parts of the data are available online to download or explore. This doesn’t include any of the genetic or MEG data.

What are the benefits and weaknesses? People aren’t slamming their local networks downloading this stuff Weaknesses Expensive bar to data access Hard drives break in transfer Every data correction is a new hard drive Someone has to write and mail these hard drives

Downloading from websites Download large, compressed files Access parts of data through databases (xnat) A science DMZ would speed of transfers https://fasterdata.es.net/science-dmz/science-dmz-architecture/

National Database for Autism Research NDAR data lives on the aws.amazon.com Data is at one location Users need accounts & access, but don’t need local data Computer use costs money, but you can scale with the computational power you need & you don’t need to buy super-powerful local computers As fast or slow as other AWS websites

What to store and how to organize it? Statistical maps vs reconstructed vs raw fMRI data? Basic scanning parameters vs detailed pulse sequence descriptions? The strength/brand of the scanner vs full hardware specifications? Task designs vs presentation scripts? Behavioral and physiological data and keys to understand what it means

Documentation & Replication Discussion of replicating an experiment using superconducting bar magnets to detect gravity waves in the 1970’s (not MRI) “Should the bar be cast from the same batch of metal? Should we buy the piezoelectric crystals form the same manufacturer as the original ones? Should we glue them to the bar using the same adhesive as before, bought from the same manufacturer? Should the amplifiers be identical in every visible respect, or is an amplifier build to certain input and output specifications “the same” as another amplifier built to the same specifications? Should we be making sure that the length and diameter of every wire is the same? Should we be making sure that the color of the insulation on the wires is the same? Clearly, somewhere one as to stop asking such questions and use a commonsense notion of “the same.” the trouble is that in frontier science, tomorrow’s common sense is still being formed.” Gravity’s Shadow by Harry Collins p 123

Documentation & Replication Discussion of replicating an experiment using superconducting bar magnets to detect gravity waves in the 1970’s (not MRI) “Should the bar be cast from the same batch of metal? Should we buy the piezoelectric crystals form the same manufacturer as the original ones? Should we glue them to the bar using the same adhesive as before, bought from the same manufacturer? Should the amplifiers be identical in every visible respect, or is an amplifier build to certain input and output specifications “the same” as another amplifier built to the same specifications? Should we be making sure that the length and diameter of every wire is the same? Should we be making sure that the color of the insulation on the wires is the same? Clearly, somewhere one as to stop asking such questions and use a commonsense notion of “the same.” the trouble is that in frontier science, tomorrow’s common sense is still being formed.” Gravity’s Shadow by Harry Collins p 123 To my astonishment, on reading this passage, Gary Sanders told me that in 1988 or 1989 he had been in a laboratory near Tokyo when a Russian physicist, examining a Japanese group’s apparatus, declared their results on Tritum Beta decay to be invalid because they had used wires with red insulation! Apparently the red dye contains traces of radioactive uranium, which can confound the measurements.

Variation in replications can be good The most replicated finding in drug abuse research is, “Rats will intravenously self-administer (IVSA) cocaine.” The types of factors that can reduce your "effect" to a null effect or change the outcome include: Catheter diameter or length Cocaine dose available in each infusion Rate of infusion/concentration of drug Age, Sex, Strain, or vendor source of the rats Time of day in which rats are run (not just light/dark* either) Food restriction status & last food availability Pair vs single housing & “Enrichment” Experimenter choice of smelly personal care products Dirty/clean labcoat (I kid you not) Handling of the rats on arrival from vendor Fire-alarm Cage-change day Minor rat illness Location of operant box in the room (floor vs ceiling, near door or away) Ambient temperature of vivarium or test room Schedule- weekends off? seven days a week? 1 hr? 2hr? 6 hr? access sessions Schedule- are reinforcer deliveries contingent upon one lever press? five? does the requirement progressively increase with each successive infusion? Animal loss from the study for various reasons We only know this because so many replications were done and documented http://scientopia.org/blogs/drugmonkey/2014/07/08/ the-most-replicated-finding-in-drug-abuse-science/

Even if studies aren’t consistent, documentation and provenance is vital Neuroimaging Data Model nidm.nidash.org/ Automated ontology for understanding statistical maps NIF (Neuroscience Information Framework) Ontology for all of neuroscience Data processing scripts AFNI software realtime output and history flags Nothing exists that is easy to populate, covers most aspects of fMRI data collection & processing, and easy to pull information from databases Bad note-taking and documentation is an issue for “small” fMRI studies too.

Ethics of data sharing Internal review boards require every study to have a purpose. Data can’t be used outside the scope of it’s purpose of collection If big data is useful, we will learn unexpected things about individual’s future health Anonymity is impossible to maintain by computer security alone First point is most significant for older studies. Newer studies can account for this in their IRB approval and consenting processes

Anonymity doesn’t exist with big data Sisters discovered they didn’t share a father A neuroscientist who was adopted used 23andMe and Facebook to identify some probable half-siblings (but decided not to contact them) 23 and You, by Virginia Hughes (12/4/2013) https://medium.com/matter-archive/23-and-you-66e87553d22c

MRI and anonymity Facebook post: “Hey look at this cool picture of my brain I got as thanks for participating in a study” The data is publicly shared. Someone takes the picture, matches it to a slice in shared data and then get the genome & diagnostic info linked to the study. Brain slice from: http://en.wikipedia.org/wiki/Neuroimaging#mediaviewer/File:Sagittal_brain_MRI.JPG Brain slice from: http://en.wikipedia.org/wiki/Neuroimaging#mediaviewer/File:Sagittal_brain_MRI.JPG

How to prevent loss of privacy? MRI brain slices and genomes are not considered identifiable information in US health privacy law Try to make this type of matching difficult Blur faces on high quality images Don’t give participants the results Lock data behind data use agreements

Washington University - University of Minnesota Consortium of the Human Connectome Project: Data Use Agreement (abridged) I will not attempt to establish the identity of or attempt to contact any of the included human subjects. I understand that under no circumstances will the code that would link these data to Protected Health Information be given to me, nor will any additional information about individual human subjects be released to me. I will comply with all relevant rules and regulations imposed by my institution. This may mean that I need my research to be approved or declared exempt by a committee that oversees research on human subjects, e.g. my IRB or Ethics Committee. The released HCP data are not considered de-identified, insofar as certain combinations of HCP Restricted Data (available through a separate process) might allow identification of individuals. I will acknowledge the use of WU-Minn HCP data… when publicly presenting any results or algorithms that benefitted from their use. Authors of publications or presentations using WU-Minn HCP data should cite relevant publications describing the methods used by the HCP… Failure to abide by these guidelines will result in termination of my privileges to access WU-Minn HCP data. http://humanconnectome.org/data/data-use-terms/DataUseTerms-HCP-Open-Access-26Apr2013.pdf

Conclusions Big data in fMRI is growing very rapidly It has the potential to greatly increase our understanding of the brain Practical, technological, financial, and ethical barriers remain.

Questions? http://xkcd.com/1403/