Presentation on theme: "NIH fMRI Summer Course fMRI and BIG DATA"— Presentation transcript:
1NIH fMRI Summer Course fMRI and BIG DATA Daniel HandwerkerAugust 4, 2014
2What is big data?An fMRI scan generates 1-10 gigabytes of reconstructed data in 2 hours.A genome sequencer can generate 1 terabyte of data per week (Intramural NIAID has 7)Google, Microsoft, Amazon, etc regularly deal with petabytes of dataThe Large Hadron Collider generates 1 petabyte of data per secondLHC #:
3Big Data“The minimum amount of data required to make one’s peers feel uncomfortable with the size of one’s data.” – Tal Yarkonibig-data-n-a-kind-of-black-magic/“If you can solve your alleged big data problems by installing a new hard drive or some more RAM, you don’t really have a big data problem”Big data is a fundamental
4Rethinking fMRI with big data How do we collect, store, and access data?Which analyses steps require human interaction?Which analyses steps can’t require human interaction?What types of automation mistakes are acceptable?Why is this important?
5OverviewWhy are big fMRI data interesting?Why isn’t everyone doing this?
6A typical fMRI study without big data Design a studyCollect data from peopleA few people in a group look at the raw images, time series, and statistical mapsThe same people (hopefully) write up their finding and publishThe data are kept on a hard drive somewhere or archived
7How does this typical fMRI study limit our understanding of the brain? With fMRI, either no one bothers to do a similar study or a similar study is done and gets a slightly or very different result. (Not necessarily a bad thing).Why does this happen?Data quality?Experimental design decisions?Sample size?Analysis decisions?We need to be able to directly compare these typical studies or design studies that don’t have these issuesMurphy, K., Garavan, H., An empirical investigation into the number of subjects required for an event-related fMRI study. NeuroImage 22, 879–885. doi: /j.neuroimage (at least 20 subjects)David, S.P., Ware, J.J., Chu, I.M., Loftus, P.D., Fusar-Poli, P., Radua, J., Munafò, M.R., Ioannidis, J.P.A., Potential Reporting Bias in fMRI Studies of the Brain. PLoS ONE 8, e doi: /journal.pone (at least 45 subjects)
8More data means means more interesting and useful science Consistently identifying subtle fMRI magnitude differences between populations requires a lot of dataMeta-analyses using data can identify things that were ignored in the original studiesDavid, S.P., Ware, J.J., Chu, I.M., Loftus, P.D., Fusar-Poli, P., Radua, J., Munafò, M.R., Ioannidis, J.P.A., Potential Reporting Bias in fMRI Studies of the Brain. PLoS ONE 8, e doi: /journal.pone
9If we want to understand brain disease, we need big data NIMH Research Domain Criteria (RDoC) We can’t study brain disorder X vs controls anymore A proposed example study of fear circuitry ALL patients presenting at an anxiety disorders clinic Independent variable: fMRI amygdala response to fearful stimuli Dependent variables: Symptom measures on fear & distress Hypothesize which responses affect severity of symptoms Outcome: Predictive validity studies leading to improved treatment selectionRDoC is starting to see some genetics findings, but no big fMRI findings yet
10While important, these projects are rare, risky, and expensive Where does big fMRI data come from? Large projects with centralized data collection and preliminary analyses1200 healthy, adult subjects ages will be scanned by multiple imaging modalities:Resting-state fMRI (R-fMRI)Task-evoked fMRI (T-fMRI)Diffusion MRI (dMRI) with tractography analysisSubjects include 300 twin pairs and their siblings, allowing robust analysis of the heritability of brain connectivity and associated behavior traitsAll 1200 subjects scanned on a single Skyra 3T scanner, beginning in Summer 2012200 of these subjects will also have R-fMRI, T-fMRI and dMRI imaging on a single 7T scannerCollecting extensive demographic and behavioral data on each subject, as well as blood samples to be used for genotyping in Year 5 of the projectT-fMRI protocols will maximize coverage of different areas of the brainOverlap of tasks with those in other large studies will facilitate cross-study comparisonStandardized imaging protocols and processing pipelines for maximal comparabilityMEG/EEG imaging on a subset of 100 subjects, including both resting-state MEG/EEG and task-evoked MEG/EEG, using some of the same tasks and timing as will be used in T-fMRIAll HCP-generated methods will made available as they are finalizedHCP data will be freely available through quarterly releases of our ConnectomeDB database, beginning Fall 2012Freely-available Connectome Workbench analysis and visualizationWhile important, these projects are rare, risky, and expensive
11Where does big fMRI data come from? Retrospective Data Sharing
12Where does big fMRI data come from? Prospective Data Sharing Also from
13What types of data are analyzed? Full fMRI time seriesFMRI Data CenterOpenfmri.orgcentral.xnat.orgfcon_1000.projects.nitrc.orgincf.org/resources/data-space/Statistical Maps or clustersBrainMap.orgNeurosynth.orgThe prospective stuff tends to be more single site, but also rarely as large
14How are cross-site data collected? Consortia with common experimental paradigmsADNI (Alzheimer’s Disease)Human Connectome ProjectsConsortia with common populationsNDAR (National Database for Autism Research)FITBAR (Traumatic Brain Injury)dbGaP (Genotypes + phenotypes)ABIDE (Autism)ADHD-200ENIGMA (Imaging + Genetics)FBIRN (Standardization across sites)
15Benefits of Big Data Sharing Much larger sample sizesPeople without resources to collect high quality fMRI data can make important contributions to the fieldExperts in other areas can offer new insightsBetter ways to develop and compare analysis methodsBetter understanding of replication attemptsIf there is inconsistency, you can then directly compare two data sets
16And the exciting findings are… Thanks to these fMRI big datasets what have we learned about the brain and disease?Seriously, what?(This isn’t to say a lot of interesting stuff, particularly methods development, isn’t happening thanks to these big data sets)Why is this?The field is still youngBig data is more than just putting data together
17The field is still young Building big data capabilities takes time USA started funding the public GeneBank in 1982The National Center for Biotechnology Information at NLM launched & look over GeneBank in 1989Basic Local Alignment Search Tool (BLAST) algorithm to rapidly match sequences came out in 1990The first human genomes were released in 2003.Nearly real time tracking of a dangerous bacteria at NIH Clinical center in 2012Images from:Dates are from wikipedia pages and Ruth Kirschstein’s biography:Images from:
18Types of problems that can’t be solved with a bigger hard drive Universal big data issuesGiving multiple people quick access to dataSecurityApplication specific issuesQuality control with less human interactionMeaningful queries of dataDocumentation and provenance
19The limits for fMRI aren’t technological The general technology needed to securely store, share, run computations, and visualize big fMRI data existsThis doesn’t mean technology can’t get much betterTechnology isn’t freeIf nothing else, things like Amazon or Microsoft cloud computer systems are already big, modular, and fast enough for most of what we currently call big data with fMRIThe really specialized technology for massive genomics, physics, meteorology, or military projects work even better & the rate limiting step is funding for a similar system or space/compute time on existing systems.
20The failure of the FMRI Data Center Scientists weren’t motivated to contributeNo social pressureIt was really hard to enter data and information and submit itfMRIDC needed to organize data in whatever format it got (unorganized hard drives people pulled from their computers)Removing identifiable informationData shared by mailShort-term funding disappearedVan Horn, J.D., Gazzaniga, M.S., 2012.Why share data? Lessons learned from the fMRIDC. NeuroImage
21Human Connectome Project example How do you share 50+ Terabytes of fMRI data? If you want the first 500 subject’s of data, they’ll mail you five 4 Terabyte drives for $1000.Parts of the data are available online to download or explore.This doesn’t include any of the genetic or MEG data.
22What are the benefits and weaknesses? People aren’t slamming their local networks downloading this stuffWeaknessesExpensive bar to data accessHard drives break in transferEvery data correction is a new hard driveSomeone has to write and mail these hard drives
23Downloading from websites Download large, compressed filesAccess parts of data through databases (xnat)A science DMZ would speed of transfershttps://fasterdata.es.net/science-dmz/science-dmz-architecture/
24National Database for Autism Research NDAR data lives on the aws.amazon.comData is at one locationUsers need accounts & access, but don’t need local dataComputer use costs money, but you can scale with the computational power you need & you don’t need to buy super-powerful local computersAs fast or slow as other AWS websites
25What to store and how to organize it? Statistical maps vs reconstructed vs raw fMRI data?Basic scanning parameters vs detailed pulse sequence descriptions?The strength/brand of the scanner vs full hardware specifications?Task designs vs presentation scripts?Behavioral and physiological data and keys to understand what it means
26Documentation & Replication Discussion of replicating an experiment using superconducting bar magnets to detect gravity waves in the 1970’s (not MRI) “Should the bar be cast from the same batch of metal? Should we buy the piezoelectric crystals form the same manufacturer as the original ones? Should we glue them to the bar using the same adhesive as before, bought from the same manufacturer? Should the amplifiers be identical in every visible respect, or is an amplifier build to certain input and output specifications “the same” as another amplifier built to the same specifications? Should we be making sure that the length and diameter of every wire is the same? Should we be making sure that the color of the insulation on the wires is the same? Clearly, somewhere one as to stop asking such questions and use a commonsense notion of “the same.” the trouble is that in frontier science, tomorrow’s common sense is still being formed.” Gravity’s Shadow by Harry Collins p 123
27Documentation & Replication Discussion of replicating an experiment using superconducting bar magnets to detect gravity waves in the 1970’s (not MRI) “Should the bar be cast from the same batch of metal? Should we buy the piezoelectric crystals form the same manufacturer as the original ones? Should we glue them to the bar using the same adhesive as before, bought from the same manufacturer? Should the amplifiers be identical in every visible respect, or is an amplifier build to certain input and output specifications “the same” as another amplifier built to the same specifications? Should we be making sure that the length and diameter of every wire is the same? Should we be making sure that the color of the insulation on the wires is the same? Clearly, somewhere one as to stop asking such questions and use a commonsense notion of “the same.” the trouble is that in frontier science, tomorrow’s common sense is still being formed.” Gravity’s Shadow by Harry Collins p 123To my astonishment, on reading this passage, Gary Sanders told me that in 1988 or 1989 he had been in a laboratory near Tokyo when a Russian physicist, examining a Japanese group’s apparatus, declared their results on Tritum Beta decay to be invalid because they had used wires with red insulation! Apparently the red dye contains traces of radioactive uranium, which can confound the measurements.
28Variation in replications can be good The most replicated finding in drug abuse research is, “Rats will intravenously self-administer (IVSA) cocaine.”The types of factors that can reduce your "effect" to a null effect or change the outcome include:Catheter diameter or lengthCocaine dose available in each infusionRate of infusion/concentration of drugAge, Sex, Strain, or vendor source of the ratsTime of day in which rats are run (not just light/dark* either)Food restriction status & last food availabilityPair vs single housing & “Enrichment”Experimenter choice of smelly personal care productsDirty/clean labcoat (I kid you not)Handling of the rats on arrival from vendorFire-alarmCage-change dayMinor rat illnessLocation of operant box in the room (floor vs ceiling, near door or away)Ambient temperature of vivarium or test roomSchedule- weekends off? seven days a week? 1 hr? 2hr? 6 hr? access sessionsSchedule- are reinforcer deliveries contingent upon one lever press? five? does the requirement progressively increase with each successive infusion?Animal loss from the study for various reasonsWe only know this because so many replications were done and documentedthe-most-replicated-finding-in-drug-abuse-science/
29Even if studies aren’t consistent, documentation and provenance is vital Neuroimaging Data Model nidm.nidash.org/Automated ontology for understanding statistical mapsNIF (Neuroscience Information Framework)Ontology for all of neuroscienceData processing scriptsAFNI software realtime output and history flagsNothing exists that is easy to populate, covers most aspects of fMRI data collection & processing, and easy to pull information from databasesBad note-taking and documentation is an issue for “small” fMRI studies too.
30Ethics of data sharingInternal review boards require every study to have a purpose. Data can’t be used outside the scope of it’s purpose of collectionIf big data is useful, we will learn unexpected things about individual’s future healthAnonymity is impossible to maintain by computer security aloneFirst point is most significant for older studies. Newer studies can account for this in their IRB approval and consenting processes
31Anonymity doesn’t exist with big data Sisters discovered they didn’t share a fatherA neuroscientist who was adopted used 23andMe and Facebook to identify some probable half-siblings (but decided not to contact them)23 and You, by Virginia Hughes (12/4/2013) https://medium.com/matter-archive/23-and-you-66e87553d22c
32MRI and anonymityFacebook post:“Hey look at this cool picture of my brain I got as thanks for participating in a study”The data is publicly shared. Someone takes the picture, matches it to a slice in shared data and then get the genome & diagnostic info linked to the study.Brain slice from:Brain slice from:
33How to prevent loss of privacy? MRI brain slices and genomes are not considered identifiable information in US health privacy lawTry to make this type of matching difficultBlur faces on high quality imagesDon’t give participants the resultsLock data behind data use agreements
34Washington University - University of Minnesota Consortium of the Human Connectome Project: Data Use Agreement (abridged)I will not attempt to establish the identity of or attempt to contact any of the included human subjects.I understand that under no circumstances will the code that would link these data to Protected Health Information be given to me, nor will any additional information about individual human subjects be released to me.I will comply with all relevant rules and regulations imposed by my institution. This may mean that I need my research to be approved or declared exempt by a committee that oversees research on human subjects, e.g. my IRB or Ethics Committee. The released HCP data are not considered de-identified, insofar as certain combinations of HCP Restricted Data (available through a separate process) might allow identification of individuals.I will acknowledge the use of WU-Minn HCP data… when publicly presenting any results or algorithms that benefitted from their use.Authors of publications or presentations using WU-Minn HCP data should cite relevant publications describing the methods used by the HCP…Failure to abide by these guidelines will result in termination of my privileges to access WU-Minn HCP data.
35Conclusions Big data in fMRI is growing very rapidly It has the potential to greatly increase our understanding of the brainPractical, technological, financial, and ethical barriers remain.