Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big data from small data: A deep survey of the neuroscience landscape data via the Neuroscience Information Framework Maryann Martone, Ph. D. University.

Similar presentations


Presentation on theme: "Big data from small data: A deep survey of the neuroscience landscape data via the Neuroscience Information Framework Maryann Martone, Ph. D. University."— Presentation transcript:

1 Big data from small data: A deep survey of the neuroscience landscape data via the Neuroscience Information Framework Maryann Martone, Ph. D. University of California, San Diego

2 “Neural Choreography” “A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits-- Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior.... However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “ Akil et al., Science, Feb 11, 2011

3 “Data choreography” In that same issue of Science Asked peer reviewers from last year about the availability and use of data About half of those polled store their data only in their laboratories—not an ideal long-term solution. Many bemoaned the lack of common metadata and archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used. “...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )

4 Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community Whole brain data (20 um microscopic MRI) Mosiac LM images (1 GB+) Conventional LM images Individual cell morphologies EM volumes & reconstructions Solved molecular structures No single technology serves these all equally well.  Multiple data types; multiple scales; multiple databases A data federation problem

5 NIF is an initiative of the NIH Blueprint consortium of institutes NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials, services) are available to the neuroscience community? What types of resources (data, tools, materials, services) are available to the neuroscience community? How many are there? How many are there? What domains do they cover? What domains do they not cover? What domains do they cover? What domains do they not cover? Where are they? Where are they? Web sites Web sites Databases Databases Literature Literature Supplementary material Supplementary material Who uses them? Who uses them? Who creates them? Who creates them? How can we find them? How can we find them? How can we make them better in the future? How can we make them better in the future? http://neuinfo.org PDF files PDF files Desk drawers Desk drawers

6 We need more databases (?) NIF Registry: A catalog of neuroscience-relevant resources > 5000 currently listed > 2000 databases And we are finding more every day NIF Registry: A catalog of neuroscience-relevant resources > 5000 currently listed > 2000 databases And we are finding more every day

7 But we have Google! Current web is designed to share documents Documents are unstructured data Much of the content of digital resources is part of the “hidden web” Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.World Wide WebSurface Webindexedsearch engines

8 NIF must work with ecosystem as it is today NIF has developed a production technology platform for researchers to discover, share, access, analyze, and integrate neuroscience-relevant information NIF has developed a production technology platform for researchers to discover, share, access, analyze, and integrate neuroscience-relevant information Semantically-enabled search engine and interface that customizes results for neuroscience Semantically-enabled search engine and interface that customizes results for neuroscience System that searches the “hidden web”, i.e., content not well served by search engines System that searches the “hidden web”, i.e., content not well served by search engines Data resources are predominantly relational, xml, text, rdf, owl Data resources are predominantly relational, xml, text, rdf, owl Automated data harvesting technologies that produce dynamic indices of data content including databases, web pages, text, xml etc. Automated data harvesting technologies that produce dynamic indices of data content including databases, web pages, text, xml etc. Tools to make products and data available Tools to make products and data available Designed to be populated rapidly; set up process for progressive refinement Designed to be populated rapidly; set up process for progressive refinement

9 UCSD, Yale, Cal Tech, George Mason, Washington Univ NIF accomplishments Assembled the largest searchable collation of neuroscience data on the web Assembled the largest searchable collation of neuroscience data on the web Data federation Data federation Resource registry (materials, data, tools, services) Resource registry (materials, data, tools, services) Pub Med literature Pub Med literature Full text of open access Full text of open access The largest ontology for neuroscience The largest ontology for neuroscience NIF search portal: simultaneous search over data, NIF catalog and biomedical literature NIF search portal: simultaneous search over data, NIF catalog and biomedical literature Neurolex Wiki: a community wiki serving neuroscience concepts Neurolex Wiki: a community wiki serving neuroscience concepts A unique technology platform A unique technology platform A reservoir of cross-disciplinary biomedical data expertise A reservoir of cross-disciplinary biomedical data expertise NIF is poised to capitalize on the new tools and emphasis on big data and open science

10 NIF data federation connectivity Brain activation foci Microarray 98% Percentage of data records per data type: everything but microarray > 180 sources; 350 M records: NIF was designed to be populated rapidly, with progressive refinement of data

11 What do you mean by data? Databases come in many shapes and sizes Primary data : Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL) Secondary data Data features extracted through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS) Tertiary data Claims and assertions about the meaning of data E.g., gene upregulation/downregulation, brain activation as a function of task Registries: Metadata Pointers to data sets or materials stored elsewhere Data aggregators Aggregate data of the same type from multiple sources, e.g., Cell Image Library,SUMSdb, Brede Single source Data acquired within a single context, e.g., Allen Brain Atlas Researchers are producing a variety of information artifacts using a multitude of technologies

12 What types of questions can I ask? We’d like to be able to find: What is known****: What is the average diameter of a Purkinje neuron Is GRM1 expressed In cerebral cortex? What are the projections of hippocampus? What genes have been found to be upregulated in chronic drug abuse in adults Is there a database of fMRI studies? What studies used my polyclonal antibody against GABA in humans ? What rat strains have been used most extensively in research during the last 20 years? What is not known: Connections among data Gaps in knowledge Without some sort of framework, very difficult to do

13 What are the connections of the hippocampus? Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms and related concepts Boolean queries Query expansion: Synonyms and related concepts Boolean queries Data sources categorized by “data type” and level of nervous system Common views across multiple sources Tutorials for using full resource when getting there from NIF Link back to record in original source

14 Results are organized within a common framework Connects to Synapsed with Synapsed by Input region innervates Axon innervates Projects to Cellular contact Subcellular contact Source site Target site Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases

15 The scourge of neuroanatomical nomenclature: Importance of NIF semantic framework NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions Brain Architecture Management System (rodent) Temporal lobe.com (rodent) Connectome Wiki (human) Brain Maps (various) CoCoMac (primate cortex) UCLA Multimodal database (Human fMRI) Avian Brain Connectivity Database (Bird) Total: 1800 unique brain terms (excluding Avian) Number of exact terms used in > 1 database: 42 Number of synonym matches: 99 Number of 1 st order partonomy matches: 385

16 NIF’s minimum requirements for effective data sharing You (and the machine) have to be able to find it Accessible through the web Annotations You have to be able to use it Data type specified and in a usable form You have to know what the data mean Some semantics Context: Experimental metadata Provenance: Where did the data come from? Reporting neuroscience data within a consistent framework helps enormously

17 What is an ontology? Brain Cerebellum Purkinje Cell Layer Purkinje cell neuron has a is a Ontology: an explicit, formal representation of concepts relationships among them within a particular domain that expresses human knowledge in a machine readable form Branch of philosophy: a theory of what is e.g., Gene ontologies

18 “Ontology as mathematics, computer science or esperanto”- Andrey Rzhetsky and James A. Evans You need to use ontology identifiers instead of strings Blah, blah, ontology blah

19 What can ontology do for us? Express neuroscience concepts in a way that is machine readable Classes are identified by unique identifiers Synonyms, lexical variants Definitions Provide means of disambiguation of strings Nucleus part of cell; nucleus part of brain; nucleus part of atom Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases GABA as a neurotransmitter Properties Provide universals for navigating across different data sources Semantic “index” Perform reasoning Link data through relationships not just one-to-one mappings “Concept-based queries” “Esperanto!”

20 Power of unique identifiers: Are you the M Martone who... The Gene Wiki: community intelligence applied to human gene annotation. Huss JW 3rd, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch JB, Su AI. Nucleic Acids Res. 2010 Jan;38(Database issue):D633-9. Ontologies for Neuroscience: What are they and What are they Good for? Larson SD, Martone ME. Front Neurosci. 2009 May;3(1):60-7. Epub 2009 May 1. Three-dimensional electron microscopy reveals new details of membrane systems for Ca2+ signaling in the heart. Hayashi T, Martone ME, Yu Z, Thor A, Doi M, Holst MJ, Ellisman MH, Hoshijima M. J Cell Sci. 2009 Apr 1;122(Pt 7):1005-13. Traumatic brain injury and the goals of care. Martone M. Hastings Cent Rep. 2006 Mar- Apr;36(2):3.. Three-dimensional pattern of enkephalin-like immunoreactivity in the caudate nucleus of the cat. Groves PM, Martone M, Young SJ, Armstrong DM. J Neurosci. 1988 Mar;8(3):892-900.. Some analyses of forgetting of pictorial material in amnesic and demented patients. Martone M, Butters N, Trauner D. J Clin Exp Neuropsychol. 1986 Jun;8(3):161-78.

21 I am not a number (but I should be) Full URI: Uniform Resource Identifier Full URI: Uniform Resource Identifier http://orcid.org/1234567 http://orcid.org/1234567 Label: Maryann Elizabeth Martone Label: Maryann Elizabeth Martone Synonym: ME Martone, M Martone, Maryann Synonym: ME Martone, M Martone, Maryann Abbreviation: MEM Abbreviation: MEM Is a Is a Has a Has a Is that entity which has these properties Is that entity which has these properties M Martone Dept of Psychiatry, UCSD Nelson Butters Publications Boston VA Hospital Text mining algorithms can discover a lot of things about me ORCID projectORCID project: Author ID’s Female

22 NIF Semantic Framework: NIFSTD ontology NIF covers multiple structural scales and domains of relevance to neuroscience Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks for more complex representations NIFSTD Organism NS Function Molecule Investigation Subcellular structure Macromolecule Gene Molecule Descriptors Techniques Reagent Protocols Cell Resource Instrument Dysfunction Quality Anatomical Structure

23 “We studied the behavior of CA2-binding proteins in Ca2 neurons under high and low Ca2 conditions ” BioGrid Allen Brain Atlas Brain Info NIF queries across over 170+ independent databases

24 But you don’t have what I need! http://neurolex.orgStephen Larson/INCF Provide a simple framework for defining the concepts required Cell, Part of brain, subcellular structure, molecule Community based: Communities contribute their vocabularies Reconcile and align concepts used by different domains Each concept gets its own unique identifier Creating a computable index for neuroscience data INCF Demo D03

25 Concept-based search: search by meaning Search Google: GABAergic neuron Search Google: GABAergic neuron Search NIF: GABAergic neuron Search NIF: GABAergic neuron NIF automatically searches for types of GABAergic neurons NIF automatically searches for types of GABAergic neurons Types of GABAergic neurons

26 Esperanto! “The trouble is that if I make up all of my own URIs, my [data] has no meaning to anyone else unless I explain what each URI is intended to denote or mean. Two [data sets] with no URIs in common have no information that can be interrelated.” “The trouble is that if I make up all of my own URIs, my [data] has no meaning to anyone else unless I explain what each URI is intended to denote or mean. Two [data sets] with no URIs in common have no information that can be interrelated.” NIF favors reuse of identifiers rather than mapping NIF favors reuse of identifiers rather than mapping NIF imports many ontologies NIF imports many ontologies Creating ontologies to be used as common building blocks: modularity, low semantic overhead, is important Creating ontologies to be used as common building blocks: modularity, low semantic overhead, is important Many community ontologies available covering multiple domains Many community ontologies available covering multiple domains NIFSTD available via web serivices NIFSTD available via web serivices Bioportal ( http://bioportal.bioontology.org/) Bioportal ( http://bioportal.bioontology.org/) http://bioportal.bioontology.org/ http://www.rdfabout.com/intro/#Introducing%20RDF

27 NIF Analytics: The Neuroscience Ecosystem NIF is in a unique position to answer questions about the neuroscience ecosystem Where are the data? Striatum Hypothalamus Olfactory bulb Cerebral cortex Brain Brain region Data source Vadim Astakhov, Kepler Workflow Engine

28 Whither neuroscience information? ∞ What is easily machine processable and accessible What is potentially knowable What is known: Literature, images, human knowledge What is known: Literature, images, human knowledge Unstructured; Natural language processing, entity recognition, image processing and analysis; communication

29 Open world meets closed world Query for “reference” brain structures and their parts in NIF Connectivity database But...NIF has > 900,000 antibodies, 250,000 model organisms, and 3 million microarray records

30 NIF Reports: Male vs Female Gender bias NIF can start to answer interesting questions about neuroscience research, not just about neuroscience

31 What have we learned: Grabbing the long tail of small data Analysis of NIF shows multiple databases with similar scope and content Analysis of NIF shows multiple databases with similar scope and content Many contain partially overlapping data Many contain partially overlapping data Data “flows” from one resource to the next Data “flows” from one resource to the next Data is reinterpreted, reanalyzed or added to Data is reinterpreted, reanalyzed or added to Is duplication good or bad? Is duplication good or bad?

32 Embracing duplication: Data Mash ups NIF queries across 3 of approximately 10 fMRI databases ~300 PMID’s were common between Brede and SUMSdb PMID serves as a unique identifier for an article Same information; value added Same data; different aspects

33 Same data: different analysis Gemma: Gene ID + Gene Symbol Gemma: Gene ID + Gene Symbol DRG: Gene name + Probe ID DRG: Gene name + Probe ID Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases Analysis: Analysis: 1370 statements from Gemma regarding gene expression as a function of chronic morphine 1370 statements from Gemma regarding gene expression as a function of chronic morphine 617 were consistent with DRG;  over half of the claims of the paper were not confirmed in this analysis 617 were consistent with DRG;  over half of the claims of the paper were not confirmed in this analysis Results for 1 gene were opposite in DRG and Gemma Results for 1 gene were opposite in DRG and Gemma 45 did not have enough information provided in the paper to make a judgment 45 did not have enough information provided in the paper to make a judgment Chronic vs acute morphine in striatum

34 Taking a global view on data: microculture to ecosystem Several powerful trends should change the way we think about our data: One  Many Many data Generation of data is getting easier  shared data Data space is getting richer: more –omes everyday But...compared to the biological space, still sparse Many eyes Wisdom of crowds More than one way to interpret data Many algorithms Not a single way to analyze data Many analytics “Signatures” in data may not be directly related to the question for which they were acquired but tell us something really interesting Are you exposing or burying your work?

35 The future of scientific communication We have learned over the years how to write a scientific paper for other humans to read and for other agents to index We now have to learn how to write papers for automated agents (and their humans) to mine We have learned over the years to report data in papers for humans to read We now have to learn how to publish data in a form and on a suitable platform for automated agents (and their humans) to mine Reporting neuroscience data within a consistent framework helps enormously Printing press Linked data cloud Watson

36 Why does it matter? 47/50 major preclinical published cancer studies could not be replicated “The scientific community assumes that the claims in a preclinical study can be taken at face value-that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of time. Unfortunately, this is not always the case.” Getting data out sooner in a form where they can be exposed to many eyes and many analyses, and easily compared, may allow us to expose errors and develop better metrics to evaluate the validity of data Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531 “There are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process. “ Data, not just stories about them!

37 Community database: beginning Community database: End Register your resource to NIF! “How do I share my data?” “There is no database for my data” 1 2 3 4 Institutional repositories Cloud INCF: Global infrastructure Government Education Industry NIF is designed to leverage existing investments in resources and infrastructure

38 It’s a messy ecosystem (and that’s OK) NIF favors a hybrid, tiered, federated system Domain knowledge Domain knowledge Ontologies Ontologies Claims about results Claims about results Virtuoso RDF triples Virtuoso RDF triples Data Data Data federation Data federation Workflows Workflows Narrative Narrative Full text access Full text access NeuronBrain partDisease Organism Gene Caudate projects to Snpc Grm1 is upregulated in chronic cocaine Betz cells degenerate in ALS

39 Future of Research Communications and e-Scholarship FORCE11: http://force11.org FORCE11: http://force11.org Founded by Phil Bourne, Tim Clark, Ed Hovy, Anita de Waard and Ivan Herman Founded by Phil Bourne, Tim Clark, Ed Hovy, Anita de Waard and Ivan Herman Bring together stakeholders with an interest in moving scholarly communication beyond reliance on papers and traditional impact metrics Bring together stakeholders with an interest in moving scholarly communication beyond reliance on papers and traditional impact metrics Beyond the PDF 2: Spring 2013 Beyond the PDF 2: Spring 2013

40 NIF team (past and present) Jeff Grethe, UCSD, Co Investigator, Interim PI Amarnath Gupta, UCSD, Co Investigator Anita Bandrowski, NIF Project Leader Gordon Shepherd, Yale University Perry Miller Luis Marenco Rixin Wang David Van Essen, Washington University Erin Reid Paul Sternberg, Cal Tech Arun Rangarajan Hans Michael Muller Yuling Li Giorgio Ascoli, George Mason University Sridevi Polavarum Fahim Imam, NIF Ontology Engineer Larry Lui Andrea Arnaud Stagg Jonathan Cachat Jennifer Lawrence Lee Hornbrook Binh Ngo Vadim Astakhov Xufei Qian Chris Condit Mark Ellisman Stephen Larson Willie Wong Tim Clark, Harvard University Paolo Ciccarese Karen Skinner, NIH, Program Officer

41 Why do we create so many overlapping products? “That which I cannot build, I cannot understand” Don’t trust any data you haven’t generated Oh, now I see what you are saying Scientists know the domain, not informatics Science is incremental; we build on the results of others It’s ingrained in our culture “Build a better mousetrap and the world will beat down our doors” Little credit for making someone else’s product better Yes, we are planning to do that... We are all time and resource constrained We are all time and resource constrained We extend projects in time We extend projects in time There’s more than way to skin a cat.... We are still mastering the medium We are still mastering the medium Technology is developing fast Technology is developing fast

42 When I talk to resource providers, neuroscientists (and journal editors)... You need to use ontology identifiers instead of strings Blah, blah, ontology blah


Download ppt "Big data from small data: A deep survey of the neuroscience landscape data via the Neuroscience Information Framework Maryann Martone, Ph. D. University."

Similar presentations


Ads by Google