Amarnath Gupta Univ. of California San Diego If There is a Data Deluge, Where are the Data?

Assembled the largest searchable collation of neuroscience data on the web The largest catalog of biomedical resources (data, tools, materials, services) available The largest ontology for neuroscience NIF search portal: simultaneous search over data, NIF catalog and biomedical literature Neurolex Wiki: a community wiki serving neuroscience concepts A unique technology platform Cross-neuroscience analytics A reservoir of cross-disciplinary biomedical data expertise

Formal Knowledge/Ontologies Extracted/Analyzed Fact Collections Least Shared Most Shared Useful for Deep (Re-) Analysis Useful for Comprehension, Discovery Uneven distribution of data volume, velocity, variability, location and availability Raw Data (in files) and Data Sets (in directories) LOCAL OFFLINE/ONLINE STORAGE, IRs, PRs? Data Collections and Databases SPECIALIZED & GENERAL PRs, DBs Processed Data Products, Processes DBs, WEB-PRs, PUBS Papers w,w/o Data PUBs Pub/DB Annotations Pub/DB Cross-Links Aggregates and Resource Hubs NIF is aware of 761 repositories

47/50 major preclinical published cancer studies could not be replicated “The scientific community assumes that the claims in a preclinical study can be taken at face value-that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of time. Unfortunately, this is not always the case.” Getting data out sooner in a form where they can be exposed to many eyes and many analyses, and easily compared, may allow us to expose errors and develop better metrics to evaluate the validity of data Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531 “There are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process.” “There must be more opportunities to present negative data.” Significant cross-linking between original papers, supporting/refuting papers/data Courtesy: Maryann Martone

Hello All, Thank you for the people who are taking a look at the data in tera15 :-) There are a whole lot of data (about +8TB) that can be looked at and/or removed. If you had assistants, students, or volunteers who assisted you in processing data, please locate those folders and remove any duplicate or unused data. This will help EVERYONE have space to process new data. Any old data that has been sitting in tera15 untouched in more than 4 years will be removed to a different area for deletion. Please take a look carefully!

For every neuroscientist  For every experiment he/she runs  For every data set that leads to positive or negative results  Store the data in some shared or on-demand repository  Annotate the data with experimental and other contextual information  Perform some analysis and contribute your analysis method to the repository where the data is being stored  For every analysis result  Keep the complete processing provenance of the result  Point back to the data set or data element that contribute to the analysis, specifically mark positively and negatively contributing data  If an error is pointed out in some result,  Provide an explanation of the error  Create a pointer back to the part of the publication and to that part of the data set or data element that produced the error

 For every publication  For every result reported  Create a pointer back to all data used in that section  For every experimental object (e.g., reagents, or auxiliary data from another group) used,  Create an appropriate, if needed time-stamped, pointer to the correct version of the data For every repository/database … that holds the data  Ensure rapid availability  Allow scientists to download or perform in-place analyses  Adhere to appropriate data standards  Keep consistency of all data + references  Should permit multiple simultaneous analsyses by different users  Should allow searching/browsing/querying all possible metadata Diverse distributed infrastructures consisting of individual researchers in different institutions, institutional repositories, public data centers, publishers, annotators and aggregators, bioinformaticians …

Scalable, Elastic Storage and Computation Service Expectations  Scalable Search and Query across structured/semi- structured/unstructured data  Facts – What neurons do Purkinje cells project to?  Resources – What are recent data sets on biomarkers for SMA?  Analytical Results -- What animal models have similar phenotypes to Parkinson’s disease?  Landscape Surveys – Who has what data holdings on neurodegenerative diseases?  Active Analyses  Combining these data and mine, compute how the connectivity of the human brain differ from non-human primates  Perform GO-enrichment analysis on all genes upregulated in Alzheimer’s on all available data and compare with my results  Tracebacks  What data and processing have been used to reach this result in this paper? Which publication refuted the claims in this paper and how?

If all neuroscientists want to comply with this data sharing today, will the current infrastructure be able to support it? Is enough attention being paid to an overarching architecture and interoperation protocol for data sharing?  Is today’s technology properly harnessed to create a holistic data sharing infrastructure? What would motivate neuroscientists and other players to really play their parts in data sharing? Should there be a “monitoring scheme” to ensure proper data sharing practices are actually happening?

The Data-Sharing Ecosystem is a distributed system that can be viewed as an operating system where  Each object has a set of unique structured ids (e.g., extended DOIs) that identify  any data set, data object, or any interval of a data object  The semantic category of the data element  Any human/software agent  Any parameter set of a software invocation  A log is maintained and transmitted for each activity by any agent on any data element  Submission, transfer to repository, pickup by aggregator, creating derived product, being crawled by search services, …  These logs can be accessed by a central monitoring system covering the ecosystem using a Twitter Storm-like infrastructure Think of Facebook maintaining a log of the different actions such as being present at the system, sending and accepting friend requests, posting comments and photos, starting and ending chat sessions, …

Update activities on data elements from Data Centers and Repositories Resource References from literature and web sites, including opinion cites like blogs and forums Citation categories from automated/human-driven annotation systems like DataCite or DOMEO Provenance chains from workflow systems like Kepler Data derivation changes from rule-based metadata management systems like iRODS

Frequency and regularity of data creation vis-à-vis submission to the data-sharing ecosystem Frequency and regularity of data usage of various kinds  Viewing, downloading, replication, uptake by a software, … Number of derived data products  Compounding by cascades of derived data Cross-referencing of data and resources in publications  Compounding by publication data citation cascades Human and programmatic access to data

Accountability Score: a measure of “good data citizenship”  Of People  Increases with contribution of data and analyses  Decays (slowly) with time  Increases with references and citations  Increases with supporting work by others  Decreases with refutation  Decreases (rapidly) with paper retraction  Of Publications  Increases with addition of reference-able data  Increases with data access  Increases with keeping updated with data updates

Influence: A classification and measure of the professional engagement one has in terms of data activity  Longer-term measure compared to accountability score  Applies to all types of players in the ecosystem including just users

These measures do not hold for scientists who do not produce data The measures are mostly designed for online activities and must be modified to match the dynamics of different scientific communities  Parameters like decay constants  Time-window for score revision Global scores should be  supplemented by community scores where a community is defined by ontological regions where one’s research lies  per activity type rather than a single overall score

This is the Big Brother for science This is going to create a bias against “non- performers” Scientific errors will be penalized more than necessary The algorithms can be manipulated to the advantage of some people over others Smaller individuals/organizations will be penalized with respect to better-funded, higher-throughput organization This will be hard to implement due to oppositions from different groups and institutions

My speculations  If the community decides that it needs data sharing, it will naturally gravitate toward some degree of judgment of those who don’t comply  Technology frameworks similar to what we discussed will be adopted within individual e-infrastructures  As more data become available and data sharing efforts become successful, third-party watchers like credit bureaus that monitor scientist’s products with respect to data will emerge  Such scores would be used for community perception and in-kind incentives earlier than their adoption for formal evaluations

The real question is “How do we promote data sharing?” Creating infrastructural elements and reusing today’s (tomorrow’s) technological capabilities is not enough We need a more holistic approach that factors in the human component Using social activity analysis as a starting point we should be able to build a monitoring-cum-incentivizing scheme for data sharing

Amarnath Gupta Univ. of California San Diego If There is a Data Deluge, Where are the Data?

Similar presentations

Presentation on theme: "Amarnath Gupta Univ. of California San Diego If There is a Data Deluge, Where are the Data?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Amarnath Gupta Univ. of California San Diego If There is a Data Deluge, Where are the Data?

Similar presentations

Presentation on theme: "Amarnath Gupta Univ. of California San Diego If There is a Data Deluge, Where are the Data?"— Presentation transcript:

Similar presentations

About project

Feedback