Amarnath Gupta Univ. of California San Diego If There is a Data Deluge, Where are the Data?

Slides:



Advertisements
Similar presentations
The SDMX Registry Model April 2, 2009 Arofan Gregory Open Data Foundation.
Advertisements

Chapter 19: Network Management Business Data Communications, 5e.
Future of Research Communications and E-Scholarship Maryann E. Martone, Ph. D. Executive Director Professor of Neuroscience, University of California,
Data Landscapes neuinfo.org Anita Bandrowski, Ph. D. University of California, San Diego.
Tagging Systems Mustafa Kilavuz. Tags A tag is a keyword added to an internet resource (web page, image, video) by users without relying on a controlled.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
1 genSpace: Community- Driven Knowledge Sharing for Biological Scientists Gail Kaiser’s Programming Systems Lab Columbia University Computer Science.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
An Architecture for Creating Collaborative Semantically Capable Scientific Data Sharing Infrastructures Anuj R. Jaiswal, C. Lee Giles, Prasenjit Mitra,
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Online communities 1 Theory revision Complete some of the activities in this powerpoint and use the revision book to answer questions.
Data Sources & Using VIVO Data Visualizing Scholarship VIVO provides network analysis and visualization tools to maximize the benefits afforded by the.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Federated Networks of Open Access Repositories in Mexico and Latin America Rosalina Vázquez Tapia, Autonomous University of San Luis Potosí.
SERNEC Image/Metadata Database Goals and Components Steve Baskauf
Mendeley Institutional Edition Hazman Aziz, eProduct Manager (APAC) University Kebangsaan Malaysia.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
1 Matthew J. McAuliffe, Ph.D., Chief, Biomedical Imaging Research Services Section (BIRSS) CIT Ramona Hicks, Ph.D., Program Director, Repair and Plasticity.
Sept 19,  Provides a common set of terminology and definitions  A framework for describing resources and processes  Enables computer based interoperability.
Donatella Castelli CNR-ISTI
Integrated Collaborative Information Systems Ahmet E. Topcu Advisor: Prof Dr. Geoffrey Fox 1.
Managing the Impacts of Programmatic Scale and Enhancing Incentives for Data Archiving A Presentation for “International Workshop on Strategies for Preservation.
VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Thomson Reuters ISI (Information Sciences Institute) Azam Raoofi, Head of Indexing & Education Departments, Kowsar Editorial Meeting, Sep 19 th 2013.
G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
GEOSS Common Infrastructure Internal Structure and Standards Steven F. Browdy (IEEE)
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
Jamie Hall (ILL). SciencePAD Persistent Identifiers Workshop PANData Software Catalogue January 30th 2013 Jamie Hall Developer IT Services, Institut Laue-Langevin.
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Series 2013 Data Management at the National Climate Change and Wildlife Science Center.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
A Practical Approach to Metadata Management Mark Jessop Prof. Jim Austin University of York.
1 Understanding Cataloging with DLESE Metadata Karon Kelly Katy Ginger Holly Devaul
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Firmware - 1 CMS Upgrade Workshop October SLHC CMS Firmware SLHC CMS Firmware Organization, Validation, and Commissioning M. Schulte, University.
Internet Documentation and Integration of Metadata (IDIOM) Presented by Ahmet E. Topcu Advisor: Prof. Geoffrey C. Fox 1/14/2009.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Marine Metadata Interoperability Acknowledgements Ongoing funding for this project is provided by the National Science Foundation.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Copyright 2010, The World Bank Group. All Rights Reserved. Recommended Tabulations and Dissemination Section B.
The Neuroscience information framework A User’s Guide.
Software Configuration Management SEII-Lecture 21
Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
ICSU-WDS & RDA Data Publication Services WG. 2 Linking Research Data and the Literature: why? Why link? 1.Increase visibility & discoverability of research.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
 Project Team: Suzana Vaserman David Fleish Moran Zafir Tzvika Stein  Academic adviser: Dr. Mayer Goldberg  Technical adviser: Mr. Guy Wiener.
Dynamic/Deferred Document Sharing (D3S) Profile for 2010 presented to the IT Infrastructure Technical Committee Karen Witting February 1, 2010.
Kathleen Shearer Data management: The new frontier for libraries.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Advanced Higher Computing Science
Enhancements to Galaxy for delivering on NIH Commons
Towards a unified MOD resource: An Overview
Pasquale Pagano (CNR-ISTI) Project technical director
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
HCLS Scientific Discourse C-SHALS 2009
EOSC MODEL Pasquale Pagano CNR - ISTI
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
An ecosystem of contributions
United Nations Statistics Division
Presentation transcript:

Amarnath Gupta Univ. of California San Diego If There is a Data Deluge, Where are the Data?

Assembled the largest searchable collation of neuroscience data on the web The largest catalog of biomedical resources (data, tools, materials, services) available The largest ontology for neuroscience NIF search portal: simultaneous search over data, NIF catalog and biomedical literature Neurolex Wiki: a community wiki serving neuroscience concepts A unique technology platform Cross-neuroscience analytics A reservoir of cross-disciplinary biomedical data expertise

Formal Knowledge/Ontologies Extracted/Analyzed Fact Collections Least Shared Most Shared Useful for Deep (Re-) Analysis Useful for Comprehension, Discovery Uneven distribution of data volume, velocity, variability, location and availability Raw Data (in files) and Data Sets (in directories) LOCAL OFFLINE/ONLINE STORAGE, IRs, PRs? Data Collections and Databases SPECIALIZED & GENERAL PRs, DBs Processed Data Products, Processes DBs, WEB-PRs, PUBS Papers w,w/o Data PUBs Pub/DB Annotations Pub/DB Cross-Links Aggregates and Resource Hubs NIF is aware of 761 repositories

47/50 major preclinical published cancer studies could not be replicated “The scientific community assumes that the claims in a preclinical study can be taken at face value-that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of time. Unfortunately, this is not always the case.” Getting data out sooner in a form where they can be exposed to many eyes and many analyses, and easily compared, may allow us to expose errors and develop better metrics to evaluate the validity of data Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531 “There are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process.” “There must be more opportunities to present negative data.” Significant cross-linking between original papers, supporting/refuting papers/data Courtesy: Maryann Martone

Hello All, Thank you for the people who are taking a look at the data in tera15 :-) There are a whole lot of data (about +8TB) that can be looked at and/or removed. If you had assistants, students, or volunteers who assisted you in processing data, please locate those folders and remove any duplicate or unused data. This will help EVERYONE have space to process new data. Any old data that has been sitting in tera15 untouched in more than 4 years will be removed to a different area for deletion. Please take a look carefully!

For every neuroscientist  For every experiment he/she runs  For every data set that leads to positive or negative results  Store the data in some shared or on-demand repository  Annotate the data with experimental and other contextual information  Perform some analysis and contribute your analysis method to the repository where the data is being stored  For every analysis result  Keep the complete processing provenance of the result  Point back to the data set or data element that contribute to the analysis, specifically mark positively and negatively contributing data  If an error is pointed out in some result,  Provide an explanation of the error  Create a pointer back to the part of the publication and to that part of the data set or data element that produced the error

 For every publication  For every result reported  Create a pointer back to all data used in that section  For every experimental object (e.g., reagents, or auxiliary data from another group) used,  Create an appropriate, if needed time-stamped, pointer to the correct version of the data For every repository/database … that holds the data  Ensure rapid availability  Allow scientists to download or perform in-place analyses  Adhere to appropriate data standards  Keep consistency of all data + references  Should permit multiple simultaneous analsyses by different users  Should allow searching/browsing/querying all possible metadata Diverse distributed infrastructures consisting of individual researchers in different institutions, institutional repositories, public data centers, publishers, annotators and aggregators, bioinformaticians …

Scalable, Elastic Storage and Computation Service Expectations  Scalable Search and Query across structured/semi- structured/unstructured data  Facts – What neurons do Purkinje cells project to?  Resources – What are recent data sets on biomarkers for SMA?  Analytical Results -- What animal models have similar phenotypes to Parkinson’s disease?  Landscape Surveys – Who has what data holdings on neurodegenerative diseases?  Active Analyses  Combining these data and mine, compute how the connectivity of the human brain differ from non-human primates  Perform GO-enrichment analysis on all genes upregulated in Alzheimer’s on all available data and compare with my results  Tracebacks  What data and processing have been used to reach this result in this paper? Which publication refuted the claims in this paper and how?

If all neuroscientists want to comply with this data sharing today, will the current infrastructure be able to support it? Is enough attention being paid to an overarching architecture and interoperation protocol for data sharing?  Is today’s technology properly harnessed to create a holistic data sharing infrastructure? What would motivate neuroscientists and other players to really play their parts in data sharing? Should there be a “monitoring scheme” to ensure proper data sharing practices are actually happening?

The Data-Sharing Ecosystem is a distributed system that can be viewed as an operating system where  Each object has a set of unique structured ids (e.g., extended DOIs) that identify  any data set, data object, or any interval of a data object  The semantic category of the data element  Any human/software agent  Any parameter set of a software invocation  A log is maintained and transmitted for each activity by any agent on any data element  Submission, transfer to repository, pickup by aggregator, creating derived product, being crawled by search services, …  These logs can be accessed by a central monitoring system covering the ecosystem using a Twitter Storm-like infrastructure Think of Facebook maintaining a log of the different actions such as being present at the system, sending and accepting friend requests, posting comments and photos, starting and ending chat sessions, …

Update activities on data elements from Data Centers and Repositories Resource References from literature and web sites, including opinion cites like blogs and forums Citation categories from automated/human-driven annotation systems like DataCite or DOMEO Provenance chains from workflow systems like Kepler Data derivation changes from rule-based metadata management systems like iRODS

Frequency and regularity of data creation vis-à-vis submission to the data-sharing ecosystem Frequency and regularity of data usage of various kinds  Viewing, downloading, replication, uptake by a software, … Number of derived data products  Compounding by cascades of derived data Cross-referencing of data and resources in publications  Compounding by publication data citation cascades Human and programmatic access to data

Accountability Score: a measure of “good data citizenship”  Of People  Increases with contribution of data and analyses  Decays (slowly) with time  Increases with references and citations  Increases with supporting work by others  Decreases with refutation  Decreases (rapidly) with paper retraction  Of Publications  Increases with addition of reference-able data  Increases with data access  Increases with keeping updated with data updates

Influence: A classification and measure of the professional engagement one has in terms of data activity  Longer-term measure compared to accountability score  Applies to all types of players in the ecosystem including just users

These measures do not hold for scientists who do not produce data The measures are mostly designed for online activities and must be modified to match the dynamics of different scientific communities  Parameters like decay constants  Time-window for score revision Global scores should be  supplemented by community scores where a community is defined by ontological regions where one’s research lies  per activity type rather than a single overall score

This is the Big Brother for science This is going to create a bias against “non- performers” Scientific errors will be penalized more than necessary The algorithms can be manipulated to the advantage of some people over others Smaller individuals/organizations will be penalized with respect to better-funded, higher-throughput organization This will be hard to implement due to oppositions from different groups and institutions

My speculations  If the community decides that it needs data sharing, it will naturally gravitate toward some degree of judgment of those who don’t comply  Technology frameworks similar to what we discussed will be adopted within individual e-infrastructures  As more data become available and data sharing efforts become successful, third-party watchers like credit bureaus that monitor scientist’s products with respect to data will emerge  Such scores would be used for community perception and in-kind incentives earlier than their adoption for formal evaluations

The real question is “How do we promote data sharing?” Creating infrastructural elements and reusing today’s (tomorrow’s) technological capabilities is not enough We need a more holistic approach that factors in the human component Using social activity analysis as a starting point we should be able to build a monitoring-cum-incentivizing scheme for data sharing