Making “Open Data” Work: Challenges for Data Integration in Genomics Research Irene Pasquetto @UCLA_KI @irenepasquetto.

Slides:



Advertisements
Similar presentations
Reflections on the Journey Diane H. Sonnenwald.
Advertisements

SAIL: Documenting data content and quality, letting the computer take the strain Caroline Brooks Senior Research Analyst, College of Medicine, Swansea.
Computational Biology: A Measurement Perspective Alden Dima Information Technology Laboratory
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Science as an Open Enterprise: Open Data for Open Science Professor Brian Collins CB, FREng UCL, June 2012 Emerging conclusions from a Royal Society Policy.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
Managing Data: The Long View FORCE15 – 12 January 2015 Amy Friedlander, Ph.D.
The Department of Energy’s Public Access Solution Giving Voice to Energy and Science R&D Results Jeffrey Salmon Deputy Director for Resource Management.
Hackathons for Scientific Software How and When do they Work? Erik H. Trainer, Chalalai Chaihirunkarn, Arun Kalyanasundaram, James D. Herbsleb.
BUSINESS INFORMATICS descriptors presentation Vladimir Radevski, PhD Associated Professor Faculty of Contemporary Sciences and Technologies (CST) Linkoping.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
INFRASTRUCTURE FOR GIS INTEROPERABLITY APPLICATION FACULTY OF INFORMATION AND COMMUNICATION TECHNOLOGY (FTMK) THE TECHNICAL UNIVERSITY OF MALAYSIA MELAKA.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
© 2007, IDEALS This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. To view a copy of this license, visit
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Structural Models Lecture 11. Structural Models: Introduction Structural models display relationships among entities and have a variety of uses, such.
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
It’s the data that makes a paper Joerg Heber Executive Editor Nature Communications.
| nectar.org.au NECTAR TRAINING Module 2 Virtual Laboratories and eResearch Tools.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
CyVerse Data Store Managing Your ‘Big’ Data. Welcome to the Data Store Manage and share your data across all CyVerse platforms.
Challenges facing data- enabled interdisciplinary training.
Towards integrating European research information
TDM in the Life Sciences Application to Drug Repositioning *
Towards a unified MOD resource: An Overview
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
FaceBase Consortium NIDCR Update Steve Scholnick, PhD TGRB/NIDCR/NIH.
Digital Transformation and Diversity in a Swedish Context
Challenges of open science
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
CARER Proposal Writing Workshop November 2004
Tools and Services Workshop
Community Science Updates
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Paolo Budroni, University of Vienna
Jarek Nabrzyski Director, Center for Research Computing
What is the National Data Service?
Themes in Geosciences.
Assessing Students' Understanding of the Scientific Process Amy Marion, Department of Biology, New Mexico State University Abstract The primary goal of.
Formal Conceptualization of Dental Diagnoses: Status Report
CCNT Lab of Zhejiang University
SOFTWARE DESIGN AND ARCHITECTURE
Open access as a means to produce high quality data Anja Gassner Head Research Method Group Sentinel Landscape Coordinator FTA World Agroforestry Centre.
ACS 2016 Moving research forward with persistent identifiers
Short to Medium Term Priority issues for EGI, EMI, anD others
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
Campus Cyberinfrastructure
Digital library for Earth System Education Teaching Boxes
Frequently asked questions about software engineering
Data Management: Documentation & Metadata
Functional Annotation of the Horse Genome
Open Access to your Research Papers and Data
FaceBase Consortium: NIDCR Update 2018
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
ESciDoc Introduction M. Dreyer.
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Comparison to existing state of security experimentation
Research Design Quantitative.
Computer Services Business challenge
Bird of Feather Session
Rankings from the perspective of European universities
BCoN Data Integration Workshop, University of Kansas, Feb 13-14, 2018
The FaceBase Consortium
Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.
AB 1755 The Open and Transparent Water Data Act
Presentation transcript:

Making “Open Data” Work: Challenges for Data Integration in Genomics Research Irene Pasquetto @UCLA_KI @irenepasquetto

Literature on OD in SCIENCE There is a lot of literature about: the BENEFITS of OD (making science more trustworthy, reproducible, efficient), the SOCIOTECHNICAL BARRIERS (lack of funding, credit attribution, trusts, provenance, lack of metadata and standards etc.), and the potential TECHNICAL SOLUTIONS (data repositories, how to link data and software, solutions for publishing data etc.). However, only few studies focused on what is necessary to do to reuse the data once they are made open. “Reuse” of OD is often taken for granted, however we have no numbers on how data are indeed reused once made open, for what purposes and by who. Very few empirical studies that tell us to what extend data are reuse in open repositories.

THE CRANIOFACIAL RESEARCH FIELD Interdisciplinary domain at the intersection of biomedicine and pure biology research. GOALS: Study the genetic causes of facial variation and facial abnormalities. Study the evolutionary processes involved in craniofacial development. Develop awareness, prevention and treatments for common genetic syndromes involving the face, such as cleft palate (half of birth defects involves the face) This paper is part of our empirical research on the reuse of open data in science. Our case study is in the craniofacial domain, which is…(read slide) The Wonders of the East, Beowulf Manuscript, c. 700–1000 AD

This community had been characterized for long time by labs working independently on similar research questions by using very different methods. In 2010 NIH funded a consortium for data sharing that is supposed to enable collaboration among researchers in the field. 11 labs were selected to generate novel data and make it open to the larger community on the website facebase.org. Facebase is a completely OPEN DATABASE.

INFORMATICS HUB LAB 1 LAB 2 LAB 3 LAB 4 LAB 5 LAB 6 LAB 7 LAB 8 LAB 9 LAB 10 DATA INTEGRATION IS NECESSARY TO ALLOW ANALYSIS AND REUSE, BUT DIFFICULT BECAUSE: Data are collected from 4 different animal models (chimps, mice, zebrafish and humans) Variety of data formats: 3D images, gene expression data, chip-seq, RNA-seq etc. Data collected and analyzed with different methods (from single genes experiments, to whole genomics approaches) Now, the community is experiencing difficulties in reusing this data. The situation is so critical that the consortium had been defined by some participants as a “data dump”. We are trying to understand what is that makes reuse so difficult. We found that data integration is the key problem. First, in order to reuse the data, scientists have to be able to conduct analysis of the data. Now, in order to conduct analysis data need to be integrated, because scientists need to make comparisons across different datasets. Data integration is made difficult by two factors. First, it is complicated by the high heterogeneity of the data and the methods involved in the consortium. And may seem obvious. However, we also found that the fulcrum of the problem is the conceptualization of what “data integration” means, of who is responsible for it, and what it should be used for. We found that informatics people, wet lab biologists, computational biologists and bioinformaticians all have different understandings of what “data integration” means. Wet lab people they want data to be integrated at a high level and they need automated tools to conduct integrated data analysis. The informatics engineers are trying to develop tools for this to happen, however they lack the domain knowledge to make these tools right. And then you have the computational biologists and the bioinformaticians who don’t want to use these tools and want to download the data and do the analysis running their own pipeline and algorithms.

What does “data integration” mean?

Conclusions Data reuse depends on the possibility of conducting integrated data analysis. Data integration work is complicated by the high heterogeneity of the datasets, methods, and tools. Negotiation of the meaning of “data integration” (not just about standards!) Data integration work is emergent and vital for data reuse, but it is difficult to articulate. In conclusion, we found that in FB data reuse depends on the possibility of conducting integrated data analysis. Data integration work is complicated by the high heterogeneity of the datasets, methods, and tools involved in the consortium, but also by the negotiation of what “data integration” means. We found that “data integration work” is necessary for data reuse, but is emergent and difficult to articulate between different stakeholders.

Thank you! @irenepasquetto @UCLA_KI KI website: https://knowledgeinfrastructures.gseis.ucla.edu/ This analysis is part of the ongoing research work of the Center for Knowledge Infrastructures supported by the Sloan Foundation.