Making “Open Data” Work: Challenges for Data Integration in Genomics Research Irene Pasquetto @UCLA_KI @irenepasquetto.

Making “Open Data” Work: Challenges for Data Integration in Genomics Research
Irene Pasquetto @UCLA_KI @irenepasquetto

Literature on OD in SCIENCE
There is a lot of literature about: the BENEFITS of OD (making science more trustworthy, reproducible, efficient), the SOCIOTECHNICAL BARRIERS (lack of funding, credit attribution, trusts, provenance, lack of metadata and standards etc.), and the potential TECHNICAL SOLUTIONS (data repositories, how to link data and software, solutions for publishing data etc.). However, only few studies focused on what is necessary to do to reuse the data once they are made open. “Reuse” of OD is often taken for granted, however we have no numbers on how data are indeed reused once made open, for what purposes and by who. Very few empirical studies that tell us to what extend data are reuse in open repositories.

THE CRANIOFACIAL RESEARCH FIELD
Interdisciplinary domain at the intersection of biomedicine and pure biology research. GOALS: Study the genetic causes of facial variation and facial abnormalities. Study the evolutionary processes involved in craniofacial development. Develop awareness, prevention and treatments for common genetic syndromes involving the face, such as cleft palate (half of birth defects involves the face) This paper is part of our empirical research on the reuse of open data in science. Our case study is in the craniofacial domain, which is…(read slide) The Wonders of the East, Beowulf Manuscript, c. 700–1000 AD

This community had been characterized for long time by labs working independently on similar research questions by using very different methods. In 2010 NIH funded a consortium for data sharing that is supposed to enable collaboration among researchers in the field. 11 labs were selected to generate novel data and make it open to the larger community on the website facebase.org. Facebase is a completely OPEN DATABASE.

INFORMATICS HUB LAB 1 LAB 2 LAB 3 LAB 4 LAB 5 LAB 6 LAB 7 LAB 8 LAB 9 LAB 10 DATA INTEGRATION IS NECESSARY TO ALLOW ANALYSIS AND REUSE, BUT DIFFICULT BECAUSE: Data are collected from 4 different animal models (chimps, mice, zebrafish and humans) Variety of data formats: 3D images, gene expression data, chip-seq, RNA-seq etc. Data collected and analyzed with different methods (from single genes experiments, to whole genomics approaches) Now, the community is experiencing difficulties in reusing this data. The situation is so critical that the consortium had been defined by some participants as a “data dump”. We are trying to understand what is that makes reuse so difficult. We found that data integration is the key problem. First, in order to reuse the data, scientists have to be able to conduct analysis of the data. Now, in order to conduct analysis data need to be integrated, because scientists need to make comparisons across different datasets. Data integration is made difficult by two factors. First, it is complicated by the high heterogeneity of the data and the methods involved in the consortium. And may seem obvious. However, we also found that the fulcrum of the problem is the conceptualization of what “data integration” means, of who is responsible for it, and what it should be used for. We found that informatics people, wet lab biologists, computational biologists and bioinformaticians all have different understandings of what “data integration” means. Wet lab people they want data to be integrated at a high level and they need automated tools to conduct integrated data analysis. The informatics engineers are trying to develop tools for this to happen, however they lack the domain knowledge to make these tools right. And then you have the computational biologists and the bioinformaticians who don’t want to use these tools and want to download the data and do the analysis running their own pipeline and algorithms.

What does “data integration” mean?

Conclusions Data reuse depends on the possibility of conducting integrated data analysis. Data integration work is complicated by the high heterogeneity of the datasets, methods, and tools. Negotiation of the meaning of “data integration” (not just about standards!) Data integration work is emergent and vital for data reuse, but it is difficult to articulate. In conclusion, we found that in FB data reuse depends on the possibility of conducting integrated data analysis. Data integration work is complicated by the high heterogeneity of the datasets, methods, and tools involved in the consortium, but also by the negotiation of what “data integration” means. We found that “data integration work” is necessary for data reuse, but is emergent and difficult to articulate between different stakeholders.

Thank you! @irenepasquetto @UCLA_KI
KI website: This analysis is part of the ongoing research work of the Center for Knowledge Infrastructures supported by the Sloan Foundation.

Making “Open Data” Work: Challenges for Data Integration in Genomics Research Irene Pasquetto @UCLA_KI @irenepasquetto.

Similar presentations

Presentation on theme: "Making “Open Data” Work: Challenges for Data Integration in Genomics Research Irene Pasquetto @UCLA_KI @irenepasquetto."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making “Open Data” Work: Challenges for Data Integration in Genomics Research Irene Pasquetto @UCLA_KI @irenepasquetto.

Similar presentations

Presentation on theme: "Making “Open Data” Work: Challenges for Data Integration in Genomics Research Irene Pasquetto @UCLA_KI @irenepasquetto."— Presentation transcript:

Similar presentations

About project

Feedback