Presentation is loading. Please wait.

Presentation is loading. Please wait.

TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute

Similar presentations

Presentation on theme: "TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute"— Presentation transcript:

1 TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute Email:; Twitter: @MarshallXMa ICSU-WDS Data Stewardship Award Lecture SciDataCon 2014, New Delhi, India, Nov. 02-05

2 TWC Acknowledgements Dr. Mustapha Mokrane and Dr. Simon Hodson Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI- IUGS, AGU/ESSI, ICSU-WDS, RDA, ITC, and more My mentor Prof. Peter Fox My family All of you

3 TWC Outline Technical trends –Data management, publication & citation Methodology –Interoperability & Provenance Data management is just a start –Data analysis –Semantic eScience 3

4 TWC Data Management 4 data work Image courtesy Randy Glasbergen

5 TWC Data Management Plan –A formal document that outlines what you will do with your data during and after you complete your research Resources/Tools help create DMPs: –NSF Data Management Plan Requirements: –DCC Data Management Plans: –DMPTool: https://dmptool.org –DCC DMPOnline: 5

6 TWC Data Publication Data as first class products of research –e.g., NSF bio-sketches can include data publications 6 Image from See:

7 TWC 7 “All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. ” “…authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications.” “…authors must make materials, data, and associated protocols available to readers.” “…it is a condition of publication that authors make available the data and research materials supporting the results in the article.” “…require authors to make all data underlying the findings described in their manuscript fully available without restriction…” “Earth and space science data should be widely accessible in multiple formats and long ‐ term preservation of data is an integral responsibility of scientists and sponsoring institutions.” “…support the principle that research data should be made freely available to all researchers…” “…recommends depositing data that correspond to journal articles in reliable data repositories…”

8 TWC Ways of data publication –Data as supplemental material of a paper –Standalone data –Data paper: data in a repository + descriptive ‘data paper’ 8 Strasser, GeoData 2014 Workshop Presentation (2014) Examples: Standalone data journals: Nature Scientific Data, Geoscience Data Journal, Ecological Archives, Data in Brief … Journals that publish data papers: Earth and Space Science, GigaScience, F1000 Research, Internet Archaeology …

9 TWC 9 An isolated data island ?! Image from

10 TWC Data Citation Data Citation Index –Indexes the world's leading data repositories –Connects datasets to related refereed literature indexed in the Web of Science™ –Efficient access to data across subjects and regions 10 Image courtesy

11 TWC Data interoperability 11 Ma et al., Nature Geosciecne (2011) Interoperability: “Data should be discoverable, accessible, decodable, understandable and usable, and data sharing should be legal and ethical for all participants.” Original image from:

12 TWC Provenance of research 12 Image from Ma et al., Nature Climate Change (2014) Provenance documentation “Linking a range of observations and model outputs, research activities, people and organizations involved in the production of scientific findings with the supporting data sets and methods used to generate them”

13 TWC IPython Notebook: A web-based interactive computational environment Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014) Codes, APIs, datasets, text… PDF document We made extension to the IPython Notebook environment to enable automatic provenance capture during a scientific workflow 13

14 TWC 14

15 TWC Semantic eScience Artificial Intelligence accelerates scientific discovery –Data search, synthesis and hypothesis representation –Data analysis: reasoning with models of the data Gil et al., Science (2014) Image from A state-of-the-art example: Hanalyzer Hanalyzer (high-throughput analyzer) Uses natural language processing to automatically extract a semantic network from all PubMed papers relevant to a scientist Uses Semantic Web technology to integrate assertions from other biomedical sources Reasons about the network to find new correlations that suggest new genes to investigate 15 Leach et al., PLoS Comput Bio (2009)

16 TWC Deep Carbon Virtual Observatory Fox, RDA Fourth Plenary Meeting Presentation (2014) A cyber-enabled platform for linked science

17 TWC Summary Data as first class products of research eScience: the digital or electronic facilitation of science Semantic eScience –A virtuous circle between science and semantic technologies –Data driven + Knowledge driven? Image courtesy @WileyExchanges 17

18 TWC More information: Marshall X Ma Thank you!

Download ppt "TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute"

Similar presentations

Ads by Google