Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com.

Slides:



Advertisements
Similar presentations
Characteristics of Life
Advertisements

Ch 1 - The Nature of Science
Big Idea 3: The Role of Theories, Laws, Hypotheses, and Models
WHAT IS THE NATURE OF SCIENCE?
Design of Experiments Lecture I
Basic Science Terms.
Introduction to Communication Research
I. The Scientific Revolution A big part of the scientific revolution was the changes in the way Europeans looked at themselves and their world.
Research Methods for Computer Science CSCI 6620 Spring 2014 Dr. Pettey CSCI 6620 Spring 2014 Dr. Pettey.
Astronomy- The Original Science Imagine that it is 5,000 years ago. Clocks and modern calendars have not been invented. How would you tell time or know.
WHAT IS SCIENCE? An organized way of gathering and analyzing evidence about the natural world.
Studying the State of Our Earth
EXPLORING LIFE. What is SCIENCE? Derived from the Latin verb meaning “to know” Science is… …a process by which we know and understand how the natural.
Tuesday 9/9/14 Learning Target:
~ Science for Life not for Grades!. Why choose Cambridge IGCSE Co-ordinated Sciences ? IGCSE Co-ordinated Sciences gives you the opportunity to study.
SCI Scientific Inquiry The Big Picture: Science, Technology, Engineering, etc.
KEY CONCEPT Biology is the study of all forms of life.
Ch 18.1 Astronomy-The Original Science Part 1 When you finish this section you will be able to 1. Identify the units of a calendar 2. Evaluate calendars.
Motion.
 1. How many tardies does it take to get an office referral?  2. What materials do you need to bring to class to be prepared?  3. What should you do.
The science branch Astronomy 天文学 Economic 经济学 Biology 生物学 Meteorology 气象学 Zoology 动物学 Seismology 地震学 Aeronautics 航空学 Chemistry 化学 Psychology 心理学 Geometry.
Preserving the Scientific Record: Preserving a Record of Environmental Change Matthew Mayernik National Center for Atmospheric Research Version 1.0 [Review.
The student will demonstrate an understanding of how scientific inquiry and technological design, including mathematical analysis, can be used appropriately.
Introduction to Earth Science Section 1- What is Earth Science Section 2- Science as a Process.
Biology The Study of Life. Course Description "Biology of organisms and cells concerns living things, their appearance, different types of life, the scope.
Section 1: The Nature of Science
CODATA 2006 Beijing - E-Science Session The Role of Scientific Data in e-Science: How Do We Preserve All Necessary Data So They are Useful John Rumble.
Science This introductory science course is a prerequisite to other science courses offered at Harrison Trimble. Text: Nelson, Science 10 Prerequisite:
Astronomy- The Original Science Imagine that it is 5,000 years ago. Clocks and modern calendars have not been invented. How would you tell time or know.
Fundamentals of Information Systems, Third Edition2 Principles and Learning Objectives Artificial intelligence systems form a broad and diverse set of.
Section 1 What Is Earth Science? Section 2 Science as a Process
The Scientific Revolution
Chapter: The Nature of Science
Studying Life Vodcast 1.3 Unit 1: Introduction to Biology.
WHAT IS SCIENCE? WHAT IS SCIENCE? An organized way of gathering and analyzing evidence about the natural world.
An Examination of Science. What is Science Is a systematic approach for analyzing and organizing knowledge. Used by all scientists regardless of the field.
WHAT IS SCIENCE? WHAT IS SCIENCE? An organized way of gathering and analyzing evidence about the natural world.
1 William P. Cunningham University of Minnesota Mary Ann Cunningham Vassar College Copyright © The McGraw-Hill Companies, Inc. Permission required for.
WHAT IS THE NATURE OF SCIENCE?. SCIENTIFIC WORLD VIEW 1.The Universe Is Understandable. 2.The Universe Is a Vast Single System In Which the Basic Rules.
Introduction to Earth Science Section 2 Section 2: Science as a Process Preview Key Ideas Behavior of Natural Systems Scientific Methods Scientific Measurements.
Earth Science Chapter 1.
Chapter 1 Science Skills 1.1 What is Science? Throughout history, humans have had a strong sense of curiosity. Science: is a system of knowledge (process)
Chapter 1 The Study of Life
Chapter 1 Science Skills. Science and Technology “Science” derives from Latin scientia, meaning “knowledge” Science: a system of knowledge and the methods.
Ch. 1: Introduction: Physics and Measurement. Estimating.
Grab Your Clickers Complex Knowledge: demonstrations of learning that go aboveand above and beyond what was explicitly taught. Knowledge: meeting the.
Statistics in WR: Lecture 1 Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables.
Evolution Webquest Created by Trina Mitchell Summer 2010.
Chapter 1: Introduction. Physics The most basic of all sciences! Physics: The “Parent” of all sciences! Physics: The study of the behavior and the structure.
Definition and Branches of Science and Physics
Introduction to ScienceSection 1 Section 1: The Nature of Science Preview Key Ideas Bellringer How Science Takes Place The Branches of Science Scientific.
Physics Basics – Summary Notes. Science The study and pursuit of knowledge about the natural world. Example A physicist tries to understand how the Sun.
Introduction to Earth Science Section 1 SECTION 1: WHAT IS EARTH SCIENCE? Preview  Key Ideas Key Ideas  The Scientific Study of Earth The Scientific.
History of Atomic Theory. How has the structure of matter been understood throughout history? Everything is made of matter. Matter is anything with mass.
Age of Reason The Enlightenment WH.H ,
Insert picture of lake from 1st page of ch Chapter 1 Studying the State of Our Earth.
Science & the Scientific Method What is science? Science is the use of evidence to construct testable explanations and predictions of natural phenomena.
SCIENCE SKILLS Chapter What is Science I. Science from Curiosity A. Involves asking questions about nature and finding solutions. B. Begins with.
Biology in the 21 st Century Biology I(1) Mr. Scott.
PreAp Biology Study of Life Themes & Concepts. PreAp Biology Age of Earth  About 4.5 bya  Life about 4 bya (Prokaryotes)  Photosynthesis about 2.7.
The Chemistry of Life. Biosphere Biodivers ity Biology Organism -- includes everything that lives on Earth, and every place where those things live --
Copyright © by Holt, Rinehart and Winston. All rights reserved. Section 1 The Nature of Science Objectives  Describe the main branches of natural science.
KEY CONCEPT Biology is the study of all forms of life.
The Science of Biology Chapter 1.
Section 2: Science as a Process
Introduction to science
Biology in the 21st Century
Principles of Science and Systems
Today, our journey begins….
11/28/17—Astronomy Warm-Up: Write 3 things you know about the Milky Way galaxy. Bring laptops/project materials MONDAY!! SCSh1. Students will evaluate.
Presentation transcript:

Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD

To understand scientific and technical data today, we must first understand how the information revolution has changed both Science and Data and their relationship 2 DC Data Science May 2012

My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 3 DC Data Science May 2012

Why Do We Do Science Two primary motivations for advancing science First is our insatiable thirst to understand the world– probably from when we started thinking Second is a direct result of the Industrial Revolution: How does the technology we are inventing actually work? 4 DC Data Science May 2012

21st Century Science From the fundamental to the complex – Determining the laws of nature for a few particles to understanding real systems - cells, the atmosphere, the Earth, ecology From reductionism to constructionism – Using our basic knowledge to make models and predict behavior of real systems – that is all systems we find in nature or that we can construct 5 DC Data Science May 2012

6 Science Vol. 336, p. 707 (2012)

7 DC Data Science May 2012 J. Schmitt et al, Science vol 336, p 708, 2012

Today’s Science and E-Science The Data Revolution has enabled E-Science through – Advanced telecommunications and networks – Computation power and storage – New algorithms for data management, visualization, analysis, and mathematics Today, E-Science can be done faster and more powerfully, and scientific communication can occur almost instantly The real revolution, however, is in the relationship between science and data 8 DC Data Science May 2012

9 When it hits the New York Times, you know it is for real!

Science and Data To understand scientific and technical data today, we must first understand how the information revolution has changed Science and Data and their relationship Science today is not about reduction to a few basic laws Science is about how do we understand and control all aspects of nature How is this done? – By careful measurement, accurate tests, keen observations, and powerful models and simulations that lead to scientific knowledge The results are expressed as scientific data! 10 DC Data Science May 2012

Scientific Knowledge What does this really mean? 11 DC Data Science May 2012

Scientific Knowledge What does this really mean? 12 DC Data Science May 2012 Recognize a new phenomenon Analyze its components Identify the variables that govern it Isolate the important variables Demonstrate understanding by control Change the phenomenon Scientific knowledge means understanding the independent variables governing a phenomenon and how they influence it

Science Today A major theme of science today is that we are able to make accurate measurements on a complex world that – Advance our understanding of nature, – Improve our ability to harness technology, And, in spite of many challenges, – Increase the importance of science to society in the future Scientific data are at the core of modern science 13 DC Data Science May 2012

My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 14 DC Data Science May 2012

The Data Revolution in Science 15 DC Data Science May 2012 Today, E-Science is real Computer at every desk Connectivity: The Internet/WWW explosion Computerized experiments and observations Database tools on every computer Electronic publications Model and simulation-based R&D Comprehensive databases Virtual libraries

Four Ways to Generate Scientific Data Observations Experiments Standardized testing Modeling and simulation 16 DC Data Science May 2012

Observational Science Today Today we have exciting new capability to observe nature in situ better than ever before – Hubble Space Telescope – High sensitivity seismographs – Bio-macromolecule sequencing instruments – LTER (Long-term ecological research) platforms – Earth-observing satellites – High power computers to analyze data Generates huge amounts of quality data 17 DC Data Science May 2012

Experimental Science Today 18 DC Data Science May 2012 Today we have exciting new capability to observe nature in controlled circumstances better than ever before – Atomic force microscopes – Micro-electronics and lasers – High energy accelerators – Femto-second chemical reactors – High power computers to analyze data Generates large amounts of high quality data

Testing Today Today we have new capability to test and analyze materials using standard methods – Electronic test equipment – Analytical databases fully integrated into equipment – Analyzing unknown substances – Carbon and other techniques dating objects – Genomic sequencing – National and international standard test procedures – Data analysis tools to generate properties – Self-calibrating instruments Generates medium amounts of high quality data 19 DC Data Science May 2012

Computation Today We now also have the ability to create a Virtual World  Models and simulations of complex systems  Techniques to do advanced mathematics  Computers to execute immense calculations  Visualization tools to examine our virtual world Uses and generates large amounts of data 20 DC Data Science May 2012

Characteristics of Approaches for Generating Scientific Data 21 DC Data Science May 2012

The Data Revolution in Science is Real Observation, experimentation, testing, and calculation all produce, and in some cases use, large amounts of data E-Science has provided an incredible array of tools, technologies, and methods to collect, store, manage, analyze, exploit, preserve, and disseminate these data Science today is more fully based on data and data collections than ever before! 22 DC Data Science May 2012

My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 23 DC Data Science May 2012

Scientific Data and Scientific Databases Data communicate measurement (experimental and observational) and computational results “When you can measure what you are speaking about, and express it in numbers, you know something about it; Lord Kelvin 24 DC Data Science May 2012

Types of Scientific Data Numbers Simple text Complex text Equations Graphs Diagrams Pictures Software Rules 25 DC Data Science May , 2, 3… ABCs Greek, scripts, symbol E=mc 2

All Data Are Not the Same Measurement or property: There is a difference! Measurements are a one-time look at nature Properties are the inherent characteristics of nature – They are Nature Itself 26 DC Data Science May 2012

Measurements are for Today 27 DC Data Science May 2012 Measurements are what you see now Capture one point of view Usually limited number of variables changed  One of 1300 measurements of Diego Giacometti

Properties are Forever Properties are the real thing Need many repeated measurements Far too many substances and systems to determine properties Will never properties of everything 28 DC Data Science May 2012 The real Diego Giacometti

Scientific Knowledge Theories Models Hypotheses Questions Data Measure- ment The Classical Paradigm for Science and Data 29

Scientific Knowledge Theories Models Hypotheses Questions Data Measure- ment Data Collections The True data paradigm has always been this 30

Scientific Databases in History Preserved data collections (large and small) At first, simply data preservation Data was stored, but not really exploited 1.Accuracy 2.Comprehensiveness 3.Systematizing 31 DC Data Science May 2012

Accuracy Newgrange – Ireland 6000 years old Aligned to the rising sun in the winter solstice Depended on careful observational data on the rising sun One data point! 32 DC Data Science May 2012

Volume and Accuracy Improving Stonehenge 5000 years old Over 100 stones Complicated stone alignments Marks position of the moon and major stars as well as the sun Storage of several observations 33 DC Data Science May 2012

Comprehensive Data Sets 34 DC Data Science May 2012 Galen Greek physician Experimental physiologist Arabic copy from 800 AD Pictorial, descriptive, function describing Representative of botanical and animal catalogs

Systematizing a Comprehensive Collection 35 DC Data Science May 2012 Pliny the Elder Roman scholar Natural History (77 AD) One of earliest known encyclopedias of the natural world Systemization of data

My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 36 DC Data Science May 2012

Data and Scientific Discovery The advent of the Baconian Revolution –anchoring scientific understanding to physical observation Led to databases becoming the foundation of scientific discovery True Beginnings of Data Science! 37 DC Data Science May 2012

Scientific Databases in History Preserved data collections (large and small) form the foundation of scientific discovery Trends in data preservation and discovery 1.Accuracy 2.Comprehensiveness 3.Systematizing 4.Extraction of essence 5.Explanation of the complex 6.Prediction of new phenomena! 7.Physical theory from data! 38 DC Data Science May 2012

Extraction of Essence Tycho Brahae Late 16 th Century Danish Astronomer Made precise measurements that led to Kepler’s theories Led to discovery of simple relationships 39 DC Data Science May 2012

Explanation of the Complex 40 DC Data Science May 2012 Charles Darwin Combined with others in geology, zoology and botany A wide variety of facts and phenomena recorded Theory of Evolution had to explain many diverse observations and measurements from different disciplines

Prediction of New Phenomena 41 DC Data Science May 2012 Mendeleev and the Chemical Periodic Table Predicting properties of unknown elements from properties (data) of known elements

Physical Theory from Data Notes on the Spectral Lines of Hydrogen: Johann Jacob Balmer Annalen der Physik und Chemie (1885) – “I gradually arrived at a formula which, at least for these four lines, expresses a law by which their wavelengths can be represented by striking precision…From the formula, we obtained for a fifth hydrogen line x10-7 mm. “ The development of quantum mechanics Bohr 42 DC Data Science May 2012 Schrödinger

Brief History of Modern S&T Databases 1950sCrystal structures (software generated data -1960s Neutron data (modeling weapons) 1970s Analytical chemistry (identify chemicals) Thermochemistry (properties linked) Environmental and toxicology Large physics experiments Space science 1980s Astronomy Materials Earth sciences Biology Genomics 43 DC Data Science May 2012

Scientific Databases Today Preserving” Data is Easy Database management tools are inexpensive and powerful Many models for good interfaces exist Collecting data (data deposition) can be routine Expertise is easily available from many sources Building databases today is remarkably easy 44 DC Data Science May 2012

Comprehensive Data Collections for 21 st Century Science International Virtual Observatory Structural Genomics Proteomics Climate change Historic geologic Chemistry on demand Biodiversity Brain scans 45 DC Data Science May 2012 All observation for every point in the sky For all living things! 30,000 or 300,000? Water, earth, atmosphere and all they contain Many millennia, the entire planet 60 elements, 5 at a time, many ratios, 10 9 – compounds 5M species? or 10M? or 50M? Every person, every thought forever Very large databases will be found in every scientific discipline

The Face of 21st Century Science Complex Multi-disciplinary Real systems Virtual as well as physical Access to quality data becomes critical Attention to the problems and challenges of long term preservation of and access to data becomes more important than ever! 46 DC Data Science May 2012

Scientific Discovery and Data Collections The Paradigm has Changed Yesterday Collections managed by a small number of people Collections readable by one scientist Collections interpretable by one person Discoveries made by thinking, with analysis by one person 47 DC Data Science May 2012 Today èCollections managed by groups èCollections not readable by any individual èCollections interpretable only with aid of software The Future èDiscoveries aided or made by computers, with verification by people?

The Proposition Scientific databases in the future will be even more important source for scientific discovery Data collections are critical for – New insights – New scientific principles – New knowledge – Understanding complex systems Let’s look at 3 problems and the challenges they present 48 DC Data Science May 2012

My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 49 DC Data Science May 2012

Three Problems in the Data Era 1.Too much data 2.Complex systems 3.Complex science 50 DC Data Science May 2012 Scientific Knowledge Theory Models Hypotheses Questions Data Measurement Data Collections Science and Data How the information revolution has changed their relationship

Scientific Data in the Future Problem 1: Too much data The Challenges How do you look at large volumes of data? What does data quality mean for large data collections? How do you determine which data are important? 51 DC Data Science May 2012

Challenge: Too Much Data Too much data for any one person to read or understand Can use Visualization Data reduction Anomalies and outliers How does anyone read a terabyte of data? Software must be used to “read” data Can we allow software to determine what are important data? 52 DC Data Science May 2012

Challenge: Too Much Data Do we have the technology to handle the overwhelming volume of data from new measurement techniques? What to capture when we generate too much data too fast? How to store, represent, manipulate and display too voluminous data? How to find out which data are important? 53 DC Data Science May 2012

Challenge: Too Much Data and Data Quality Evaluating data quality How can large amounts of data be evaluated? In real time? As new data are published? How can large data sets be integrated together correctly? What does quality mean in a terabyte of data? For Each data point? Each set of points? Sub-collections? An entire collection? 54 DC Data Science May 2012

Challenge: Too Much Data and Data Quality Evaluating data quality Bad data quality leads to bad science and bad decisions based on science One measurement does not make a property Agreement between theory and experiment does not mean both are correct In today’s world with terabytes of data, what does quality mean? 55 DC Data Science May 2012

Challenge: Modeling and Data Quality Data Quality Making accurate virtual measurements on virtual systems What is the quality in a calculation? How do you establish uncertainty for a calculation? Which computational results should be stored, and how can those data be handled? How do you discover something new in a mass of computational results? 56 DC Data Science May 2012 Some models have mechanisms for assessing quality HΨ = EΨ Schrödinger equation The variational principle applies only to energy Quality of other properties calculated from the equation is unknown

Challenge: Modeling and Data Quality Documenting Quality of a calculation Making accurate virtual measurements on virtual systems How do you establish uncertainty for each step of a calculational result? Science Vol. 336 pp (2012): Software created by public funding must be released, just as with data themselves 57 DC Data Science May Model assumptions (which ind. Var. used) 2.Translation into algorithms 3.Coding 4.Input 5.Finite arithmetic 6.Post-processing analysis

Challenge: With Too Much Data, What is Important? When you have a lot of data, what can you do with it? Abstraction of important features How can we find what is important when we have too much data? Or not enough of the correct data? Truly great science is having the insight of what is important Can we teach software how to do that? 58 DC Data Science May trillion cells in body Number of human proteins is estimated to be 30,000 – 70,000 Which are important and why? At least 150 proteins repair DNA damage

Scientific Data in the Future Problem 2: Real systems are very complex The Challenges Large number of objects Large number of independent variables Changing scientific language Data Integration 59 DC Data Science May 2012

Complexity Challenge: Many Objects There are too many objects to count, observe, measure, or calculate Number of stars Number of species Number of chemicals Number of individuals Number of rocks Number of cells Number of thoughts Number of ecosystems You get the point 60 DC Data Science May 2012

Complexity Challenge: Independent Variables How do we use metadata (report the relevant independent variables) to describe what we preserve? Independent variables are a quantitative mechanism for expressing our knowledge about how and why a phenomenon occurs Capturing complete knowledge of independent variables requires a large or (perhaps) even an impossible amount of data One goal of research is to understand which variables are important and why Our knowledge clearly evolves over time 61 DC Data Science May 2012 P=(n/V)RT (Ideal gas law) Dependent variable P=pressure Independent variables n/V = number/volume=density T = temperature

Complexity Challenge: Independent Variables Major challenge in data collections is to capture evolution of knowledge of independent variables Must be done in a way as to preserve data set compatibility Let’s work through a quick examples of the complexity and how knowledge changes with time 62 DC Data Science May 2012 Most data have numerous independent variables they are functions of

Brain Imaging Recording techniques evolve and improve over time – X-ray, CT, MRI, PET, next? Each technology individually evolves, as do the types of signals collected, their association with brain activity and region Monitoring reactions to stimulus: pain, visual, auditory, tactile, etc. Details must be defined and recorded 63 DC Data Science May 2012

Brain History If we imagine the details necessary to describe this, the number of independent variables expand rapidly – Stimuli history – Physiological history – Developmental history – Environmental exposures – Drugs taken – More As with the development of unifying theories of the gross physical world – motion, evolution, chemistry, genetics - the details are necessary to find the dominant factors What are the most important independent variables for recording brain history? Still an open question! 64 DC Data Science May 2012

Complexity Challenge: Independent Variables Modern science requires data from many disciplines If we must aggregate different data sets (e.g., over the Web) to do discovery, how do we know data are comparable? How do we integrate data sets with varying numbers of independent variables? Especially if their names and meaning change over time? 65 DC Data Science May 2012

Complexity Challenge: Evolving Language These are powerful change factors that cannot be ignored in preserving data How do languages evolve? Contractions of words Reordering sentences Borrowing words Dropping and adding startings and endings Differentiation of concepts Evolution of concepts John McWhorter – The Power of Babel Ontologies can help 66 DC Data Science May 2012

Complexity Challenge: Evolving Language Time and Scientific Language Grammar rules appeared only a few hundred years ago Language change factors ignore authority Usage wins over regulations every time! – Are terminology standards actually used? Are efforts such as that on the right doomed to fail? 67 DC Data Science May 2012

Complexity Challenge: Evolving Language Time and the evolution of Scientific Language New knowledge requires new language Data preservation efforts must recognize evolution of scientific language Not just independent variables and metadata – the scientific language itself So if you are going to do “discovery,” you’d better know what you are working with 68 DC Data Science May 2012

Complexity Challenge: Data Integration Developing standards for scientific data and metadata 69 DC Data Science May 2012 What is the business case for such standards? How can you standardize scientific language if it continues to evolve? How can you determine object equivalency and uniqueness with partial data sets? How do you persuade scientists to back off the state-of-the-art to agree on standards?

Complexity Challenge: Data Integration Data Standards: Making exploitation of large data sets possible 70 DC Data Science May 2012 What standards are needed for making data sets work together? How can you trust integrated data sets? No science is an island by itself Science today is multi- discipline, international, multi- lingual, ever-changing Integration can be achieved by standards and clear reporting of measurements As knowledge of variables increases, integrating old and new data becomes more difficult

Complexity Challenge: Data Integration Data Ownership We must differentiate between discovery and adding value Observing nature should not lead to data ownership Transforming observations through value-added intellectual effort can create IPR For scientific data, must be very careful not to restrict use by others The same observations led to many different theories of planetary motion – from Aristotle and Ptolemy to Kepler to Newton to today 71 DC Data Science May 2012

Complexity Challenge: Data Integration Maintaining full and open access to the large number of databases required for making new scientific discoveries 72 DC Data Science May 2012 What policies are needed for full and open access? Open access aims to provide everyone with the information and data to advance science Open is not necessarily free Long term preservation does cost money – Data and literature collections must be supported How can discoverers profit from their automated discoveries? How do you get the information industry to understand the new paradigm for discovery?

Complexity Challenge: Data Integration Data Costs Nothing is ever really free! It costs significant money to generate, capture, manage, store, analyze, use, disseminate, and preserve scientific data Data costs must be integrated into the cost of generation Policy and practice will vary from discipline to discipline, but nothing is ever free 73 DC Data Science May 2012

Complexity Challenge: Data Integration Data Repositories Started with crystal structure Genomics Other disciplines following NSF now requiring data management plans Often required to publish papers Curation (everything reported correctly) now automated Model does not translate easily for evolving fields 74 DC Data Science May 2012

Complexity Challenge: Data Integration Progression of Data Collection Individual Collegial Institution or discipline repository Evaluated data “Property values” Each step requires more metadata to provide adequate documentation Very difficult to add metadata after the fact For new phenomenon, difficult to know what Ind. Var. are necessary 75 DC Data Science May 2012

Scientific Discovery in Preserved Data Problem 3: Real systems are very complex and complex behavior in systems is difficult to find The Challenges How do we recognize real understanding? What is knowledge discovery in the future? 76 DC Data Science May 2012

Challenge: Real Understanding Real systems are very complex How can you identify the existence of a unifying theory or concept? Could we have derived quantum mechanics from a complete database of atomic and molecular spectra? What features does quantum mechanics have beyond these data? 77 DC Data Science May 2012

Challenge: Real Understanding Real systems are very complex Multiple views of the same phenomena exist The Simple (?) Laws of Interaction String theory Quantum theory Matrix mechanics Maxwell’s theory Quantum electrodynamics Newton’s laws of motion Are all views of nature equally discoverable? By computer-aided discovery? 78 DC Data Science May 2012

Challenge: Real Understanding How do we develop real understanding? Real Scientific knowledge? Just because we measure a phenomenon do not mean we understand it – Do we know how many genes there? – Does measuring the mass of the universe makes us understand dark matter? How does data lead to understanding? 79 DC Data Science May 2012

Challenge: Real Understanding Knowledge Discovery Large amounts of data can help find new discoveries How to know which data are the most important, the key to discovery Hoe to know something is there to be discovered? Can too much data make discovery more difficult? Will/Can discovery have to be automated? 80 DC Data Science May 2012

Data Collections in 21 st Century Science The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them William Bragg 81 DC Data Science May 2012

Some Final Thoughts Scientific databases in the future will be even more important source for scientific discovery Preservation of data needed for – New insights – Scientific principles – New knowledge – Understanding complex systems The problems and challenges I have just outlined are not insurmountable – just problems and challenges 82 DC Data Science May 2012

Some Final Thoughts Science has changed and with that change, our expectations for science have changed. We now expect science to be a force for shaping the future, not just understanding nature Scientific databases in the future will be even more important source for scientific discovery The Data Revolution has become an enabling force to meet our expectations for 21 st Century Science 83 DC Data Science May 2012