Current NIST Definition NIST Big data consists of advanced techniques that harness independent resources for building scalable data systems when the characteristics.

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

IEEE BigData Overview October NIST Big Data Public Working Group NBD-PWG Based on September 30, 2013 Presentations at one day workshop at NIST Leaders.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
A Very Brief Introduction to iRODS
Symposium on Digital Curation in the Era of Big Data: Career Opportunities and Educational Requirements Workforce Demand and Career Opportunities From.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
The MetaDater Model and the formation of a GRID for the support of social research John Kallas Greek Social Data Bank National Center for Social Research.
Grand Challenges Robert Moorhead Mississippi State University Mississippi State, MS 39762
Computer Science Prof. Bill Pugh Dept. of Computer Science.
ONS Big Data Project. Plan for today Introduce the ONS Big Data Project Provide a overview of our work to date Provide information about our future plans.
Geographic Information System Geog 258: Maps and GIS February 17, 2006.
Master of Arts in Data Science Geoffrey Fox for Data Science Program March
Big Data Open Source Software and Projects Unit 1: Introduction Data Science Curriculum March Geoffrey Fox
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Master of Arts in Data Science
Big Data Use Cases and Requirements
NIST Big Data Public Working Group
Big Data A big step towards innovation, competition and productivity.
NIST Big Data Public Working Group
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Transforming Data-Driven Publications and Decision Support Joan L. Aron, Ph.D. Consultant Federal Big Data Working Group COM.BigData 2014.
X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? January Geoffrey Fox
Big Data Power and Responsibility HEPTech, Budapest, 30 March 2015
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
For each of the Climate Literacy and Energy Literacy Principles, a dedicated page on the CLEAN website summarizes the relevant scientific concepts and.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Charles Tappert Seidenberg School of CSIS, Pace University
material assembled from the web pages at
Preserving the Scientific Record: Preserving a Record of Environmental Change Matthew Mayernik National Center for Atmospheric Research Version 1.0 [Review.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
What is Information Systems (IS)? Information systems (IS) consist of networks of hardware and software, people, and telecommunications that organizations.
© 2010 IBM Corporation IBM InfoSphere Streams Enabling a smarter planet Roger Rea InfoSphere Streams Product Manager Sept 15, 2010.
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
Fourth Paradigm Science-based on Data-intensive Computing.
CERN openlab V Technical Strategy Fons Rademakers CERN openlab CTO.
“Big Data” and Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington July.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.
November Geoffrey Fox Community Grids Lab Indiana University Net-Centric Sensor Grids.
51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
DNV GL © What’s around the Corner? Paal Johansen, VP & Regional Director, Americas New York, 29 th October 2015.
Expedition Workshop Towards Scalable Data Management June 10, 2008 Chris Greer Director, NCO.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Big Data Why it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis.
Training Data Scientists DELSA Workshop DW4 May Washington DC Geoffrey Fox Informatics, Computing.
Support to scientific research on seasonal-to-decadal climate and air quality modelling Pierre-Antoine Bretonnière Francesco Benincasa IC3-BSC - Spain.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Geoffrey Fox Panel Talk: February
Supporting the human decision maker IBM Research and big data at the edge Dr. Jürg von Känel Associate Director IBM Research Australia.
Volume 3, Use Cases and General Requirements Document Scope
Big-Data Fundamentals
Tools and Services Workshop
Data Quality: Practice, Technologies and Implications
Data uploading and sharing with CyVerse
Chapter 5 Innovative EC Systems: From E-Government to E-Learning, E-Health, Sharing Economy and P2P Commerce.
6 October 2016 Irmingard Eder Data Scientist, Munich Re
FDA Objectives and Implementation Planning
4 Education Initiatives: Data Science, Informatics, Computational Science and Intelligent Systems Engineering; What succeeds? National Academies Workshop.
Big Data Use Cases and Requirements
Big Data Use Cases and Requirements
Big Data Architectures
Lecture 1: Introduction
Grid Application Model and Design and Implementation of Grid Services
Bird of Feather Session
Big DATA.
Presentation transcript:

Current NIST Definition NIST Big data consists of advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis. Big data is where the data volume, acquisition velocity, or data representation limits the ability to perform effective analysis using traditional relational approaches or requires the use of significant horizontal scaling (more nodes) for efficient processing

3 Meeker/Wu May Internet Trends D11 Conference Zettabyte = 1 million Petabytes IP Traffic ~ 1 Zettabyte per year in 2015 Slightly faster than Moore’s law

4 Faster than Moore’s Law Slower? Why need cost effective Computing! Full Personal Genomics: 3 petabytes per day

Some Data sizes ~ Web pages at ~300 kilobytes each = ~10 Petabytes Earth and Polar Observation several petabytes per year Large Hadron Collider 15 PBs/ year (300K cores 24 by 7) Radiology 69 PBs per year imagery common source of big data from medical/scientific instruments; surveillance devices; Facebook uploading 500M a day; synchrotron light sources Square Kilometer Array Telescope will be 100 terabits/second (~400 exabytes/year); currently astronomy has ~100TB in a large collection; petabytes total Internet of Things: Billion devices on Internet 2020 Exascale simulation data dumps – terabytes/second Deep Learning to train self driving car; 100 million megapixel images ~ 100 terabytes 5

NIST Big data Taxonomy Philosophy “Big Data” and “Data Science” are currently composites of many concepts. To better identify those terms, we first addressed the individual concepts needed in this disruptive field. Then we came back to clarify the two over-arching buzzwords to provide clarity on what concepts they encompass. To keep the topic of data and data systems manageable, we tried to restrict our discussions to what is different now that we have “big data”. 30 documents studying issue!

Current NIST Definitions Data Science is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis. A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle.

McKinsey Institute on Big Data Jobs There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions. 8

The 4 paradigms of Scientific Research 1.Theory 2.Experiment or Observation E.g. Newton observed apples falling to derive his equations 3.Simulation of theory or model (Computational Science) 4.The Fourth Paradigm: Data-driven or Data-Intensive Scientific Discovery (Data Science) us/collaboration/fourthparadigm/ A free book us/collaboration/fourthparadigm/ Data (not theory) drives discovery process Clear for recommender systems. There are no “Newton’s equations” of user rankings of movies or books but ranking data identifies who is like you and so suggests what you might like Jobs in data science (category 4) much larger than computational science (category 3)

DIKW Pipeline (raw) Data  Information  Knowledge  Wisdom, Decisions, Policy, Benefit …. Big data refers to all stages of pipeline e.g. it includes “Big Benefit” Analytics at each  stage but especially at Information  Knowledge transformation Information as a Service (cloud hosted NOSQL or SQL repository) transformed by Software as a Service (Image Processing, Collaborative Filtering ….) gives Knowledge as a Service

Use Case Template 26 fields completed for 51 areas Government Operation: 4 Commercial: 8 Defense: 3 Healthcare and Life Sciences: 10 Deep Learning and Social Media: 6 The Ecosystem for Research: 4 Astronomy and Physics: 5 Earth, Environmental and Polar Science: 10 Energy: 1

51 Detailed Use Cases: Many TB’s to many PB’s Record analysis requirements, data volume, velocity, variety, software and analytics etc. Government Operation: National Archives and Records Administration, Census Bureau Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense: Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media: Driving Car, Geolocate images, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research: Metadata, Collaboration, Language Translation, Light source experiments Astronomy and Physics: Sky Surveys compared to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan Earth, Environmental and Polar Science: Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy: Smart grid