Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
21 st Century Science and Education for Global Economic Competition William Y.B. Chang Director, NSF Beijing Office NATIONAL SCIENCE FOUNDATION.
Will The 21st Century Need a Library ? Cyberinfrastructure and its Implications.

Discovery and Exploration in the VO Chris Miller NOAO/CTIO La Serena, Chile T HE US N ATIONAL V IRTUAL O BSERVATORY.
What does the science library / informatics professional need to know and be able to do? Ronald L. Larsen, Dean The iSchool at Pitt Oct. 17, 2008.
Discover the world at Leiden University Transformation of the Academic Library Kurt De Belder OCLC Research Distinguished Seminar, 10 May 2013.
Development of China-VO ZHAO Yongheng NAOC, Beijing Nov
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Using Sakai to Support eScience Sakai Conference June 12-14, 2007 Sayeed Choudhury Tim DiLauro, Jim Martino, Elliot Metsger, Mark Patton and David Reynolds.
Lecture 13 Information and History. Objectives Revolution or Paradigms of Information Systems Development of Information Systems in historical context.
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
An Introduction to the Open Science Data Cloud Heidi Alvarez Florida International University Robert L. Grossman University of Chicago Open Cloud Consortium.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Final lecture: The future of scholarly communication, or what have we learned? Honors NJohn November 30, 2006.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
NEPTUNE Canada Workshop Oceans 2.0 Project Environment NEPTUNE Canada DMAS Team Victoria, BC February 16, 2009.
An introduction to digital libraries Jens Vigen, 2 nd CERN-UNESCO School on Digital Libraries, Rabat, Morocco, Nov
Alex Szalay Department of Physics and Astronomy The Johns Hopkins University and the SDSS Project The Sloan Digital Sky Survey.
Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.
Microsoft Research Faculty Summit Division within Microsoft Research focused on partnerships between academia, industry and government to advance.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
Large Scientific Databases. Large scientific datasets are those which are systematically collected and organized and which stretch the technical capabilites.
-- Don Preuss NCBI/NLM/NIH
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Perspectives on Cyberinfrastructure Daniel E. Atkins Professor, University of Michigan School of Information & Dept. of EECS October 2002.
Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.
Federated Discovery and Access in Astronomy Robert Hanisch (NIST), Ray Plante (NCSA)
“Big Data” and Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington July.
NVO Review -- San Diego Jan The VO compared to Other O‘s Jim Gray Microsoft T HE US N ATIONAL V IRTUAL O BSERVATORY.
Environmental Scanning and Library 2.0 Computers in Libraries 2006 Marianne E. Giltrud May 8, 2006.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
March 1st, 2006Prospective PNG PNG: Databases - Virtual Observatory.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
Data Archives: Migration and Maintenance Douglas J. Mink Telescope Data Center Smithsonian Astrophysical Observatory NSF
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Arizona Astronomical Data Hub AAS 227: Dark/Orphaned Data P. Bryan Heidorn ORCID: University of January 2016.
Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.
National Science Foundation Blue Ribbon Panel on Cyberinfrastructure Summary for the OSCER Symposium 13 September 2002.
1 This Changes Everything: Accelerating Scientific Discovery through High Performance Digital Infrastructure CANARIE’s Research Software.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Amdahl’s Laws and Extreme Data-Intensive Computing Alex Szalay The Johns Hopkins University.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
There is an inherent meaning in everything. “Signs for people who can see.”
Final Data Archiving of the Sloan Digital Sky Survey-an Example
Moving towards the Virtual Observatory Paolo Padovani, ST-ECF/ESO
Long-Term Preservation of Astronomical Research Results
EOSCpilot All Hands Meeting 8 March 2018 Pisa
Bird of Feather Session
Presentation transcript:

Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University

Living in an Exponential World Scientific data doubles every year –caused by successive generations of inexpensive sensors + exponentially faster computing Industrial revolution in generating scientific data Changes the nature of scientific computing Cuts across disciplines (eScience) It becomes increasingly harder to extract knowledge 20% of the worlds servers go into huge data centers by the Big 5 –Google, Microsoft, Yahoo, Amazon, eBay

Publishing Data Exponential growth: –Projects last at least 5 years –Data sent upwards only at the end –Data will never be centralized More responsibility on projects –Becoming Publishers and Curators (session on Data Publishing) Data sets too large to move –Analyses must be close to the data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists Linear Exponential

Continuing Growth How long does the data growth continue? High end always linear Exponential comes from technology + economics, rapidly changing generations! –like CCDs replacing plates… gene-chips How many new generations of instruments do we have left? Are there new growth areas emerging? Software (collaboration) is becoming an instrument –hierarchical data replication –Value added data/ mashups –data cloning

The Virtual Observatory Premise: most data is (or could be online) The Internet is the worlds best telescope: –It has data on every part of the sky –In every measured spectral band –As deep as the best instruments –The seeing is always great –Its a smart telescope: links objects and data to literature on them Software became the capital expense –Share, standardize, reuse..

Goal Create the most detailed map of the Northern sky The Cosmic Genome Project Two surveys in one Photometric survey in 5 bands Spectroscopic redshift survey Automated data reduction 150 man-years of development High data volume 40 TB of raw data 5 TB processed catalogs Data is public 2.5 Terapixels of images Sloan Digital Sky Survey The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA

Public Use of the SkyServer Prototype in 21 st Century data access –400 million web hits in 6 years –930,000 distinct users vs 10,000 astronomers –Delivered 50,000 hours of lectures to high schools –Delivered 100B rows of data –1700 refereed publications –The worlds most used astro facility –Everything is a power law GalaxyZoo –27 million visual galaxy classifications by the public –Enormous publicity (CNN, Times, Washington Post, BBC) –100,000 people participating, blogs, poems, …. –Now truly amazing original discovery by a schoolteacher

Scholarly Communications SDSS is finished, preparations for long term archiving Very little paper trail –Proposals and papers archived –No Einstein letters today… Large projects communicate through exploders and phonecons –Some technical info on WIKI pages –More instant messaging, especially next generation –Science oriented blogs appearing What can we and what should we capture? –What will science historians do in 50 years?

Collaborative Trends Science is aggregating into ever larger projects VO is inevitable, a new way of doing science Present on every physical scale today, not just astronomy (Earth/Oceans, Biology, MS, HEP) But: there is a natural size for close collaborations Data collection is increasingly separate from analysis Analyzing archival data may be the only way to do 'small science' in 2020

Technology+Sociology+Economics Technology is changing very rapidly –Sensors, Moore's Law –Trend driven by changing generations of technologies –There may not be time for a top-down design Sociology is changing in unpredictable ways –YouTube… vs Google and Yahoo –In general, people will use a new technology if it is Offers something entirely new Or substantially cheaper Or substantially simpler –Build it and they (may) come… Funding is essentially level

Data Sharing in the VO What is the business model (reward/career benefit)? Three tiers (power law!!!) (a) big projects (b) value added, refereed products (c) ad-hoc data, on-line sensors, images, outreach info We have largely done (a), mandated Need Journal for Data to solve (b) Need VO-Flickr (a simple interface) for (c) Google Earth, Virtual Earth mashups Emerging citizen science (GalaxyZoo) Integrated environment for virtual excursions for education (World Wide Telescope, C. Wong/J. Fay)

Summary Data growing exponentially, petabytes/year by 2010 Explosion is coming from inexpensive sensors and value added data products Requires a new model for science –Having more data makes it harder to extract knowledge –New models for scientific collaborations Same thing happening in all sciences –High energy physics, genomics, cancer research, medical imaging, oceanography, remote sensing, … Science with so much data requires a new paradigm –Computational methods, algorithmic thinking will come just as naturally as mathematics today eScience: an emerging new branch of science We need to reinvent data publishing