Scientific data cloud infrastructure and services in Chinese Academy of Sciences Jianhui Yuanke

Slides:



Advertisements
Similar presentations
21 st Century Science and Education for Global Economic Competition William Y.B. Chang Director, NSF Beijing Office NATIONAL SCIENCE FOUNDATION.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
High Performance Computing Course Notes Grid Computing.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Introduction to Scientific Data Grid Kai Nan Computer Network Information Center, CAS
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
A.V. Bogdanov Private cloud vs personal supercomputer.
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
Supercomputing Center Jysoo Lee KISTI Supercomputing Center National e-Science Project.
CNGI Applications in CSTNET QingHua Zhang CSTNET January 2007.
1 Common Challenges Across Scientific Disciplines Laurence Field CERN 18 th November 2013.
Designing the Microbial Research Commons: An International Symposium Overview National Academy of Sciences Washington, DC October 8-9, 2009 Cathy H. Wu.
, Increasing Discoverability and Accessibility of NASA Atmospheric Science Data Center (ASDC) Data Products with GIS Technology ASDC Introduction The Atmospheric.
, Implementing GIS for Expanded Data Accessibility and Discoverability ASDC Introduction The Atmospheric Science Data Center (ASDC) at NASA Langley Research.
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
Scientific Data Grid on NGI Kai Nan Computer Network Information Center Chinese Academy of Sciences CANS 2004, Miami.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Helix Nebula The Science Cloud CERN – 14 May 2014 Bob Jones (CERN) This document produced by Members of the Helix Nebula consortium is licensed under a.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
Chapter 4 Realtime Widely Distributed Instrumention System.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
DataTAG Research and Technological Development for a Transatlantic Grid Abstract Several major international Grid development projects are underway at.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Scientific Data Grid & China-VO Kai Nan Computer Network Information Center Chinese Academy of Sciences November 27, 2003.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Interoperability from the e-Science Perspective Yannis Ioannidis Univ. Of Athens and ATHENA Research Center
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
A Data Centre for Science and Industry Roadmap. INNOVATION NETWORKING DATA PROCESSING DATA REPOSITORY.
Construction of Shanghai Life Science & Bio-technology Service Platform for Data Access and Sharing International Workshop on Strategies Presentation of.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) Giuseppe Andronico INFN Sez. CT / Consorzio COMETA Beijing,
7. Grid Computing Systems and Resource Management
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
2. WP9 – Earth Observation Applications ESA DataGrid Review Frascati, 10 June Welcome and introduction (15m) 2.WP9 – Earth Observation Applications.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
The Global Scene Wouter Los University of Amsterdam The Netherlands.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
All Hands Meeting 2005 BIRN-CC: Building, Maintaining and Maturing a National Information Infrastructure to Enable and Advance Biomedical Research.
IPCEI on High performance computing and big data enabled application: a pilot for the European Data Infrastructure Antonio Zoccoli INFN & University of.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
ChinaGrid: National Education and Research Infrastructure Hai Jin Huazhong University of Science and Technology
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Virtual Laboratory Amsterdam L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam.
Clouds , Grids and Clusters
Tools and Services Workshop
Joslynn Lee – Data Science Educator
Grid Computing.
University of Technology
Introduction to D4Science
Cyberinfrastructure for the Life Sciences
ESciDoc Introduction M. Dreyer.
Presentation transcript:

Scientific data cloud infrastructure and services in Chinese Academy of Sciences Jianhui Yuanke Yuanchun Computer Network Information Center Chinese Academy of Sciences

Outline About us –CAS (Chinese Academy of Sciences) –CNIC(Computer Network Information Center), CAS –SDC(Scientific Data Center), CNIC, CAS About Scientific Data Cloud of CAS –Data Challenge –Architecture –Infrastructure Service –Middleware Service –Data Service Conclusion 2

CAS is a leading academic institution and comprehensive research and development center in natural science, technological science and high-tech innovation in China. It was founded in Beijing on 1st November 1949 on the basis of the former Academia Sinica (Central Academy of Sciences) and Peiping Academy of Sciences. 3

4

a public support institution for consistent construction, operation and services of information infrastructure of CAS. a pioneer, promoter and participator for informtion of domestic scientific research and scientific research management 5

Operation and Services in CNIC 6 —— Provided by 7 Business Departments Respectively Scientific Research Network EnvironmentScientific Data EnvironmentSupercomputing EnvironmentInformatization of Research ManagementInternet-based Science Popularization and EducationInternet Fundamental Resource Services

Scientific Data Center (SDC) is the support facility in charge of the construction, management, operation and maintenance of CAS Informatization Data Application Environment, and has been taking the lead in implementing the CAS Scientific Database Project for more than 20 years. SDC provides storage services, data services and related application technology services for the entire CAS SDC hosts the Secretariat of Committee on Data for Science and Technology (CODATA) and the CAS Secretariat for World Wide Web Consortium (W3C). The vision of SDC is striving to become an important facilitator of exchange and application of scientific data resources, key technology supplier during lifecycle of scientific data, and leader in transforming scientific data into knowledge service. Scientific Data Center 7

Outline About us –CAS (Chinese Academy of Sciences) –CNIC(Computer Network Information Center), CAS –SDC(Scientific Data Center), CNIC, CAS About Scientific Data Cloud of CAS –Data Challenge –Architecture –Infrastructure Service –Middleware Service –Data Service Conclusion 8

Hotter and hotter in data research Mar.29, 2012, the Obama Administration “ Big Data Research and Development Initiative ”($200 Million) : improving our ability to extract knowledge and insights from large and complex collections of digital data Feb. 11, 2011, 《 Science 》 issued a Special Online Collection: “Dealing with Data” Sep., 2009, 《 Nature 》 issued “Data’s shameful neglect”: Research cannot flourish if data are not preserved and made accessible. All concerned must act accordingly. The Second International Symposium on Dataology & Data Science was held 3 days ago in China Difficult to discover Difficult to access Being lost 9

Data Driven Scientific Discovery Data is regarded as the most valuable thing. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science." — Bill Gates Scientific discovery based on data intensive computing is now considered as the ''fourth paradigm'' after theoretical, experimental, and computational science. 10

Over Moore’s Law in Data IDC: Data doubles less every 18 months Huge volume Rapid increase Various types and formats 11

Data Challenge Scientists are being overwhelmed with exploding scientific data. Much scientific research needs data distributed in different locations. There is a growing gap between ability of modern scientific instruments and that of scientists. It has been a great challenge to view, manipulate, store, move, share, and interpret the massive data. 12

Scientific Data Deluge in CAS Large scientific facilities produce huge data –+20 being operation –+20 under construction Long-Term field observation stations –+100 stations including Ecology, Environment, Space, etc. Long-Term Research data need to be archived and shared –100+ institutes Large Scientific facilities Field observation stations 13

High Speed Network -CSTNET -CSTNET-CNGI -GLORIAD 1.Field observation stations 2.Large scientific facilities 3.others Advanced CI for Data Lifecycle in CAS Application Generation &Collection Trans- mission Computing &Analysis Storage &Curation Data Information Stream Data Centers -storage &preservation -Curation -Sharing and Service Supercomputing Grid -Computing -Analysis -Mining -visualization Data intensive e- Science activities and Applications 14

It is mixed evolution of grid computing, distributed computing, parallel computing, utility computing, network storage technologies, virtualization, and etc. It has the characteristics of large-scale, virtualization, high reliability, generality, expandability, on-demand service, extremely cheap, which enables it a popular computing paradigm. It can bridge the scientists and massive data. Chinese Academy of Sciences Scientific Data Cloud (CASSDC) is focused on cloud technology to provide facilitated ways for scientists to make use of powerful information infrastructure, massive scientific data and rich scientific software. Cloud Computing 15

Services of CASSDC 16

Scientific Data infrastructure Middle ware (Scientific data grid middleware, internet-based storage service middleware…) Scientific databases Massive storage system Data-intensive computing facility High speed network Application enabled environments and typical e-science practice Software and Toolkits (scientific data collection, curation, and publishing, data analyzing and visualization…) 17

Data Centers Distribution of CASSDC Scientific Data ~1PB Above 60 institutions Multiple Disciplines Storage Capacity ~ 22PB(50PB) 1 major center 1 archive center 12 middle-size center Computing Capacity ~ 5000(10000) CPU cores Dedicated design for DIC Scientific Data ~1PB Above 60 institutions Multiple Disciplines Storage Capacity ~ 22PB(50PB) 1 major center 1 archive center 12 middle-size center Computing Capacity ~ 5000(10000) CPU cores Dedicated design for DIC 18

System Ach. Of Major Center 19

Enabling Technology: Infrastructure Global File System of Cloud Storage 20

Enabling Technology: Infrastructure On fly provision of a computing cluster 21

Scientific Databases (SDB) A Long-term mission started in 1986 which funded by CAS –many institutes involved –long-term, large-scale collaboration –data from research, for research Collecting multi-discipline research data and promoting data sharing –More than 350 research databases and 500 datasets by 61 institutes –Over 200TB data available to open access and download 22

Scientific Databases (cont.) focusing on data integration and improving research database to be resource database and even reference database) Research database Resource database Reference database Application oriented database 23

Scientific Databases (cont.) 8 Resource databases –Geo-Science –Biodiversity –Chemistry –Astronomy –Space Science –Micro biology and virus –Material science –Environment  2 Reference databases –China Species –compound  4 application-Oriented databases –High Energy (ITER) –Western Environment Research –Ecology research –Qinghai Lake Research 24

Scientific Databases (cont.) 37 research databases –Physics & Chemistry, Geosciences, Biosciences, Atmospheric & Ocean Science, Energy Science, Material Science, Astronomy & Space Science 25

CAS Scientific Data Grid SDG is –built upon the Scientific Database, supporting to find and access large scale, distributed and heterogeneous scientific data uniformly and conveniently in a SECURE and proper way Building scientific data application grid according to domain requirements –Integrate distributed data, analysis tools and storage and computing facilities, providing a uniform data service interface –4 pilot grids bioscience grid geoscience grid Chemistry grid Astronomy and space science grid 26

Scientific Data Grid-Architecture Organization Architecture of SDG 27

SDG-Platform && Middleware Platform –SDGIM: Information Management –SDGOM: Operation Management –SDGSA: Storage Service –SDGMS: Monitor && Statistic Middelware –SDGDD: Data Publish –SDGDT:Data Transfer Toolkit –SDGDC: Data Compress Toolkit –SDGMM:MetaData Management –SDGJS: Job Scheduler 28

Tools for data management and service 29

An Integrated Case on Geography Supported by CASSDC Data and computing resource are both distributed Model is from CAS scientist Adopted Middleware: Data search Data transport On-fly computing provision Job scheduler It solves massive data computing while some commercial geometric software can’t work Project: High Precision Display of Earth Surface 30

Data: Microbiology Institute World Data Center for Microorganisms Wuhan Virus Institute Computing: CNIC Microbiology Institute Adopted Middleware: Data search Data transport Job scheduler User athentication Gene Alignment Project An Integrated Case on Biography Supported by CASSDC 31

An Integrated Case on Biography Supported by CASSDC 32

Cooperation International Organization Membership 33

Cooperation with Europe CSTNET provide network support for the data transmission between Europe and China 34 ITER Global Earth Observation System of Systems CERN LHC: ATLAS & CMS ARGO-Yangbajing

Challenges On-demand Linking multi-disciplinary data based on semantic Big Data processing –High scalable, Low cost, high Throughput –On-demand flexible data processing Integrate data, storage, computing, analysis model and etc. as a whole system driven by one specific scientific problem –Making infrastructure invisible for scientists 35

Conclusion Science discovery has increasingly become data intensive, and it calls for reliable and easily accessible scientific data infrastructure CAS is always promoting to build scientific data infrastructure and data intensive e-Science practices Seeking potential cooperation in data intensive e-Science and data cloud 36

Thank you! 37