Presentation on theme: "UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed."— Presentation transcript:
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego Developing Cyberinfrastructure for Data-Oriented Science and Engineering
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman The Digital World Commerce Entertainment Information Education
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Research, Education, and Data Astronomy Physics Geosciences Life Sciences Arts, and Humanities NVO – 100+ TB SCEC – 153 TB Japanese Art Images – 70.6 GB Engineering TeraBridge – 800 GB JCSG/SLAC – 15.7 TB Projected LHC Data – 10 PB/year
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Today’s Research and Education Applications Cover the Spectrum Compute (more FLOPS) Data (more BYTES) Home, Lab, Campus, Desktop Applications Medium, Large, and Leadership HPC Applications Data-oriented Applications World of Warcraft Quicken PDB applications TeraShake NVO Molecular Modeling Large-scale data required as input, intermediate, output for many modern HPC applications Applications vary with respect to how well they can perform in distributed mode (grid computing) Researchers increasingly dependent on both High Performance Computing (HPC) and Highly Reliable Data (HRD)
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman SDSC Mission The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure Cyberinfrastructure = resources (computers, data storage, networks, scientific instruments, experts, etc.) + “glue” (integrating software, systems, and organizations).
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman SDSC Cyberinfrastructure SDSC HIGH PERFORMANCE COMPUTING SYSTEMS DataStar 15.6 TFLOPS Power 4+ system 7.125 TB total memory Up to 4 GBps I/O to disk 115 TB GPFS filesystem Blue Gene Data First academic IBM Blue Gene system 17.1 TF 1.5 TB total memory 3 racks, each with 2,048 PowerPC processors and 128 I/O nodes TeraGrid Cluster 524 Itanium2 IA-64 processors 2 TB total memory Also 16 2-way data I/O nodes http://www.sdsc.edu/ user_services/ SDSC SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES User Services Application/Community Collaborations Education and Training SDSC Synthesis Center Data-oriented Community SW, toolkits, portals, codes http://www.sdsc.edu/ SDSC DATA COLLECTIONS, ARCHIVAL AND STORAGE SYSTEMS 2.4 PB Storage-area Network (SAN) 25 PB StorageTek/IBM tape library HPSS and SAM-QFS archival systems DB2, Oracle, MySQL Storage Resource Broker Supporting servers: IBM 32-way p690s, 72-CPU SunFire 15K, etc. http://datacentral.sdsc.edu/ Support for community data collections and databases Data management, mining, analysis, and preservation
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Who are SDSC’s “customers”? What role does digital data play in their research?
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Major Earthquakes on the San Andreas Fault, 1680-present 1906 M 7.8 1857 1680 M 7.7 How dangerous is the San Andreas Fault? Researchers use geological, historical, and environmental data to simulate massive earthquakes. These simulations are critical to understand seismic movement, and assess potential impact. Simulation results provide new scientific information enabling better Estimation of seismic risk Emergency preparation, response and planning Design of next generation of earthquake-resistant structures Results provide information which can help in saving many lives and billions in economic losses ? Data and Simulations -- TeraShake
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman TeraShake Visualization Simulation of Southern of 7.7 earthquake on lower San Andreas Fault Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each Simulation generates 45+ TB data Project leadership: Tom Jordan (SCEC), Bernard Minster (SIO) Reagan Moore (SDSC), Carl Kesselman (ISI)
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Resources must support a complicated orchestration of computation and data movement 47 TB output data for 1.8 billion grid points Continuous I/O 2GB/sec 240 procs on SDSC Datastar, 5 days, 1 TB of main memory Data parking of 100s of TBs for many months “Fat Nodes” with 256 GB of DS for pre-processing and post visualization 10-20 TB data archived a day Finer resolution simulations require even more resources. TeraShake being scaled to run on petascale architectures TeraShake Data Choreography Parallel file system Data parking “ I have desired to see a large earthquake simulation for over a decade. This dream has been accomplished.” Bernard Minster, Scripps Institute of Oceanography
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman TeraShake at Petascale – A qualitative difference in prediction accuracy creates even greater data infrastructure demands Estimated figures for simulated 240 second period, 100 hour run-time TeraShake domain (600x300x80 km^3) PetaShake domain (800x400x100 km^3) Fault system interaction NOYES Inner Scale200m25m Resolution of terrain grid 1.8 billion mesh points 2.0 trillion mesh points Magnitude of Earthquake 7.78.1 Time steps 20,000 (.012 sec/step) 160,000 (.0015 sec/step) Surface data 1.1 TB1.2 PB Volume data 43 TB4.9 PB Estimates courtesy of the Southern California Earthquake Center Petascale platform will allow much higher resolution for very accurate prediction of ground motion
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data as a driver – the Protein Data Bank: A resource for the global Biology community The Protein Data Bank Largest repository on the planet for structural information about proteins Provides free worldwide public access 24/7 to accurate protein data PDB maintained by the Worldwide PDB administered by the Research Collaboratory for Structural Bioinformatics (RCSB), directed by Helen Berman January 2007 Molecule of the Month: Importins Complex of 3 proteins which aids in protein synthesis by ferrying molecules back and forth between the inside and the outside of the nucleus through tube- shaped nuclear pores Growth of Yearly/Total Structures in PDB 1976-1990, roughly 500 structures or less per year 2006: > 5000 structures in one year, >36,000 total structures Each structure costs roughly 200K to generate. 2006 holdings will have cost roughly $80B in research investment
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman How Does the PDB Work? Data collected, annotated and validated at one of 3 worldwide PDB sites (Rutgers in US). Infrastructure required: 20 highly trained personnel and significant computational, storage and networking capabilities. SDSC machine room Search Fields FLAT FILES Remote Applications DB INTEGRATION LAYER Query Result Browser Query Result Browser Structure Explorer Structure Explorer FTP tree (downlo ad ) SearchLite CORBA Interface KEYWORD SEARCH DERIV ED DATA CORE DB BMCD WWW User Interface New queries New tools New queries Infrastructure: PDB portal served by cluster at SDSC. PDB system designed with multiple failover capabilities to ensure 24/7 access and 99.99% uptime. PDB infrastructure requires 20TB storage at SDSC PDB accessible over the Internet and serves 10,000 users a day (> 200,000 hits) H. Berman estimated that in 2005, more than $1B of research funding was spent to generate the data that were collected, curated, and distributed by the PDB.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Using the PDB DisciplinaryDatabases UsersSoftware to access dataSoftware to federate dataOrganisms Organs Cells Atoms Bio- polymers Organelles Cell Biology Anatomy Physiology Proteomics (PDB level) Medicinal Chemistry Genomics PDB provides a critical building block for research, education, and practice in the Biosciences PDB tools include Data Extraction and Preparation Data Format Conversion Data Validation Dictionary and Data management Tools supporting the OMB Corba Standard for Macromolecular Structure Data, etc.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Cyberinfrastructure to Support Data-oriented Research, Education, and Practice at SDSC
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Storage of research data in SDSC’s archives show consistent increase in the need for capacity Most of the data is supercomputer simulation output, but digital library collections and experimental data are contributing to growth rates Consistent exponential growth with ~15 month doubling drives planning and cost projections Technology advancements help, but media costs/byte are not decreasing as storage is increasing Information courtesy of Richard Moore
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman National Data Repository: SDSC DataCentral First broad program of its kind to support national research and community data collections and databases “Data allocations” provided on SDSC resources Data collection and database hosting Batch oriented access, collection management services Comprehensive data resources: disk, tape, databases, SRB, web services, tools, 24/7 operations, collection specialists, etc. Web-based portal access
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman DataCentral Allocated Collections include Seismology 3D Ground Motion Collection for the LA Basin Atmospheric Sciences50 year Downscaling of Global Analysis over California Region Earth Sciences NEXRAD Data in Hydrometerology and Hydrology Elementary Particle Physics AMANDA data BiologyAfCS Molecule Pages Biomedical Neuroscience BIRN NetworkingBackbone Header Traces NetworkingBackscatter Data BiologyBee Behavior BiologyBiocyc (SRI) ArtC5 landscape Database GeologyChronos BiologyCKAAPS BiologyDigEmbryo Earth Science Education ERESE Earth SciencesUCI ESMF Earth SciencesEarthRef.org Earth SciencesERDA Earth SciencesERR BiologyEncyclopedia of Life Life SciencesProtein Data Bank GeosciencesGEON GeosciencesGEON-LIDAR GeochemistryKd BiologyGene Ontology GeochemistryGERM NetworkingHPWREN EcologyHyperLter NetworkingIMDC BiologyInterpro Mirror BiologyJCSG Data GovernmentLibrary of Congress Data Geophysics Magnetics Information Consortium data Education UC Merced Japanese Art Collections GeochemistryNAVDAT Earthquake Engineering NEESIT data EducationNSDL AstronomyNVO GovernmentNARA AnthropologyGAPP NeurobiologySalk data SeismologySCEC TeraShake SeismologySCEC CyberShake OceanographySIO Explorer NetworkingSkitter AstronomySloan Digital Sky Survey GeologySensitive Species Map Server Geology SD and Tijuana Watershed data OceanographySeamount Catalogue OceanographySeamounts Online BiodiversityWhyWhere Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data Structural Engineering TeraBridge VariousTeraGrid data collections Biology Transporter Classification Database BiologyTreeBase ArtTsunami Data EducationArtStor BiologyYeast regulatory network BiologyApoptosis Database CosmologyLUSciD
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data Systems SAM/QFS HPSS GPFS SRB Services, Tools, and Technologies Key for Data-related Capability Data Services Data migration/upload, usage and support (SRB) Database selection and Schema design (Oracle, DB2, MySQL) Database application tuning and optimization Portal creation and collection publication Data analysis (e.g. Matlab) and mining (e.g. WEKA) DataCentral Data-oriented Toolkits and Tools Biology Workbench Montage (astronomy mosaicking) Kepler (Workflow management) Vista Volume renderer (visualization), etc.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Increasing Need to Sustain Digital Data for the Foreseeable Future UCSD Libraries Digital State and Federal records The Entertainment Industry The Private Sector The Public Sector Researchers and Educators
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman What data is the most valuable? Key criteria Irreplaceable Longitudinal Used by many Expensive Needed in the future Culturally or scientifically meaningful … Federal records Data needing rescue Irreplaceable data Time-series Reference collections
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Key Challenges for Digital Preservation What should we preserve? What materials must be “rescued”? How to plan for preservation of materials by design? How should we preserve it? Formats Storage media Stewardship – who is responsible, and for how long? Who should pay for preservation? The content generators? The government? The users? Who should have access? Print media provides easy access for long periods of time but is hard to data-mine Digital media is easier to data-mine but requires management of evolution of media and resource planning over time
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Preservation and Risk Entity at risk SizeWhat can go wrongFrequency Minimum number of replicas needed to mitigate risk Administrative support FTEs File~2 MB Corrupted media, disk failure 1 year 2 copies in single system System Admins Tape~200 GB + Simultaneous failure of 2 copies 5 years 3 homogeneous systems + Storage Admin System~10 TB + Systemic errors in vendor SW, or Malicious user, or Operator error that deletes multiple copies 15 years 3 independent, heterogeneous systems + Database Admin + Security Admin Archive~1 PB + Natural disaster, obsolescence of standards 50 - 100 years 3 distributed, heterogeneous systems + Network Admin + Data Grid Admin Less risk means more replicants, more resources, more people
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Chronopolis™ provides a comprehensive approach to infrastructure for long-term preservation integrating Collection ingestion Access and Services Research and development for new functionality and adaptation to evolving technologies Business model, data policies, and management issues critical to success of the infrastructure Chronopolis™: An Integrated Approach to Long-term Digital Preservation SDSC, the UCSD Libraries, NCAR, UMd, NARA, Library of congress, NSF working together on long-term preservation of digital collections Consortium
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Chronopolis™ – Replication and Distribution 3 replicas of valuable collections considered reasonable mitigation for risk of data loss Chronopolis™ Consortium will store 3 copies of preservation collections: “Bright copy” – Chronopolis ™ site supports ingestion, collection management, user access “Dim copy” – Chronopolis ™ site supports remote replica of bright copy and supports user access “Dark copy” – Chronopolis ™ site supports reference copy that may be used for disaster recovery but no user access Each site may play different roles for different collections SDSC U MdNCAR Chronopolis Site Chronopolis™ Federation architecture Bright copy C1 Dim copy C1 Bright copy C2 Dark copy C1 Dim copy C2 Dark copy C2
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman SDSC Playing a Leadership Role in Development of a National Digital Data Framework SDSC storing genetic research data for the City of Hope SDSC Developing Data Visualizations for UCSD Moores Cancer Center SDSC Storing National Collections for National Archives and Records Administration SDSC working with the Library of Congress on Distributed Data Stewardship Prokudin-Gorskii Photographs value of content cost to store
UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Community Cyberinfrastructure at SDSC Allocated HPC resources (via TeraGrid) http://www.sdsc.edu/resources/ CompStorage.html SDSC Summer Institutes, Training, Outreach www.sdsc.edu/us/training http://education.sdsc.edu SW, visualization and other services http://www.sdsc.edu/resources/ Resources.html Community CI-oriented R&D projects www.sdsc.edu/research/ Thank You www.sdsc.edu DataCentral data repository datacentral.sdsc.edu