Presentation is loading. Please wait.

Presentation is loading. Please wait.

UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed.

Similar presentations


Presentation on theme: "UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed."— Presentation transcript:

1 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed Chair, UC San Diego A Grand Challenge for the Information Age

2 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman The Fundamental Driver of the Information Age is Digital Data Shopping Entertainment Information Business Education Health

3 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Digital Data Critical for Research and Education Data at multiple scales in the Biosciences Data from multiple sources in the Geosciences DisciplinaryDatabases Users Data Access and Use Data Integration Organisms Organs Cells Atoms Bio- polymers Organelles Cell Biology Anatomy Physiology Proteomics Medicinal Chemistry Genomics Where should we drill for oil? What is the Impact of Global Warming? How are the continents shifting? Data Integration Geologic Map Geo- Chemical Geo- Physical Geo- Chronologic Foliation Map Complex “multiple-worlds” mediation What genes are associated with cancer? What parts of the brain are responsible for Alzheimers?

4 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Today’s Presentation Data Cyberinfrastructure Today – Designing and developing infrastructure to enable today’s data-oriented applications Challenges in Building and Delivering Capable Data Infrastructure Sustainable Digital Preservation – Grand Challenge for the Information age

5 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data Cyberinfrastructure Today – Designing and Developing Infrastructure for Today’s Data-Oriented Applications

6 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman COMPUTE (more FLOPS) DATA (more BYTES) Home, Lab, Campus, Desktop Applications Compute- intensive HPC Applications Data-intensive and Compute- intensive HPC applications Grid Applications Data Grid Applications NETWORK (more BW) Data-intensive applications Today’s Data-oriented Applications Span the Spectrum Designing Infrastructure for Data: Data and High Performance Computing Data and Grids Data and Cyberinfrastructure Services

7 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data and High Performance Computing For many applications, development of “balanced systems” needed to support applications which are both data-intensive and compute-intensive. Codes for which Grid platforms not a strong option Data must be local to computation I/O rates exceed WAN capabilities Continuous and frequent I/O is latency intolerant Scalability is key Need high-bandwidth and large- capacity local parallel file systems, archival storage Compute- intensive HPC Applications Data-intensive applications COMPUTE (more FLOPS) DATA (more BYTES) Data-intensive and Compute-intensive HPC applications Data-intensive applications Compute-intensive applications

8 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman : Earthquake Simulation at Petascale – better prediction accuracy creates greater data-intensive demands Estimated figures for simulated 240 second period, 100 hour run-time TeraShake domain (600x300x80 km^3) PetaShake domain (800x400x100 km^3) Fault system interaction NOYES Inner Scale200m25m Resolution of terrain grid 1.8 billion mesh points 2.0 trillion mesh points Magnitude of Earthquake 7.78.1 Time steps 20,000 (.012 sec/step) 160,000 (.0015 sec/step) Surface data 1.1 TB1.2 PB Volume data 43 TB4.9 PB Information courtesy of the Southern California Earthquake Center

9 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman 9 Data and HPC: What you see is what you’ve measured FLOPS alone are not enough. Appropriate benchmarks needed to rank/bring visibility to more balanced machines critical for today’s applications. Information courtesy of Jack Dongarra Cray XD1 -- Custom Interconnect Dalco Linux Cluster -- Quadrics I nterconnect Sun Fire Cluster -- Gigabit ethernet Interconnect Three systems using the same processor and number of processors. AMD Opteron 64 processors 2.2 GHz Difference is in way the processors are interconnected HPC Challenge benchmarks measure different machine characteristics Linpack and matrix multiply are computationally intensive PTRANS (matrix transpose), RandomAccess, bandwidth/latency tests and other tests begin to reflect stress on memory system

10 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data and Grids Data applications some of the first applications which required Grid environments could naturally tolerate longer latencies Grid model supports key data application profiles Compute at site A with data from site B Store Data Collection at site A with copies at sites B and C Operate instrument at site A, move data to site B for storage, post- processing, etc. CERN data providing key driver for grid technologies

11 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data Services Key for TeraGrid Science Gateways Science Gateways provide common application interface for science communities on TeraGrid Data services key for Gateway communities Analysis Visualization Management Remote access, etc. LEADGridChem NVO Information and images courtesy of Nancy Wilkins-Diehr

12 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Unifying Data over the Grid – the TeraGrid GPFS WAN Effort User wish list Unlimited data capacity. (everyone’s aggregate storage almost looks like this) Transparent, high speed access anywhere on the Grid Automatic archiving and retrieval No Latency. TeraGrid GPFS-WAN effort focuses on providing “infinite“(SDSC) storage over the grid Looks like local disk to grid sites Uses automatic migration with a large cache to keep files always “online” and accessible. Data automatically archived without user intervention Information courtesy of Phil Andrews

13 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data Services – Beyond Storage to Use What are the trends and what is the noise in my data? How should I display my data? How should I organize my data? How can I make my data accessible to my collaborators? How can I combine my data with my colleague’s data? My data is confidential; how do I make sure that it is seen/used only by the right people? How do I make sure that my data will be there when I want it? What services do users want?

14 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Integrated Infrastructure Services: Integrated Environment Key to Usability Data Storage Data Management Data Manipulation Data Access computers Sensor- nets instruments File systems, Database systems, Collection Management Data Integration, etc. simulation analysis visualization modeling Many Data Sources Database selection and schema design Portal creation and collection publication Data analysis Data mining Data hosting Preservation services Domain-specific tools Biology Workbench Montage (astronomy mosaicking) Kepler (Workflow management) Data visualization Data anonymization, etc.

15 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data Hosting: SDSC DataCentral – A Comprehensive Facility for Research Data Broad program to support research and community data collections and databases DataCentral services include: Public Data Collections and Database Hosting Long-term storage and preservation (tape and disk) Remote data management and access (SRB, portals) Data Analysis, Visualization and Data Mining Professional, qualified 24/7 support DataCentral resources include 1 PB On-line disk 25 PB StorageTek tape library capacity 540 TB Storage-area Network (SAN) DB2, Oracle, MySQL Storage Resource Broker Gpfs-WAN with 700 TB PDB – 28 TB Web-based portal access

16 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman DataCentral Allocated Collections include Seismology 3D Ground Motion Collection for the LA Basin Atmospheric Sciences50 year Downscaling of Global Analysis over California Region Earth Sciences NEXRAD Data in Hydrometerology and Hydrology Elementary Particle Physics AMANDA data BiologyAfCS Molecule Pages Biomedical Neuroscience BIRN NetworkingBackbone Header Traces NetworkingBackscatter Data BiologyBee Behavior BiologyBiocyc (SRI) ArtC5 landscape Database GeologyChronos BiologyCKAAPS BiologyDigEmbryo Earth Science Education ERESE Earth SciencesUCI ESMF Earth SciencesEarthRef.org Earth SciencesERDA Earth SciencesERR BiologyEncyclopedia of Life Life SciencesProtein Data Bank GeosciencesGEON GeosciencesGEON-LIDAR GeochemistryKd BiologyGene Ontology GeochemistryGERM NetworkingHPWREN EcologyHyperLter NetworkingIMDC BiologyInterpro Mirror BiologyJCSG Data GovernmentLibrary of Congress Data Geophysics Magnetics Information Consortium data Education UC Merced Japanese Art Collections GeochemistryNAVDAT Earthquake Engineering NEESIT data EducationNSDL AstronomyNVO GovernmentNARA AnthropologyGAPP NeurobiologySalk data SeismologySCEC TeraShake SeismologySCEC CyberShake OceanographySIO Explorer NetworkingSkitter AstronomySloan Digital Sky Survey GeologySensitive Species Map Server Geology SD and Tijuana Watershed data OceanographySeamount Catalogue OceanographySeamounts Online BiodiversityWhyWhere Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data Structural Engineering TeraBridge VariousTeraGrid data collections Biology Transporter Classification Database BiologyTreeBase ArtTsunami Data EducationArtStor BiologyYeast regulatory network BiologyApoptosis Database CosmologyLUSciD

17 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Data Visualization is key SCEC Earthquake simulations Visualization of Cancer Tumors Prokudin– Gorskii historical images Information and images courtesy of Amit Chourasia, SCEC, Steve Cutchin, Moores Cancer Center, David Minor, U.S. Library of Congress

18 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Building and Delivering Capable Data Cyberinfrastructure

19 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Infrastructure Should be Non-memorable Good infrastructure should be Predictable Pervasive Cost-effective Easy-to-use Reliable Unsurprising What’s required to build and provide useful, usable, and capable data Cyberinfrastructure?

20 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Building Capable Data Cyberinfrastructure: Incorporating the “ilities” Scalability Interoperability Reliability Capability Sustainability Predictability Accessibility Responsibility Accountability …

21 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Entity at risk What can go wrongFrequency File Corrupted media, disk failure 1 year Tape + Simultaneous failure of 2 copies 5 years System + Systemic errors in vendor SW, or malicious user, or operator error that deletes multiple copies 15 years Archive + Natural disaster, obsolescence of standards 50 - 100 years Reliability Reliability: What can go wrong How can we maximize data reliability? Replication, UPS systems, heterogeneity, etc. How can we measure data reliability? Network availability= 99.999% uptime (“5 nines”), What is the equivalent number of “0’s” for data reliability? Information courtesy of Reagan Moore

22 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Responsibility and Accountability Who owns the data? Who takes care of the data? Who pays for the data? Who can access the data? What are reasonable expectations between users and repositories? What are reasonable expectations between federated partner repositories? What are appropriate models for evaluating repositories? What incentives promote good stewardship? What should happen if/when the system fails?

23 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Good Data Infrastructure Incurs Real Costs Most valuable data must be replicated SDSC research collections have been doubling every 15 months. SDSC storage is 25 PB and counting. Data is from supercomputer simulations, digital library collections, etc. Information courtesy of Richard Moore Capacity Costs Reliability increased by up-to-date and robust hardware and software for Replication (disk, tape, geographically) Backups, updates, syncing Audit trails Verification through checksums, physical media, network transfers, copies, etc. Data professionals needed to facilitate Infrastructure maintenance Long-term planning Restoration, and recovery Access, analysis, preservation, and other services Reporting, documentation, etc. Capability Costs Information courtesy of Richard Moore

24 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Economic Sustainability Making Infinite Funding Finite Difficult to support infrastructure for data preservation as an infinite, increasing mortgage Creative partnerships help create sustainable economic models Relay Funding Consortium support User fees, recharges Endowments Hybrid solutions Geisel Library at UCSD

25 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Preserving Digital Information Over the Long Term

26 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman How much Digital Data is there? Kilo 10 3 Mega 10 6 Giga 10 9 Tera 10 12 Peta 10 15 Exa 10 18 Zetta 10 21 U.S. Library of Congress manages 295 TB of digital data, 230 TB of which is “born digital” SDSC HPSS tape archive = 25+ PetaBytes 1 novel = 1 MegaByte Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007 5 exabytes of digital information produced in 2003 161 exabytes of digital information produced in 2006 25% of the 2006 digital universe is born digital (digital pictures, keystrokes, phone calls, etc.) 75% is replicated (emails forwarded, backed up transaction records, movies in DVD format) 1 zettabyte aggregate digital information projected for 2010 iPod (up to 20K songs) = 80 GB

27 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman How much Storage is there? 2007 is the “crossover year” where the amount of digital information is greater than the amount of available storage Given the projected rates of growth, we will never have enough space again for all digital information Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007

28 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Focus for Preservation: the “most valuable” data What is “valuable”? Community reference data collections (e.g. UniProt, PDB) Irreplaceable collections Official collections (e.g. census data, electronic federal records) Collections which are very expensive to replicate (e.g. CERN data) Longitudinal and historical data and others … Cost Time Value

29 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman “Regional” Scale Local Scale National, International Scale The Data Pyramid A Framework for Digital Stewardship Preservation efforts should focus on collections deemed “most valuable” Key issues: What do we preserve? How do we guard against data loss? Who is responsible? Who pays? Etc. Digital Data Collections Reference, nationally important, and irreplaceable data collections Key research and community data collections Personal data collections Increasing Value Increasing Trust Repositories/Facilities National / Internaional-scale data repositories, archives, and libraries. “Regional”-scale libraries and targeted data centers. Private repositories. Increasing risk/responsibility Increasing stability Increasing infra- structure

30 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Digital Collections of Community Value “Regional” Scale Local Scale National, International Scale The Data Pyramid Key techniques for preservation: replication, heterogeneous support

31 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman : A Conceptual Model for Preservation Data Grids The Chronopolis Model Geographically distributed preservation data grid that supports long-term management, stewardship of, and access to digital collections Implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure. Integrates targeted technology forecasting and migration to support of long-term life-cycle management and preservation Digital Information of Long-Term Value Distributed Production Preservation Environment Technology Forecasting and Migration Administration, Policy, Outreach

32 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Chronopolis Focus Areas and Demonstration Project Partners Chronopolis R&D, Policy, and Infrastructure Focus areas: Assessment of the needs of potential user communities and development of appropriate service models Development of formal roles and responsibilities of providers, partners, users Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. Development of appropriate cost and risk models for long-term preservation Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure 2 Prototypes: National Demonstration Project Library of Congress Pilot Project Partners SDSC/UCSD U Maryland UCSD Libraries NCAR NARA Library of Congress NSF ICPSR Internet Archive NVO UCSD Libraries Demonstration Project information courtesy of Robert McDonald

33 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman National Demonstration Project – Large-scale Replication and Distribution Focus on supporting multiple, geographically distributed copies of preservation collections: “Bright copy” – Chronopolis site supports ingestion, collection management, user access “Dim copy” – Chronopolis site supports remote replica of bright copy and supports user access “Dark copy” – Chronopolis site supports reference copy that may be used for disaster recovery but no user access Each site may play different roles for different collections SDSC U MdNCAR Chronopolis Site Chronopolis Federation architecture Bright copy C1 Dim copy C1 Bright copy C2 Dark copy C1 Dim copy C2 Dark copy C2 Demonstration collections included: National Virtual Observatory (NVO) [1 TB Digital Palomar Observatory Sky Survey] Copy of Interuniversity Consortium for Political and Social Research (ICPSR) data [1 TB Web-accessible Data] NCAR Observational Data [3 TB of Observational and Re-Analysis Data]

34 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman SDSC/ UCSD Libraries Pilot Project with U.S. Library of Congress Goal: To “… demonstrate the feasibility and performance of current approaches for a production digital Data Center to support the Library of Congress’ requirements.” Library of Congress Pilot Project information courtesy of David Minor Prokudin-Gorskii Photographs (Library of Congress Prints and Photographs Division) http://www.loc.gov/exhibits/empire/ (also collection of web crawls from the Internet Archive) Historically important 600 GB Library of Congress image collection Images over 100 years old with red, blue, green components (kept as separate digital files). SDSC stores 5 copies with dark archival copy at NCAR Infrastructure must support idiosyncratic file structure. Special logging and monitoring software developed so that both SDSC and Library of Congress could access information

35 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Pilot Projects provided invaluable experience with key Issues Technical Issues How to address Integrity, verification, provenance, authentication, etc. Legal/Policy Issues Who is responsible? Who is liable? Social Issues What formats/standards are acceptable to the community? How do we formalize trust? Infrastructure Issues What kinds of resources (servers, storage, networks) are required? How should they operate? Evaluation Issues What is reliable? What is successful? Cost Issues What is cost-effective? How can support be sustained over time?

36 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Inadequate/unrealistic general solution: “Let X do it” where X is: The Government The Libraries The Archivists Google The private sector Data owners Data generators, etc. It’s Hard to be Successful in the Information Age without reliable, persistent information Creative partnerships needed to provide preservation solutions with Trusted stewards Feasible costs for users Sustainable costs for infrastructure Very low risk for data loss, etc.

37 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman October 31, 2006 Office of CyberInfrastructure Blue Ribbon Task Force to Focus on Economic Sustainability State Local International Non-profit College University Commercial Federal USER Image courtesy of Chris Greer International Blue Ribbon Task Force (BRTF-SDPA) to begin in 2008 to study issues of economic sustainability of digital preservation and access Support from National Science Foundation Library of Congress Mellon Foundation Joint Information Systems Committee National Archives and Records Administration Council on Library and Information Sources

38 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman BRTF-SDPA Charge to the Task Force: 1.To conduct a comprehensive analysis of previous and current efforts to develop and/or implement models for sustainable digital information preservation; (First year report) 2.To identify and evaluate best practice regarding sustainable digital preservation among existing collections, repositories, and analogous enterprises; 3.To make specific recommendations for actions that will catalyze the development of sustainable resource strategies for the reliable preservation of digital information; (Second Year report) 4.Provide a research agenda to organize and motivate future work. How you can be involved: Contribute your ideas (oral and written “testimony”) Suggest readings (website will serve as a community bibliography) Write an article on the issues for a new community (Important component will be to educate decision makers and the public about digital preservation) Website to be launched this Fall. Will link from www.sdsc.edu

39 UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Many Thanks Phil Andrews, Reagan Moore, Ian Foster, Jack Dongarra, Authors of the IDC Report, Ben Tolo, Reagan Moore, Richard Moore, David Moore, Robert McDonald, Southern California Earthquake Center, David Minor, Amit Chourasia, U.S. Library of Congress, Moores Cancer Center, National Archives and Records Administration, NSF, Chris Greer, Nancy Wilkins-Diehr, and many others … www.sdsc.edu berman@sdsc.edu


Download ppt "UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed."

Similar presentations


Ads by Google