Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of MSR External Research Earth, Energy, and MSR Environmental Ecosystem Conceptual Model Projects Trident GrayWulf Dyrad and DryadLinq.

Similar presentations


Presentation on theme: "Overview of MSR External Research Earth, Energy, and MSR Environmental Ecosystem Conceptual Model Projects Trident GrayWulf Dyrad and DryadLinq."— Presentation transcript:

1

2 Overview of MSR External Research Earth, Energy, and Environment @ MSR Environmental Ecosystem Conceptual Model Projects Trident GrayWulf Dyrad and DryadLinq

3 Research locations : Redmond, Washington (Sept, 1991) San Francisco, California (Jun, 1995) Cambridge, United Kingdom (July, 1997) Beijing, China(Nov, 1998) Silicon Valley, California (July, 2001) Bangalore, India (Jan, 2005) Cambridge, Massachusetts(July, 2008) MSR New England MSR Asia MSR India

4 Division within Microsoft Research focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computing Supporting groundbreaking research to help advance human potential and the wellbeing of our planet Developing advanced technologies and services to support every stage of the research process Microsoft External Research is committed to interoperability and to providing open access, open tools, and open technology

5 Core Computer Science Earth, Energy & Environment Education & Scholarly Communication Health & Wellbeing Advanced Research Tools and Services Community and Geographic Outreach

6 Visualizing and Experiencing E 3 Data + Information: Provide a unique experience to reduce time to insight and knowledge through visualizing data and information Accessible Data: Ensure E 3 data (remote and local sensing) is easily accessible and consumable in the scientists domain Enabling Scientific Collaboration: Look at new ways to enable collaboration in scientific virtual organizations Earth, Energy & Environment

7 7 Action Knowledge Inform

8 8 AnalysisInsightPublishData Action Knowledge Communicate Decide Implement Inform

9

10

11

12 Each of these potentially impacts the technology, user interface, and API design ● I want to visualize ocean processes and share my analysis. I want to do this more than once and get exactly the same answer. I want to do this more than once, but don’t care if I get exactly the same answer. I’m only going to do this once and don’t care about keeping the data or the results long term (but I need to remember the inputs); I want to store the data in I want full provenance to validate a result, OPM compliant; I want to use my own provenance management system; Each group may wish a different UI (no WF), or authoring tool I only want NCAR, MBARI, etc. data because I trust it. I know that Jon really wants my results to drive his model and I want to share my workflow and executables.

13

14

15

16 Visually program workflows. Libraries of activities and workflows, to save and reuse workflows. Abstract parallelism for HPC, to test on desktop and then run on cluster. Automatic provenance capture, for all workflows and data products. Integrated data storage and access, allows researcher to store data on a SQL database, local files or in the cloud (Microsoft SDS, Amazon S3). Reproducible research Composition Space Activity Library Workflow Library Data Options & Sharing http://research.microsoft.com/collaboration/tools/trident

17

18 PanSTARRs (Astronomy) One of the largest visible light telescopes Four unit telescopes acting as one One Gigapixel per telescope Survey entire visible universe in 1 week Catalog solar system, moving objects/asteroids ps1sc.org: Univ. Hawaii, Johns Hopkins, …

19 1 PB of raw image data/year 2.5 TB image data | 1000 images | 150 M detections / night 30 TB of processed data per year 5.5 Billion celestial objects 350 Billion detections The largest astronomy DB in the world! And the platform to build it upon! Telescope Telescope diameter (m) Effective collecting area (m 2 ) [A] Solid angle subtended by field of view (deg 2 ) [D] Nominal image quality (arcsec) [Q] The survey power [AD/Q 2 ] Status UH 2.2-m/PFCam2.23.50.250.71.82004+ Palomar/QUEST1.21.116.624.62003+ CFHT/Megacam3.6101.000.628Active Subaru/Suprimam8.0450.250.635Active Pan-STARRS3.61070.52802007+ DMT/LSST8.35470.610502012+

20 Software & Hardware design principles for data intensive science Enhances BeoWulf model with storage co-located with commodity HPC nodes Databases for fast queries on index High sequential I/O bandwith for varying query patterns Scale out instead of Scale up The GrayWulf name pays tribute to Jim Gray who was actively involved in the defining these design principles.

21 GrayWulf Shared Compute Resources Shared Queryable Data Store Configuration Management, Health and Performance Monitoring Operator User Interface User Interface Data Valet User Interface VALETWORKFLOWVALETWORKFLOW USER WORKFLOWUSER WORKFLOW User Storage Data Flow Control Flow Data Valet Queryable Data Store User Queryable Data Store

22 Cluster - Scheduling & Monitoring Windows HPC 2008 Cluster Database - Shared Domain DBs & User MyDBs SQL Server 2008 Trident Workflow Workbench Windows Workflow Foundations, Composer, Registry, Provenance/Logging Common data management library Domain specific user interfaces Scientists, Data Valets, System Operations

23 3000 node cluster 12,000 cores (36 x 10 12 cycles/sec) 48 terabytes of RAM 9 petabytes of persistent storage

24 Continuously deployed since 2006 Running on >> 10 4 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 10 5 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly

25 Automatic plan generated by DryadLINQ Automatic distributed execution by Dryad Programmer writes sequential C#, VB,… code – System figures out the data-parallelism – Manages execution, traditional parallel-DB tricks

26 A radical approach to programming at scale Nodes talk to each other as little as possible (shared nothing) Programmer is not allowed to communicate between nodes Data is spread throughout machines in advance, computation happens where it’s stored. Master program divvies up tasks based on location of data, schedules tasks on same machine as the data resides, or at least same rack, detects worker failures and restarts, load balances, redundant execution, etc…

27 The goal of the analysis is to execute a set of analysis functions on a collection of data files produced by high-energy physics experiments Histogramming of events from large data set (TBs) DryadLINQ program provides easy way to distribute the computation on the cluster

28 Broad academic/research Dryad and DryadLINQ ( binary for now, source release in planning) With tutorials, programming guides, sample codes, libraries, and a community site. http://research.microsoft.com/collaboration/tools/dryad.aspx

29

30 © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "Overview of MSR External Research Earth, Energy, and MSR Environmental Ecosystem Conceptual Model Projects Trident GrayWulf Dyrad and DryadLinq."

Similar presentations


Ads by Google