Presentation is loading. Please wait.

Presentation is loading. Please wait.

Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.

Similar presentations


Presentation on theme: "Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009."— Presentation transcript:

1 Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009

2 The Standard Model Disk CPU File System Upload Data 1 Batch System Submit 1M Jobs 2 3 Access Data P F The user has to guess at what the filesystem is good at. The FS has to guess what what the user is going to do next.

3 Understand what the user is trying to accomplish. …not what they are struggling to run on this machine right now.

4 The Context: Biometrics Research Computer Vision Research Lab at Notre Dame: Kinds of research questions: –How does human variation affect matching? –What’s the reliability of 3D versus 2D? –Can we reliably extract a good iris image from an imperfect video, and use it for matching? A great context for systems research: –Committed collaborators, lots of data, problems of unbounded size, some time sensitivity, real impact. –Acquiring TB data/month, sustained 90% CPU util. –Always composing workloads that break something.

5 BXGrid Schema fileid = 24305 size = 300K type = jpg sum = abc123… replicaid=423 state=ok replicaid=105 state=ok replicaid=293 state=creating replicaid=102 state=deletingTypeSubjectEyeColorFileIDIrisS100RightBlue10486 IrisS100LeftBlue10487 IrisS203RightBrown24304 IrisS203LeftBrown24305 Scientific Metadata General Metadata Immutable Replicas

6

7

8 BXGrid Abstractions B1 B2 B3 A1A2A3 FFF F FF FF F Lbrown Lblue Rbrown R eyecolor G G G ROC Curve S = Select( color=“brown” ) B = Transform( S,G ) M = AllPairs( A, B, F ) H. Bui et al, “BXGrid: A Repository and Experimental Abstraction” Special Issue on e-Science, Journal of Cluster Computing, 2009. S

9 An abstraction is a declarative specification of both the data and computation in a batch workload (Think SQL for clusters.)

10

11

12 A0 AllPairs( set A, set B, func F ) A1 An B0B1Bn F 0.50.20.90.8 0.10.30.4 0.20.6 1.0 0.50.10.70.3 B2 A2 Moretti, et al, “All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids”, TPDS, to appear in 2009.

13 > AllPairs IrisA IrisB MyComp.exe

14 How can All-Pairs do better? No demand paging! Distribute the data via spanning tree. Metadata lookup can be cached forever. File locations can be cached optimistically. Compute the right number of CPUs to use. Check for data integrity, but on failure, don’t abort, just migrate. On a multicore CPU, use multiple threads and a cache-aware access algorithm. Decide at runtime whether to run here on 4 cores, or run on the entire cluster. Number of CPUs Turnaround Time 200

15 What’s the Upshot? The end user gets scalable performance always at the “sweet spot” of the system. It works despite many different failure modes, including uncooperative admins. 60Kx60K All-Pairs ran on 100-500 CPUs over one week, largest known experiment on public data in the field. All-Pairs abstraction is 4X more efficient than the standard model.

16 A0 AllPairs( set A, set B, func F ) A1 An B0B1Bn F 0.50.20.90.8 0.10.30.4 0.20.6 1.0 0.50.10.70.3 B2 A2 Moretti, et al, “All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids”, TPDS, to appear in 2009.

17 A0 SomePairs( set A, set B, pairs P, func F ) A1 An B0B1Bn 0.50.20.90.8 0.10.50.30.4 0.20.6 1.0 0.50.10.70.3 B2 A2 (0,1) (0,2) (1,1) (1,2) (2,0) (2,2) (2,3) P Moretti, et al, “Scalable Modular Genome Assembly on Campus Grids”, under review in 2009.

18 Wavefront ( R[x,0], R[0,y], F(x,y,d) ) R[4,2] R[3,2]R[4,3] R[4,4]R[3,4]R[2,4] R[4,0]R[3,0]R[2,0]R[1,0]R[0,0] R[0,1] R[0,2] R[0,3] R[0,4] F x yd F x yd F x yd F x yd F x yd F x yd F F y y x x d d x FF x ydyd Yu et al, “Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions”, IEEE HPDC 2009.

19 Directed Graph part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result Thain and Moretti, “Abstractions for Cloud Computing with Condor”, in Handbook of Cloud Computing Services, in press.

20 Contributions Repository for Scientific Data: –Provide a better match with end user goals. –Simplify scalability, management, recovery… Abstractions for Large Scale Computing: –Expose actual data needs of computation. –Simplify scalability, consistency, recovery… –Enable transparent portability from single CPU to multicore to cluster to grid. Are there other useful abstractions?

21 HECURA Summary Impact on real science: –Production quality data repository used to produce datasets accepted by NIST and published to community. “Indispensable” –Abstractions used to scale up student’s daily “scratch” work to 500 CPUs; increased data coverage of published results by two orders of magnitude. –Planning to adapt prototype to Civil Eng CDI project. Impact on computer science: –Open source tools that run on Condor, SGE, multicore, recently with independent users. –1 chapter, 2 journal papers, 4 conference papers. Broader Impact –Integration into courses, used by other projects at ND. –Undergrad co-authors from ND and elsewhere.

22 Participants Faculty: –Douglas Thain and Patrick Flynn Graduate Students: –Christopher Moretti (Abstractions) –Hoang Bui (Repository) –Li Yu (Multicore) Undergraduates: –Jared Bulosan –Michael Kelly –Christopher Lyon (North Carolina A&T Univ) –Mark Pasquier –Kameron Srimoungchanh (Clemson Univ) –Rachel Witty

23 For more information: The Cooperative Computing Lab –http://www.cse.nd.edu/~ccl http://www.cse.nd.edu/~ccl Cloud Computing and Applications –October 20-21, Chicago IL –http://www.cca09.org http://www.cca09.org This work was supported by the National Science Foundation via grant CCF-06-21434.


Download ppt "Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009."

Similar presentations


Ads by Google