Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECE 571R: Data-intensive computing systems Matei Ripeanu matei at ece.ubc.ca.

Similar presentations


Presentation on theme: "EECE 571R: Data-intensive computing systems Matei Ripeanu matei at ece.ubc.ca."— Presentation transcript:

1 EECE 571R: Data-intensive computing systems Matei Ripeanu matei at ece.ubc.ca

2 2 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Contact Info ece.ubc.ca Office: KAIS 4033 Office hours: by appointment ( me) Course page:

3 3 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) EECE 571R: Course Goals l Primary –Gain deep understanding of fundamental issues that affect design of: >Data-intensive systems >(more generally) Large-scale distributed systems –Survey main current research themes –Gain experience with distributed systems research >Research on: federated system, networks l Secondary –By studying a set of outstanding papers, build knowledge of how to do & present research –Learn how to read papers & evaluate ideas

4 4 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) What I’ll Assume You Know l Basic Internet architecture –IP, TCP, DNS, HTTP l Basic principles of distributed computing –Asynchrony (cannot distinguish between communication failures and latency) –Incomplete & inconsistent global state knowledge (cannot know everything correctly) –Failures happen (In large systems, even rare failures of individual components, aggregate to high failure rates) l If there are things that don’t make sense, ask!

5 5 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Outline l Case study (and project ideas): –Volunteer computing: /BOINC –Virtual Data System –Batch Aware Distributed File System l Administrative

6 6 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07)

7 7 How does it work? Characteristics: l Fixed-rate data processing task l Low bandwidth/computation ratio l Independent parallelism l Error tolerance Master-worker architecture

8 8 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Operations data recorder screensavers WU storage splitters DLT tapes data server science DBuser DB result queue acct. queue garbage collector tape archive, delete tape backup master DB redundancy checking RFI elimination repeat detection web site CGI program web page generator

9 9 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) History and Statistics l Conceived 1995, launched April 1999 l Millions of users, hosts… l No ET signals yet, but other results TotalLast 24 Hours (as of Wed Feb 23 07:04:51) Users5,361,3134,391 Results received1,779 millions5 million Total CPU time2.2 million years years Average CPU time/work unit 10 hr 58 min 14.0 sec6 hr 19 min 30.1 sec

10 10 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Millions of individual contributors! (Problems) l Server scalability l Dealing with excess CPU time l Untrusted environment: Bad user behavior –Cheating –Team recruitment by spam –Sale of accounts on eBay l Malfunctions of individual components

11 11 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Summary l The characteristics of the problem … –Massive (“embarrassing”) parallelism –Low bandwidth/computation ratio –Fixed-rate data processing task l … make possible a solution that operates in an unfriendly environment –Wide area distribution; huge scale –High failure rates –Untrusted/malicious components l Solution: Master-worker design >Master=central point of control >Single point of failure >Performance bottleneck

12 12 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Outline l Case study (and project ideas): –Volunteer computing: /BOINC –Virtual Data System –Batch Aware Distributed File System l Administrative

13 13 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Virtual Data System l Context: ’big science’ l Motivation/goals: support science process, –i.e., track all aspects of data capture, production, transformation, and analysis l Requirements: ability to define complex workflows, and to reliably & efficiently execute workflows in heterogeneous, multi-domain environments. l Derived benefits: helps to audit, validate, reproduce, and/or rerun with corrections various data transformations.

14 14 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) The European Organisation for Nuclear Research CERN builds particle accelerators for particle physics research BIG Science!

15 EECE571R Data-intensive computing (Spring’07) reconstruction simulation analysis interactive physics analysis batch physics analysis batch physics analysis detector event summary data raw data event reprocessing event reprocessing event simulation event simulation analysis objects (extracted by physics topic) Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) event filter (selection & reconstruction) processed data CER N

16 16 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) CMS Grid Hierarchy Tier2 Center Online System CERN Computer Center > 20 TIPS USA Center France Center Italy Center UK Center Institute Workstations, other portals 100MB~1.5GB/sec Gbits/sec Gbits/sec Bunch crossing per 25 ns 100 triggers per second ~1 MByte per event Physics data cache 10 ~ 40 Gbits/sec Tier2 Center Gbits/sec Tier 0 Tier 1 Tier 3 Tier 4 Experiment 2500 Physists, 40 countries 10s of Petabytes/Yr by 2008 Institute Tier 2

17 17 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) TransformationDerivation Data Product-of execution-of consumed-by/ generated-by “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.” “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” Motivations (1)

18 18 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Motivations (2) l Data track-ability and result audit-ability l Repair and correction of data –Rebuild data products—c.f., “make” l Workflow management –A new, structured paradigm for organizing, locating, specifying, and requesting data products l Performance optimizations –Ability to re-create data rather than move it

19 19 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Requirements l Express complex multi-step “workflows” –Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data –Different formats & access protocols l Harness many computing resources –Parallel computers &/or distributed Grids l Execute workflows reliably –Despite diverse failure conditions l Enable reuse of data & workflows –Discovery & composition l Support many users, workflows, resources –Policy specification & enforcement

20 20 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Virtual Data System Local planner DAGman DAG Statically Partitioned DAG DAGman & Condor-G Dynamically Planned DAG Job Planner Job Cleanup Abstract workflow VDL Program Virtual Data catalog Virtual Data Workflow Generator Workflow spec Create Execution Plan Grid Workflow Execution

21 21 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) VDS Software Stack l Express complex multi-step “workflows” –Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data –Different formats & access protocols l Harness many computing resources –Parallel computers &/or distributed res. l Execute workflows reliably & efficiently –Despite diverse failure conditions l Enable reuse of data & workflows –Discovery & composition l Support many users, workflows, resources –Policy specification & enforcement VDL, XDTM Pegasus, DAGman, Globus VDC TBD

22 22 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Outline l Case study (and project ideas): –Volunteer computing: /BOINC –Virtual Data System –Batch Aware Distributed File System l Administrative

23 Batch-aware Distributed File System

24 24 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Motivating question: Are existing distributed file systems adequate for batch computing workloads? l NO. Internal decisions inappropriate –Caching, consistency, replication l A solution: Combine scheduling knowledge with external storage control –Detail information about workload is known –Storage layer allows external control –External scheduler makes informed storage decisions l Combining information and control results in –Improved performance –More robust failure handling –Simplified implementation Explicit Control in a Batch-Aware Distributed File System, John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Miron Livny, (NSDI '04)

25 25 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Outline l Batch computing –Systems –Workloads –Environment –Why not DFS? l Solution: BAD-FS –Design –Experimental evaluation

26 26 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Batch computing Home storage Internet

27 27 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Batch computing l Not interactive l Compute Loop –Users submit jobs >Job description languages –System itself executes –Results are copied back to user system l Many exiting batch systems –Condor, LSF, PBS, Sun Grid Engine

28 28 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Internet Batch computing Scheduler Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Job queue Home storage 12 34

29 29 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Batch workloads l General properties –Large number of processes –Process and data dependencies –I/O intensive l Different types of I/O –Endpoint –Batch –Pipeline l Usage: mainly scientific workloads, but also video production, data mining, electronic design, financial services, graphic rendering Pipeline and Batch Sharing in Grid Workloads, Douglas Thain, John Bent, Andrea Arpaci- Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.

30 30 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Batch workloads Endpoint Batch dataset Pipeline Endpoint Pipeline

31 31 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Cluster-to-cluster (c2c) l Not quite p2p –More organized –Less hostile –More homogeneity l Each cluster is autonomous –Run and managed by different entities l An obvious bottleneck is wide-area network Q: How to manage flow of data into, within and out of these clusters? Internet Home store

32 32 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Why not a traditional Distributed File System ? l Distributed file system (DFS) would be ideal –Easy to use –Uniform name space l But... –Designed for wide-area networks –Not practical –Embedded decisions are wrong Internet Home store

33 33 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Distributed file systems make ‘bad’ decisions l Caching –Must guess what and how to cache l Consistency –Output: Must guess when to commit –Input: Needs mechanism to invalidate cache l Replication –Must guess what to replicate

34 34 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) BAD-FS makes ‘good’ (i.e. informed) decisions l Removes the guesswork –Scheduler has detailed workload knowledge –Storage layer designed to allow external control –Scheduler makes informed storage decisions >Manages data as well as computations l Retains simplicity of distributed file systems l Practical and deployable

35 35 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Outline l Introduction l Batch computing –Systems –Workloads –Environment –Why not DFS? l One solution: BAD-FS –Design –Experimental evaluation

36 36 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) l User-level; requires no privilege l Packaged as a modified batch system l A new batch system which includes BAD-FS l General: will work on all batch systems Solution BAD-FS: Practical and deployable Internet SGE BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS Home store

37 37 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Solution BAD-FS: Components Scheduler Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Compute node CPU Manager Job queue Home storage Job queue 3) Expanded job description language BAD-FS Scheduler 4) BAD-FS scheduler 1) Storage managers 2) Batch-Aware Distributed File System Storage Manager Storage Manager Storage Manager Storage Manager BAD-FS

38 38 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Information used l Remote cluster knowledge –Storage availability –Failure rates l Workload knowledge –Data type (batch, pipeline, or endpoint) –Data quantity –Job dependencies

39 39 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Control through volumes l Guaranteed storage allocations –Containers for job I/O l Scheduler –Creates volumes to cache input data >Subsequent jobs can reuse this data –Creates volumes to buffer output data >Destroys pipeline, copies endpoint –Configures workload to access containers

40 40 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Knowledge plus control l Enhanced performance –I/O scoping –Capacity-aware scheduling l Improved failure handling –Cost-benefit replication l Simplified implementation –No cache consistency protocol

41 41 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Real workload experience l Setup –16 jobs –16 compute nodes –Emulated wide-area l Configuration –Remote I/O –AFS-like with /tmp –BAD-FS l Result is order of magnitude improvement

42 42 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) BAD-FS Lessons l Generic solutions may be inefficient –Often designed with specific tradeoffs in mind (e.g., most common workloads) l Fix: –Redesign for new workload –Use explicit information available at runtime to optimize the execution of lower layers

43 43 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Course Organization/Syllabus/etc.

44 44 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Course structure l Lectures –About 1/3 of all classes l Student projects –Aim high! Have fun! It’s a class project, not your PhD! –Teams of up to 3 students –Project presentations at the end of the term l Paper discussion –The other classes

45 45 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Weekly schedule (tentative) 1. Introduction. Overview of current research problems, technologies, and applications. 2. File system semantics, data durability and availability, replication and consistency, fault-tolerance. 3. Data storage technologies. Storage hierarchies. Capacity management. 4. Scientific applications: data access patterns, workload characterization. 5. Integration with compute systems. Grids and Virtual Data 6. Performance focus: caching, parallel access, striping. 7. Structured overlays. Distributed hash tables. Data systems harnessing structured overlays. 8. Security. 9. Applications I: Experience with deployed systems. (NFS, AFS, Google File System) 10. Applications II: Data archival. Cooperative internet proxy caches. Content distribution networks. 11. Applications III: Peer-to-peer file-sharing (BitTorrent, FreeLoader) 12. Project presentations

46 46 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Grading l Paper reviewing:35% l Discussion leading: 15% l Project: 50%

47 47 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Paper Reviewing (1) l Goals: –Think of what you read –Expand your knowledge beyond the papers that are assigned –Get used to writing paper reviews l Reviews due by midnight the day before the class l Be professional in your writing l Have an eye on the writing style: –Clarity –Beware of traps: learn to use them in writing and detect them in reading –Detect (and stay away from) trivial claims. E.g., 1 st sentence in the Introduction: “The tremendous/unprecedented/phenomenal growth/scale/ubiquity of the Internet…”

48 48 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Paper Reviewing (2) Follow the form provided when relevant. l State the main contribution of the paper l Critique the main contribution: l Rate the significance of the paper on a scale of 5 (breakthrough), 4 (significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution or negative contribution). l Explain your rating in a sentence or two. Rate how convincing the methodology is. l Do the claims and conclusions follow from the experiments? l Are the assumptions realistic? l Are the experiments well designed? l Are there different experiments that would be more convincing? l Are there other alternatives the authors should have considered? l (And, of course, is the paper free of methodological errors?)

49 49 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Paper Reviewing (3) l What is the most important limitation of the approach? l What are the three strongest and/or most interesting ideas in the paper? l What are the three most striking weaknesses in the paper? l Name three questions that you would like to ask the authors. l Detail an interesting extension to the work not mentioned in the future work section. l Optional comments on the paper that you’d like to see discussed in class.

50 50 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Discussion leading l Come prepared! –Prepare discussion outline –Prepare questions: >“What if”s >Unclear aspects of the solution proposed >… –Similar ideas in different contexts –Initiate short brainstorming sessions l Leaders do NOT need to submit paper reviews l Main goals: –Keep discussion flowing –Keep discussion relevant –Engage everybody (I’ll have an eye on this, too)

51 51 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Projects l Combine with your research if relevant to the class l Get approval from all instructors if you overlap final projects: –Don’t sell the same piece of work twice –You can get more than twice as many results with less than twice as much work l Aim high! –Put one extra month and get a publication out of it –It is doable! l Try ideas that you postponed out of fear: it’s just a class, not your PhD.

52 52 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Administravia: Project deadlines (tentative) l 3 rd week (Tue): 1-page project proposal l 5 th week (Tue): 3-page literature survey –Know relevant work in your problem area –If implementation project, list tools, similar projects –Expand proposal l 7 th week (Tue): 5-page Midterm project due –Have a clear image of what’s possible/doable –Report preliminary results l First week of exam session: In-class project presentation –Demo, if appropriate l Last week of exam session: –10-page write-up

53 53 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Next Class (Thu, 11/01) l Note room change: KAIS l Discussion of some project ideas l Presentation by Matei To do: l Subscribe to mailing list l Volunteers for discussion leaders for class next week

54 54 Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07) Questions?


Download ppt "EECE 571R: Data-intensive computing systems Matei Ripeanu matei at ece.ubc.ca."

Similar presentations


Ads by Google