Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer.

Similar presentations


Presentation on theme: "Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer."— Presentation transcript:

1 Big Data Platforms Mihai Budiu, Oct 6 2014

2 My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer architecture Researcher at Microsoft Research Silicon Valley 2004-2014 Computer security Cloud computing infrastructure: distributed computation platforms monitoring and debugging performance analysis Big data analysis and visualization Large scale machine learning 2

3 500 Years Ago 3 Tycho Brahe (1546-1601) Johannes Kepler (1571-1630)

4 The Laws of Planetary Motion 4 Tycho’s measurementsKepler’s laws

5 The Large Hadron Collider 5 25 PB/year WLHC Grid: 200K computing cores

6 Genetic Code 6

7 Astronomy 7

8 Weather 8

9 The Webs 9 Internet Facebook friends graph

10 Big Data 10

11 Big Computers 11

12 Talk Outline 12 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

13 Design Space 13 Throughput (batch) Latency (interactive) Internet Data center Data- parallel Shared memory

14 Dryad Eurosys 2007 Continuously deployed in Microsoft since 2006 Execution engine of Bing analytics > 10 5 machines Many PB of data analyzed daily 14 Dryad painting by Evelyn de Morgan

15 Dryad = Execution Layer 15 Job (application) Dryad Cluster Pipeline Shell Machine ≈

16 2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 16

17 Virtualized 2-D Pipelines 17

18 Virtualized 2-D Pipelines 18

19 Virtualized 2-D Pipelines 19

20 Virtualized 2-D Pipelines 20

21 Virtualized 2-D Pipelines 21 2D DAG multi-machine virtualized

22 Dryad Job Structure 22 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage

23 Dryad System Architecture 23 Files, TCP, FIFO, Network job schedule data plane control plane NS, Sched RE V VV job managercluster

24 GM code vertex code Staging 1. Build 2. Send.exe 3. Start manager 5. Generate graph 7. Serialize vertices 8. Monitor Vertex execution 4. Query cluster resources Name server 6. Initialize vertices Remote execution service

25 Talk Outline 25 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

26 Distributed Collections 26 Partition Collection.Net objects

27 LINQ 27 Dryad => DryadLINQ

28 28 LINQ =.Net+ Queries Collection collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

29 Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 29 DryadLINQ = LINQ + Dryad C# collection results C# Vertex code Query plan (Dryad job) Data

30 Language Summary 30 Where Select GroupBy OrderBy Aggregate Join

31 Very expressive 31 var result = input.SelectMany(r => Mapper(r)).GroupBy(r => Key(r)).Select(g => Reducer(g)); Map-Reduce Distributed sorting Iterative machine-learning (EM)

32 Talk Outline 32 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

33 Debugging DryadLINQ jobs 33

34 Distributed performance counters 34

35 Training Kinect 35 Depth mapBody parts Classifier Xbox GPU

36 Learn from Many Examples 36 Decision Tree Classifier Machine learning

37 Talk Outline 37 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

38 Bandwidth hierarchy

39 Principles Visualizations are bounded data displays All computations are sketches Sketch is a runtime for (1)running streaming (sketching) algorithms (2)implementing visualizations with bounded data renderings 39

40 Streaming algorithms Sketches = randomized streaming algorithms Input = set of size n Result same independent of the order Memory = O(log(n)) Multi-pass Linear input transformations 40

41 4 billion rows on 155 machines

42 Spreadsheet operations Browsing/scrolling Filtering Using predicates Heavy hitters Sampling Searching Sorting Computing new columns Set operations (intersection, union, etc.) Charting 42

43 Histograms

44 Heat Maps

45 Sketch distributed service 45 data Sketch service data Sketch service data Sketch service data Sketch service

46 DataSets = distributed objects 46 Network 46 Client Servers DataSet Application TTTTTTTTTTT

47 Sketch Spreadsheet architecture 47 DataSet SQL ServerCSV FilesColumn storeCosmos Storage layer Table operations GUI Distributed objects Spreadsheet logic Spreadsheet display

48 DataSet API interface IDataSet { IDataSet Map (Func f); IDataSet > Zip(IDataSet other); R Sketch(ISketch sketch); } interface ISketch { R Create(T data); R Combine(List parts); } 48

49 DataSet Implementations 49 Application Network Client ParallelProxy GUI ParallelLocal ParallelLocal Parallel Dataset interface Rack aggregation Core parallelism Cluster parallelism RMI layer Proxy ref Parallel Server 0 Server 1 Server n Rack 0Rack r Address space T T TT T T

50 ProxyLocal ParallelProxyLocal Parallel TTSS f f Map(f)

51 Sketch(s) 51 ProxyLocal Parallel RR R R s.Combine TT s.Create interface ISketch { R Create(T data); R Combine(List parts); }

52 Zip 52 ProxyLocal ParallelProxyLocal Parallel TTSS ProxyLocal Parallel T,S

53 Histograms 53 CDF 2D histogram

54 Compute Computing a histogram 54 Client Server 1 Server n Histogram 1D + 2D composite sketch Data range sketch Render Display histogram User click trtr thth tata

55 Some numbers Window Server 2012 R2 8-core 2.1GHz AMD Opteron 2373 EE > 16GB RAM 3 x 1TB disks using RAID-0 155 machines 5 racks 1Gbps Ethernet 55

56 56 Null Sketch Machines Time (ms)

57 Histogram computation 26M rows/machine Scale-out 57 machines Time (ms)

58 Conclusions Big data is here to stay Better tools are needed Quest for high-level abstractions for building distributed systems Execution graphs Distributed collections Higher-order transformations Distributed stateful objects Sketching algorithms 58

59 59

60 Execution Application Data-Parallel Computation 60 Storage Language Map- Reduce GFS BigTable Cosmos Azure SQL Server Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive ≈SQLLINQ, SQLSawzall, Java


Download ppt "Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer."

Similar presentations


Ads by Google