Camdoop Exploiting In-network Aggregation for Big Data Applications

Name: Camdoop Exploiting In-network Aggregation for Big Data Applications
Uploaded: 2017-10-05T08:48:31+00:00
Duration: PTM33S39
Channel: Dina Ward
Description: Camdoop Exploiting In-network Aggregation for Big Data Applications

Camdoop Exploiting In-network Aggregation for Big Data Applications
Paolo Costa joint work with Austin Donnelly, Antony Rowstron, and Greg O’Shea (MSR Cambridge) Presented by: Navid Hashemi

MapReduce Overview Map Shuffle Reduce
Input file Intermediate results Map Task Final results Reduce Task Chunk 0 Chunk 1 Chunk 2 Map Task Reduce Task Map Task Reduce Task Map Processes input data and generates (key, value) pairs Shuffle Distributes the intermediate pairs to the reduce tasks Reduce Aggregates all values associated to each key Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Problem Shuffle phase is challenging for data center networks
Input file Intermediate results Map Task Final results Reduce Task Split 0 Split 1 Split 2 Map Task Reduce Task Map Task Reduce Task Shuffle phase is challenging for data center networks Led to proposals for full-bisection bandwidth Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Data Reduction Final results Input file Intermediate results Map Task Reduce Task Split 0 Split 1 Split 2 Map Task Reduce Task Map Task Reduce Task The final results are typically much smaller than the intermediate results In most Facebook jobs the final size is 5.4 % of the intermediate size In mostYahoo jobs the ratio is 8.2 % Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Data Reduction Final results Input file Intermediate results Map Task Reduce Task Split 0 Split 1 Split 2 Map Task Reduce Task Map Task Reduce Task The final results are typically much smaller than the intermediate results How can we exploit this to reduce the traffic and improve the performance of the shuffle phase? Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Background: Combiners Input file Intermediate results
Final results Map Task Combiner Reduce Task Split 0 Split 1 Split 2 Reduce Task Map Task Combiner Reduce Task To reduce the data transferred in the shuffle, users can specify a combiner function − Aggregates the local intermediate pairs • Server-side only => limited aggregation Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Distributed Combiners
It has been proposed to use aggregation trees in MapReduce to perform multiple steps of combiners − e.g., rack-level aggregation [Yu et al., SOSP’09] Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Logical and Physical Topology
What happens when we map the tree to a typical data center topology? ToR Switch Logical topology Physical topology Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Logical topology Physical topology The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Camdoop Goal Physical topology The server link is the bottleneck Perform the combiner functions within the network as opposed to application-level solutions Reduce shuffle time by aggregating packets on path Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

How Can We Perform In-network Processing?
y We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic z x Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

y We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic Servers intercept, forward and process packets z x Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

(1,2,1) (1,2,2) y Nodes have (x,y,z)coordinates: This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x Key Property No distinction between network and computation devices Servers can perform arbitrary packet processing on-path Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

(1,2,1) (1,2,2) y In CamCube: Keys are encoded using 160-bit identifier Each server responsible for subset of them Most significant bits are used to generate (x,y,z) If server unreachable => identifier map to its neighbors Remaining (160 - k) bits used to determine the neighbor z x

Mapping a tree… The 1 Gbps link becomes the bottleneck
… on a switched topology The 1 Gbps link … on CamCube Packets are aggregated on path (=> less traffic) 1:1 mapping btw. logical and physical topology becomes the bottleneck 1/in-degree Gbps 1 Gbps Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop Design Goals No change in the programming model
Exploit network locality Good server and link load distribution Fault-tolerance Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Design Goal #1 Programming Model
Camdoop adopts the same MapReduce model GFS-like distributed file-system Each server runs map tasks on local chunks We use a spanning tree Combiners aggregate map tasks and children results (if any) and stream the results to the parents The root runs the reduce task and generates the final output Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Design Goal #2 Network locality How to map the tree nodes to servers?
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Design Goal #2 Network locality
Simple rule: Map task outputs are always read from the local disk Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 20/52

Design Goal #2 Network locality
(1,2,1) (1,2,2) (1,1,1) The parent-children are mapped on physical neighbors Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 21/52

This ensures maximum locality and optimizes network transfer
Design Goal #2 Network locality (1,2,1) (1,2,2) (1,1,1) This ensures maximum locality and optimizes network transfer

Network Locality Logical View Physical View (3D Torus)
One physical link is used by one and only one logical link Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Design Goal #3 Load Distribution
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Only 1 Gbps (instead of 6) Different in-degree Poor server load distribution Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Solution: stripe the data across disjoint trees Different links are used Improves load distribution

Solution: stripe the data across 6 disjoint trees
Design Goal #3 Solution: stripe the data across 6 disjoint trees All links are used => (Up to) 6 GBps / server Good load distribution Paolo Costa

Design Goal #4 Fault-tolerance
The tree is built in the coordinate space Link Failure: Using CamCube key-based routing service and shortest path reroute the packets along the new path Server Failure: CamCube remaps coordinates API notifies all the servers Parent of vertex v receives notification send control packet containing the last key to the newly mapped server Each child of v resends all (key, value) from last key onwards Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Evaluation Testbed Simulator 27-server CamCube (3 x 3 x 3)
Quad-core Intel Xeon Ghz 12GB RAM 6 Intel PRO/1000 PT 1 Gbps ports Simulator Packet-level simulator (CPU overhead not modelled) 512-server (8x8x8) CamCube Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Evaluation Design and implementation recap
Camdoop Shuffle & reduce parallelized  Reduce phase is parallelized with the shuffle phase Since all streams are ordered, as soon as the root receive at least one packet from all children, it can start the reduce function No need to store to disk the intermediate results on reduce servers Map Shuffle Map Shuffle Reduce Reduce Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Evaluation Design and implementation recap  Camdoop Shuffle & reduce
parallelized  CamCube Six disjoint trees In-network aggregation Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Evaluation Design and implementation recap TCP Camdoop (switch)
Shuffle & reduce parallelized  CamCube  Six disjoint trees In-network aggregation TCP Camdoop (switch) Map Shuffle Map Shuffle Reduce 27 CamCube servers attached to a ToR switch TCP is used to transfer data in the shuffle phase Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Evaluation Design and implementation recap TCP Camdoop (switch)
Camdoop (no agg) Shuffle & reduce parallelized  CamCube  Six disjoint trees In-network aggregation Camdoop (no agg) (Non-commutative/ associative functions) Map Shuffle Map Shuffle Reduce TCP Camdoop (switch) 27 CamCube servers attached to a ToR switch TCP is used to transfer data in the shuffle phase Like Camdoop but without in-network aggregation Shows the impact of just running on CamCube Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Validation against Hadoop & Dryad
Dryad/DryadLINQ Camdoop (no agg) Hadoop TCP Camdoop (switch) Sort and WordCount Same size intermediate and output data Significant aggregation Camdoop baselines are competitive against Hadoop and Dryad Several reasons: 1000 Worse 100 Time logscale (s) 10 1 Better Sort WordCount Shuffle and reduce parellized Fine-tuned implementation Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Validation against Hadoop
We consider these as our baseline Validation against Hadoop Hadoop Dryad/DryadLINQ TCP Camdoop (switch) Camdoop (no agg) Sort and WordCount Camdoop baselines are competitive against Hadoop and Dryad Several reasons: 1000 Worse 100 Time logscale (s) 10 1 Better Sort WordCount Shuffle and reduce parellized Fine-tuned implementation Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Parameter Sweep Output size / intermediate size (S)
S=1 (no aggregation) Every key is unique S=1/N ≈ 0 (full aggregation) Every key appears in all map task outputs Distribute the workload Intermediate data size is 22.2 GB (843 MB/server) Reduce tasks (R) R= 1 (all-to-one) E.g., Interactive queries, top-K jobs R=N (all-to-all) Common setup in MapReduce jobs N output files are generated Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

All-to-one (R=1) Worse Time (s) logscale TCP Camdoop (switch)
1000 Time (s) logscale 100 TCP Camdoop (switch) 10 Camdoop (no agg) Camdoop Better 1 Full aggregation Paolo Costa 0.2 0.4 0.6 0.8 1 Output size/ intermediate size (S) Camdoop: Exploiting In-network Aggregation for Big Data Applications

Impact of running on CamCube
in-network aggregation Performance independent of S Impact of running on CamCube Worse 1000 Time (s) logscale 100 TCP Camdoop (switch) 10 Camdoop (no agg) Camdoop Better 1 Full aggregation Paolo Costa 0.2 0.4 0.6 0.8 1 Output size/ intermediate size (S) Camdoop: Exploiting In-network Aggregation for Big Data Applications

in-network aggregation Performance independent of S Impact of running on CamCube Worse 1000 Time (s) logscale 100 TCP Camdoop (switch) 10 Camdoop (no agg) Camdoop Better 1 Fa0c.e2book rep0o.4rted 0.6 0.8 1 Full aggregation Paolo Costa No aggregation 40/52 aggregation ratio Camdoop: Exploiting In-network Aggregation for Big Data Applications

All-to-all (R=27) Worse Time (s) logscale TCP Camdoop (switch)
100 Time (s) logscale 10 TCP Camdoop (switch) Camdoop (no agg) Camdoop Switch 1 Gbps (bound) Better 1 Full aggregation Paolo Costa 0.2 0.4 0.6 0.8 1 No Output size / intermediate size (S) Camdoop: Exploiting In-network Aggregation for Big Data Applications

in-network aggregation 10 Maximum theoretical performance over the switch Time (s) logscale TCP Camdoop (switch) Camdoop (no agg) Camdoop Switch 1 Gbps (bound) Better 1 Full aggregation Paolo Costa 0.2 0.4 0.6 0.8 1 No Output size / intermediate size (S) Camdoop: Exploiting In-network Aggregation for Big Data Applications

Number of reduce tasks (S=0)
Worse 1000 TCP Camdoop (switch) Camdoop (no agg) Camdoop 10.19 x 4.31 x Time (s) logscale 100 10 1.91 x 1.13 x Better 1 1 6 11 16 21 26 27 All-to-one All-to-all # reduce tasks (R) Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 43/52

Performance depends on R
Worse R1d00o0es not (significantly) impact perTfCoPrmCaamncdeoop (switch) Camdoop (no agg) Camdoop 10.19 x 100 4.31 x 10 1.91 x Time (s) logscale 1.13 x Better 1 1 All-to-one 6 11 16 # reduce tasks (R) 21 26 27 All-to-all Implementation bPaooltotCloesntaeck Camdoop: Exploiting In-network Aggregation for Big Data Applications 44/52

Numberdoepfenrdseodn Ruce tasks (S=0)
Performance Numberdoepfenrdseodn Ruce tasks (S=0) Worse R1d00o0es not (significantly) impact perTfCoPrmCaamncdeoop (switch) Camdoop (no agg) Camdoop 10.19 x 100 4.31 x 10 1.91 x Time (s) logscale 1.13 x Better 1 26 All resources are used in the aggregation phase even when R = 1 1 27 All-to-one All-to-all # reduce tasks (R) The number of output files just become a configuration parameter Paolo Costa

Behavior at scale (simulated)
N=512, S= 0 Worse 100 Switch 1 Gbps (bound) Camdoop 10 512x Time (s) logscale 1 0.1 0.01 Better 0.001 1 All-to-one 4 16 64 # reduce tasks (R) logscale 256 512 All-to-all Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

ior at scale (simulated)
This assumes B ehav ior at scale (simulated) N=512, S= 0 full-bisection bandwidth Worse 100 Switch 1 Gbps (bound) Camdoop 10 512x Time (s) logscale 1 0.1 0.01 Better 0.001 1 All-to-one 4 16 64 # reduce tasks (R) logscale 256 512 All-to-all Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Experiment of failure Without failure: With failure:
A single edge in tree means a single one-hop link in CamCube With failure: A single edge becomes a multi-hop path in CamCube Graph shows the shuffle and reduce time normalized against T(time taken to run without failure) Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Typically, these services use
Beyond MapReduce requests The partition-aggregate model also common in interactive services − e.g., Bing Search, Google Dremel Web server Cache Small-scale data (Good use-case for CamCube) − 10s to 100s of KB returned per server … Parent servers … Typically, these services use one reduce task (R=1) − Single result must be returned to the user Full aggregation is common (S ≈ 0) Leaf servers − There is a large opportunity for aggregation Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Small-scale data (R=1, S=0)
Worse 60 TCP Camdoop (switch) Camdoop (no agg) Camdoop 50 40 Time (ms) 30 20 10 Better 0 20 KB 200 KB Input data size / server Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 50/52

Small-scale data (R=1, S=0)
Worse 60 TCP Camdoop (switch) Camdoop (no agg) Camdoop 50 40 Time (ms) 30 20 10 Better 0 20 KB 200 KB In-network aggregation can be beneficial also for (small-scale data) interactive services

for data centers using a 3D torus
Conclusions Camdoop Explores the benefits of in-network processing by running combiners within the network No change in the programming model Achieves lower shuffle and reduce time Decouples performance from the # of output files A final thought: how would Camdoop run on this? AMD SeaMicro – a 512-core cluster for data centers using a 3D torus Fast interconnect: 5 Gbps / link Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 52/52

Questions?

Camdoop Exploiting In-network Aggregation for Big Data Applications

Similar presentations

Presentation on theme: "Camdoop Exploiting In-network Aggregation for Big Data Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Camdoop Exploiting In-network Aggregation for Big Data Applications

Similar presentations

Presentation on theme: "Camdoop Exploiting In-network Aggregation for Big Data Applications"— Presentation transcript:

Similar presentations

About project

Feedback