Presentation is loading. Please wait.

Presentation is loading. Please wait.

2006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications 1 Three Topics in Parallel Communications Thesis presentation by Emin Gabrielyan.

Similar presentations


Presentation on theme: "2006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications 1 Three Topics in Parallel Communications Thesis presentation by Emin Gabrielyan."— Presentation transcript:

1 2006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications 1 Three Topics in Parallel Communications Thesis presentation by Emin Gabrielyan

2 22006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications Parallel communications: bandwidth enhancement or fault- tolerance? We do not know if parallel communications were first used for fault-tolerance or for bandwidth enhancement In 1964 Paul Baran proposed parallel communications for fault-tolerance (inspiring the design of ARPANT and Internet) 1981 IBM invented the 8-bit parallel port for faster communication

3 32006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications Bandwidth enhancement by parallelizing the sources and sinks Bandwidth enhancement can be achieved by adding parallel paths But a greater capacity enhancement is achieved if we can replace the senders and destinations with parallel sources and sinks This is possible in parallel I/O (first topic of the thesis)

4 42006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications Parallel transmissions in coarse- grained networks cause congestions In coarse-grained circuit-switched HPC networks uncoordinated parallel transmissions cause congestions The overall throughput degrades due to access conflicts on shared resources Coordination of parallel transmissions is covered by the second topic of my thesis (liquid scheduling)

5 52006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications Classical backup parallel circuits for fault-tolerance Typically the redundant resource remains idle As soon as there is a failure with the primary resource The backup resource replaces the primary one

6 62006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications Parallelism in living organisms Parallelism is observed in almost every living organisms Duplication of organs primarily serves for fault- tolerance And as a secondary purpose, for capacity enhancement

7 72006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications Simultaneous parallelism for fault- tolerance in fine-grained networks A challenging bio- inspired solution is to use simultaneously all available paths for achieving fault- tolerance This topic is addressed in the last part of my presentation (capillary routing)

8 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 8 Fine Granularity Parallel I/O for Cluster Computers SFIO, a Striped File parallel I/O

9 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 9 Why is parallel I/O required Single I/O gateway for cluster computer saturates Does not scale with the size of the cluster

10 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 10 What is Parallel I/O for Cluster Computers Some or all of the cluster computers can be used for parallel I/O

11 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 11 Objectives of parallel I/O Resistance to concurrent access Scalability as the number of I/O nodes increases High level of parallelism and load balance for all application patterns and all types of I/O requests

12 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 12 Parallel I/O Subsystem Concurrent Access by Multiple Compute Nodes No concurrent access overheads No performsne degradation When the number of compute nodes increases

13 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 13 Scalable throughput of the parallel I/O subsystem The overall parallel I/O throughput should increase linearly as the number of I/O nodes increases Parallel I/O Subsystem Number of I/O Nodes Throughput

14 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 14 Concurrency and Scalability = Scalable All-to-All Communication Concurrency and Scalability (as the number of I/O nodes increases) can be represented by scalable overall throughput when the number of compute and I/O nodes increases Number of I/O and Compute Nodes All-to-All Throughput I/O Nodes Compute Nodes

15 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 15 High level of parallelism and load balance Balanced distribution across parallel disks must be ensured: For all types of application patterns: Using small or large I/O requests Continuous or fragmented I/O request patterns

16 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 16 How parallelism is achieved? Split the logical file into stripes Distribute the stripes cyclically across the subfiles Subfiles file1 file2file3 file4 file5file6 Logical file

17 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 17 The POSIX-like Interface of Striped File I/O Using SFIO from MPI Simple Posix like interface #include #include "/usr/local/sfio/mio.h" int _main(int argc, char *argv[]) { MFILE *f; int r=rank(); //Collective open operation f=mopen("p1/tmp/a.dat;p2/tmp/a.dat;", 5); // each process writes 8 to 14 characters at its own position if(rank==0) mwritec(f,0,"Good*morning!",13); if(rank==1) mwritec(f,13,"Bonjour!",8); if(rank==2) mwritec(f,21,"Buona*mattina!",14); mclose(f); //Collective close operation }

18 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 18 Distribution of the global file data across the subfiles Example with three compute nodes and two I/O nodes First subfile Global file Second subfile Good* Good* ng!Bo ng!Bo !Buon !Buon tina! tina! morni morni njour njour a*mat a*mat 13 021

19 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 19 Impact of the stripe unit size on the load balance When the stripe unit size is large there is no guarantee that an I/O request will be well parallelized subfiles Logical file I/O Request

20 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 20 Fine granularity striping with good load balance Low granularity ensures good load balance and high level of parallelism But results in high network communication and disk access cost subfiles Logical file I/O Request

21 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 21 Fine granularity striping is to be maintained Most of the HPC parallel I/O solutions are optimized only for large I/O blocks (order of Megabytes) But we focus on maintaining fine granularity The problem of the network communication and disk access are addressed by dedicated optimizations

22 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 22 Overview of the implemented optimizations Disk access requests aggregation (sorting, cleaning- overlaps and merging) Network communication aggregation Zero-copy streaming between network and fragmented memory patterns (MPI derived datatypes) Support of the multi-block interface efficiently optimizes application related file and memory fragmentations (MPI-I/O) Overlapping of network communication with disk access in time (at the moment write operation only)

23 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 23 Multi-block I/O request Disk access optimizations Sorting Cleaning the overlaps Merging Input: striped user I/O requests Output: optimized set of I/O requests No data copy block 1bk. 2block 3 access1access2 Local subfile 6 I/O access requests are merged into 2

24 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 24 Network Communication Aggregation without Copying Striping across 2 subfiles Derived datatypes on the fly Contiguous streaming Logical file From: application memory Remote I/O node 1 Remote I/O node 2 To: remote I/O nodes

25 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 25 SFIO library on compute node Functional Architecture Blue: Interface functions Green: Striping functionality Red: I/O request optimizations Orange: Network communication and relevant optimizations bkmerge: overlapping and aggregation mkbset: creates on the fly MPI derived datatypes SFP_CMD _WRITE SFP_CMD _READ mread mwrite mreadcmreadbmwritecmwriteb mrw (cyclic distribution) sfp_rflushsfp_wflush sfp_readcsfp_writec sfp_rdwrc (request caching) flushcache sfp_read sfp_write sortcache sfp_readbsfp_writeb bkmerge mkbset sfp_wait all SFP_CMD_ BREAD SFP_CMD_ BWRITE I/O Node MPI I/O Listener

26 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 26 Optimized throughput as a function of the stripe unit size 3 I/O nodes 1 compute node Global file size: 660 Mbytes TNET About 10 MB/s per disk

27 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 27 All-to-all stress test on Swiss- Tx cluster supercomputer Stress test is carried out on Swiss-Tx machine 8 full crossbar 12- port TNet switches 64 processors Link throughput is about 86 MB/s

28 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 28 SFIO on the Swiss-Tx cluster supercomputer MPI-FCI Global file size: up to 32 GB Mean of 53 measurements for each number of nodes Nearly linear scaling with 200 bytes stripe unit ! Network is a bottleneck above 12 nodes

29 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 29 Liquid scheduling for low-latency circuit-switched networks Reaching liquid throughput in HPC wormhole switching and in Optical lightpath routing networks

30 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 30 Upper limit of the network capacity Given is a set of parallel transmissions and a routing scheme The upper limit of networks aggregate capacity is its liquid throughput

31 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 31 Distinction: Packet Switching versus Circuit Switching Packet switching is replacing circuit switching since 1970 (more flexible, manageable, scalable) New circuit switching networks are emerging (HPC clusters, Optical switching) In HPC wormhole routing targets extremely low latency requirements In optical network packet switching is not possible due to lack of technology

32 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 32 Coarse-Grained Networks In circuit switching the large messages are transmitted entirely (coarse- grained switching) Low latency The sink starts receiving the message as soon as the sender starts transmission Message Sink Message Source Fine-Grained Packet switching Coarse-grained Circuit switching

33 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 33 Parallel transmissions in coarse-grained networks When the nodes transmit in parallel across a coarse-grained network in uncoordinated fashion congestion may occur The resulting throughput can be far below the expected liquid throughput

34 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 34 Congestions and blocked paths in wormhole routing When the message encounters a busy outgoing port it waits The previous portion of the path remains occupied Source1 Sink2 Sink1 Source2 Sink3 Source3

35 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 35 Hardware solution in Virtual Cut-Through routing In VCT when the port is busy The switch buffers the entire message Much more expensive hardware than in wormhole switching Source1 Sink2 Sink1 Source2 Sink3 Source3 buffering

36 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 36 Other hardware solutions In optical networks OEO conversion can be used Significant impact on the cost (vs. memory-less wormhole switch and MEMS optical switches) Affecting the properties of the network (e.g. latency)

37 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 37 Application level coordinated liquid scheduling Liquid scheduling is a software solution Implemented at the application level No investments in network hardware Coordination between the edge nodes is required Network topology knowledge is assumed

38 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 38 Example of a simple traffic pattern 5 sending nodes (above) 5 receiving nodes (below) 2 switches 12 links of equal capacity Traffic consist of 25 transfers

39 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 39 Round robin schedule of all-to- all traffic pattern First, all nodes simultaneously send the message to the node in front Then, simultaneously, to the next node etc

40 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 40 Throughput of round-robin schedule 3 rd and 4 th phases require each two timeframes 7 timeframes are needed in total Link throughput = 1Gbps Overall throughput = 25/7x1Gbps = 3.57Gbps

41 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 41 A liquid schedule and its throughput 6 timeframes of non-congesting transfers Overall throughput = 25/6x1Gbps = 4.16Gbps

42 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 42 Problem of liquid scheduling Building liquid schedule for arbitrary traffic of transfers Problem of partitioning of the traffic into minimal number of subsets consisting of non-congesting transfers Timeframe = a subset of non- congesting transfers

43 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 43 Definitions of our mathematical model Transfer is a set of links lying on the path of the transmission Load of a link is the number of transfers in the traffic using that link Most loaded links are called bottlenecks Duration of the traffic is the load of its bottlenecks

44 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 44 bottlenecks Teams = non-congesting transfers using all bottleneck links The shortest possible time to carry out the traffic is the active time of the bottleneck links Then the schedule must keep the bottleneck links busy all the time Therefore the timeframes of a liquid schedule must consist of transfers using all bottlenecks team not a team

45 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 45 Retrieval of teams without repetitions by subdivisions Teams can be retrieved without repetitions by recursive partitioning By a choice of a transfer all teams are divided into teams using that transfer and teams not using it Each halves can be similarly sub divided until individual teams are retrieved

46 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 46 Teams use all bottlenecks: retrieving teams of traffic skeleton Since teams must use transfers using the bottleneck links We can first create teams using only such transfers (traffic skeleton) Chart: fraction of the traffic skeleton

47 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 47 Optimization by first retrieving the teams of the skeleton Speedup: by skeleton optimization Reducing the search space 9.5 times

48 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 48 Liquid schedule assembling from retrieved teams By relying on efficient retrieval of full teams (subsets of non-congesting transfers using all bottlenecks) We assemble liquid schedule by trying together different combinations of teams Until all transfers of the traffic are used

49 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 49 Liquid schedule assembling optimizations (reduced traffic) Proved. If we remove a team from a traffic, new bottlenecks can emerge New bottlenecks add additional constraints on the teams of the reduced traffic Proved. A liquid schedule can be assembled if we use teams of the reduced traffic (instead of constructing teams of the initial traffic from the remaining transfers) Proved. A liquid schedule can be assembled by considering only saturated full teams

50 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 50 Liquid schedule construction speed with our algorithm 360 traffic patterns across Swiss-Tx network Up to 32 nodes Up to 1024 transfers Comparison of our optimized construction algorithm with MILP method (optimized for discrete optimization problems)

51 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 51 Carrying real traffic patterns according to liquid schedules Swiss-Tx supercomputer cluster network is used for testing aggregate throughputs Traffic patterns are carried out according liquid schedules Compare with topology-unaware round robin or random schedules

52 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 52 Theoretical liquid and round-robin throughputs of 362 traffic samples 362 traffic samples across Swiss-Tx network Up to 32 nodes Traffic carried out according to round robin schedule reaches only 1/2 of the potential network capacity

53 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 53 Throughput of traffic carried out according liquid schedules Traffic carried out according to liquid schedule practically reaches the theoretical throughput

54 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 54 Liquid scheduling conclusions: application, optimization, speedup In HPC networks, large messages are copied across the network causing congestions Arbitrarily transmitted transfers yield throughput below the theoretical capacity Liquid scheduling: relies on network topology and reaches the theoretical liquid throughput of the network Liquid schedules can be constructed in less than 0.1 sec for traffic patterns with 1000 transmissions (about 100 nodes) Future work: dynamic traffic patterns and application in OBS

55 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 55 Fault-tolerant streaming with Capillary-routing Path diversity and Forward Error Correction codes at the packet level

56 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 56 Structure of my talk The advantages of packet level FEC in Off-line streaming Solving the difficulties of Real-time streaming by multi-path routing Generating multi-path routing patterns of various path diversity Level of the path diversity and the efficiency of the routing pattern for real-time streaming

57 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 57 Decoding a file with Digital Fountain Codes A file is divided into packets Digital fountain code generates numerous checksum packets Sufficient quantity of any checksum packets recovers the file Like when filling your cup only collecting a sufficient amount of drops matters … … …

58 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 58 Transmitting large files without feedback across lossy networks using digital fountain codes Sender transmits the checksum packets instead of the source packets Interruptions cause no problems The file is recovered once a sufficient number of packets is delivered FEC in off-line streaming relies on time stretching

59 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 59 In Real-time streaming the receiver play-back buffering time is limited While in off-line streaming the data can be hold in the receiver buffer … In real-time streaming the receiver is not permitted to keep data too long in the playback buffer

60 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 60 Long failures on a single path route If the failures are short, by transmitting a large number of FEC packets, receiver may constantly have in time a sufficient number of checksum packets If the failure lasts longer than the playback buffering limit, no FEC can protect the real- time communication

61 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 61 Reliable Off-line streaming Reliable real- Time streaming Applicability of FEC in Real-Time streaming by using path diversity Time stretching Playback buffer limit Real-time streaming Losses can be recovered by extra packets: received later (in off-line streaming) received via another path (in real-time streaming) Path diversity replaces time- stretching Path diversity

62 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 62 Creating an axis of multi-path patterns Intuitively we imagine the path diversity axis as shown High diversity decreases the impact of individual link failures, but uses much more links, increasing the overall failure probability We must study many multi-path routings patterns of different diversity in order to answer this question Single path routing Multi-path routing Path diversity

63 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 63 Capillary routing creates solutions with different level of path diversity As a method for obtaining multi-path routing patterns of various path diversity we relay on capillary routing algorithm For any given network and pair of nodes capillary routing produces layer by layer routing patterns of increasing path diversity Path diversity= Layer of Capillary Routing

64 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 64 Capillary routing - introduction Capillary routing first offers a simple multi-path routing pattern At each successive layer it recursively spreads out individual sub-flows of previous layers The path diversity develops as the layer number increases The construction relies on LP

65 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 65 Reduce the maximal load of all links Capillary routing – first layer First take the shortest path flow and minimize the maximal load of all links This will split the flow over a few parallel routes

66 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 66 Capillary routing – second layer Then identify the bottleneck links of the first layer And minimize the flow of the remaining links Continue similarly, until the full routing pattern is discovered layer by layer Reduce the load of the remaining links

67 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 67 Capillary Routing Layers Single network 4 routing patterns Increasi ng path diversity

68 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 68 Application model: evaluating the efficiency of path diversity To evaluate the efficiencies of patterns with different path diversities we rely on an application model where: The sender uses a constant amount of FEC checksum packets to combat weak losses and The sender dynamically increases the number of FEC packets in case of serious failures source packets redundant packets FEC block

69 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 69 Packet Loss Rate = 3% Packet Loss Rate = 30% Strong FEC codes are used in case of serious failures When the packet loss rate observed at the receiver is below the tolerable limit, the sender transmits at its usual rate But when the packet loss rate exceeds the tolerable limit, the sender adaptively increases the FEC block size by adding more redundant packets

70 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 70 Redundancy Overall Requirement The overall amount of dynamically transmitted redundant packets during the whole communication time is proportional: to the duration of communication and the usual transmission rate to a single link failure frequency and its average duration and to a coefficient characterizing the given multi-path routing pattern

71 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 71 Equation for ROR: it depends only on the routing pattern r(l) Where: FEC r(l) is the FEC transmission block size in case of the complete failure of link l r(l) is the load of link l for a given routing pattern FEC t is the FEC block size at default streaming (tolerating loss rate t)

72 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 72 ROR coefficient Smaller the ROR coefficient of the multi- path routing pattern, better is the choice of multi-path routing for real-time streaming By measuring ROR coefficient of multi-path routing patterns of different path diversity, we can evaluate the advantages (or disadvantages) of diversification Multi-path routing patterns of different diversity are created by capillary routing algorithm

73 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 73 0 5 10 15 20 25 30 35 40 45 50 55 60 layer1 layer2 layer3 layer4 layer5layer6 layer7layer8layer9 layer10 capillarization Average ROR rating ROR as a function of diversity Here is ROR as a function of the capillarization level It is an average function over 25 different network samples (obtained from MANET) The constant tolerance of the streaming is 5.1% Here is ROR function for a stream with a static tolerance of 4.5% Here are ROR functions for static tolerances from 3.3% to 7.5% 3.3% 3.9% 4.5% 5.1% 7.5% 6.3%

74 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 74 ROR rating over 200 network samples ROR coefficients for 200 network samples Each section is the average for 25 network samples Network samples are obtained from random walk MANET Path diversity obtained by capillary routing reduces the overall amount of FEC packets

75 2006-09-29Emin Gabrielyan, Three Topics in Parallel Communications 75 Conclusions Although strong path diversity increases the overall failure rate it is beneficiary for real-time streaming (except a few pathological cases) Capillary routing patterns reduce the overall number of redundant packets required from the sender In single-path real-time streaming application of FEC at packet level is almost useless With multi-path routing patterns real-time applications can have great advantages from application of FEC Future work: using overly network to achieve a multi- path communication flow Considering coding also inside network, not only at the edges; aiming also at energy saving in MANET

76 762006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications Thank you! Presented topics: Fine-grained parallel I/O for cluster computers Liquid scheduling of parallel transmissions in coarse-grained networks Capillary routing: fault-tolerance in fine- grained networks


Download ppt "2006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications 1 Three Topics in Parallel Communications Thesis presentation by Emin Gabrielyan."

Similar presentations


Ads by Google