Presentation on theme: "Performance Analysis of Daisy- Chained CPUs Based On Modeling Krzysztof Korcyl, Jagiellonian University, Krakow Radoslaw Trebacz Jagiellonian University,"— Presentation transcript:
Performance Analysis of Daisy- Chained CPUs Based On Modeling Krzysztof Korcyl, Jagiellonian University, Krakow Radoslaw Trebacz Jagiellonian University, Krakow Second FutureDaq workshop, GSI,
Sink Node 1 Node N Node 2 Node N-1 The chain FIFO SOURCE Poisson Uniform Fixed
Compute Engine (fixed processing time, Size reduction for processed data) GBit In GBit Out FIFO Busy FIFO Models of the nodes Selector 100Gigabit processingCount Processor FIFO 10Gigabit 10 11
Operation Data packets are produced using one of available distributions of inter packet times (Poisson Uniform or Fixed) and stored into FIFO. If the Source output line is free and there is a packet in the FIFO, the packet is sent immediately into the chain Between nodes packets are transferred with gigabit per second speed (no Ethernet framing, neither check for line packet loss nor transmission erros). Packets arriving to the selector are stored in the input FIFO if there a space, otherwise they are dropped. If the selector’s local transfer medium is free and there is a packet in the input FIFO it is tested whether it is raw or processed data. The packet with raw data is sent to the local computing resource via the local transfer medium, if the resource has credit to absord the packet (every packet absorbed by the computing resource decrements it’s credit). If the computing resource run out of credits, the packet is sent to the output FIFO Packets with processed data arriving to a selector are sent to the output FIFO if the selector’s local transfer medium is free.
Operation - cont Packet with raw data arriving to the computer output FIFO, decrements credit count of the resource and is sent off to the resource with 100 Gbit speed. Computing resource starts procesing data when transfer over link finishes. After processing time the raw data is converted into processed data and it’s size is reduced Processed data are returned back to the selector with 100 Gbit speed and stored in the selector’s computer input FIFO. If the selector’s local transfer medium is free, the processed data packet is sent to the selector’s output FIFO. Transfer of the processed data has higher priority over sending the raw data however, the currently running transfer is not interrupted. When the processed data arrive to the selector’s output FIFO, the resource’s credit count is incremented. If the output line is free, the packet from the selector’s output FIFO is sent to the line immediately.
Parameters Raw data size: 1500 bytes; Processed: raw size * 0.5 Processing time: 10, 12, 24, 36, 48, 60, 72, 84, 96, 108 µs Selector’s FIFO size: 10 packets on chain input and output, 1 packet for computer input and output Delay time: 12, 24, 32 µs Ethernet transfer speed: 1ns/bit: 1500B = 12 µs Number of nodes 10 [us]12[us]24 [us]36 [us]84[us]96[us]108[us] 12[us] [us] [us] Minimal Number of processors Processing time Delay time
Minimal chain length to process all
Average CPU usage
Minimal chain length to process all
Non-processed data vs chain length
Observations Link between source and the first selector derandomizes packets arrival. Lack of proper derandomization results in longer chain (poorer CPU utilization) to absorb bursts – smaller message size with fixed procesing time allows for bursts
Observations The CPUs located far from the source receive majority of processed packets instead of raw data. Instead of processing they relay processed data from input to the output port. The closer to the sink the poorer utilization of the CPU resource.
Possible modifications – for evaluation with modeling Add more sources and more sinks along the chain Resend non-processed data again into the chain Use more efficiently buffering on the selector nodes – keep data on board and delay decision on forwarding
About the model Runs with Ptolemy Classic All nodes (Source, Selector, Processor and Sink) coded within 200 lines of C++ 100k events simulated in 3 mins on 1.5 GHz Pentium 4