Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ishikawa Laboratory UNIVERSITY OF TOKYO Stage-Distributed Time- Division Permutation.

Similar presentations


Presentation on theme: "Ishikawa Laboratory UNIVERSITY OF TOKYO Stage-Distributed Time- Division Permutation."— Presentation transcript:

1 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ http://www.k2.t.u-tokyo.ac.jp/index-e.html Stage-Distributed Time- Division Permutation Routing in a Multistage Optically Interconnected Fabric Alvaro Cassinelli (1), Makoto Naruse (2), Alain Goulet (1), and Masatoshi Ishikawa (1) (1) University of Tokyo, Dept. Information Physics and Computing, 7-3-1 Hongo Bunkyo-ku, Tokyo 113-0033, Japan. (2) Communications Research Laboratory, 4-2-1 Nukui-kita, Koganei, Tokyo 184-8795, Japan. Multistage optical hypercube Processor arrays X Y W Z

2 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ PLAN of the presentation II. Column-Control in Multistage Interconnection Networks (CCMINs) III. Folded Optical Implementation of a transparent CCMIN IV. Packet switching in a buffered CCMIN (“new”) V. Conclusion and Further Research I. Introduction: space-domain optical switching fabrics VI. Some References

3 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ 1)Processor-memory bottleneck in Supercomputers 2)Router bottleneck in Next Generation Optical Internet I. Introduction: the problem on study How to design an efficient optical switching fabric for addressing: These problems have some similarities: low latency required, synchronization, high bandwidth… Traffic characteristics changes: synchronous/asynchronous, regular/arbitrary request patterns, fixed/variable length of data bursts (granularity) In fact, the above problems are case studies among a continuum of situations…

4 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ I. Introduction: optics inside routers Scheme of a router controller input interface output interface switching fabric interconnect router subsystems at the (unbuffered) switching fabric (OXC) at the interfaces and controller (“all-optical routing”) Where optics? This presentation concerns: SPACE-DOMAIN OPTICAL SWITCHING FABRICS

5 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ II. Column-Control in Multistage Interconnection Networks II.1 Multistage Interconnection Networks II.2 Column-Control in MINs II.3 Permutation Capacity of CCMIN II.4 Unbuffered CCMIN for permutation routing

6 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ II.1 Multistage Interconnection Networks O( N 2 ) complexity (using 2x2 switches) Simultaneous switching noise Central controller bottleneck Poor modularity Wide-sense non-blocking Low latency “Basic” switching fabric: Full-Crossbar (XC) Circuit Switching: good for low- latency memory-processor communications. Packet Switching: Maximum throughput of 63% without buffers (uniform traffic).

7 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ …It still has point- to-point full connectivity. Alternative architecture: Multistage Interconnection Network (MIN) (and is “self-routing”) Internal blocking Large optical losses Large crosstalk Full point-to-point connectivity O(N.log 2 N) complexity Distributed routing possible Fault tolerance possible (re-routing) Easier repairing thanks to modularity II.1 Multistage Interconnection Networks

8 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Column-control simplifies hardware and control “stage-global switch” column-control lines… Nice: CCMIN it is still capable of point-to-point connectivity II.2 Column-Control in MINs 2-states “global” switches with long-range interconnections suited for optical implementation (free-space, guided-wave) Possible physical-merge of active switching and passive interconnection:

9 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ …if blocking was a problem for a MIN… …things are much worse for the CCMIN “global-stage” blocking local- blocking As a consequence of “global-stage” blocking, permutation capacity of the CCMIN is extremely reduced. II.3 Permutation Capacity of CCMIN However…

10 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Request serviced by circuit switching, (or by on-the-flight packet switching) Input requests are indep. Bernoulli trials (parameter ) Uniform Traffic: equal probability of requesting any output port Input request probability per unit time ( ) Probability of request acceptance 00.10.20.30.40.50.60.70.80.91 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CCMIN Standard MIN crossbar tends to 63% when N , because HOL blocking. both tend to 0 when N  CCMIN cannot be used to service arbitrary requests in a circuit-switched manner! 64x64 network II.3 Permutation Capacity of CCMIN

11 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ II.4 Unbuffered CCMIN for permutation routing 4-D hypercube-connected multiprocessor… Synchronous, weak-connected parallel computer (processors use same permutation / time slot) C2C2 C3C3 C4C4 C1C1 1 2 3 4 16...... 1 3 5 9 6 8 4 7 11 12 15 16 13 14 10 2 C3C3 C2C2 C1C1 C4C4 Reduced permutation capacity may still be useful for synchronous “permutation routing” in parallel computers(*) (*) issue well studied in the past on “standard” blocking MINs

12 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ III. Folded Optical Implementation of a transparent CCMIN III.1 Designing a CCMIN for circuit-switched permutation routing III.2 “Folded” Optical Implementation III.3 Experimental Demonstration III.4 Possible applications

13 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ 3 stage CC-”Baseline Network” {c 3, id} {c 2, id} {c 1, id} Number of permutations: 2 n (n=3) A multistage version of most parallel-computer direct-network topologies (hypercube, cube-connected-cycles, deBruijn, etc.) can be implemented as a CCMIN with properly designed inter-stage permutation modules. III.1 Designing a CCMIN for circuit-switched permutation routing These are {c 3, id}x{c 2, id}x{c 1, id} These are just the required permutations to implement the (3D) hypercube! c2c2 c3c3 c1c1

14 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ III.2 “Folded” Optical Implementation Multistage Interconnection Network architecture Dense & Efficient 3D folded inter-stage optical interconnects Optical Multistage Architecture Paradigm (fixed interconnections) + plane implementation electronic planar lightwave circuit (PLC) 3D implementation free space guided-wave

15 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ fixed, no broadcast: optical fiber ok. better efficiency (and just like free-space optics, no cross-talk in 3D). No space-invariance imposed. Precise and robust alignment possible. Theoretically more volume efficient than free-space counterpart. “hard” to build? not fundamentally difficult (can be automated, permutation decomposition possible) Alignment of output and input Power dissipation fundamental limit very far compared with electronics. input output Prototype Fiber module (fibers and holders) “integrated” 2D folded perfect shuffle permutation module Wave-guide arrays for fixed, point-to-point and space variant interconnections are an interesting alternative to free-space optics Guide-wave (fiber-based) Modules vs. Free-Space III.2 “Folded” Optical Implementation slide not shown in main presentation

16 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Prototype (non-integrated) 4x4 fiber module Two holder prototypes: Zirconium, SiO 2 Pitch: 250±5  m Multimode graded index fibers: NA=0,21 (core 50  m, cladding 126  m) Transmission loss: 3dB/km Input (VCSEL 854±4nm) Output (CCD)  (2) input output slide not shown in main presentation

17 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ III.2 Multiple-permutation module Besides density, reduced crosstalk and optical efficiency, there is another nice feature of the guided-wave approach to plane-to-plane optical interconnections… Interleaving multiple permutations is possible 3D bi-permutation module built by stacking planar lightwave circuits (for instance) A small mechanical/optical perturbation produces a drastic change of the interconnection pattern Multi-permutation modules as CCMIN’s “global-stage” switches output input (*) and not only CCMIN’s stage permutations (*)

18 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Cube Permutations for N=2 n Folded: Cube Permutation c k : ckck {b n, … b k+1, b k, b k-1, … b 2, b 1 } If k  n/2, exchange only rows; If k>n/2, c k exchange only columns. The modules are just the same, rotated. c1c1 c2c2 c3c3 c4c4 Unfolded: (example with n=4) slide not shown in main presentation

19 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ III.2 Experimental Demonstration Unfolded hypercube and identity permutations Row-Column Folded bi- permutation module Prototype implementation of using optical fibers 1 2 3 4 16...... plane mapping (“folding”) * (*) not unique!

20 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ “Spanned” 4D hypercube (use four bi-permutation modules) III.2 Experimental Demonstration c3c3 c4c4 c2c2 c1c1 c2c2 c1c1 c3c3 c4c4 (processors interconnected trough a 2D optical “socket” – or laying in a VLSI chip matrix) four-dimensional hypercube- connected multiprocessor… …topology is mapped on a plane slide not shown in main presentation

21 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Output (CCD camera) Input (VCSEL array) {c 2, id} {c 1, id} Inter-module Coupling Efficiency: 1.7dB (no additional optics, matching oil or antireflection coating). Alignment tolerance:  5  m (half peak power). Commutation pitch: 125  m Validation of simple cascaded architecture.  Exit first module Input second module III.2 Experimental Demonstration slide not shown in main presentation

22 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Id x IdC 1 x C 2 Id x C 2 C 1 x Id Selected permutation product Input (VCSEL array) Output (CCD) Visualization of 2D permutation switching using a pair of modules III.2 Experimental Demonstration C 1 or Id C 2 or Id

23 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Resonant frequency = 430 Hz (  62.5  m) (can vibrate the module in both X and Y directions – in principle, permutation interleaving is possible in both directions) III.2 Demonstration: electromechanical actuator X-Y electro-magnetic actuated device (Micro electro-mechanical actuators (MEMS) may also be an interesting alternative when switching latency in the millisecond range is tolerable)

24 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Time slot time Interconnect 1 Interconnect 2 Interconnect 3 Interconnect N Resonant-frequency: round-robin permutation scheduling III.2 Demonstration: electromechanical actuator slide not shown in main presentation

25 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ No electromagnetic actuation: Electromagnetic actuation: Input: slow row/column scan of VCSEL array Fixed Identity permutation Identity & Cube 2 permutations alternate at 860 Hz. III.2 Demonstration: electromechanical actuator slide not shown in main presentation

26 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Actuator position Photodetector signal 200ms Input: 635nm laser modulated at 500MHz Output: High speed photodetector  If 10Gb/s optical link, burst size is 2 Mbits per channel, (every millisecond). Average bandwidth of 2 Gb/s per channel Switching latency between interconnections ≈ 0,96 ms (*) Time Slot (3dB) ≈ 200ms III.2 Demonstration: electromechanical actuator (*) MEMS routers: ms range.

27 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Possible computing applications: The present system is not usable for typical memory-processor communications, which requires low latencies (< 100 ns), unless another switching hardware is used (Acousto-optic cells:  s range / electro-optical material: ns range) If processing time is large (slow switching latency) and “burst” of data large, the electromechanical system may be used (FFT, large database retrieval, ?…) Communication networks: burst switching at the WAN level (ms range reconfiguration times). scientific-dedicated, transparent networks with long holding times and high- bandwidth (TransLight, GLIF). MEMS switches are currently used (reconfiguration times in the range of a second is ok). An optical GSMIN may be used to regularly provide interconnection configurations. if switching time is reduced, it can be used to perform cyclic permutation scheduling in an virtual output queued (VOQ) switch, leading to 100% throughput (Standford “Tiny-Tera Switch”) III.4 Possible applications of an optical CCMIN

28 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ time Burst Interconnects Computation one-stage (ex. 1 ms) Burst interconnection within “short” time slot (Ex. 10Gbps, 100nsec  1kbit) Interconnect 1 Interconnect 2 Interconnection switching interval (Ex. 1ms) = …Slow switching may be okay slide not shown in main presentation

29 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ IV. Packet switching in a buffered CCMIN IV.1 Buffering in blocking networks IV.2 FIFO Buffered CCMIN architecture IV.3 Performance evaluation IV.4 Delay-line “buffered” architecture

30 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Unbuffered networks (even wide-sense non-blocking) suffer from HOL blocking: buffering is unavoidable. Input queues, Output Queues and Virtual Output Queues and internal buffering has been explored in crossbars as well as in MINs; However, an advantage of buffered MINs over buffered crossbars is that the stage-distributed switching marries well with the distribution of buffering (thus avoiding large buffers) Blocking is a serious drawback for circuit switching …Less serious for packet switching Buffering is a solution adopted in “usual” MINs… IV.1 Buffering for packet switching …how much a CCMIN is improved by buffering?

31 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ input output inter-stage FIFO buffers Why this architecture may compare well with “standard” buffered MINs? For uniform traffic, at each stage half of the packets wait, and half pass: individual switch/buffer control is, presumably, not really required… IV.2 FIFO Buffered CCMIN architecture What’s more: Arbitration for configuring the Global Switches may not be necessary at all !

32 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ 6 00.10.20.30.40.50.60.70.80.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 1 2 3 4 5 6 crossbar standard MIN Global Switched MIN 1 2 3 4 5 0 GSMIN performance evolve quicker with buffer size For buffer size = 5 packets, equivalent performances For buffer size = 3 packets, performances are better than Xbar IV.3 Performance: global control vs. local control Seven stage - 128x128 Input/Output fabrics (rem: inter-stage transfer with maximum speed-up equal to the size of the buffer) Performance of Global Switched MIN compares very well with that of a standard MIN. Input request probability per unit time ( ) Probability of packet acceptance Buffer size

33 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ This is very interesting, because it means that a Standard MIN can be operated “blindly” if traffic is uniform enough. Interconnection scheduling bottleneck is eliminated (CLOS, etc.) by using a Time-Division Permutation Routing strategy. IV.3 Performance: global control with blind alternate 6 00.10.20.30.40.50.60.70.80.91 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 1 2 3 4 5 Input request probability per unit time ( ) Probability of packet acceptance crossbar “fair” switching “blind” alternate Buffer size “Blind” Switch alternation of a GSMIN As expected “blind” alternation of switch states gives same performance than a “fair” switch- selection (for uniform traffic)

34 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ input output delay-line “buffer” IV.4 Delay-line “buffered” architecture What about just delaying packets? Reliable optical memories are still too difficult to implement... (since there are only two states per stage, only a single delay-line may give good performance)

35 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ input output Switch delay-line “buffer” … we didn’t study a “standard” MIN with delay-lines slide not shown in main presentation

36 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ 6 00.10.20.30.40.50.60.70.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 1 2 3 4 5 delay-line crossbar Global Switched MIN Input request probability per unit time ( ) Probability of packet acceptance Buffer size (we didn’t study a “standard” MIN with delay-lines) Using a single selectable delay per channel and per stage, performance lies somewhere in between one and two-packet sized FIFO buffered architecture. Blind alternation of global witch states is assumed IV.4 Performance of a delay-line “buffered” architecture

37 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ V. Conclusion V.1 Results V.2 Further Research

38 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ V.1 Conclusion Summarizing: Column-Control simplifies MIN hardware and control; Column-Controlled MIN can be efficiently implemented using dense plane-to-plane optical interconnections; Column-Control MIN may have enough permutation capacity for specific applications (highly parallel algorithms); Column-Controlled MIN can be used for packet switching if buffered, giving roughly the same performance than “standard” MINs; Path-selection mechanism may be “blind” (i.e. round-robin, time-division permutation routing) without appreciable degradation of performance.

39 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ V.2 Further Research Other models of buffers: in particular, inter-stage virtual output queues (VOQ) may gives very good performance in CCMIN (because with a speed-up of only 2, each stage will have 100% throughput). Two parallel delay-line buffers ? On transparent circuit switched CCMINs On buffered packet switched CCMINs: An arbitrary permutation request may be serviced by multiplexing in time the available set of permutations. This needs input buffers and speed-up (i.e. short switching latency). This has been explored in standard MINs using 2x2 switches… Design of “active” modules, and multi-function modules (containing more than two permutations, but also other optical functions - e.g. optical delay lines) How heavily the the studied architectures rely on the URM assumption? Study more realistic traffic models / ways to balance the non-regular traffic.

40 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ stack of PLC layers coupled in the normal direction cross state by-pass state Simulation of a crossbar by speed-up (TDM connections for local area networks) Core of a permutation routing switches for inter-processor communications in a parallel computer Reconfiguration time can be of the order of nanoseconds! slide not shown in main presentation V.2 Fast switching permutation modules

41 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ Based on the observation that VOQ and speed-up, plus optimal permutation decomposition are the basic ingredients of the Birkhof-von Newmann Switch (plus load-balancing to simplify the decomposition => Tiny-Tera switch) with 100% throughput, it will be interesting to study then: 1) a “constrained” decomposition of a rate matrix onto the set of available CCMIN permutations 2) a multistage version of the BVN switch, where the permutation decomposition is done: a) at each stage (using bi-permutation modules, this will probably lead to simple forced-alternate mode, and reduce the size of the VOQ, to only 2, which may be accommodated by simple delay-lines!), b) every some stages, so that the available set of permutations will be very reduced, but still larger than 2. This may optimize the design of buffer functions (no need to put in all stages). slide not shown in main presentation Thank you for your attention V.2 …“advanced” further research

42 Ishikawa Laboratory UNIVERSITY OF TOKYO http://www.k2.t.u-tokyo.ac.jp/ VI. Some References Traffic models: J. Cao et al., “Internet traffic tends toward Poisson and Independent as load Increases”, Nonlinear Estimation and Classification, eds. C. Holmes et al., Springer, NY, 2002. thermo-optic matrix [Goh01] round-robin (TDM). [Thompson91]. Crosstalk can be solved decomposing a permutation into semi-permutations, with an increase of the number of network stages [Qiao] “Volume-consumption comparisons of free-space and guided-wave optical interconnections”, Y.Li and J. Popelek, p.1815-1825, Appl.Opt. Vol 39, n.11, april 2000. Study of inter-stage VOQ in MINs: Kolias, “Dual Banyan Switch”, [Kolias] W.J. Dainty, “Virtual-Channel Flow Control“, IEEE Trans. Parallel and Distr. Systems, Vol. 3, No. 2, Mar. 1992, pp. 194-205. Dainy studies “DAMQ” (dynamically allocated multi-queue buffers), which looks quite similar to “hop-mode” buffers. slide not shown in main presentation


Download ppt "Ishikawa Laboratory UNIVERSITY OF TOKYO Stage-Distributed Time- Division Permutation."

Similar presentations


Ads by Google