University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators Manjunath Kudlur, Kevin Fan, Michael Chu, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science Motivation Custom application accelerators (ASICs/ASIPs) require careful data memory system design –Large volumes of data access at high bandwidth Distributed local memories (scratchpads) –Achieves high bandwidth through parallel access –Low latency by placing data near computation Custom memory design is complex –Multiple considerations – bandwidth, size requirements, data distribution –Decentralized datapath – another monkey wrench

University of Michigan Electrical Engineering and Computer Science Background – Our System Synthesis of non-programmable accelerators –System similar to PICO (Program-In Chip-Out) –Input is “Hot” loop nest expressed in C Throughput-directed synthesis –Required throughput expressed as II (initiation interval) –Innermost loop modulo scheduled –Datapath derived directly from the schedule –FU allocation to meet II

University of Michigan Electrical Engineering and Computer Science Background – Multicluster Datapath FUs divided into clusters Intercluster communication through global bus Reduced wire lengths, reduced porting on register file structures Increased compiler complexity C Program FU Register FIFOs MEM Local Memories FU Register FIFOs MEM Local Memories Cluster 1Cluster 2 Interconnection Network

University of Michigan Electrical Engineering and Computer Science Background – Local Memories SRAMs connected to MEM units in clusters –Data structures assigned to a single SRAM –Can be whole arrays, part of an array –Currently whole arrays considered Multiple arrays can be combined in a single SRAM FU Register FIFOs MEM Local Memories Cluster 1

University of Michigan Electrical Engineering and Computer Science Problem Statement and Approach “Given a set of arrays, their sizes and bitwidths, the corresponding loop nest, the number of clusters and the target II, find an allocation of arrays to SRAMs and allocation of SRAMs to clusters such that overall cost is minimized” Phase-ordered approach which handles 2 sub problems separately –Memory synthesis –Operation partitioning

University of Michigan Electrical Engineering and Computer Science Combining arrays into a single SRAM reduces hardware cost (row decoders, sense amps) Issues with combining: –Consider two arrays with (Bitwidth, Size) = (B 1, S 1 ) and (B 2, S 2 ) –Suppose A 1 and A 2 are number of static accesses in the loop –Number of ports = Combining Arrays II A 1 + A 2 X Y B1B1 B2B2 S1S1 S2S2 X Y MAX(B 1, B 2 ) S 1 + S 2

University of Michigan Electrical Engineering and Computer Science Combining Arrays Multicluster issues –Can cause imbalance in operation distribution All load store operations for the combined arrays should be assigned to same cluster –Can increase inter cluster traffic Address calculations and load-uses would cause extra inter cluster moves LD + R1R2 USE IC Move

University of Michigan Electrical Engineering and Computer Science Solution 1 Formulate the problem as an integer program –A binary decision variable X(i,j,k,l) to denote assignment of array ‘i’ to local memory ‘j’ with ‘k’ ports on cluster ‘l’ Constraints to make sure inter cluster move bandwidth is not violated Perform operation partitioning and Modulo schedule after memory synthesis B A C D A C B D Cluster 1Cluster 2 Input Arrays Target II Memory SynthesisOperation PartitioningModulo Schedule

University of Michigan Electrical Engineering and Computer Science Experiments System implemented in the Trimaran framework Memory costs obtained from ARTISAN SRAM generator scripts lp_solve used to solve the integer programs A set of DSP kernels evaluated –Loop oriented –Many arrays accessed in the loops

University of Michigan Electrical Engineering and Computer Science Results for Solution 1 channel Target Initiation Interval (II) huffman LUlyapunov

University of Michigan Electrical Engineering and Computer Science Achieved II in Solution 1 Solution 1 eagerly combines arrays –Potential increase in inter cluster moves due to imbalance in distribution of LD/ST ops –Achieved II poor due to IC moves in recurrence cycles Benchmark BW=2BW=3BW=4BW=5 channel 2013108 huffman 28191412 LU 5332 lyapunov 10754 Best II achieved

University of Michigan Electrical Engineering and Computer Science Solution 2 Phase-ordered approach –Two highly intertwined decisions: allocation of local memories and partitioning of operations Three phases: –Pre-Partitioning –Memory Synthesis –Operation Partitioning

University of Michigan Electrical Engineering and Computer Science Pre-Partitioning Performance-oriented operation partitioning –Memory operations accessing the same arrays are bound to same cluster –Consequently, arrays are bound to clusters Pre-Partitioning 3 8 5 9 6 10 7 11 13 14 12 1 2 4 AC B D E Cluster 1 Cluster 2

University of Michigan Electrical Engineering and Computer Science Memory Synthesis ILP used to optimally combine arrays within clusters Pre-partitioning effectively disables combining of arrays that cause operation imbalance A C B D Cluster 1Cluster 2 Memory Synthesis AC B D E Cluster 1 Cluster 2 E

University of Michigan Electrical Engineering and Computer Science Results for Solution 2 Target Initiation Interval (II) channelhuffman LUlyapunov

University of Michigan Electrical Engineering and Computer Science Achieved II for Solution 2 BW=2BW=3BW=4BW=5 BenchmarkNONEPRENONEPRENONEPRENONEPRE channel 20141310 786 huffman 28201914 10128 LU 53323221 lyapunov 105735342 37%35%33%40% Cost of synthesized memory not substantially different But achieved II is 36% better with pre- partitioning Best II achieved

University of Michigan Electrical Engineering and Computer Science Conclusion An approach for synthesizing custom local memories –ILP based optimal solution –Works for clustered datapath Pre-partitioning to improve achieved throughput, with minimal impact on cost For more information –http://cccp.eecs.umich.edu

University of Michigan Electrical Engineering and Computer Science Example

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators."— Presentation transcript:

Similar presentations

About project

Feedback