Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Scale Vector-Thread Processor Ronny Krashinsky, Christopher Batten, Krste Asanović Vector-Thread ArchitectureScale Prototype.

Similar presentations


Presentation on theme: "The Scale Vector-Thread Processor Ronny Krashinsky, Christopher Batten, Krste Asanović Vector-Thread ArchitectureScale Prototype."— Presentation transcript:

1 The Scale Vector-Thread Processor Ronny Krashinsky, Christopher Batten, Krste Asanović http://cag.csail.mit.edu/scale Vector-Thread ArchitectureScale Prototype Chip in TSMC 0.18μm 1 Lane 2 Lanes 4 Lanes 8 Lanes This work was partially supported by a DARPA PAC/C award, an NSF CAREER award, an NSF graduate fellowship, a CMI research grant and donations from Infineon Technologies and Intel. We acknowledge and thank Albert Ma for designing the VCO and providing extensive help with CAD tools, Mark Hampton for implementing VTorture and the Scale compilation tools, Jaime Quinonez for the baseline datapath tiler implementation, Jared Casper for work on an initial cache design and documentation, and Jeffrey Cohen for initial work on VTorture. Microarchitectural Simulation Results Clock and Power Distribution Chip Test Platform Standard Cell Preplacement Process TechnologyTSMC 0.18μm Metal Layers6 aluminum Transistors7.14 M Gates1.41 M Standard Cells397 K Flip-Flops & Latches94 K Core Area16.61 mm 2 Chip Area23.14 mm 2 Design Time19 months Design Effort24 person-months Voltage (V) Maximum Frequency (MHz) Power (mW) Energy per Cycle (nJ) 1.21571560.99 1.52183421.57 1.82706122.27 2.13049553.14 2.433814044.15 Chip StatisticsInitial Results Initial results for running the adpcm.dec benchmark from on-chip RAM with no cache tag checks Computer Science and Artificial Intelligence Laboratory Kernel speedup is compared to compiling the benchmark and running it on the Scale control processor. Mem-B is average number of bytes of L1 to main memory traffic per cycle. Ld-El and St-El are number of load and store elements transferred to and from the cache per cycle. Loop types include: data parallel loops with no control flow (DP), data parallel loops with control flow or inner loops (DC), loops with cross-iteration dependencies (XI), and free running threads (FT). Memory access types include: unit-stride and strided vector memory accesses (VM), segment vector memory accesses (SVM), and individual VP loads and stores (VP). The chip includes a custom designed voltage-controlled oscillator to enable testing at various frequencies. Scale can either be clocked by the on-chip VCO or by an external clock input. The clock tree was automatically synthesized using Encounter and the maximum trigger-edge skew is 233ps. Scale uses a fine-grained power distribution grid over the entire chip as show in the above diagrams. The standard cells have a height of nine Metal 3/5 tracks and the cells get power/ground from Metal 1 strips which cover two tracks. We horizontally route power/ground strips on Metal 5, directly over the Metal 1 power/ground strips, leaving seven Metal 3/5 tracks unobstructed for signal routing. We vertically route power/ground strips on Metal 6. These cover three Metal 2/4 tracks and are spaced nine tracks apart. The power distribution uses 21% of the Metal 6 and 17% of Metal 5. The chip test infrastructure includes a host computer, a test baseboard, and a daughter card with a socket for Scale. The test baseboard includes a host interface and a memory controller implemented on a Xilinx FPGA as well as 96MB of SDRAM, configurable power supplies, and a tunable clock generator. The host interface is clocked by the host and uses a low-bandwidth asynchronous protocol to communicate with Scale, while the memory controller and the SDRAM use a synchronous clock generated by Scale. Using this test setup, the host computer is able to download and run programs on Scale while monitoring the power consumption at various voltages and frequencies. To allow us to run real programs which include file I/O and other system calls, a simple proxy kernel marshals up system calls, sends them to the host, and waits for the results before resuming program execution. Host PC PLX SDRAM (96MB) Host Interface Memory Controller HostTest BaseboardDaughter Card Scale PWRCLK FPGA We have developed a C++ based preplacement framework which manipulates standard cells using the OpenAccess libraries. Preplacement has several important advantages including improved area utilization, decreased congestion, improved timing, and decreased tool run time. The framework allows us to write code to instantiate and place standard cells in a virtual grid and programmatically create logical nets to connect them together. The framework processes the virtual grid to determine the absolute position of each cell within the preplaced block. We preplaced 230 thousand cells (58% of all standard cells) in various datapaths, memory arrays, and crossbar buffers and tri-states. After preplacing a block, we export a Verilog netlist and DEF file for use by the rest of the toolflow. Although the preplaced blocks do not need to be synthesized, we still input them into the synthesis tool so that the tool can correctly optimize logic which interfaces with the preplaced blocks. During place & route we use TCL scripts to flexibly position the blocks around the chip. Although we preplace the standard cells, we do the datapath routing automatically; we have found that automatic routing produces reasonably regular routes for signals within the preplaced blocks. Execute Directive Queue Adder Shifter Single Scale Cluster The datapath, register file, and execute directive queue have all been preplaced using our custom framework. Although we were able to use the Artisan memory compiler to generate an SRAM for the AIB cache, there was no suitable memory compiler for the cluster’s 2 read / 2 write port register file. With preplacement we were able to make use of special latch standard cells and hierarchical tri-state bit-lines to create a reasonable register file implementation using purely static logic. We used a similar technique for the various CAM arrays in the design. Implementation and Verification Tooflow Synopsys Design Compiler Synthesis Cadence First Encounter FP /Clk / Pwr / P&R Mentor Graphics Calibre DRC / LVS / RCX Standard Cell Preplacement Framework Tenison VTOC Verilog to C++ Golden Verilog RTL Synopsys Formality Verify RTL vs Gates Artisan SRAMs and Standard Cells C++ RTL Simulator 2-State RTL Sim Synopsys VCS 4-State RTL Sim Synopsys Nanosim Transistor Sim Compare Test Memory Dumps C++ ISA Simulator Functional Sim C++ Microarchitectural Models Directed and Random Test Programs Gate-Level C++ Preplacement Code BenchmarkSuite Description Kernel Speedup Ops Per Cycle Ld-El Per Cycle St-El Per Cycle Mem-B Per Cycle Loop Types Memory Access Types rgbcmykEEMBC RGB to CMYK color conversion14.86.81.20.43.0DPVM,SVM rgbyiqEEMBC RGB to YIQ color conversion39.89.31.3 3.8DPSVM hpgEEMBC High pass gray-scale filter44.610.42.81.03.1DPVM,VP fftEEMBC 256-pt fixed-point complex FFT18.83.81.71.40.1DPVM,SVM viterbiEEMBC Soft decision Viterbi decoder10.55.00.5 0.1DPVM,SVM ditherEEMBC Floyd-Steinberg gray-scale dithering6.95.01.10.3 DP,DCVM,SVM,VP lookupEEMBC IP route lookup using Patricia Trie5.86.90.90.0 DCVM,VP pktflowEEMBC IP packet processing (2MB dataset)13.53.70.70.14.2DC,XIVM,VP shaMiBench Secure hash algorithm (large dataset)2.41.90.30.10.0DP,XIVM,VP adpcm.encMediaBench Speech encoding1.92.30.10.0 XIVM,VP adpcm.decMediaBench Speech decoding8.16.70.60.20.0XIVM,VP ptrchaseEEMBC Pointer chasing, searching linked lists4.42.30.30.0 FTVP quicksortMiBench Quick sort of short strings (small dataset)3.02.00.40.32.2FTVP Kernel Speedup


Download ppt "The Scale Vector-Thread Processor Ronny Krashinsky, Christopher Batten, Krste Asanović Vector-Thread ArchitectureScale Prototype."

Similar presentations


Ads by Google