Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Many Core Platform For DSP Applications 037661584 048879 Seminar in VLSI Architecture Ophir Erez.

Similar presentations


Presentation on theme: "1 Many Core Platform For DSP Applications 037661584 048879 Seminar in VLSI Architecture Ophir Erez."— Presentation transcript:

1 1 Many Core Platform For DSP Applications 037661584 048879 Seminar in VLSI Architecture Ophir Erez

2 2 What will we talk about? “AsAP - An Asynchronous Array of Simple Processors for DSP Applications” - in IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2006, “A 167-processor 65 nm Computational Platform with Per-Processor DVFS” - in Symp. VLSI Circuits Dig., Jun. 2008

3 3 AsAP An As ynchronous A rray of Simple P rocessors for DSP Applications

4 4 Agenda Project objectives Key features of the AsAP processor Design of the AsAP processor Results

5 5 Target Applications Computationally intensive DSP and scientific apps –Key components in many systems –Require high performance –Limited power budgets –Require innovations in architecture and circuit design wireless communications remote sensing and processing biomedical applications

6 6 New era of high performance design clock frequencies ↑ numbers of circuits per chip ↑  - chip performance being limited by power dissipation rather than circuit constraints - high performance design must focus on energy efficient implementations

7 7 Project Objectives Programming flexibility –with a C-like language High performance –Throughput –Latency High energy efficiency Suitable for future fabrication technologies Performance & Energy efficiency Programming flexibility ASIC FPGA Prog. DSP AsAP

8 8 Agenda Project objectives Key features of the AsAP processor Design of the AsAP processor Results

9 9 Key Features of the AsAP Processor Chip multiprocessor High performance Small memory & simple processor High energy efficiency Globally asynchronous locally synchronous (GALS) (simplify clock design) Technology scalability Nearest neighbor Communication (avoid long global wires)

10 10 High Performance Through Chip Multiprocessor and TLP high performance  increase the clock frequency. This has worked well in the past but now it is becoming more challenging. Advanced CPUs already run at several GHz and consume more than 100W. Increasing the clock frequency further means increasing the power consumption further. Increasing the clock frequency is challenging: Parallelism is more promising –Instruction level parallelism (VLIW, Superscalar) –Data level parallelism (SIMD) –Task level parallelism Higher clock frequency Increased design complexity & Increased power consumption Deeper pipeline

11 11 Task Level Parallelism Well suited for DSP applications 802.11a/g wireless LAN (54 Mbps, 5 GHz) baseband transmit path Task1Task2 A B C Proc.1 Proc.2 Task3 Proc.3 Improves performance and potentially reduces memory size … Widely available in many DSP applications scram coding inter- leave mod. map loadfft inter- leave IFFT win- dow up- samp filter scale clip up- samp filter in out training pilots Task1 A B C Memory Proc. … … Task2

12 12 Memory Size in Modern Processors Memory occupies much of the area in modern processors — this reduces area available for the core and consumes large amounts of power Area breakdown other mem core TI_C64x Itanium SPARC BlGe/L Area and energy dissipation can be dramatically decreased with smaller memories

13 13 Small Memory Requirements for DSP Tasks The memory required for common DSP tasks is quite small Several hundred words of memory are sufficient for many DSP tasks TaskIMem (words) DMem (words) N -pt FIR 6 2 N 8-pt DCT 40 16 8x8 2-D DCT 154 72 Conv. coding ( k = 7) 29 14 Huffman encoder 200 330 N -pt convolution 29 2 N 64-pt complex FFT 97 192 Bubble sort 20 1 N merge sort 50 N Square root 62 15 Exponential 108 32

14 14 GALS Clocking Style The challenges of globally synchronous systems –Clock tree design difficulty due to: faster clock speed long clock wires caused by large chip sizes large circuit parameter variations –Power consumption: A synchronous clock consumes much power—about 1/3 of the power in modern processors lack of flexibility to independently control clock frequencies to achieve high energy efficiency

15 15 GALS Clocking Style How about totally asynchronous design style ? –Lack of EDA tool support –Design complexity and circuit overhead The GALS compromise –Synchronous blocks each operating in their own independent clock domain –AsAP is the first configurable GALS and the first GALS array processor

16 16 Wires and On Chip Communication Global wires are a concern –Their length doesn’t shrink with technology scaling—assuming the chip size remains the same –The ratio of global wire delay to gate delay doubles each generation system performance can degrade nearest local Methods to avoid global wires –Network on chip (NOC) –Local communication –Nearest neighbor communication AsAP uses this method

17 17 Outline Project objectives Key features of the AsAP processor Design of the AsAP processor Results

18 18 AsAP Block Diagram GALS array of identical processors –Each processor is a reduced complexity programmable DSP with small memories –Each processor can receive data from any two neighbors and send data to any of its four neighbors FIFO 0 32 words ALU MAC Control IMem 64 words DMem 128 words OSC Static config Dynamic config Output FIFO 1 32 words

19 19 Single Processor Architecture 54 32-bit instructions among which only Bit-Reverse is algorithm specific 9-stage pipeline

20 20 MAC Architecture MAC is divided into 3 stages to enable: –a high clock rate –capability of issuing MAC and multiply instructions every cycle First stage generates the partial products of the 16x16 multiplier Second stage uses carry-save adders to compress the partial products into a single 32-bit carry-save output Final stage contains a 40-bit adder to add the results from the second stage to the 40-bit accumulator register (ACC) ACC is normally read infrequently  only the least- significant 16 bits of the ACC are readable More significant ACC bits are read by shifting those bits into the 16 LSBs

21 21 Advantages of the AsAP Clocking System Simplified clock tree design –The maximum span is < 1 mm in 0.18 µm Scalable – easy to add processors Improved energy efficiency –Clock halts in 9 cycles (processor dissipates leakage power only) and restores in < 1 cycle according to work availability –53% and 65% power savings for two applications –Independent clock and voltage scaling (individual processor voltage scaling not implemented in this version)

22 22 Programmable Clock Oscillator Standard cell based Configurable frequency –Delay tunable stages using 7 parallel tri-state inverters –5 or 9 stage selection –1 to 128 clock divider SR latch logic enables clean OSC halt

23 23 Programmable Clock Oscillator stall_fifo is asserted when FIFO access is stalled due to either an empty input or full output condition. After a nine clock cycle period, during which the processor’s pipeline is flushed, the signal halt_clk goes high which halts the clock oscillator. The signal stall_fifo returns low when the cause of the stall has been resolved (when data is put into the empty input FIFO or data is removed from the full output FIFO) halt_clk restarts the oscillator at full speed in less than one clock period.

24 24 Results –1.66 MHz – 702 MHz –Max gap: 0.08 MHz (1.66 – 500 MHz) Programmable Clock Oscillator

25 25 Programmable Clock Oscillator

26 26 Inter-processor Communication

27 27 Each processor contains two dual-clock FIFOs Rd/Wr in separate clock domains –Gray coded Rd/Wr address across clock domains No FIFO failures in several weeks of testing with multiple procs Write side Read side east west south Write logic FIFO 0 Read logic east north west south CPUCPU FIFO 1 Processor east north west south north Binary Gray & Sync. addr SRAM Inter-processor Communication

28 28 Physical Design 0.18 µm TSMC Standard cell Design flow –Completely synthesized, except oscillator –Macro memory blocks used for 4 main memories –Completely auto placed and routed To simplify physical design, clock gating not implemented in this version

29 29 Transistors: 1 Proc 230,000 Chip 8.5 million Max speed: 475 MHz @ 1.8 V Area: 1 Proc 0.66 mm² Chip 32.1 mm² Power (1 Proc @ 1.8V, 475 MHz): Typical application 32 mW Typical 100% active 84 mW Worst case 144 mW Power (1 Proc @ 0.9V, 116 MHz): Typical application 2.4 mW Single Processor OSC FIFOs DMem IMem 810 µm 5.65 mm 810 µm 5.68 mm Chip Micrograph of the 6 x 6 Array

30 30 Test Environment PC Board & FPGA Board

31 31 Agenda Project objectives Key features of the AsAP processor Design of the AsAP processor Results

32 32 Area Evaluation Most of AsAP’s area is for the core (66%) Each processor requires a very small area; more than 20x smaller than others Areabreakdown comm mem core Processor area (mm ² ) 20x All scaled to 0.13 µm [ISSCC 05, 00; ISCA04; CMPON96] 72 30 8.3 7 0.34 29 TI RAW Fujitsu BlGe/ AsAP C64x 4-VLIW L TI CELL/ Fujitsu RAW ARM AsAP C64x SPE 4-VLIW

33 33 Area Evaluation

34 34 Power and Performance Power / Clock frequency / Scale * (mW/MHz) Peak performance density (MOPS/mm ² ) *All scaled to 0.13 µm * Assume 2 ops/cycle for CELL/SPE and 3.3 ops/cycle for TI C64x Note: word widths and workload not factored [ISSCC 05, 00; ISCA04; CMPON96] Higher performance ÷ area, and Lower estimated energy

35 35 JPEG Core Encoder Implementation 9 processors Fully functional on chip 224 mW @ 300 MHz ~1400 clock cycles for each 8x8 block Similar performance and ~11x lower energy dissipation than 8-way VLIW TI C62x [MICRO 02, ISCAS 02]

36 36 Data bits 802.11a/802.11g Wireless Transmitter Implementation 22 processors Fully functional on chip 407 mW @ 300 MHz 30% of 54 Mb/s Code unscheduled and lightly optimized ~10x performance and 35x – 75x lower energy dissipation than 8-way VLIW TI C62x [SIPS 04; ICC 02]

37 37 802.11a/802.11g Wireless Transmitter - Dataflow

38 38 Comparison of AsAP to TI C62x

39 39 Summary of AsAP Scalable programmable processing array –Many processors on a single chip –Reduced complexity processors with small memories –GALS clocking style –Nearest neighbor communication Results –0.18 µm, 475 MHz @ 1.8 V 32 mW application power 84 mW 100% active power 2.4 mW application power @ 116 MHz and 0.9 V –High performance density: 475 MOPS in 0.66 mm² –Well suited for future fabrication technologies

40 40 A 167-processor 65 nm Computational Platform with Per-Processor DVFS

41 41 Agenda The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –DVFS Analysis and Summary

42 42 New Challenges Addressed Reduction in the power dissipation of: –Lightly-loaded processors (lowering Vdd ) –Unused processors (leakage) Achieving very high efficiencies and speed on common demanding tasks such as FFTs, video motion estimation, and Viterbi decoding Larger, area efficient on-chip memories Efficient, low overhead communication between distant processors

43 43 167-processor Computational Platform Key features –164 Enhanced programmable processors –3 Dedicated-purpose processors –3 Shared memories –Long-distance circuit- switched communication network –Dynamic Voltage and Frequency Scaling (DVFS) Tile Core DVFS Osc Comm Viterbi Decoder FFT 16 KB Shared Memories Motion Estimation

44 44 Homogenous Processors 164 in-order, single-issue, 6-stage processors –16-bit datapath with RISC-like instructions –128x16-bit data memory (DMEM) –128x35-bit instruction memory (IMEM) –Two 64x16-bit FIFOs for inter-processor communication IF ID MemRd EXE1 EXE2WB

45 45 Homogenous Processors Over 60 basic instructions –Add, Sub, Logic, Multiply, MAC, Branch, … New instructions and features –Min/Max, Byte-Sub/Add, Absolute value, Fixed-to- Float conversion assist –Jump/Return (function support) –Zero Overhead Looping (block repeat) –Conditional Execution (predicating) –Block Floating Point Floating point CORDIC square root requires 2.9x fewer cycles compared to first generation AsAP Preliminary results from one chip: 1.2 GHz, 59 mW, 1.3 V, 100% active MAC/ALU operations

46 46 Configurable Dedicated- Purpose Processors 3 dedicated-purpose processors are placed into the array: –FFT– Fast Fourier Transform –Motion Estimation– for video encoding –Viterbi Decoding– for convolutional code decoding

47 47 Fast Fourier Transform (FFT) Continuous flow architecture with a single radix-4,2 butterfly Runtime configurable from 16-pt to 4096-pt transforms, FFT and IFFT 760,000 1024-pt complex FFTs/sec @ 989 MHz, 1.3 V 1.01 mm 2 Preliminary measurements functional at 866 MHz, 34.97 mW @ 1.3 V 1.23 mm 0.82 mm MEM O F

48 48 Motion Estimation for Video Encoding Supports a number of fixed and programmable search patterns Supports all H.264 specified block sizes within a 48x48 search range 14 billion SADs/sec @880 MHz, 1.3 V; supports 1080p HDTV @ 30fps 0.67 mm 2 Preliminary measurements functional at 938 MHz, 196.17 mW @ 1.3 V 0.82 mm O MEM F

49 49 Viterbi Decoder 8 Add-Compare-Select (ACS) units Highly configurable –Up to 32 different rates, including 1/2 and 3/4 –Decode codes up to constraint length 10 72 Mbps @ 789 MHz 1.3 V for rate = 1/2 0.17 mm 2 Preliminary measurements functional at 894 MHz, 17.55 mW @ 1.3 V 0.41 mm MEM F O

50 50 Shared Memories Ports for up to four processors (two connected in this chip) to directly connect to the block, which provides –Port priority –Port request arbitration –Programmable address generation supporting multiple addressing modes Uses a 16 KByte single-ported SRAM 1.28 GHz operation, 1.3 V –One read or write per cycle –20.5 Gbps peak throughput 0.34 mm 2 Preliminary measurements functional at 1.3 GHz, 4.55 mW @1.3 V 0.41 mm 0.82 mm SRAM F O

51 51 Inter-Processor Communication Network

52 52 Inter-Processor Communication Circuit-switched source-synchronous communication –Each link has a clk, 16-bit data bus, valid, and request –Core can Write to any combination of the 8 outputs under software control Read from any 2 of the 8 inputs using statically configured FIFOs

53 53 Long-Distance Communication Allows communication across tiles without disturbing cores –Long-distance links may be pipelined or not Depending on: source clock frequency, distance, and latency Source Destination

54 54 Internal communication circuits

55 55 How long distance links impacts allowable source clock frequencies ?

56 56 Configurable Local Clock Oscillator

57 57 Configurable Local Clock Oscillator

58 58 Per-Processor DVFS Each processor tile contains a core that operates at: –A fully-independent clock frequency Any frequency below maximum Halts, restarts, and changes arbitrarily –Dynamically-changeable supply voltage VddHigh or VddLow Disconnected for leakage reduction VddAlwaysOn powers DVFS and inter-processor communication Tile

59 59 DVFS Controller Voltage and frequency is set by: –Static configuration –Software –Hardware (controller) FIFO “fullness” Processor “stalling frequency”

60 60 Agenda The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –DVFS Analysis and Summary

61 61 Tile and Core Area Breakdowns Communication area approximately 7% DVFS area approximately 8% Routing complexity results in 27% for gaps and fillers Tile Area SRAMs 34% Core Area

62 62 Die Micrograph and Key Data 65 nm STMicroelectronics low-leakage CMOS 55 million transistors, 39.4 mm 2 Single Tile Transistors 325,000 Area 0.17 mm 2 Max. frequency 1.19 GHz@ 1.3V Power (100% active) 59 mW@1.19 GHz, 1.3 V 608 μW @66 MHz, 0.675 V Power (JPEG) 6.3 mW @ 1.095 GHz, 1.3V 5.939 mm 410 μm FFT Vit Mot. Est. Mem 5.516 mm Mem

63 63 Complete 802.11a Baseband Receiver 22 processors –32 processors using only nearest- neighbor connections (46% increase) 54 Mbps throughput 75 mW @ 610 MHz, 1.3 V 6x faster than TI C62x, 2x faster than SODA, 4x faster than LART (all scaled to 65 nm technology, 1.3 V and 610 MHz) 802.11a without long-distance 802.11a with long-distance

64 64 Summary 65 nm low power ST Microelectronics process Maintains the basic GALS architecture of AsAP 164 homogenous processors –1.2 GHz, 59 mW, 100% active @ 1.3 V Three 16 KB shared memories Three dedicated-purpose processors Long-distance circuit-switched communication increases mapping efficiency without overhead DVFS nets a 48% reduction in energy for JPEG with only 8% performance loss

65 65 References B. M. Baas, “A parallel programmable energy-efficient architecture for computationally-intensive DSP systems,” in 37th Asilomar Conf. Signals, Systems and Computers, Nov. 2003. Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, T. Mohsenin, M. Singh, and B. Baas, “An asynchronous array of simple processors for DSP applications,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2006, pp. 428–429. Z. Yu, M. J. Meeuwsen, R. W. Apperson, O. Sattari, M. Lai, J. W. Webb, E. W. Work, D. Truong, T. Mohsenin, and B. M. Baas, “AsAP: An asynchronous array of simple processors,” IEEE J. Solid-State Circuits, vol. 43, no. 3, pp. 695–705, Mar. 2008. B. Baas, Z. Yu, et al.,.”AsAP: A fine-grained many-core platform for DSP applications”,. IEEE Micro, vol. 27, no. 2, pp. 34.45, Mar. 2007. D. Truong, W. Cheng, et al.,.”A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling”,. in Symposium on VLSI Circuits, June 2008, p. C3.1.


Download ppt "1 Many Core Platform For DSP Applications 037661584 048879 Seminar in VLSI Architecture Ophir Erez."

Similar presentations


Ads by Google