Download presentation
Presentation is loading. Please wait.
Published byShonda Reynolds Modified over 9 years ago
1
CGRA QUIZ
2
Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures? (Max of 5 words!) Give two examples for each coarse grained architecture type: Mesh, Linear Array, and Crossbar Indicate whether the given architecture supports some form of partial reconfiguration or not. PipeRanch, KressArray, Chess
3
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/21/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY - 2 3
4
Outline Coarse Grained Reconfigurable Architectures RAW CHESS Basics Of Network On Chip(NoC) Project Overview 4
5
Raw Architecture Workstation (RAW) Developed at MIT It fully exposes Low Level hardware architectural details to the compiler It lacks hardware for register renaming and dynamic instruction issue A Raw architecture seeks to execute pipelined application (like signal processing) efficiently. Motivation ??? Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 5
6
Change Is Around the Corner Processor performance not scaling as before Wire delay and power old view: chip looks small to a wire chip size distance signal can travel in 1 cycle new view: chip looks much bigger to a wire, communication is expensive even on chip! Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 6
7
Raw Architecture How do we arrive at this design??? Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 7
8
Problems with Monolithic Designs Super-wide general purpose processors are no longer practical Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net control Centralized control with global operand routing Area, power, and frequency concerns Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 8
9
Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net control + >>
10
ALU Bypass Net RF Spatial Architectures Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 10
11
ALU RF Bypass Net Spatial Architectures Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 11
12
ALU RF Spatial Architectures Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 12
13
ALU RF >> + Exploiting Locality Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 13
14
ALU RF Distribute the Register File Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 14
15
ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ Distribute the Rest Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 15
16
ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ Tiled-Processor Architecture Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 16
17
Tiled-Processor Architecture Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. Tile abstraction is quite powerful –e.g., power → resources used as necessary Easily scalable All signals registered at tile boundaries, no global signals Easier to Tune the Frequency Easier to do the Physical Design Easier to Verify 17
18
Raw On-Chip Networks 2 Static Networks Provides low latency communication between tiles. Makes routing decision during compile time. 2 Dynamic Networks Header encodes destination. Transports unpredictable operations like interrupt and cache misses. Computation Resources Switch Processor Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 18
19
Inside the Compute Processor IFRFD ATL M1M2 FP E U TV F4WB r26 r27 r25 r24 Input FIFOs from Static Router r26 r27 r25 r24 Output FIFOs to Static Router Local Bypass Network 19
20
20 Raw Compiler Example tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Assign instructions to the tiles, maximizing locality. Generate the static router instructions to transfer Operands & streams tiles. [Slide Source: Michael B. Taylor] Raw tile
21
Architectural Comparison RAWSuperscalarMultiprocessor Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 21
22
Application Mapping on RAW [ Four-way parallelized scalar code Two-way threaded Java program httpdZzzz.. Video Data Stream Frame Buffer And Screen Custom Data Path Pipeline (by Compiler) Sleep Mode (power saving) Fast Inter-tile ALU forwarding : 3 cycles Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93. 22
23
RAW - Performance Taylor, Michael Bedford, et al. "Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams." ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society, 2004. 23
24
CHESS - A Reconfigurable Arithmetic Array For Multimedia Applications Designed by Hewlett Packard laboratories in the year 1999 Aims at speeding up arithmetic operations for multimedia applications and tries to improve memory density Principle goals of CHESS Increased arithmetic computational density Increased memory bandwidth Increased capacity of internal memories Enhanced Flexibility Rapid Reconfiguration 24 Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
25
CHESS - Architecture 4 bit ALUs 4 bit bus wiring Switchboxes Chessboard Layout Embedded block RAM’s Speed and hierarchical line lengths Small configuration memories No run-time reconfiguration 25 Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
26
CHESS - Components ALU L OGIC D ESIGN 26 Switchbox Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
27
CHESS - Routing Structure 27 Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
28
CHESS - Performance 28 High computational density Efficient multiplies due to embedded ALU Issues: No reported software or application results No run-time reconfiguration Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."
29
Comparison: CHESS and MATRIX Both use 2D array of ALUs For both, instructions can be generated within the array Both the architectures are flexible CHESS is 4 bit whereas MATRIX is 8 bit CHESS does not support run-time reconfiguration but has very fast configuration as few bits are required CHESS has high computational density CHESS is aimed at arithmetic operations whereas MATRIX is more general purpose 29
30
Network-On-Chip(NoC) 30
31
Project Overview Implementing Coarse Grained and Hybrid Reconfigurable Architecture NoC interconnection between processing elements Supports Variable Block Size Motion Estimation Motion Estimation Algorithms Full Search Diamond Search 31 Verma, Ruchika, and Ali Akoglu. "A coarse grained and hybrid reconfigurable architecture with flexible NoC router for variable block size motion estimation." Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008.
32
CPE (1,1) CPE (1,1) CPE (2,1) CPE (2,1) CPE (3,1) CPE (3,1) CPE (4,1) CPE (4,1) CPE (1,2) CPE (1,2) CPE (2,2) CPE (2,2) CPE (3,2) CPE (3,2) CPE (4,2) CPE (4,2) CPE (1,3) CPE (1,3) CPE (2,3) CPE (2,3) CPE (3,3) CPE (3,3) CPE (4,3) CPE (4,3) CPE (1,4) CPE (1,4) CPE (2,4) CPE (2,4) CPE (3,4) CPE (3,4) CPE (4,4) CPE (4,4) c_d r_d c_d r_d c_d r_d c_d PE 2(1) PE 2(3) PE 2(2) PE 2(4) PE 3 Main Memory Memory Interface (MI) data_load_control (16 bits) reference_block_id (5 bits) c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) 32 bits 14 bits 12 bits 32
33
QUESTIONS?? 33
34
34
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.