Download presentation
Presentation is loading. Please wait.
Published byThomasina Sharp Modified over 8 years ago
1
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/2012 1
2
OUTLINE Introduction Motivation Network-on-Chip (NoC) ASIC based approaches Coarse grain architectures Proposed Architecture Results 2
3
INTRODUCTION Goal Application specific hybrid coarse grained reconfigurable architecture using NoC Purpose Support Variable Block Size Motion Estimation (VBSME) First approach No ASIC and other coarse grained reconfigurable architectures Difference Use of intelligent NoC routers Support full and fast search algorithms 3
4
4 MOTIVATION H.264 Motion Estimation Ө(f)=
5
5 MOTION ESTIMATION Previous Frame Current Frame Current 16x16 Block Motion Vector Search Window Sum of Absolute Difference (SAD)
6
SYSTEM-ON-CHIP (SOC) Single chip systems Common components Microprocessor Memory Co-processor Other blocks Increased processing power and data intensive applications Facilitating communication between individual blocks has become a challenge 6
7
TECHNOLOGY ADVANCEMENT 7
8
DELAY VS. PROCESS TECHNOLOGY 8
9
NETWORK-ON-CHIP (NOC) Efficient communication via use of transfer protocols Need to take into consideration the strict constraints of SoC environment Types of communication structure Bus Point-to-point Network 9
10
COMMUNICATION STRUCTURES 10
11
BUS VS. NETWORK Bus Pros & ConsNetwork Pros & Cons Every unit attached adds parasitic capacitance x ✓ Local performance not degraded with scaling Bus timing is difficultx ✓ Network wires can be pipelined Bus arbitration can become a bottleneck x ✓ Routing decisions are distributed Bus testability problematic and slow x ✓ Locally placed BIST is fast and easy Bandwidth is limited and shared by all x ✓ Bandwidth scales with network size Bus latency is wire speed once granted ✓ xNetwork contention may cause latency Very compatible ✓ xIPs need smart wrappers Simple to understand ✓ xRelatively complicated 11
12
EXAMPLE 12
13
EXAMPLE OF NOC 13
14
ROUTER ARCHITECTURE 14
15
BACKGROUND ME General purpose processors, ASIC, FPGA and coarse grain Only FBSME VBSME with redundant hardware General purpose processors Can exploit parallelism Limited by the inherent sequential nature and data access via registers 15
16
CONTINUED… ASIC No support to all block sizes of H.264 Support provided at the cost of high area overhead Coarse grained Overcome the drawbacks of LUT based FPGAs Elements with coarser granularity Fewer configuration bits Under utilization of resources 16
17
ASIC Approaches TopologySAD accumulation 2D systolic array Large number of registers Store partial SADs Area overhead High latency Mesh based architecture Store partial SADs Area overhead High latency No VBSME Partial Sum Parallel Sum 1D systolic array 2D systolic array Partial Sum Parallel Sum 2D systolic array Partial Sum Parallel Sum Reference pixels broadcasted SAD computation for each 4x4 block pipelined Each processing element computes pixel difference, accumulates it to the previous partial SAD and sends the computed partial SAD to the next processing element Large number of registers All pixel differences of a 4x4 block computed in parallel Reference pixels are reused Direction of data transfer depends on search pattern 17
18
OU’S APPROACH 16 SAD modules to process 16 4x4 motion vectors VBSME processor Chain of adders and comparators to compute larger SADs PE array Basic computational element of SAD module Cascade of 4 1D arrays 1D array 1D systolic array of 4 PEs Each PE computes a 1 pixel SAD 18
19
Module 0 Module 1 Module 15 current_block_data_0search_block_data_0 current_block_data_1 current_block_data_15 search_block_data_1 search_block_data_15 SAD_0 SAD_1 SAD_15 MV_0 MV_1 MV_15 strip_sel read_addr_B read_addr_A write_addr SAD Modules MUX for SAD 1D Array 0 1D Array 0 1D Array 3 1D Array 3 block_strip_B block_strip_A D D D D current_block_data_i 4 bits 1 bit 32 bits SAD_i MV_i PE Array 19
20
PE ACCM D D D D D D D D D D D D D D D D D D D D D D D D 32 bits 1D Array 20
21
PUTTING IT TOGETHER Clock cycle Columns of current 4x4 sub-block scheduled using a delay line Two sets of search block columns broadcasted 4 block matching operations executed concurrently per SAD module 4x4 SADs -> 4x4 motion vectors Chain of adders and comparators 4x4 SADs -> 4x8 SADs -> … 16x16 SADs Chain of adders and comparators Drawbacks No reuse of search data between modules Resource wastage 21
22
22 ALTERNATIVE SOLUTION: COARSE GRAIN ARCHITECTURES ChESS *(M x 0.8M)/256 x 17 x 17 MATRIX *(M x0.8M)/256 x 17 x 17 RaPiD *272+32M+14.45M2 * Performance (clock cycles) [Frame Size: M x 0.8M] Resource utilization Generic interconnect
23
PROPOSED ARCHITECTURE 2D architecture 16 CPEs 4 PE2s 1 PE3 Main Memory Memory Interface CPE (Configurable Processing Element) PE1 NoC router Network Interface Current and reference block from main memory 23
24
CPE (1,1) CPE (1,1) CPE (2,1) CPE (2,1) CPE (3,1) CPE (3,1) CPE (4,1) CPE (4,1) CPE (1,2) CPE (1,2) CPE (2,2) CPE (2,2) CPE (3,2) CPE (3,2) CPE (4,2) CPE (4,2) CPE (1,3) CPE (1,3) CPE (2,3) CPE (2,3) CPE (3,3) CPE (3,3) CPE (4,3) CPE (4,3) CPE (1,4) CPE (1,4) CPE (2,4) CPE (2,4) CPE (3,4) CPE (3,4) CPE (4,4) CPE (4,4) c_d r_d c_d r_d c_d r_d c_d PE 2(1) PE 2(3) PE 2(2) PE 2(4) PE 3 Main Memory Memory Interface (MI) data_load_contro l (16 bits) reference_block_id (5 bits) c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) 32 bits 14 bits 12 bits 24
25
1 8 bit sub 1 8 bit sub CPR RPR 2 8 bit sub 2 8 bit sub CPR RPR 3 8 bit sub 3 8 bit sub CPR RPR 4 8 bit sub 4 8 bit sub CPR RPR 5 8 bit sub 5 8 bit sub CPR RPR 6 8 bit sub 6 8 bit sub CPR RPR 7 8 bit sub 7 8 bit sub CPR RPR 8 8 bit sub 8 8 bit sub CPR RPR 9 8 bit sub 9 8 bit sub CPR RPR 10 8 bit sub 10 8 bit sub CPR RPR 11 8 bit sub 11 8 bit sub CPR RPR 12 8 bit sub 12 8 bit sub CPR RPR 13 8 bit sub 13 8 bit sub CPR RPR 14 8 bit sub 14 8 bit sub CPR RPR 15 8 bit sub 15 8 bit sub CPR RPR 16 8 bit sub 16 8 bit sub CPR RPR 10 bit adder 12 bit adder CO MP RE G r_d c_d To/From NI To/From East To/From South 4x4 mv 25
26
CONTROL UNIT PACKETIZATION UNIT DEPACKETIZATION UNIT reference_block_id to MI data_load_control to MI Network Interface NETWORK INTERFACE 26
27
0 0 1 1 3 3 5 5 4 4 2 2 Ring Buffer First Index Last Index Header Decoder PE 1 East West North South PE 1 East West North South Input Controller Output Controller ack request Receives packets from NI/ adjacent router Stores packets XY routing protocol Extracts direction of data transfer from header packet Updates number of hops Sends packets to NI or adjacent router Input/Output Control Signals 27 NOC ROUTER
28
Input Controller Output Controller Input Controller Output Controller Router 1Router 2 Step 1: Send a message from Router 1 to Router 2 req (1 bit) Busy? Buffer space available? ack (1 bit) Step 2: Send a 1 bit request signal to Router 2 Step 3: Router 2 first checks if it is busy. If not checks for available buffer space Step 4: Send ack if space available Step 5: Send the packet packet 32 bit 28
29
PE2 AND PE3 AddersMuxes De-muxes Comparators Registers 29
30
FAST SEARCH ALGORITHM Diamond Search 9 candidate search points Numbers represent order of processing the reference frames Directed edges labeled with data transmission equations derived based on data dependencies 9 candidate search points Numbers represent order of processing the reference frames Directed edges labeled with data transmission equations derived based on data dependencies 30
31
EXAMPLE Frame Macro- block SAD 31
32
CONTINUED… 32
33
DATA TRANSFER Data Transfer between PE1(1,1) and PE1(1,3) Individual Points Intersecting Points 33
34
DATA LOAD SCHEDULE 34
35
OTHER FAST SEARCH ALGORITHMS Hexagon Big Hexagon Spiral 35
36
FULL SEARCH 36
37
CONTINUED… 37
38
RESULTS 38
39
CONTINUED… 39
40
40
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.