Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.

Similar presentations


Presentation on theme: "A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009."— Presentation transcript:

1 A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009 Savannah, Georgia, USA

2 2 Content  Motivation  Proposed multi-core platform architecture  RISC cluster  Hardware operating system kernel  Computation coprocessor architecture  Communication architecture with two separated networks  Design flow for application mapping  Experimental result  H.264/AVC 720p high profile decoder implementations  Future work

3 3 High-performance Video Systems Huge computation load 60 GOPS to decode 1080p 30 fps Dedicated H/W blocks (for high-end applications) Multiple standards/ New standards MPEG2/4, H.264, DivX, VC-1, etc Software with RISC, DSP, SIMD processors Embedded in mobile devices PMPs, Smart Phones, etc Area and energy efficiencies are critical Large data transfers and memories At least 96MB for 1080p decoders Application-specific optimized communication and memory architectures Should satisfy all of these CONFLICTING requirements!! Flexible high-performance platform

4 4 Proposed Multi-core Platform Architecture  An array of RISC clusters with coprocessors connected through two separated networks: control and data  Each RISC consists of up to 4 cores, shared I$ and D$, HOSK, coprocessors.

5 5 A Multi-threading RISC Cluster Scheduling Context Switching Load Balancing Multithreading Synchronization Message Passing (Channel Access) Communication Implementation Area (Complexity) Scalability (# of threads, # of cores) Coherent Shared Memory H/W based Task queue management + {priority+RR}-based task scheduling Fast context switching in 4 ~ 17 cycles Dynamic thread allocation + Pre-emptive multithreading (Priority- or Round Robin) Thread migration without compulsory cache misses H/W-based mutex/semaphore No cache-coherency problem Channel access with a single co-processor instruction Thread suspend or wake-up without software intervention On-chip/Off-chip memory-based Context memory No system services in each core + Shared multiplier unit Use larger SRAMs No cache fragmentation The number of cores in a cluster is limited due to cache sharing. Area (Complexity)

6 6 Hardware Operating System Kernel (HOSK) 32-bit bus: 17 cycles 64-bit bus: 9 cycles 544-bit bus: 4 cycles Context switching order R15  R14  R13  … Pre-fetch or Save contexts in background! Task Scheduling & Semaphore Control SDRAM or SRAM SDRAM or SRAM  Main controller: receive service requests and control other blocks  Context manager: pre-fetch or save contexts in background  Thread manager: schedule tasks and control semaphores

7 7 Computation Coprocessors Local memory is accessed by both RISC cores and the computation coprocessors Coprocessor task manager selects an available hardware thread for an outstanding coprocessor command A pool of hardware threads General coprocessor interface Command queues to issue nonblocking coprocessor commands A pool of software threads  Implemented for computation-intensive part of the video algorithms that cannot be run in a RISC cores.

8 8 Communication Network Architecture  Among RISC clusters  Two separated communication networks  control network: smaller data size, and synchronization information  based on conventional message passing  employ point-to-point hardware FIFO  provide a new path to transfer data  data network: larger data size  based on remote DMA operations, and bus-based style-like  employ memory (local or global) and hardware FIFO  handle high-rate data transfers for stream-based applications

9 9 Control Network: point-to-point FIFO based FIFO group Fully programmable connectivity Two-level distributed identification for FIFOs Each control transaction is initiated by a control core with clusterID and fifoID A control core can issue a command to the communication coprocessor in a single cycle for a control transaction

10 10 Data Communication Network Streaming data is stored in either a local memory or a global memory, which depends on the size of the data. Platform provides n C 2 local data links Local data between two RISC clusters is exchanged through a shared local memory.

11 11 Global Data Communication with a DMAC A centralized DMA controller performs address translation, DMA request queue Management, and data arrangement so that data cores are free from tasks related to data transfers Two global data network for streaming data and I/D cache data can be either unified or separated, which depends on configuration of the memory controllers A small buffer is used between the DMA controller and a RISC cluster for DMA operations P c o p r o c e s s o r P PP Global Data Network1 Global Memory (Streaming Data) Memory Controller Multimedia Address Translator(MAT) Request Queue Data Recombination Unit(DRU) DMA Controller I-$D-$ Global Memory (I/D Cache Data) Memory Controller Global Data Network2

12 12 Design Flow for Application Mapping video specification area, power operating frequency number of clusters configurable network SystemC simulation in TLM Multithreading Code generation (for RISC clusters) RTL coding or generation (for coprocessors) Core #, cache sizing for each cluster Sizing local memories FPGA prototyping application profiling cluster partitioning communication mapping TLM modeling&function profiling HW/ SW thread partitioning&mapping performance estimation verification Starting with an application model and a platform model with constraints function partitioning & clustering

13 13 Partitioning into clusters According the profiling results for a reference software, the application is first partitioned into grouped functions Each grouped function is mapped into a RISC cluster. Assumptions: RISC clusters with 4 cores @ 200MHz utilization rate=0.7 Upper MIPS bound for a 4-core cluster=560MIPS

14 14 RISC cluster Cluster Partitioning  Example: an H.264/AVC CIF decoder is mapped into 4 RISC clusters Entropy Decoding Inverse Quantization Intra Prediction Inter Prediction Reconstructi on Deblocking Filter Neighbor Reference Pixels Current 16x Multi Reference Frames Frame N-1 MUX H.264bitstream 01011000 01101010 01010101 10010111 output 231MIPS259MIPS113MIPS 1087MIPS 45MIPS 356MIPS

15 15 Cluster Partitioning  Example: A H.264/AVC 720p decoder is mapped into 6 RISC clusters RISC cluster

16 16 Communication Mapping 1. identify control and data flows among the clusters 2. Map each control flow into a specific FIFO in a FIFO group 3. Map a data flow for streaming into a local data network or the global data network according to the size of its bandwidth requirement 4. Map data flows for I/D cache into the global memory

17 17 Example 1: Control Network Mapping for an H.264.AVC CIF high-profile decoder  transaction and size

18 18 Example 1: Data Network Mapping for an H.264.AVC CIF high-profile decoder  transaction and size

19 19 Example 2: Control Network Mapping for an H.264.AVC 720p high-profile decoder  transaction and size

20 20 Example 2: Data Network Mapping for an H.264.AVC 720p high-profile decoder  transaction and size

21 21 HW/SW Thread Partitioning & Mapping 1. Profile the required MIPS of each thread from TLM modeling 3. Allocate the threads to the cores or the coprocessor in the cluster 4. Back to step 2 if the result is not good enough 2. Select # of RISC cores and HW threads in the coprocessor  For each RISC cluster cores coprocessors

22 22  ~480 MIPS for intra prediction in the 720p decoder  Upper bound for a 4-core cluster: 560 MIPS Example: Thread Partitioning & Mapping for Intra-prediction (1) Map all threads to SW Thread-level parallelism is limited due to dependency among the threads, which limits core utilization threadsMIPS control2.9 luma 4x4401.8 8x8298.3 16x1695.8 chromacb/cr78.0

23 23  Dependency and intra-prediction order in a MB Example: Thread Partitioning & Mapping for Intra-prediction (2) 4x4 luma intra prediction for luma samples  Core utilization: limited because of limited parallelism (2)  Reducing cores from 4 to 3 2367 0145 891213 10111415 dependencyIntra-prediction ordering

24 24 Example: Thread Partitioning & Mapping for Inter-prediction  Inter prediction case in the 720p decoder  Upper bound for a 4-core cluster: 560 MIPS One of several possible SW-HW partitions is selected. threadsMIPS control2.9 luma DMA setup300.7 Data Recombination 1838.6 Interpolation4644.9 chroma DMA setup269.6 Data Recombination 546.1 Interpolation414.7

25 25 A Software-Centric Solution For H.264/AVC 720p High-Profile Decoder

26 26 Complexity of 720p High-profile Decoder  Logic gate count and memory usage  Synthesis conditions  0.18-um CMOS technology  200MHz for RISC clusters and 100MHz for others Logic part (unit: K gates)Memory part (unit: KB) Computation ComponentRISC clusterCoprocessorI-cacheTagD-cacheTag ED cluster186 (2)358.000.674.000.38 ITQ cluster226 (3)304.000.351.000.10 INTRA cluster226 (3)032.002.432.000.20 INTER cluster266 (4)368.000.672.000.20 RECON cluster145 (1)51.000.100.500.05 DF cluster145 (1)108.000.670.500.05 Sum1,194 (14)11661.004.9010.001.00 Communication Control network25- Data network4211.50 Sum6711.50 Total (Logic + Memory)1,37788.39

27 27 Thread Partitioning for 720p (@ 200MHz)

28 28 Communication Network 21.6 MB/sec 415.63 MB/sec 310.2 MB/sec 196 MB/sec

29 29 Core Utilization (@200MHz) ED Cluster (3, 2)ITQ Cluster (4, 7)INTRA Cluster (3, 0) INTER Cluster (4, 0)RECON Cluster (1, 0)DF Cluster (1, 0) (thread number, context switching number per MB)

30 30 Design Space Exploration  Seven mappings of an H.264 720p decoder  With the same networks for control and data communication software-centric hardware-centric

31 31 Future Works  More codec implementations  H.264/AVC 720-p high-profile encoder  VC-1 720p advanced-profile decoder  Flexible coprocessors:  Coarse-grained reconfigurable architecture (CGRA)

32 32 Thank you


Download ppt "A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009."

Similar presentations


Ads by Google