A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Threads, SMP, and Microkernels

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.

Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

Design center Vienna Donau-City-Str. 1 A-1220 Vienna Vers SVEN Scalable Video Engine Gerald Krottendorfer.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Multiprocessing Memory Management

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Configurable System-on-Chip: Xilinx EDK

Chapter 17 Parallel Processing.

EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.

HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.

HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe.

HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

THE PHILIPS NEXPERIA DIGITAL VIDEO PLATFORM. The Digital Video Revolution  Transition from Analog to Digital Video  Navigate, store, retrieve and share.

1. DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

2 Systems Architecture, Fifth Edition Chapter Goals Describe the system bus and bus protocol Describe how the CPU and bus interact with peripheral devices.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Shih-Fan, Peng 2013 IEE5008 –Autumn 2013 Memory Systems DRAM Controller for Video Application Shih-Fan, Peng Department of Electronics Engineering National.

Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich

Way beyond fast © 2002 Axis Systems, Inc. CONFIDENTIAL Axis Common Transaction Interface (CTI) Architecture Highlights 9/11/2003 Ching-Ping Chou Axis Systems,

Lecture on Central Process Unit (CPU)

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.

The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

My Coordinates Office EM G.27 contact time:

1 of 14 Lab 2: Design-Space Exploration with MPARM.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

1. 2 Design of a 125  W, Fully-Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ching-Che Chung 1, Chen-Yi Lee 1,

System on a Programmable Chip (System on a Reprogrammable Chip)

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Module 12: I/O Systems I/O hardware Application I/O Interface

Multi-core SOC for Future Media Processing

Improving cache performance of MPEG video codec

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Operating System Concepts

13: I/O Systems I/O hardwared Application I/O Interface

CS703 - Advanced Operating Systems

A High Performance SoC: PkunityTM

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 4 Multiprocessors

Chip&Core Architecture

What Choices Make A Killer Video Processor Architecture?

Module 12: I/O Systems I/O hardwared Application I/O Interface

Presentation transcript:

A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009 Savannah, Georgia, USA

2 Content  Motivation  Proposed multi-core platform architecture  RISC cluster  Hardware operating system kernel  Computation coprocessor architecture  Communication architecture with two separated networks  Design flow for application mapping  Experimental result  H.264/AVC 720p high profile decoder implementations  Future work

3 High-performance Video Systems Huge computation load 60 GOPS to decode 1080p 30 fps Dedicated H/W blocks (for high-end applications) Multiple standards/ New standards MPEG2/4, H.264, DivX, VC-1, etc Software with RISC, DSP, SIMD processors Embedded in mobile devices PMPs, Smart Phones, etc Area and energy efficiencies are critical Large data transfers and memories At least 96MB for 1080p decoders Application-specific optimized communication and memory architectures Should satisfy all of these CONFLICTING requirements!! Flexible high-performance platform

4 Proposed Multi-core Platform Architecture  An array of RISC clusters with coprocessors connected through two separated networks: control and data  Each RISC consists of up to 4 cores, shared I$ and D$, HOSK, coprocessors.

5 A Multi-threading RISC Cluster Scheduling Context Switching Load Balancing Multithreading Synchronization Message Passing (Channel Access) Communication Implementation Area (Complexity) Scalability (# of threads, # of cores) Coherent Shared Memory H/W based Task queue management + {priority+RR}-based task scheduling Fast context switching in 4 ~ 17 cycles Dynamic thread allocation + Pre-emptive multithreading (Priority- or Round Robin) Thread migration without compulsory cache misses H/W-based mutex/semaphore No cache-coherency problem Channel access with a single co-processor instruction Thread suspend or wake-up without software intervention On-chip/Off-chip memory-based Context memory No system services in each core + Shared multiplier unit Use larger SRAMs No cache fragmentation The number of cores in a cluster is limited due to cache sharing. Area (Complexity)

6 Hardware Operating System Kernel (HOSK) 32-bit bus: 17 cycles 64-bit bus: 9 cycles 544-bit bus: 4 cycles Context switching order R15  R14  R13  … Pre-fetch or Save contexts in background! Task Scheduling & Semaphore Control SDRAM or SRAM SDRAM or SRAM  Main controller: receive service requests and control other blocks  Context manager: pre-fetch or save contexts in background  Thread manager: schedule tasks and control semaphores

7 Computation Coprocessors Local memory is accessed by both RISC cores and the computation coprocessors Coprocessor task manager selects an available hardware thread for an outstanding coprocessor command A pool of hardware threads General coprocessor interface Command queues to issue nonblocking coprocessor commands A pool of software threads  Implemented for computation-intensive part of the video algorithms that cannot be run in a RISC cores.

8 Communication Network Architecture  Among RISC clusters  Two separated communication networks  control network: smaller data size, and synchronization information  based on conventional message passing  employ point-to-point hardware FIFO  provide a new path to transfer data  data network: larger data size  based on remote DMA operations, and bus-based style-like  employ memory (local or global) and hardware FIFO  handle high-rate data transfers for stream-based applications

9 Control Network: point-to-point FIFO based FIFO group Fully programmable connectivity Two-level distributed identification for FIFOs Each control transaction is initiated by a control core with clusterID and fifoID A control core can issue a command to the communication coprocessor in a single cycle for a control transaction

10 Data Communication Network Streaming data is stored in either a local memory or a global memory, which depends on the size of the data. Platform provides n C 2 local data links Local data between two RISC clusters is exchanged through a shared local memory.

11 Global Data Communication with a DMAC A centralized DMA controller performs address translation, DMA request queue Management, and data arrangement so that data cores are free from tasks related to data transfers Two global data network for streaming data and I/D cache data can be either unified or separated, which depends on configuration of the memory controllers A small buffer is used between the DMA controller and a RISC cluster for DMA operations P c o p r o c e s s o r P PP Global Data Network1 Global Memory (Streaming Data) Memory Controller Multimedia Address Translator(MAT) Request Queue Data Recombination Unit(DRU) DMA Controller I-$D-$ Global Memory (I/D Cache Data) Memory Controller Global Data Network2

12 Design Flow for Application Mapping video specification area, power operating frequency number of clusters configurable network SystemC simulation in TLM Multithreading Code generation (for RISC clusters) RTL coding or generation (for coprocessors) Core #, cache sizing for each cluster Sizing local memories FPGA prototyping application profiling cluster partitioning communication mapping TLM modeling&function profiling HW/ SW thread partitioning&mapping performance estimation verification Starting with an application model and a platform model with constraints function partitioning & clustering

13 Partitioning into clusters According the profiling results for a reference software, the application is first partitioned into grouped functions Each grouped function is mapped into a RISC cluster. Assumptions: RISC clusters with 4 200MHz utilization rate=0.7 Upper MIPS bound for a 4-core cluster=560MIPS

14 RISC cluster Cluster Partitioning  Example: an H.264/AVC CIF decoder is mapped into 4 RISC clusters Entropy Decoding Inverse Quantization Intra Prediction Inter Prediction Reconstructi on Deblocking Filter Neighbor Reference Pixels Current 16x Multi Reference Frames Frame N-1 MUX H.264bitstream output 231MIPS259MIPS113MIPS 1087MIPS 45MIPS 356MIPS

15 Cluster Partitioning  Example: A H.264/AVC 720p decoder is mapped into 6 RISC clusters RISC cluster

16 Communication Mapping 1. identify control and data flows among the clusters 2. Map each control flow into a specific FIFO in a FIFO group 3. Map a data flow for streaming into a local data network or the global data network according to the size of its bandwidth requirement 4. Map data flows for I/D cache into the global memory

17 Example 1: Control Network Mapping for an H.264.AVC CIF high-profile decoder  transaction and size

18 Example 1: Data Network Mapping for an H.264.AVC CIF high-profile decoder  transaction and size

19 Example 2: Control Network Mapping for an H.264.AVC 720p high-profile decoder  transaction and size

20 Example 2: Data Network Mapping for an H.264.AVC 720p high-profile decoder  transaction and size

21 HW/SW Thread Partitioning & Mapping 1. Profile the required MIPS of each thread from TLM modeling 3. Allocate the threads to the cores or the coprocessor in the cluster 4. Back to step 2 if the result is not good enough 2. Select # of RISC cores and HW threads in the coprocessor  For each RISC cluster cores coprocessors

22  ~480 MIPS for intra prediction in the 720p decoder  Upper bound for a 4-core cluster: 560 MIPS Example: Thread Partitioning & Mapping for Intra-prediction (1) Map all threads to SW Thread-level parallelism is limited due to dependency among the threads, which limits core utilization threadsMIPS control2.9 luma 4x x x chromacb/cr78.0

23  Dependency and intra-prediction order in a MB Example: Thread Partitioning & Mapping for Intra-prediction (2) 4x4 luma intra prediction for luma samples  Core utilization: limited because of limited parallelism (2)  Reducing cores from 4 to dependencyIntra-prediction ordering

24 Example: Thread Partitioning & Mapping for Inter-prediction  Inter prediction case in the 720p decoder  Upper bound for a 4-core cluster: 560 MIPS One of several possible SW-HW partitions is selected. threadsMIPS control2.9 luma DMA setup300.7 Data Recombination Interpolation chroma DMA setup269.6 Data Recombination Interpolation414.7

25 A Software-Centric Solution For H.264/AVC 720p High-Profile Decoder

26 Complexity of 720p High-profile Decoder  Logic gate count and memory usage  Synthesis conditions  0.18-um CMOS technology  200MHz for RISC clusters and 100MHz for others Logic part (unit: K gates)Memory part (unit: KB) Computation ComponentRISC clusterCoprocessorI-cacheTagD-cacheTag ED cluster186 (2) ITQ cluster226 (3) INTRA cluster226 (3) INTER cluster266 (4) RECON cluster145 (1) DF cluster145 (1) Sum1,194 (14) Communication Control network25- Data network Sum Total (Logic + Memory)1,

27 Thread Partitioning for 720p 200MHz)

28 Communication Network 21.6 MB/sec MB/sec MB/sec 196 MB/sec

29 Core Utilization ED Cluster (3, 2)ITQ Cluster (4, 7)INTRA Cluster (3, 0) INTER Cluster (4, 0)RECON Cluster (1, 0)DF Cluster (1, 0) (thread number, context switching number per MB)

30 Design Space Exploration  Seven mappings of an H p decoder  With the same networks for control and data communication software-centric hardware-centric

31 Future Works  More codec implementations  H.264/AVC 720-p high-profile encoder  VC-1 720p advanced-profile decoder  Flexible coprocessors:  Coarse-grained reconfigurable architecture (CGRA)

32 Thank you