University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution for Mobile Multimedia Applications Hyunchul Park 1, Yongjun Park 2, Scott Mahlke 2 December 12, 2009 Texas Instruments Inc. 1 University of Michigan, Ann Arbor 2

University of Michigan Electrical Engineering and Computer Science Multimedia applications have high performance, cost, energy demands –High-quality video –Flash animation Clear need for application and domain-specific hardware Introduction 24 fps min. Frames/sec MPEG-4 Decoder Cell-phone battery life (hours) 2 energy performance

University of Michigan Electrical Engineering and Computer Science Convergence of Functionalities 3 Anatomy of iPhone HD TV decoder Video Recording Video Editing 3D Rendering 4G Wireless Advanced Image Processing Convergence of functionalities demands a flexible solution Applications have different characteristics

University of Michigan Electrical Engineering and Computer Science ASIC Alternatives General Purpose Processors DSPs Efficiency, Performance Flexibility ASICs Domain specific Efficiency Somewhat programmable What’s the right way to support multimedia applications ? 4

University of Michigan Electrical Engineering and Computer Science 5 Coarse-Grained Reconfigurable Architecture (CGRA) Array of PEs connected in a mesh-like interconnect High throughput, low cost/power with distributed hardware High flexibility with dynamic reconfiguration Morphosys, SiliconHive, ADRES

University of Michigan Electrical Engineering and Computer Science Execution Model of CGRAs 6 for ( …… ) { } time Host CGRA Modulo scheduling exploits loop level parallelism

University of Michigan Electrical Engineering and Computer Science 7 Large Scale CGRA Need for higher performance –Higher resolution/more detail video –Multiple concurrent applications support Increasing technology allows more resources available Loop 0 Loop 1 Loop 2 Loop 3 Task 0Task 1Task 2Task 3Task 4 Loop 0

University of Michigan Electrical Engineering and Computer Science Streaming Execution Model Streaming property –Packet of data goes through independent tasks Partition tasks into stages –Map each stage onto different hardware Pipeline parallelism –Pipeline the outermost loop 8

University of Michigan Electrical Engineering and Computer Science Insights Multimedia applications rich both in ILP/pipeline parallelism –Not mutually exclusive, cooperatively enhance performance Resource requirement varies –Statically / dynamically Need a flexible execution model –Exploiting both types of parallelism –Resource allocation based on computation requirement –Dynamically adapt to computation variance 9

University of Michigan Electrical Engineering and Computer Science Polymorphic Pipeline Array Multi-core accelerator : each 2x2 array becomes a processor Cores can be combined to form a larger logical core Exploit both coarse-grain and fine-grain pipeline parallelism No dynamic routing logic: all communications statically generated 10 Core Logical Core

University of Michigan Electrical Engineering and Computer Science Execution Model 11 Pipeline outermost loop ST 0ST 1ST 2ST 3 ST 0 ST 1 ST 2 ST 3

University of Michigan Electrical Engineering and Computer Science Execution Model 12 Pipeline outermost loop Compute intensive stage –Assign more resources –Modulo scheduling ST 0 ST 1 ST 2 ST 3 ST 0ST 1ST 2ST 3

University of Michigan Electrical Engineering and Computer Science Execution Model 13 ST 0 ST 1 ST 2 ST 3 ST 0 ST 1 ST 2 ST 3 Pipeline outermost loop Compute intensive stage –Assign more resources –Modulo scheduling

University of Michigan Electrical Engineering and Computer Science Partitioning of PPA Static partitioning –Schedules can be optimized –Computation variance leads to low utilization Dynamic partitioning –Adjust core assignment at run-time –Adapt to computation variance, but some overhead How to support dynamic partitioning –Multiple schedules: code bloat –Unified schedule targeting multiple sub-arrays (virtualization) 14

University of Michigan Electrical Engineering and Computer Science Virtualized Modulo Scheduling 15 0 A BA B One binary that can run in multiple targets –Part of code migrate to neighboring core –No rescheduling Challenges –Avoid resource conflict –Enforce multiple modulo constraints –Inter-core communication A B A A AB B B AB 01 BA II

University of Michigan Electrical Engineering and Computer Science Multi-level Modulo Constraints 16 0 1 2 3 0 23 4 5 6 7 5 4 6 78 9 11 8 9 10 11 12 10 13 timeF0F1F2F3 Core 0 0 23 6 9 0 23 5 4 6 78 9 11 II = 4

University of Michigan Electrical Engineering and Computer Science Multi-level Modulo Constraints 17 0 1 2 3 4 5 6 7 5 4 78 0 23 6 9 11 8 9 10 11 12 10 13 timeF0F1F2F3 Core 0 II = 4

University of Michigan Electrical Engineering and Computer Science Multi-level Modulo Constraints 18 0 1 2 3 0 23 4 5 6 7 5 4 6 8 9 10 11 78 9 12 10 13 timeF0F1F2F3 Core 0 0 1 2 3 4 5 6 7 8 9 10 11 timeF0F1F2F3 Core 1 II = 4

University of Michigan Electrical Engineering and Computer Science Multi-level Modulo Constraints 19 0 1 2 3 0 23 4 5 6 7 5 4 6 8 9 10 11 78 9 12 10 13 timeF0F1F2F3 Core 0 0 1 2 3 4 5 6 7 8 9 10 11 timeF0F1F2F3 Core 1 II = 2 II = 4 II = 2

University of Michigan Electrical Engineering and Computer Science Inter-core Communication 20 0 1 2 3 0 23 4 5 6 7 5 4 6 8 9 10 11 78 9 12 10 13 timeF0F1F2F3 Core 0 0 1 2 3 4 5 6 7 8 9 10 11 timeF0F1F2F3 Core 1 II = 2 Direct RF connection

University of Michigan Electrical Engineering and Computer Science VMS Summary Edge-centric Modulo Scheduling [PACT’08] with virtualization support Generate a unified schedule –Schedule for the smallest array, then expanded Multi-level modulo constraints enforced –Avoid resource conflict when expanded –Apply to computation/routing/registers Register transfer operations for inter-core communications –Enabled only when expanded 21

University of Michigan Electrical Engineering and Computer Science Evaluation of PPA Exploiting both types of parallelism in AAC Dynamic partitioning overhead –13% overhead for single-core schedule, runtime overhead 22

University of Michigan Electrical Engineering and Computer Science Where PPA stands 24 fps min. Frames/sec MPEG-4 Decoder Cell-phone battery life (hours) 23 energy performance

University of Michigan Electrical Engineering and Computer Science 24 Questions?

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution."— Presentation transcript:

Similar presentations

About project

Feedback