1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE ‘09. A Generic.

Slides:

Advertisements

Similar presentations

November 23, 2005 Egor Bondarev, Michel Chaudron, Peter de With Scenario-based PA Method for Dynamic Component-Based Systems Egor Bondarev, Michel Chaudron,

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s: Computer Hardware 1 WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s Computer Hardware Department of.

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

System-level Trade-off of Networks-on-Chip Architecture Choices Network-on-Chip System-on-Chip Group, CSE-IMM, DTU.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.

Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1 An adaptable FPGA-based System for Regular Expression Matching Department of Computer Science and Information Engineering National Cheng Kung University,

RTL Processor Synthesis for Architecture Exploration and Implementation Schliebusch, O. Chattopadhyay, A. Leupers, R. Ascheid, G. Meyr, H. Steinert, M.

Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.

1 ReCPU:a Parallel and Pipelined Architecture for Regular Expression Matching Department of Computer Science and Information Engineering National Cheng.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Chapter 13 Embedded Systems

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

Scheduling Parallel Task

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Unit VI. Keil µVision3/4 IDE for 8051 Tool for embedded firmware development Steps for using keil.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Multi-core architectures. Single-core computer Single-core CPU chip.

Presenter: Zong Ze-Huang Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi- Core Systems Stattelmann, S. ; Bringmann, O.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

A Methodology for Architecture Exploration of heterogeneous Signal Processing Systems Paul Lieverse, Pieter van der Wolf, Ed Deprettere, Kees Vissers.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Software Engineering Chapter: Computer Aided Software Engineering 1 Chapter : Computer Aided Software Engineering.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Multiprocessor SoC integration Method: A Case Study on Nexperia, Li Bin, Mengtian Rong Presented by Pei-Wei Li.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

An Overview of Parallel Processing

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.

Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Evaluating Register File Size

Ann Gordon-Ross and Frank Vahid*

Simulation of computer system

CprE 588 Embedded Computer Systems

COMPUTER ORGANIZATION AND ARCHITECTURE

Research: Past, Present and Future

Funded by the Horizon 2020 Framework Programme of the European Union

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic Platform for Estimation of Multi-threaded Program Performance on Heterogeneous Multiprocessors

This paper deals with a methodology for software estimation to enable design space exploration of heterogeneous multiprocessor systems. Starting from fork-join representation of application specification along with high level description of multiprocessor target architecture and mapping of application components onto architecture resource elements, it estimates the performance of application on target multiprocessor architecture. The methodology proposed includes the effect of basic compiler optimizations, integrates light weight memory simulation and instruction mapping for complex instruction to improve the accuracy of software estimation. To estimate performance degradation due to contention for shared resources like memory and bus, synthetic access traces coupled with interval analysis technique is employed. The methodology has been validated on a real heterogeneous platform. Results show that using estimation it is possible to predict performance with average errors of around 11%. 2

There are many mappings between application and hardware architecture.  How to know the mapping we used is the best one? 。 We need a performance estimator to estimate the performance of the mapping. So, when estimation, we have three input :  Application specification 。 task, data and communication  Architecture specification 。 processor, memory and bus  Mapping description 。 Mapping application components onto architecture components 3 P1 P2 A B C D A B C D Time 0 10 Saved time

4 Fork-join task graph Represent parallel phase of computation This paper SUIF Compiler Software profiling HMDES( High Level Machine Description ) Target processor description includes pipeline stages, memory, link, … Application specification Architecture specification

1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 5

1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 6

Introduction 7 Multi-threaded Application Processor #1 Processor #n Processor #2 …… tasks fork Example of Multi-threaded Application running on Multi-processors. Communication, Synchronization, Resource Contention …

Estimation Input – Application Specification  Fork-join task graph : A task graph consisting of alternating sequential and parallel phases consists of independent tasks. 。 Vertex : a task which is a unit of work in a parallel program. 。 Edge : precedence between a pair of tasks 8 Sequential phase Parallel phase Examples of fork-join task graph.

9 bb : basic block L : latency T : total time F : frequency Register Allcation

Estimated processor  Cradle PE  Leon3 with FPU  SS-mips Estimated and actual execution cycles Average error rate : 14 % 10

1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 11

A application may be composed of some sequential pre/post-processing and nested fork-joins. A fork-join may be iterated for many times. 12 Includes the shared resource contention delay

1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 13

Interval analysis  generate access rate data 14

Spreading of interval 15 At next time slot : P1 P2 P3 P4

16 ( Time Interval )

17 ( Time Interval )

Architecture of Cradle CT3400 heterogeneous multi-processor chip.  4 processors  4 DSE (Digital Signal Engine) 18

JPEG application 19 Parallelism The mapping description P : processor D : DSE (Digital Signal Engine)

Estimated cycles for 8 mappings of JPEG application over Cradle architecture.  Estimated cycles without contention delay must lower than the others. 20

Conclusion  The presented framework for retargetable performance estimation of multi-threaded applications on heterogeneous multi-processors.  The estimated performance includes shared resource contention delay, task execution time on uni-processor. Comment  The mapping of multi-threaded application component to hardware architecture is very important for improving performance.  The error rate is not good. 21