Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Slides:

Advertisements

Similar presentations

The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.

Advertisements

ECE 667 Synthesis and Verification of Digital Circuits

Hadi Goudarzi and Massoud Pedram

ECE-777 System Level Design and Automation Hardware/Software Co-design

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

GRAPH BALANCING. Scheduling on Unrelated Machines J1 J2 J3 J4 J5 M1 M2 M3.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Martha Garcia.  Goals of Static Process Scheduling  Types of Static Process Scheduling  Future Research  References.

Towards Feasibility Region Calculus: An End-to-end Schedulability Analysis of Real- Time Multistage Execution William Hawkins and Tarek Abdelzaher Presented.

Abhijit Davare 1, Qi Zhu 1, Marco Di Natale 2, Claudio Pinello 3, Sri Kanajan 2, Alberto Sangiovanni-Vincentelli 1 1 University of California, Berkeley.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

1 of 30 June 14, 2000 Scheduling and Communication Synthesis for Distributed Real-Time Systems Paul Pop Department of Computer and Information Science.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s

Dynamic lot sizing and tool management in automated manufacturing systems M. Selim Aktürk, Siraceddin Önen presented by Zümbül Bulut.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.

ECE Synthesis & Verification - LP Scheduling 1 ECE 667 ECE 667 Synthesis and Verification of Digital Circuits Scheduling Algorithms Analytical approach.

1 Contents college 3 en 4 Book: Appendix A.1, A.3, A.4, §3.4, §3.5, §4.1, §4.2, §4.4, §4.6 (not: §3.6 - §3.8, §4.2 - §4.3) Extra literature on resource.

Embedded System Design Framework for Minimizing Code Size and Guaranteeing Real-Time Requirements Insik Shin, Insup Lee, & Sang Lyul Min CIS, Penn, USACSE,

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)

Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.

VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.

Lecture 15. IGP and MPLS D. Moltchanov, TUT, Spring 2008 D. Moltchanov, TUT, Spring 2015.

Efficient link scheduling for online admission control of real-time traffic in wireless mesh networks P. Cappanera a, L. Lenzini b, A. Lori b, G. Stea.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Design Techniques for Approximation Algorithms and Approximation Classes.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

Introduction to Job Shop Scheduling Problem Qianjun Xu Oct. 30, 2001.

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Logical Topology Design

1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.

Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.

1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.

1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Modeling and Analysis of Printer Data Paths using Synchronous Data Flow Graphs in Octopus Ashwini Moily Under the supervision of Dr. Lou Somers, Prof.

Real-Time Support for Mobile Robotics K. Ramamritham (+ Li Huan, Prashant Shenoy, Rod Grupen)

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 7: Deadlocks.

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

Branch-and-Cut Valid inequality: an inequality satisfied by all feasible solutions Cut: a valid inequality that is not part of the current formulation.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 5 CPU Scheduling Slide 1 Chapter 5 CPU Scheduling.

Rounding scheme if r * j  1 then r j := 1  When the number of processors assigned in the continuous solution is between 0 and 1 for each task, the speed.

Static Process Scheduling

Hub Location–Allocation in Intermodal Logistic Networks Hüseyin Utku KIYMAZ.

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.

Production Scheduling Lorena Kawas lk2551 Raul Galindo rg2802.

A Two-Phase Linear programming Approach for Redundancy Problems by Yi-Chih HSIEH Department of Industrial Management National Huwei Institute of Technology.

Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Basic Concepts Maximum CPU utilization obtained with multiprogramming

Scheduling Determines the precise start time of each task.

Response time analysis in real-time distributed automotive systems

Chapter 6: CPU Scheduling

Integer Programming (정수계획법)

Multi-Objective Optimization

Networked Real-Time Systems: Routing and Scheduling

Integer Programming (정수계획법)

Presentation transcript:

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea Richa Arizona State University

Agenda Network Processor (NP) System Resource Mapping and Scheduling Problem Heuristic Approach –Linear Programming and Randomized Rounding Resource Contention Issue –Detection and Elimination Experimental Results Summary and Future Work

Network Processor Systems Programmable devices designed to process packets at wire-speed Non-homogeneous real-time systems Comprise of a mix of ASICs, programmable processors and on-chip interconnects Optimized to support multiple applications such as IPv4, Diffserv, etc.

Resource Mapping and Scheduling Problem in NP Given a set APP={APP 1, APP 2, …,APP k } of applications each specified by a DAG, where each application APP j has a set of constraints (e.g. timing constraints, area constraints etc.), find the mapping that minimize the system cost in terms of dollar value while satisfying all the design constraints Assuming only one application active at any given time

System Specification Possible Task-to-Resource Mappings Several algorithms may be available for execution of a task Associated with each resource are cost and area parameters There may be multiple instances of a resource

Integer Linear Programming (ILP) formulation Objective: –Find a task-to-resource mapping with minimum cost Constraints: –Board area constraint –Timing constraint –Unique task constraint –Exclusive resource constraint –Communication delay constraint –Task-to-Resource mapping constraint –Task dependency constraint Example design problem with 3-flows: –800 variables –2000 constraints

Heuristic Approach-- Randomized Rounding Based on Linear Programming solution Traditional evolutionary algorithms require a set of feasible solutions as a starting point, i.e. Genetic Algorithms, Simulated Annealing –Hard to obtain an initial feasible set due to the conflicting constraints (area, time) in the problem

Randomized Rounding Relax integrality constraints of the ILP and solve the LP Fractional values of the binary variables used as probabilities for rounding them to either 0 or 1 Variable Randomized Rounding –Randomly select variables from a set of randomly chosen constraints –Round the selected variables Iterative rounding in case of constraint violation

Randomized Rounding (cont.) Fixing Variables –Reducing the number of variable to be rounded –Fix variable with integer values after solving LP –Iteratively solve LP till the number of integer variables does not increase Grouping variables –Assign priority based on the variable group affiliation

Randomized Rounding (cont.) Rollback Point Selection – Roll back only to the last group where constraint violation occurred Rounding Step Size –Round one or more each time?

Randomized Rounding Results Near-optimal solution in a fraction of ILP solution time

Exploration of Solution Space If the deadline constraint is too strict, the ILP may not have any feasible solution for the existing set of resources. On the other hand, with a too relaxed deadline feasible solution will be obtained with increased chance of resource contention. Solution space is explored using binary search in order to find a least cost feasible solution without any resource contention.

Improvement of Solution Relaxed deadline for packet processing helps to reduce the system cost in dollar value. Packet latency is increased, while satisfying the line speed. This approach allows multiple packets to be inside the system simultaneously (packet level parallelism). There may be resource contention if more than one packet try to access the same resource at the same instance of time for two different tasks.

Resource Contention Example: –Line rate = 10Gbps, Packet size = 64 bytes –No Packet Gap –Packet arrives every 51ns

Resource Contention Detection Packet Flow Graph (PFG) –This is visual depiction of the flow of packets through various resources inside NP system –G(V, E): V is the set the of resources allocated by the ILP, with additional entry and exit nodes, s and t, respectively. –Edge e = (u, v) ε E, if resource u and v are sequentially allocated. –Weight w(e) is associated with edge e: w(e) = (x(e), y(e)); where x(e) is the allocation sequence of the resource and y(e) is the execution time on that sequence.

Resource Contention Detection Resource Cycle Time –Calculation in PFG –It is defined as the maximum time span for which a resource is busy in executing the set of tasks for a packet. –Resource is not available until it finishes all the tasks for a packet scheduled on it Maximum Cycle Time: –It is defined as the maximum of all resource cycle times. Resource contention is detected if maximum cycle time is greater than packet arrival rate. Gantt chart is used to detect resource contention among multiple paths in a task graph

Resource Contention (Single Path) Example:

Resource Contention (Multiple Paths)

Resource Contention Elimination Binary search approach to speed up the exploration of solution space iteratively. Solution found by ILP is scrutinized for resource contention. –If there is no resource contention, no more work needed. –search iteratively for least cost feasible solution otherwise

Resource Contention Elimination d is the arrival rate of the packets and l is the maximum diameter of the flow graphs

Experimental Settings Codesign method applied to a Packet Processing System similar to the Intel IXP2400 network processor –Resource set derived from Intel IXP2400 architecture –Application set derived from the standard benchmarking applications defined by the Network Processing Forum, for which there is a mapping available from Intel Compared performance of the mapping generated by our approach with the standard mapping specified by Intel as part of the IXA Application Framework

Performance Metrics End-to-end Packet Latency Defined as the time interval starting when the first bit of a packet enters the input port and ending when the first bit of the packet reaches the output port Throughput The number of data bits transferred in unit time. Measured at 0% packet loss while varying packet size Resource Utilization The ratio of the time a resource was active and the total measurement time

Input Task Graphs

Experimental Parameters Input:

Experimental Results Output:

Experimental Results

Conclusion and Future Work Codesign framework for PPSs with consideration of multiple flows and real-time constraints The iterative improvement scheme introduces packet-level parallelism into the system For task graphs of the benchmark applications, the method produces solution in a small time and shows performance metrics comparable to the existing PPSs The framework can be extended with: –An object-oriented or modeling language for specification –Effects of caching and multithreading –Dynamic analysis for workload characterization

Thank You Questions ?