Department of Electrical and Computer Engineering University of Massachusetts, Amherst Xin Huang and Tilman Wolf A Methodology.

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

Copyright © 2005 Department of Computer Science CPSC 641 Winter PERFORMANCE EVALUATION Often in Computer Science you need to: – demonstrate that.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
1 Performance Evaluation of Computer Networks Objectives  Introduction to Queuing Theory  Little’s Theorem  Standard Notation of Queuing Systems  Poisson.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
WBest: a Bandwidth Estimation Tool for IEEE Wireless Networks Presented by Feng Li Mingzhe Li, Mark Claypool, and.
Data Communication and Networks Lecture 13 Performance December 9, 2004 Joseph Conron Computer Science Department New York University
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 PERFORMANCE EVALUATION H Often in Computer Science you need to: – demonstrate that a new concept, technique, or algorithm is feasible –demonstrate that.
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
Queuing Networks: Burke’s Theorem, Kleinrock’s Approximation, and Jackson’s Theorem Wade Trappe.
Performance Evaluation Techniques Professor Bob Kinicki Computer Science Department.
Department of Electrical and Computer Engineering Kekai Hu, Harikrishnan Chandrikakutty, Deepak Unnikrishnan, Tilman Wolf, and Russell Tessier Department.
Buffer Management for Shared- Memory ATM Switches Written By: Mutlu Apraci John A.Copelan Georgia Institute of Technology Presented By: Yan Huang.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
Flows and Networks Plan for today (lecture 5): Last time / Questions? Blocking of transitions Kelly / Whittle network Optimal design of a Kelly / Whittle.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Network Analysis A brief introduction on queues, delays, and tokens Lin Gu, Computer Networking: A Top Down Approach 6 th edition. Jim Kurose.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Network Aware Resource Allocation in Distributed Clouds.
Time Parallel Simulations II ATM Multiplexers and G/G/1 Queues.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
CPE 619 Queueing Networks Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in Huntsville.
Manijeh Keshtgary. Queuing Network: model in which jobs departing from one queue arrive at another queue (or possibly the same queue)  Open and Closed.
Networks of Queues Plan for today (lecture 6): Last time / Questions? Product form preserving blocking Interpretation traffic equations Kelly / Whittle.
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
11 Experimental and Analytical Evaluation of Available Bandwidth Estimation Tools Cesar D. Guerrero and Miguel A. Labrador Department of Computer Science.
1 Chapter 8 Closed Queuing Network Models Flexible Machining Systems CONWIP (CONstant Work In Process)
1 Chapters 8 Overview of Queuing Analysis. Chapter 8 Overview of Queuing Analysis 2 Projected vs. Actual Response Time.
Network Design and Analysis-----Wang Wenjie Queuing Theory III: 1 © Graduate University, Chinese academy of Sciences. Network Design and Performance Analysis.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Software Tools for Network Modeling. Content IntroductionIntroduction PEPSY-QNS WinPEPSY Using WinPEPSY.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Flows and Networks Plan for today (lecture 6): Last time / Questions? Kelly / Whittle network Optimal design of a Kelly / Whittle network: optimisation.
DTM and Reliability High temperature greatly degrades reliability
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Internet Applications: Performance Metrics and performance-related concepts E0397 – Lecture 2 10/8/2010.
Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Flows and Networks Plan for today (lecture 6): Last time / Questions? Kelly / Whittle network Optimal design of a Kelly / Whittle network: optimisation.
Courtesy Piggybacking: Supporting Differentiated Services in Multihop Mobile Ad Hoc Networks Wei LiuXiang Chen Yuguang Fang WING Dept. of ECE University.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
OPERATING SYSTEMS CS 3502 Fall 2017
Introduction | Model | Solution | Evaluation
Wayne Wolf Dept. of EE Princeton University
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Distributed Process Scheduling: 5.1 A System Performance Model
Serial versus Pipelined Execution
Presented by Jason L.Y. Lin
Flows and Networks Plan for today (lecture 6):
Buffer Management for Shared-Memory ATM Switches
Computer Systems Performance Evaluation
CSE 550 Computer Network Design
Presentation transcript:

Department of Electrical and Computer Engineering University of Massachusetts, Amherst Xin Huang and Tilman Wolf A Methodology for Evaluating Runtime Support in Network Processors

2 Department of Electrical and Computer Engineering Runtime Support in Network Processor  Network processor (NP) Multi-core system-on-chip Programmability & high packet processing rate  Heterogeneous resources Control processors Multiple packet processors Co-processors Memory hierarchy Interconnection  Runtime support Dynamic task allocation IXP 2800

3 Department of Electrical and Computer Engineering General Operation of Runtime Support in NP  Input Hardware resources Workload  Mapping method  Output Task allocation  Dynamic adaptation Different runtime support systems Difficult to compare AP2 AP1 AP3 AP2AP3

4 Department of Electrical and Computer Engineering Contributions  Evaluation methodology Traffic representation Analytical system model based on queuing networks Results  Specific: 3 example runtime support system I. Ideal Allocation II. Full Processor Allocation R. Kokku, T. Riche, A. Kunze, J. Mudigonda, J. Jason, and H. Vin. A case for run-time adaptation in packet processing systems. In Proc. of the 2 nd workshop on Hot Topics in Networks (HOTNETS-II), Cambridge, MA, Nov III. Partitioned Application Allocation T. Wolf, N. Weng, and C.-H. Tai. Design consideration for network processor operating systems. In Proc. of ACM/IEEE Symposium on Architectures for Networking and Communication System (ANCS), pages 71-80, Princeton, NJ, Oct. 2005

5 Department of Electrical and Computer Engineering Outline  Introduction  Evaluation Methodology Dynamic Workload Model Runtime System Model  Result  Summary

6 Department of Electrical and Computer Engineering Workload  NP workload is characterized by applications and traffic  How to represent workload?

7 Department of Electrical and Computer Engineering Dynamic Workload Model  Workload graph: Application/Task: T Traffic: Processing requirement:  Example:  Processing requirement: R. Ramaswamy and T. Wolf. PacketBench: A tool for workload characterization of network processing. In Proc. of IEEE 6th Annual Workshop on Workload Characterization (WWC-6), page 42-50, Austin, TX, Oct. 2003

8 Department of Electrical and Computer Engineering Outline  Introduction  Evaluation Methodology Dynamic Workload Model Runtime System Model  Result  Summary

9 Department of Electrical and Computer Engineering Runtime System Model  Unified approach for all runtime systems Queuing networks Specific solution for each runtime system Runtime mapping: Graph: Packet arrival rate: Service time:  Metrics for all runtime systems Processor utilization: Average number of packets in the system:

10 Department of Electrical and Computer Engineering Three Example Runtime Support Systems  System I: Ideal Allocation  System II: Full Processor Allocation  System III: Partitioned Application Allocation

11 Department of Electrical and Computer Engineering Example Evaluation Model – System I  Ideal Allocation All processors can process all packets completely Unrealistic, but can provide baseline M/G/m FCFS single station

12 Department of Electrical and Computer Engineering M/G/m Single Station Queuing System  Cosmetatos approximation  Evaluation metrics G. Cosmetatos. Some Approximate Equilibrium Results for the Multiserver Queue (M/G/r). Operations Research Quarterly, USA, pages 615 – 620, 1976 G. Bolch, S. Greiner, H. de Meer, and K. S. Trivedi. Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. John Wiley & Sons, Inc., New York, NY, August 1998

13 Department of Electrical and Computer Engineering Example Evaluation Model – System II  Full Processor Allocation Allocate entire tasks to subsets of processors Allocate as few processors as possible to save power One processor run one type of task Reallocation is triggered by queue length BCMP M/M/1-FCFS model (Jackson network)

14 Department of Electrical and Computer Engineering BCMP Network  BCMP: Basket, Chandy, Muntz, and Palacios  Characteristics: Open, closed, and mixed queuing network; Several job classes; Four types of nodes: M/M/m – FCFS (class-independent service time), M/G/1 – PS, M/G/∞ – IS, and M/G/1 – LCFS PR  Product-form steady-state solution:  Open M/M/1-FCFS BCMP Queuing Network: Evaluation metrics: F. Baskett, K. Chandy, R. Muntz, and F. Palacios. Open, Closed, and Mixed Networks of Queues wit Different Classes of Customers. Journal of the ACM, 22(2): 248 – 260, April 1975

15 Department of Electrical and Computer Engineering Example Evaluation Model – System III  Partitioned Application Allocation Tasks be partitioned across multiple processors Synchronized pipelines Allocate tasks equally across all processors to maximize throughput Reallocate at fixed time intervals Equations for evaluation metrics are the same as System II. BCMP M/M/1-FCFS model (Jackson network)

16 Department of Electrical and Computer Engineering Outline  Introduction  Evaluation Methodology Dynamic Workload Model Runtime System Model  Result  Summary

17 Department of Electrical and Computer Engineering Setup  System MIPS processing engines Queue lengths are infinite  Workload  Other assumptions Partition applications into 7-15 subtasks

18 Department of Electrical and Computer Engineering Processor Allocation Over Time  Ideal: 16 processors  Full Processor: Change with traffic  Partitioned Application: 16 processors Full processor allocation system

19 Department of Electrical and Computer Engineering Processor Utilization Over Time  Ideal: Lowest processor utilization  Full Processor: Highest processor utilization because using fewer number of processors  Partitioned Application: Low processor utilization Not equal to ideal case due to the unbalanced task allocation and pipeline overhead

20 Department of Electrical and Computer Engineering Packets in System Over Time  Ideal: Least number of packets  Full Processor: Packets queued up due to its high processor utilization  Partitioned Application: Most number of packets due to unbalanced task allocation and pipeline overhead More stable performance because of finer processor allocation granularity

21 Department of Electrical and Computer Engineering Performance for Different Data Rates  Ideal: Smooth increase  Full Processor: Periodical peak  Partitioned Application: Smooth increase  The maximum data rate supported by the systems Ideal: 100% Full Processor: 79.6% Partitioned application: 75.1%

22 Department of Electrical and Computer Engineering Implication of the Results  Ideal Allocation Provide a base line  Full Processor Allocation Allocate as few processors as possible to save power Use entire processor as the allocation granularity Good: High processor utilization Bad: High performance variance  Partitioned Application Allocation Equally distribute tasks on all the processors Finer processor allocation granularity Good: Stable performance Bad: Difficult to get optimized solution => pipeline synchronization overhead

23 Department of Electrical and Computer Engineering Summary  Analytical methodology for evaluating different runtime support NP systems  Dynamic workload model and runtime system model  Results: 3 example runtime support systems Quantitative metrics Tradeoffs

24 Department of Electrical and Computer Engineering Questions ?