Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

Slides:

Advertisements

Similar presentations

February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.

Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson

A Novel 3D Layer-Multiplexed On-Chip Network

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Clustering Categorical Data The Case of Quran Verses

Types of Algorithms.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Balancing Interconnect and Computation in a Reconfigurable Array Dr. André DeHon BRASS Project University of California at Berkeley Why you don’t really.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Network Aware Resource Allocation in Distributed Clouds.

1 Multicast Algorithms for Multi- Channel Wireless Mesh Networks Guokai Zeng, Bo Wang, Yong Ding, Li Xiao, Matt Mutka Michigan State University ICNP 2007.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.

Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang.

Presenter ： Kuang-Jui Hsu Date ： 2011/3/24(Thur.).

Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.

On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.

Clustering Data Streams A presentation by George Toderici.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Unsupervised Learning

Computational problems, algorithms, runtime, hardness

Routing Through Networks - 1

A Study of Group-Tree Matching in Large Scale Group Communications

Parallel Density-based Hybrid Clustering

Ching-Chi Lin Institute of Information Science, Academia Sinica

Minimum Spanning Tree 8/7/2018 4:26 AM

STEREO MATCHING USING POPULATION-BASED MCMC

Surviving Holes and Barriers in Geographic Data Reporting for

Server Allocation for Multiplayer Cloud Gaming

Department of Electrical & Computer Engineering

Types of Algorithms.

Multi-Core Parallel Routing

Result of Ontology Alignment with RiMOM at OAEI’06

ESE534: Computer Organization

ESE534: Computer Organization

Effective Social Network Quarantine with Minimal Isolation Costs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Objective of This Course

Types of Algorithms.

Haim Kaplan and Uri Zwick

MURI Kickoff Meeting Randolph L. Moses November, 2008

Kevin Mason Michael Suggs

Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu

Networked Real-Time Systems: Routing and Scheduling

3. Brute Force Selection sort Brute-Force string matching

CPS 173 Computational problems, algorithms, runtime, hardness

3. Brute Force Selection sort Brute-Force string matching

Adaptivity and Dynamic Load Balancing

Types of Algorithms.

More on HW 2 (due Jan 26) Again, it must be in Python 2.7.

Kunxiao Zhou and Xiaohua Jia City University of Hong Kong

Parallel Programming in C with MPI and OpenMP

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.

Edinburgh Napier University

Unsupervised Learning

3. Brute Force Selection sort Brute-Force string matching

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University Thank you for the introduction. Good afternoon everyone. In this talk, I’m going to talk about a task mapping algorithm called Quadrisection which targets for energy-efficient on-chip communication. This is joint work with Nithin Michael, Professor Edward Suh and Professor Kevin Tang at Cornell University N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Task Mapping Problem Task mapping for many-core processor ~Introduction~ Problem Formulation Quadrisection Evaluation Conclusions Task Mapping Problem Task mapping for many-core processor Our focus – mapping communicating application tasks to cores to minimize communication power consumption With technology scaling, more and more cores are integrated into a single chip. This creates a challenge for resource management and utilization. Task mapping is one of the problems. We define task mapping problem as follows. Given a set of tasks and communication requirement between them, we map the tasks to a many-core processor to achieve some certain goal. Here is one possible mapping to a many-core processor with mesh network. It’s obvious that different mappings will result in different performance and power consumption. And our goal in this work, is to find the mapping that can minimize the communication power consumption. DSP N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

~Problem Formulation~ Introduction ~Problem Formulation~ Quadrisection Evaluation Conclusions Assumptions Applications of interest consist of multiple tasks communicating via message passing, represented by communication task graph (CTG) Number of tasks are less than number of cores Assume mesh network and dimension-ordered routing We made several basic assumptions. Firstly, the application consists of multiple tasks and communicate via message passing. The application is represented by a communication task graph, or CTG for short, which describes the set of tasks and the communication rate between them, as you can see in this DSP example. Secondly, we assume the number of tasks to be mapped are less than the number of cores because we are interested in mapping, not scheduling multiple tasks on a single core. Besides, we also assume a mesh network and dimension-ordered routing. CTG for DSP N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

~Problem Formulation~ Introduction ~Problem Formulation~ Quadrisection Evaluation Conclusions Problem Difficulty A(i, j), communication rate from task i to task j B(i, j), Manhattan-distance between node i and node j of the NoC p(i), p(j), nodes to which task i and task j are mapped Represent the dynamic power consumption Task mapping is exactly the Quadratic Assignment Problem NP-Hard Difficult to approximate And here is how we formulate the problem mathematically. A(i, j) denotes the communication rate from task i to task j, and B(i, j) is the Manhattan-distance between node i and node j. Since we are using dimension-order routing, so B(i, j) is also the number of hops for communication flow between node i and node j. The cost function is defined as the sum of A(i, j) times B(p(i), p(j)), in which p(i) and p(j) means the nodes that task i and task j are mapped to. Basically, this means we multiply the communication rate between two tasks and the number of hops between where they are mapped. Intuitively, this can represent the communication power consumption, so the lower, the better. And our goal is find the mapping function p() that can minimize this cost. However, this is exactly the Quadratic assignment problem and it’s NP-Hard. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

~Problem Formulation~ Introduction ~Problem Formulation~ Quadrisection Evaluation Conclusions State of the Art NMAP – Initial greedy map followed by pair wise swaps Computation intensive, slow BMAP – Clustering based approach merging communicating tasks Poor quality (High cost value) Bisection – Recursively divide the task graph into equal halves to minimize the communication between them There have been several existing approaches dealing with this task mapping problem. NMAP uses a initial greedy map and then does pair wise swaps to improve the mapping. The quality of the resulted mapping is fair but the algorithm is quite slow. Another approach BMAP tries to cluster communicating tasks together step by step, but the quality of resulted mapping is quite poor. Bisection approaches this problem by recursively dividing the task graph into halves and trying to minimize the communication between these halves. The operations are shown in this figure. Firstly, it divides the task graph into the left and right panel. Then for each subpanel, it further divides into a top and bottom panel. This is repeated until the size of a subpanel equals to a single node. Bisection achieves mapping with quite good quality and it’s much more efficient than NMAP. [1] [1] K. Srinivasan and K. S. Chatha. A Technique for Low Energy Mapping and Routing in Network-on-Chip Architectures. ISLPED 2005 N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Can we improve Bisection? Introduction Problem Formulation ~Quadrisection~ Evaluation Conclusions Can we improve Bisection? Key observation: Mapping is a 2D problem Bisection uses vertical and horizontal min-cut based on Fiduccia-Mattheyses (FM) algorithm FM is applied to 1D, the derived local optimum doesn’t consider the 2D mapping FM FM FM So, can we do better than Bisection? Our key observation is that mapping is a 2-dimension problem. Bisection approaches this 2D problem by using vertical and horizontal cut of a task graph. The cutting is based on Fiduccia-Mattheyses algorithm, or FM for short. Understand FM is very important, so I’m going to spend a little time explaining how FM works. Assume we have a communication task graph, the FM divides it into a left and right panel. The FM first divides the task graph randomly. Then it swaps one node between left and right panel that can reduce the communication cost between the two panels. This swap operation is repeated until no swap can further reduce the communication between two panels, which means the mapping has arrived at a local optimum. The next step is applying FM algorithm to each of the subpanels, and we do this recursively until the task graph is fully mapped. What’s the potential limitation here? As we can see, in each step, FM algorithm is only applied to 1D by dividing a panel into two subpanels, so the derived local optimum does not consider the 2D property of task mapping problem. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Quadrisection Main idea: apply FM algorithm to 2D mapping Introduction Problem Formulation ~Quadrisection~ Evaluation Conclusions Quadrisection Main idea: apply FM algorithm to 2D mapping Explore a different part of design space The local optimal considers 2D mapping Further optimize subpanel placement by brute force search 1 hop FM FM 2 hops Based on the previous observation, we came up with Quadrisection. The main idea is applying FM algorithm to 2D mapping. Again, assume we have a task graph, in Quadrisection, the graph is divided into 4 panels using FM algorithm. The FM works very similar to that in Bisection, it swaps one node between 2 panels out of the 4 to reduce the total communication cost among the 4 panels. and this swap operation is repeated until no swap can further reduce the communication cost. The main difference is how we calculate the communication cost. In Quadrisection, communication across the diagonal are counted as 2 hops, so the communication cost between diagonal panels will time a weight 2 while communication cost between neighboring panels only have a weight 1. This calculation of communication cost exactly captures the 2D property of the task mapping problem, and we believe this heuristic will explore the design space better than the bisection algorithm. Furthermore, we optimize the subpanel placement by brute force search. For example, in the top left panel, we permutate the position of four subpanels to find out the best placement that minimize the communication cost to the other 3 panels. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Evaluation Methodology Introduction Problem Formulation Quadrisection ~Evaluation~ Conclusions Evaluation Methodology Evaluations were performed numerically and using simulation Performance was evaluated for mesh networks for Synthetically generated random task graphs Multimedia benchmarks Simulations were performed using cycle level simulator and power consumption were evaluated using Orion2 We evaluate the effectiveness of Quadrisection using both numerical result and simulation. For numerical result, we evaluate the cost function of mappings derived from different algorithms. The benchmarks we use include both synthetic task graphs and multimedia benchmarks. For simulation, we use a cycle level network simulator to simulate the network traffic based on the task mapping and derived the power using Orion2. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Cost: Synthetic Task Graph Introduction Problem Formulation Quadrisection ~Evaluation~ Conclusions Cost: Synthetic Task Graph 100 randomly generated CTGs were mapped to 4-by-4 mesh network Here is the numerical result for synthetic task graphs. We randomly generated 100 task graphs which are mapped to 4-by-4 mesh network using 4 different algorithms. The figure shows the average cost of each algorithm normalized to Quadrisection. As a reminder, the cost is the lower, the better. As can be seen, Quadrisection outperforms NMAP and BMAP by 20-30 percent, and outperforms Bisection by about 7%. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Power: Multimedia Benchmarks Introduction Problem Formulation Quadrisection ~Evaluation~ Conclusions Power: Multimedia Benchmarks Communication Power Next is the result for communication power consumption of multimedia benchmarks. We collected the network traffic using a cycle-level network simulator and then derive the power consumption using Orion2 power models. Because some task graphs are small enough, we are able to derive the optimal mapping. So the optimal mapping provides a reference of how well our algorithm can achieve. Qudrisection achieves about 7% less power consumption compared with Bisection, and achieves 10%-20% less power consumption than NMAP and BMAP. On average, it’s within 2% of the optimal mapping. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Introduction Problem Formulation Quadrisection ~Evaluation~ Conclusions Execution Time Quadrisection’s runtime is comparable with previous approaches Runtime of an algorithm is also an important metric to deal with this NP hard problem. Quadrisection has similar runtime with BMAP and Bisection, and is orders of magnitude faster than NMAP algorithm. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Introduction Problem Formulation Quadrisection Evaluation ~Conclusions~ Conclusions Quadrisection outperforms commonly used mapping approaches by exploiting 2D structure Run-time of Quadrisection is comparable to the state-of-art Useful addition to NoC designer’s toolkit. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping

Cost: Multimedia Benchmarks Introduction Problem Formulation Quadrisection ~Evaluation~ Conclusions Cost: Multimedia Benchmarks Cost Next is the result for multimedia benchmarks. Because some task graphs are small enough, we are able to derive the optimal mapping. So the optimal mapping provides a reference of how well our algorithm can achieve. As can be seen, Quadrisection performs well in most cases. On average, it’s within 2% of the optimal mapping and performs best out of the 4 algorithms. N. Michael, Y. Wang, G. E. Suh & A. Tang Quadrisection-Based Task Mapping