An Application-Specific Design Methodology for STbus Crossbar Generation Author: Srinivasan Murali, Giovanni De Micheli Proceedings of the DATE’05,pp.1176-1181,2005.

Slides:



Advertisements
Similar presentations
Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP Students : Haim Assor, Horesh Ben Shitrit 2. Shared Bus 3. Fabric 4. Network on Chip.
Advertisements

Verifying Performance of a HDL design block
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Misbah Mubarak, Christopher D. Carothers
DSPs Vs General Purpose Microprocessors
A Novel 3D Layer-Multiplexed On-Chip Network
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya Mahadevan, Yiannis Yiakoumis, Puneed Sharma, Sujata Banerjee,
An Analytical Model for Worst-case Reorder Buffer Size of Multi-path Minimal Routing NoCs Gaoming Du 1, Miao Li 1, Zhonghai Lu 2, Minglun Gao 1, Chunhua.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
June 3, A New Multipath Routing Protocol for Ad Hoc Wireless Networks Amit Gupta and Amit Vyas.
CS599 Software Engineering for Embedded Systems1 Software Engineering for Real-Time: A Roadmap Presentation by: Mandar Samant Raghbir Singh Banwait.
Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
TCSS 372A Computer Architecture. Getting Started Get acquainted (take pictures) Discuss purpose, scope, and expectations of the course Discuss personal.
1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.
Chapter 13 Embedded Systems
Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
CSS Lecture 2 Chapter 3 – Connecting Computer Components with Buses Bus Structures Synchronous, Asynchronous Typical Bus Signals Two level, Tri-state,
Low Latency Wireless Video Over Networks Using Path Diversity John Apostolopolous Wai-tian Tan Mitchell Trott Hewlett-Packard Laboratories Allen.
Receiver-driven Layered Multicast Paper by- Steven McCanne, Van Jacobson and Martin Vetterli – ACM SIGCOMM 1996 Presented By – Manoj Sivakumar.
Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†
SYNTHESIS OF NETWORKS ON CHIPS FOR 3D SYSTEMS ON CHIPS Srinivasan Murali, Ciprian Seiculescu, Luca Benini, Giovanni De Micheli Presented by Puqing Wu.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Yao Wang, Yu Wang, Jiang Xu, Huazhong Yang EE. Dept, TNList, Tsinghua University, Beijing, China Computing System Lab, Dept. of ECE Hong Kong University.
Applied Transportation Analysis ITS Application SCATS.
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Network Aware Resource Allocation in Distributed Clouds.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
1 Heterogeneity in Multi-Hop Wireless Networks Nitin H. Vaidya University of Illinois at Urbana-Champaign © 2003 Vaidya.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
LAN Switching and Wireless – Chapter 1
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Architectural and Physical Design Optimization for Efficient Intra-Tile Communication Liza Rodriguez Aurelio Morales EEL Embedded Systems Dept.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
CSS 372 Oct 4th - Lecture 3 Chapter 3 – Connecting Computer Components with Buses Bus Structures Synchronous, Asynchronous Typical Bus Signals Two level,
EEE440 Computer Architecture
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.
1 Presented by Sarbagya Buddhacharya. 2 Increasing bandwidth demand in telecommunication networks is satisfied by WDM networks. Dimensioning of WDM networks.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Joint Routing and Scheduling Optimization in Wireless Mesh Networks with Directional Antennas A. Capone, I. Filippini, F. Martignon IEEE international.
T-Share: A Large-Scale Dynamic Taxi Ridesharing Service
SystemC Simulation Based Memory Controller Optimization
BIC 10503: COMPUTER ARCHITECTURE
Israel Cidon, Ran Ginosar and Avinoam Kolodny
ElasticTree: Saving Energy in Data Center Networks
Die Stacking (3D) Microarchitecture -- from Intel Corporation
A Case for Interconnect-Aware Architectures
Presentation transcript:

An Application-Specific Design Methodology for STbus Crossbar Generation Author: Srinivasan Murali, Giovanni De Micheli Proceedings of the DATE’05,pp ,2005 Presenter : Ching-Yuan Lin Date : 2007/1/22 Seminar book :P120

2 Abstract  As the communication requirements of current and future Multiprocessor Systems on Chips (MPSoCs) continue to increase, scalable communication architectures are needed to support the heavy communication demands of the system. This is reflected in the recent trend that many of the standard bus products such as STbus, have now introduced the capability of designing a crossbar with multiple buses operating in parallel. The crossbar configuration should be designed to closely match the application traffic characteristics and performance requirements. In this work we address this issue of application-specific design of optimal crossbar (using STbus crossbar architecture), satisfying the performance requirements of the application and optimal binding of cores onto the crossbar resources. We present a simulation based design approach that is based on analysis of actual traffic trace of the application, considering local variations in traffic rates, temporal overlap among traffic streams and criticality of traffic streams. Our methodology is applied to several MPSoC designs and the resulting crossbar platforms are validated for performance by cycle-accurate SystemC simulation of the designs. The experimental case studies show large reduction in packet latencies (up to 7×) and large crossbar component savings (up to 3.5×) compared to traditional design approaches.

3 STbus Crossbar Architecture  Low-latency, high bandwidth infrastructure  Interface components: arbiters and frequency/data width adapters I1 I2 I3 A1 A2 A3 Bus 1 Bus 2 Bus 3 T1 T2 T3 Initiators TargetsArbitersBuses I1 I2 I3 A1 A2 A3 Bus 1 Bus 2 Bus 3 T1 T2 T3 Initiators Targets Arbiters Buses Initiator-Target crossbar Target-Initiator crossbar I1 I2 I3 A1 A2 Bus 1 Bus 2 T1 T2 T3 InitiatorsTargetsArbitersBuses Initiator-Target crossbar ※ Full crossbar architecture ※ Partial crossbar architecture

4 What’s the Problem  Full crossbar is expensive  Lot of wires and gates  Partial crossbar is a compromise solution  Optimum partial crossbar  Latency close to full crossbar  Fewer component and area  How to design best partial crossbar for applications?

5 Application Traffic analysis  Example traffic trace from 3 Targets ※ Traffic trace Merge t1&t2 T1: T2 T3: overlap Simulation period:  Overlap is increase average and peak latency T1: T2: Simulation period:

6  Consider criticality of streams  Targets with overlapping real-time stream should not share the same bus ※ Traffic trace T1: T2 T3: Simulation period: Real time constraint

7 Crossbar design approach  Simulation time window for analysis  Split to fixed sized windows  In each simulation window  Satisfy bandwidth requirement The total receives data of every core (place on same bus) must less or equal than window size  Minimize overlaps among streams  Consider criticality of streams T1: T2 T3: overlap Simulation period: Windows 1Windows 2

8 Design flow for partial crossbar design

9 Phase1  Full crossbar traffic in perfect communication  Data collection hardware add to arbiters  Traffic collection on each window Data rate for each core Overlap among streams Criticality of streams

10 Phase 2: Pre-processing  Core that should be different buses  Cores with large overlap (above threshold) (1)  Cores with overlapping criticality streams (2)  Non-satisfy bandwidth requirements (4)  Maximum number of cores on bus  To bound maximum latency (8) Worst case: packets to all the target onto a bus can arrive in the same cycle T1: T2: Simulation period: T3: One packet (burst)

11 Phase 3: Crossbar Design  Start with a single bus  Check for feasible solution  Satisfy window bandwidth constraints (4)  Place forbidden core on different buses (1)(2)  Fewer than maximum number of cores on each bus(8)  Repeat step2, incrementing the number of buses by 1  Optimal binding:  Minimize overlap on each bus (11)

12 Experiment result  Application benchmark  Matrix suite-1 (25 cores)  Matrix suite-2 (21 cores)  FFT suite (29 cores)  Quick sort suite (15 cores)  DES encryption system (19 cores) ※ Matrix multiplication benchmark-2 (21 cores)  Initiator-target full crossbar  Need 12 bus  Target-initiator full crossbar  Need 9 bus  FC bus count = 21

13 The average and maximum packet latencies  Win : optimal partial crossbar  Avg: crossbar base communication traffic flow,by relaxing overlap constraints and using a single window  Latencies of crossbar (avg) are 4x to 7x higher than crossbar designed using our scheme

14 Effect of window size variations  a) Small window size:  Finer control of the performance parameters and crossbars have lower latencies  Disadvantage: over-design of the network component  b) the acceptable window size for various burst size

15 Overlap threshold setting  From experiments, threshold value can be set:  30%-40% of window size for conservative design  10% of window size for conservative designs

16 Conclusion  Presented methodology for STbus crossbar design  local variations in traffic  Overlap of streams  Criticality of traffic streams  Large saving in components, good performance  Approach can be extended to other bus designs