Paper Review: Area-Performance Trade-offs in Tiled Dataflow Architectures Ke Peng Instructor: Chun-Hsi Huang CSE 340 Computer Architecture University of.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Novel 3D Layer-Multiplexed On-Chip Network

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.

Ken Michelson David Sunderland Jared Wilkins Chris Fisher WaveScalar Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

University of Michigan, Ann Arbor

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Dynamically Scheduled High-level Synthesis

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

WaveScalar: the Executive Summary

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

Paper Review: Area-Performance Trade-offs in Tiled Dataflow Architectures Ke Peng Instructor: Chun-Hsi Huang CSE 340 Computer Architecture University of Connecticut

2/33 Reference Steven Swanson, Andrew Putanm, Martha Mercaldi, Ken Michelson, Andrew Petersen, Andrew Schwerin, Mark Oskin, Susan J. Eggers Computer Science & Engineering, University of Washington “Area-Performance Trade-offs in Tiled Dataflow Architectures” Proceedings of the 33 rd International Symposium on Computer Architecture, 2006 IEEE.

3/33 Outline Background Introduction Experimental Infrastructure WaveScalar Architecture Evaluation Conclusion

4/33 Background Introduction A lot of issues should be addressed in processor design Wire delay Fabrication reliability Design complexity Processing elements (PEs) are designed and replicated across a chip Examples of titled architectures RAW SmartMemories TRIPS WaveScalar

5/33 Background Introduction Tiled WaveScalar Architecture Captured online, University of Washington

6/33 Background Introduction Benefits PE design Decreases design and verification time Provide robustness for fabrication errors Reduce wire delay for data and control signal transmission Good performance is achievable only if all aspects of the microarchitecture are properly designed. Challenges Tile number VS Tile size High utilized tiles VS Possibly more powerful tiles Partition and distribution of data memory across the chip Tiles interconnection Etc.

7/33 Background Introduction This paper focuses on WaveScalar processor, explores the area-performance trade-offs encountered when designing a tiled architecture. WaveScalar, tiled dataflow architecture Based on PE replication Hierarchical data networks Distributed hardware data structures, including the caches, store buffers, and specialized dataflow memories (token store).

8/33 Experimental Infrastructure Synthesizable RTL WaveScalar model TSMC (Taiwan Semiconductor Manufacturing Company) 90 nm technology Use Synopsys DesignWare IP Synopsys Design Compiler for front-end synthesis Cadence First Encounter for back-end synthesis Synopsys VCS for RTL simulation and function verification Hierarchical architecture and single voltage design, can be extended for multiple voltage design.

9/33 Experimental Infrastructure Three workloads to evaluate the WaveScalar processor Spec2000 benchmark suite (ammp, art, equake, gzip, twolf and mcf), for single-threaded performance evaluation Mediabench (rawdaudio, mgeg2encoder, djpeg), for media processing performance evaluation Splash2 benchmarks(fft, lu-continuous, ocean- noncontinuous, raytrace, water-spatial, radix), for multiple-threaded performance evaluation

10/33 WaveScalar Architecture Processing Elements (PEs) PE is the heart of a WaveScalar machine. Execution resources (Captured from Steven ISCA’06)

11/33 WaveScalar Architecture Five pipeline stages of PE Input stage: Operand messages arrive at the PE either from itself or another PE. Match stage: Operands enter the matching table, determing which instructions are ready to fire, issuing table index of eligible instructions into instruction scheduling queue. Dispatch stage: Selects instruction from scheduling queue, reads operands from matching table for executing. Execute state: Executes an instruction Output stage: Send output to its consumer instructions via connection network.

12/33 WaveScalar Architecture Several PEs are combined into a single Pod share bypass networks Several pods are combined into a single Domain Several Domains are combined into a single cluster These are parameters that affect the performance and area trade-off of WaveScaler. 2-PE pod is 15% faster on average than isolated PEs Increasing the number of PEs in each pod with further increase performance but adversely affects cycle time.

13/33 WaveScalar Architecture The configuration of the baseline WaveScalar processor (Captured from Steven ISCA’06)

14/33 WaveScalar Architecture Hierarchical organization of the WaveScalar microarchitecture (Captured from Steven ISCA’06)

15/33 WaveScalar Architecture 4 Levels hierarchical interconnect network Intra-pod Intra-domain Broadcast-based Pseudo-PEs (Mem, Net), serve as gateways to the memory system and PEs in other domains or clusters. 7% area overhead Intra-cluster Small network, area overhead negligible Inter-cluster Responsible for all long-distance communication 1% of the total chip area overhead

16/33 WaveScalar Architecture Hierarchical cluster interconntects (Captured from Steven ISCA’06)

17/33 WaveScalar Architecture Memory Subsystem Wave-ordered store buffers Memory interface that enables WaveScalar to execute programs written in imperative languages (C, C++, Java) Store decoupling technique to process store address and store data messages separately Partial store queues for storing address Occupies approximately 6.2% area of the cluster Conventional memory hierarchy with distributed L1 and L2 cache L1 data cache is 4-way set associative and have 128-byte lines, with hit costs 3-cycles L2’s hit rate is cycles Main memory latency is modeled at 200 cycles

18/33 Evaluation Die area spent for the baseline design (Captured from Steven ISCA’06)

19/33 Evaluation The configuration of the baseline WaveScalar processor (Captured from Steven ISCA’06)

20/33 Evaluation Many parameters affect the area required for WaveScalar designs. This paper considers 7 parameters with the strongest effect on the area requirements. Ignores some minor effects. For example, assuming that wiring costs do not decrease with fewer than 4 domains.

21/33 Evaluation WaveScalar processor area model (Captured from Steven ISCA’06)

22/33 Evaluation The ranges allow for over 21,000 Wavescalar processor configurations. To select the configurations, the authors: Eliminate clearly poor, unbalanced designs Bound die size at 400mm 2 in 90 nm technology Reduce the number of designs to 201 Report AIPC (Alpha-equivalent instructions executed per cycle) instead of IPC.

23/33 Evaluation Pareto-optimal WaveScalar Designs (Captured from Steven ISCA’06)

24/33 Evaluation Pareto-optimal configurations for Splash2 (Captured from Steven ISCA’06)

25/33 Evaluation Goal of WaveScalar’s hierarchical interconnect: Isolate as much traffic as possible in the lower levels of the hierarchy, within a PE, a pod or a domain. On average, 40% network traffic remains within a pod 52% network traffic remains within a domain On average, just 1.5% of traffic traverses the inter- cluster interconnect

26/33 Evaluation Pareto-optimal WaveScalar Designs (Captured from Steven ISCA’06)

27/33 Conclusion This paper presents WaveScalar processor architecture in details Presents parameters affecting area and performance significantly Propose WaveScalar architecture and explored the area/performance trade-offs simulation and analysis Reveals that Wavescalar processors tuned for either area efficiency or maximum performance across a wide range of processor sizes. The hierarchical interconnect network is very effective. Over 50% of messages stay within a domain Over 80% of messages stay within a cluster

28/33 Thank you!