1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan, and Jason Anderson

Slides:

Advertisements

Similar presentations

TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.

Advertisements

Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.

Concurrency Issues Motivation, Problems, Directions Dennis Kafura - CS Operating Systems1.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Submitted by: Omer & Ofer Kiselov Supevised by: Dmitri Perelman Networked Software Systems Lab Department of Electrical Engineering, Technion.

1 Johannes Schneider Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer.

[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

“Evaluating MapReduce for Multi-core and Multiprocessor Systems” Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis Computer.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Power Reduction for FPGA using Multiple Vdd/Vth

1 Improving Productivity With Fine-grain Compiler-based Checkpointing Chuck (Chengyan) Zhao Prof. Greg Steffan Prof. Cristiana Amza Allan Kielstra* Dept.

View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.

Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs Supervisor: Dr. Guy Lemieux October 20, 2011 Chris Wang.

Operating Systems for Reconfigurable Systems John Huisman ID:

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Incremental Placement Algorithm for Field Programmable Gate Arrays David Leong Advisor: Guy Lemieux University of British Columbia Department of Electrical.

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.

© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia.

Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.

CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.

Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto.

Parallel Routing for FPGAs based on the operator formulation

Operating Systems: Wrap-Up Questions answered in this lecture: What is an Operating System? Why are operating systems so interesting? What techniques can.

Parallelism without Concurrency Charles E. Leiserson MIT.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

1 Compiler Support for Efficient Software-only Checkpointing Chuck (Chengyan) Zhao Dept. of Computer Science University of Toronto Ph.D. Thesis Exam Sept.

Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.

Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

A Dynamic Binary-Rewriting Approach to Software Transactional Memory appeared in PACT 2007, Brasov, Romania University of Toronto Marek Olszewski Jeremy.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Floating-Point FPGA (FPFPGA)

Transactional Memory : Hardware Proposals Overview

Faster Data Structures in Transactional Memory using Three Paths

Challenges in Concurrent Computing

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Efficient software checkpointing framework for speculative techniques

Lecture 22: Consistency Models, TM

Lecture 23: Transactional Memory

Dynamic Performance Tuning of Word-Based Software Transactional Memory

Presentation transcript:

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson** *CS Department / **ECE Department University of Toronto

2 Implications of Moore’s Law  need for parallel CAD is intensifying Year FPGAs CAD Complexity CPUs … 7.5m Pentium II 42m PIV 1.1b70m 350m 2.5b 291m Core 2 Duo 731m Core i7 Quad

3 Parallelizing FPGA Placement with TMSteffan Parallelizing CAD Software The focus of this talk:The focus of this talk: –simulated-annealing-based placement  key algorithm in FPGA CAD

4 Parallelizing FPGA Placement with TMSteffan Simulated Annealing Placement: Basic Idea Algorithm: 1) Start with random placement of blocks 2) Randomly pick a pair of blocks to swap 3) Keep new placement if an improvement … A B C D ? B A C D ? blocks nets

5 Parallelizing FPGA Placement with TMSteffan Potential Parallelism: the Intuition Thread 1 Single-Threaded  parallelism when blocks/nets are disjoint A B C D ? Thread 1 Thread 2 Parallel Moves (success) A B C D ? ? Thread 1 Thread 2 Parallel Moves (failure)  A B C D ? ?  nice match to Transactional Memory

6 Parallelizing FPGA Placement with TMSteffan abort! Transactional Memory (TM): the Basic Idea Source Code:... atomic {... access_shared_data();... }... TM System Specifies transactions in source code... atomic {... access_shared_data();... }... atomic {... access_shared_data();... } Transactions: Executes transactions optimistically in parallel Programmer: TM System: 1) Checkpoints execution 2) Detects conflicts ?? 3) Commits or aborts and re-executes  Exploits available parallelism while maintaining correctness!

7 Parallelizing FPGA Placement with TMSteffan Software TM (STM)Software TM (STM) –compiler or library based –works on current multicores, but high overheads –Java: DSTM, ASTM –C or C++: McRT icc, TL2, RSTM, JudoSTM, tinySTM Hardware TM (HTM)Hardware TM (HTM) –more automatic, low overhead, limited transaction size –commercial systems don’t exist yet –Stanford’s TCC, Wisconsin’s LogTM, SUN’s ROCK TM Implementations This work  STM has high overhead, no HTM’s (yet)

8 Parallelizing FPGA Placement with TMSteffan Goals of this Work Parallelize simulated-annealing placement – –using software transactional memory (tinySTM) – –demonstrate the potential for good scaling – –not expecting great speedup due to the overheads of STM For the FPGA community – –evaluate potential for easier parallelization via TM – –suggest CAD algorithm changes to capitalize on TM For the systems/TM community – –lessons from a real application – –TM feature wish-list

9 Parallelizing FPGA Placement with TMSteffan Methodology CAD SW: Versatile Place and Route (VPR) 5.0CAD SW: Versatile Place and Route (VPR) 5.0 –available at Benchmark circuits: provided by VPRBenchmark circuits: provided by VPR –sizes ranging from: blocks, nets –target architecture: 4 LUTs, cluster size 10 STM: tinySTMSTM: tinySTM –available at Platform: 8 CPUsPlatform: 8 CPUs –2 X Quad-Core Intel Xeon 2.33 Ghz

10 Parallelizing FPGA Placement with TMSteffan Challenges: Non-Determinism & Measurement Our initial implementation is non-deterministicOur initial implementation is non-deterministic –however a deterministic version is possible, see paper Non-determinism makes measurement difficultNon-determinism makes measurement difficult –different numbers of threads -> different work/results Solution: consider both runtime & quality-of-result (QoR)Solution: consider both runtime & quality-of-result (QoR) –QoR: worst-case critical path delay  can trade-off runtime and QoR

11 Parallelizing FPGA Placement with TMSteffan The Parallelization Story

12 Parallelizing FPGA Placement with TMSteffan First Parallelization Attempt Fast: one student-monthFast: one student-month –includes time to get familiar with tinySTM, VPR code –very few code changes –produced correct results very quickly –no deadlocks or data race Standard parallelism optimizations:Standard parallelism optimizations: –reductions: i.e. –reductions: i.e. success_sum += 1 –scheduling: move unnecessary code out of transactions  additional effort devoted to improving perf.

13 Parallelizing FPGA Placement with TMSteffan Performance (avg all benchmark circuits)  high QoR degradation (30%), high abort rate (60%) deg.

14 Parallelizing FPGA Placement with TMSteffan More Optimization: Reduce Aborts Use feedback to identify causes of abortsUse feedback to identify causes of aborts –80% of aborts caused by accesses to x_lookup[] array used to locate 2 nd block in a swaparray used to locate 2 nd block in a swap –interesting: not used by “I/O” type blocks Interesting resulting behavior: “favoritism”Interesting resulting behavior: “favoritism” –system favors swapping I/O blocks I/O block swaps have much shorter txns, no conflictsI/O block swaps have much shorter txns, no conflicts –only one non-I/O block swapping at a time others conflict immediately on x_lookup[]others conflict immediately on x_lookup[] –intuition: causing QoR degradation, ‘false’ speedup  solution: privatize x_lookup[]

15 Parallelizing FPGA Placement with TMSteffan Transactions and Swaps: Terminology SwapsSwaps –ACCEPTED or REJECTED TransactionsTransactions –COMMIT or ABORT A B  A B 

16 Parallelizing FPGA Placement with TMSteffan More Optimization: Leveraging TM VPR code implements commit/abortVPR code implements commit/abort –directly modifies placement data structures –undoes modifications if swap is rejected TM implements commit/abort, hence optimize:TM implements commit/abort, hence optimize: –delete VPR code for undoing rejected swaps –force transaction to abort if swap is rejected  requires API for forcing a transaction to abort

17 Parallelizing FPGA Placement with TMSteffan Impact on Abort Rate Standard OptimizationsPrivatization and Leveraging  significant decrease in abort rate

18 Parallelizing FPGA Placement with TMSteffan Performance of Privatization and Leveraging TM deg.  improved QoR deg: max 35% to 8%, avg 7% to 2%

19 Parallelizing FPGA Placement with TMSteffan Even More Optimization: Ignoring Large Nets  improves abort rate, little impact on QoR Privatization and LeveragingIgnore Large Nets

20 Parallelizing FPGA Placement with TMSteffan Evaluating Scaling Relative to Single Thread STM (estimated) Single Thread STM vs. Sequential

21 Parallelizing FPGA Placement with TMSteffan Conclusions Parallel placement via STMParallel placement via STM –good algorithmic fit (accept/reject -> commit/abort) –speedup poor due to overheads, scaling good, need HTM! FPGA community:FPGA community: –should pay attention to TM, especially HTM –TM offers fast & correct parallelization, focus on performance –algorithms can be modified to better exploit TM (ignoring nets) Systems/TM community:Systems/TM community: –need API for forced abort, ordered transactions 21