Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

Similar presentations


Presentation on theme: "1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**"— Presentation transcript:

1 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson** *CS Department / **ECE Department University of Toronto

2 2 Implications of Moore’s Law  need for parallel CAD is intensifying 199520002005 2010 Year FPGAs CAD Complexity CPUs … 7.5m Pentium II 42m PIV 1.1b70m 350m 2.5b 291m Core 2 Duo 731m Core i7 Quad

3 3 Parallelizing FPGA Placement with TMSteffan Parallelizing CAD Software The focus of this talk:The focus of this talk: –simulated-annealing-based placement  key algorithm in FPGA CAD

4 4 Parallelizing FPGA Placement with TMSteffan Simulated Annealing Placement: Basic Idea Algorithm: 1) Start with random placement of blocks 2) Randomly pick a pair of blocks to swap 3) Keep new placement if an improvement … A B C D ? B A C D ? blocks nets

5 5 Parallelizing FPGA Placement with TMSteffan Potential Parallelism: the Intuition Thread 1 Single-Threaded  parallelism when blocks/nets are disjoint A B C D ? Thread 1 Thread 2 Parallel Moves (success) A B C D ? ? Thread 1 Thread 2 Parallel Moves (failure)  A B C D ? ?  nice match to Transactional Memory

6 6 Parallelizing FPGA Placement with TMSteffan abort! Transactional Memory (TM): the Basic Idea Source Code:... atomic {... access_shared_data();... }... TM System Specifies transactions in source code... atomic {... access_shared_data();... }... atomic {... access_shared_data();... } Transactions: Executes transactions optimistically in parallel Programmer: TM System: 1) Checkpoints execution 2) Detects conflicts ?? 3) Commits or aborts and re-executes  Exploits available parallelism while maintaining correctness!

7 7 Parallelizing FPGA Placement with TMSteffan Software TM (STM)Software TM (STM) –compiler or library based –works on current multicores, but high overheads –Java: DSTM, ASTM –C or C++: McRT icc, TL2, RSTM, JudoSTM, tinySTM Hardware TM (HTM)Hardware TM (HTM) –more automatic, low overhead, limited transaction size –commercial systems don’t exist yet –Stanford’s TCC, Wisconsin’s LogTM, SUN’s ROCK TM Implementations This work  STM has high overhead, no HTM’s (yet)

8 8 Parallelizing FPGA Placement with TMSteffan Goals of this Work Parallelize simulated-annealing placement – –using software transactional memory (tinySTM) – –demonstrate the potential for good scaling – –not expecting great speedup due to the overheads of STM For the FPGA community – –evaluate potential for easier parallelization via TM – –suggest CAD algorithm changes to capitalize on TM For the systems/TM community – –lessons from a real application – –TM feature wish-list

9 9 Parallelizing FPGA Placement with TMSteffan Methodology CAD SW: Versatile Place and Route (VPR) 5.0CAD SW: Versatile Place and Route (VPR) 5.0 –available at www.eecg.toronto.edu/vpr Benchmark circuits: provided by VPRBenchmark circuits: provided by VPR –sizes ranging from: 67-6000 blocks, 100-60000 nets –target architecture: 4 LUTs, cluster size 10 STM: tinySTMSTM: tinySTM –available at www.tinystm.org Platform: 8 CPUsPlatform: 8 CPUs –2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz

10 10 Parallelizing FPGA Placement with TMSteffan Challenges: Non-Determinism & Measurement Our initial implementation is non-deterministicOur initial implementation is non-deterministic –however a deterministic version is possible, see paper Non-determinism makes measurement difficultNon-determinism makes measurement difficult –different numbers of threads -> different work/results Solution: consider both runtime & quality-of-result (QoR)Solution: consider both runtime & quality-of-result (QoR) –QoR: worst-case critical path delay  can trade-off runtime and QoR

11 11 Parallelizing FPGA Placement with TMSteffan The Parallelization Story

12 12 Parallelizing FPGA Placement with TMSteffan First Parallelization Attempt Fast: one student-monthFast: one student-month –includes time to get familiar with tinySTM, VPR code –very few code changes –produced correct results very quickly –no deadlocks or data race Standard parallelism optimizations:Standard parallelism optimizations: –reductions: i.e. –reductions: i.e. success_sum += 1 –scheduling: move unnecessary code out of transactions  additional effort devoted to improving perf.

13 13 Parallelizing FPGA Placement with TMSteffan Performance (avg all benchmark circuits)  high QoR degradation (30%), high abort rate (60%) deg.

14 14 Parallelizing FPGA Placement with TMSteffan More Optimization: Reduce Aborts Use feedback to identify causes of abortsUse feedback to identify causes of aborts –80% of aborts caused by accesses to x_lookup[] array used to locate 2 nd block in a swaparray used to locate 2 nd block in a swap –interesting: not used by “I/O” type blocks Interesting resulting behavior: “favoritism”Interesting resulting behavior: “favoritism” –system favors swapping I/O blocks I/O block swaps have much shorter txns, no conflictsI/O block swaps have much shorter txns, no conflicts –only one non-I/O block swapping at a time others conflict immediately on x_lookup[]others conflict immediately on x_lookup[] –intuition: causing QoR degradation, ‘false’ speedup  solution: privatize x_lookup[]

15 15 Parallelizing FPGA Placement with TMSteffan Transactions and Swaps: Terminology SwapsSwaps –ACCEPTED or REJECTED TransactionsTransactions –COMMIT or ABORT A B  A B 

16 16 Parallelizing FPGA Placement with TMSteffan More Optimization: Leveraging TM VPR code implements commit/abortVPR code implements commit/abort –directly modifies placement data structures –undoes modifications if swap is rejected TM implements commit/abort, hence optimize:TM implements commit/abort, hence optimize: –delete VPR code for undoing rejected swaps –force transaction to abort if swap is rejected  requires API for forcing a transaction to abort

17 17 Parallelizing FPGA Placement with TMSteffan Impact on Abort Rate Standard OptimizationsPrivatization and Leveraging  significant decrease in abort rate

18 18 Parallelizing FPGA Placement with TMSteffan Performance of Privatization and Leveraging TM deg.  improved QoR deg: max 35% to 8%, avg 7% to 2%

19 19 Parallelizing FPGA Placement with TMSteffan Even More Optimization: Ignoring Large Nets  improves abort rate, little impact on QoR Privatization and LeveragingIgnore Large Nets

20 20 Parallelizing FPGA Placement with TMSteffan Evaluating Scaling Relative to Single Thread STM (estimated) Single Thread STM vs. Sequential

21 21 Parallelizing FPGA Placement with TMSteffan Conclusions Parallel placement via STMParallel placement via STM –good algorithmic fit (accept/reject -> commit/abort) –speedup poor due to overheads, scaling good, need HTM! FPGA community:FPGA community: –should pay attention to TM, especially HTM –TM offers fast & correct parallelization, focus on performance –algorithms can be modified to better exploit TM (ignoring nets) Systems/TM community:Systems/TM community: –need API for forced abort, ordered transactions 21


Download ppt "1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**"

Similar presentations


Ads by Google