May 10-02 Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM IBM Contact – Karl Erickson.

Slides:



Advertisements
Similar presentations
TOPIC : SYNTHESIS DESIGN FLOW Module 4.3 Verilog Synthesis.
Advertisements

3D-STAF: Scalable Temperature and Leakage Aware Floorplanning for Three-Dimensional Integrated Circuits Pingqiang Zhou, Yuchun Ma, Zhouyuan Li, Robert.
Natarajan Viswanathan Min Pan Chris Chu Iowa State University International Symposium on Physical Design April 6, 2005 FastPlace: An Analytical Placer.
A Size Scaling Approach for Mixed-size Placement Kalliopi Tsota, Cheng-Kok Koh, Venkataramanan Balakrishnan School of Electrical and Computer Engineering.
Ripple: An Effective Routability-Driven Placer by Iterative Cell Movement Xu He, Tao Huang, Linfu Xiao, Haitong Tian, Guxin Cui and Evangeline F.Y. Young.
SimPL: An Effective Placement Algorithm Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan 1ICCAD 2010, Myung-Chul Kim,
CHAPTER 5 THREADS & MULTITHREADING 1. Single and Multithreaded Processes 2.
FastPlace: Efficient Analytical Placement using Cell Shifting, Iterative Local Refinement and a Hybrid Net Model FastPlace: Efficient Analytical Placement.
1 Thermal Via Placement in 3D ICs Brent Goplen, Sachin Sapatnekar Department of Electrical and Computer Engineering University of Minnesota.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals.
CSE 144 Project Part 2. Overview Multiple rows Routing channel between rows Components of identical height but various width Goal: Implement a placement.
ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CRISP: Congestion Reduction by Iterated Spreading during Placement Jarrod A. Roy†‡, Natarajan Viswanathan‡, Gi-Joon Nam‡, Charles J. Alpert‡ and Igor L.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
GPU-Accelerated Beat Detection for Dancing Monkeys Philip Peng, Yanjie Feng UPenn CIS 565 Spring 2012 Final Project – Final Presentation img src:
Quadratic VLSI Placement Manolis Pantelias. General Various types of VLSI placement  Simulated-Annealing  Quadratic or Force-Directed  Min-Cut  Nonlinear.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Chris Chu Iowa State University Yiu-Chung Wong Rio Design Automation
1 CS612 Algorithms for Electronic Design Automation CS 612 – Lecture 1 Course Overview Mustafa Ozdal Computer Engineering Department, Bilkent University.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Circuit Placement w/ Multi-core Processors May Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM Design Presentation.
Martin Kruliš by Martin Kruliš (v1.1)1.
May Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM IMB Contact – Karl Erickson.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-3. OMP_INIT_LOCK OMP_INIT_NEST_LOCK Purpose: ● This subroutine initializes a lock associated with the lock variable.
International Symposium on Physical Design San Diego, CA April 2002ER UCLA UCLA 1 Routability Driven White Space Allocation for Fixed-Die Standard-Cell.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Parallel Computing Presented by Justin Reschke
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
Tuning Threaded Code with Intel® Parallel Amplifier.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
RTL Design Flow RTL Synthesis HDL netlist logic optimization netlist Library/ module generators physical design layout manual design a b s q 0 1 d clk.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Placement and Routing Algorithms. 2 FPGA Placement & Routing.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
HeAP: Heterogeneous Analytical Placement for FPGAs
VLSI Quadratic Placement
APLACE: A General and Extensible Large-Scale Placer
Chapter 4 Multithreading programming
Intel® Parallel Studio and Advisor
A Semi-Persistent Clustering Technique for VLSI Circuit Placement
EE 4xx: Computer Architecture and Performance Programming
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallelized Analytic Placer
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

May Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM IBM Contact – Karl Erickson

Project Overview Circuit Placement problem is bottleneck of physical design Currently only single-core – no threads Will attempt to parallelize some functions of the FastPlace algorithm using the linux pthreads library. Implement RQL idea (IBM) into FastPlace

Project Plan Start with existing serial FastPlace algorithm Parallelize FastPlace algorithm to decrease run-time Hope to gain increases as close to N times speedup (N = cores) as possible Realistically, expect 0.75N or 0.5N End-goal is mostly proof-of-concept IBM uses in-house algorithm Contains proprietary circuit processing

Project Design Written in C Run under Linux using POSIX thread library Consider scalability – 2, 4, 8, etc. cores RQL implementation IBM Concept Netlist optimization for placement

Implementation – Overall Using Data Parallelism as scheme Assigning loop iterations to threads Localizing variable usage Where absolutely necessary, using thread synchronization (mutex, etc..) To maximize speed improvement with threads, minimize total number of tasks for threads to accomplish Have individual threads do as much as possible

Implementation – Thread Pool Threads are created once at start Various Benefits: Minimizes overhead from thread creation Increases cache performance Allows core scalability – number of threads running can equal cores available

Implementation - RQL Force-vector Modulation Forces acting upon cells Forces are modeled as a spring potential energy problem Native Force in the algorithm tries to reduce wire length by bringing connected cells closer to each other Spreading Force tries to move cells into sparse areas within the placement region Need a balance of the two to meet placement and wire length objectives Modulate the Spreading Forces High Spreading Forces means the connection belongs to a fan-out net or boundary Therefore, cells with connections in the top 5 percentile of spreading forces are skipped in quadratic placement Skipping these leaves the cell’s other connections minimized instead of degrading them. Results in placing cells in their overall optimal location

Implementation - RQL During quadratic placement (global placement process) Calculate magnitude of spreading forces for all cells in each iteration Calculate force on current cell If current cell’s force is above the 5% threshold, skip its placement

Implementation - Functions Move_8pt family move_8pt, move_8pt_withMap, move_8pt_mixedMode, move_8pt_mixedMode_withMap, move_8pt_clustering, move_8pt_clustering_withMap Calculates score based on cell coordinates and bin utilization Doesn’t lend well to parallelization The fix? If a new cell is within 3x3 box of cell being currently calculated for, new cell is skipped Helps remove significant wirelength degradation

Implementation - Functions Swap_move family swap_move_FM, vswap_move, local_order3_FM, flipAllCells Row-based data processing Break up matrix into segments based on number of threads Assign each thread to do X rows

Testing Profiled original FastPlace algorithm gprof gives CPU time per function Profiling parallel FastPlace Valgrind FastPlace code outputs actual time elapsed Can be used to compare performance Not 100% consistent

Testing & Results Test results for correctness Compare “wire length” results Average total wirelength no worse than 1% greater Threadpool is tested and working Test results for speedup Compared actual run-time See slides on next page

Test Results – RQL Implementation Wire length Results Between.12% % decreased wire length on ISPD98 benchmarks with an average of.98% Between.11% % decreased wire length on ISPD2005 benchmarks with an average of 1.39% Run-time Results Some run-time slow down Average of 3.36% increased on ISPD98 Average of 4.02% increased on ISPD2005

Test Results – Global Placement

Test Results – Detailed Placement

Project Impact Shows that threads can be used to speed up the placement process With availability of multi-core CPU’s, and scalability of thread implementation, speed improvement could continue Reduces bottleneck in development process

Questions?