Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Slides:

Advertisements

Similar presentations

Open Universiteit, Centre for Learning Sciences and Technologies

Advertisements

Embedded Systems & Parallel Programming P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2007 Universität Dortmund A view on embedded systems.

Recognition of qualifications from joint programmes The good, the not so good and the unacceptable.

Perceptions of the Bologna Process By Chris Mackin A former Erasmus student.

Supporting further and higher education Eprints Software Roundtable 23 rd June 2004.

RPEWG Report to PCC Seattle, WA June RPEWG June Agenda 1.SRP Double PV-WW PBRC Adjustment Seven Step Process 2.Phase 1 PBRC Adjustment Review:

Publication and Deposition in an Eprint Repository Bill Hubbard SHERPA Project Manager University of Nottingham.

The Quantum Chromodynamics Grid James Perry, Andrew Jackson, Matthew Egbert, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Edinburgh Research Archive Theo Andrew Dawn of a new ERA.

E-Euskara Learning Basque on the Web Nathalie TichelerEgiarte Giménez Etxarri London Metropolitan UniversityUniversity of Westminster.

Successful online courses have…... Seven Principles for Good Practice Chickering, A. W. and Gamson, Z. F. (1987). "Seven Principles for Good Practice.

© University of Reading Information Technology Services 20 April 2014 Installing / Setting Up PCs.

Land Remediation Project (Web-based educational tool) Angel de Vicente SELLIC – University of Edinburgh.

Research Data MANTRA Ҫuna Ekmekcioglu, Robin Rice, Stuart Macdonald JISC Managing Research Data (International) Workshop, Birmingham, March 2011.

8-9 October 2007 Internet Librarian International 2007, London Open the Data Doors to Perception Stuart Macdonald (EDINA / Edinburgh University Data Library)

Making the most of Feedback Chris Doye Institute for Academic Development University of Edinburgh 2012.

WS-Talk Web Service Talking in the Language of Their User Community Department of Computer Science Royal Holloway, University of London Prof. Fionn Murtagh,

Masters Programmes Stuart Anderson.

THE UNIVERSITY of EDINBURGH HEALTH and SAFETY DEPARTMENT Sources of guidance and assistance Health and Safety Office Fire Safety Unit Training & Audit.

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM? Marcelo Cintra University of Edinburgh

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.

Research Computing and Facilitating Services CLMS Symposium 28 th June 2012 Clare Gryce Head of Research Computing & Facilitating Services.

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Honors Geometry Section 8.4 The Side-Splitting Theorem

Additional help, additional problem – Issues for supported dyslexic students Linda Robson FHEA 1 st May 2014 Open dyslexic font available from

Copyright 2005 – Biz/ed Functional Areas of Business What do the different parts of a business do?

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

Embedding e-Science Applications - Designing and Managing for Usability Marina Jirotka (Principle Investigator) Anne Trefethen (Co-Investigator) Ralph.

Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Searching Prudence Wong

District Heating as an Energy Intervention? An exploration of findings from two residential case studies in Edinburgh and Glasgow Dr Heather Lovell University.

What Works in Behaviour Change? Scottish Government, Edinburgh, 28 June 2010 Long-term and robust shifts in understanding and practice: the need for deliberation.

Using behaviour change to reduce household waste 28 June 2010 Daniel Prince.

Simulations of Nanomaterials: Carbon Nanotubes, Graphene and Gold Nanoclusters Iván Cabria, María J. López, Luis M. Molina, Nicolás A. Cordero, P. A. Marcos,

1/03/09 De 89 à 98. 1/03/09 De 89 à 98 1/03/09 De 89 à 98.

Fig. 16-CO, p Fig. 16-1, p. 450 Fig. 16-2, p. 450.

1.4 and 1.5 Quiz on Friday!. When two lines intersect, they intersect in a ___________ Point.

Universidad de Zaragoza LEFIS General Assembly Firenze February 2006 LEFIS - APTICE: Legal Framework for the Information Society II The LEFIS Network:

Marco Cattaneo LHCb computing status for LHCC referees meeting 14 th June

Honors Geometry Section 8.6 Triangle Proportionality Theorem.

亚洲的位置和范围吉林省白城市洮北区教师进修学校郑春艳. Q 宠宝贝神奇之旅 —— 亚洲 Q 宠快递你在网上拍的一套物理实验器材到了。 Q 宠宝贝打电话给你：你好，我是快递员，有你的邮件，你的收货地址上面写的是学校地址，现在学校放假了，能把你家的具体位置告诉我吗？请向快递员描述自己家的详细位置！

Title Category #1 Category #2 Category #3Category #

Information Processing Modules. 10 -level INF1030- Word Processing INF1050- Spreadsheets INF1060- Databases INF1070- Digital Presentation INF1910- Special.

。 33 投资环境 3 开阔视野提升竞争力。 3 嘉峪关市概况。 3 。 3 嘉峪关是一座新兴的工业旅游城市，因关得名，因企设市，是长城文化与丝路文化交汇点，是全国唯一一座以长城关隘命名的城市。嘉峪关关城位于祁连山、黑山之间。 1965 年建市，下辖雄关区、镜铁区、长城区，全市总面积 2935.

Parallel Iterative Solvers for Ill-Conditioned Problems with Reordering Kengo Nakajima Department of Earth & Planetary Science, The University of Tokyo.

PRESS RELEASE Mid-Voltage Power MOSFETs in PQFN Package Utilizing Copper Clip Technology DATA SHEETS HI-RES GRAPHIC The new power MOSFETs featuring IR’s.

6.2 Properties of Parallelograms

Supporting the transitions of graduates into settled working life

Intelligent Adaptive Mobile Robots

Marginalized Children -

مكتبة الإسكندرية ١٣ ديسمبر ٢٠٠٨ دكتور باسم أحمد عوض

A A The ACE Theorem for Querying the Web of Data A C E

Lecture 1: Parallel Architecture Intro

3.3: Proving Lines parallel

NERC-WECC Coordination

Multiprocessor Highlights

UK Out of Home Revenue.

NERC-WECC Coordination

By Angle Measures By Side Lengths

Chapter 5: Quadrilaterals

Groundwater bodies - overview

Software Engineering Lecture #28

Challenge: Complexity and Scalability

Tapas Write a description of yourself. Include how you are dressed today.

CONTROL DE CALIDAD TOTAL. QUÉ ES CONTROL DE CALIDAD?

Presentation transcript:

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh Universidad de Valladolid

Symp. on Principles and Practice of Parallel Programming - June Speculative parallelization on SMP for(i=0; i<100; i++) {... = A[L[i]]; A[K[i]] =... }  Assume no dependences and execute iterations in parallel Iteration J+2... = A[5]; Iteration J+1... = A[2]; Iteration J... = A[4]; A[6] =...A[2] =...A[5] =...  Access to shared data should be tracked at runtime RAW  If a violation is detected, offending threads are squashed

Symp. on Principles and Practice of Parallel Programming - June Hardware vs. Software schemes  Hardware schemes +High performance –Changes to processor, caches, and coherence controller  Software schemes +No hardware changes –Poorer performance:  Software management overhead  Suboptimal scheduling  Contention due to the need of synchronization

Symp. on Principles and Practice of Parallel Programming - June Wish List  To reduce software overhead  use of efficient speculative data structures and optimized operations  To have an efficient scheduling  minimizing memory overhead while maximizing tolerance to load imbalance and violations  To reduce contention  avoid synchronization as much as possible  To avoid performance degradation  squash contention mechanism

Symp. on Principles and Practice of Parallel Programming - June Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structures  Use of versions of the shared data structure Shared structure Thread A (iteration J) Thread B (iteration J+1) na el na  A speculative access structure holds the state (na, m, el, elm) of each version of elements m A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]...

Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure I: Simple Array  Array of access states directly mapped to shadow copy of the user data array NAELMNA ELNA Spec. access structure Version copy NA: not accessed EL: exposed loaded M: modified ELM: exposed loaded and modified

Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure I: Simple Array  Cheap to look up on speculative memory operations... = A[2] NAELMNA ELNA Version copy EL NAELMNA ELNA Access array User array Scan  Expensive to search on commits

Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure II: Indirection Array  Array of indices that indicate the elements of the shadow data array that were touched NAELMNA ELNA Access array User array 164 Indirection array

Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure II: Indirection Array 164 Indirection array... = A[2] NAELMNA ELNA Access array EL 2  Cheap to look up on speculative memory operations  Cheap to search on commits Scan NAELMNA ELNA Access array 164 Indirection array

Symp. on Principles and Practice of Parallel Programming - June Scheduling Threads  Static: assign a chunk of N/P iterations to each processor +Only P active threads  little memory overhead –Poor tolerance to load imbalance and dependence violations  Dynamic: dynamically assign each of N iterations –N active threads  bigger memory structures +Better tolerance to load imbalance and dependence violations  Our solution: software version of an aggressive sliding window mechanism † † Cintra, Martinez and Torrellas; ISCA 2000

Symp. on Principles and Practice of Parallel Programming - June  Schedule a window of W iterations at a time Sliding window Window (W) Thread 1 Thread 2 Time Iterations (N):  Dynamic assignment of iterations inside the window  When the non-spec thread finishes, the window is advanced  Tradeoff between load balancing and size of version structures

Symp. on Principles and Practice of Parallel Programming - June Memory operations... = A[K[i]]  Load operation L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version A[K[i]] =...  Store operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations  Correctness guaranteed if globally performed in program order  But program order may not be respected…  Compiler reordering  Use of relaxed memory consistency models

Symp. on Principles and Practice of Parallel Programming - June Race Conditions  Certain interleaving of operations may lead to incorrect execution Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Thread executing iteration J Thread executing iteration J+K Time S2 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L2 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version S1 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L3 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version  Incorrect value S3 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations  Violation not detected L1 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version

Symp. on Principles and Practice of Parallel Programming - June Conservative Solution  To embrace operations in a critical section Load Operation # lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version # unlock A Store Operation # lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations # unlock A  Drawback: contention

Symp. on Principles and Practice of Parallel Programming - June Our Solution: Memory Fences Load Operation L1: Update state of the element to EL # memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version Store Operation S1: Perform the store of the new version # memory fence S2: Update state of the element to M or ELM # memory fence S3: Scan forwards access array for violations  All pending operations should be performed before passing the memory fence  This is the minimun set of memory fences needed  Critical sections are still necessary to protect structures on thread starts, commits and squashes.

Symp. on Principles and Practice of Parallel Programming - June Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

Symp. on Principles and Practice of Parallel Programming - June Evaluation Environment  Execution of experiments on a real machine  Sun SunFire 6800 SMP with 24 UltraSPARC-III processors  OpenMP 2.0  Study of applications with non-analyzable loops  TREE, WUPWISE, MDG  no dependences  LUCAS, AP3M  dependences

Symp. on Principles and Practice of Parallel Programming - June Speedups of Loops: TREE Very close to “ideal” DOALL speedup

Symp. on Principles and Practice of Parallel Programming - June Speedups of Loops: WUPWISE Not so close to “ideal” DOALL speedup: huge spec data size

Symp. on Principles and Practice of Parallel Programming - June Importance of Indirection Array

Symp. on Principles and Practice of Parallel Programming - June Cost of Violation Checks  Systems evaluated:  Baseline: our scheme, with violation checks upon stores  sys2: same as Baseline, but violation checks upon commits

Symp. on Principles and Practice of Parallel Programming - June Cost of Violation Checks May outperform checks at commit on sparse accesses Checks upon loads and stores are not too expensive

Symp. on Principles and Practice of Parallel Programming - June Effects of Scheduling Schemes  Systems evaluated  Baseline: Sliding window moved when non- speculative thread finishes  sys3: Sliding window moved when all thread finish (solution adopted by Dang et al. [IPDPS 2002])  sys4: Dynamic scheduling, no partial commits (solution adopted by Rundberg et al. [WSSMM 2000])

Symp. on Principles and Practice of Parallel Programming - June Effects of Scheduling Schemes P = 4 processors Fully dynamic schedule is not always feasible Best performance for W=2*P to 4*P

Symp. on Principles and Practice of Parallel Programming - June Wish List Revisited  To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of during commit)  To have an efficient scheduling Agressive Sliding Window mechanism  To reduce contention Use of memory fences instead of critical sections  To avoid performance degradation Squash monitor with feedback

Symp. on Principles and Practice of Parallel Programming - June Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

Symp. on Principles and Practice of Parallel Programming - June Software-only speculative parallelization schemes  SW-R-LRPD at Texas A&M University (IPDPS 2002)  Less aggressive window (moved when all threads finish)  Violation checks when threads commit  Chalmers University (WSSMM 2000)  Dynamic scheme  Violation checks upon stores  IBM Research (SC 1998)  Series of tests for various specific behaviors  TLDS at Carnegie Mellon University (tech. rep. 2001)  Speculation in software DSM engine

Symp. on Principles and Practice of Parallel Programming - June Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

Symp. on Principles and Practice of Parallel Programming - June Conclusions  Systematic consideration of the design space and cost/performance issues  New efficient and robust software-only speculative parallelization scheme: –Fine-tuned data structures –Aggressive sliding window –Reduced synchronization requirements –Overhead monitors and feedback  Very good performance: –7 to 25% faster than previous schemes –71% of hand-made, manual parallelization speedup

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh Universidad de Valladolid

Symp. on Principles and Practice of Parallel Programming - June Data Structures Implementation User array 0 2nNA M EL NA Access structures Version copies M 0 1 n

Symp. on Principles and Practice of Parallel Programming - June Squashing Threads  Violations are detected by looking up speculative access structure  On every store +Check only the element being accessed +Earlier violation detections ±Frequent checks  Need some form of synchronization  At commit  Check all elements +Faster speculative memory operations

Symp. on Principles and Practice of Parallel Programming - June Squash contention mechanism  Goal: to avoid performance degradation in the presence of dependences  Implemented with commit and squash monitors  After a given threshold, following invocations of the same loop will be executed sequentially

Symp. on Principles and Practice of Parallel Programming - June Importance of Squash Monitors

Symp. on Principles and Practice of Parallel Programming - June Application Characteristics Application TREE MDG Loops accel_10 interf_1000 WUPWISE muldeo_200’ muldoe_200’ % of Seq. Time Spec data Size (KB) < 1 12,000 AP3M Shgravll_700 LUCAS mers_mod_square (line 444) ,000 4,000

Symp. on Principles and Practice of Parallel Programming - June Speedups of Loops: MDG Very close to “ideal” DOALL speedup

Symp. on Principles and Practice of Parallel Programming - June Overall Speedups: TREE

Symp. on Principles and Practice of Parallel Programming - June Overall Speedups: WUPWISE

Symp. on Principles and Practice of Parallel Programming - June Overall Speedups: MDG

Symp. on Principles and Practice of Parallel Programming - June Constrained Memory Overheads Mixed results: either Baseline Or Sys4 perform best

Symp. on Principles and Practice of Parallel Programming - June Related Work Hardware-based speculative parallelization schemes: –I-ACOMA at University of Illinois –HYDRA at Stanford –Multiplex at Purdue –Multiscalar at Wisconsin –Clustered Speculative Multithreading at UPC –TLDS at Carnegie Mellon Inspector-Executor scheme: –Leung and Zahorjan (PPoPP 1993) –Saltz, Mirchandney, and Crowley (IEEE ToC 1991)

Symp. on Principles and Practice of Parallel Programming - June Related Work Optimistic Concurrency Control schemes: –E.g., Herlihy (ACM TDBS 1990); Kung and Robinson (ACM TDBS 1981) –Only need to enforce that accesses to objects in critical sections do not overlap  no total order –Applied to explicitly parallel applications