A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.
A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
1 Overview Assignment 5: hints  Garbage collection Assignment 4: solution.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
CSC 213 – Large Scale Programming. Today’s Goals  Consider what new does & how Java works  What are traditional means of managing memory?  Why did.
1 Queues (5.2) CSE 2011 Winter May Announcements York Programming Contest Link also available from.
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.
MOSTLY PARALLEL GARBAGE COLLECTION Authors : Hans J. Boehm Alan J. Demers Scott Shenker XEROX PARC Presented by:REVITAL SHABTAI.
CS 206 Introduction to Computer Science II 10 / 29 / 2008 Instructor: Michael Eckmann.
PRASHANTHI NARAYAN NETTEM.
1 Stack Data : a collection of homogeneous elements arranged in a sequence. Only the first element may be accessed Main Operations: Push : insert an element.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Adapted from instructor resources Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, Sriram Krishnamoorthy, Laxmikant V. Kale.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Arrays, Link Lists, and Recursion Chapter 3. Sorting Arrays: Insertion Sort Insertion Sort: Insertion sort is an elementary sorting algorithm that sorts.
Data Structures Intro2CS – week Stack ADT (Abstract Data Type) A container with 3 basic actions: – push(item) – pop() – is_empty() Semantics: –
Fundamental Structures of Computer Science II
CSE373: Data Structures & Algorithms Priority Queues
Stacks II David Lillis School of Computer Science and Informatics
Review Array Array Elements Accessing array elements
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Ioannis E. Venetis Department of Computer Engineering and Informatics
Håkan Sundell Philippas Tsigas
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Stacks and Queues.
Concepts of programming languages
Martin Rinard Laboratory for Computer Science
A Lock-Free Algorithm for Concurrent Bags
Practical Non-blocking Unordered Lists
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
Jonathan Mak & Alan Mycroft University of Cambridge
The University of Adelaide, School of Computer Science
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Designing Parallel Algorithms (Synchronization)
Strategies for automatic memory management
Yiannis Nikolakopoulos
Other time considerations
Data Structures and Analysis (COMP 410)
CS210- Lecture 5 Jun 9, 2005 Agenda Queues
The University of Adelaide, School of Computer Science
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge Cosmin E. Oancea, Alan Mycroft & Stephen M. Watt

I. Motivation & High Level Goal We study more algorithms for parallel tracing: We study more scalable algorithms for parallel tracing: memory management is the primary motivation, but memory management is the primary motivation, but do not claim immediate improvements to state-of-the-art GC. do not claim immediate improvements to state-of-the-art GC. Tracing is important to computing: Tracing is important to computing: sequential & flat memory model – well understood, sequential & flat memory model – well understood, parallel & multi-level memory – less clear: parallel & multi-level memory – less clear: processor communication cost grows w.r.t. raw instr speed x P x ILP processor communication cost grows w.r.t. raw instr speed x P x ILP Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path. Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path.

I. Abstract Tracing Algorithm Assume an initialisation phase has already marked and processed some root nodes. Assume an initialisation phase has already marked and processed some root nodes. Implementing the implicit fix-point via, yields: Implementing the implicit fix-point via worklists, yields: 1. mark and process any unmarked child of a marked node ; 2. until no further marking is possible. 1. pick a node from a worklist ; 2. if unmarked then mark it, process it, and add its unmarked childreen to worklists ; 3. repeat until all worklists are empty.

I. Worklist Semantics: Classical What should worklists model? What should worklists model? Classical approach: processing semantics. Classical approach: processing semantics. Worklist 1 Worklist i stores nodes to be processed by processor i ! Worklist i stores nodes to be processed by processor i ! Worklist 2Worklist 3Worklist 4

I. Classic Algorithm Two layers of synchronisation: Two layers of synchronisation: Worklist level – small overhead via deque (Arora et al.) or work tealing (Michael et al.) Worklist level – small overhead via deque (Arora et al.) or work tealing (Michael et al.) Frustrating atomic block – gives idempotent copy, thus enables the above small overhead worklist-access solutions. Frustrating atomic block – gives idempotent copy, thus enables the above small overhead worklist-access solutions. while (!worklist.isEmpty()) { int ind = 0; Object from_child, to_child, to_obj = worklist.deqRand(); foreach( from_child in to_obj.fields() ) { ind++; atomic{ if(from_child.isForwarded())continue; to_child = copy(from_child); setForwardingPtr(from_child,to_child); } to_obj.setField(to_child, ind-1); queue.enqueue(to_child); }

I. Related Work Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions: Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions: Object stealing: Arora et al. Flood et al., Endo et al.... Object stealing: Arora et al. Flood et al., Endo et al.... Block-based approaches: Imai and Tick, Attanasio et al., Marlow et al.,... Block-based approaches: Imai and Tick, Attanasio et al., Marlow et al.,... Free-of-locks solutions via exploiting immutable data: Doligez and Leroy, Huelsbergen and Larus Free-of-locks solutions via exploiting immutable data: Doligez and Leroy, Huelsbergen and Larus Memory-centric solutions – studied only in the sequential case: Shuf et al., Demers et al., Chicha and Watt. Memory-centric solutions – studied only in the sequential case: Shuf et al., Demers et al., Chicha and Watt.

II. Memory-Centric Tracing (High Level) L == memory partition (local) size; gives the trade-off between locality of reference and load balancing. L == memory partition (local) size; gives the trade-off between locality of reference and load balancing. Worklist j stores slots: the to-space address pointing to a from- space field f of the currently copied/scanned object o && j = ( o.f quo L ) rem N Worklist j stores slots: the to-space address pointing to a from- space field f of the currently copied/scanned object o && j = ( o.f quo L ) rem N

II. Memory-Centric Tracing (High Level) 1. Arrow Semantics: double ended – copy to-space, dashed – insert in queue, solid – slots pointing to fields 1. Each worklist w is owned by at most one collector c (owner) 2. Forwarded slots of c : those slots belonging to a partition owned by c, but discovered by another collector. 3. Eager strategy for acquiring worklists ownership. Initially all roots are placed in worklists, if non-empty owned. Dispatching Slots to Worklists or Forwarding Queues

II. Memory-Centric Tracing Implem. Each collector processes its forwarding queues (size F ) Each collector processes its forwarding queues (size F ) Empty worklists are released (ownership). Empty worklists are released (ownership). Each collector processes F*P*4 items from its owned worklists ( 4 empirically chosen – forwarding ratio inv). Each collector processes F*P*4 items from its owned worklists ( 4 empirically chosen – forwarding ratio inv). No locking when accessing worklists or when copying. No locking when accessing worklists or when copying. L (local partition size) gives the locality-of-reference level. L (local partition size) gives the locality-of-reference level. Repeat until no owned worklists && all forw. queues empty && all worklists empty. Repeat until no owned worklists && all forw. queues empty && all worklists empty.

II. Forwarding Queues on INTEL IA-32 Implement inter-processor communication: Implement inter-processor communication: with P collectors have a PxP matrix of queues; entry (i,j) holds items enqueued by collector i and dequeued by j with P collectors have a PxP matrix of queues; entry (i,j) holds items enqueued by collector i and dequeued by j wait-free, lock-free and mfence-free IA-32 implementation. wait-free, lock-free and mfence-free IA-32 implementation. volatile int tail=0, head=0, buff[F]; next : k -> (k+1)%F; bool enq(Address slot) { bool is_empty() int new_tl=next(tail); { return head == tail; } if(new_tl == head) return false; Address deq() { buff[tail] = slot; Address slot= buff[head]; tail = new_tl; head = next(head); return true; return slot; } }

II. Forwarding Queues on INTEL IA-32 The sequentially inconsistent pattern occurs, but algorithm still safe: The sequentially inconsistent pattern occurs, but algorithm still safe: head & tail interaction – reduces to a collector failing to deq from a non-empty list (and to enq into a non-full list); head & tail interaction – reduces to a collector failing to deq from a non-empty list (and to enq into a non-full list); buff[tail_prev] & head==tail_prev interaction is safe because writes are not re-ordered. buff[tail_prev] & head==tail_prev interaction is safe because writes are not re-ordered. a = b = 0; // Initially // (two enq ) || (two is_empty; deq ) // // Proc 1 Proc 2 // Proc i Proc j a = 1; b = 1; buff[tail]=...; head=next(head); // mfence; mfence; tail =...; if(head!=tail) x = a; y = b; if(new_tl==head)..=buff[head]; // x == 0 & y == 0!

II. Dynamic Load Balancing Small partitions (64K) -- OK under static ownership: Small partitions (64K) -- OK under static ownership: grey object -- randomly distributed among the N partitions, grey object -- randomly distributed among the N partitions, still gives some locality of reference (otherwise forwarding would be too expensive) still gives some locality of reference (otherwise forwarding would be too expensive) Larger partitions may need dynamic load balancing: Larger partitions may need dynamic load balancing: Partition ownership must be transferred: Partition ownership must be transferred: A starving collector c signals nearby collectors; these may release ownership of an owned worklist w while placing an item of w on collector c 's forwarding queue. A starving collector c signals nearby collectors; these may release ownership of an owned worklist w while placing an item of w on collector c 's forwarding queue. Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it ( Michael et al. )! Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it ( Michael et al. )!

II. Optimisation; Run-Time Adaptation Inter-collector producer-consumer relations are detected when forwarding queues are found full ( F*P*4 processed items/iter): Inter-collector producer-consumer relations are detected when forwarding queues are found full ( F*P*4 processed items/iter): transfer ownership to the producer collector to optimise forwarding. transfer ownership to the producer collector to optimise forwarding. Run-time adapt: monitor forw ratio ( FR ) & load balancing ( LB ): Run-time adapt: monitor forw ratio ( FR ) & load balancing ( LB ): start with large L ; while poor LB decrease L start with large L ; while poor LB decrease L if FR > FR_MAX or L FR_MAX or L < L_MIN switch to classical!

III. Empirical Results – Small Data Two quad-core AMD Opteron machine on small live data-sets applications against MMTK: Two quad-core AMD Opteron machine on small live data-sets applications against MMTK: Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. Heap Size = M, IFRav = 4.2, L = 64K. Heap Size = M, IFRav = 4.2, L = 64K.

III. Empirical Results – Large Data Two quad-core AMD Opteron machine on large live data-sets applications against MMTK: Two quad-core AMD Opteron machine on large live data-sets applications against MMTK: Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, TSP, Perimet, BH. Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, TSP, Perimet, BH. Heap Size > 500M, IFR average = 6.3, L = 128K. Heap Size > 500M, IFR average = 6.3, L = 128K.

III. Empirical Results – Eclipse Quad-core Intel machine on Eclipse (large live data-set): Quad-core Intel machine on Eclipse (large live data-set): Heap Size = 500M, IFR average = (only) 2.6 for L = 512K, otherwise 2.1! Heap Size = 500M, IFR average = (only) 2.6 for L = 512K, otherwise 2.1!

III. Empirical Results – Jython Two quad-core AMD machine on Jython: Two quad-core AMD machine on Jython: Heap Size = 200M, IFR average = (only) 3.0! Heap Size = 200M, IFR average = (only) 3.0!

III. Conclusions Memory-centric algorithms may be an important alternative to processing-centric algorithms, especially on non-homogeneous hardware. How to explicitly represent and optimise two abstractions: locality of reference ( L ) and inter- processor communication ( FR ). L trade-offs locality for load balancing. Robust behaviour: scales well with both data size and number of processors.