A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge.

A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge Cosmin E. Oancea, Alan Mycroft & Stephen M. Watt

I. Motivation & High Level Goal We study more algorithms for parallel tracing: We study more scalable algorithms for parallel tracing: memory management is the primary motivation, but memory management is the primary motivation, but do not claim immediate improvements to state-of-the-art GC. do not claim immediate improvements to state-of-the-art GC. Tracing is important to computing: Tracing is important to computing: sequential & flat memory model – well understood, sequential & flat memory model – well understood, parallel & multi-level memory – less clear: parallel & multi-level memory – less clear: processor communication cost grows w.r.t. raw instr speed x P x ILP processor communication cost grows w.r.t. raw instr speed x P x ILP Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path. Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path.

I. Abstract Tracing Algorithm Assume an initialisation phase has already marked and processed some root nodes. Assume an initialisation phase has already marked and processed some root nodes. Implementing the implicit fix-point via, yields: Implementing the implicit fix-point via worklists, yields: 1. mark and process any unmarked child of a marked node ; 2. until no further marking is possible. 1. pick a node from a worklist ; 2. if unmarked then mark it, process it, and add its unmarked childreen to worklists ; 3. repeat until all worklists are empty.

I. Worklist Semantics: Classical What should worklists model? What should worklists model? Classical approach: processing semantics. Classical approach: processing semantics. Worklist 1 Worklist i stores nodes to be processed by processor i ! Worklist i stores nodes to be processed by processor i ! Worklist 2Worklist 3Worklist 4

I. Classic Algorithm Two layers of synchronisation: Two layers of synchronisation: Worklist level – small overhead via deque (Arora et al.) or work tealing (Michael et al.) Worklist level – small overhead via deque (Arora et al.) or work tealing (Michael et al.) Frustrating atomic block – gives idempotent copy, thus enables the above small overhead worklist-access solutions. Frustrating atomic block – gives idempotent copy, thus enables the above small overhead worklist-access solutions. while (!worklist.isEmpty()) { int ind = 0; Object from_child, to_child, to_obj = worklist.deqRand(); foreach( from_child in to_obj.fields() ) { ind++; atomic{ if(from_child.isForwarded())continue; to_child = copy(from_child); setForwardingPtr(from_child,to_child); } to_obj.setField(to_child, ind-1); queue.enqueue(to_child); }

I. Related Work Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions: Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions: Object stealing: Arora et al. Flood et al., Endo et al.... Object stealing: Arora et al. Flood et al., Endo et al.... Block-based approaches: Imai and Tick, Attanasio et al., Marlow et al.,... Block-based approaches: Imai and Tick, Attanasio et al., Marlow et al.,... Free-of-locks solutions via exploiting immutable data: Doligez and Leroy, Huelsbergen and Larus Free-of-locks solutions via exploiting immutable data: Doligez and Leroy, Huelsbergen and Larus Memory-centric solutions – studied only in the sequential case: Shuf et al., Demers et al., Chicha and Watt. Memory-centric solutions – studied only in the sequential case: Shuf et al., Demers et al., Chicha and Watt.

II. Memory-Centric Tracing (High Level) L == memory partition (local) size; gives the trade-off between locality of reference and load balancing. L == memory partition (local) size; gives the trade-off between locality of reference and load balancing. Worklist j stores slots: the to-space address pointing to a from- space field f of the currently copied/scanned object o && j = ( o.f quo L ) rem N Worklist j stores slots: the to-space address pointing to a from- space field f of the currently copied/scanned object o && j = ( o.f quo L ) rem N

II. Memory-Centric Tracing (High Level) 1. Arrow Semantics: double ended – copy to-space, dashed – insert in queue, solid – slots pointing to fields 1. Each worklist w is owned by at most one collector c (owner) 2. Forwarded slots of c : those slots belonging to a partition owned by c, but discovered by another collector. 3. Eager strategy for acquiring worklists ownership. Initially all roots are placed in worklists, if non-empty owned. Dispatching Slots to Worklists or Forwarding Queues

II. Memory-Centric Tracing Implem. Each collector processes its forwarding queues (size F ) Each collector processes its forwarding queues (size F ) Empty worklists are released (ownership). Empty worklists are released (ownership). Each collector processes F*P*4 items from its owned worklists ( 4 empirically chosen – forwarding ratio inv). Each collector processes F*P*4 items from its owned worklists ( 4 empirically chosen – forwarding ratio inv). No locking when accessing worklists or when copying. No locking when accessing worklists or when copying. L (local partition size) gives the locality-of-reference level. L (local partition size) gives the locality-of-reference level. Repeat until no owned worklists && all forw. queues empty && all worklists empty. Repeat until no owned worklists && all forw. queues empty && all worklists empty.

II. Forwarding Queues on INTEL IA-32 Implement inter-processor communication: Implement inter-processor communication: with P collectors have a PxP matrix of queues; entry (i,j) holds items enqueued by collector i and dequeued by j with P collectors have a PxP matrix of queues; entry (i,j) holds items enqueued by collector i and dequeued by j wait-free, lock-free and mfence-free IA-32 implementation. wait-free, lock-free and mfence-free IA-32 implementation. volatile int tail=0, head=0, buff[F]; next : k -> (k+1)%F; bool enq(Address slot) { bool is_empty() int new_tl=next(tail); { return head == tail; } if(new_tl == head) return false; Address deq() { buff[tail] = slot; Address slot= buff[head]; tail = new_tl; head = next(head); return true; return slot; } }

II. Forwarding Queues on INTEL IA-32 The sequentially inconsistent pattern occurs, but algorithm still safe: The sequentially inconsistent pattern occurs, but algorithm still safe: head & tail interaction – reduces to a collector failing to deq from a non-empty list (and to enq into a non-full list); head & tail interaction – reduces to a collector failing to deq from a non-empty list (and to enq into a non-full list); buff[tail_prev] & head==tail_prev interaction is safe because writes are not re-ordered. buff[tail_prev] & head==tail_prev interaction is safe because writes are not re-ordered. a = b = 0; // Initially // (two enq ) || (two is_empty; deq ) // // Proc 1 Proc 2 // Proc i Proc j a = 1; b = 1; buff[tail]=...; head=next(head); // mfence; mfence; tail =...; if(head!=tail) x = a; y = b; if(new_tl==head)..=buff[head]; // x == 0 & y == 0!

II. Dynamic Load Balancing Small partitions (64K) -- OK under static ownership: Small partitions (64K) -- OK under static ownership: grey object -- randomly distributed among the N partitions, grey object -- randomly distributed among the N partitions, still gives some locality of reference (otherwise forwarding would be too expensive) still gives some locality of reference (otherwise forwarding would be too expensive) Larger partitions may need dynamic load balancing: Larger partitions may need dynamic load balancing: Partition ownership must be transferred: Partition ownership must be transferred: A starving collector c signals nearby collectors; these may release ownership of an owned worklist w while placing an item of w on collector c 's forwarding queue. A starving collector c signals nearby collectors; these may release ownership of an owned worklist w while placing an item of w on collector c 's forwarding queue. Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it ( Michael et al. )! Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it ( Michael et al. )!

II. Optimisation; Run-Time Adaptation Inter-collector producer-consumer relations are detected when forwarding queues are found full ( F*P*4 processed items/iter): Inter-collector producer-consumer relations are detected when forwarding queues are found full ( F*P*4 processed items/iter): transfer ownership to the producer collector to optimise forwarding. transfer ownership to the producer collector to optimise forwarding. Run-time adapt: monitor forw ratio ( FR ) & load balancing ( LB ): Run-time adapt: monitor forw ratio ( FR ) & load balancing ( LB ): start with large L ; while poor LB decrease L start with large L ; while poor LB decrease L if FR > FR_MAX or L FR_MAX or L < L_MIN switch to classical!

III. Empirical Results – Small Data Two quad-core AMD Opteron machine on small live data-sets applications against MMTK: Two quad-core AMD Opteron machine on small live data-sets applications against MMTK: Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. Heap Size = 120-200M, IFRav = 4.2, L = 64K. Heap Size = 120-200M, IFRav = 4.2, L = 64K.

III. Empirical Results – Large Data Two quad-core AMD Opteron machine on large live data-sets applications against MMTK: Two quad-core AMD Opteron machine on large live data-sets applications against MMTK: Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, TSP, Perimet, BH. Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, TSP, Perimet, BH. Heap Size > 500M, IFR average = 6.3, L = 128K. Heap Size > 500M, IFR average = 6.3, L = 128K.

III. Empirical Results – Eclipse Quad-core Intel machine on Eclipse (large live data-set): Quad-core Intel machine on Eclipse (large live data-set): Heap Size = 500M, IFR average = (only) 2.6 for L = 512K, otherwise 2.1! Heap Size = 500M, IFR average = (only) 2.6 for L = 512K, otherwise 2.1!

III. Empirical Results – Jython Two quad-core AMD machine on Jython: Two quad-core AMD machine on Jython: Heap Size = 200M, IFR average = (only) 3.0! Heap Size = 200M, IFR average = (only) 3.0!

III. Conclusions Memory-centric algorithms may be an important alternative to processing-centric algorithms, especially on non-homogeneous hardware. How to explicitly represent and optimise two abstractions: locality of reference ( L ) and inter- processor communication ( FR ). L trade-offs locality for load balancing. Robust behaviour: scales well with both data size and number of processors.

A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge.

Similar presentations

Presentation on theme: "A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge.

Similar presentations

Presentation on theme: "A New Approach to Parallelising Tracing Algorithms Computer Science Department University of Western Ontario Computer Laboratory University of Cambridge."— Presentation transcript:

Similar presentations

About project

Feedback