Intel Concurrent Collections (for Haskell) Ryan Newton, Chih-Ping Chen, Simon Marlow+ *Intel +Microsoft Research Software and Services Group Jul 29,

Slides:

Advertisements

Similar presentations

Parallel Haskell Tutorial: The Par Monad Simon Marlow Ryan Newton Simon Peyton Jones.

Advertisements

A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.

Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Two for the Price of One: A Model for Parallel and Incremental Computation Sebastian Burckhardt, Daan Leijen, Tom Ball (Microsoft Research, Redmond) Caitlin.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

HABANERO CNC Sagnak Tasirlar 1. Acknowledgments 2  Rice  Vivek Sarkar, Zoran Budimlic, Michael Burke, Philippe Charles  Vincent Cave,

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

MACHINE-INDEPENDENT VIRTUAL MEMORY MANAGEMENT FOR PAGED UNIPROCESSOR AND MULTIPROCESSOR ARCHITECTURES R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron,

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

CS 1114: Data Structures – memory allocation Prof. Graeme Bailey (notes modified from Noah Snavely, Spring 2009)

Parallelism and Concurrency Koen Lindström Claessen Chalmers University Gothenburg, Sweden Ulf Norell.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

490dp Synchronous vs. Asynchronous Invocation Robert Grimm.

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

Lecture 36: Programming Languages & Memory Management Announcements & Review Read Ch GU1 & GU2 Cohoon & Davidson Ch 14 Reges & Stepp Lab 10 set game due.

Selfishness in Transactional Memory Raphael Eidenbenz, Roger Wattenhofer Distributed Computing Group Game Theory meets Multicore Architecture.

Computer Science 162 Section 1 CS162 Teaching Staff.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

Cormac Flanagan University of California, Santa Cruz Hybrid Type Checking.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Intel Concurrent Collections (for Haskell) Ryan Newton, Chih-Ping Chen, Simon Marlow Software and Services Group Jul 27, 2010.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Lightweight Concurrency in GHC KC Sivaramakrishnan Tim Harris Simon Marlow Simon Peyton Jones 1.

Computer System Architectures Computer System Software

Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.

Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Status of the vector transport prototype Andrei Gheata 12/12/12.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Writing Systems Software in a Functional Language An Experience Report Iavor Diatchki, Thomas Hallgren, Mark Jones, Rebekah Leslie, Andrew Tolmach.

Server to Server Communication Redis as an enabler Orion Free

Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

Polytechnic University of Tirana Faculty of Information Technology Computer Engineering Department A MULTITHREADED SEARCH ENGINE AND TESTING OF MULTITHREADED.

Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.

Multiprocessor Fixed Priority Scheduling with Limited Preemptions Abhilash Thekkilakattil, Rob Davis, Radu Dobrin, Sasikumar Punnekkat and Marko Bertogna.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones.

GC Assertions: Using the Garbage Collector To Check Heap Properties Samuel Z. Guyer Tufts University Edward Aftandilian Tufts University.

Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.

Parallelism and Concurrency Koen Lindström Claessen Chalmers University Gothenburg, Sweden Patrik Jansson.

Optimistic Hybrid Analysis

TensorFlow– A system for large-scale machine learning

Parallelism and Concurrency

Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno

Why Events Are A Bad Idea (for high-concurrency servers)

Chilimbi, et al. (2014) Microsoft Research

Task Scheduling for Multicore CPUs and NUMA Systems

Lazy Diagnosis of In-Production Concurrency Bugs

Communicating Runtimes in CnC

Linchuan Chen, Xin Huo and Gagan Agrawal

Store Recycling Function Experimental Results

Chapter 4: Threads.

Chapter 9: Virtual-Memory Management

Design and Implementation Issues for Atomicity

RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science

Presentation transcript:

Intel Concurrent Collections (for Haskell) Ryan Newton*, Chih-Ping Chen*, Simon Marlow+ *Intel +Microsoft Research Software and Services Group Jul 29, 2010

Software and Services Group 2 Goals Push Parallel Scaling in Haskell −Never trivial −Harder with: Laziness, complex runtime Offer another flavor of parallelism: −Explicit graph based a `par` b Pure operand parallelism [: x*y | x<-xs | y<-ys :] Data Parallelism forkIO$ do… Threaded parallelism C C n Streaming/Actors/ Dataflow

Software and Services Group 3 Intel Concurrent Collections (CnC) These models: A bit more domain specific (than, say, futures) A payoff in parallelism achieved / effort CnC is approximately: − stream processing + shared data structures Intel CnC: A library for C++ −Also for Java,.NET thanks to Rice U. (Vivek Sarkar et al) Yet… CnC requires and rewards pure functions -- Haskell is a natural fit. C C n Streaming/Actors/ Dataflow

Software and Services Group 4 Goals Push Parallel Scaling in Haskell −Never trivial −Harder with: Laziness, complex runtime Offer another flavor of parallelism: −Explicit graph based a `par` b Pure operand parallelism [: x*y | x<-xs | y<-ys :] Data Parallelism forkIO$ do… Threaded parallelism How would I (typically) do message passing in Haskell?

Software and Services Group 5 Message passing in Haskell IO Threads and Channels: do action.. writeChan c msg action.. do action.. readChan c action.. This form of message passing sacrifices determinism. Let’s try CnC instead…

Software and Services Group 6 CnC Model of Computation Tasks not threads −Eager computation, −Takes tag input, −Runs to completion Step reads data items Step produces data items “Cnc Graph” = network of steps DETERMINISTIC λ tag. Step ‘a’ ‘f’ Step ‘z’ tags items 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: tags items tag tags There’s a whole separate story about offline analysis of CnC graphs - enables scheduling, GC, etc. No time for it today!

Software and Services Group 7 How is CnC Purely Functional? Domain Range

Software and Services Group 8 How is CnC Purely Functional? Domain Range MyStep Updates (new tags, items) A set of tag/ item collections

Software and Services Group 9 An CnC graph is a function too Domain Range A complete CnC Eval A set of tag/ item collections

Software and Services Group 10 do action.. writeChan c msg action.. do action.. readChan c action.. Message passing in Haskell IO Threads and Channels: This form of message passing sacrifices determinism. Let’s try CnC instead… Steps, Items, and Tags Step1 Step2

Software and Services Group 11 Haskell / CnC Synergies C++ incarnationHaskell incarnation CnC Stephopefully pure execute method Pure function Enforce step purity. Complete CnC Graph Execution Graph execution concurrently on other threads. Blocking calls allow environment to collect results. Pure function. Leverage the fact that CnC can play nicely in a pure framework. Also, Haskell CnC is a fun experiment. We can try things like idempotent work stealing! Most imperative threading packages don’t support that. Can we learn things to bring back to other CnC’s?

Software and Services Group 12 How is CnC called from Haskell? Haskell is LAZY To control par. eval granularity Haskell CnC is mostly eager. When the result of a CnC graph is needed, the whole graph executes, in parallel, to completion. −(e.g. WHNF = whole graph evaluated) Need result! Recv. result Par exec. Fork workers

Software and Services Group 13 A complete CnC program in Haskell myStep items tag = do word1 <- get items "left" word2 <- get items "right" put items "result" (word1 ++ word2 ++ show tag) cncGraph = do tags <- newTagCol items <- newItemCol prescribe tags (myStep items) initialize$ do put items "left" "Hello " put items "right" "World " putt tags 99 finalize$ do get items "result" main = putStrLn (runGraph cncGraph)

Software and Services Group 14 INITIAL RESULTS

Software and Services Group 15 I will show you graphs like this: (4 socket 8 core Westmere)

Software and Services Group 16 Current Implementations IO based + pure impl.s Lots of fun with schedulers! −Lightweight user threads (3) >Data.Map of MVars for data. −Global work queue (4,5,6,7,10) −Continuations + Work stealing (10,11) >Simple Data.Sequence deques −Haskell “spark” work stealing (8,9) >(Mechanism used by ‘par’) > Cilkish sync/join on top of IO threads wrkr1 wrkr2 wrkr3 Global work pool thr1 thr2 thr3 … Thread per step instance. Step(0)Step(1)Step(2) wrkr1 wrkr2 Per-thread work deques

Software and Services Group 17 Back to graphs…

Software and Services Group 18 Recent Results

Software and Services Group 19 Recent Results

Software and Services Group 20 Recent Results

Software and Services Group 21 One bit of what we’re up against: Example: Tight non-allocating loops

Software and Services Group 22 Conclusions from early prototype Lightweight user threads (3)  simple, predictable, but too much overhead Global work queue (4,5,6,7,10)  better scaling than expected Work stealing (11)  will win in the end, need good deques! Using Haskell “spark” work stealing (8,9 )  a flop (for now) No clear scheduler winners! CnC mantra - separate tuning (including scheduler selection) from application semantics.

Software and Services Group 23 Conclusions Push Scaling in Haskell −Overcame problems (bugs, blackhole issues) −GHC 6.13 does *much* better than 6.12 −Greater than 20X speedups −A start. GHC will get scalable GC (eventually) A fix for non-allocating loops may be warranted. It’s open source (Hackage) - feel free to play the “write a scheduler” game! See my website for the draft paper! (google Ryan Newton)

Software and Services Group 25 Future Work, Haskell CnC Specific GHC needs a scalable garbage collector (in progress) Much incremental scheduler improvement left. Need more detailed profiling and better understanding of variance −Lazy evaluation and a complex runtime are a challenge!

Software and Services Group 26 Moving forward, radical optimization Today CnC is mostly “WYSIWYG” −A step becomes a task (but we can control the scheduler) −User is responsible for granularity −Typical of dynamically scheduled parallelism CnC is moving to a point where profiling and graph analysis enable significant transformations. −Implemented currently: Hybrid static/dynamic scheduling >Prune space of schedules / granularity choices offline −Future: efficient selection of data structures for collections (next slide) >Vectors (dense) vs. Hashmaps (sparse) >Concurrent vs. nonconcurrent data structures

Software and Services Group 27 Tag functions: data access analysis How nice -- a declarative specification of data access! (mystep : i) <- [mydata : i-1], [mydata : i], [mydata : i+1] Much less pain than traditional loop index analysis Use this to: −Check program for correctness. −Analyze density, convert data structures −Generate precise garbage collection strategies −Extreme: full understanding of data deps. can replace control dependencies