Presentation is loading. Please wait.

Presentation is loading. Please wait.

@gfraiteur a deep investigation on how.NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël.

Similar presentations


Presentation on theme: "@gfraiteur a deep investigation on how.NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël."— Presentation transcript:

1 @gfraiteur a deep investigation on how.NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël Fraiteur PostSharp Technologies Founder & Principal Engineer

2 @gfraiteur Hello! My name is GAEL. No, I don’t think my accent is funny. my twitter

3 @gfraiteur Agenda 1. Hardware 2. Windows Kernel 3. Application

4 @gfraiteur Microarchitecture Hardware

5 @gfraiteur There was no RAM in Colossus, the world’s first electronic programmable computing device (1944). http://en.wikipedia.org/wiki/File:Colossus.jpg

6 @gfraiteur Hardware Processor microarchitecture Intel Core i7 980x

7 @gfraiteur Hardware Hardware latency Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with SiSoftware Sandra

8 @gfraiteur Hardware Cache Hierarchy Distributed into caches L1 (per core) L2 (per core) L3 (per die) Die 1 Core 4 Processor Core 5 Processor Core 7 Processor L2 Cache Core 6 Processor L2 Cache Memory Store (DRAM) Die 0 Core 1 Processor Core 0 Processor Core 2 Processor L2 Cache Core 3 Processor L2 Cache L3 Cache L2 Cache L3 Cache L2 Cache Data Transfer Core Interconnec t Processor Interconnect Synchronization Synchronized through an asynchronous “bus”: Request/response Message queues Buffers

9 @gfraiteur Hardware Memory Ordering Issues due to implementation details: Asynchronous bus Caching & Buffering Out-of-order execution

10 @gfraiteur Hardware Memory Ordering Processor 0Processor 1 Store A := 1Load A Store B := 1Load B Possible values of (A, B) for loads? Processor 0Processor 1 Load ALoad B Store B := 1Store A := 1 Possible values of (A, B) for loads? Processor 0Processor 1 Store A := 1Store B := 1 Load BLoad A Possible values of (A, B) for loads?

11 @gfraiteur Hardware Memory Models: Guarantees http://en.wikipedia.org/wiki/Memory_ordering X84, AMD64: strong orderingARM: weak ordering

12 @gfraiteur Hardware Memory Models Compared Processor 0Processor 1 Store A := 1Load A Store B := 1Load B Processor 0Processor 1 Load ALoad B Store B := 1Store A := 1

13 @gfraiteur Hardware Compiler Memory Ordering The compiler (CLR) can: Cache (memory into register), Reorder Coalesce writes volatile keyword disables compiler optimizations. And that’s all!

14 @gfraiteur Hardware Thread.MemoryBarrier() Processor 0Processor 1 Store A := 1Store B := 1 Thread.MemoryBarrier() Load BLoad A Barriers serialize access to memory.

15 @gfraiteur Hardware Atomicity Processor 0Processor 1 Store A := (long) 0xFFFFFFFFLoad A Up to native word size (IntPtr) Properly aligned fields (attention to struct fields!)

16 @gfraiteur Hardware Interlocked access The only locking mechanism at hardware level. System.Threading.Interlocked Add Add(ref x, int a ) { x += a; return x; } Increment Inc(ref x ) { x += 1; return x; } Exchange Exc( ref x, int v ) { int t = x; x = a; return t; } CompareExchange CAS( ref x, int v, int c ) { int t = x; if ( t == c ) x = a; return t; } Read(ref long) Implemented by L3 cache and/or interconnect messages. Implicit memory barrier

17 @gfraiteur Hardware Cost of interlocked Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with C# code. Cache latencies measured with SiSoftware Sandra.

18 @gfraiteur Hardware Cost of cache coherency

19 @gfraiteur Multitasking No thread, no process at hardware level There no such thing as “wait” One core never does more than 1 “thing” at the same time (except HyperThreading) Task-State Segment: CPU State Permissions (I/O, IRQ) Memory Mapping A task runs until interrupted by hardware or OS

20 @gfraiteur Lab: Non-Blocking Algorithms Non-blocking stack (4.5 MT/s) Ring buffer (13.2 MT/s)

21 @gfraiteur Windows Kernel Operating System

22 @gfraiteur SideKick (1988). http://en.wikipedia.org/wiki/SideKick

23 @gfraiteur Program (Sequence of Instructions) CPU State Wait Dependencies (other things) Virtual Address Space Resources At least 1 thread (other things) Processes and Threads ProcessThread Windows Kernel

24 @gfraiteur Windows Kernel Processed and Threads 8 Logical Processors (4 hyper-threaded cores) 2384 threads 155 processes

25 @gfraiteur Windows Kernel Dispatching threads to CPU Executing CPU 0 CPU 1 CPU 2 CPU 3 Thread H Thread G Thread I Thread J Mutex Semaphore Event Timer Thread A Thread C Thread B Wait Thread D Thread E Thread F Ready

26 @gfraiteur Live in the kernel Sharable among processes Expensive! Event Mutex Semaphore Timer Thread Dispatcher Objects (WaitHandle) Kinds of dispatcher objectsCharacteristics Windows Kernel

27 @gfraiteur Windows Kernel Cost of kernel dispatching Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)

28 @gfraiteur Windows Kernel Wait Methods Thread.Sleep Thread.Yield Thread.Join WaitHandle.Wait (Thread.SpinWait)

29 @gfraiteur Windows Kernel SpinWait: smart busy loops 1. Thread.SpinWait (Hardware awareness, e.g. Intel HyperThreading) 2. Thread.Sleep(0) (give up to threads of same priority) 3. Thread.Yield() (give up to threads of any priority) 4. Thread.Sleep(1) (sleeps for 1 ms) Processor 0Processor 1 A = 1; B = 1; SpinWait w = new SpinWait(); int localB; if ( A == 1 ) { while ( ((localB = B) == 0 ) w.SpinOnce(); }

30 @gfraiteur.NET Framework Application Programming

31 @gfraiteur Application Programming Design Objectives Ease of use High performance from applications!

32 @gfraiteur Application Programming Instruments from the platform Hardware Interlocked SpinWait Volatile Operating System Thread WaitHandle (mutex, event, semaphore, …) Timer

33 @gfraiteur Application Programming New Instruments Concurrency & Synchronization Monitor (Lock) , SpinLock ReaderWriterLock  Lazy, LazyInitializer Barrier, CountdownEvent Data Structures BlockingCollection  Concurrent(Queue  |Stack  |Bag  |Dictionary  ) ThreadLocal, ThreadStatic Thread Dispatching ThreadPool  Dispatcher , ISynchronizeInvoke BackgroundWorker Frameworks PLINQ Tasks Async/Await

34 @gfraiteur Concurrency & Synchronization Application Programming

35 @gfraiteur Concurrency & Synchronization Monitor: the lock keyword 1. Start with interlocked operations (no contention) 2. Continue with “spin wait”. 3. Create kernel event and wait  Good performance if low contention

36 @gfraiteur Concurrency & Synchronization ReaderWriterLock Concurrent reads allowed Writes require exclusive access

37 @gfraiteur Concurrency & Synchronization Lab: ReaderWriterLock

38 @gfraiteur Data Structures Application Programming

39 @gfraiteur Data Structures BlockingCollection Use case: pass messages between threads TryTake: wait for an item and take it Add: wait for free capacity (if limited) and add item to queue CompleteAdding: no more item

40 @gfraiteur Data Structures Lab: BlockingCollection

41 @gfraiteur Data Structures ConcurrentStack node top Producer A Producer B Producer C Consumer A Consumer B Consumer C Consumers and producers compete for the top of the stack node

42 @gfraiteur Data Structures Lab: Concurrent Stack

43 @gfraiteur Data Structures ConcurrentQueue node tail Producer A Producer B Producer C Consumer A Consumer B Consumer C head Consumers compete for the head. Producers compete for the tail. Both compete for the same node if the queue is empty.

44 @gfraiteur Data Structures ConcurrentBag Thread A Bag Thread Local Storage AThread Local Storage B node Producer AConsumer A Thread B Producer BConsumer B Threads preferentially consume nodes they produced themselves. No ordering

45 @gfraiteur Data Structures ConcurrentDictionary TryAdd TryUpdate (CompareExchange) TryRemove AddOrUpdate GetOrAdd

46 @gfraiteur Thread Dispatching Application Programming

47 @gfraiteur Thread Dispatching ThreadPool Problems solved: Execute a task asynchronously (QueueUserWorkItem) Waiting for WaitHandle (RegisterWaitForSingleObject) Creating a thread is expensive Good for I/O intensive operations Bad for CPU intensive operations

48 @gfraiteur Thread Dispatching Dispatcher Execute a task in the UI thread Synchronously: Dispatcher.Invoke Asynchonously: Dispatcher.BeginInvoke Under the hood, uses the HWND message loop

49 @gfraiteur Thread Dispatching

50 @gfraiteur Q&A Gael Fraiteur gael@postsharp.net @gfraiteur

51 Summary

52 @gfraiteur http://www.postsharp.net/ gael@postsharp.net


Download ppt "@gfraiteur a deep investigation on how.NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël."

Similar presentations


Ads by Google