Joe Hummel, PhD UC-Irvine

Joe Hummel, PhD UC-Irvine hummelj@ics.uci.edu http://www.joehummel.net/downloads.html

 New standard of C++ has been ratified ◦ “C++0x” ==> “C++11”  Lots of new features  Workshop will focus on concurrency features 2 Going Parallel with C++11

3  Async programming:  Better responsiveness…  GUIs (desktop, web, mobile)  Cloud  Windows 8  Parallel programming:  Better performance…  Financials  Scientific  Big data Going Parallel with C++11

4 #include void func() { std::cout << "**Inside thread " << std::this_thread::get_id() << "!" << std::endl; } int main() { std::thread t; t = std::thread( func ); t.join(); return 0; } #include void func() { std::cout << "**Inside thread " << std::this_thread::get_id() << "!" << std::endl; } int main() { std::thread t; t = std::thread( func ); t.join(); return 0; } A simple function for thread to do… Create and schedule thread… Wait for thread to finish… Going Parallel with C++11

 Hello world… 5 Going Parallel with C++11

6 #include void func() { std::cout << "**Hello world...\n"; } int main() { std::thread t; t = std::thread( func ); t.join(); return 0; } #include void func() { std::cout << "**Hello world...\n"; } int main() { std::thread t; t = std::thread( func ); t.join(); return 0; } (1) Thread function must do exception handling; unhandled exceptions ==> termination… void func() { try { // computation: } catch(...) { // do something: } void func() { try { // computation: } catch(...) { // do something: } (2) Must join, otherwise termination… (avoid use of detach( ), difficult to use safely) Going Parallel with C++11

 Old school: ◦ distinct thread functions (what we just saw)  New school: ◦ lambda expressions (aka anonymous functions) 7 Going Parallel with C++11

 A thread that loops until we tell it to stop… 8 When user presses ENTER, we’ll tell thread to stop… Going Parallel with C++11

9 #include using namespace std;. int main() { bool stop(false); thread t(loopUntil, &stop); getchar(); // wait for user to press enter: stop = true; // stop thread: t.join(); return 0; } #include using namespace std;... int main() { bool stop(false); thread t(loopUntil, &stop); getchar(); // wait for user to press enter: stop = true; // stop thread: t.join(); return 0; } void loopUntil(bool *stop) { auto duration = chrono::seconds(2); while (!(*stop)) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } void loopUntil(bool *stop) { auto duration = chrono::seconds(2); while (!(*stop)) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } } Going Parallel with C++11

10 int main() { bool stop(false); thread t = thread( [&]() { auto duration = chrono::seconds(2); while (!stop) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } ); getchar(); // wait for user to press enter: stop = true; // stop thread: t.join(); return 0; } int main() { bool stop(false); thread t = thread( [&]() { auto duration = chrono::seconds(2); while (!stop) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } ); getchar(); // wait for user to press enter: stop = true; // stop thread: t.join(); return 0; } thread t( [&] () { auto duration = chrono::seconds(2); while (!stop) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } ); lambda expression Closure semantics: [ ]: none, [&]: by ref, [=]: by val, … Closure semantics: [ ]: none, [&]: by ref, [=]: by val, … lambda arguments

 Lambdas: ◦ Easier and more readable -- code remains inline ◦ Potentially more dangerous ([&] captures everything by ref)  Functions: ◦ More efficient -- lambdas involve class, function objects ◦ Potentially safer -- requires explicit variable scoping ◦ More cumbersome and illegible 11 Going Parallel with C++11

 Multiple threads looping… 12 When user presses ENTER, all threads stop… Going Parallel with C++11

13 #include using namespace std; int main() { cout << "** Main Starting **\n\n"; bool stop = false;. getchar(); cout << "** Main Done **\n\n";. return 0; } #include using namespace std; int main() { cout << "** Main Starting **\n\n"; bool stop = false;. getchar(); cout << "** Main Done **\n\n";. return 0; } vector workers; for (int i = 1; i <= 3; i++) { workers.push_back( thread([i, &stop]() { while (!stop) { cout << "**Inside thread " << i << "...\n"; this_thread::sleep_for(chrono::seconds(i)); } }) ); } vector workers; for (int i = 1; i <= 3; i++) { workers.push_back( thread([i, &stop]() { while (!stop) { cout << "**Inside thread " << i << "...\n"; this_thread::sleep_for(chrono::seconds(i)); } }) ); } stop = true; // stop threads: // wait for threads to complete: for ( thread& t : workers ) t.join(); stop = true; // stop threads: // wait for threads to complete: for ( thread& t : workers ) t.join(); Going Parallel with C++11

 Matrix multiply… 14 Going Parallel with C++11

15 int rows = N / numthreads; int extra = N % numthreads; int start = 0; // each thread does [start..end) int end = rows; vector workers; for (int t = 1; t <= numthreads; t++) { if (t == numthreads) // last thread does extra rows: end += extra; workers.push_back( thread([start, end, N, &C, &A, &B]() { for (int i = start; i < end; i++) for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } })); start = end; end = start + rows; } int rows = N / numthreads; int extra = N % numthreads; int start = 0; // each thread does [start..end) int end = rows; vector workers; for (int t = 1; t <= numthreads; t++) { if (t == numthreads) // last thread does extra rows: end += extra; workers.push_back( thread([start, end, N, &C, &A, &B]() { for (int i = start; i < end; i++) for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } })); start = end; end = start + rows; } for (thread& t : workers) t.join(); for (thread& t : workers) t.join(); // 1 thread per core: numthreads = thread::hardware_concurrency(); // 1 thread per core: numthreads = thread::hardware_concurrency();

 Parallelism alone is not enough… 16 HPC == Parallelism + Memory Hierarchy ─ Contention Expose parallelism Maximize data locality: network disk RAM cache core Minimize interaction: false sharing locking synchronization Going Parallel with C++11

17 X Going Parallel with C++11

 Loop interchange is first step… 18 workers.push_back( thread([start, end, N, &C, &A, &B]() { for (int i = start; i < end; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; for (int i = start; i < end; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); })); workers.push_back( thread([start, end, N, &C, &A, &B]() { for (int i = start; i < end; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; for (int i = start; i < end; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); })); Next step is to block multiply… Going Parallel with C++11

19 Going Parallel with C++11

 No compiler as yet fully implements C++11  Visual C++ 2012 has best concurrency support ◦ Part of Visual Studio 2012  gcc 4.7 has best overall support ◦ http://gcc.gnu.org/projects/cxx0x.html http://gcc.gnu.org/projects/cxx0x.html  clang 3.1 appears very good as well ◦ I did not test ◦ http://clang.llvm.org/cxx_status.html http://clang.llvm.org/cxx_status.html 20 Going Parallel with C++11

21 # makefile # threading library: one of these should work # tlib=thread tlib=pthread # gcc 4.6: ver=c++0x # gcc 4.7: # ver=c++11 build: g++ -std=$(ver) -Wall main.cpp -l$(tlib) # makefile # threading library: one of these should work # tlib=thread tlib=pthread # gcc 4.6: ver=c++0x # gcc 4.7: # ver=c++11 build: g++ -std=$(ver) -Wall main.cpp -l$(tlib) Going Parallel with C++11

22 ConceptHeaderSummary Threads Standard, low-level, type-safe; good basis for building HL systems ( futures, tasks, …) Futures Via async function; hides threading, better harvesting of return value & exception handling Locking Standard, low-level locking primitives Condition Vars Low-level synchronization primitives Atomics Predictable, concurrent access without data race Memory Model “ Catch Fire” semantics ; if program contains a data race, behavior of memory is undefined Thread LocalThread-local variables [ problematic => avoid ] Going Parallel with C++11

 Use mutex to protect against concurrent access… 23 thread t1([&]() { m.lock(); sum += compute(); m.unlock(); } ); thread t1([&]() { m.lock(); sum += compute(); m.unlock(); } ); #include mutex m; int sum; #include mutex m; int sum; thread t2([&]() { m.lock(); sum += compute(); m.unlock(); } ); thread t2([&]() { m.lock(); sum += compute(); m.unlock(); } ); Going Parallel with C++11

 “ Resource Acquisition Is Initialization ” ◦ Advocated by B. Stroustrup for resource management ◦ Uses constructor & destructor to properly manage resources (files, threads, locks, …) in presence of exceptions, etc. 24 thread t([&]() { m.lock(); sum += compute(); m.unlock(); }); thread t([&]() { m.lock(); sum += compute(); m.unlock(); }); thread t([&]() { lock_guard lg(m); sum += compute(); } ); thread t([&]() { lock_guard lg(m); sum += compute(); } ); should be written as… Locks m in constructor Unlocks m in destructor Going Parallel with C++11

 Use atomic to protect shared variables… ◦ Lighter-weight than locking, but much more limited in applicability 25 thread t1([&]() { count++; }); thread t1([&]() { count++; }); #include atomic count; count = 0; #include atomic count; count = 0; thread t2([&]() { count++; }); thread t2([&]() { count++; }); thread t3([&]() { count = count + 1; }); thread t3([&]() { count = count + 1; }); X not safe… Going Parallel with C++11

 Atomics enable safe, lock-free programming ◦ “Safe” is a relative word… 26 thread t1([&]() { x = 42; done = true; }); thread t1([&]() { x = 42; done = true; }); thread t2([&]() { while (!done) ; assert(x==42); }); thread t2([&]() { while (!done) ; assert(x==42); }); int x; atomic done; done = false; int x; atomic done; done = false; done flag thread t1([&]() { if (!initd) { lock_guard _(m); x = 42; initd = true; } > }); thread t1([&]() { if (!initd) { lock_guard _(m); x = 42; initd = true; } > }); int x; atomic initd; initd = false; int x; atomic initd; initd = false; thread t2([&]() { if (!initd) { lock_guard _(m); x = 42; initd = true; } > }); thread t2([&]() { if (!initd) { lock_guard _(m); x = 42; initd = true; } > }); lazy init

27  Prime numbers… Going Parallel with C++11

 Futures provide a higher-level of abstraction ◦ Starts an asynchronous operation on some thread, await result… 28 #include. future fut = async( []() -> int { int result = PerformLongRunningOperation(); return result; } );. #include. future fut = async( []() -> int { int result = PerformLongRunningOperation(); return result; } );. try { int x = fut.get(); // join, harvest result: cout << x << endl; } catch(exception &e) { cout << "**Exception: " << e.what() << endl; } try { int x = fut.get(); // join, harvest result: cout << x << endl; } catch(exception &e) { cout << "**Exception: " << e.what() << endl; } return type… Going Parallel with C++11

 May run on the current thread  May run on a new thread  Often better to let system decide… 29 // run on current thread when someone asks for value (“lazy”): future fut1 = async( launch::sync, []() ->... ); future fut2 = async( launch::deferred, []() ->... ); // run on a new thread: future fut3 = async( launch::async, []() ->... ); // let system decide: future fut4 = async( launch::any, []() ->... ); future fut5 = async( []()... ); // run on current thread when someone asks for value (“lazy”): future fut1 = async( launch::sync, []() ->... ); future fut2 = async( launch::deferred, []() ->... ); // run on a new thread: future fut3 = async( launch::async, []() ->... ); // let system decide: future fut4 = async( launch::any, []() ->... ); future fut5 = async( []()... ); Going Parallel with C++11

30 Netflix Movie Reviews (.txt) Netflix Movie Reviews (.txt) Netflix Data Mining App Computes average review for a movie…  Netflix data-mining… Going Parallel with C++11

 C++ committee thought long and hard on memory model semantics… ◦ “ You Don’t Know Jack About Shared Variables or Memory Models ”, Boehm and Adve, CACM, Feb 2012  Conclusion: ◦ No suitable definition in presence of race conditions  Solution: ◦ Predictable memory model *only* in data-race-free codes ◦ Computer may “ catch fire ” in presence of data races 31 Going Parallel with C++11

32 thread t1([&]() { x = 1; r1 = y; } ); t1.join(); thread t1([&]() { x = 1; r1 = y; } ); t1.join(); int x, y, r1, r2; x = y = r1 = r2 = 0; int x, y, r1, r2; x = y = r1 = r2 = 0; thread t2([&]() { y = 1; r2 = x; } ); t2.join(); thread t2([&]() { y = 1; r2 = x; } ); t2.join(); What can we say about r1 and r2? Going Parallel with C++11

33 If we think in terms of all possible thread interleavings (aka “sequential consistency”), then we know r1 = 1, r2 = 1, or both In C++ 11? Not only are the values of x, y, r1 and r2 undefined, but the program may crash! Going Parallel with C++11

 A program is data-race-free (DRF) if no sequentially-consistent execution results in a data race. Avoid anything else. 34 Def: two memory accesses conflict if they 1.access the same scalar object or contiguous sequence of bit fields, and 2.at least one access is a store. Def: two memory accesses conflict if they 1.access the same scalar object or contiguous sequence of bit fields, and 2.at least one access is a store. Def: two memory accesses participate in a data race if they 1.conflict, and 2.can occur simultaneously. Def: two memory accesses participate in a data race if they 1.conflict, and 2.can occur simultaneously. via independent threads, locks, atomics, … Going Parallel with C++11

 Tasks are a higher-level abstraction ◦ Idea:  developers identify work  run-time system deals with execution details 36 Task: a unit of work; an object denoting an ongoing operation or computation. Going Parallel with C++11

 Microsoft PPL: Parallel Patterns Library 37 #include for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; // for (int i = 0; i < N; i++) Concurrency::parallel_for(0, N, [&](int i) { for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); }); #include for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; // for (int i = 0; i < N; i++) Concurrency::parallel_for(0, N, [&](int i) { for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); }); Matrix Multiply Going Parallel with C++11

38 Windows Process Thread Pool worker thread worker thread worker thread worker thread parallel_for(... ); task global work queue Parallel Patterns Library Resource Manager Task Scheduler Windows

 Presenter: Joe Hummel ◦Email: hummelj@ics.uci.edu ◦Materials: http://www.joehummel.net/downloads.html  References: ◦ Book: “ C++ Concurrency in Action ”, by Anthony Williams ◦ Talks: Bjarne and friends at MSFT’s “ Going Native 2012 ”  http://channel9.msdn.com/Events/GoingNative/GoingNative-2012 http://channel9.msdn.com/Events/GoingNative/GoingNative-2012 ◦ Tutorials: really nice series by Bartosz Milewski  http://bartoszmilewski.com/2011/08/29/c11-concurrency-tutorial/ http://bartoszmilewski.com/2011/08/29/c11-concurrency-tutorial/ ◦ FAQ: Bjarne Stroustrup’s extensive FAQ  http://www.stroustrup.com/C++11FAQ.html http://www.stroustrup.com/C++11FAQ.html 40 Going Parallel with C++11

Joe Hummel, PhD UC-Irvine

Similar presentations

Presentation on theme: "Joe Hummel, PhD UC-Irvine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joe Hummel, PhD UC-Irvine

Similar presentations

Presentation on theme: "Joe Hummel, PhD UC-Irvine"— Presentation transcript:

Similar presentations

About project

Feedback