Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer.

Similar presentations


Presentation on theme: "1 Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer."— Presentation transcript:

1 1 Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer

2 2 Parallelism in games is no longer optional The unending quest for realism in games is causing game content and gameplay to become increasingly complex. More complicated scenes + more complicated behavior = increased computation. CPUs and GPUs are no longer competing on clock speed, but on degree of parallelism. High-end games require threading. You can't go home again.

3 3 Threaded architectures for games are challenging to design Techniques for threading individual computations/systems are well-known, but...  the techniques often have inefficient interactions.  games rely on middleware to provide some functionality – more potential conflict.  the moment-to-moment workload can change dramatically.  the variety of CPU topologies complicates tuning. Task-based parallelism is a viable way out of this mess. But first, let's gaze into the abyss...

4 4 A threaded game architecture – full of pain and oversubscription  Particles array could be partitioned.  One-off jobs run on "job threads".  Physics threads are created by middleware.  Sound mixing is on a dedicated thread.  Bones/skinning is a Directed Acyclic Graph. ParticlesJobsBones PhysicsSound ? ? + ? = ???

5 5 Tasks are an efficient method for handling all of this parallel work With a task scheduler, and with all of this work decomposed into tasks, then...  one thread pool can process all work.  oversubscription will be avoided by using the same threads for all parallel work.  the game will scale well to different core topologies without painful tuning. Tasks can do it!

6 6 Task-based parallelism is agile threading A thread is...  run on the OS.  able to be pre-empted.  expected to wait on events.  most efficient with some oversubscription.  optimized for a specific core topology. A task is...  run on a thread pool.  run to completion.  heavily penalized for blocking.  efficient by avoiding oversubscription.  able to adapt to any number of threads/cores.

7 7 Tasks are the uninterrupted portions of threaded work Texture Lookup Data Parallelism Processing Setup

8 8 Tasks can be arranged in a dependency graph Texture Lookup Data Parallelism Processing Setup

9 9 Dependency graph can be mapped to a thread pool Lots of work means lots of tasks which fill in the gaps in the thread pool. The decomposition of tasks and mapping to threads is the job of the task scheduler.

10 10 Task schedulers have similar ingredients but different flavors Cilk scheduler has been extremely influential. Most have task queues per thread to avoid contention (often multiple queues per thread). Cache-aware distribution of work is a key performance feature. Most prevent direct manipulation of queues. The APIs vary in some ways:  Constructive schedulers define tasks a priori.  Reductive schedulers subdivide tasks in flight.  Event-driven schedulers trigger off of I/O.  Computation schedulers are triggered manually.

11 11 Threading Building Blocks is Intel's Open Source task-based scheduler TBB is a reductive, computation scheduler designed to...  be cross-platform (Windows*, OS X*, Linux, XBox360*).  simplify data parallelism coding.  provide scalability and high performance. TBB has a high-level API for easy parallelism, low-level API for control. API is not so low-level that it exposes threads or queues. *Other names and brands may be claimed as the property of others.

12 12 Enough! Let's look at code This talk shows code solutions to threaded game architecture problems. Common threading patterns in games are decomposed into tasks, using the TBB API. The code is available:

13 13 Start with the easy stuff – turn independent loops into tasks The TBB high-level API provides parallel_for(). Behold, the humble for loop: for(int i = 0; i < ELEMENT_MAX; ++i) { doSomethingWith(element[i]); }

14 14 Using parallel_for() is a 2-step process; step 1 is objectify the loop class DoSomethingContext { void operator()( const tbb::blocked_range &range ) { for(int i = range.begin(); i != range.end(); ++i) { doSomethingWith(element[i]); }

15 15 parallel_for() step 2: invoke the objectified loop with a range tbb::parallel_for( tbb::blocked_range (0, ELEMENT_MAX), *pDoSomethingContext ); Particles For more general task decomposition problems, we need a low-level API...

16 16 TBB low-level API: work trees with blocking calls and/or continuations Root Task More Callback Spawn & Wait Root Task More Spawn Wait Blocking calls go down Continuations go up Root

17 17 Work trees can implement common game threading patterns The TBB low-level API creates and processes trees of work – each node is a task. Work trees of tasks can be made to process:  Callbacks  Promises  Synchronized callbacks  Long, low priority operations  Directed acyclic graph We'll look at how these patterns can be decomposed into tasks using the TBB low-level API.

18 18 Callbacks – send it off and never wait Callbacks are function pointers executed on another thread. Execution begins immediately. No waiting on individual callbacks - can wait in aggregate. void doCallback(FunctionPointer fFunc, void *pParam);

19 19 void doCallback(FunctionPointer fCallback, void *pParam) { // allocation with "placement new" syntax CallbackTask *pCallbackTask = new( s_pCallbackRoot->allocate_additional_child_of( *s_pCallbackRoot ) ) CallbackTask(fCallback, pParam); s_pCallbackRoot->spawn(*pCallbackTask); } Code and tree: Callback Root Task More Callback Spawn

20 20 Callbacks are simple and powerful, but have limits No waiting! Callbacks are run on demand. No waiting? Callback has to report its own completion. No waiting?! Need special case code to run on 1-core system. If this is a deal-breaker, there are other options...

21 21 Promises – come back for it later Promises are an evolution of Callbacks. Like Callbacks:  Promises are function pointers executed on another thread.  Execution begins immediately. Unlike Callbacks:  Promises provide a method for efficient waiting. Promise Root Task More Spawn Wait Callback Promise *doPromise(FunctionPointer fFunc, void *pParam);

22 22 void doPromise(FunctionPointer fCallback, void *pParam, Promise *pPromise) { // allocation with "placement new" syntax tbb::task *pParentTask = new( tbb::task::allocate_root() ) tbb::empty_task(); pPromise->setRoot(pParentTask); PromiseTask *pPromiseTask = new( pParentTask->allocate_child() ) PromiseTask(fCallback, pParam, pPromise); pParentTask->set_ref_count(2); pParentTask->spawn(*pPromiseTask); } Code and tree: Promise setup

23 23 Code and tree: Promise execution void Promise::waitUntilDone() { if(m_pRoot != NULL) { tbb::spin_mutex::scoped_lock(m_tMutex); if(m_pRoot != NULL) { m_pRoot->wait_for_all(); m_pRoot->destroy(*m_pRoot); m_pRoot = NULL; }

24 24 Promises seem almost too good to be true Blocking wait only if result is not available when requested. If wait blocks, the current thread actively contributes to completion. 2 files, 3 classes, ~150 lines of code. Robust Promise systems can also:  Cancel jobs in progress  Get partial progress updates

25 25 Synchronized Call – wait until all threads call it exactly once Synchronized Calls can be useful for:  Initialization of thread-specific data  Coordination with some middleware  Instrumentation and profiling Trivial if you have direct access to threads, but trickier with a task-based system. CallbackTest + Wait Spawn + Wait Task Root void doSynchronizedCallback( FunctionPointer fFunc, void *pParam);

26 26 Code and tree: Synchronized Call setup void doSynchronizedCallback(FunctionPointer fCallback, void *pParam, int iThreads) { tbb::atomic tAtomicCount; tAtomicCount = iThreads; tbb::task *pRootTask = new(tbb::task::allocate_root()) tbb::empty_task; tbb::task_list tList; for(int i = 0; i < iThreads; i++) { tbb::task *pSynchronizeTask = new( pRootTask->allocate_child() ) SynchronizeTask(fCallback, pParam, &tAtomicCount); tList.push_back(*pSynchronizeTask); } pRootTask->set_ref_count(iThreads + 1); pRootTask->spawn_and_wait_for_all(tList); pRootTask->destroy(*pRootTask); }

27 27 Code and tree: Synchronized Call execution tbb::task *SynchronizeTask::execute() { m_fCallback(m_pParam); m_pAtomicCount->fetch_and_decrement(); while(*m_pAtomicCount > 0) { // yield while waiting tbb::this_tbb_thread::yield(); } return NULL; }

28 28 Synchronized Calls are useful, but not efficient Don't make Synchronized Calls in the middle of other work. Performance penalty is negated if work queue is empty.

29 29 Long, low priority operation – hide some time-slicing Many games have long operations that run in parallel to the main computation:  Asset loading/decompression  Sound mixing  Texture tweaking  AI pathfinding It's not necessary to create a new thread to handle these operations! Use the time-honored technique of time-slicing. Set Low-Priority Task Test + Clear ParentRoot Spawn More

30 30 Code and tree: Long, low priority operation tbb::task *BaseTask::execute() { if(s_tLowPriorityTaskFlag.compare_and_swap(false, true) == true) { // allocation with "placement new" syntax tbb::task *pLowPriorityTask = new( this->allocate_additional_child_of( *s_pLowPriorityRoot ) ) LowPriorityTask(); spawn(*pLowPriorityTask); } // spawn other children... }

31 31 Long, low priority operations are tricky to get right Task-based schedulers won’t swap out a task that runs a long time. A low-priority task can’t reschedule itself naively, or it will create an infinite loop. Even if scheduler designed with priority in mind, it only matters when a thread runs dry. This approach doesn’t guarantee any minimum frequency of execution.

32 32 Directed Acyclic Graph – everyone's favorite paradigm Directed Acyclic Graphs are popular for executing workflows and kernels in games. Interface varies, but generally construct a DAG and then execute and wait. How can work trees represent a DAG?

33 33 Tree: Directed Acyclic Graph Root More Root More Root More Root More Root More Spawn

34 34 Directed Acyclic Graph gets the job done The DAGs created by this approach are destroyed by waiting on them. Persistent DAGs are possible, for re-use across several frames. A scheduler could be DAG-based to begin with, making this trivial. Remember, get the code from:

35 35 Soon, rendering may also be decomposable into tasks DirectX* 11 is designed for use on multi-core CPUs. Multiple threads can draw to local DirectX contexts ("devices"), and those draw calls are aggregated once per frame. All those draw calls can be done as tasks! All the threads can be initialized with a DirectX context using Synchronized Callbacks! This is an extremely positive development; Intel will produce lots of samples to help promote to the industry. *Other names and brands may be claimed as the property of others.

36 36 Our sample architecture can be handled by tasks top-to-bottom  Particles partitioning handled by parallel_for().  One-off jobs using Callbacks or Promises.  Physics uses job threads via Synchronized Calls.  Sound mixing is time-sliced as Low-Priority job.  Bones/skinning DAG uses the job threads, too. ParticlesJobsBones PhysicsSound √ √

37 37 TBB has other helpful features we didn't cover Beyond the high-level and low-level threading APIs, TBB has:  Atomic variables  Scalable memory allocators  Efficient thread-safe containers (vector, hash, etc.)  High-precision time intervals  Core count detection  Tunable thread pool sizes  Hardware thread abstraction

38 38 Using task parallelism will ensure continued game performance Task-based parallelism scales performance on varying architectures. Break loops into tasks for the maximum performance benefit. Use tasks to implement a game's preferred threading paradigms.

39 39 Want more? Here's more

40 40 Legal Disclaimer  INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.  Intel may make changes to specifications and product descriptions at any time, without notice.  All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.  Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.  *Other names and brands may be claimed as the property of others.  Copyright © 2009 Intel Corporation.


Download ppt "1 Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer."

Similar presentations


Ads by Google