Presentation is loading. Please wait.

Presentation is loading. Please wait.

QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal.

Similar presentations


Presentation on theme: "QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal."— Presentation transcript:

1 QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal 1, Eduard Ayguadé 1,2, Tim Harris 3, Mateo Valero 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya 3 Microsoft Research

2 2 Outline Introduction & motivation Quake description Parallelization Results Conclusion

3 3 CPU processing is the bottleneck. Introduction Topic of this work Parallelization of the Quake server. What is Quake?  The first person shooter game.  A sequential application.  Close to instantaneous control of player actions.  High degree of interaction among players in a detailed 3D virtual world. Requirements of a sequential game server OpenMP + Transactional Memory Method

4 4 Background OpenMP: –API for writing shared-memory parallel programming in C/C++ and Fortran. –Compiler directives and library routines. –Fork-Join parallelism. Transactional Memory (TM): –concurrency control mechanism. –series of reads and writes to shared memory are handled atomically. –When successful transaction commits, otherwise it aborts.

5 5 Motivation Just a few TM applications available –STAMP, Haskell STM benchmark, RMS-TM … –Clear need for more complex applications. Contribution: Parallelization of a complex sequential application using TM. Question: Is it possible to achieve fine-grained locking performance with the coarse-grained parallelization effort? MOTIVATION - Test TM programmability: –Start with a coarse-grained approach. –Test the performance. –Determine the problems. –Compare with a fine-grained approach.

6 6 Outline Introduction & motivation Quake description Parallelization Results Conclusion

7 7 Quake Organization Typical client – server architecture Server Maintains the consistency of the game world. Handles the coordination among clients. Clients Update graphics Implement user-interface operations

8 8 The Server The main server task - computing a new frame Process Read Physics Update SELECT Reply Yes No Tx Rx Frame execution diagram Request Processing Sequential server execution with 8 connected clients. Execution breakdown We concentrate on the request processing stage 2.1% 87.8% 3.1%

9 9 LEVEL 4 LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 5 Areanode tree Top view 3D volume in a 3D coordinate space. Represented as a binary space partition tree. Fine grained and inefficient. Areanode tree: -balanced binary tree. -each 3D point in the map must either be in an areanode that is a leaf or in a division plane. -areanodes maintain a list of game objects (entities). Quake Map

10 10 Outline Introduction & motivation Quake description Parallelization Results Conclusion

11 11 Parallelization Only the request processing stage is parallelized OpenMP to start parallel execution. Transactions for synchronization. Coarse-grained approach. Comparison with the fine-grained implementation of Atomic Quake [PPoPP2009] Application characteristics: Coarse-grained 8 TM blocks Big read & write sets Long transactions Abort rate 35.3% Fine-grained 58 TM blocks Abort rate 4.1%

12 12 Shared Data Three types of shared data structures: –Areanode tree –Game objects –Message buffers Common global state buffer Per-player reply buffers Most intensive sharing inside the request processing stage.

13 13 Client Requests Two types of requests: Connection related messages –associated with the connection or disconnection protocols, used when the client wants to join or leave the server game session, or other facilities that do not affect gameplay Gameplay messages –most important type of requests –model the player’s interaction with the game world. –the most used – MOVE COMMAND.

14 14 Pseudocode for the request processing stage while (NET_GetPacket ()) { // Filter packets if (connection related packet) { SV_ConnectionlessPacket (); continue; } // game play packets for (i=0 ; i<MAX_CLIENTS ; i++) { // Do some checking here SV_ExecuteClientMessage (); } while (NET_GetPacket ()) { // Filter packets if (connection related packet){ SV_ConnectionlessPacket (); continue; } AddPacketToList(); CopyBuffer(); } #pragma intel omp parallel taskq shared(packetlist,...) { while (packetlist != NULL) { #pragma intel omp task captureprivate(packetlist) { NET_Message_Init(..); // check for packets from connected clients for (i=0, cl=svs.clients ; i<MAX_CLIENTS ; i++,cl++) { // Do some checking here SV_ExecuteClientMessage (cl); } packetlist = packetlist->next; } SequentialParallel

15 15 The Move Command Execution Construct the bounding box. Traverse the areanode tree. Find objects contained in the bounding box. Associate them with the command. Simulate the move. Remove the player from the old position. Add him to the new position. Parameters Player’s origin View angles Motion indicators Time to run

16 16 Move Command Execution AddLinksToPmove Execute Move ClientPhysics ClientThink PmoveInit PlayerMove LinkEntity PlayerTouch Transaction 1 Transaction 2 Transaction 3 Transaction 4 T1 T2 T3 T4 ClientPhysics client’s physics update ClientThink execute actions registered in previous frames PmoveInit pmove (player move) structure initialization AddLinksToPmove determines which entities could be affected by the current move command. PlayerMove constructs a trajectory line and determines the client's final position LinkEntity re-links the player’s entity to the new position in the areanode tree PlayerTouch model influence on the other game objects

17 17 ReachPoints int reachpoints[NumThreads][x*16] TM_PURE void PointReached(int check) { reachpoints[ThreadId][check]++; } int main () {... TRANSACTION PointReached (1); statement_1; PointReached (2); TRANSACTION_END... } Helps to: Identify thread private variables. Discover where transactions abort Discover causes for the aborts. Discover TM false sharing conflicts (conflict management granularity).

18 18 Outline Introduction & motivation Quake description Parallelization Results Conclusion

19 19 Evaluation TraceBot: –An automatic trace client. –Behavior is controlled by a finite state machine. VideoClient: –Normal graphical client for proving correctness. –For trace creation. The server runs on one machine, the clients on the other. –Server – 8 cores (4 x dual-core 64-bit Intel® Xeon™). Frame execution time as a performance measure. Prototype version 3.0 of the Intel STM C/C++ compiler. –In-place updates. –Cache line granularity conflict detection. –Transactions validate the read set at commit time, and if necessary during the read operation, –function annotations: tm_callable, tm_pure and tm_unknown. –Closed nesting - flattening

20 20 Results - Normalized average frame execution times (coarse) The baseline is always the average frame execution time of the sequential server for the respected number of clients. TM version overhead 3.5x – 6x more than 85% of the time is spent in critical sections. Overhead is too high

21 21 Results - performance of coarse-grained configurations Comparative performance of parallel configurations Transactional server running with 16 clients (speedup & scalability)

22 22 Transactional statistic – coarse-grained ClientsTransactionsAborts Abort rate [%] Mean [KB]Max [KB]Total [MB] 13475400.0 Reads3.0104105 Writes0.61720 29598019702.1 Reads2.8863263 Writes0.616455 4179241108206.0 Reads3.41413570 Writes0.6269108 83643057656021.0 Reads4.214781207 Writes0.8251216 1652456118499235.3 Reads5.117041725 Writes0.9262296 The abort rate is significant TM server running with 8 threads.

23 23 The Overhead Breakdown TM block Multithread execution - 8 threads, 16 clients Total [10 9 cycles] Instrumentation timeAbort overhead Abort rate [%] 10 9 cycles% % 113.510.375.83.324.219.5 29.59.094.10.65.918.0 317.215.187.92.112.152.7 411.610.994.30.75.722.4 55.93.253.72.846.361.1 overall57.948.583.89.416.235.2 We have limited possibility for profiling Seems like the TM instrumentation overhead is more important

24 24 Results - Normalized average frame execution times (fine) TM version overhead 2.4x – 3x

25 25 Results - performance of fine-grained configurations Comparative performance of parallel configurations Transactional server running with 16 clients (speedup & scalability)

26 26 Transactional statistic – fine-grained ClientsTransactionsAborts Abort rate [%] Mean [B]Max [B]Total [MB] 119020600.0 Reads65.15851112 Writes5.2201021 23671188260.2 Reads66.06272825 Writes5.7243972 465502041650.6 Reads83.78027555 Writes8.2397265 81439874205931.4 Reads102.5102470145 Writes9.65755214 1632267591318144.1 Reads133.3231593192 Writes15.521165122 TM server running with 8 threads.

27 27 Outline Introduction & motivation Quake description Parallelization Results Conclusion

28 28 QuakeTM Characteristics 27.600 lines of code. 49 files. Configurable with macros –Synchronization, granularity, nesting, TM implementation. Coarse-grained setup: –8 critical regions (TM or global lock) Fine-grained setup: –58 critical regions (TM or fine-grained locks) Available on the www.bscmsrc.eu

29 29 Conclusion The transactional overhead is excessive: –6x slowdown –35.3% abort rate A coarse-grained approach is not a good option for the current STM systems. Significant programmer time investment (10 man-months). Fine-grained approach maybe the only solution.

30 30 Questions? Thank you! Download QuakeTM www.bscmsrc.eu

31 31 Intel Compiler single lock atomicity semantics and weak atomicity guarantees. –Strongly atomic semantics, where non- transactional accesses are treated as implicit single-operation transactions

32 32 Atomic Quake Main objective was to evaluate the effort of replacing locks with transactions. The lock parallelization is not block structured which required code reorganization to adapt to the TM model. The second problem was to avoid I/O operations which is not an issue in a lock based system. Finally, a big fraction of the development time was spent in understanding how locks are associated with the variables and to get a grip with the locking strategy.

33 33 Atomic Quake 2 Thread private data – call to get_specific The conditional variables – no retry I/O in transactions – tm_pure Proposition for error handling –When error happens commit the transaction and handle the error outside the atomic block. Privatization examples –Custom memory manager allocates a block of memory for string operations TM fits for guarding access to different shared data (separate locks)


Download ppt "QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal."

Similar presentations


Ads by Google