Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.

Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University

Contents Background Design Implementation Optimization Experiment Conclusion

Transactional Memory Background Trend to concurrent programming Current solution: – Lock – Flaws: Association between locks and data Deadlock Not composable

Transactional Memory Background a.credit(amount); b.debit(amount); class Account{ int balance; lock mylock; bool credit(int amount); bool debit(int amount); }; bool credit(int amount){ acquire(mylock); balance+=amount; release(mylock); } bool debit(int amount){ acquire(mylock); balance-=amount; release(mylock); } inconsistent state acquire(a.mylock); acquire(b.mylock); release(a.mylock); release(b.mylock); Poor abstraction of class Account Deadlock Exposed implementation details transfer(Account a, Account b, int amount){ } atomic{ a.credit(amount); b.debit(amount); }

Transactional Memory Background Current Implementations – TM libraries DSTM DracoSTM TL2 TinySTM …….. Function calls: TM_INIT()/TM_SHUTDOWN() TM_ATOMIC_BEGIN()/TM_ATOMIC_END() TM_SHARED_READ()/TM_SHARED_WRITE() Explicit Transaction

Transactional Memory Background Current Implementations – Compilers Intel C++ STM Compiler Tanger OpenTM GCC

Design Programming Interfaces #pragma tm atomic [clause] structured block readonly private(var list) shared(var list) #pragma tm abort #pragma tm function function declaration #pragma tm waiver function declaration

Design TM runtime interfaces (TL2) InterfaceDescription Thread* TxNewThread()Allocate a new Thread structure to keep logs TxStart(Thread* Self, jmp_buf* buf, int flags)Start a new transaction for current thread TxCommit(Thread* Self)Commit the current transaction TxLoad(Thread* Self, void* addr)Perform synchronized load from given memory address TxStore(Thread* Self, void* addr, intptr_t val)Perform synchronized store to given memory address TxStoreLocal(Thread* Self, void* addr, intptr_t val)Perform locally logged store to given memory address TxAbort(Thread* Self)Abort the current transaction and re-execute

Design Wrapper functions – To ease the process of integrating new TM libraries tm_init()/tm_finalize() tm_thread_start()/tm_thread_end() __tm_atomic_begin()/__tm_atomic_end() __tm_shared_read()/__tm_shared_read_float() __tm_shared_write()/__tm_shared_write_float() __tm_local_write()/__tm_local_write_float() by programmers by compiler more wrapper functions are needed for other data types, and additional TM semantics

Design Optimization – Eliminate redundant calls to runtime libraries

Implementation General Transformation

Implementation General Transformation – #pragma tm atomic – simple statements – control flow statements IF WHILE_DO a = b+c; PARM #address of c CALL LDID STID #tm_preg_num_0 PARM #address of b CALL LDID STID #tm_preg_num_1 LDID #tm_preg_num_0 LDID #tm_preg_num_1 ADD PARM PARM #address of a CALL setjmp(); __tm_atomic_begin(); for(;i<10;i++){ } PARM #address of I CALL LDID STID #tm_preg_num_0 WHILE_DO LDID #tm_preg_num_0 INTCONST 9 LE BODY BLOCK ……………. PARM #address of I CALL LDID STID #tm_preg_num_0 END_BLOCK

Implementation General Transformation 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 TxStore(Self, &j, 0); 2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20; TxStore(Self, &i, TxLoad(Self, &i)+1)){ 2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10; TxStore(Self, &j, TxLoad(Self, &j)+1)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self);

Implementation Functions – clone and instrument #pragma tm function void calculate(){} void calculate() __tm_cloned__calculate() //instrumented #pragma tm atomic { calculate(); } #pragma tm atomic { __tm_cloned__calculate(); }

Implementation Optimization 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 TxStore(Self, &j, 0); 2.6 for (TxStore(Self, &i, 0);; TxLoad(Self, &i)<20; TxStore(Self, &i, TxLoad(Self, &i)+1)){ 2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10; TxStore(Self, &j, TxLoad(Self, &j)+1)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self); Transaction local variables : detected by the frontend

Implementation Optimization 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 j=0; 2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20; TxStore(Self, &i, TxLoad(Self, &i)+1)){ 2.7 for(j=0; j<10;j++)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self); Barrier Free variables : detected according to its storage class

Implementation Optimization 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 j=0; 2.6 for (; i<20; TxStoreLocal(Self, &i, i+1)){ 2.7 for(j=0; j<10;j++)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self);

Implementation Optimization – Optimization opportunities detection strategy Pthread parallel task – transaction local: declared in tm atomic scope – barrier free: auto variables Cloned transactional function – transaction local: declared in the function OpenMP parallel task – transaction local: declared in tm atomic scope – barrier free: declared in micro task, marked in openmp private clause Checking readonly transactions – Limitation Reserved design for pointers Needs programmers to participate in optimization

Preliminary Experiments Compare with fine-grained lock based application

Preliminary Experiments Compare with manually instrumented application

Preliminary Experiments #pragma tm atomic { int j; *new_centers_len[index] ++; for(j=0;j<nfeatures;j++){ new_centers[index][j]+=feature[i][j]; } private(feature)

Conclusion & Future work A infrastructure for TM on Open64 – Replaceable TM implementation – Optimization More experiments on non-trivial applications are desired Nested transaction Signal processing Event handler Indirect calls Dealing with legacy code … FastDB: 8 out of 75 critical regions contain nested transactions FastDB: 28 out of 75 critical regions contain signal processing PARSEC: 20 out of 55 critical regions contain signal processing

Thanks

Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.

Similar presentations

Presentation on theme: "Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.

Similar presentations

Presentation on theme: "Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University."— Presentation transcript:

Similar presentations

About project

Feedback