Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g201103010) Coe-502 paper presentation 2.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
SE-292 High Performance Computing
Principles of Transaction Management. Outline Transaction concepts & protocols Performance impact of concurrency control Performance tuning.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Multiscalar processors
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
The University of Adelaide, School of Computer Science
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Lecture 20: Consistency Models, TM
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
Lecture 21 Synchronization
Multiscalar Processors
Memory Consistency Models
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Memory Consistency Models
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Transactional Memory Coherence and Consistency
The University of Adelaide, School of Computer Science
Superscalar Processors & VLIW Processors
Cache Coherence Protocols 15th April, 2006
Lecture 6: Transactions
Lecture 22: Consistency Models, TM
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Instruction Level Parallelism (ILP)
The University of Adelaide, School of Computer Science
BulkCommit: Scalable and Fast Commit of Atomic Blocks
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Programming with Shared Memory - 3 Recognizing parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
rePLay: A Hardware Framework for Dynamic Optimization
Lecture: Consistency Models, TM
Problems with Locks Andrew Whitaker CSE451.
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2

OUtline 1.Introduction 2.Current Hardware 3.TCC in Hardware 4.TCC in Software 5.Performance evaluation 6.Conclusion.

Transactional Coherence and Consistency (TCC) provides a lock free transactional model which simplifies parallel hardware and software. Transactions are the basic unit of parallel work which are defined by the programmer. Memory coherence, communication and memory consistency are implicit in a transaction. Intoduction

Provide illusion of a single shared memory to all processors. Problem is divided into various parallel tasks that work on a shared data present in shared memory. Complex cache coherence protocols required. Memory consistency models are also required to ensure the correctness of the program. Locks used to prevent data races and provide sequential access. Too many locks overhead can degrade performance. Current Hardware

TCC in HARDWARE Processors execute speculative transactions in a continuous cycle. A transaction is a sequence of instructions marked by software that are guaranteed to execute and complete atomically. Provides All Transactions All The time model which simplifies parallel hardware and software.

TCC in HARDWARE When a transaction starts, it produces a block of writes in a local buffer while transaction is executing. After completing transaction, hardware arbitrates system wide for permission to commit the transaction. After acquiring permission, the node broadcasts the writes of the transaction as one single packet. Transmission as a single packet reduces number of inter processor messages and arbitrations. Other processors snoop on these write packets for dependence violation.

TCC in HARDWARE

TCC simplifies cache design Processor hold data in unmodified and speculatively modified form. During snooping invalidation is done if commit packet contains address only. Update is done if commit packet contains address and data. Protection against data dependencies. If a processor has read from any of the commit packet address, the transaction is re executed.

TCC in HARDWARE Current CMP need features that provide speculative buffering of memory references and commit arbitration control. Mechanism for gathering all modified cache lines from each transaction into a single packet is required. Write Buffer completely separate from cache. Address buffer containing list of tags for lines containing data to be committed.

TCC in HARDWARE Read BITs Set on a speculative read during a transaction. Current transaction is voilated and restarted if the snoop protocal sees a commit packet having address of a location whose read bit is set. Modified BITs During a transaction stores set this bit to 1. During violation lines having modified bit set to 1 are invalidated.

TCC in Software Programming with TCC is a 3 Step process. Divide program into transactions. Specify Transactions Order. Can be relaxed if not required. Tuning Performance TCC provide feedback where in program the violations occur frequently

Loop Based Parallelization Consider Histogram Calculation for 1000 integer percentage /* input */ int *in = load_data(); int i, buckets[101]; for (i = 0; i < 1000; i++) { buckets[data[i]]++; } /* output */ print_buckets(buckets);

Loop Based Parallelization Can be parallelized using. t_for (i = 0; i < 1000; i++) Each loop body becomes a separate transaction. When two parallel iterations try to update same histogram bucket, TCC hardware causes later transaction to violate, forcing the later transaction to re execute. A conventional Shared memory model would require locks to protect histogram bins. Can be further optimized using t_for_unordered()

Fork Based Parallelization t_fork() forces the parent transaction to commit and create two completely new transactions. One continues execution of remaining code Second start executing the function provided in parameters. E.g /* Initial setup */ int PC = INITIAL_PC; int opcode = i_fetch(PC); while (opcode ! = END_CODE){ t_fork(execute, &opcode, 1, 1, 1); increment_PC(opcode, &PC); opcode = i_fetch(PC);}

Explicit transaction commit ordering Provide partial ordering. Done by assigning two parameters to each transaction Sequence Number and Phase Number Transactions with same sequence number commit in an ordered way defined by programmer. Transactions with different sequence number are independent. Order for transactions having same sequence numbered is achieved through phase number. Transaction having Lowest Phase number is executed first.

Performance Evaluation

Maximize Parallelization. Create as many transactions as possible Minimize Violations. Keep transactions small to reduce amount of work lost on violation Minimize Transaction Overhead Not To small size of transaction Avoid Buffer Overflow Can result in excessive serialization

Performance Evaluation Base Case. Simple parallelization without any optimization. Unordered Finding loops that can be un orderd. Reduction Finding areas that exploit reduction operations Privatization Privatize the variables to each transaction that cause violations. Using t_commit() Break large transactions to small ones but execute on same processor. Reduces loss overhead due to violations and prevents buffer overflow. Loop Adjustments Using various loop adjustments optimizations provided by the compiler.

Performance Evaluation Privatization and t_commit Improve performance Inner Loops had too many violations Using outer loop_adjust improved result

Performance Evaluation CMP performance is close to Ideal TCC for small number of processors.

Conclusions Bandwidth limitation is still a problem for scaling TCC to more processors. No support for nested for loops. Dynamic optimization techniques still required to automate performance tuning on TCC