Presentation on theme: "Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras 2006720612."— Presentation transcript:
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Agenda Introduction Performance Impacts of Shared Resources on CMTs Implementation Related Work Conclusion
Part I - Introduction Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
CMP and MT CMP : Chip Multiprocessing – Multiple processor cores on a single chip, allowing more than one thread to be active at a time, improving utilization of the chip. MT : Hardware Multithreading – Has multiple sets of registers, interleaves the execution of threads, either by switching between them in each cycle, or by executing multiple threads simultaneously(using different functional units)
Examples CMPs – IBM’s Power4 – Sun’s Ultra Sparc IV MTs – Intel’s hyper-threaded Pentium IV – IBM’s RS64 IV
What is CMT? CMT (Multithreaded Chip Multiprocessor) is a new generation of processors, that exploit thread level parallelism to mask the memory latency in modern workloads. CMT = CMP + MT Studies have demonstrated the performance benefits of CMTs, and vendors are planning to ship their CMTs in So it is important to understand how to best take advantage of CMTs.
CMTs share resources... A CMT may be equipped with many simultaneously active thread contexts. So, competition for shared resources is intense. It is important to understand the conditions leading to performance bottlenecks on shared resources, and avoid performance degradation.
CMT Simulation CMT systems not exist yet, we will work on CMT system simulator kit (similar to Simics) The simulated CPU core has a simple RISC pipeline, with one set of functional units. Each core has a TLB, L1 data and instruction caches. L2 cache is shared by all CPU cores in the chip.
CMT Simulation A schematic view of the simulated CMT Processor
CMT Simulation Accurately simulate: Pipeline contention L1 and L2 caches Bandwidth limits on crossbar connections between L1- L2 caches Bandwidth limits on the path between L2 cache and memory. 1 to 4 cores, each including 4 hardware contexts, 8KB-16KB L1 data and instruction caches, and L2 cache.
Part II - Performance Impacts of Shared Resources on CMTs Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
Shared Resources on CMTs We will analyze the potential performance bottlenecks on: – Processor Pipeline – L1 Data Cache – L2 Cache
Processor Pipeline Throughput of 4 different scheduling ways Throughput can be improved with a smart scheduling algorithm.
L1 Data Cache Pipeline utilization and L1 cache miss rates as a function of Cache Size.
L1 Data Cache Increasing size of L1 data cache does not improve performance on MTs. Even if it decreases cache misses, no need for such a cost. L1 instruction cache is also small, but hitrates are always high (above 97%), so no need to consider.
L2 Cache L2 cache is more likely to be a potential bottleneck when hitrate decreases. – Latency (on hyperthreaded Pentium IV) A trip from L1 to L2 takes 18 cycles A trip from L2 to memory takes 360 cycles. – Bandwidth The bandwidth between L1-L2 is typically greater than the bandwidth between L2 and main memory.
L2 Cache Performance is highly related with L2 cache miss ratio, so our scheduling algorithm should be targeted to decrease cache misses on L2.
Part III - Implementation Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
A new Scheduling Algorithm Balance set scheduling (Denning, 1968) Basic idea : – To avoid thrashing, schedule a group of threads whose working set fits in the cache. – Working set is the amount of data that a thread touches in its lifetime.
Problem with Working-Set Model Denning’s assumption: – Working set is accessed uniformly. – Small working sets have good locality, large ones have poor locality. Does it apply to modern workloads?
Problem with Working-Set Model A simple experiment to prove it. – Footprint for gzip : 200K – Footprint for crafty : 40K – According to Denning’s assumption, crafty should have better cache hit rates than gzip. – We proved this assumption was wrong for modern workloads, so we can not use working-set model.
A better metric of Locality Hypothesis : Even though gzip has a larger working set, it has better cache hit rates than crafty. It should be accessing the data in smaller chunks. Reuse distance : amount of time that passes between the references to a memory location.
Balance-Set Scheduling Adapted balance set principle to work with reuse distance model instead of working set. When a scheduling decision is to be done : – Predict miss rates of all possible groups of threads. – Schedule the groups whose predicted miss rates are below a threshold.
Balance-Set Scheduling What is the threshold to be used?
Balance-Set Scheduling Two policies for selecting the group to schedule : – PERF : Select the groups with lowest miss rate (making sure no workload will be starved) – FAIR : Each workload receives equal share of processor. Keep track of how many times each workload is selected In each selection, favor groups that has least frequently selected workloads.
Balance-Set Scheduling IPC achieved with default, PERF and FAIR schedulers - Lowest performance gain : 16%, FAIR when L2=384KB - Highest performance gain : 32%, PERF when L2=48KB
Balance-Set Scheduling L2 cache miss rates are reduced by 20-40% Minimum gain is 12% with FAIR scheduler in L2 = 384KB, which could be achieved by using a 4 times larger L2 cache.
Implementation Cost We talked about the potential benefits, but a useful approach must be practical to implement. Cost of predicting the miss rates based on reuse distance histograms, was previously discussed. – To adapt the model to MT environment, we should combine the histogram informations. – Little cost with AVG method, more expensive in COMB method.
Implementation Cost Cost of collecting the data required for building reuse distance histograms. – Monitor memory locations and record their reuse distances. – A user-level watching tool was implemented, with 20% overhead. – Overhead is reduced multiple watch points (could be done in UltraSparc) and kernel level instead of user-level.
Implementation Cost Data to be stored for each thread is small. The size of reuse distance histogram can be fixed. Reuse distance histograms can be compressed: – Aggregated reuse distances in buckets. – Results stayed accurate even for a few buckets.
Part V – Conclusion Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
Conclusion We investigated the performance effects of shared resources in a CMT system, and found L2 cache has the greatest effect on performance. Using Balance-Set Scheduling, we reduced L2 cache miss rates by 20-40% and improved performance by 16-32%. (same improvement when L2 cache size multiplied by 4)
References  A.Fedorova, M.Seltzer, C.Small, D.Nussbaum, “Performance of Multithreaded Chip Multiprocessors and Implications on Operating System Design”,  A.Fedorova, M.Seltzer, C.Small, D.Nussbaum, “Throughput Oriented Scheduling on Chip Multithreading Systems”,  A.Snavely, D.Tullsten, “Symbiotic Job Scheduling for a Simultaneous Multithreading Machine”, 2000  A.Snavely,D.Tullsten,G.Voelker, “Symbiotic JobScheduling with priorities for a Simultaneous Multithreading Processor”, 2002  S.Parekh,S.Eggers,H.Levy,J.Lo, “Thread-sensitive scheduling for SMT processors”.  J.Larus,M.Parkes, “Using Cohort Scheduling to enhance server performance”,  R.Behren,J.Condit,F.Zhou,G.Necula, E.Brewer, “Capriccio:Scalable threads for internet services”, 2003.