C-AMAT：Concurrent Average Memory Access Time

C-AMAT：Concurrent Average Memory Access Time
Xian-He Sun April， 2015 Illinois Institute of Technology C-AMAT means concurrent average memory access time. C-AMAT is a rethinking of memory performance from data-centric of view. With Yuhang Liu and Dawei Wang

Outline Motivation Memory System and Metrics
C-AMAT: Definition and Contribution Experimental Design and Verification Application and Related Work Conclusion Reference X.-H. Sun and D. Wang, "Concurrent Average Memory Access Time", in IEEE Computers, vol. 47, no. 5, pp ,May 2014 D. Wang and X. Sun, “APC: A Novel Memory Metric and Measurement Methodology for Modern Memory System,” IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626–1639, 2014.

Motivation Processor is 400x faster than memory, and applications become more data intensive Data access becomes THE performance bottleneck of high-end computing Many concurrency based technologies are developed to improve data access speed, but their impact on final performance is elusive and, therefore, are not fully utilized Existing memory optimization strategies are still primarily based on the sequential single-access assumption Existing memory metrics—MR, AMP and AMAT—still measure hits and misses based primarily on sequential single-access activity, and so are inadequate for purposes of measuring concurrent cache memory access activity Traditional memory performance metrics, such as average memory access time (AMAT), are designed for sequential data accesses and can prove misleading for contemporary cache technologies that increasingly rely on access concurrency. C-AMAT, a new performance metric, accounts for concurrency at both the component and system levels for modern memory design.

Memory Wall Problem Processor-DRAM Memory Gap “Moore’s Law”
µProc 1.20/yr. “Moore’s Law” µProc 1.52/yr. (2X/1.5yr) DRAM 7%/yr. (2X/10 yrs) Processor-Memory Performance Gap: (grows 50% / year) Growing disparities between processor and memory speeds have resulted in a so-called “memory wall”, an ever-widening gap between CPU and memory performance Although the slope of the curve is become smaller during 2005 and 2010, the number of on-chip processors is also increased, therefore, the gap is not narrowed. In fact, the gap is increasingly widened. CPU 1986~ /yr, 2005~now 1.20/yr A metaphor from get a reply from a friend cost you half an year 486 Pentium Pro Cache 1980: no cache in micro-processor; 2010: 3-level cache on chip, 4-level cache off chip 1989 the first Intel processor with on-chip L1 cache was Intel 486, 8KB size 1995 the first Intel processor with on-chip L2 cache was Intel Pentium Pro, 256KB size 2003 the first Intel processor with on-chip L3 cache was Intel Itanium 2, 6MB size Source: Computer Architecture A Quantitative Approach

Extremely Unbalanced Operation Latency
5~15M cycles Cycles The performance mismatching among the different layers of the memory hierarchy is increasingly widen. Note that the latency of the main memory and the disk are both much large. IO Access

Data Access becomes Performance Bottleneck
Source: Gromacs Source: MPQC GROMACS (molecular dynamics) MPQC (Massively Parallel Quantum Chemistry) Source: Multi-grid solver applications tend to be data intensive, such as the four applications CFD: computational fluid dynamics simulation Multi-Grid solver: an approach for CFD Gromacs: is a molecular dynamics simulation package NaSt3DGP - A Parallel 3D Flow Solver MPQC: is the Massively Parallel Quantum Chemistry Program. The performance lost due to data access accounts for 40 to 70% CPU time. Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia Ailamaki, and Babak Falsafi, “Database servers on chip multiprocessors: Limitations and opportunities,” In Proceedings of the 3rd Conference on Innovative Data Systems Research, Jan Somogyi Somogyi, Wenisch F. Wenisch, Ailamaki Ailamaki, Babak Falsafi, “Spatio-temporal memory streaming,” ACM SIGARCH Computer Architecture News. ACM, 2009, 37(3): Multi-Grid solver (CFD) Microstructure

Data Access becomes Performance Bottleneck
Computational Fluid Dynamics Adaptive Multigrid CFD: computational fluid dynamics simulation Multi-Grid solver: an approach for CFD Gromacs: is a molecular dynamics simulation package NaSt3DGP - A Parallel 3D Flow Solver MPQC: is the Massively Parallel Quantum Chemistry Program. The performance lost due to data access accounts for 40 to 70% CPU time. Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia Ailamaki, and Babak Falsafi, “Database servers on chip multiprocessors: Limitations and opportunities,” In Proceedings of the 3rd Conference on Innovative Data Systems Research, Jan Somogyi Somogyi, Wenisch F. Wenisch, Ailamaki Ailamaki, Babak Falsafi, “Spatio-temporal memory streaming,” ACM SIGARCH Computer Architecture News. ACM, 2009, 37(3): Computational Finance Data mining

Solution: Memory Hierarchy
CPU Registers <8KB <0.2~0.5 ns L1 Cache <128B ns Main Memory Giga Bytes 50ns-100ns Disk Tera Bytes, 5 ms Capacity Access Time Registers L1 Cache Memory Disk Instr. Operands Blocks Pages Staging Xfer Unit prog./compiler 1-8 bytes L2 cache cntl bytes OS 4K-4M bytes Upper Level Lower Level faster Larger L1 cache cntl bytes L2 Cache <50MB 1-10 ns L2 Cache 10% global miss

Data Access Concurrency Exist
Modern memory systems use a large number of advanced caching technologies to decrease cache latency. Some widely used cache optimization methods, such as nonblocking cache, pipelined cache, multibanked cache, and data prefetching, allow cache accesses generated by CPU or prefetcher to overlap with each other. These technologies make the relation between memory access and processor performance even more complicated, since the processor could continue executing instructions or accessing memory even under multiple cache misses. Research to alleviate these performance gaps has focused on improving memory system concurrency, that is, methods that support multiple requests simultaneously

Solution: Memory Hierarchy & Parallelism
Multi-core Multi-threading Multi-issue Multi-banked Cache Multi-level Cache Multi-channel Multi-rank Multi-bank CPU Cache Memory Out-of-order Execution Speculative Execution Runahead Execution Pipelined Cache Non-blocking Cache Data Prefetching Write buffer Pipeline Non-blocking Prefetching Write buffer During the last fifteen years, memory hierarchy has evaluated with parallelism and concurrency on each layer of the memory hierarchy. Spatial (left), temporal (right), parallelism Input-Output (I/O) Parallel File System Disks

Extremely Unbalanced Operation Latency
Assumption of Current Solutions Memory Hierarchy: Locality Concurrence: Data access pattern Data stream Extremely Unbalanced Operation Latency IO Access 5~15M cycles Cycles Performances vary largely

Existing Memory Metrics
Miss Rate(MR) {the number of miss memory accesses} over {the number of total memory accesses} Misses Per Kilo-Instructions(MPKI) {the number of miss memory accesses} over {the number of total committed Instructions × 1000} Average Miss Penalty(AMP) {the summary of single miss latency} over {the number of miss memory accesses} Average Memory Access Time (AMAT) AMAT = Hit time + MR×AMP Flaw of Existing Metrics Focus on a single component or A single memory access Deficiency of traditional memory metric only reflect the proportion of the data in or out of the cache don't reflect the penalty of the miss access Only consider miss information Measured based on single access, ignoring the concurrency of accesses Missing memory parallelism/concurrency

Concurrent AMAT (C-AMAT)
H is Hit time CH is the hit concurrency CM is the pure miss concurrency pMR and pAMP are pure-miss ratio and pure-miss penalty a Pure-miss cycle is a miss cycle there is no hit Pure miss is a very important new concept. Comparing with AMAT, C-AMAT has introduced two new parameters and extended two parameters to consider pure miss cases

Different perspectives
Sequential perspective: AMAT Concurrent perspective: C-AMAT AMAT measures from the process (compute-centric), one by one then get the average. C-AMAT measures from the time, active memory cycle (data-centric), see if a cycle has a hit. AMAT means Average Memory Access Time. As a traditional memory performance metric, it is designed for sequential data accesses and can mislead for contemporary cache technologies that increasingly rely on access concurrency C-AMAT means Concurrent AMAT. With a new perspective, it is a newly proposed model which extends AMAT to consider both data locality and data concurrency. While AMAT is widely used as a tool for architecture analysis and design, it does not accurately take into account access concurrency, an increasingly vital factor in measuring memory performance C-AMAT extends AMAT to consider data access concurrency in modern memory systems Extensive simulations confirm C-AMAT’s effectiveness for evaluating modern memory system design and architecture configuration When memory access improvement results from concurrence whether through ILP or other advanced cache design techniques or as a result of multi-core technologies AMAT cannot correctly reflect these memory performance changes, and so provides misleading information C-AMAT takes into account memory access improvements that result from concurrency

Pure-miss Miss is not important (Pure miss is)
The penalty is due to pure miss The introduction of pure miss is based on the fact that not all the cache misses will cause processor stall and only pure misses cause processor stall Pure-miss is the interaction of concurrency and locality The concept of pure-miss challenges the conventional computer hardware and software design principle of “locality is always good.” Pure miss and C-AMAT bring in a different angle of designing computer architecture and algorithm

C-AMAT is Recursive This Eq. shows the recurrence relation of C-AMAT1 and C-AMAT2 where Like AMAT, C-AMAT also can be extended to next level of the memory hierarchy. The default C-AMAT is C-AMAT1. Here is the extension of C-AMAT from the L1 to L2 Please notice that we use CM (capital M) to represent the pure concurrent misses and use Cm (little m) to represent concurrent misses. is a measureable parameter and has physical meanings. The number of misses occurred on L2 is Cm and the number of misses matters to the L1 performance is CM. Similarly, the argument applies to pAMP and AMP. Please notice here that we have introduced a new parameter η1, and the impact of C-AMAT2 toward the final C-AMAT1 has been trimmed by pMR1 and η1 pMR×η1 is the concurrency contribution in reducing average memory (access) delay at the L1 level In a similar fashion of the Eq., C-AMAT can be extended to next layer of the memory hierarchy

The physical meaning of η1
R1 = pure miss cycles / miss cycles R2 = pure misses / misses η1 = R1 / R2 The penalty at L2 is C-AMAT2 The actual delay impact is η1 x C-AMAT2 η1 is the L1 (concurrency) data delay reducer We have prove that η1 is the ratio of R1 and R2. Where R1 is the ratio between #pure miss cycles and #miss cycles. R2 is the ratio between #pure misses / #misses R1 and R2 are both the smaller the better We want to keep η1 as small as possible

Architecture Impacts CH could be contributed by
multi-port cache multi-banked cache pipelined cache structures CM could be contributed by non-blocking cache structures prefetching logic These techniques can both increase the CH and CM out-of-order execution multiple issue pipeline SMT CMP

Detecting System C-AMAT, and its parameters, can be measured on some systems, but not on other systems. We have developed the detecting system, so C-MATA can be measured on all the systems. The Hit Concurrency Detector (HCD) counts the total hit cycles and records each hit phase in order to calculate the average hit concurrency. The hit cycles are the clock cycles containing at least one hit access activity. The HCD also tells the Miss Concurrency Detector (MCD) whether a current cycle has a hit access or not. The MCD is a monitor unit which counts the total number of pure miss cycles and records each pure miss phase in order to calculate the average miss concurrency, pure miss rate, and pure miss penalty. With the information provided by the HCD, the MCD is able to tell whether a cycle is a pure miss cycle, and whether a miss is pure miss. Furthermore with all miss information, the pure miss rate and average pure miss penalty can be calculated. Finally with formula (5), CAMAT can be measured at a component level. Structure for detecting cache hit concurrency and cache miss concurrency using the C-AMAT metric

Experimental Environment
Simulator GEM5 Benchmarks 29 benchmarks from SPEC CPU2006 suite For each benchmark, 10 million instructions were simulated to collect statistics Average values of the correspondent memory metrics are shown A good memory metric should matches the actual design choices for modern processors

Default configuration
Default processor and cache configuration parameters for simulated testing of C-AMAT

Experimental Results only the memory performance reflected by C-AMAT consistently matches the performance improvement trend indicated by this ILP technique. L1 DCache AMAT and C-AMAT when Changing Issue Pipeline Width AMAT getting worse and C-AMAT getting better when concurrency increase

Experimental Results L1 DCache AMAT and C-AMAT when Changing MSHR Size
AMAT getting worse and C-AMAT getting better when concurrency increase

Experimental Results L2 Cache AMAT and C-AMAT when Changing MSHR Size
AMAT getting worse and C-AMAT getting better when concurrency increase More results can be found in X. H. Sun and D. Wang, "Concurrent Average Memory Access Time," IEEE Computer, 47(5), May 2014, pp

Potential of C-AMAT and Data Concurrency
Assume total running time is T Data stall time is d, d/T is up to 70%, that is d/T is 0.7 T Compute time is t, and t is 0.3 T Therefore, data stall time can be up to 0.7/0.3 = 2.3 folds of compute time If layered performance matching can be achieved when the overlapping effect of data access concurrency is enough, data stall time is only 1% of compute time Then memory performance can be improved 230 times!

Improvement potential due to concurrency
If the overlapping effect is significant enough, the layered mismatch can be mitigated. This figure show the potential of concurrency oriented optimization. Aided by concurrency and locality, memory system performances can be improved up to hundreds of times with optimized layered performance matching. Aided by concurrency, memory system performances can be improved up to hundreds of times (230X) at each layer of a memory hierarchy with layered performance matching

How 230x Improvement Achieved
Our recent research find out that we can increase concurrency without hurting locality. That means we can increase performance through increase concurrency. For an extreme case, as given above, we increase IW, ROB, L1 cache port number and pipeline width. Configuration C is the first scheme meets the “1%” requirement we found in the architecture exploration. The data stall time is 0.979% of CPIexe. Until this moment, the coarse-grained optimization is completed. If the hardware configurability is enough, we can continue the optimization, then configuration D is found meets the “1%” requirement. The data stall cost is already small. As an optional step, we continue to check if hardware is over provided. To achieve the LPM, we do a fine tune to reduce possible hardware overprovision to achieve cost-efficiency, which leads to the final configuration E. Configuration E meets the “1%” requirement with minimal hardware cost. Increasing data access concurrency to have a 230 speedup of memory system performance with our LPM algorithm

Technique Impact Analysis (Original)
The techniques to improve hit time, bandwidth, miss penalty, and miss rate generally affect the other components of the average memory access equation as well as the complexity of the memory hierarchy. Figure 2.11 summarizes these techniques and estimates the impact on complexity, with + meaning that the technique improves the factor, – meaning it hurts that factor, and blank meaning it has no impact. Generally, no technique helps more than one category. Figure 2.11 on page 96 in Hennessy & Patterson’s latest book

Technique Impact Analysis (Ours)
Following Hennessy & Patterson’s book(2012, 5th), the rows represent different techniques, while the columns are the various optimization directions The technique items are from the sixteen memory-system optimization mechanisms introduced by Hennessy and Patterson The plus, +, and minus, −, results are from there as well and also confirmed in our simulations Please notice that the row four of the table, Hardware techniques: large IW and ROB, runahead, is not in the original table of Hennessy and Patterson. The reason is large IW and ROB, runahead only influence C-AMAT and do not influence AMAT. Here we only use the large IW and ROB, runahead to demonstrate memory concurrency technologies have not received their deserved attention. The list is far from complete. Large MSHR size, multi-banks, multi-memory-channels are other “concurrency only” technologies, for examples. In addition, new concurrency only technologies can be developed, if their contribution can be well defined and measured. This table shows the power of C-AMAT in unifying the impact of data locality and data concurrency technologies, in bring up concurrency only technologies to the spotlight, and in calling new technologies to utilizing data concurrency. Five new columns are added for the four new parameters and C-AMAT The entries in the newly added columns can be analyzed using the theorems, and actual values can be determined via simulation Since all the thirty-eight entries marked with ‘cycle’ are new entries that are not listed in Figure 2.11 in Patterson’s book, Table III brings thirty-eight new opportunities of memory system optimizations A new technique summation table with C-AMAT

The Impact of C-AMAT New understanding of memory systems with a rigor mathematical description Unified the influence of data locality and concurrency under one formulation Foundation for developing new concurrency-based optimizations, and utilizing existing locality-based optimizations Foundation for automatic tuning for best configuration, partition, and scheduling, etc. While applications become more and more data intensive, memory level parallelism (concurrency) becomes increasingly vital to amortize the data access cost Hardware concurrency of memory hierarchy is similar to the multiple lanes of a highway. Multi-port, bank, and rank are the three specific ways of implementing memory concurrency Architects are willing to provide sufficient memory concurrency at a reasonable cost for reducing data access delay. However, the diverse workload makes the requirements change over time and depend on layers and workloads, making optimum memory design a challenging task.

C-AMAT in Action Traditional AMAT model Data stall time New C-AMAT model CPU-time = IC×(CPIexe + fmem×C-AMAT×(1–overlapRatioc-m))×cycle-time Data stall time C-AMAT can be used directly to reduce CPU time Note that the traditional model no longer holds when data access concurrency exists. The traditional model also has a factor (1-overlapRatioc-m), where overlapRatioc-m = 0. Data stall time Only pure miss will cause processor stall, and the penalty is formulated here Y.-H. Liu and X.-H. Sun, “Reevaluating data stall time with the consideration of data access concurrency,” Journal of Computer Science and Technology, vol. 30, no. 2, pp. 227–245,

C-AMAT in Action Layered performance matching at each memory hierarchy
Using recursive C-AMAT to measure and mitigate layered performance mismatch For instance, the impact of C-AMAT2 can be trimmed by pMR1 and η1 The key is to reduce pure miss, not miss, and data concurrence can do so In this study, we propose a Layered Performance Matching (LPM) approach to balance the memory performance and the computing performance. The general concept of LPM is that the performance of each layer of a memory hierarchy should and can be optimized to match the request of the layer directly above as closely as possible. The LPM model simultaneously considers both memory concurrency and locality. It reveals the fact that increasing the effective overlapping between hits and misses of the higher layer will ease the performance impact of the lower layer. The terms “pure miss” and “pure miss penalty” are introduced to measure the “effectiveness” of such hit-miss overlapping. By distinguishing between common miss and pure miss, LPM has reshaped the conventional computer hardware and software design principle of “locality is always good” and calls for a joint consideration of locality and concurrency. Our evaluation shows that memory system performances can be improved up to 230 times with optimized layered performance matching. Without altering the hardware configurations and by simply scheduling data allocation following the matching principle in heterogeneous multicore systems, we also have achieved more than 20% performance improvement in our case studies. Analysis and experimental results validate LPM is feasible and provides a novel and efficient way to cope with the severe memory wall problem, the complex memory system, and to optimize the vital memory performance. Y.-H. Liu, X.-H. Sun, "LPM: Layered Performance Matching in Memory Hierarchy," Illinois Institute of Technology Technical Report (IIT/CS-SCS ), 2014.

C-AMAT in Action Online Reconfiguration and Smart Scheduling
A performance optimization tool has been developed base on C-AMAT Provide measurement and optimization suggestions Measure C-AMAT on existing computing systems Optimization in hardware reconfiguration Optimization in software task partitioning and scheduling Y.-H. Liu, X.-H. Sun, "TuningC: A Concurrency-aware Optimization Tool," Illinois Institute of Technology Technical Report (IIT/CS-SCS ), 2015.

Related Work: APC Versus C-AMAT
Access Per (memory active) Cycle (APC) APC = A/T APC is a measurement, a companion of C-AMAT C-AMAT is a analysis and optimization tool APC is very different with the traditional IPC Memory Active Cycle (data centric/access) Overlapping mode (concurrent data access) C-AMAT does not depend on its five parameters for its value C-AMAT = 1/APC D. Wang, X.-H. Sun "Memory Access Cycle and the Measurement of Memory Systems", IEEE Transactions on Computers, vol. 63, no. 7, pp , July.2014

Related Work: MLP Memory Level Parallelism (MLP)
Average number of long-latency main memory outstanding accesses when there is at least one such outstanding access Assuming each off-chip memory access has a constant latency, say m cycles, APCM=MLP/m That means APCM is directly proportional to MLP APC is superset of MLP C-AMAT is an analytical tool and measurement, MLP is a measurement MLP does not consider locality, will APC and C-AMAT do MLP is a special case of APC. MLP does not include the locality in its formula. The the combined optimization of locality and concurrency cannot be conduced via only MLP.

Conclusions Data access delay is the premier bottleneck of computing
Hardware memory concurrence exists but is under utilized C-AMAT unifies data concurrency with locality for combined data access optimizations C-AMAT can improve AMAT performance 230 times This 230X number could be even larger. With the multicore technology, CPU can be built faster. The question is if data can be moved up fast enough All the results presented in this study are based on C-AMAT. C-AMAT is a rethinking of memory performance from data-centric of view. A tacit fact of C-AMAT is its cycle is memory active cycle, not CPU cycle. It is not as simple as its first look. Aided by C-AMAT, we have obtained many exciting research results, which will be introduced in future

Develop C-AMAT based technologies to reduce data access time !
Conclusions Develop C-AMAT based technologies to reduce data access time ! All the results presented in this study are based on C-AMAT. C-AMAT is a rethinking of memory performance from data-centric of view. A tacit fact of C-AMAT is its cycle is memory active cycle, not CPU cycle. It is not as simple as its first look. Aided by C-AMAT, we have obtained many exciting research results, which will be introduced in future

Thank You & Questions ?

C-AMAT：Concurrent Average Memory Access Time

Similar presentations

Presentation on theme: "C-AMAT：Concurrent Average Memory Access Time"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

C-AMAT：Concurrent Average Memory Access Time

Similar presentations

Presentation on theme: "C-AMAT：Concurrent Average Memory Access Time"— Presentation transcript:

Similar presentations

About project

Feedback