Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

Similar presentations


Presentation on theme: "Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak."— Presentation transcript:

1 Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames)

2 P. Husbands, IPDPS 2002 Motivation  Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones)  E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of peak on 1.5GHz P4  Even worse when parallel efficiency considered  Overall ~10% across application benchmarks  Is memory bandwidth the problem?  Performance directly related to how well memory system performs  But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)

3 P. Husbands, IPDPS 2002 Solutions?  Better Software  ATLAS, FFTW, Sparsity, PHiPAC  Power and packaging are important too!  New buildings and infrastructure needed for many recent/planned installations  Alternative Architectures  One idea: Tighter integration of processor and memory  BlueGene/L (~ 25 cycles to main memory)  VIRAM –Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM

4 P. Husbands, IPDPS 2002 VIRAM Overview 14.5 mm 20.0 mm  MIPS core (200 MHz)  Main memory system  13 MB of on-chip DRAM  Large on-chip bandwidth 6.4 GBytes/s peak to vector unit  Vector unit  Energy efficient way to express fine- grained parallelism and exploit bandwidth  Typical power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops  1.6 Gflops (single-precision)  Fabrication by IBM  Tape-out in O(1 month)  Our results use simulator with Cray’s vcc compiler

5 P. Husbands, IPDPS 2002 Our Task  Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines  For now focus on serial performance  Benchmark VIRAM on Scientific Computing kernels  Originally for multimedia applications  Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser)  Isolate performance limiting features of architectures  More than just memory bandwidth

6 P. Husbands, IPDPS 2002 Benchmarks Considered  Transitive-closure (small & large data set)  NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)  Fetch-and-increment a stream of “random” addresses  Sparse matrix-vector product:  Order 10000, #nonzeros 177820  Computing a histogram  Different algorithms investigated: 64-elements sorting kernel; privatization; retry  2D unstructured mesh adaptation TransitiveGUPSSPMVHistogramMesh Ops/step 2121N/A Mem/step 2 ld 1 st2 ld 2 st3 ld2 ld 1 stN/A

7 P. Husbands, IPDPS 2002 The Results Comparable performance with lower clock rate

8 P. Husbands, IPDPS 2002 Power Efficiency  Large power/performance advantage for VIRAM from  PIM technology  Data parallel execution model

9 P. Husbands, IPDPS 2002 Ops/Cycle

10 GUPS  1 op, 2 loads, 1 store per step  Mix of indexed and unit stride operations  Address generation key here (only 4 per cycle on VIRAM)

11 P. Husbands, IPDPS 2002 Histogram  1 op, 2 loads, 1 store per step  Like GUPS, but duplicates restrict available parallelism and make it more difficult to vectorize  Sort method performs best on VIRAM on real data  Competitive when histogram doesn’t fit in cache

12 P. Husbands, IPDPS 2002 Which Problems are Limited by Bandwidth?  What is the bottleneck in each case?  Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)  SPMV and Mesh limited by address generation, bank conflicts, and parallelism  For Histogram lack of parallelism, not memory bandwidth

13 P. Husbands, IPDPS 2002 Summary and Future Directions  Performance advantage  Large on applications limited only by bandwidth  More address generators/sub-banks would help irregular performance  Performance/Power advantage  Over both low power and high performance processors  Both PIM and data parallelism are key  Performance advantage for VIRAM depends on application  Need fine-grained parallelism to utilize on-chip bandwidth  Future steps  Validate our work on real chip!  Extend to multi-PIM systems  Explore system balance issues –Other memory organizations (banks, bandwidth vs. size of memory) –# of vector units –Network performance vs. on-chip memory

14 P. Husbands, IPDPS 2002 The Competition SPARC IIi MIPS R10K P IIIP 4Alpha EV6 Make Sun Ultra 10 Origin 2000 Intel Mobile Dell Compaq DS10 Clock 333MHz180MHz600MHz1.5GHz466MHz L1 16+16KB32+32KB32KB12+8KB64+64KB L2 2MB1MB256KB 2MB Mem 256MB1GB128MB1GB512MB

15 P. Husbands, IPDPS 2002 Transitive Closure (Floyd-Warshall)  2 ops, 2 loads, 1 store per step  Good for vector processors:  Abundant, regular parallelism and unit stride

16 P. Husbands, IPDPS 2002 SPMV  2 ops, 3 loads per step  Mix of indexed and unit stride operations  Good performance for ELLPACK, but only when we have same number of non- zeros per row

17 P. Husbands, IPDPS 2002 Mesh Adaptation  Single level of refinement of mesh with 4802 triangular elements, 2500 vertices, and 7301 edges  Extensive reorganization required to take advantage of vectorization  Many indexed memory operations (limited again by address generation)


Download ppt "Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak."

Similar presentations


Ads by Google