Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

Similar presentations


Presentation on theme: "Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and."— Presentation transcript:

1 Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15 th, 1998

2 Introduction n NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems n 7 scientific benchmarks that represents the most common computation kernels n NPB is written on top of Message Passing Interface (MPI) for portability n NPB is a Constant Problem Size (CPS) scaling benchmark suite n This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000

3 Motivation n Early study on NPB shows ideal speedup on NOW! u Scaling as good as T3D and better than SP-2 u Per node performance better than T3D, close to SP-2 n Submitted results for Origin 2000 show a spread

4 Presentation Outline n Hardware Configuration n Time Breakdown of the Applications n Communication Performance n Computation Performance n Conclusion

5 Hardware Configuration n SGI Origin 2000 (64 nodes) u MIPS R10000 processor, 195 MHz, 32KB/32KB L1 u 4MB external L2 cache per processor u 16GB memory total u MPI performance: 13  sec one-way latency, 150 MB peak, half-power at 8KB message size n Network Of Workstations (NOW) u UltraSPARC I processor, 167MHz, 16KB/16KB L1 u 512KB external L2 cache per processor u 128 MB memory per processor u MPI performance: 22  sec one-way latency, 27 MB peak, half-power at 4KB message size

6 Time Breakdown -- LU n Black line -- total running time u a single-man - 10 secs job u ideally, requires 5 secs for 2 men u total amount of work -- 10 secs n More work, need communication

7 Time Breakdown -- LU

8 Time Breakdown -- SP

9 Communication Performance n Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW

10 Communication Efficiency n absolute bandwidth delivered are close u SP/32 on NOW -- 215s u SP/32 on SGI -- 289s n comm. efficiency on SGI only achieved 30% of potential bandwidth n protocols tradeoff are pronounce u hand-shake vs. bulk- send in pt2pt u collective ops

11 Computation Performance n Relative performance of the benchmarks on single node roughly close to the processor performance difference n Both computational CPI and L2 misses change significantly on both platforms when scaled

12 Recap on CPS Scaling 4 8 16 3264 128256

13 LU Working Set n 4-processor u Knee starts at 256KB

14 LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB

15 LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB n 16-processor u Knee starts at 64KB

16 LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB n 16-processor u Knee starts at 64KB n 32-processor u Knee starts at 32KB n miss rate drops from 2MB to 4 MB global cache

17 n Cost under scaling u extra work worsen memory system’s performance SP Working Set u total memory references on SGI F 4-processor has 64.38 billion memory reference F 25-processor has 72.35 billion memory reference F 12.38% increase Cost Benefit

18 Conclusion n NPB u  -benchmarks hard to predict comm performance u global cache increases effectively reduce comp. time u sequential node arch. is a dominant factor in NPB perf. n NOW u an inexpensive way to go parallel u absolute performance is excellent u MPI on NOW has good scalability and performance u NOW vs. proprietary system -- detail instrumentation ability n speedup cannot tell the whole story, scalability involves: u the interplay of program and machine scaling u delivered comm. performance, not  -benchmarks u complicated memory system performance


Download ppt "Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and."

Similar presentations


Ads by Google