Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon

Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon (dkmoon@cs)

Outline Evaluated Machines SPEC CPU Result SPEC WEB Result Using Fixed Point Number MTTF Issue Conclusion (Pictures from www.sun.com and www.intel.com)

Evaluated Machines Sun T2000Dell 1850HP rx1620 CPUNiagara 1GHzXeon 3GHzItanium2 1.3GHz Chip/Core/Thread1/8/322/2/2 L1 Cache (I/D)16KB/8KB (/core)~12KB/16KB16KB/16KB L2 Cache3MB1MB256KB L3 CacheN/A 3MB FU (int/fp) / chip8/12/16/2 Word length64 bits32 bits64 bits MultithreadsFine-grainDeep out-of-orderStatic (VLIW) Max Watt / chip79 W111 W130 W Memory8 GB3 GB4 GB Disks1 x 73GB2 x 147 GB2 x 73 GB OSSolaris 10Linux 2.6.11 Dimension2U1U (Data are from google search and datasheets from www.intel.com and www.sun.com)www.intel.com

SPEC CPU 2000 Integer benchmarks (CINT2000) 11 C programs gzip, vpr, gcc, mcf, crafty, parser, perlbmk, gap, vortex, bzip2, twolf 1 C++ program eon Floating Point benchmarks (CFP2000) 3 C programs mesa, equake, ammp 6 Fortran 77 programs wupwise, swim, mgrid, applu, apsi, sixtrack 4 Fortran 90 programs fma3d, facerec, galgel, lucas Speed vs. Throughput Basically, it measures Speed with single benchmark instance SPEC rate runs multiple benchmarks simultaneously to measure throughput

SPEC CPU 2000 3 different Architectures 2 different Compilers for each Architecture w/ & wo/ optimization for each compiler Note Optimization for Itanium2 includes profile-directed optimization Itanium relies on compiler’s static instruction scheduling Floating point optimization is turned off for some applications The optimization which relies on extended precision generates incorrect results. e.g. vpr under Xeon-icc, gcc under Itanium2-gcc, applu under Itanium2-gcc FP benchmarks written in Fortran 90 couldn’t be evaluated Due to unavailability of compilers (i.e. icc & suncc) i.e. fma3d, facerec, galgel, lucas

SPEC int result (Integer Speed) Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21) Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp)

SPEC fp result (FP Speed) Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21) Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp)

SPEC int rate result (Integer Throughput) Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21) Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp)

SPEC web 2005 Operation Overview Components Server Under Test (SUT) Prime Client Workhorse Clients (I used 20) Backend Simulator Workload SPECweb_Banking: dynamic pages through SSL SPECweb_Ecommerce: dynamic pages though SSL + non-SSL SPECweb_Support: static pages though non-SSL to test large file transfer Metric # of sessions at which SUT sustains “Tolerable QoS” of more than 99% and “Good QoS” of more than 95% To figure out the number,  Initial guess and try & error  I increased 100 sessions per step SPECweb value: GeoMean of the three workloads

SPEC WEB network topology 20 Workhorse Clients Prime Client Backend Simulator SUT InitQuery Static Data Dynamic data Static workload files 50GB per 1500-session

SPEC WEB 2005 Tested Web Server Zeus + php C.f. Sun’s report on Niagara used JAVA web server + JAVA Servlet (14,000 simultaneous sessions!!) But, JAVA is especially optimized in SPARC + Solaris => It’s NEVER fair!! => USE PHP!!

SPEC web result Reference1: Niagara (1.2GHz CPU, Sun Java Web Server 6.1 + JSP) Reference2: Xeon (2x3.8GHz CPU w/ HT, 2MB L2, Zeus + JSP)

SPEC web result 2

Weak FP with Niagara? Slow FPU FADD/FMULd/FDIVd = 26/29/83 cycles c.f. Integer: ADD/MULX/UDIVX = 1/11/72 cycles What if we use faster one? 8 cores share one FPU What if each core has one? → Fixed Point number using 64-bit Integer register Not precise experiment, but we can see the possibility

Using Fixed Point IEEE 754 Floating Point 1 bit for sign 11 bits for exponent 52 bits for fraction 32 bits + 32 bits Add(X, Y) = (X) + (Y) Mul(X, Y) = ((X) * (Y)) >> 32 Note: Tested Xeon has 32-bit word (IA-32 Xeon) Long Long is used for 64-bit Can degenerate the performance Before decimal ptAfter decimal pt 63 decimal pt 0 SignFraction 63 62 52 0 Exponent

Fixed Point Result 1-1 (Response Time – What if using faster FPU)

Fixed Point Result 1-2 (Response Time – What if using faster FPU)

Fixed Point Result 2-1 (Throughput – What if each core has a FU?)

Fixed Point Result 2-2 (Throughput – What if each core has a FU?)

Outline Evaluated Machines SPEC CPU Result SPEC WEB Result Using Fixed Point MTTF Issue Conclusion (Pictures from www.sun.com and www.intel.com)

Putting all eggs in one basket? Two scenarios 32 machines cooperate for a service One machine failures results in service outage MTTF’ = MTTF / 32 Niagara can be justified 32 machines are replicas just for load balancing “Partial service is better than service outage” What about Niagara?  Not enough redundancy: only two redundant power supplies and fans  What if partial of CPU, memory subsystem or disks fail? What if service outage penalty is much greater than server management cost? => Usually TRUE

Conclusion Evaluated 3 machines by SPEC Niagara shows the poorest performance in terms of response time the slow CPU, simple pipeline, and the lack of FU. Niagara outperforms the others in terms of throughput 32 threads Having a FPU for each core can be beneficial. I’m skeptical about consolidating too many servers without redundancy (Pictures from www.sun.com and www.intel.com)

Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon

Similar presentations

Presentation on theme: "Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon

Similar presentations

Presentation on theme: "Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon"— Presentation transcript:

Similar presentations

About project

Feedback