# Raspberry Pi Performance Benchmarking

## Presentation on theme: "Raspberry Pi Performance Benchmarking"— Presentation transcript:

Raspberry Pi Performance Benchmarking
Justin Moore Salish Kootenai College

Overview Raspberry Pi Cluster Build Performance
Performance Benchmark Tools Tuning Analysis Conclusion

Raspberry Pi Cluster – Model B
CPU – Broadcom ARM11 76JZF-S 700MHz Can overclock to 1000MHz RAM – 512MB 448MB/64MB – CPU/GPU Linux OS 10/100 BaseT Ethernet port Size of a credit card Price - \$35 Image:

Raspberry Pi Cluster - Setup
Router – Western Digital N750 NFS mounted external hard drive – 500GB Buffalo Inc. 4 Pi cluster USB hub – Manhattan 7 port USB 2.0 SD card – Kodak 8GB ~\$200 Transition to next slide with “After the cluster was set up, we wanted to measure performance”

Performance – Floating Point Operations
How we measure performance Busy CPU? Speed? What is a floating point operation (FLOP)? Arithmetic operation Formats Single Double Performance measurement by FLOPS How do we measure FLOPS? General Matrix Multiplication (GEMM) High computational intensity with an increase in matrix size

Performance Benchmarking
Single Raspberry Pi BLAS - Basic Linear Algebra Subprograms ATLAS - Automatically Tuned Linear Algebra Software Auto tunes BLAS for any system Raspberry Pi Cluster MPI - Message Passing Interface Standard API for inter-process communication Facilitates parallel programming MPICH p1 HPL - High Performance LINPACK Tuned MPI Combined with ATLAS Wrote Custom code ATLAS Added parallel capability Compared with HPL

MyGEMM – Naïve Implementation
= * C (i,j) A(i,k) B (k,j) Where N = Matrix Size For i = 1 to N For j = 1 to N For k = 1 to N C(i,j)=C(i,j)+A(i,k)*B(k,j) striding bad! ----- Meeting Notes (7/30/14 13:38) ----- Transistion to benchmarking slide

MyGEMM – Naïve Pitfalls
GEMM – Naïve method inefficient Two whole matrices are loaded in to memory Cache is not used efficiently Strides through the matrix If not Naïve, then what? Naïve

MyGEMM – Software Tuning
What is block matrix multiplication Matrix is split into smaller matrices/blocks Shrinks matrix size to allow both A & B into fast memory A11 A12 A13 A14 A21 A22 A23 A24 Reduces striding by fully utilizing memory hierarchy Transistion After we tuned for a single node we scaled it to our cluster a11 a12 A11 A31 A32 A33 A34 a21 a22 A41 A42 A43 A44

MyGEMM – Cluster Software Tuning
How do we distribute the matrix multiplication? MPI used to distribute blocks to nodes MyGEMM allows for experimentation on block size A11 A12 A13 A14 Node 1 A21 A22 A23 A24 Node 2 A31 A32 A33 A34 Start with further division of matrix Why to divide matrix? How do I reduce matrix size? Divided by number of nodes in the cluster allows matrix multiplication on higher sizes What is cache blocking? reduce the matrix to the exact size of my cache Why blocks? Row/Column wise multiplication Loop unrolling ----- Meeting Notes (7/28/14 13:54) ----- reduces matrix to fit the cache block size is reduced to fit in cache ----- Meeting Notes (7/30/14 08:29) ----- remove bullet 1 add mygemm parallel talk how to distribute the matrix A41 Node 3 A42 A43 A44 Node 4

Hardware Tuning Raspberry Pi allows CPU overclocking and memory sharing between CPU and GPU Memory Memory is shared between CPU and GPU 512MB total onboard memory Up to 496MB can be used for CPU More memory = larger matrices CPU Clock Up to 1000 MHz from 700MHz

Conclusion Performance is dependent on key factors Top 500 – 1993
Matrix Size Block Size RAM size CPU speed Top 500 – 1993 T#292 Tied with General Motors ----- Meeting Notes (7/30/14 16:07) ----- move after yellowstone

Conclusion – Pi Cluster vs. Yellowstone
ARM11, 32 bit, Double Precision Yellowstone Sandy Bridge Xeon, 64 bit, Double Precision Power 15.6 Watts 1.4 MWatts Performance 836 MFLOPS 1.2 PFLOPS Price \$250.00 \$22,500,000.00 MFLOPS/W 54.8 875.3 MFLOPS/\$ 3.4 57.2

Thank You Dr. Richard Loft Raghu Raj Prasanna Kumar Amogh Simha
Stephanie Barr

Questions?