LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742 Supervised by Prof. Michael R. Lyu Department of Computer.

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Computer Abstractions and Technology
SISD—Single Instruction Single Data Xin Meng Tufts University School of Engineering.
Computer Architecture and Data Manipulation Chapter 3.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computer Science: An Overview Tenth Edition by J. Glenn Brookshear Chapter.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
1LYU0703 Electronic Advertisement Guide on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
CS 1308 Computer Literacy and the Internet Computer Systems Organization.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
CS 1308 Computer Literacy and the Internet. Introduction  Von Neumann computer  “Naked machine”  Hardware without any helpful user-oriented features.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Top Level View of Computer Function and Interconnection.
System bus.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
Introduction to MMX, XMM, SSE and SSE2 Technology
Computer Organization. This module surveys the physical resources of a computer system.  Basic components  CPU  Memory  Bus  I/O devices  CPU structure.
Processor Architecture
CS 1308 Computer Literacy and the Internet. Objectives In this chapter, you will learn about:  The components of a computer system  Putting all the.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Lecture 3: Computer Architectures
Chapter 2: Data Manipulation
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Group 1 chapter 3 Alex Francisco Mario Palomino Mohammed Ur-Rehman Maria Lopez.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Cell Architecture.
Architecture & Organization 1
Architecture & Organization 1
Chapter 2: Data Manipulation
Chapter 5: Computer Systems Organization
Chapter 2: Data Manipulation
Large data arrays processing on Cell Broadband Engine
Multicore and GPU Programming
Multicore and GPU Programming
Chapter 2: Data Manipulation
Presentation transcript:

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer Science and Engineering, CUHK Final Year Project Presentation (1st term)

LYU0703 Parallel Distributed Programming on PS3 2 Agenda Background Information Architecture of PlayStation ® 3 Principals of Parallel Programming Optimization of the ADVISER program: 1. Sequential Approach 2. Parallel Approach Conclusion Future Works Q&A

LYU0703 Parallel Distributed Programming on PS3 3 Background Information Limitation of single-core processor: 1.Memory Access Latency 2.Wire Delays 3.Power Consumption

LYU0703 Parallel Distributed Programming on PS3 4 Power Consumption P = power C = capacitance V = voltage F = processor frequency (cycles per second)

LYU0703 Parallel Distributed Programming on PS3 5 Development of Multi-Core Processor Fig. 1.4 Growth of No. of Cores in Processors

LYU0703 Parallel Distributed Programming on PS3 6 Development of Multi-Core Processor Reduce power consumption - use multiple cores with low frequency instead of one with high frequency Efficient processing of multiple tasks - divide the computation work - execute among the cores concurrently

LYU0703 Parallel Distributed Programming on PS3 7 Project Objectives Need of parallel programming to optimize intensive-computation applications Study features of parallel programming, compare sequential and parallel approach Optimize an application, showing great improvement by parallel programming

LYU0703 Parallel Distributed Programming on PS3 8 Architecture of PlayStation ® 3 (PS3) A multi-core machine produced by Sony, with the Cell Broadband Engine Strong Computation Power Opened platform for other applications and development

LYU0703 Parallel Distributed Programming on PS3 9 Cell Broadband Engine (Cell BE) PPE – Power Processor Element SPE – Synergistic Processor Element EIB – Element Interconnect Bus

LYU0703 Parallel Distributed Programming on PS3 10 Power Processor Element (PPE) 64-bit PowerPC architecture based General purpose operation Designed as control- intensive Control I/O of main memeory and other devices by the OS Control over all 8 SPEs Fig. 2.5 Design of PPE

LYU0703 Parallel Distributed Programming on PS3 11 Synergistic Processor Element (SPE) Designed to provide computation performance SPU – perform allocated task LS – the only memory MFC – control data transfer Totally 8 SPEs in Cell Only 6 acessisble 1 reserved for system software 1 disabled Fig. 2.6 Design of a SPE

LYU0703 Parallel Distributed Programming on PS3 12 Element Interconnect Bus (EIB) Internal communication bus inside Cell Connect different elements: PPE, SPEs. Memory controller Fig. 2.7 Data Flow and Program Control

LYU0703 Parallel Distributed Programming on PS3 13 Principal of Parallel Programming Parallel algorithmSerial algorithm multiple processing unitssingle processing unit communication overheadno communication overhead higher complexity in codestraight forward code ensure load balance between PUeverything is done by CPU

LYU0703 Parallel Distributed Programming on PS3 14 Concept of Load Balance Distribute data evenly Total runtime depends on the busiest processing element Wasting computation time on idling processing element

LYU0703 Parallel Distributed Programming on PS3 15 Method of parallelism Data parallelism Task parallelism

Parallel Architecture Flynn's taxonomy Single Instruction Multiple Instruction Single Data SISDMISD Multiple Data SIMDMIMD LYU0703 Parallel Distributed Programming on PS3 16

SISD Traditional Computer von Neumann model LYU0703 Parallel Distributed Programming on PS3 17

SIMD Same instruction on all data Data parallelism SIMD intrinsic function LYU0703 Parallel Distributed Programming on PS3 18

MISD No well known system Mention for completeness LYU0703 Parallel Distributed Programming on PS3 19

MIMD Different instruction on different data Task parallelism Further break down to –Shared Memory System –Distributed Memory System LYU0703 Parallel Distributed Programming on PS3 20

Shared Memory System Access to central memory for data PS3 :Achieve by MFC issuing DMA command LYU0703 Parallel Distributed Programming on PS3 21

Distributed Memory System Each PE has its own memory PS3: Each SPE has 256KB Local Store PS3 is hybrid shared-distributed memory system LYU0703 Parallel Distributed Programming on PS3 22

ADVISER Comparing 2 video clips 1.Generating meaningful data (in form of numbers) of frames from the video 2.Comparing and looking for the most similar frames 3.Locating the similar segment which consist of a series of very similar frames LYU0703 Parallel Distributed Programming on PS3 23

Input 2 Folder, “Repository” & “Target” hl3 file = vector of 1024 double precision values LYU0703 Parallel Distributed Programming on PS3 24 InputNo. of hl3 files Target directory5473 Repository directory7547

Processing hl3 file = vector of 1024 double precision values File P File Q Similarity = Smaller the better LYU0703 Parallel Distributed Programming on PS3 25

Output M “Target”, N “Repository” O ( M * N ) Computation time = 633 sec Flash demo LYU0703 Parallel Distributed Programming on PS3 26 target hl3 1most match repository Adifference value = ?? target hl3 2most match repository Bdifference value = ?? target hl3 3most match repository Cdifference value = ??

Parallel Version Data parallelism Split data to 6 SPEs evenly Computation time for 6 SPEs = 330 sec Flash demo LYU0703 Parallel Distributed Programming on PS3 27

Parallel Version Expected speed up 6X Actual speed up 2X PC and PPU, SPE all run at different speed Computation time with CPU = 633 sec Computation time with 1 SPE = 1928 sec Computation time with PPU = 3119 sec CPU > SPE > PPU LYU0703 Parallel Distributed Programming on PS3 28

Time Attack 1.SIMD intrinsic function 2.Changing data type 3.Double Buffering 4.Parallel Read 5.Distributing Job to idling PPE 6.SIMD on loop counter 7.Loop unrolling LYU0703 Parallel Distributed Programming on PS3 29

SIMD intrinsic function Addition, subtraction, multiplication, etc. Operates on 128 bits registers Date type: double (64 bits) Speed up 2X LYU0703 Parallel Distributed Programming on PS3 30

Changing Data Type to int Precision not important Major speed up from SIMD intrinsic Data type: int (32 bits) Total Speed up 4X Computation time = 71 sec LYU0703 Parallel Distributed Programming on PS3 31

Changing Data Type to float SPE specified for high precision computation No intrinsic for int data type at all Data Type: float (32 bits) Save data conversion time Speed up by 30% Computation time = 49 sec LYU0703 Parallel Distributed Programming on PS3 32

Double buffering Save communication time MFC and SPU 2 buffers –Prefetching –Processing Not heavy in communication Minor speed up LYU0703 Parallel Distributed Programming on PS3 33

LYU0703 Parallel Distributed Programming on PS3 34 Parallel Reading for All Files Read “ Target ” and “ Repository ” concurrently Share file reading job among SPEs Not improve as predicted, even slower Reason: hard disk cannot cannot handle concurrent request Failed Attempt

LYU0703 Parallel Distributed Programming on PS3 35 Distributing Job to Idling PPE PPE current job: read files, distribute files, collect result Use stall time to do some computation Relatively low computation power of PPE No significant improvement Increase program complexity Abandon this approach

LYU0703 Parallel Distributed Programming on PS3 36 Applying SIMD for Loop Counter Major computation power consumed in: initialize i = 0, diff = (0, 0, 0, 0). for i < Number of float numbers in a file / Number of floats packed in a register A. temp = SIMD subtraction on vector i in “ Target ” and “ Repository ” file. B. diff = SIMD addition (SIMD multiplication (temp, temp), diff). i = i + 1. Loop back to 2.

LYU0703 Parallel Distributed Programming on PS3 37 Applying SIMD for Loop Counter Try to optimize step 3 Apply SIMD to the loop counter Addition and comparison operations are reduced by 8 times

LYU0703 Parallel Distributed Programming on PS3 38 Applying SIMD for Loop Counter initialize i = (0,1,2,3,4,5,6,7), diff = (0, 0, 0, 0). for i[0] < Number of float numbers in a file / Number of floats packed in a register temp = SIMD subtraction on vector i[0] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[1] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[2] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[3] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[4] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[5] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[6] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[7] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). i = SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)). Loop back to 2.

LYU0703 Parallel Distributed Programming on PS3 39 Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version No. of SPU used Read input time (sec) Total Elapsed time (sec) Net Elapsed time (sec)

LYU0703 Parallel Distributed Programming on PS3 40 Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version

LYU0703 Parallel Distributed Programming on PS3 41 Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version little improvement (about 4%). shows the possibility to have faster performance by further loop unrolling. The best performance becomes 47 sec

LYU0703 Parallel Distributed Programming on PS3 42 Loop Unrolling Proved that optimizing the loop can improve performance Completely loop unrolling More obvious speed up

LYU0703 Parallel Distributed Programming on PS3 43 Result of the parallel, with SIMD, float input, loop unrolling PS3 version No. of SPU used Read input time (sec) Total Elapsed time (sec) Net Elapsed time (sec)

LYU0703 Parallel Distributed Programming on PS3 44 Result of the parallel, with SIMD, float input, loop unrolling PS3 version

LYU0703 Parallel Distributed Programming on PS3 45 Result of the parallel, with SIMD, float input, loop unrolling PS3 version 45% faster ultimate best performance becomes 27 sec

LYU0703 Parallel Distributed Programming on PS3 46 Conclusion of Optimization PC version: 663 sec PS3 with 1 SPU (i.e. sequential version on PS3): 1928 sec Final optimized version of PS3: 27 sec 23 times faster than PC version 71 times faster than sequential version on PS3

LYU0703 Parallel Distributed Programming on PS3 47 Conclusion of Optimization

LYU0703 Parallel Distributed Programming on PS3 48 Future Works Port the whole ADVISER application on PlayStation ® 3 Optimization throughout the whole application

LYU0703 Parallel Distributed Programming on PS3 49 Q&A

LYU0703 Parallel Distributed Programming on PS3 50 The End