Presentation is loading. Please wait.

Presentation is loading. Please wait.

HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A CUSTOM COMPUTING MACHINE Authors: Nalini K. Ratha, Anil K. Jain H. GÜL ÇALIKLI 2002700743 2002700743.

Similar presentations


Presentation on theme: "HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A CUSTOM COMPUTING MACHINE Authors: Nalini K. Ratha, Anil K. Jain H. GÜL ÇALIKLI 2002700743 2002700743."— Presentation transcript:

1 HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A CUSTOM COMPUTING MACHINE Authors: Nalini K. Ratha, Anil K. Jain H. GÜL ÇALIKLI 2002700743 2002700743

2 INTRODUCTION Artificial Neural Networks (ANNs) attempt to mimic biological neural networks. One of the main features of biological neural networks is the massively parallel interconnections among the neurons.

3 INTRODUCTION Computational model of a biological neural network: Computational model of a biological neural network: simple operations such as: simple operations such as: inner product computation thresholding design parameters design parameters 1. 1. network topology: i. i. number of layers ii. ii. number of nodes in a layer 2. 2. connection weights 3. 3. property at a node; e.g.type of non- linearity to be used.

4 INTRODUCTION ∑ x1x1 x2x2 x3x3.... xdxd w1w1 w2w2 w3w3 wdwd ∑ w i. x i Non-Linearity y d-dimensional input vector X=(x1,x2,...,xd) weight vector W =(w1,w2,w3,....,wd) İnner product of input vector X and weight vector W “Output y” used to determine the category of input Schematic of a Perceptron

5 INTRODUCTION: Multilayer Perceptrons: (MLPs) Multilayer Perceptrons: (MLPs) one of the most popular neural network models for solving pattern classification and image classification problems consists of several layer of perceptrons nodes in the i layer are connected to nodes in the (i+1) layer through suitable weights no interconnection among the nodes in a layer. th

6 INTRODUCTION: Multilayer Perceptrons Multilayer PerceptronBiological neuron

7 INTRODUCTION: Training of a MLP: Training of a MLP: 1.Feedforward Stage: 1.Feedforward Stage: training patterns with known class labels are presented at the input layer training patterns with known class labels are presented at the input layer at the start weight matrix is randomly initialised at the start weight matrix is randomly initialised the output is computed at the output node. the output is computed at the output node. 2.Weight Update Stage: 2.Weight Update Stage: weights are updated in a backward fashion starting with the output layer weights are updated in a backward fashion starting with the output layer weights are changed proportional to the error between the desired output and the actual output. weights are changed proportional to the error between the desired output and the actual output. 3.Repeat 1 and 2 until the network converges 3.Repeat 1 and 2 until the network converges

8 INTRODUCTION:...... x1 x2 x3 xd Feature Vector................ input layer hidden layer output layer A Multilayer Perceptron

9 INTRODUCTION: For an n-node MLP, O(n 2 ) interconnections are needed. THUS: real challenge mapping a MLP onto a parallel processor is a real challenge. sequentially one node at a time no complex communications on a uniprocessor, the whole operation proceeds sequentially one node at a time. (no complex communications involved) HOWEVER: high performance implementation efficient communication capability for a high performance implementation, efficient communication capability must be supported.

10 INTRODUCTION: Typical Pattern Recognition and Computer Vision Applications: Typical Pattern Recognition and Computer Vision Applications: applications have >100 input nodes applications have >100 input nodes classification process involving complex decision boundaries demands a large number of hidden nodes classification process involving complex decision boundaries demands a large number of hidden nodes

11 INTRODUCTION: Real Time Computer Vision Applications: Real Time Computer Vision Applications: The network training can be carried out offline. The network training can be carried out offline. Recall Phase:High input/output bandwidth is required along with fast classification (recall) speeds. Recall Phase:High input/output bandwidth is required along with fast classification (recall) speeds.

12 INTRODUCTION: For a three layer network (excluding the input layer), let : For a three layer network (excluding the input layer), let : m:input nodes m:input nodes n 1 :nodes in the first hidden layer n 1 :nodes in the first hidden layer n 2 :nodes in the second hidden layer n 2 :nodes in the second hidden layer k:output nodes (classes) k:output nodes (classes) N m : the number of multiplications N m : the number of multiplications N a :the number of additions N a :the number of additions N m =(m*n 1 )+(n 1 *n 2 )+(n 2 *k) ) N a =N m -(n 1 +n 2 +k) Nonlinearity not included

13 INTRODUCTION: Example- a Practical Vision System Example- a Practical Vision System Process a 1024 x 1024 image in “real time” 30 frames to be processed per second 30 x 1024 x 1024 =30 x 10 6 30 x 1024 x 1024 =30 x 10 6 input patterns/sec THUS, a real time neural network classifier is expected to perform billions of operations per second. Connection weights are floating point numbers  floating point multiplications and additions Result:Throughputs of this kind are difficult to achieve with today’s most powerful uniprocessors.

14 INTRODUCTION: Parallel Architectures for ANNs Parallel Architectures for ANNs Types of Parallelism Available in a MLP: Types of Parallelism Available in a MLP: 1. Training session parallelism 1. Training session parallelism 2. Training example parallelism 2. Training example parallelism 3. Layer and forward/backward parallelism 3. Layer and forward/backward parallelism 4. Node parallelism 4. Node parallelism 5. Weight parallelism 5. Weight parallelism 6. Bit parallelism 6. Bit parallelism easily mapped onto a parallel architecture

15 INTRODUCTION: Parallel Architectures for ANNs (cont’d) Parallel Architectures for ANNs (cont’d) Complexities involved: Complexities involved: 1.computational complexity 1.computational complexity 2.communication complexity  2.communication complexity  inner product computation involves a large number of communication steps. THUS, THUS,special purpose neurocomputers have been built using; 1. 1.commercially available special purpose VLSIs 2.special purpose VLSIs  2.special purpose VLSIs  provides the best performance

16 INTRODUCTION: Parallel Architectures for ANNs (cont’d) Parallel Architectures for ANNs (cont’d) Dynamically changing architecture  number of nodes  number of layers from application to application Expensive to design a VLSI architecture for individual applications Typically, architectures with a fixed number of nodes and layers are fabricated.

17 INTRODUCTION: Parallel Architectures for ANNs (cont’d) Parallel Architectures for ANNs (cont’d) Special Purpose ANN Implementations in the Literature Special Purpose ANN Implementations in the Literature Ghosh and Hwang: Ghosh and Hwang:  massively parallel multipocessors  investigate architectural requirements for simulating ANNs using massively parallel multipocessors  message passing multicomputers.  propose a model for mapping neural networks onto message passing multicomputers.

18 INTRODUCTION: Liu: Liu:  CM-5  presents an efficient implementation of backpropagation algorithm on the CM-5 that avoids explicit message passing  CM-5 Cray-2CrayX-MPCrayY-MP  results of CM-5 implementation compared with those of Cray-2,CrayX-MP,and CrayY-MP Chinn: Chinn:  describe a systolic algorithm for ANN on MasPar-1 using a 2-D Systolic Array-Based Design  describe a systolic algorithm for ANN on MasPar-1 using a 2-D Systolic Array-Based Design

19 INTRODUCTION: Onuki: Onuki:  present a parallel implementation using a set of sixteen standard 24 bit DSPs connected in a hypercube  present a parallel implementation using a set of sixteen standard 24 bit DSPs connected in a hypercube Kirsanov: Kirsanov:  discusses a new architecture for ANNs using Transputers  discusses a new architecture for ANNs using Transputers Muller: Muller:  presents a special purpose parallel computer using a large number of Motorola floating point processors for ANN implementation.  presents a special purpose parallel computer using a large number of Motorola floating point processors for ANN implementation.

20 INTRODUCTION: Parallel Architectures for ANNs (cont’d) Parallel Architectures for ANNs (cont’d) Special Purpose VLSI chips designed & fabricated for ANN implementations: Special Purpose VLSI chips designed & fabricated for ANN implementations: Hamerstorm: Hamerstorm:  a high performance and low cost ANN with:  a high performance and low cost ANN with: 64 processing nodes per chip 64 processing nodes per chip hardware based multiply and accumulator operators hardware based multiply and accumulator operators Barber: Barber:  used a binary tree adder following parallel multipliers in SPIN-L architecture  used a binary tree adder following parallel multipliers in SPIN-L architecture

21 INTRODUCTION: Shinokawa: Shinokawa:   describe a fast ANN with billion connections per second using ASIC VLSI chips Viredez:  describes MANTRA-I neurocomputer using 2x2 systolic PE blocks Kotolainen:  proposed a tree of connection units with processing units at the leaf nodes for mapping many common ANNs.

22 INTRODUCTION: Asanovic: Asanovic:  proposed a VLIW of 128 bit instruction width and a 7 stage pipelined processor with 8 processors per chip.  proposed a VLIW of 128 bit instruction width and a 7 stage pipelined processor with 8 processors per chip. Ramacher: Ramacher:  describes the architecture of SYNAPSE  describes the architecture of SYNAPSE SYNAPSE:a systolic neural signal processor using a 2D array of systolic elements SYNAPSE:a systolic neural signal processor using a 2D array of systolic elements Mueller & Hammerstrom: Mueller & Hammerstrom:  describe design and implementation of CNAPS  describe design and implementation of CNAPS

23 INTRODUCTION: CNAPS: CNAPS:  a gate array implementation of ANNs  a gate array implementation of ANNs  a single CNAPS chip:  a single CNAPS chip: consists of 64 processing nodes consists of 64 processing nodes each node connected in a SIMD fashion each node connected in a SIMD fashion using broadcast interconnect. using broadcast interconnect.  each processor has:  each processor has: 4K bytes of local memory 4K bytes of local memory a multipler a multipler ALU ALU dual internal buses dual internal buses

24 INTRODUCTION: Cox: Cox:  describes the implementation of GANGLION  describes the implementation of GANGLION GANGLION: GANGLION:  a single neuron caters a fixed neural architecture of  a single neuron caters a fixed neural architecture of  12 input nodes  12 input nodes  14 hidden nodes  14 hidden nodes  4 output nodes  4 output nodes  using CLBs 8x8 multipliers have been built  using CLBs 8x8 multipliers have been built  a lookup table is used for the activation function.  a lookup table is used for the activation function.

25 INTRODUCTION: Stochastic Neural Architectures: Stochastic Neural Architectures: There is no need for a time-consuming and area costly floating point multiplier. Suitable for VLSI implementations Examples : Examples : Armstrong & Thomas: Armstrong & Thomas:  proposed a variation of ANN called Adaptive Logic Network (ALNs)  proposed a variation of ANN called Adaptive Logic Network (ALNs) ALNs:  similiar to ANNs ALNs:  similiar to ANNs  costly multiplications replaced by logical and operations  costly multiplications replaced by logical and operations  additions replaced by logical or operations  additions replaced by logical or operations

26 INTRODUCTION: Masa et. al.: Masa et. al.: Describe an ANN, Describe an ANN, with  a single output with  a single output  six hidden layers  six hidden layers  seventy inputs  seventy inputs can operate at 50 MHz input rate can operate at 50 MHz input rate

27 CUSTOM COMPUTING MACHINES Uniprocessor: Uniprocessor: instruction set available to a programmer is fixed an algorithm is coded using a sequence of instructions processor can serve many applications by simply reordering the sequence of instructions Application Specific Integrated Circuits (ASICs) used for a specific application provide higher performance compared to the general purpose uniprocessor

28 CUSTOM COMPUTING MACHINES Custom Computing Machine (CCM) Custom Computing Machine (CCM)  a user can customize the architecture and instructions for a given application  programming at a gate level by programming at a gate level, high performance can be achieved. using a CCM, a designer can tune and match the architectural requirements of the problem

29 CUSTOM COMPUTING MACHINES A CCM can overcome the limitations of ASICs Limitations of ASICs Limitations of ASICs  fast but costly  nonreconfigurable  time consuming Advantages of CCMs Advantages of CCMs  cheap: CCMs use Field Programmable Gate Arrays (FPGAs) as compute elements FPGAs are off-the-shelf components, thus relatively cheap.

30 CUSTOM COMPUTING MACHINES Advantages of CCMs (cont’d) Advantages of CCMs (cont’d)   reconfigurable: since FPGAs are reconfigurable, CCMs are easily reprogrammed.   time saving: CCMs do not need to be fabricated with every new application since they are often employed for fast prototyping THUS they save a considerable amount of time in design and implementation of algorithms.

31 SPLASH 2 – ARCHITECTURE and PROGRAMMING FLOW Splash 2 is one of the leading FPGA based custom computing machine designed and developed by Supercomputing Research Center

32 SYSTEM LEVEL VIEW of the SPLASH 2 ARCHITECTURE interface board: 1.connects Splash 2 to the host 2.Extends the address and data buses The Sun host can read/write to memories and memory mapped control registers of Splash 2 via these buses. Splash 2 Processing Board Processing Element (PE): Each PE has 512 KB of memory The host can read/write this memory PEs (X 1 -X 16 ) PE X 0 :controls the data flow into the processor board

33 SPLASH 2 – ARCHITECTURE and PROGRAMMING FLOW 36 32 1816 512 K 16-bit Memory RD WR Processing Element (PE) Xilinx 4010 SBus Read SBus Write SBus Address SBus Data Processor inhibit To left neighbor To right neighbor To crossbar Address Data Processing Element in Splash 2 individual memory available with each PE makes it convenient to store temporary results and tables.

34 SPLASH 2 – ARCHITECTURE and PROGRAMMING FLOW Logic Synthesis (Gate level decription) Timing of logic Simulation Splash 2 Partition, place and route (Logic placement) VHDL source Programming Flow for Splash 2 Logic designed using VHDL is verified. Main concern: achieve the best placement of logic in an FPGA in order to minimize timing delay. if the logic circuit can not be mapped to CLBs and flip flops which are available internal to an FPGA, then designer needs to revise the logic in the VHDL code and the process is repeated. Once the logic is mapped to CLBs, the timing for the entire digital logic is obtained. İf timing obtained is not acceptable then design process is repeated. To program Splash 2, we need to program: 1.Each of the PEs 2.Crossbar 3.Host interface

35 SPLASH 2 – ARCHITECTURE and PROGRAMMING FLOW Steps in Software Development on Splash 2 Design Entry (VHDL) Functional Verification simulation Verified Design Partition, place and route Delay Analysis Generate Control Bİts Debugging Integration synthesis Host interface improvement Host-splash 2 Executable code

36 MAPPING an MLP on SPLASH 2 In implementing a neural network classifier on Splash-2: building block  perceptron implementation For mapping MLP to Splash-2 2 physical PEs serve as a neuron. i th PE handles the inner product phase ∑ w ij x i (i+1) th PE computes nonlinear function tanh(βx) with β=0.25 where i is odd and (i+1) is even

37 MAPPING an MLP on SPLASH 2 Assume: perceptrons have been trained  connection weights are fixed. Thus, an efficient way of handling the multiplication is to employ a Look-up Table.Since a large external memory (512 KB),the lookup table can be stored. A pattern vector component x i is presented at every clock cycle 1. Inner Product Calculation: The i th (odd) PE look up the multiplication table to obtain the weighted product

38 MAPPING an MLP on SPLASH 2 The sum ∑ w ij x i is computed using an accumulator. After all the components of a pattern vector have been examined, we have computed the inner product. 2. Application of nonlinear function to the inner product The nonlinearity is again stored as a lookup table in the second PE. On receiving the inner product result from the first PE, the second uses the result as the address to the non-linearity look-up table and produces the output.

39 MAPPING an MLP on SPLASH 2 3. Thus the output of a neuron is obtained: The output is written back to the external memory of the second PE starting from a prespecified location. 4. After sending all the pattern vectors, the host can read back the memory contents. A layer in the neural network is simply a collection of neurons working synchronously on the input. On Splash-2 this can be achived by broadcasting the input to as many physical PEs as desired.The output of a neuron is written into a specified segment of external memory and read back by the host.

40 MAPPING an MLP on SPLASH 2 For every layer in MLP stages 1-4 is repated until the output layer is reached. NOTE: For every layer, there is a different look-up table. Look-up Table Organization: m multiplications m-dimensional weight There are m multiplications to be performed per node corresponding to the m-dimensional weight vector. Look-up Table is divided into m segments.

41 MAPPING an MLP on SPLASH 2 Look-up Table Organization (cont’d) A counter is incremented at every clock which forms the higher order (block) address for the lookup table. Note:The offset can also be negative correponding to a negative input to the look up table. Table m : Table 1 Block Offset 17 12 11 0 Pattern vector component forms the lower order address bits. Splash-2 has 18 bit adress bus for the external memory. Higher order 6 bits for the block address Lower order 12 bits for the offset address within the block.

42 MAPPING an MLP on SPLASH 2 Look-up Table Organization (cont’d) The numbers have been represented by 12 bits 2’s complement representation. Hence; The resolution of this representation is eleven bits. accumulator: within PE 16 bit wide After accumulation, the accumulator result is scaled down to 12 bits.

43 MAPPING an MLP on SPLASH 2 Lookup Table Organization

44 PERFORMANCE EVALUATION The requirements for mapping a MLP required to complete a classification process: in terms of PEs required number of PEs required is equal to twice the number of layers in each layer. number of clock cycles required = m*K* l where: m: number of input layer nodes. K: number of patterns l :number of clock cycles

45 PERFORMANCE EVALUATION in the implementation by authors of the paper: m = 20 K = 1024 X 1024 =1 MB (Total number of pixels in the input şmage) l = 2 THUS: no. of clock-cycles = 20*2*10 6 = 40 million with a clock rate of 22 MHz., time taken for 40 million clock ticks =1.81 secs.

46 PERFORMANCE EVALUATION When the number of PEs required is larger than the available PEs; either more processor boards need to be added or PEs need to be time shared. NOTE: NOTE:   neuron outputs are produced independent of other neurons   algorithm waits till the computations in each layer is completed.

47 PERFORMANCE EVALUATION A MLP has communication complexity of O(n 2 ) where n is the number of nodes. As n grows, it will be difficult to get good timing performance from a single processor system. with a large number of processor boards, the single input data bus of 36 bits can cater to multiple input patterns. Note:   In a multiboard system, all boards receive the same input.   This parallelism can give rise to more data streaming into the system,   thus the number of clock cycles is reduced.

48 PERFORMANCE EVALUATION splash sparc20 Size of network Time (Log Scale) SCALABILITY:  only a single layer is considered  network size is represented by the # of nodes in that layers  multilayered networks are considered to be linearly scalable in Splash 2 architecture  performance measure is processing time as measured by # of clock cycles for Splash 2 with 22 MHz. Clock.

49 PERFORMANCE EVALUATION Speed Evaluation: Speed Evaluation: 20 input nodes implemented on a 2-board system 176 million connections per second (MCPS) is achieved per layer by running the Splash clock at 22 MHz. A 6- board system can deliver more than a billion connections per second. Comparable to the performance of many high level VLSI-based systems such as Synapse, CNAPS which perform in the range of 5 GCPS.

50 PERFORMANCE EVALUATION Network-based Image Segmentation: Network-based Image Segmentation: Image Segmentation: The process of partitioning an image into mutually exclusive connected image regions. In an automated image document understanding system, page layout segmentation plays an important role for segmenting text, graphics and background areas. Jain and Karu proposed an algorithm to learn texture discrimination masks needed for image segmentation.

51 PERFORMANCE EVALUATION Network-based Image Segmentation (cont’d) Network-based Image Segmentation (cont’d) The page segmentation algorithm proposed by Jain and Karu has three stages of computation: 1.feature extraction Based on 20 masks 2. classification A multisate feedforward neural network with 20 input nodes 20 hidden nodes 3 output nodes. 3. postprocessing involves removing small noisy regions and placing rectangular blocks around homogenous identical regions.

52 PERFORMANCE EVALUATION Network-based Image Segmentation (cont’d) Network-based Image Segmentation (cont’d) Schematic of the Page Segmentation Algorithm

53 PERFORMANCE EVALUATION Page Segmentation input gray level image result of segmentation algorithm result after postprocessing

54 CONCLUSIONS: A novel sheme of mapping MLPs on Custom Computing Machine has been presented. The scheme is scalable in terms of number of nodes and the number of layers in the MLP and provides near-ASIC level speed. The reconfigurality of CCMs has been exploited to map several layers of a MLP onto the same hardware. The performance gains achieved using this mapping have been demonstrated on a network-based image segmentation.


Download ppt "HIGH PERFORMANCE MULTILAYER PERCEPTRON ON A CUSTOM COMPUTING MACHINE Authors: Nalini K. Ratha, Anil K. Jain H. GÜL ÇALIKLI 2002700743 2002700743."

Similar presentations


Ads by Google