Presentation is loading. Please wait.

Presentation is loading. Please wait.

A solution to the Von Neumann bottleneck

Similar presentations


Presentation on theme: "A solution to the Von Neumann bottleneck"— Presentation transcript:

1 A solution to the Von Neumann bottleneck
Sylvain EUDIER Union College, 2004 MSCS Candidate In-memory computing A solution to the Von Neumann bottleneck

2 Seminar - Processing in Memory
Plan Introduction to a new architecture Different architectures The C-RAM Architecture: Implication / Application Performances Conclusion Seminar - Processing in Memory

3 Seminar - Processing in Memory
Introduction Von Neumann architecture The situation (gap evolution) Some improvements were made Can we avoid this bottleneck? (graph) The Von Neumann Bottleneck: The Von Neumann Architecture: a von Neumann computer has 3 parts: - a central processing unit (or CPU), - a store -and a connecting tube that can transmit a single word between the CPU and the store (and send an address to the store).‘ Backus proposed to call this tube the von Neumann bottleneck. The situation: CPU speed * 2 every 18 months => Bigger die size; >> solve with pipelining but increases latencies (cache access, branch prediction penalties and complexity of processor design) CPU-Memory gap: increase of mem speed by only 7% a year. >> caching, prefetching, multithreading. Discuss the cache memory (pros and cons) Seminar - Processing in Memory

4 Different Architectures
Architectures and designs IRAM (design) RAW (design) CRAM (design) IRAM: Berkeley university: taped out in October 2002 by IBM (72 chips on a wafer). Testing in progress ( ) IRAM stands for Intelligent RAM (CPU + RAM) Principle: Grouping CPU+Cache+RAM+Networking onto the same chip => Smaller, cheaper, less power. Massive vector processing IRAM is designed to be a stand alone (But first results are good: 13MB of DRAM w/ 200Mhz proc gives 1.6GFlops (200*8bit per cycle) 2W;) RAW: MIT: Instead of building one proc on the chip, several processors (tiles) implemented and connected with a network. A tile is: a RISC processor, 128KB of SRAM, a FPU and a communication processor. Proto of 16 tiles The memory is located at the peripheral part of the 16 tiles All the tiles can access memory either via a (1) static or (2) dynamic network. (1) 3 cycles latency btw nearest tiles. +1 for each hop (2) send data packet (The prototype 300Mhz and 3.2 GFlops) CRAM: University of Alberta: Prof. ELLIOTT. And also Carleton University, Ottawa First prototype in protos developed, the last one is currently being designed. Processing elements are added to the sense amplifiers to take advantage of the memory bandwidth at this point. PE’s are very simple. They are 1-bit serial PE’s. Computation based on the truth table of the 8-1 Mux. Lots of small PE rather than a big uP. Designed for multi purpose. Only 5-10% of chip space for PE’s The PE’s communicate with a shift-left/shift-right register Seminar - Processing in Memory

5 The C-RAM (Computational RAM) architecture
Applications Performances Implications New software design Energy consumption I chose to focus on the CRAM for its multi-purpose design, not intended to perform in a restricted domain. This architecture is not so new since the 1st prototype had been taped out in 1996 by D. Elliott at the university of Alberta, Canada Application of the CRAM: Very good for // computing for // reducible algorithms. The greater the // degree of a computation the better. Because of the PE’s, a bigger computation doesn’t necessary mean more time as long as it fits in the memory. More CRAM = More RAM and more POWER. Even if the problem is not really parallel, it can be faster because of the bandwidth. However, the process has to be big enough otherwise, the time to put it to the memory will be longer than the actual computation These applications / Perf tests were chosen for: The different fields their represent, to demonstrate the general purpose of the CRAM Their different computation complexity and models They are all based on practical problems Implications: The CRAM implies a different way of writing programs and new interfaces Due to the integration of all the elements on-chip (mem, cpu, bus), consumption very low Seminar - Processing in Memory

6 Seminar - Processing in Memory
CRAM Applications Image processing : Low-level adjustments (brightness, average filter…) Databases searches : Equivalence, Extremes, between limits… Multimedia Compression: MPEG Motion estimation Image processing: Brightness due to the CRAM design is very fast, pure computation Average Filter is slower due to the communication between the PE’s => Depends on the problem and the parallel degree of it Databases Searches: Randomly generated lists All the searches are about the same speed but the difference uniprocessor / CRAM is not so high => Linear running time Multimedia compression: This algo requires use of shift registers (slows down the process) but the problem suits particularly well to the // processing => Lots of redundant computation on groups of pixels Seminar - Processing in Memory

7 Performances - Configs
CRAM 200Mhz; 32MB; 64K PE’s on a Pentium 133Mhz (simulated) Pentium 133Mhz with 32 MB Ram Sun SPARC Station 167 Mhz CPU with 64 MB Simulator has been certified with tests and results from the prototype. The results are very close. Seminar - Processing in Memory

8 Performances – Basic ops
Because of the small amount of available proto’s, the results are based on a simulator. But the tests proved that it is very precise, very close to the real results Depends on the precision required The perf of this CRAM is about 0.01 GFlops which is good if we consider the fact that it’s only 200Mhz and 32MB. At least 200 times faster than the equivalent PC Ops complexity Seminar - Processing in Memory

9 Performances - Comparison
The first case assume the CRAM is used as the main memory or video memory. Don’t care about the transfer time. The second case is when the CRAM is used as an extension card, as an accelerator for massive parallel computation The overhead case takes into consideration the transfer of data from the host to the CRAM and the creation of overhead during the tranfer. Seminar - Processing in Memory

10 New Software Design (Step 1)
Think Parallel (pseudo code) Seminar - Processing in Memory

11 New Software Design (Step 2)
Use a different language (modified C++) Seminar - Processing in Memory

12 New Software Design (Step 3)
Possibly coding in assembly to optimize Can code in assembly to optimize or when you want to switch from host to CRAM computing The comprehensive compiler is not yet finished The instruction defined for the CRAM are translated into the corresponding host’s MOVE’s. Seminar - Processing in Memory

13 CRAM Energy Consumption
We avoid the use of a bus We have a direct access to memory No overhead in communication Finally the CRAM use 20 times less energy therefore less heat Seminar - Processing in Memory

14 Seminar - Processing in Memory
The future… Which architecture will be chosen? End of today’s architecture? A PetaOps is feasible with CRAM Blue Gene/P aims at the petaFlops (view) Which Architecture will survive? : Talk about the new software design, performances and compare with RAMBUS case. Will probably be a matter of sponsors and money. End of today’s architecture: Some scientists believe that one day or another we will have to move to these techniques because of the power available, the power reduction and therefore the heat reduction A PetaOps: A study of 4 scientists from the Carleton University in Canada (D. Elliott is part of) shows that a PetaOps is feasible with 500Mhz SRAM and PE’s for every 512 bytes  Just need 1TB of CRAM. Blue Gene: This is the super computer of IBM. Arrays of PIM’s Because of the use of PIM’s techniques, it will be air-cooled. Blue Gene/C released last year. Protein folding ranked 73rd in the top500 supercomputers 2TeraFlops. Blue Gene/L  2005  200/360 TFlops  Faster than the total computing power of today’s top500 supercomputers Blue Gene/P: PetaFlops ; 1000 times faster than Deep Blue; Comp. to today’s supercomputers: 6*faster; 1/15th of the power; 10 times compacter. (just half a tennis court) 128 times bigger than Blue Gene/C About 2millions times faster than a desktop computer => For 2007 Seminar - Processing in Memory

15 Seminar - Processing in Memory
Questions ? Seminar - Processing in Memory

16 Memory bandwidth in a computer
Based on 256 MBytes of 16 Mb, 50ns DRAM chips,and a 100 MHz CPU with a 64-bit bus. !!! Log Scale !!! Back Seminar - Processing in Memory

17 Seminar - Processing in Memory
IRAM Design Back Seminar - Processing in Memory

18 Seminar - Processing in Memory
RAW Design Back Seminar - Processing in Memory

19 Seminar - Processing in Memory
CRAM Design Back Seminar - Processing in Memory

20 Operations Complexity for CRAM
Back Seminar - Processing in Memory

21 Seminar - Processing in Memory
Blue Gene/P Scale Seminar - Processing in Memory

22 Seminar - Processing in Memory
Computing power Scale !!! Log Scale !!! Seminar - Processing in Memory Back

23 Memory – processors gap evolution
Seminar - Processing in Memory Back


Download ppt "A solution to the Von Neumann bottleneck"

Similar presentations


Ads by Google