Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory-Driven Computing

Similar presentations


Presentation on theme: "Memory-Driven Computing"— Presentation transcript:

1 Memory-Driven Computing
The future of computing Ian Phillips, ESEP Principal Systems Architect Lockheed Martin EBS Presentation to the Orlando chapter of INCOSE August 16th, 2017

2 Agenda Motivation for Memory Driven Compute
Challenges with memory with today’s computer systems Introducing Gen-Z and architecture concepts The Machine update ([--TB_Third_PARTY--]) Third Party Proprietary Information

3 What is the motivation?

4 The New Normal: Compute is not keeping up
Data (Zettabytes) Data nearly doubles every two years ( ) Data growth (~ doubling every 2 years, exponential growth) Transistors (thousands) Single-thread Performance (SpecINT) Frequency (MHz) Typical Power (Watts) Number of Cores Microprocessors 107 106 105 104 103 102 101 100 Log Scale The point here is that hardware stopped getting better a while ago and can’t keep up with the data explosion nearly doubling every couple of years. The IEEE Computer Society, International Technology Roadmap of Semiconductor (ITRS), R&D investments and capital expenses through government consortiums’ come together to help shape the development of modern fab structure of semiconductors and computer architecture. They’ve also been saying the same thing. The graph on the left shows the different aspect of the microprocessor improvement performance since the dawn of Moore’s Law. Every time you go up the graph, the performance is 10 times better. The orange line is transistors per square unit and will start to flatten out when Moore’s law ends (about 4-5 years out if things don’t change). This is called, Dennard Scaling, a scaling law that as transistors get smaller their power density stays constant, so that the power use stays in proportion with area. You can see that the single-thread performance, frequency and typical power has already flattening out in It started flattening out approximately a decade ago, which in turn led to the rise of multicore processors. Since one processor could no longer be run faster, successive generations of Moore’s law transistors were consumed by making copies of the processor core. This increased the demands on memory systems because of the increase in voracious consumers of data as well as multiple independent access patterns to memory, which defeated decades’ worth of hierarchy-hiding techniques. Additionally, memory is closer to the end of scaling than processors, due to the physical characteristics of semiconductor memory devices. There are a few more generations left to processor designers, but as they eke out those last few generations, the increases in performance are stagnating and the operational and capital costs associated with these penultimate systems are rising. (FYI: Penultimate is the next to last, as in 7nm is the penultimate silicon process step, while 5nm is the ultimate or terminal improvement, but neither offer the step over step improvement of prior years.) The only thing that is not flattening out is the data growth graph on the right. We generate zettabytes of data which nearly doubles every two years according to IDC. There is an exponential growth curve in rich data that conventional technology with linear responses just can’t keep up with. Link: but it’s not just about the sheer size of the data. You’ve all heard that story a million times. It’s about new use cases that use data in a whole different way. [Next: systems of record, engagement and action] Start of CPU degradation trend End of CPU degradation trend End of Moore’s Law

5 Memory

6 Definitions Memory Volatile memory Nonvolatile memory
The physical media used to store the information required by a computer to execute programs Directly addressable by the processor All instructions and data operated on by a computer reside in, or pass through, memory (i.e., the von Neumann architecture) Volatile memory Memory that loses its information when power is removed Nonvolatile memory Memory that retains its information when power is removed; accidentally, maliciously, or purposefully

7 Computer Processor Architecture
A classic, contemporary computer processor architecture Memory is directly addressable by the processor SoC DDR SoC = System on Chip ]System on Chip. Think iphone chips which now have CPU, GPU, networking, etc now packaged on a chip.    HPE brought this technology to the datacenter with our Moonshot offering and at the with our IoT offering with the Edgeline product offering.

8 Extending Computer Processors using standard integrations.
Motherboard Computer Processors are integrated with a variety of media adapters using standard integrations such as ISA, VME, SATA, USB, SCSI, PCI, and many others. These media adapters allow CPUs to integrate with a wide variety of peripheral devices including data storage devices, hardware integration adapters, networks and user interface devices. These interfaces are all extremely inefficient at handling I/O when compared to CPU memory I/O. Most Media peripherals operate with their own CPU/Compute, adding to the latency problem. DDR NVDIMM NVDIMM DDR ISA SoC Slot VME Slot SATA Slot USB Slot SCSI Slot PCIe Slot DDR DDR WWAS 2017 – Confidential

9 Convergence of Memory and Storage: Storage Class Memory
In-Memory Operations SCM Ops Block Operations X 5 X 20 X 2000 X 50 X 4000 Processor 1 ns SRAM 5 ns DRAM 100 ns Flash 200,000 ns Hard Drive 10,000,000 ns Tape 40,000,000,000 ns Byte Addressability High Bandwidth/Low Latency Simple & Efficient Load/Store operations Data Protection methods (RAID, EC) Device management: Hotplug-ability Rich Data Services (ie deduplication, replication, thin provisioning, etc)

10 The “memory” hierarchy
Level Amount Access time (approximate, in processor cycles) Register O(26) 1 L1 cache O(215) 2 L2 cache O(218) 10 L3 cache O(222) 60 Memory O(237) 300 SSDs O(239) 250,000 Disk O(241) 7,500,000 Operation Energy (pJ) 64-bit integer operation 1 64-bit floating-point operation 20 256 bit on-die SRAM access 50 256 bit bus transfer (short) 26 256 bit bus transfer (1/2 die) 256 Off-die link (efficient) 500 256 bit bus transfer(across die) 1,000 DRAM read/write (512 bits) 16,000 HDD read/write (32k bits) O(106)

11 Memory/Storage Convergence: The Media Revolution
Today Memory DRAM DRAM DRAM OPM Internal Block Storage OPM SCM SCM SCM OPM will be able to package memory and CPU on the same chip - closer to the CPU to reduce memory latency and will reduce the pressure of pins on the processor and hence the size .  My understanding is about 50% of the pins on a processor are used for grounding and connecting to DRAM. Certain applications will still require low memory latency close to the process.  What we are now seeing is a move to packaging DRAM with the CPU on the same chip/  OPM is the packaging of the memory with the CPU on the same chip either in 2D or 3D (stacking) fashion.  SCM initially will be further from the process (between DRAM and NVMe storage) from a latency point of view but byte addressable just like DRAM memory.  Unlike DRAM which is volatile, SCM is non-volatile.  SRAM is typically used for caching by the processor (non-byte addressable and volatile).  We believe over time, SCM will evolve into true non-volatile memory with the performance similar DRAM, non- volatile, cheaper, very power efficient, dense, etc.  At some point in the future non-volatile memeory will be integrated in as OPM in the SoC. Disk/SSD Disk/SSD Disk/SSD SSD SCM = Storage Class Memory OPM = On-Package Memory Memory Semantics will be pervasive in Volatile AND Non-Volatile Storage as these technologies continue to converge.

12 What are the Solution Elements?
Eliminate architectural bottlenecks by simplifying HW/SW interfaces Speak the common language of CPU data access Extend memory semantics to all devices Design for highest data bandwidth, lowest latency to all devices Flexible architecture that utilizes scalable technology for keeping up with the data growth trend Reduce acquisition and operational costs

13 Ultra-fast HPE Persistent Memory Today
Current Gen10 Persistent Memory Products HPE Scalable Persistent Memory Large (Terabytes) 1 TB in 2 socket Large in-memory compute Checkpoints and Restores HTAP- Real Time Analytics Large Databases VSAN and Storage Spaces Direct Big Data, Service Providers, Performance Tier, and Virtualizations Database Workloads HPE NVDIMMs Small (100s of GB) Database Storage Bottlenecks Software Licensing Reduction Caching HPE 16GB NVDIMMs Industry leading persistent memory technology delivers performance, resiliency, scale and efficiency required of data intensive applications

14 Memory Semantic Fabric
& the Gen-Z Consortium

15 Memory Semantic Fabric
Communication at the speed of memory What is a Memory Semantic Fabric? A communication protocol that speaks the same language the CPU speaks: load/store, put/get, and read/write Memory address based connectivity that extends beyond the CPU to the server, to the rack and to the data center. CPUs SoC Accelerators SoC FPGA FPGA GPU GPU Memory Memory Memory Memory So I do not have the exact details in terms of latency for read and write operations though for memory semantic fabric I have been told the latency will be in the 100's of nanosec.  The latency for SoC private is similar to our current computer system with DRAM which is in the 10's.  The latencies for both read and write will be similar.  However, when you factor in system latency, the MDC system will beat it hands down. Memory Semantic Fabric I/O Network Storage Pooled Memory

16 The Gen-Z Consortium: Broad Industry & Device Support
Consortium Members GPUs FPGAs Special Purpose Processors Volatile Memory HMC PCM Memristor 3D XPoint ReRAM STTRAM Storage Class Memory IBM Power CPUs Storage Controllers Switches/ Gateways Component Categories Components Intellectual Property Connectors Subsystems Systems Software WWAS 2017 – Confidential

17 A high speed Gen-Z memory semantic fabric
Media Module SoC Media Module SoC Media Controller Gen-Z Logic SCM SCM CPU SCM SCM Gen-Z Fabric Media Controller Gen-Z Logic SCM SCM SCM SCM SoC Media Module Storage Class Memory CPU GPU or FPGA SoC Accelerator Module Gen-Z Switch GPU or FPGA Gen-Z Fabric Gen-Z Logic Gen-Z Logic Media Module Media Module Media Media Controller DRAM Media Media Controller GPU or FPGA Media Supports DRAM, Flash, Memristor, PCRAM, MRAM, 3D XPoint … Universal Interconnect Decouples CPU/memory design, enabling independent innovation Media Multiple Resources Enabled by Gen-Z

18 Memory-Driven Computing
These are the key attributes of Memory-Driven Computing… [Next: MDC is Powerful.. Processor centric computing to Memory-Driven Computing]

19 Memory Driven Computing (MDC)
Memory is the scarce compute resource Conventional computers move date to a processor MDC flips that around Brings processing to the data Processors handle individual tasks Change processors to suit compute task With this architecture, we can process massive datasets Saves energy (multiple orders of magnitude) [This is why MDC is “Powerful” (addressing the first pillar). Note, this slide contains the single most important concept in the MDC architecture.] The challenges are always on building enough memory to keep up with compute. Memory has always been the scarce resource (never enough volume/resources). Traditional computers chop up your information – the data – to match the limitations of the processor Processor as gatekeeper We flip that around and puts the data first – bringing processing to the data Processor almost irrelevant – can swap out to suit task We call this Memory-Driven Computing SoC, universal memory and photonics are the key parts of the architecture of the future With this architecture, we can ingest, store and manipulate truly massive datasets while simultaneously achieving multiple orders of magnitude less energy/bit Q: What is HPE doing here that is truly different? A: New technologies are not substitutional - we’re re-architecting [Next: MDC concept build out…] Memory Driven Computing technologies are not substitutional. Requires re-architecting the fundamentals of computer systems.

20 Memory Driven Computing (MDC)
SoC SoC SoC SoC SoC SoC Memory Memory SoC SoC SoC SoC Memory + Fabric [This is why MDC is “Powerful” (addressing the first pillar). Note, this slide contains the single most important concept in the MDC architecture.] The challenges are always on building enough memory to keep up with compute. Memory has always been the scarce resource (never enough volume/resources). Traditional computers chop up your information – the data – to match the limitations of the processor Processor as gatekeeper We flip that around and puts the data first – bringing processing to the data Processor almost irrelevant – can swap out to suit task We call this Memory-Driven Computing SoC, universal memory and photonics are the key parts of the architecture of the future With this architecture, we can ingest, store and manipulate truly massive datasets while simultaneously achieving multiple orders of magnitude less energy/bit Q: What is HPE doing here that is truly different? A: New technologies are not substitutional - we’re re-architecting [Next: MDC concept build out…] SoC SoC SoC SoC Memory Memory SoC SoC SoC SoC Memory Memory SoC SoC SoC SoC SoC SoC Memory Memory From processor-centric computing… …to Memory-Driven Computing Memory

21 Gen-Z & HPE’s “The Machine”
So what we are working on is a memory fabric as part of The Machine project.  Ability to load/store data in a byte addressable fashion Vs. TCP/IP packets.  You are correct that photonics is used in the datacenter too in Fibre Optics, the issue is cost compared to say copper.  The fibre we see in data center usually carries one wavelength or lamda of light.  The work we have done increases the number of lamdas we can transmit (4) per fibre optic cable.  Each lamda transmits at 25 Gb/s or an aggregate b/w of 100 Gb/s.  12 of this optical cables are bundled together for an aggregate B/W of 1.2 Tb/s.  To drive 1.2Tb/s we have developed the lasers, photo detectors, chips to convert digital to optical signaling (we call it the X1 chip), and the packaging for the datacenter and bring the cost points down to copper.  This is what we refer to as the X1 module. Photonics General / special purpose cores Massive common memory pool The Machine

22 Communications and memory fabric
MDC in context Conventional Servers MDC Mainframe Local DRAM Local NVM Network SoC Network Communications and memory fabric SoC Local DRAM Local NVM Network Coherent Interconnect Local DRAM SoC Physical Server NVM Memory Pool All of our fabric attached resources are potentially shareable and rather than burden every access with checking for rules, we leave synchronization explicitly to the application writers.   Those that need the protection of synchronization are afford high performance mechanisms to do so, but those applications which can afford relaxed or eventual consistency can run full tilt. Physical Server Shared nothing Shared something Shared everything All fabric attached resources shareable Memory synchronization left to the application writers.   High performance data sharing mechanisms available

23 Memory Semantic Fabric
SoC Memory Access Memory Semantic Fabric 2-4 TB SoC Bridge “Private” memory Universal Memory 2-4 TB This slide has produced an “ah-hah” moment for several audiences. It illustrates how The Machine architecture allows a compute node to access any part of the Universal Memory pool. The reason it works is that when we show a conceptual design with blades and enclosures in a rack, it looks just like the old world, with memory attached to compute nodes. Thanks to the bridge, it’s actually very different. SoC memory is mapped across nodes, but still “private” All nodes share the Universal Memory. SoC Bridge “Private” memory Universal Memory SoC Bridge “Private” memory Universal Memory

24 HPE’s “The Machine” with 160 TB Memory

25 HPE’s “The Machine” Architecture

26 Installing a Node

27 HPE’s The Machine: Memory fabric testbed Machine Board
Powerful A quantum leap in performance, beyond what you can imagine Open An open architecture designed to foster a vibrant innovation ecosystem Explanation of why Labs calls this the “Memory Fabric Testbed” – The Machine research program sits between the extremes of product prototype (I’m not doing a product, so therefore I don’t have a prototype of what I’m not doing) and an experiment (where we hope only to gain verification or nullification of a hypothesis) Important to stress - This prototype hardware is a software development vehicle used by researchers to develop this software. Not what the final state of The Machine prototype will be. It’s the first design concept. As shown earlier on how MDC works…. Each node board has up 4TB of Fabric-Attached Memory (limited by available DRAM DIMMs). That may not sound like a lot, but now that all the pieces are working, it’s a matter of working through the supply chain and the turn-on process to ramp from our initial tens of TB to our self-imposed limit, balancing risk against knowledge to be gained, of hundreds of TB on the rack-scale prototype.  HPE will continue to drive research through The Machine program starting with the prototype scalability, expanding the capacity of the proof-of-concept prototype with more nodes and memory from the initial tens of terabytes to hundreds of terabytes, giving algorithm and applications teams an unprecedented platform for exploration. Remember, HPE’s Memory-Driven Computing architecture is incredibly scalable, from tiny IoT devices to the exascale, making it an ideal foundation for a wide range of emerging high-performance compute and data intensive workloads, including big data analytics. Exascale is a developing area of High Performance Computing (HPC) that aims to create single systems with more computational power than the current top 500 supercomputers combined, and The Machine research project is the basis by which HPE will push the limits of engineering and material science to achieve it. In researching the requirements for HPC exascale, we will develop practical systems for petascale commercial and enterprise applications, democratizing the availability of applications capable of analyzing petabytes of data in real time. MFT Hardware description The first Machine hardware prototype leverages the Apollo 6000 enclosure, but we “stretched” it to include more compute and memory. Each enclosure has 10 nodes.  Each node consists of a compute board and a fabric-attached memory board with up to 4TB per node. In the prototype, the memory consists of standard DDR4 memory, but in the future it would be non-volatile memory.  As long as the system stays powered on, the memory looks non-volatile to the software. The MFT node consists of two boards connected with a high speed electrical interconnect. The board with the fabric-attached memory is to the left. The compute board is on the right. [Next: Innovation Zone board] Trusted Always safe, always recoverable All the benefits without asking for sacrifice Simple Structurally simple, manageable and automatic, so that “it just works”

28 “The Machine” prototype specification and architecture specs
The Machine Prototype 160 TB of memory Each enclosure has 10 nodes, 4 enclosures in prototype, total of 40 nodes Each node has a fabric-attached memory board with 4TB Each Node has 1 SoC HPE’s Memory-Driven Computing is the foundation for high-performance compute for data intensive workloads, including big data analytics. Exascale High Performance Computing (HPC) initiative aims to create single systems with more computational power than the current top 500 supercomputers combined. Petascale systems for commercial enterprise applications, capable of analyzing petabytes of data in memory. Six “The Machine” prototypes would provide a petascale system! Exascale systems for supercomputing applications, capable of analyzing exabytes of data in memory Architecture scales from tiny IoT devices to exascale computers Petascale is 1 Million Gigabytes memory Exascale is 1 Billion Gigabytes memory Typical Server 1 TB RAM X 1000 Petascale System 1 PB Memory X 1000 Exaascale System 1 EB Memory

29 Photonics

30 Photonics: Light inside “The Machine”
“The Machine” prototype moves fiber optics cables inside the computer for internal data transmission Copper currently runs at 5 gigabits per second (gb/s) Copper can only go to 9.6 gb/s. Optical connections will go to at least 20 gb/s Maybe as high as 6 terabytes/sec Photonic data access will address any one byte of data in a 160 petabyte online data store in under 250 nanoseconds (a nanosecond is one billionth of a second). Optical connections also use much less power. Specific savings ratios are configuration dependent.

31 Contains Third Party Proprietary Information
Memristor Contains Third Party Proprietary Information

32 Memristor Memory Technology
v I q Non-volatile Fast Dense Low energy Stable The Machine is HPE and Western Digital memristor technology… Paper by Leon Chua in 1971: Memristor - the missing circuit element. Theorized that in addition to the known three basic circuit elements: resistor, capacitor and inductor, there should be a forth that relates flux to charge. Chua called this the memristor. Discovered in 2008 in HP Labs by R. Stanley Williams. Two layers of titanium oxide sandwiched between Titanium and platinum electrodes. One layer has a slight depletion of oxygen atoms; the oxygen vacancies act as charge carriers and hence decrease resistance. By applying electrical charge across the electrodes the vacancies can be moved, and hence the width of the lower resistance layer can be increased or decreased, thus the resistance of the film as a whole is dependent on how much charge has been passed through it and in which direction in the past. A physical memristor consists of a two-terminal device whose resistance depends on the magnitude, polarity and length of time of the voltage applied to it. When the voltage is turned off, the resistance remains as it did just before it was turned off. This makes the memristor a non-linear, nonvolatile memory device. Because the power required to write to a memristor is low, and there is no need to periodically refresh as there is for DRAM, the power consumption of memristor is tiny. And once programmed the state is absolutely stable A memristor is of a two-terminal device whose resistance depends on the magnitude, polarity and time of the applied voltage. When the voltage is turned off, the resistance remains as it was when turned off. This makes the memristor a non-linear, nonvolatile memory device. Power used by memristor is tiny Power required to write is low No power needed after write Once programmed, the state is absolutely stable

33 Hardware + software stack
Insight Analytics and visualization Exabyte-scale algorithms Security built on HW foundations Manageability Hardware alone – no matter how revolutionary – is useless without the software stack. We’re stripping Linux while keeping the POSIX interface, and we’re adapting Android to be optimized for non-volatile memory. That’s why the first version of The Machine will support today’s applications. Linux for The Machine will be Open Source. HPE is all in on open source, agile development and the DevOPS model. OpenStack (HPE #1 committer) and Trafodion as proof points How do you manage a million nodes? We’re building the tools now. HPE’s Helion includes a management tool called Loom to let you manage 40,000 VMs today. Next-gen core enabling algorithms – HAVEN on steroids. New visualization technology to harness our amazing ability to spot patterns and anomalies. [Next: MDC drives new computer science] >_ Operating systems and programming models Ultra-efficient hardware Data Linux for The Machine will be Open Source

34 Applications

35 Transform performance with Memory-Driven programming
6/1/ :01 PM Transform performance with Memory-Driven programming New algorithms Modify existing frameworks Completely rethink In-memory analytics 15X faster Large-scale graph inference 100X faster Similarity Search 300X faster Financial models 8000X faster This is what you can achieve if you throw away 60 years of software assumptions and legacy. Spark is one of the leading open source tools to do in-memory analytics on a cluster of servers. As part of another project, we wanted to see what we could do with Spark if we adapted to a large memory system. The results were astounding. 15X s compared to “vanilla” Spark. We can also run 20x bigger data sets that won’t run all at all normally. [harnesses the “well-connected” vector. Not big memory, just many processing elements on the pool of memory] [The other three harnesses abundance: need all the memory the same “distance” away] Graphs increasingly represent our connected world today. Graph inference is how you make predictions using a small known set of data. Comparison is with GraphLab, the state-of-the-art today. Similarity search is used for things like image search, genomics etc. and is outpacing supercomputer development today. Comparison is with standard disk-based scale-out Map/Reduce. This is also on a 20x bigger data set. Financial modelling is Monte Carlo simulations used to predict things like derivatives pricing and portfolio risk management. Comparision is with open source QuantLib package. Presenter Name

36 Graph analytics time machine
Massive memory and fast fabrics to ingest all data “Are there any emerging new behaviors in my network?” External users Critical node The Machine will let me build fast graph databases (100x faster than existing technologies like OLTP, OLAP). The other new thing is that we cans store the full history – every state the graph has been in, which enables us to look at things that change over time. It’s like going from a photo of a bird in flight to a video of the same bird. Suddenly you can tell how fast the bird is going and in what direction – whole new dimensions of insight to be gained. You can detect what’s normal, what’s changing and what’s abnormal. A pattern of new activity might indicate an unfolding security breach in a company’s network, allowing action to be taken before any sensitive data can be compromised. You can find the most important, or “critical” nodes You find the optimal path to achieve a stated goal And you can do it in real time, as the graph is changing (offline analysis is standard today) Why this is hard for conventional computers? – randomness, sheer size (ref. the “cache miss” problem) This is why they’re not widespread today The Machine’s ability to rapidly access any byte in hundreds of petabytes makes this possible DNS/Security Machine as example: Today, throw away nearly all the data Massive memory and fast fabrics to ingest all data Graph analytics to find normal/abnormal Remote users Fast graph databases and the ability to look at things that change over time


Download ppt "Memory-Driven Computing"

Similar presentations


Ads by Google