Nat DucaJonathan Cohen Johns Hopkins University Peter Kirchner IBM Research Stream Caching: Mechanisms for General Purpose Stream Processing.

Slides:

Advertisements

Similar presentations

CSCI 4717/5717 Computer Architecture

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Computer Organization and Architecture

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Organization and Architecture

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chapter 17 Parallel Processing.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.

CH12 CPU Structure and Function

Advanced Computer Architectures

Computer performance.

Processor Structure & Operations of an Accumulator Machine

Enhancing GPU for Scientific Computing Some thoughts.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

Parallelism Processing more than one instruction at a time. Pipelining

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Basics and Architectures

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

Introduction CSE 410, Spring 2008 Computer Systems

Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.

Chapter One Introduction to Pipelined Processors.

MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

ELEN 033 Lecture #1 Tokunbo Ogunfunmi Santa Clara University.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Pipelining and Parallelism Mark Staveley

Vector and symbolic processors

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Topics in Memory System Design 2016/2/5\course\cpeg323-07F\Topic7.ppt1.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Fall 2008, Oct. 31 ELEC / The IBM360 and Tomasulo's Algorithm Joel D. Hewlett.

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

Computer Architecture Organization and Architecture

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

William Stallings Computer Organization and Architecture 6th Edition

Computer Organization and Architecture Lecture 1 : Introduction

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Topics to be covered Instruction Execution Characteristics

Introduction to Parallel Processing

Visit for more Learning Resources

William Stallings Computer Organization and Architecture 8th Edition

Parallel Processing - introduction

Performance of Single-cycle Design

Architecture & Organization 1

Computer Architecture and Organization

Foundations of Computer Science

Architecture & Organization 1

Coe818 Advanced Computer Architecture

Mattan Erez The University of Texas at Austin

What is Computer Architecture?

Mattan Erez The University of Texas at Austin

What is Computer Architecture?

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Presentation transcript:

Nat DucaJonathan Cohen Johns Hopkins University Peter Kirchner IBM Research Stream Caching: Mechanisms for General Purpose Stream Processing

Talk Outline ● Objective: reconcile current practices of CPU design with stream processing theory ● Part 1: Streaming Ideas in current architectures – Latency and Die-Space – Processor types and tricks ● Part 2: Insights about Stream Caches – Could window-based streaming be the next step in computer architecture?

Streaming Architectures ● Graphics processors ● Signal processors ● Network processors ● Scalar/Superscalar processors ● Data stream processors? ● Software architectures?

What is a Streaming Computer? ● Two [overlapping] ideas – A system that executes strict-streaming algorithms [unbounded N, small M] – A general purpose system that is geared toward general computation, but is best for the streaming case ● Big motivator: ALU-bound computation! ● To what extent do present computer architectures serve these two views of a streaming computer?

[Super]scalar Architectures ● Keep memory latency from limiting computation speed ● Solutions: – Caches – Pipelining – Prefetching – Eager execution / branch prediction [the super in superscalar] ● These are heuristics to locate streaming patterns in unstructured program behavior

By the Numbers, Data ● Optimized using caches, pipelines, and eager- execution – Random:182MB/s – Sequential:315MB/s ● Optimizing with prefetching – Random: 490MB/s – Sequential:516MB/s ● Theoretical Maximum:533MB/s

By the Numbers, Observations ● Achieving full throughput on a scalar CPU requires either – (a) prefetching [requires advance knowledge] – (b) sequential access [no advances req'd] ● Vector architectures hide latency in their instruction set using implicit prefetching ● Dataflow machines solve latency using automatic prefetching ● Rule 1: Sequential I/O simplifies control and access to memory, etc

Superscalar (e.g. P4) Local Memory Hierarchy Cache Prefetc h

Superscalar (e.g. P4) Local Memory Hierarchy Cache Prefetch The P4, by surface area, is about 95% cache, prefetch, and branch- prediction logic. The remaining area is primarily the floating point ALU.

Pure Streaming (e.g. Imagine) In Streams Out Streams

Can We Build This Machine? Local Memory Hierarchy In Streams Out Streams ● Rule 2: Small memory footprint allows more room for ALU --> more throughput

Part II: Chromium ● Pure stream processing model ● Deals with OpenGL command stream – Begin(Triangles); Vertex, Vertex, Vertex; End; ● Record splits are supported, joins are not ● You perform useful computation in Chromium by joining together Stream Processors into a DAG – Note: DAG is constructed across multiple processors (unlike dataflow)

Chromium w/ Stream Caches ● We added join capability to Chromium for the purpose of collapsing multiple records to one – Incidentally: this allows windowed computations ● Thought: there seems to be direct connection between streaming-joins and sliding-windows ● Because we're in software, the windows can become quite big without too much hassle ● What if we move to hardware?

Windowed Streaming In Streams Out Streams Window Buffer Uses for Window Buffer of size M: ● Store program structures of up to size M ● Cache M input records, where M << N

Windowed Streaming In Streams Out Streams Window Buffer Realistic values of M if you stay exclusively on chip: 128k K... 2MB [DRAM-on-chip tech is promising]

Impact on Window Size In Streams Out Streams Window Buffer Insight: As M increases, this starts to resemble a superscalar computer

The Continuum Architecture Memory Hierarchy In Streams Out Streams ● For too large a value of M: – Non-Sequential I/O --> caches – Caches --> less room for ALU (etc)

Windowed Streaming In Streams Out Streams Window Buffer Thought: Can we augment window-buffer limit by a loopback feature? Loopback streams

Windowed Streaming In Streams Out Streams Window Buffer Thought: What do we gain by allowing a finite delay in the loopback stream? Loopback streams Memory

Streaming Networks: Primitive

Streaming Networks: 1:N [Hanrahan model]

Streaming Networks: N:1

Streaming Networks: The Ugly

Versatility of Streaming Networks? ● Question: What algorithms can we support here? How? – Both from a Theoretical and Practical view ● We have experimented with graphics problems only: – Stream compression, visibility & culling, level of detail

New Concepts with Streaming Networks ● An individual processor's cost is small ● Highly flexible: use high level ideas of Dataflow – Multiple streams in and out – Interleaving or non-interleaved – Scalable window size ● Open to entirely new concepts – E.g. How do you add more memory in this system?

Summary ● Systems are easily built on the basis of streaming I/O and memory models ● By design, it makes maximum use of hardware: very very efficient ● Continuum of Architectures: Pure Streaming to Superscalar ● Stream processors are trivially chained, even in cycles ● Such a chained architecture may be higly flexible: – Experimental evidence & systems work – Dataflow literature – Streaming literature