Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
The University of Adelaide, School of Computer Science
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Department of Computer Science University of the West Indies.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
7/14/2000 Page 1 Design of the IRAM FPU Ioannis Mavroidis IRAM retreat July 12-14, 2000.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
MIPS coding. SPIM Some links can be found such as:
1 Chapter 04 Authors: John Hennessy & David Patterson.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
History of Microprocessor MPIntroductionData BusAddress Bus
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Performance of mathematical software Agner Fog Technical University of Denmark
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
IBM/Motorola/Apple PowerPC
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
1 chapter 1 Computer Architecture and Design ECE4480/5480 Computer Architecture and Design Department of Electrical and Computer Engineering University.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
HPC F ORUM S EPTEMBER 8-10, 2009 Steve Rowan srowan at conveycomputer.com.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Single Node Optimization Computational Astrophysics.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
The Universal Machine (UM) Implementing the UM Noah Mendelsohn Tufts University Web:
Vector computers.
1 Lecture 5a: CPU architecture 101 boris.
Computer Organization Exam Review CS345 David Monismith.
Microarchitecture.
Presented by: Tim Olson, Architect
Cache Memory Presentation I
Array Processor.
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Multivector and SIMD Computers
Superscalar and VLIW Architectures
Course Outline for Computer Architecture
CSE 502: Computer Architecture
ADSP 21065L.
Presentation transcript:

Convey Computer Status Steve Wallach swallach”at”conveycomputer.com

swallach - April HPC Users Forum 2 Company Background Started in June 2007 –28 people Raised $15.1 mill, series A –Intel, Xilinx, Centerpoint, Interwest, Rho Located Richardson, Texas Announced at SC’08 –Markoff Article in New York Times Convey  Convex++ –No plans for Convez

swallach - April HPC Users Forum 3 The Convey Hybrid-Core Computer Extends x86 ISA with performance of a hardware-based architecture Adapts to application workloads Programmed in ANSI standard C/C++ and Fortran Leverages x86 ecosystem

swallach - April HPC Users Forum 4 Product Reconfiguable Co-Processor to Intel x86-64 Shared 64_bit Virtual and Physical Memory (cache coherent) Coprocessor executes instructions that are viewed as extensions to the x86 ISA Convey Developed Compilers (C(C++) & Fortran based on open 64) –Automatic Vectorization/Parallelization SIMD Multi-threading –Generates both x86 and coprocessor instructions

swallach - April HPC Users Forum 5 VECTOR (64 Bit -Float) Finite Element Convey - ISA VECTOR (32 Bit -Float) Signal/Imaging Bit/Logical Data Mining Sorting/Tree Traversal Systolic Bio-Informatics Finance (Float) X86 ISA

swallach - April HPC Users Forum 6 Inside the Coprocessor crossbar memory controller Scalar Processing Instruction Fetch/Decode Host Interface memory controller Application Engines Personalities dynamically loaded into AEs implement application specific instructions 16 DDR2 memory channels Standard or Scatter-Gather DIMMs 80GB/sec throughput System interface and memory management implemented by coprocessor infrastructure direct I/O interface Non-blocking Virtual output queuing Round-robin arbitration

swallach - April HPC Users Forum 7 Convey Scatter-Gather DIMMs Standard DIMMs are optimized for cache line transfers –performance drops dramatically when access pattern is strided or random Convey Scatter-Gather DIMMs are optimized for 8-byte transfers –deliver high bandwidth for random or strided 64-bit accesses –prime number (31) interleave maintains performance for power- of-two strides –Supports both SIMD and Parallel multi-threading compute model –Out of order loads and stores

swallach - April HPC Users Forum 8 Personalities A personality implements a set of extended instructions –multiple personalities may be installed on the system –one is active on coprocessor at any one time –reloaded dynamically by the operating system as needed Vector personalities –implement a load/store vector accumulator architecture with multiple function pipes –Convey vectorizing compilers automatically identify loops that can be executed with vector instructions –can operate on floating point, integer, or bit data “Procedural” personalities –implement an entire routine or algorithm in logic –invoked by one or more instructions –called as procedures or functions 1/30/2009 8

swallach - April HPC Users Forum 9 SPvector Personality 1/30/2009 Page 9 crossbar Same instructions sent to all function pipes Each function pipe supports: −multiple functional units −out-of-order execution −register renaming 32 Function Pipes vector elements distributed across function pipes to crossbar vector register file fma A load-store vector architecture with modern latency-hiding features Optimized for Signal Processing (i.e., Oil & Gas) applications eginter store load logicalrcp,dividemisc add

swallach - April HPC Users Forum 10 Financial Vector Personality 1/30/2009 Page 10 crossbar Add functional units for common functions such as log, exp, random number generation Supported by the compiler as vector intrinsics 32 Function Pipes vector elements distributed across function pipes to crossbar Same overall structure and datapaths of SPvector personality Pairs of single precision functional units replaced by double precision units vector register file fma integer store load logicalrcpmiscexp,log,CNDaddParallel RNG

swallach - April HPC Users Forum 11 Inspect Proteomics Procedural Personality 1/30/ pipe 0 pipe 1 pipe2 pipe31 … Substring Fetch Protein Fetch Peptide Mass Memory PRM Scores Memory Score Save Match Temp Match Memory Store Matches length ProteinLen Score To Beat Temp Matches mbuf Entire numerical routine implemented as function pipe Scalar unit (in hc-1) performs setup Multiple function pipes for data parallellism Operates on main memory using virtual addresses Match Score To Beat Protein Database Update Score To Beat

swallach - April HPC Users Forum 12 Development Tools 1/30/ executable Intel® 64 code Coprocessor code C/C++ Fortran95 Common Optimizer & Code Intel® 64 Optimizer Generator Convey Vectorizer& Code Generator Procedural Personality Interface Linker other objects Program in ANSI standard C/C++ and Fortran Unified compiler generates x86 & coprocessor instructions Seamless debugging environment for Intel & coprocessor code Executable can run on x86_64 nodes or on Convey Hybrid-Core nodes

swallach - April HPC Users Forum 13 Where we are Shipping Beta –Bioinformatics, seismic, speech processing, architectural simulation, etc 35 People Production Summer 2009 Expanding sales, service, manufacturing

swallach - April HPC Users Forum 14