Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.

Slides:

Advertisements

Similar presentations

Carnegie Mellon Performance Portable Tracking of Evolving Surfaces Wei Yu Citadel Investment Group Franz Franchetti, James C. Hoe, José M. F. Moura Carnegie.

Advertisements

Instruction Set Design

Parallel Processing with OpenMP

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Computers Organization & Assembly Language Chapter 1 THE 80x86 MICROPROCESSOR.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.

OmniVM Efficient and Language- Independent Mobile Programs Ali-Reza Adl-Tabatabai, Geoff Langdale, Steven Lucco and Robert Wahbe from Carnegie Mellon University.

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

1 Lecture 1  Getting ready to program  Hardware Model  Software Model  Programming Languages  The C Language  Software Engineering  Programming.

Carnegie Mellon μ-op Fission: Hyper-threading without the Hyper-headache Anthony Cartolano Robert Koutsoyannis Daniel S. McFarlin Carnegie Mellon University.

Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.

Contemporary Languages in Parallel Computing Raymond Hummel.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

C++ Crash Course Class 1 What is programming?. What’s this course about? Goal: Be able to design, write and run simple programs in C++ on a UNIX machine.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

PROGRAMMING LANGUAGES The Study of Programming Languages.

CS 350 Operating Systems & Programming Languages Ethan Race Oren Rasekh Christopher Roberts Christopher Rogers Anthony Simon Benjamin Ramos.

1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.

ITEC 352 Lecture 11 ISA - CPU. ISA (2) Review Questions? HW 2 due on Friday ISA –Machine language –Buses –Memory.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Instruction Set Architecture

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

High Performance Linear Transform Program Generation for the Cell BE

Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++

CS 671 Compilers Prof. Kim Hazelwood Spring 2008.

Chapter 2 Instructions: Language of the Computer Part I.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Single Node Optimization Computational Astrophysics.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 10 Ahmed Ezzat.

Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.

Introduction to computer software. Programming the computer Program, is a sequence of instructions, written to perform a specified task on a computer.

Introduction to Operating Systems Concepts

Evolution and History of Programming Languages

Topics to be covered Instruction Execution Characteristics

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Chapter 1 Introduction.

Chapter 1 Introduction.

课程名编译原理 Compiling Techniques

EE 4xx: Computer Architecture and Performance Programming

Introduction to Virtual Machines

6- General Purpose GPU Programming

Research: Past, Present and Future

Algoritmos y Programacion

Presentation transcript:

Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University

Carnegie Mellon C Compilers Got Pretty Good… Numerical Recipes: textbook FFT implementation (ANSI C + auto-vectorization) Spiral-generated FFT plain ANSI C Spiral-generated FFT auto-vectorized, using C99 + #pragma Spiral-generated FFT using intrinsics Gap: 30% - 50%

Carnegie Mellon Spiral vs. C Compiler Algorithm Generation Algorithm Optimization Implementation Code Optimization problem specification algorithm C code fast executable Search C compiler Spiral Spiral does all high-level optimization algorithm choice program transformations parallelization, vectorization memory layout C compiler = “glorified assembler“ very simple (pre-digested) code access to machine details should behave predictable must produce fast code We are after the fastest possible code

Carnegie Mellon Cross-Platform Portability in Spiral SIMD vector extensions SSE – SSE 4.2, AVX, LRBni, AltiVec, VMX, Cell, BlueGene/L, CUDA warps,… Threading and messaging interfaces Pthreads, OpenMP, Windows threads, CUDA, MPI, Cell DMA Compilers Intel C/C++, Intel Fortran, IBM XL C, Gnu C, PGI, MS Visual C, Vector C,… Languages K&R C, ANSI C, C99, C++, Intel/GNU/IBM/MS extensions, Fortran 77, Fortran 90, Java, CUDA, Verilog, x86 assembly Hardware FPGA, ASIC, CPU + instruction in FPGA Caveat: Retargeting the unparser is the easy part. The hard part is in the higher abstraction levels.

Carnegie Mellon Proposal: Influence C Standard C is extensible enough for our needs Intrinsic functions, pragmas, preprocessor, bit-fields, pointers, struct/union vector data types, inline assembly, inline opcodes, C compiler available on any machine The OS and the C compiler is built with it… High quality compilers available Intel C, IBM XL C, Gnu C, PGI C, MS Visual C Works for us (somehow) Most library generators target C and find ways to co-opt the compilers Only standards get widely adopted It will take some time If we try to have our own language and compiler, we will fail

Carnegie Mellon Everybody Extends C (At Will) Industrial C Compilers Intel C/C++, IBM XL C, MS Visual C, PGI C, Vector C,… Gnu C, LLVM C, Open64 C fall-back for everybody without their own C compiler OpenMP, CUDA, UPC provides parallelism through #pragma or language extensions Hardware vendors map ISAs etc. into intrinsic functions and data types Provide a standard on how to extend C “nicely” for us Specify what a C compiler should do

Carnegie Mellon Lets Get To Work Survey what we do to get perforance Threading commands, SIMD intrinsics, memory attributes,… Survey what we want the compilers to do (but they don’t) Translate our wish list into pragmas and attributes etc. Collect horror stories and black belt programming tricks What are the problems and how do we fight the compiler? How do we want C interpreted? “register” keyword, SSA order, array writes=spills What about assembly tricks that can’t be expressed in C? Side effects of instructions, software pipelining, IA32 memory operands Goal: a small C extension to be included in C1X/C2X

Carnegie Mellon C for Autotuning: Some Ideas Standardize intrinsic interface and attributes clean up SSE vs. VMX, vector constants, Intel/Gnu/XL syntax, alignment, memory placement, register, function call ABI,… Expose compiler optimizations and algorithms #pragma for Bellady register allocation, array scalarization, dag ordering… Autotuning subset/extension of OpenMP thread pinning, fast synchronization, worker threads, SMT/SIMT Manage multiple address spaces, caches, local stores messaging, DMA, scratchpad load/stores, non-temporal loads/stores, cache lines, memory layout, address translation Strict adherence to semantics, attributes and pragmas register means register, SSA code defines order,…

Carnegie Mellon Summary Make C for Autotuning part of the next C standard Small, well developed C extension and C semantics clarifications Don’t build our own compiler Impossible to keep up with industry and hardware changes Collect and formalize community knowledge Autotuning and program generation community effort Convince hardware and compiler vendors to join Need broad support to make it happen Convince the C standard committee Hard but crucial to influence C standard