High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 6 Computer Architecture
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
A Dynamic Binary Hash Scheme for IPv6 Lookup Q. Sun 1, X. Huang 1, X. Zhou 1, and Y. Ma 1,2 1. School of Computer Science and Technology 2. Beijing Key.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Lecture 1  Getting ready to program  Hardware Model  Software Model  Programming Languages  The C Language  Software Engineering  Programming.
An Efficient Low Bit-Rate Video-coding Algorithm Focusing on Moving Regions Kwok-Wai Wong, Kin-Man Lam, Wan-Chi Siu IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
Informationsteknologi Friday, October 19, 2007Computer Architecture I - Class 61 Today’s class Floating point numbers Computer systems organization.
Lecture 3: Computer Performance
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Introduction to Computer Architecture SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING SUMMER 2015 RAMYAR SAEEDI.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Processors Menu  INTEL Core™ i Processor INTEL Core™ i Processor  INTEL Core i Processor INTEL Core i Processor  AMD A K.
JSZap: Compressing JavaScript Code Martin Burtscher, UT Austin Ben Livshits & Ben Zorn, Microsoft Research Gaurav Sinha, IIT Kanpur.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.
Why Program? Computer – programmable machine designed to follow instructions Program – instructions in computer memory to make it do something Programmer.
Chapter Introduction to Computers and Programming 1.
Copyright Jim Martin Computers Inside and Out Dr Jim Martin
GFPC: A Self-Tuning Compression Algorithm Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Kasetsart University.
Georgia Institute of Technology Introduction to Programming Part 2 Barb Ericson Georgia Institute of Technology May 2006.
Overview Introduction The Level of Abstraction Organization & Architecture Structure & Function Why study computer organization?
Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.
Xiaoke Qin, Member, IEEE Chetan Murthy, and Prabhat Mishra, Senior Member, IEEE IEEE Transactions in VLSI Systems, March 2011 Presented by: Sidhartha Agrawal.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Morgan Kaufmann Publishers
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sunpyo Hong, Hyesoon Kim
EGRE 426 Computer Organization and Design Chapter 4.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Systems Architecture, Fourth Edition 1 Processor Technology and Architecture Chapter 4.
Hello world !!! ASCII representation of hello.c.
Introduction to Programming Part 2
Ioannis E. Venetis Department of Computer Engineering and Informatics
EECE571R -- Harnessing Massively Parallel Processors ece
Defining Performance Which airplane has the best performance?
Instructions - Type and Format
Memory Hierarchies.
Advanced Computer Architecture
Introduction to Programming Part 2
Computers Inside and Out
Midterm 2 review Chapter
COMS 361 Computer Organization
Presentation transcript:

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering Cornell University

Fast Floating-Point CompressionMarch 2007 Introduction  Scientific programs  Produce and transfer lots of 64-bit FP data  Exchange 100s of MB/s, generate 1TB/day of new data  Large amounts of data  Are expensive to store and transfer  Take a long time to transfer  Data compression  Can reduce amount of data  Can speed up transfer

Fast Floating-Point CompressionMarch 2007 IEEE 754 Double-Precision Values  Goal  Compress linear streams of FP data fast and well  Online operation and lossless compression  Challenges  Floating-point data are hard to compress  FP codes may generate over 90% unique values  Related work on lossless FP compression  Focuses on 32-bit single-precision values  Relies on smoothness of data or known geometry

Fast Floating-Point CompressionMarch 2007 Floating-Point Data Compression  Our approach  Predict FP data with value prediction algorithms and encode the difference  Format:  Value predictors  Hardware devices to speed up processors  Predict instruction result by extrapolating previously sequences of computed results  Employ very fast and simple algorithms

Fast Floating-Point CompressionMarch 2007 FPC Algorithm  Make two predictions  Select closer value  XOR with true value  Count leading zeros  Encode value  Update predictors

Fast Floating-Point CompressionMarch 2007 Algorithm/Implementation Co-Design  Inner loop (about 50 and 70 C statements)  Compresses or decompresses one block of data  Accounts for over 90% of execution time  Loop body optimizations  Loop body is used to hide memory latency  No fp, int mult, or int div instructions  No branches (only conditional moves)  Single basic block (>100 machine instructions)  Average IPC > 5.4 and 5.1 on Itanium 2

Fast Floating-Point CompressionMarch 2007 Evaluation Method  System  1.6 GHz Itanium 2, Intel C Itanium Compiler 9.1  Red Hat Enterprise Linux AS4  Scientific datasets  Linear streams of 64-bit FP data (18 – 277MB)  4 observations: spitzer, temp, error, info  4 simulations: comet, plasma, brain, control  5 messages: bt, lu, sp, sppm, sweep3d

Fast Floating-Point CompressionMarch 2007 Compression Throughput

Fast Floating-Point CompressionMarch 2007 Decompression Throughput

Fast Floating-Point CompressionMarch 2007 Summary and Conclusions  FPC algorithm  Highest throughput and mean compression ratio  1.02 – absolute compression ratio  840 and 680 MB/s throughput on a 1.6GHz Itanium 2 (= 2 and 2.5 machine cycles per byte)   Conclusions  Value predictors are fast & accurate data models  Algorithm/implementation co-design is essential