VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.

Slides:

Advertisements

Similar presentations

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Advertisements

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

IP Routing Lookups Scalable High Speed IP Routing Lookups.

Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.

Processor Technology and Architecture

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Hash Tables1 Part E Hash Tables  

Chapter 3.2 : Virtual Memory

Hash Tables1 Part E Hash Tables  

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Hash Tables1 Part E Hash Tables  

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Chapter 91 Translation Lookaside Buffer (described later with virtual memory) Frame.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

EECE476: Computer Architecture Lecture 27: Virtual Memory, TLBs, and Caches Chapter 7 The University of British ColumbiaEECE 476© 2005 Guy Lemieux.

CS 241 Section Week #12 (04/22/10).

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

GFPC: A Self-Tuning Compression Algorithm Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Kasetsart University.

Faculty of Information Technology Department of Computer Science Computer Organization and Assembly Language Chapter 4 Cache Memory.

OSes: 8. Mem. Mgmt. 1 Operating Systems v Objectives –describe some of the memory management schemes used by an OS so that several processes can be in.

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

Survey on Improving Dynamic Web Performance Guide:- Dr. G. ShanmungaSundaram (M.Tech, Ph.D), Assistant Professor, Dept of IT, SMVEC. Aswini. S M.Tech CSE.

Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

Hash Tables1   © 2010 Goodrich, Tamassia.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Random-Accessible Compressed Triangle Meshes Sung-eui Yoon Korea Advanced Institute of Sci. and Tech. (KAIST) Peter Lindstrom Lawrence Livermore National.

Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.

CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.

© 2004 Goodrich, Tamassia Hash Tables1  

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics

Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.

Hello world !!! ASCII representation of hello.c.

CSCI206 - Computer Organization & Programming

Displacement (Indexed) Stack

Module 11: File Structure

Alvaro Mauricio Peña Dariusz Niworowski Frank Rodriguez

Exploiting Streams in Instruction and Data Address Trace Compression

A Closer Look at Instruction Set Architectures: Expanding Opcodes

William Stallings Computer Organization and Architecture 8th Edition

CMSC 611: Advanced Computer Architecture

CS170 Computer Organization and Architecture I

Computer Architecture

CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.

Chap. 12 Memory Organization

CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.

Adapted from the slides of Prof

Chapter Five Large and Fast: Exploiting Memory Hierarchy

CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT

Aliasing and Anti-Aliasing in Branch History Table Prediction

rePLay: A Hardware Framework for Dynamic Optimization

CPS 296.3:Algorithms in the Real World

Presentation transcript:

VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher

Introduction Traces PC values. Good compression algorithms e.g sequitur. Extended traces PC values + Extended data (ED) ED = {load/store values, load/store addresses…} ED exihibit lower repeatability and span higher range than PC values. No good compression algorithms. Value predictors Compress extended traces. Preprocess extended traces for post compression.

Compression using value predictors Assume the following A set of predictors each with 1 byte ID. 8 byte trace entries Compression algorithm Compare trace entry with predicted values. If match, write ID of predictor. If no match, Write special code followed by trace entry value. Update predictors with trace entry value. Decompression is analogous.

vpc3 compression Lossless compression Single pass algorithm Excellent compression rate Fixed memory requirement Fast decompression speed Fast compression speed

Evaluated compression algorithms Gzip 2.3MB memory usage. Fast (de)compression. Poor compression rate. Bzip2 10MB memory usage. Slower (de)compression. Better compression rate. Sequitur (modified) Source code optimizations. Split PC and ED into separate streams for compression. Bzip2 post compression. Impressive compression rate for traces. 951MB memory usage.

Value predictors (1) Last-n-value predictor(LnV) Retain n most recent values. All n values are provided for prediction. Viewed as n independent predictors. Can predict sequence of repeating or alternating values of length <= n. n <= 4 typically. Used for ED values but not PC values.

Value predictors (2) Finite-context-method predictor(FCM) Hash of n most recent values used as index into hash table for predicting and for inserting new values. FCMs are described by their n order, eg FCM3. Most recent (a), and 2 nd most recent (b) values retained and predicted for each hash index(line). FCM3b = 3 order FCM predictor that predicts 2 nd most recent value. Can predict long arbitrary sequence of values. Used for both PC and ED values.

Value predictors (3) Differential-finite-context-method predictor. Similar to FCM Predicts and stores strides instead of absolute values. Final prediction is formed by adding predicted stride to more recently seen value. Can predict never before seen values. Improved prediction of ED values over FCM. No improvement for PC values over FCM.

Extended trace Format 32 bit PC field 64 bit ED field PC0 32, ED0 64,PC1 32,ED1 64,……

vpc0 Compressed only ED value. 27 predictors. Fixed bit predictor encoding i.e 5 bits. 2.6x compression rate limit 96 bit PC/ED compressed to 37 bit PC/ED.

vpc1 Compress PC using 10 predictors. Optimized predictor bit encoding using dynamic huffman encoder. Use predictor with shortest huffman code. Compress unpredictable values. Encode PC values with log 2 (Max(PC value)) bits. For ED values, store code of closest predictor and difference. Many other optimizations to enhance compression rate. 48x compression rate limit. 96 PC/ED compressed to 2 bit PC/ED.

vpc2 Vpc1 compressed traces are highly compressible. Vpc2 = vpc1 + gzip post compression Improved compression rate over vpc1. 2x geometric mean compression rate over sequitur. 3.5 times slower decompression speed.

vpc3 Expose and enhance patterns in trace for bzip2 post compression. Simpler than vpc2 14 predictors, down from 37 in vpc2. Fixed byte encoding for predictors. Unpredictable values not compressed. Eliminated vpc1 optimizations that hurt post compression rate.

vpc3 Value predictors tuned using gcc load value trace L4V for ED values. FCM{1a,1b,3a,3b} for PC values. FCM{1a,1b} for ED values. DFCM{1a,1b,3a,3b} for ED values. Trace converted into 4 streams PC predictor codes stream Unpredicted PC values stream ED predictor codes stream Unpredicted ED values stream

Evaluation 64-bit CS20 system with dual 833MHz 21264B Alpha CPUs. Only one CPU used. Spec2k int and float programs. Generated traces (<= 12 GB) PC + effective address of store instructions. PC + effective address of misses on simulated L1 cache. PC + load value. Compressors compiled with same compiler and compile flags.

Evaluation vpc3 predictor configuration Lots of state sharing amongst predictors. 5MB for PC predictors 21MB for ED predictors 27MB total memory used by vpc3.

Compression Rate

Decompression Speed

Compression speed

Predictor Usage

Conclusion Questions Insights Discussions