LZ77 Compression Using Altera OpenCL

Slides:

Advertisements

Similar presentations

EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.

Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

PDAs Accept Context-Free Languages

ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala

Reflection nurulquran.com.

EuroCondens SGB E.

Slide 1Fig 26-CO, p.795. Slide 2Fig 26-1, p.796 Slide 3Fig 26-2, p.797.

Sequential Logic Design

Copyright © 2013 Elsevier Inc. All rights reserved.

STATISTICS Linear Statistical Models

Addition and Subtraction Equations

By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman

David Burdett May 11, 2004 Package Binding for WS CDL.

Create an Application Title 1Y - Youth Chapter 5.

Add Governors Discretionary (1G) Grants Chapter 6.

CHAPTER 18 The Ankle and Lower Leg

Introduction to Turing Machines

ASCII stands for American Standard Code for Information Interchange

The 5S numbers game..

突破信息检索壁垒－SciFinder Scholar 介绍

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Break Time Remaining 10:00.

The basics for simulations

PP Test Review Sections 6-1 to 6-6

MM4A6c: Apply the law of sines and the law of cosines.

Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std

Regression with Panel Data

TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”

1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.

Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.

TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Progressive Aerobic Cardiovascular Endurance Run

Biology 2 Plant Kingdom Identification Test Review.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”

Artificial Intelligence

When you see… Find the zeros You think….

2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.

Before Between After.

2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.

2.10% more children born Die 0.2 years sooner Spend 95.53% less money on health care No class divide 60.84% less electricity 84.40% less oil.

Subtraction: Adding UP

Numeracy Resources for KS2

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Figure 10–1 A 64-cell memory array organized in three different ways.

2.4 Bases de Dados Estudo de Caso. Caso: Caixa Eletrônico Caixa Eletrônico com acesso à Base de Dados; Cada cliente possui:  Um número de cliente  Uma.

Static Equilibrium; Elasticity and Fracture

Converting a Fraction to %

Resistência dos Materiais, 5ª ed.

& dding ubtracting ractions.

Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.

WARNING This CD is protected by Copyright Laws. FOR HOME USE ONLY. Unauthorised copying, adaptation, rental, lending, distribution, extraction, charging.

A Data Warehouse Mining Tool Stephen Turner Chris Frala

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Introduction Embedded Universal Tools and Online Features 2.

úkol = A 77 B 72 C 67 D = A 77 B 72 C 67 D 79.

Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.

Presentation transcript:

LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah

LZ77 Compression in OpenCL Goal: Demonstrate that a compression algorithm can be implemented using the OpenCL compiler 2 GB/s high-performance efficiently Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

OpenCL Single-threaded Code Basically C-code OpenCL compiler extracts parallelism automatically Pipeline parallelism Kernels can communicate directly through “channels” One or more custom kernels FPGA

OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z FPGA

OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 1 Store z

OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 2 1 Store z

OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 3 2 Store z 1

OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 4 3 Store z 2

OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 5 4 Store z 3 Can start new loop iteration every cycle!  Initiation interval II = 1 No loop-carried dependencies

OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z

OpenCL Single-threaded Code void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Store z

OpenCL Single-threaded Code void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Store z

OpenCL Single-threaded Code void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Loop-carried computation Store z Need data from iteration x for iteration x+1

OpenCL Single-threaded Code Simple Complex Load x Load y Load x Load y Store z Store z

OpenCL Single-threaded Code Simple Complex Load x Load y Store z 1

OpenCL Single-threaded Code Simple Complex Load x Load y Store z 2 2 1 1

OpenCL Single-threaded Code Simple Complex Load x Load y Store z 3 3 2 2 1 1

OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 4 3 Takes 2 cycles to compute Stall! 2 3 Pipeline bubble! !! 2 1 1 1

OpenCL Single-threaded Code Simple Complex Load x Load y Store z 5 4 Takes 2 cycles to compute 3 4 1 Continue 2 3 2 !!

OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 6 4 Takes 2 cycles to compute 3 Stall! 5 Bubble! !! 4 2 3 2

OpenCL Single-threaded Code Simple Complex Load x Load y Store z 7 5 Takes 2 cycles to compute 4 6 2 Continue 3 5 4 !!

OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 8 5 Takes 2 cycles to compute 4 Stall! 7 Bubble! !! 6 3 5 3

OpenCL Single-threaded Code Simple Complex Load x Load y Store z 9 6 Takes 2 cycles to compute 5 8 3 Continue 4 7 6 !!

OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 10 6 Takes 2 cycles to compute 5 Stall! 9 Bubble! !! 8 4 7 4

OpenCL Single-threaded Code Simple Complex Load x Load y Store z Optimize loop-carried computation 11 7 Takes 2 cycles to compute 6 10 II = 1 II = 2 Double the throughput 4 5 9 8 !! A new iteration of the loop starts every “II” cycles

LZ77 Compression in OpenCL Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length Match offset Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 2 Match offset Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 3 Match offset Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence

LZ77 Compression Example This sentence is an easy @(8,20) to compress. Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset

LZ77 Compression Example This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress. Saved 5 bytes! Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset

LZ77 Compression in OpenCL Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

Single-threaded OpenCL flow Single kernel: fully pipelined  II = 1 Overview Single-threaded OpenCL flow Single kernel: fully pipelined  II = 1 Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

Comparison against CPU/Verilog

Comparison against CPU/Verilog Best implementation of Gzip on CPU By Intel corporation On Intel Core i5 (32nm) processor 2013 Compression Speed: 338 MB/s Compression ratio: 2.18X

Comparison against CPU/Verilog Best implementation on ASICs AHA products group Coming up Q2 2014 Compression Speed: 2.5 GB/s

Comparison against CPU/Verilog Best implementation on FPGAs Verilog IBM Corporation Nov. 2013 ICCAD Altera Stratix-V A7 Compression Speed: 3 GB/s

Comparison against CPU/Verilog OpenCL design example Altera Stratix-V A7 Developed in 1 month Compression speed ? Compression Ratio ?

Comparison against CPU/Verilog 3 GB/s 2.7 GB/s 2.5 GB/s 0.3 GB/s

Comparison against CPU Same compression ratio 12X better performance/Watt

Comparison against Verilog 10% Slower 12% more resources Much lower design effort and design time

Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

1. Shift In New Data Current Window Input from DDR memory

o l d _ t e x t 1. Shift In New Data Current Window e.g. sample_text Cycle boundary

o l d _ t e x 1. Shift In New Data Current Window e.g. sample_text Use text in our example, but can be anything Cycle boundary VEC = 4

t e x 1. Shift In New Data Current Window e.g. sample_text Cycle boundary

t e x s a m p 1. Shift In New Data Current Window e.g. le_text Cycle boundary

Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

2. Dictionary Lookup/Update Current Window: t t e x e e x t s x x t s a t t s a m s a m p Dictionary 1 Compute hash Look for match in 4 dictionaries 3. Update dictionaries Dictionary 2 Dictionaries buffer the text that we have already processed, e.g.: Dictionary 3

2. Dictionary Lookup/Update _ Dictionary Current Window: t e x t s a m p Hash t e x e x t s Dictionary 1 t e x x t s a t s a m t e x l Dictionary 2 t e n Dictionary 3

2. Dictionary Lookup/Update t a n _ Current Window: t e x t s a m p e a t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a t s a m Dictionary 2 t e x l e p s Dictionary 3 t e n e n t

2. Dictionary Lookup/Update _ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a x y l o t s a m t e x l Dictionary 2 e p s x e l y t e n Dictionary 3 e n t x i r t

2. Dictionary Lookup/Update Possile matches from history (dictionaries) t a n _ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x t a n _ e x t s Dictionary 1 t e x e a r s x t s a x y l o t s a m t a m e t e x l Dictionary 2 e p s x e l y t e a l Dictionary 3 t e n e n t x i r t t e n

2. Dictionary Lookup/Update Current Window: t e x t s a m p t e x Hash e x t s Dictionary 1 x t s a t s a m Dictionary 2 Dictionary 3

2. Dictionary Lookup/Update W0 RD02 RD03 RD00 RD01 Dictionary t a n _ Current Window: t e x t s a m p t e x t e x l t e x W1 RD12 RD13 RD10 RD11 Dictionary 1 Generate exactly the number of read/write ports that we need W2 RD22 RD23 RD20 RD21 Dictionary 2 W3 RD32 RD33 RD30 RD31 Dictionary 3

Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

3. Match Search & Filtering Comparison Windows: Current Windows: t e n t e x l t e x t a n _ t e x e n t e p s e x t s e a r s e a t x i r t x e l y x y l o x t s a x a n t t e n t s a m t e a l t a m e t a n _ A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows

3. Match Search & Filtering Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x We have another 3 of those Match Length: 2 3 4 1 Compare each byte

3. Match Search & Filtering Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x Match Length: 2 3 4 1 Match Reduction Best Length: 4

3. Match Search & Filtering

3. Match Search & Filtering

3. Match Search & Filtering

3. Match Search & Filtering Typical C-code Fixed loop bounds – compiler can unroll loop

3. Match Search & Filtering One bestlength associated with each current_window t e x s a m p t e x 3 e x t s 1 3 x t s a 3 t s a m 4 3

3. Match Search & Filtering Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

3. Match Search & Filtering Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Last-fit Too short Overlap Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

3. Match Search & Filtering Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Last-fit 1 2 Too short Overlap Matches 4 Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

3. Match Search & Filtering Cycle boundary 3  First Valid position next cycle 1 2 3 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches: Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit Compute “first valid position” for next step

3. Match Search & Filtering Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p  First Valid ------position 3 4 2 e.g.: Best lengths: 1

3. Match Search & Filtering Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p  First Valid ------position 3 -1 2 e.g.: Best lengths: 1

3. Match Search & Filtering 3. From the remaining set; select the best ones  last-fit bin-packing ? ? ? e.g.: Best lengths: 3 3 4

3. Match Search & Filtering 3. From the remaining set; select the best ones  last-fit bin-packing e.g.: Best lengths: 3 4 3 -1 -1 4

3. Match Search & Filtering 4. Compute “first valid position” for next step 1 2 3 e.g.: Best lengths: 3 -1 -1 4 First_valid_pos = 3 3 3 7 t e x s a m p 1 2 3

Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

Use either 3 or 4 bytes for this: 4. Writing to Output Marker, length, offset Length is limited by VEC (=16 in our case) – fits in 4 bits Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits Use either 3 or 4 bytes for this: Offset < 2048 Offset = 2048 .. 262144 MARKER LENGTH OFFSET OFFSET OFFSET MARKER LENGTH

Results MARKER LENGTH OFFSET OFFSET OFFSET

LZ77 Compression in OpenCL Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results Area optimizations Compression ratio Results

Area Optimizations By choosing the right (hardware) architecture, you are already most of the way there The last ~5% (of area optimizations) requires some tinkering and advanced knowledge Example:

Match Search & Filtering Generates a long vine of logic: Match Search & Filtering Compute length condition Compute length Compute length Compute length Compute length Compute length Causes longer latency in the pipeline  increases area

Balance the computation: Generates a long vine of logic: Compute length Compute length Compute length Balanced tree has shallower pipeline depth  Less area Compute length Get rid of the dependency on “length” Compute length Compute length Causes longer latency in the pipeline  increases area

Modified Code 4% smaller area OR operator creates a balanced tree (no condition) Instead of having a length variable (= 2,3,4) We have array of bits (= 0011,0111,1111) OR operator is cheaper than adder 4% smaller area

Evaluate compression ratio on widely-used compression benchmarks: Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X Want to improve results! 2. Hash Function 1. Bin-packing Heuristic

1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) Optimization Report in 14.0 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones heuristic for bin-packing Compute “first valid position” for next step

1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 2. Filter bestlength (covered) 1 Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos

1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 1 4. Compute first_valid_pos

1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 3 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 Stall! 3. Filter bestlength (bin-pack) !! 4. Compute first_valid_pos 1

1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4

1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4

1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Constraint: cannot change the first_valid_position in this step

1. Bin-packing heuristic 8% better ratio Constraint: Match selection heuristic cannot change “first_valid_position” But: Last-fit is very inefficient 1 2 3 t e x s a m p Best lengths: 4 3 2 4. Filter bestlength (bin-pack) 3. Compute first_valid_pos 1 Matches 2 4 2 -1 Add a step to eliminate matches that have the same reach but smaller value Much better! 4 -1 Doesn’t affect first_valid_position

2. Hash Function Original: XOR2 XOR3 3.1% better ratio Hash[i] = curr_window[i] E.g. Hash[text] = ‘t’ XOR2 Hash[i] = curr_window[i] xor curr_window[i+1] E.g. Hash[text] = ‘t’ xor ‘e’ Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’ Not utilizing depth efficiently (256 words but BRAMS go up to 1024) XOR3 Hash[i] = curr_window[i] << 2 xor curr_window[i+1] << 1 xor curr_window[i+2] Match contains information about first 3 bytes + sense of their ordering More likely that our compare windows will have a match Hash (BRAM address) is 10 bits  utilizes BRAM depth = 1024 3.1% better ratio 7.1% better ratio Emulator in 13.1 Compared to Verilog, it is much easier to try & verify new algorithms It is exactly like trying out new C-code

Evaluate compression ratio on widely-used compression benchmarks: Work in progress Evaluate compression ratio on widely-used compression benchmarks: Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X With (simple) huffman encoding (currently on the host) 47.8% or 2.10X After Optimizations: 60.2% or 1.67X

Huffman portion of Gzip 16-way parallel variable-bit-width encoding/alignment

<< << << Huffman encoding Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream Separate offset computation from the actual assembly 3 compute phases Compute code bit-offsets and start offset of next iteration Assembly of the codes in the current iteration Build fixed-length segments across multiple iterations 𝑙𝑒𝑛 𝑖 << << << STORE

Tight dependency on offset carried across iterations Compute offsets Tight dependency on offset carried across iterations Careful about the order of the additions, the compiler does not consider dependencies when it redistributes associative operations Decision whether to write to memory is based on accumulating a full segment pos[0] pos[1] 𝑙𝑒𝑛 𝑖 pos[n] basepos

Each code shifts to an arbitrary bit-offset within the entire range Bit-level shift Each code shifts to an arbitrary bit-offset within the entire range 2 shift stages 16 bit barrel shifters OR reduction tree for final assembly