LZ77 Compression Using Altera OpenCL

Name: LZ77 Compression Using Altera OpenCL
Uploaded: 2017-08-18T09:26:14+00:00
Duration: PTM57S40
Channel: Tyrone Searls
Description: LZ77 Compression Using Altera OpenCL

LZ77 Compression Using Altera OpenCL
Mohamed Abdelfattah

LZ77 Compression in OpenCL
Goal: Demonstrate that a compression algorithm can be implemented using the OpenCL compiler 2 GB/s high-performance efficiently Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

OpenCL Single-threaded Code
Basically C-code OpenCL compiler extracts parallelism automatically Pipeline parallelism Kernels can communicate directly through “channels” One or more custom kernels FPGA

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z FPGA

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 1 Store z

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 2 1 Store z

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 3 2 Store z 1

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 4 3 Store z 2

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 5 4 Store z 3 Can start new loop iteration every cycle!  Initiation interval II = 1 No loop-carried dependencies

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z

void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Store z

void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Loop-carried computation Store z Need data from iteration x for iteration x+1

Simple Complex Load x Load y Load x Load y Store z Store z

Simple Complex Load x Load y Store z 1

Simple Complex Load x Load y Store z 2 2 1 1

Simple Complex Load x Load y Store z 3 3 2 2 1 1

Simple Complex Load x Load y Store z Stall! 4 3 Takes 2 cycles to compute Stall! 2 3 Pipeline bubble! !! 2 1 1 1

Simple Complex Load x Load y Store z 5 4 Takes 2 cycles to compute 3 4 1 Continue 2 3 2 !!

Simple Complex Load x Load y Store z Stall! 6 4 Takes 2 cycles to compute 3 Stall! 5 Bubble! !! 4 2 3 2

Simple Complex Load x Load y Store z Optimize loop-carried computation 11 7 Takes 2 cycles to compute 6 10 II = 1 II = 2 Double the throughput 4 5 9 8 !! A new iteration of the loop starts every “II” cycles

Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length Match offset Replace with a reference to previous occurrence

This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 2 Match offset Replace with a reference to previous occurrence

This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 3 Match offset Replace with a reference to previous occurrence

This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset Replace with a reference to previous occurrence

This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence

This sentence is an to compress. Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset

This sentence is an easy sentence to compress. This sentence is an to compress. Saved 5 bytes! Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset

Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

Single-threaded OpenCL flow Single kernel: fully pipelined  II = 1
Overview Single-threaded OpenCL flow Single kernel: fully pipelined  II = 1 Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

Comparison against CPU/Verilog

Best implementation of Gzip on CPU By Intel corporation On Intel Core i5 (32nm) processor 2013 Compression Speed: 338 MB/s Compression ratio: 2.18X

Best implementation on ASICs AHA products group Coming up Q2 2014 Compression Speed: 2.5 GB/s

Best implementation on FPGAs Verilog IBM Corporation Nov ICCAD Altera Stratix-V A7 Compression Speed: 3 GB/s

OpenCL design example Altera Stratix-V A7 Developed in 1 month Compression speed ? Compression Ratio ?

3 GB/s 2.7 GB/s 2.5 GB/s 0.3 GB/s

Comparison against CPU
Same compression ratio 12X better performance/Watt

Comparison against Verilog
10% Slower 12% more resources Much lower design effort and design time

Implementation Overview
1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

1. Shift In New Data Current Window Input from DDR memory

o l d _ t e x t 1. Shift In New Data Current Window e.g. sample_text
Cycle boundary

o l d _ t e x 1. Shift In New Data Current Window e.g. sample_text
Use text in our example, but can be anything Cycle boundary VEC = 4

t e x 1. Shift In New Data Current Window e.g. sample_text
Cycle boundary

t e x s a m p 1. Shift In New Data Current Window e.g. le_text
Cycle boundary

2. Dictionary Lookup/Update
Current Window: t t e x e e x t s x x t s a t t s a m s a m p Dictionary 1 Compute hash Look for match in 4 dictionaries 3. Update dictionaries Dictionary 2 Dictionaries buffer the text that we have already processed, e.g.: Dictionary 3

_ Dictionary Current Window: t e x t s a m p Hash t e x e x t s Dictionary 1 t e x x t s a t s a m t e x l Dictionary 2 t e n Dictionary 3

t a n _ Current Window: t e x t s a m p e a t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a t s a m Dictionary 2 t e x l e p s Dictionary 3 t e n e n t

_ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a x y l o t s a m t e x l Dictionary 2 e p s x e l y t e n Dictionary 3 e n t x i r t

Possile matches from history (dictionaries) t a n _ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x t a n _ e x t s Dictionary 1 t e x e a r s x t s a x y l o t s a m t a m e t e x l Dictionary 2 e p s x e l y t e a l Dictionary 3 t e n e n t x i r t t e n

Current Window: t e x t s a m p t e x Hash e x t s Dictionary 1 x t s a t s a m Dictionary 2 Dictionary 3

W0 RD02 RD03 RD00 RD01 Dictionary t a n _ Current Window: t e x t s a m p t e x t e x l t e x W1 RD12 RD13 RD10 RD11 Dictionary 1 Generate exactly the number of read/write ports that we need W2 RD22 RD23 RD20 RD21 Dictionary 2 W3 RD32 RD33 RD30 RD31 Dictionary 3

3. Match Search & Filtering
Comparison Windows: Current Windows: t e n t e x l t e x t a n _ t e x e n t e p s e x t s e a r s e a t x i r t x e l y x y l o x t s a x a n t t e n t s a m t e a l t a m e t a n _ A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows

Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x We have another 3 of those Match Length: 2 3 4 1 Compare each byte

Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x Match Length: 2 3 4 1 Match Reduction Best Length: 4

Typical C-code Fixed loop bounds – compiler can unroll loop

One bestlength associated with each current_window t e x s a m p t e x 3 e x t s 1 3 x t s a 3 t s a m 4 3

Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Last-fit Too short Overlap Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Last-fit 1 2 Too short Overlap Matches 4 Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

Cycle boundary 3  First Valid position next cycle 1 2 3 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches: Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit Compute “first valid position” for next step

Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p  First Valid position 3 4 2 e.g.: Best lengths: 1

Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p  First Valid position 3 -1 2 e.g.: Best lengths: 1

3. From the remaining set; select the best ones  last-fit bin-packing ? ? ? e.g.: Best lengths: 3 3 4

3. From the remaining set; select the best ones  last-fit bin-packing e.g.: Best lengths: 3 4 3 -1 -1 4

4. Compute “first valid position” for next step 1 2 3 e.g.: Best lengths: 3 -1 -1 4 First_valid_pos = 3 3 3 7 t e x s a m p 1 2 3

Use either 3 or 4 bytes for this:
4. Writing to Output Marker, length, offset Length is limited by VEC (=16 in our case) – fits in 4 bits Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits Use either 3 or 4 bytes for this: Offset < 2048 Offset = MARKER LENGTH OFFSET OFFSET OFFSET MARKER LENGTH

Results MARKER LENGTH OFFSET OFFSET OFFSET

Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results Area optimizations Compression ratio Results

Area Optimizations By choosing the right (hardware) architecture, you are already most of the way there The last ~5% (of area optimizations) requires some tinkering and advanced knowledge Example:

Match Search & Filtering
Generates a long vine of logic: Match Search & Filtering Compute length condition Compute length Compute length Compute length Compute length Compute length Causes longer latency in the pipeline  increases area

Balance the computation: Generates a long vine of logic:
Compute length Compute length Compute length Balanced tree has shallower pipeline depth  Less area Compute length Get rid of the dependency on “length” Compute length Compute length Causes longer latency in the pipeline  increases area

Modified Code 4% smaller area
OR operator creates a balanced tree (no condition) Instead of having a length variable (= 2,3,4) We have array of bits (= 0011,0111,1111) OR operator is cheaper than adder 4% smaller area

Evaluate compression ratio on widely-used compression benchmarks:
Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X Want to improve results! 2. Hash Function 1. Bin-packing Heuristic

1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) Optimization Report in 14.0 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones heuristic for bin-packing Compute “first valid position” for next step

We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 2. Filter bestlength (covered) 1 Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos

We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 1 4. Compute first_valid_pos

We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 3 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 Stall! 3. Filter bestlength (bin-pack) !! 4. Compute first_valid_pos 1

We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4

We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4

We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Constraint: cannot change the first_valid_position in this step

8% better ratio Constraint: Match selection heuristic cannot change “first_valid_position” But: Last-fit is very inefficient 1 2 3 t e x s a m p Best lengths: 4 3 2 4. Filter bestlength (bin-pack) 3. Compute first_valid_pos 1 Matches 2 4 2 -1 Add a step to eliminate matches that have the same reach but smaller value Much better! 4 -1 Doesn’t affect first_valid_position

2. Hash Function Original: XOR2 XOR3 3.1% better ratio
Hash[i] = curr_window[i] E.g. Hash[text] = ‘t’ XOR2 Hash[i] = curr_window[i] xor curr_window[i+1] E.g. Hash[text] = ‘t’ xor ‘e’ Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’ Not utilizing depth efficiently (256 words but BRAMS go up to 1024) XOR3 Hash[i] = curr_window[i] << xor curr_window[i+1] << 1 xor curr_window[i+2] Match contains information about first 3 bytes + sense of their ordering More likely that our compare windows will have a match Hash (BRAM address) is 10 bits  utilizes BRAM depth = 1024 3.1% better ratio 7.1% better ratio Emulator in 13.1 Compared to Verilog, it is much easier to try & verify new algorithms It is exactly like trying out new C-code

Evaluate compression ratio on widely-used compression benchmarks:
Work in progress Evaluate compression ratio on widely-used compression benchmarks: Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X With (simple) huffman encoding (currently on the host) 47.8% or 2.10X After Optimizations: 60.2% or 1.67X

Huffman portion of Gzip
16-way parallel variable-bit-width encoding/alignment

<< << <<
Huffman encoding Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream Separate offset computation from the actual assembly 3 compute phases Compute code bit-offsets and start offset of next iteration Assembly of the codes in the current iteration Build fixed-length segments across multiple iterations 𝑙𝑒𝑛 𝑖 << << << STORE

Tight dependency on offset carried across iterations
Compute offsets Tight dependency on offset carried across iterations Careful about the order of the additions, the compiler does not consider dependencies when it redistributes associative operations Decision whether to write to memory is based on accumulating a full segment pos[0] pos[1] 𝑙𝑒𝑛 𝑖 pos[n] basepos

Each code shifts to an arbitrary bit-offset within the entire range
Bit-level shift Each code shifts to an arbitrary bit-offset within the entire range 2 shift stages 16 bit barrel shifters OR reduction tree for final assembly

LZ77 Compression Using Altera OpenCL

Similar presentations

Presentation on theme: "LZ77 Compression Using Altera OpenCL"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LZ77 Compression Using Altera OpenCL

Similar presentations

Presentation on theme: "LZ77 Compression Using Altera OpenCL"— Presentation transcript:

Similar presentations

About project

Feedback