LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah
LZ77 Compression in OpenCL Goal: Demonstrate that a compression algorithm can be implemented using the OpenCL compiler 2 GB/s high-performance efficiently Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results
OpenCL Single-threaded Code Basically C-code OpenCL compiler extracts parallelism automatically Pipeline parallelism Kernels can communicate directly through “channels” One or more custom kernels FPGA
OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z FPGA
OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 1 Store z
OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 2 1 Store z
OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 3 2 Store z 1
OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 4 3 Store z 2
OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 5 4 Store z 3 Can start new loop iteration every cycle! Initiation interval II = 1 No loop-carried dependencies
OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z
OpenCL Single-threaded Code void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Store z
OpenCL Single-threaded Code void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Store z
OpenCL Single-threaded Code void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Loop-carried computation Store z Need data from iteration x for iteration x+1
OpenCL Single-threaded Code Simple Complex Load x Load y Load x Load y Store z Store z
OpenCL Single-threaded Code Simple Complex Load x Load y Store z 1
OpenCL Single-threaded Code Simple Complex Load x Load y Store z 2 2 1 1
OpenCL Single-threaded Code Simple Complex Load x Load y Store z 3 3 2 2 1 1
OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 4 3 Takes 2 cycles to compute Stall! 2 3 Pipeline bubble! !! 2 1 1 1
OpenCL Single-threaded Code Simple Complex Load x Load y Store z 5 4 Takes 2 cycles to compute 3 4 1 Continue 2 3 2 !!
OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 6 4 Takes 2 cycles to compute 3 Stall! 5 Bubble! !! 4 2 3 2
OpenCL Single-threaded Code Simple Complex Load x Load y Store z 7 5 Takes 2 cycles to compute 4 6 2 Continue 3 5 4 !!
OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 8 5 Takes 2 cycles to compute 4 Stall! 7 Bubble! !! 6 3 5 3
OpenCL Single-threaded Code Simple Complex Load x Load y Store z 9 6 Takes 2 cycles to compute 5 8 3 Continue 4 7 6 !!
OpenCL Single-threaded Code Simple Complex Load x Load y Store z Stall! 10 6 Takes 2 cycles to compute 5 Stall! 9 Bubble! !! 8 4 7 4
OpenCL Single-threaded Code Simple Complex Load x Load y Store z Optimize loop-carried computation 11 7 Takes 2 cycles to compute 6 10 II = 1 II = 2 Double the throughput 4 5 9 8 !! A new iteration of the loop starts every “II” cycles
LZ77 Compression in OpenCL Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length Match offset Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 2 Match offset Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 3 Match offset Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence
LZ77 Compression Example This sentence is an easy @(8,20) to compress. Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset
LZ77 Compression Example This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress. Saved 5 bytes! Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset
LZ77 Compression in OpenCL Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results
Single-threaded OpenCL flow Single kernel: fully pipelined II = 1 Overview Single-threaded OpenCL flow Single kernel: fully pipelined II = 1 Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output
Comparison against CPU/Verilog
Comparison against CPU/Verilog Best implementation of Gzip on CPU By Intel corporation On Intel Core i5 (32nm) processor 2013 Compression Speed: 338 MB/s Compression ratio: 2.18X
Comparison against CPU/Verilog Best implementation on ASICs AHA products group Coming up Q2 2014 Compression Speed: 2.5 GB/s
Comparison against CPU/Verilog Best implementation on FPGAs Verilog IBM Corporation Nov. 2013 ICCAD Altera Stratix-V A7 Compression Speed: 3 GB/s
Comparison against CPU/Verilog OpenCL design example Altera Stratix-V A7 Developed in 1 month Compression speed ? Compression Ratio ?
Comparison against CPU/Verilog 3 GB/s 2.7 GB/s 2.5 GB/s 0.3 GB/s
Comparison against CPU Same compression ratio 12X better performance/Watt
Comparison against Verilog 10% Slower 12% more resources Much lower design effort and design time
Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output
1. Shift In New Data Current Window Input from DDR memory
o l d _ t e x t 1. Shift In New Data Current Window e.g. sample_text Cycle boundary
o l d _ t e x 1. Shift In New Data Current Window e.g. sample_text Use text in our example, but can be anything Cycle boundary VEC = 4
t e x 1. Shift In New Data Current Window e.g. sample_text Cycle boundary
t e x s a m p 1. Shift In New Data Current Window e.g. le_text Cycle boundary
Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output
2. Dictionary Lookup/Update Current Window: t t e x e e x t s x x t s a t t s a m s a m p Dictionary 1 Compute hash Look for match in 4 dictionaries 3. Update dictionaries Dictionary 2 Dictionaries buffer the text that we have already processed, e.g.: Dictionary 3
2. Dictionary Lookup/Update _ Dictionary Current Window: t e x t s a m p Hash t e x e x t s Dictionary 1 t e x x t s a t s a m t e x l Dictionary 2 t e n Dictionary 3
2. Dictionary Lookup/Update t a n _ Current Window: t e x t s a m p e a t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a t s a m Dictionary 2 t e x l e p s Dictionary 3 t e n e n t
2. Dictionary Lookup/Update _ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a x y l o t s a m t e x l Dictionary 2 e p s x e l y t e n Dictionary 3 e n t x i r t
2. Dictionary Lookup/Update Possile matches from history (dictionaries) t a n _ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x t a n _ e x t s Dictionary 1 t e x e a r s x t s a x y l o t s a m t a m e t e x l Dictionary 2 e p s x e l y t e a l Dictionary 3 t e n e n t x i r t t e n
2. Dictionary Lookup/Update Current Window: t e x t s a m p t e x Hash e x t s Dictionary 1 x t s a t s a m Dictionary 2 Dictionary 3
2. Dictionary Lookup/Update W0 RD02 RD03 RD00 RD01 Dictionary t a n _ Current Window: t e x t s a m p t e x t e x l t e x W1 RD12 RD13 RD10 RD11 Dictionary 1 Generate exactly the number of read/write ports that we need W2 RD22 RD23 RD20 RD21 Dictionary 2 W3 RD32 RD33 RD30 RD31 Dictionary 3
Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output
3. Match Search & Filtering Comparison Windows: Current Windows: t e n t e x l t e x t a n _ t e x e n t e p s e x t s e a r s e a t x i r t x e l y x y l o x t s a x a n t t e n t s a m t e a l t a m e t a n _ A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows
3. Match Search & Filtering Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x We have another 3 of those Match Length: 2 3 4 1 Compare each byte
3. Match Search & Filtering Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x Match Length: 2 3 4 1 Match Reduction Best Length: 4
3. Match Search & Filtering
3. Match Search & Filtering
3. Match Search & Filtering
3. Match Search & Filtering Typical C-code Fixed loop bounds – compiler can unroll loop
3. Match Search & Filtering One bestlength associated with each current_window t e x s a m p t e x 3 e x t s 1 3 x t s a 3 t s a m 4 3
3. Match Search & Filtering Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing) last-fit
3. Match Search & Filtering Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Last-fit Too short Overlap Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing) last-fit
3. Match Search & Filtering Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Last-fit 1 2 Too short Overlap Matches 4 Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing) last-fit
3. Match Search & Filtering Cycle boundary 3 First Valid position next cycle 1 2 3 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches: Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing) last-fit Compute “first valid position” for next step
3. Match Search & Filtering Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p First Valid ------position 3 4 2 e.g.: Best lengths: 1
3. Match Search & Filtering Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p First Valid ------position 3 -1 2 e.g.: Best lengths: 1
3. Match Search & Filtering 3. From the remaining set; select the best ones last-fit bin-packing ? ? ? e.g.: Best lengths: 3 3 4
3. Match Search & Filtering 3. From the remaining set; select the best ones last-fit bin-packing e.g.: Best lengths: 3 4 3 -1 -1 4
3. Match Search & Filtering 4. Compute “first valid position” for next step 1 2 3 e.g.: Best lengths: 3 -1 -1 4 First_valid_pos = 3 3 3 7 t e x s a m p 1 2 3
Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output
Use either 3 or 4 bytes for this: 4. Writing to Output Marker, length, offset Length is limited by VEC (=16 in our case) – fits in 4 bits Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits Use either 3 or 4 bytes for this: Offset < 2048 Offset = 2048 .. 262144 MARKER LENGTH OFFSET OFFSET OFFSET MARKER LENGTH
Results MARKER LENGTH OFFSET OFFSET OFFSET
LZ77 Compression in OpenCL Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results Area optimizations Compression ratio Results
Area Optimizations By choosing the right (hardware) architecture, you are already most of the way there The last ~5% (of area optimizations) requires some tinkering and advanced knowledge Example:
Match Search & Filtering Generates a long vine of logic: Match Search & Filtering Compute length condition Compute length Compute length Compute length Compute length Compute length Causes longer latency in the pipeline increases area
Balance the computation: Generates a long vine of logic: Compute length Compute length Compute length Balanced tree has shallower pipeline depth Less area Compute length Get rid of the dependency on “length” Compute length Compute length Causes longer latency in the pipeline increases area
Modified Code 4% smaller area OR operator creates a balanced tree (no condition) Instead of having a length variable (= 2,3,4) We have array of bits (= 0011,0111,1111) OR operator is cheaper than adder 4% smaller area
Evaluate compression ratio on widely-used compression benchmarks: Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X Want to improve results! 2. Hash Function 1. Bin-packing Heuristic
1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) Optimization Report in 14.0 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones heuristic for bin-packing Compute “first valid position” for next step
1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 2. Filter bestlength (covered) 1 Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos
1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 1 4. Compute first_valid_pos
1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 3 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 Stall! 3. Filter bestlength (bin-pack) !! 4. Compute first_valid_pos 1
1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4
1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4
1. Bin-packing heuristic We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Constraint: cannot change the first_valid_position in this step
1. Bin-packing heuristic 8% better ratio Constraint: Match selection heuristic cannot change “first_valid_position” But: Last-fit is very inefficient 1 2 3 t e x s a m p Best lengths: 4 3 2 4. Filter bestlength (bin-pack) 3. Compute first_valid_pos 1 Matches 2 4 2 -1 Add a step to eliminate matches that have the same reach but smaller value Much better! 4 -1 Doesn’t affect first_valid_position
2. Hash Function Original: XOR2 XOR3 3.1% better ratio Hash[i] = curr_window[i] E.g. Hash[text] = ‘t’ XOR2 Hash[i] = curr_window[i] xor curr_window[i+1] E.g. Hash[text] = ‘t’ xor ‘e’ Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’ Not utilizing depth efficiently (256 words but BRAMS go up to 1024) XOR3 Hash[i] = curr_window[i] << 2 xor curr_window[i+1] << 1 xor curr_window[i+2] Match contains information about first 3 bytes + sense of their ordering More likely that our compare windows will have a match Hash (BRAM address) is 10 bits utilizes BRAM depth = 1024 3.1% better ratio 7.1% better ratio Emulator in 13.1 Compared to Verilog, it is much easier to try & verify new algorithms It is exactly like trying out new C-code
Evaluate compression ratio on widely-used compression benchmarks: Work in progress Evaluate compression ratio on widely-used compression benchmarks: Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X With (simple) huffman encoding (currently on the host) 47.8% or 2.10X After Optimizations: 60.2% or 1.67X
Huffman portion of Gzip 16-way parallel variable-bit-width encoding/alignment
<< << << Huffman encoding Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream Separate offset computation from the actual assembly 3 compute phases Compute code bit-offsets and start offset of next iteration Assembly of the codes in the current iteration Build fixed-length segments across multiple iterations 𝑙𝑒𝑛 𝑖 << << << STORE
Tight dependency on offset carried across iterations Compute offsets Tight dependency on offset carried across iterations Careful about the order of the additions, the compiler does not consider dependencies when it redistributes associative operations Decision whether to write to memory is based on accumulating a full segment pos[0] pos[1] 𝑙𝑒𝑛 𝑖 pos[n] basepos
Each code shifts to an arbitrary bit-offset within the entire range Bit-level shift Each code shifts to an arbitrary bit-offset within the entire range 2 shift stages 16 bit barrel shifters OR reduction tree for final assembly