Presentation is loading. Please wait.

Presentation is loading. Please wait.

LZ77 Compression Using Altera OpenCL

Similar presentations


Presentation on theme: "LZ77 Compression Using Altera OpenCL"— Presentation transcript:

1 LZ77 Compression Using Altera OpenCL
Mohamed Abdelfattah

2 LZ77 Compression in OpenCL
Goal: Demonstrate that a compression algorithm can be implemented using the OpenCL compiler 2 GB/s high-performance efficiently Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

3 OpenCL Single-threaded Code
Basically C-code OpenCL compiler extracts parallelism automatically Pipeline parallelism Kernels can communicate directly through “channels” One or more custom kernels FPGA

4 OpenCL Single-threaded Code
void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z FPGA

5 OpenCL Single-threaded Code
void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 1 Store z

6 OpenCL Single-threaded Code
void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 2 1 Store z

7 OpenCL Single-threaded Code
void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 3 2 Store z 1

8 OpenCL Single-threaded Code
void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 4 3 Store z 2

9 OpenCL Single-threaded Code
void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y 5 4 Store z 3 Can start new loop iteration every cycle!  Initiation interval II = 1 No loop-carried dependencies

10 OpenCL Single-threaded Code
void kernel simple(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } Load x Load y Store z

11 OpenCL Single-threaded Code
void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Store z

12 OpenCL Single-threaded Code
void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Store z

13 OpenCL Single-threaded Code
void kernel complex(global int *input, int size, global int *output) { for(i=1..size) int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } Load x Load y Loop-carried computation Store z Need data from iteration x for iteration x+1

14 OpenCL Single-threaded Code
Simple Complex Load x Load y Load x Load y Store z Store z

15 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z 1

16 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z 2 2 1 1

17 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z 3 3 2 2 1 1

18 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z Stall! 4 3 Takes 2 cycles to compute Stall! 2 3 Pipeline bubble! !! 2 1 1 1

19 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z 5 4 Takes 2 cycles to compute 3 4 1 Continue 2 3 2 !!

20 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z Stall! 6 4 Takes 2 cycles to compute 3 Stall! 5 Bubble! !! 4 2 3 2

21 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z 7 5 Takes 2 cycles to compute 4 6 2 Continue 3 5 4 !!

22 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z Stall! 8 5 Takes 2 cycles to compute 4 Stall! 7 Bubble! !! 6 3 5 3

23 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z 9 6 Takes 2 cycles to compute 5 8 3 Continue 4 7 6 !!

24 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z Stall! 10 6 Takes 2 cycles to compute 5 Stall! 9 Bubble! !! 8 4 7 4

25 OpenCL Single-threaded Code
Simple Complex Load x Load y Store z Optimize loop-carried computation 11 7 Takes 2 cycles to compute 6 10 II = 1 II = 2 Double the throughput 4 5 9 8 !! A new iteration of the loop starts every “II” cycles

26 LZ77 Compression in OpenCL
Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

27 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

28 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

29 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

30 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

31 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

32 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Replace with a reference to previous occurrence

33 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length Match offset Replace with a reference to previous occurrence

34 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 2 Match offset Replace with a reference to previous occurrence

35 LZ77 Compression Example
This sentence is an easy sentence to compress. Scan file byte by byte Look for matches Match length = 3 Match offset Replace with a reference to previous occurrence

36 LZ77 Compression Example
This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset Replace with a reference to previous occurrence

37 LZ77 Compression Example
This sentence is an easy sentence to compress. Match offset = 20 bytes Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence

38 LZ77 Compression Example
This sentence is an to compress. Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset

39 LZ77 Compression Example
This sentence is an easy sentence to compress. This sentence is an to compress. Saved 5 bytes! Scan file byte by byte Look for matches Match length = 8 Match offset = 20 Replace with a reference to previous occurrence Marker, length, offset

40 LZ77 Compression in OpenCL
Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results

41 Single-threaded OpenCL flow Single kernel: fully pipelined  II = 1
Overview Single-threaded OpenCL flow Single kernel: fully pipelined  II = 1 Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

42 Comparison against CPU/Verilog

43 Comparison against CPU/Verilog
Best implementation of Gzip on CPU By Intel corporation On Intel Core i5 (32nm) processor 2013 Compression Speed: 338 MB/s Compression ratio: 2.18X

44 Comparison against CPU/Verilog
Best implementation on ASICs AHA products group Coming up Q2 2014 Compression Speed: 2.5 GB/s

45 Comparison against CPU/Verilog
Best implementation on FPGAs Verilog IBM Corporation Nov ICCAD Altera Stratix-V A7 Compression Speed: 3 GB/s

46 Comparison against CPU/Verilog
OpenCL design example Altera Stratix-V A7 Developed in 1 month Compression speed ? Compression Ratio ?

47 Comparison against CPU/Verilog
3 GB/s 2.7 GB/s 2.5 GB/s 0.3 GB/s

48 Comparison against CPU
Same compression ratio 12X better performance/Watt

49 Comparison against Verilog
10% Slower 12% more resources Much lower design effort and design time

50 Implementation Overview
1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

51 1. Shift In New Data Current Window Input from DDR memory

52 o l d _ t e x t 1. Shift In New Data Current Window e.g. sample_text
Cycle boundary

53 o l d _ t e x 1. Shift In New Data Current Window e.g. sample_text
Use text in our example, but can be anything Cycle boundary VEC = 4

54 t e x 1. Shift In New Data Current Window e.g. sample_text
Cycle boundary

55 t e x s a m p 1. Shift In New Data Current Window e.g. le_text
Cycle boundary

56 Implementation Overview
1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

57 2. Dictionary Lookup/Update
Current Window: t t e x e e x t s x x t s a t t s a m s a m p Dictionary 1 Compute hash Look for match in 4 dictionaries 3. Update dictionaries Dictionary 2 Dictionaries buffer the text that we have already processed, e.g.: Dictionary 3

58 2. Dictionary Lookup/Update
_ Dictionary Current Window: t e x t s a m p Hash t e x e x t s Dictionary 1 t e x x t s a t s a m t e x l Dictionary 2 t e n Dictionary 3

59 2. Dictionary Lookup/Update
t a n _ Current Window: t e x t s a m p e a t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a t s a m Dictionary 2 t e x l e p s Dictionary 3 t e n e n t

60 2. Dictionary Lookup/Update
_ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x e x t s t e x Dictionary 1 e a r s x t s a x y l o t s a m t e x l Dictionary 2 e p s x e l y t e n Dictionary 3 e n t x i r t

61 2. Dictionary Lookup/Update
Possile matches from history (dictionaries) t a n _ Dictionary Current Window: t e x t s a m p e a t x a n t Hash t e x t a n _ e x t s Dictionary 1 t e x e a r s x t s a x y l o t s a m t a m e t e x l Dictionary 2 e p s x e l y t e a l Dictionary 3 t e n e n t x i r t t e n

62 2. Dictionary Lookup/Update
Current Window: t e x t s a m p t e x Hash e x t s Dictionary 1 x t s a t s a m Dictionary 2 Dictionary 3

63 2. Dictionary Lookup/Update
W0 RD02 RD03 RD00 RD01 Dictionary t a n _ Current Window: t e x t s a m p t e x t e x l t e x W1 RD12 RD13 RD10 RD11 Dictionary 1 Generate exactly the number of read/write ports that we need W2 RD22 RD23 RD20 RD21 Dictionary 2 W3 RD32 RD33 RD30 RD31 Dictionary 3

64 Implementation Overview
1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

65 3. Match Search & Filtering
Comparison Windows: Current Windows: t e n t e x l t e x t a n _ t e x e n t e p s e x t s e a r s e a t x i r t x e l y x y l o x t s a x a n t t e n t s a m t e a l t a m e t a n _ A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows

66 3. Match Search & Filtering
Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x We have another 3 of those Match Length: 2 3 4 1 Compare each byte

67 3. Match Search & Filtering
Comparison Windows: t e n t e x l t e x t a n _ Current Window: Comparators t e x Match Length: 2 3 4 1 Match Reduction Best Length: 4

68 3. Match Search & Filtering

69 3. Match Search & Filtering

70 3. Match Search & Filtering

71 3. Match Search & Filtering
Typical C-code Fixed loop bounds – compiler can unroll loop

72 3. Match Search & Filtering
One bestlength associated with each current_window t e x s a m p t e x 3 e x t s 1 3 x t s a 3 t s a m 4 3

73 3. Match Search & Filtering
Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

74 3. Match Search & Filtering
Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches 1 2 4 Last-fit Too short Overlap Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

75 3. Match Search & Filtering
Cycle boundary 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Last-fit 1 2 Too short Overlap Matches 4 Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit

76 3. Match Search & Filtering
Cycle boundary 3  First Valid position next cycle 1 2 3 1 2 3 t e x s a m p Best lengths: 3 1 3 4 Matches: Last-fit Select the best combination of matches from the set of candidate matches Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones (heuristic for bin-packing)  last-fit Compute “first valid position” for next step

77 3. Match Search & Filtering
Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p  First Valid position 3 4 2 e.g.: Best lengths: 1

78 3. Match Search & Filtering
Remove matches that are longer when encoded than original 3 1 4 e.g.: Best lengths: 2. Remove matches covered by previous step s a m p  First Valid position 3 -1 2 e.g.: Best lengths: 1

79 3. Match Search & Filtering
3. From the remaining set; select the best ones  last-fit bin-packing ? ? ? e.g.: Best lengths: 3 3 4

80 3. Match Search & Filtering
3. From the remaining set; select the best ones  last-fit bin-packing e.g.: Best lengths: 3 4 3 -1 -1 4

81 3. Match Search & Filtering
4. Compute “first valid position” for next step 1 2 3 e.g.: Best lengths: 3 -1 -1 4 First_valid_pos = 3 3 3 7 t e x s a m p 1 2 3

82 Implementation Overview
1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

83 Use either 3 or 4 bytes for this:
4. Writing to Output Marker, length, offset Length is limited by VEC (=16 in our case) – fits in 4 bits Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits Use either 3 or 4 bytes for this: Offset < 2048 Offset = MARKER LENGTH OFFSET OFFSET OFFSET MARKER LENGTH

84 Results MARKER LENGTH OFFSET OFFSET OFFSET

85 LZ77 Compression in OpenCL
Outline: OpenCL single-threaded flow LZ77 overview Implementation details Optimizations & results Area optimizations Compression ratio Results

86 Area Optimizations By choosing the right (hardware) architecture, you are already most of the way there The last ~5% (of area optimizations) requires some tinkering and advanced knowledge Example:

87 Match Search & Filtering
Generates a long vine of logic: Match Search & Filtering Compute length condition Compute length Compute length Compute length Compute length Compute length Causes longer latency in the pipeline  increases area

88 Balance the computation: Generates a long vine of logic:
Compute length Compute length Compute length Balanced tree has shallower pipeline depth  Less area Compute length Get rid of the dependency on “length” Compute length Compute length Causes longer latency in the pipeline  increases area

89 Modified Code 4% smaller area
OR operator creates a balanced tree (no condition) Instead of having a length variable (= 2,3,4) We have array of bits (= 0011,0111,1111) OR operator is cheaper than adder 4% smaller area

90 Evaluate compression ratio on widely-used compression benchmarks:
Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X Want to improve results! 2. Hash Function 1. Bin-packing Heuristic

91 1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) Optimization Report in 14.0 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Remove matches that are longer when encoded than original Remove matches covered by previous step From the remaining set; select the best ones heuristic for bin-packing Compute “first valid position” for next step

92 1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 2. Filter bestlength (covered) 1 Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos

93 1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 1 4. Compute first_valid_pos

94 1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2 3 Stall! 2. Filter bestlength (covered) !! Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 Stall! 3. Filter bestlength (bin-pack) !! 4. Compute first_valid_pos 1

95 1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Dependency causes a stall in the kernel pipeline Cannot start a new iteration each cycle II = 6 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4

96 1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Because we always use the last match (which determines first_valid_position) Last-fit bin-packing doesn’t affect “first_valid_position” 3 1 3 4

97 1. Bin-packing heuristic
We use the “last-fit” heuristic Reason: We have a loop-carried variable “first_valid_position” 1. Filter bestlength (length) 2. Filter bestlength (covered) Tighter computation for loop-carried variable: Start new iteration each cycle II = 1 3. Compute first_valid_pos 4. Filter bestlength (bin-pack) Constraint: cannot change the first_valid_position in this step

98 1. Bin-packing heuristic
8% better ratio Constraint: Match selection heuristic cannot change “first_valid_position” But: Last-fit is very inefficient 1 2 3 t e x s a m p Best lengths: 4 3 2 4. Filter bestlength (bin-pack) 3. Compute first_valid_pos 1 Matches 2 4 2 -1 Add a step to eliminate matches that have the same reach but smaller value Much better! 4 -1 Doesn’t affect first_valid_position

99 2. Hash Function Original: XOR2 XOR3 3.1% better ratio
Hash[i] = curr_window[i] E.g. Hash[text] = ‘t’ XOR2 Hash[i] = curr_window[i] xor curr_window[i+1] E.g. Hash[text] = ‘t’ xor ‘e’ Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’ Not utilizing depth efficiently (256 words but BRAMS go up to 1024) XOR3 Hash[i] = curr_window[i] << xor curr_window[i+1] << 1 xor curr_window[i+2] Match contains information about first 3 bytes + sense of their ordering More likely that our compare windows will have a match Hash (BRAM address) is 10 bits  utilizes BRAM depth = 1024 3.1% better ratio 7.1% better ratio Emulator in 13.1 Compared to Verilog, it is much easier to try & verify new algorithms It is exactly like trying out new C-code

100 Evaluate compression ratio on widely-used compression benchmarks:
Work in progress Evaluate compression ratio on widely-used compression benchmarks: Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks Initial results: 78.3% or 1.28X With (simple) huffman encoding (currently on the host) 47.8% or 2.10X After Optimizations: 60.2% or 1.67X

101 Huffman portion of Gzip
16-way parallel variable-bit-width encoding/alignment

102 << << <<
Huffman encoding Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream Separate offset computation from the actual assembly 3 compute phases Compute code bit-offsets and start offset of next iteration Assembly of the codes in the current iteration Build fixed-length segments across multiple iterations 𝑙𝑒𝑛 𝑖 << << << STORE

103 Tight dependency on offset carried across iterations
Compute offsets Tight dependency on offset carried across iterations Careful about the order of the additions, the compiler does not consider dependencies when it redistributes associative operations Decision whether to write to memory is based on accumulating a full segment pos[0] pos[1] 𝑙𝑒𝑛 𝑖 pos[n] basepos

104 Each code shifts to an arbitrary bit-offset within the entire range
Bit-level shift Each code shifts to an arbitrary bit-offset within the entire range 2 shift stages 16 bit barrel shifters OR reduction tree for final assembly

105


Download ppt "LZ77 Compression Using Altera OpenCL"

Similar presentations


Ads by Google