Presentation is loading. Please wait.

Presentation is loading. Please wait.

LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Similar presentations


Presentation on theme: "LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah."— Presentation transcript:

1 LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah

2 LZ77 Compression in OpenCL Goal:  Demonstrate that a compression algorithm can be implemented using the OpenCL compiler 2 high-performance efficiently 2 GB/s Outline: 1. OpenCL single-threaded flow 2. LZ77 overview 3. Implementation details 4. Optimizations & results

3 OpenCL Single-threaded Code Basically C-code  OpenCL compiler extracts parallelism automatically  Pipeline parallelism 3 FPGA One or more custom kernels Kernels can communicate directly through “channels”

4 void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } OpenCL Single-threaded Code 4 FPGA Load xLoad y Store z

5 void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } OpenCL Single-threaded Code 5 Load xLoad y Store z 1 1

6 void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } OpenCL Single-threaded Code 6 Load xLoad y Store z 1 1 2 2

7 void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } OpenCL Single-threaded Code 7 Load xLoad y Store z 1 1 2 2 3 3

8 void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } OpenCL Single-threaded Code 8 Load xLoad y Store z 2 2 3 3 4 4

9 void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } OpenCL Single-threaded Code 9 Load xLoad y Store z 3 3 4 4 5 5 Can start new loop iteration every cycle!  Initiation interval II = 1 No loop-carried dependencies

10 void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } OpenCL Single-threaded Code 10 Load xLoad y Store z

11 void kernel complex(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } OpenCL Single-threaded Code 11 Load xLoad y Store z

12 void kernel complex(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } OpenCL Single-threaded Code 12 Load xLoad y Store z

13 void kernel complex(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; } OpenCL Single-threaded Code 13 Load xLoad y Store z Loop-carried computation Need data from iteration x for iteration x+1

14 OpenCL Single-threaded Code 14 Load x Load y Store z Load x Load y Store z SimpleComplex

15 OpenCL Single-threaded Code 15 Load xLoad y Store z Load xLoad y Store z 1 1 1 1 SimpleComplex

16 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 16 1 1 1 1 2 2 2 2 SimpleComplex

17 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 17 2 2 2 2 3 3 3 3 1 1 1 1 SimpleComplex

18 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 18 3 3 2 2 4 4 3 3 1 1 2 2 1 1 1 1 Pipeline bubble! Takes 2 cycles to compute Stall! !! SimpleComplex

19 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 19 4 4 2 2 5 5 3 3 3 3 2 2 1 1 Continue Takes 2 cycles to compute 4 4 !! SimpleComplex

20 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 20 5 5 2 2 6 6 3 3 4 4 3 3 2 2 Bubble! Takes 2 cycles to compute 4 4 !! Stall! SimpleComplex

21 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 21 6 6 3 3 7 7 4 4 5 5 4 4 2 2 Continue Takes 2 cycles to compute 5 5 !! SimpleComplex

22 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 22 7 7 3 3 8 8 4 4 6 6 5 5 3 3 Bubble! Takes 2 cycles to compute 5 5 !! Stall! SimpleComplex

23 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 23 8 8 4 4 9 9 5 5 7 7 6 6 3 3 Continue Takes 2 cycles to compute 6 6 !! SimpleComplex

24 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 24 9 9 4 4 10 5 5 8 8 7 7 4 4 Bubble! Takes 2 cycles to compute 6 6 !! Stall! SimpleComplex

25 Load xLoad y Store z Load xLoad y Store z OpenCL Single-threaded Code 25 10 5 5 11 6 6 9 9 8 8 4 4 Takes 2 cycles to compute 7 7 !! II = 1 II = 2 Double the throughput Optimize loop-carried computation A new iteration of the loop starts every “II” cycles SimpleComplex

26 LZ77 Compression in OpenCL 26 Outline: 1. OpenCL single-threaded flow 2. LZ77 overview 3. Implementation details 4. Optimizations & results

27 LZ77 Compression Example This sentence is an easy sentence to compress. 27 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence

28 LZ77 Compression Example 28 This sentence is an easy sentence to compress. 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence

29 LZ77 Compression Example 29 This sentence is an easy sentence to compress. 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence

30 LZ77 Compression Example 30 This sentence is an easy sentence to compress. 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence

31 LZ77 Compression Example 31 This sentence is an easy sentence to compress. 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence

32 LZ77 Compression Example 32 This sentence is an easy sentence to compress. 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 3.Replace with a reference to previous occurrence

33 This sentence is an easy sentence to compress. LZ77 Compression Example 33 1.Scan file byte by byte 2.Look for matches 1.Match length 2.Match offset 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 1.Match length 2.Match offset 3.Replace with a reference to previous occurrence

34 This sentence is an easy sentence to compress. LZ77 Compression Example 34 1.Scan file byte by byte 2.Look for matches 1.Match length = 2 2.Match offset 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 1.Match length = 2 2.Match offset 3.Replace with a reference to previous occurrence

35 This sentence is an easy sentence to compress. LZ77 Compression Example 35 1.Scan file byte by byte 2.Look for matches 1.Match length = 3 2.Match offset 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 1.Match length = 3 2.Match offset 3.Replace with a reference to previous occurrence

36 This sentence is an easy sentence to compress. LZ77 Compression Example 36 1.Scan file byte by byte 2.Look for matches 1.Match length = 8 2.Match offset 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 1.Match length = 8 2.Match offset 3.Replace with a reference to previous occurrence Match offset = 20 bytes

37 This sentence is an easy sentence to compress. LZ77 Compression Example 37 1.Scan file byte by byte 2.Look for matches 1.Match length = 8 2.Match offset = 20 3.Replace with a reference to previous occurrence 1.Scan file byte by byte 2.Look for matches 1.Match length = 8 2.Match offset = 20 3.Replace with a reference to previous occurrence Match offset = 20 bytes

38 This sentence is an easy @(8,20) to compress. LZ77 Compression Example 38 1.Scan file byte by byte 2.Look for matches Match length = 8 Match offset = 20 3.Replace with a reference to previous occurrence Marker, length, offset 1.Scan file byte by byte 2.Look for matches Match length = 8 Match offset = 20 3.Replace with a reference to previous occurrence Marker, length, offset

39 This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress. LZ77 Compression Example 39 1.Scan file byte by byte 2.Look for matches Match length = 8 Match offset = 20 3.Replace with a reference to previous occurrence Marker, length, offset 1.Scan file byte by byte 2.Look for matches Match length = 8 Match offset = 20 3.Replace with a reference to previous occurrence Marker, length, offset Saved 5 bytes!

40 LZ77 Compression in OpenCL 40 Outline: 1. OpenCL single-threaded flow 2. LZ77 overview 3. Implementation details 4. Optimizations & results

41 Single-threaded OpenCL flow Single kernel: fully pipelined  II = 1 Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s Overview 41 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

42 Comparison against CPU/Verilog 42

43 Comparison against CPU/Verilog 43 Best implementation of Gzip on CPU By Intel corporation On Intel Core i5 (32nm) processor 2013 Compression Speed: 338 MB/s Compression ratio: 2.18X

44 Comparison against CPU/Verilog 44 Best implementation on ASICs AHA products group Coming up Q2 2014 Compression Speed: 2.5 GB/s

45 Comparison against CPU/Verilog 45 Best implementation on FPGAs Verilog IBM Corporation Nov. 2013 ICCAD Altera Stratix-V A7 Compression Speed: 3 GB/s

46 Comparison against CPU/Verilog 46 OpenCL design example Altera Stratix-V A7 Developed in 1 month Compression speed ? Compression Ratio ?

47 Comparison against CPU/Verilog 47 2.7 GB/s 3 GB/s 2.5 GB/s 0.3 GB/s

48 Comparison against CPU 48 Same compression ratio 12X better performance/Watt

49 Comparison against Verilog 49 12% more resources Much lower design effort and design time 10% Slower

50 Implementation Overview 50 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

51 1. Shift In New Data 51 Current Window Input from DDR memory

52 1. Shift In New Data 52 Current Window sample_text e.g. o l d _ t e xt Cycle boundary

53 1. Shift In New Data 53 Current Window sample_text e.g. o l d _ t e xt Cycle boundary VEC = 4 Use text in our example, but can be anything

54 1. Shift In New Data 54 Current Window sample_text e.g. t e xt Cycle boundary

55 1. Shift In New Data 55 Current Window le_text e.g. t e xts a mp Cycle boundary

56 Implementation Overview 56 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

57 e e x x t t s s x x t t s s a a t t s s a a m m t t e e x x t t 2. Dictionary Lookup/Update 57 t t e e x x t t s s a a m m p p Current Window: 1.Compute hash 2.Look for match in 4 dictionaries 3. Update dictionaries Dictionary 0 Dictionary 0 Dictionary 1 Dictionary 1 Dictionary 2 Dictionary 2 Dictionary 3 Dictionary 3 Dictionaries buffer the text that we have already processed, e.g.:

58 2. Dictionary Lookup/Update 58 t t e e x x t t s s a a m m p p Current Window: t t e e x x t t e e x x t t s s x x t t s s a a t t s s a a m m Dictionary 0 Dictionary 0 Dictionary 1 Dictionary 1 Dictionary 2 Dictionary 2 Dictionary 3 Dictionary 3 t t a a n n _ _ t t e e x x t t Hash t t e e x x l l t t e e e e n n

59 2. Dictionary Lookup/Update 59 t t e e x x t t s s a a m m p p Current Window: t t e e x x t t e e x x t t s s x x t t s s a a t t s s a a m m Dictionary 0 Dictionary 0 Dictionary 1 Dictionary 1 Dictionary 2 Dictionary 2 Dictionary 3 Dictionary 3 t t a a n n _ _ t t e e x x t t Hash t t e e x x l l t t e e e e n n e e a a t t e e e e a a r r s s e e e e p p s s e e n n t t e e

60 2. Dictionary Lookup/Update 60 t t e e x x t t s s a a m m p p Current Window: t t e e x x t t e e x x t t s s x x t t s s a a t t s s a a m m Dictionary 0 Dictionary 0 Dictionary 1 Dictionary 1 Dictionary 2 Dictionary 2 Dictionary 3 Dictionary 3 t t a a n n _ _ t t e e x x t t Hash t t e e x x l l t t e e e e n n e e a a t t e e e e a a r r s s e e e e p p s s e e n n t t e e x x a a n n t t x x y y l l o o x x e e l l y y x x i i r r t t

61 2. Dictionary Lookup/Update 61 t t e e x x t t s s a a m m p p Current Window: t t e e x x t t e e x x t t s s x x t t s s a a t t s s a a m m Dictionary 0 Dictionary 0 Dictionary 1 Dictionary 1 Dictionary 2 Dictionary 2 Dictionary 3 Dictionary 3 t t a a n n _ _ t t e e x x t t Hash t t e e x x l l t t e e e e n n e e a a t t e e e e a a r r s s e e e e p p s s e e n n t t e e x x a a n n t t x x y y l l o o x x e e l l y y x x i i r r t t t t e e e e n n t t e e a a l l t t a a n n _ _ t t a a m m e e Possile matches from history (dictionaries)

62 2. Dictionary Lookup/Update 62 Dictionary 0 Dictionary 0 Dictionary 1 Dictionary 1 Dictionary 2 Dictionary 2 Dictionary 3 Dictionary 3 t t e e x x t t s s a a m m p p Current Window: t t e e x x t t e e x x t t s s x x t t s s a a t t s s a a m m Hash

63 2. Dictionary Lookup/Update 63 W0 RD02 RD03 RD00 RD01 Dictionary 0 Dictionary 0 W1 RD12 RD13 RD10 RD11 Dictionary 1 Dictionary 1 W2 RD22 RD23 RD20 RD21 Dictionary 2 Dictionary 2 W3 RD32 RD33 RD30 RD31 Dictionary 3 Dictionary 3 t t e e x x t t s s a a m m p p Current Window: Generate exactly the number of read/write ports that we need t t e e x x t t t t a a n n _ _ t t e e x x t t t t e e x x l l t t e e e e n n

64 Implementation Overview 64 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

65 3. Match Search & Filtering 65 Current Windows: t t e e x x t t e e x x t t s s x x t t s s a a t t s s a a m m t t a a n n _ _ t t e e x x t t t t e e x x l l t t e e e e n n e e a a t t e e e e a a r r s s e e e e p p s s e e n n t t e e x x a a n n t t x x y y l l o o x x e e l l y y x x i i r r t t t t e e e e n n t t e e a a l l t t a a n n _ _ t t a a m m e e Comparison Windows: A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows

66 3. Match Search & Filtering 66 Current Window: t t e e x x t t t t a a n n _ _ t t e e x x t t t t e e x x l l t t e e e e n n Comparison Windows: 1 1 4 4 3 3 2 2 Match Length: Comparators We have another 3 of those Compare each byte

67 3. Match Search & Filtering 67 Current Window: t t e e x x t t t t a a n n _ _ t t e e x x t t t t e e x x l l t t e e e e n n Comparison Windows: 1 1 4 4 3 3 2 2 Match Length: Comparators 4 4 Match Reduction Best Length:

68 3. Match Search & Filtering 68

69 3. Match Search & Filtering 69

70 3. Match Search & Filtering 70

71 3. Match Search & Filtering 71 Typical C-code Fixed loop bounds – compiler can unroll loop

72 3. Match Search & Filtering One bestlength associated with each current_window 72 t t e e x x t t e e x x t t s s x x t t s s a a t t s s a a m m 3 3 3 3 4 4 3 3 3 3 1 1 t e x t s a mp

73 3. Match Search & Filtering 73 3 3 t e x t s a mp Cycle boundary 1 1 3 3 4 4 Matches 0 1 2 4 0 1 23 Select the best combination of matches from the set of candidate matches 1.Remove matches that are longer when encoded than original 2.Remove matches covered by previous step 3.From the remaining set; select the best ones (heuristic for bin-packing)  last-fit Best lengths:

74 3. Match Search & Filtering 74 3 3 t e x t s a mp Cycle boundary 1 1 3 3 4 4 Matches 0 1 2 4 0 1 23 Select the best combination of matches from the set of candidate matches 1.Remove matches that are longer when encoded than original 2.Remove matches covered by previous step 3.From the remaining set; select the best ones (heuristic for bin-packing)  last-fit Best lengths: Too short Last-fit Overlap Last-fit

75 3. Match Search & Filtering 75 3 3 t e x t s a mp Cycle boundary 1 1 3 3 4 4 Matches 0 4 0 1 23 Select the best combination of matches from the set of candidate matches 1.Remove matches that are longer when encoded than original 2.Remove matches covered by previous step 3.From the remaining set; select the best ones (heuristic for bin-packing)  last-fit Best lengths: Last-fit 1 2 Too short Overlap Last-fit

76 3. Match Search & Filtering 76 3 3 t e x t s a mp Cycle boundary 1 1 3 3 4 4 Matches: 0 1 23 Select the best combination of matches from the set of candidate matches 1.Remove matches that are longer when encoded than original 2.Remove matches covered by previous step 3.From the remaining set; select the best ones (heuristic for bin-packing)  last-fit 4.Compute “first valid position” for next step Best lengths: Last-fit  First Valid position next cycle 0 1 23 3 3

77 3. Match Search & Filtering 77 1.Remove matches that are longer when encoded than original 2. Remove matches covered by previous step 3 3 1 1 3 3 4 4 e.g.: Best lengths: s a mp  First Valid ------position 3 3 3 3 3 4 4 4 4 2 2 e.g.: Best lengths: 0 1 2

78 3. Match Search & Filtering 78 1.Remove matches that are longer when encoded than original 2. Remove matches covered by previous step 3 3 1 1 3 3 4 4 e.g.: Best lengths: s a mp  First Valid ------position 3 3 3 2 2 e.g.: Best lengths: 0 1 2

79 3. Match Search & Filtering 79 3. From the remaining set; select the best ones  last-fit bin-packing 3 3 0 0 3 3 4 4 e.g.: Best lengths: ? 0 0 ? ?

80 3. Match Search & Filtering 80 3. From the remaining set; select the best ones  last-fit bin-packing 3 3 0 0 0 0 4 4 e.g.: Best lengths: 3 3 4 4

81 3. Match Search & Filtering 81 4. Compute “first valid position” for next step 3 3 4 4 e.g.: Best lengths: 0123 First_valid_pos = 3 3 3 3 3 3 7 7 t e x t s a mp 0 1 23 0 1 23 3 3

82 Implementation Overview 82 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output

83 4. Writing to Output Marker, length, offset  Length is limited by VEC (=16 in our case) – fits in 4 bits  Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits Use either 3 or 4 bytes for this:  Offset < 2048  Offset = 2048.. 262144 83 MARKER LENGTH OFFSET MARKER LENGTH OFFSET

84 Results 84 OFFSET MARKER LENGTH OFFSET

85 LZ77 Compression in OpenCL 85 Outline: 1. OpenCL single-threaded flow 2. LZ77 overview 3. Implementation details 4. Optimizations & results  Area optimizations  Compression ratio  Results

86 Area Optimizations By choosing the right (hardware) architecture, you are already most of the way there The last ~5% (of area optimizations) requires some tinkering and advanced knowledge Example: 86

87 Match Search & Filtering 87 Generates a long vine of logic: Compute length Causes longer latency in the pipeline  increases area condition

88 88 Generates a long vine of logic: Compute length Causes longer latency in the pipeline  increases area Balance the computation: Balanced tree has shallower pipeline depth  Less area Get rid of the dependency on “length”

89 Modified Code 89 Instead of having a length variable (= 2,3,4) We have array of bits (= 0011,0111,1111) 4% smaller area OR operator is cheaper than adder OR operator creates a balanced tree (no condition)

90 Compression Ratio Evaluate compression ratio on widely-used compression benchmarks:  Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks  Initial results: 78.3% or 1.28X Want to improve results! 90 2. Hash Function1. Bin-packing Heuristic

91 1. Bin-packing heuristic We use the “last-fit” heuristic  Reason: We have a loop-carried variable “first_valid_position” 91 1.Remove matches that are longer when encoded than original 2.Remove matches covered by previous step 3.From the remaining set; select the best ones heuristic for bin-packing 4.Compute “first valid position” for next step 2. Filter bestlength (covered) 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos 1. Filter bestlength (length) Dependency causes a stall in the kernel pipeline  Cannot start a new iteration each cycle  II = 6 Optimization Report in 14.0

92 1. Bin-packing heuristic We use the “last-fit” heuristic  Reason: We have a loop-carried variable “first_valid_position” 92 2. Filter bestlength (covered) 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos 1. Filter bestlength (length) Dependency causes a stall in the kernel pipeline  Cannot start a new iteration each cycle  II = 6 2 2 1 1

93 1. Bin-packing heuristic We use the “last-fit” heuristic  Reason: We have a loop-carried variable “first_valid_position” 93 2. Filter bestlength (covered) 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos 1. Filter bestlength (length) Dependency causes a stall in the kernel pipeline  Cannot start a new iteration each cycle  II = 6 2 2 1 1 !! Stall!

94 1. Bin-packing heuristic We use the “last-fit” heuristic  Reason: We have a loop-carried variable “first_valid_position” 94 2. Filter bestlength (covered) 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos 1. Filter bestlength (length) Dependency causes a stall in the kernel pipeline  Cannot start a new iteration each cycle  II = 6 2 2 1 1 !! Stall! !! Stall! 3 3

95 1. Bin-packing heuristic We use the “last-fit” heuristic  Reason: We have a loop-carried variable “first_valid_position” 95 2. Filter bestlength (covered) 3. Filter bestlength (bin-pack) 4. Compute first_valid_pos 1. Filter bestlength (length) Dependency causes a stall in the kernel pipeline  Cannot start a new iteration each cycle  II = 6 Last-fit bin-packing doesn’t affect “first_valid_position” 3 3 4 4 1 1 3 3 Because we always use the last match (which determines first_valid_position)

96 1. Bin-packing heuristic We use the “last-fit” heuristic  Reason: We have a loop-carried variable “first_valid_position” 96 2. Filter bestlength (covered) 4. Filter bestlength (bin-pack) 3. Compute first_valid_pos 1. Filter bestlength (length) Last-fit bin-packing doesn’t affect “first_valid_position” 3 3 4 4 1 1 3 3 Because we always use the last match (which determines first_valid_position) Tighter computation for loop-carried variable:  Start new iteration each cycle  II = 1 Tighter computation for loop-carried variable:  Start new iteration each cycle  II = 1

97 1. Bin-packing heuristic We use the “last-fit” heuristic  Reason: We have a loop-carried variable “first_valid_position” 2. Filter bestlength (covered) 4. Filter bestlength (bin-pack) 3. Compute first_valid_pos 1. Filter bestlength (length) Constraint: cannot change the first_valid_position in this step Tighter computation for loop-carried variable:  Start new iteration each cycle  II = 1 Tighter computation for loop-carried variable:  Start new iteration each cycle  II = 1

98 1. Bin-packing heuristic Constraint: Match selection heuristic cannot change “first_valid_position” But: Last-fit is very inefficient 4 4 t e x t s a mp 3 3 2 2 0 0 Matches 0 1 2 4 0 1 23 Best lengths: 4. Filter bestlength (bin-pack) 3. Compute first_valid_pos 0 0 0 0 0 0 2 2 4 4 Much better! Doesn’t affect first_valid_position Add a step to eliminate matches that have the same reach but smaller value 8% better ratio

99 2. Hash Function Original:  Hash[i] = curr_window[i]  E.g. Hash[text] = ‘t’ XOR2  Hash[i] = curr_window[i] xor curr_window[i+1]  E.g. Hash[text] = ‘t’ xor ‘e’  Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’  Not utilizing depth efficiently (256 words but BRAMS go up to 1024) XOR3  Hash[i] = curr_window[i] << 2 xor curr_window[i+1] << 1 xor curr_window[i+2]  Match contains information about first 3 bytes + sense of their ordering  More likely that our compare windows will have a match  Hash (BRAM address) is 10 bits  utilizes BRAM depth = 1024 99 3.1% better ratio 7.1% better ratio Compared to Verilog, it is much easier to try & verify new algorithms It is exactly like trying out new C-code Emulator in 13.1

100 Compression Ratio Evaluate compression ratio on widely-used compression benchmarks:  Calgary – Canterbury – Large – Silesia corpora Text, images, binary, databases – mix of everything Geomean results over all benchmarks  Initial results: 78.3% or 1.28X With (simple) huffman encoding (currently on the host)  47.8% or 2.10X 100 Work in progress 60.2% or 1.67X After Optimizations:

101 Huffman portion of Gzip 16-way parallel variable-bit-width encoding/alignment

102 Huffman encoding Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream  Separate offset computation from the actual assembly 3 compute phases  Compute code bit-offsets and start offset of next iteration  Assembly of the codes in the current iteration  Build fixed-length segments across multiple iterations 102 << << << STORE

103 Compute offsets Tight dependency on offset carried across iterations  Careful about the order of the additions, the compiler does not consider dependencies when it redistributes associative operations  Decision whether to write to memory is based on accumulating a full segment 103 pos[0] basepos pos[1] pos[n]

104 Bit-level shift Each code shifts to an arbitrary bit-offset within the entire range 2 shift stages  16 bit barrel shifters  OR reduction tree for final assembly 104

105 Thank You


Download ppt "LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah."

Similar presentations


Ads by Google