MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.

MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith

What is MISTY1? Cryptographic block cipher Cryptographic block cipher Developed by Mitsubishi Electric Developed by Mitsubishi Electric Created in 1995 Created in 1995 Developed primarily for encryption on mobile phones and other mobile devices Developed primarily for encryption on mobile phones and other mobile devices Stands for: Mitsubishi Improved Security TechnologY Stands for: Mitsubishi Improved Security TechnologY

Technical Specs Feistel Network Feistel Network 64-bit block size 64-bit block size 128-bit key 128-bit key Rounds in multiples of 4 (4, 8, 12, 16, …) Rounds in multiples of 4 (4, 8, 12, 16, …) RFC 2994 RFC 2994 Picture from: http://web.archive.org/web/20000823133547/http://www.mitsubishi.com/ghp_japan/mi sty/misty_e_b.pdf

Our Original Implementation 8 rounds; the standard 8 rounds; the standard 128-bit key and 64-bit data as hexadecimal inputs (command line arguments) 128-bit key and 64-bit data as hexadecimal inputs (command line arguments) Encrypt and decrypt functionality both implemented (as well as performing both consecutively for benchmarking) Encrypt and decrypt functionality both implemented (as well as performing both consecutively for benchmarking)

Original (Unoptimized) Design Designed for code size and clarity Designed for code size and clarity Written in C Written in C Only standard libraries used Only standard libraries used Inefficiencies in: loops, multiplies and divides, function calls, parameter passing Inefficiencies in: loops, multiplies and divides, function calls, parameter passing Usage:./misty [I] Usage:./misty [I] 'e' to encrypt, 'd' to decrypt, 'b' to test both'e' to encrypt, 'd' to decrypt, 'b' to test both K is a required 16-digit hex string (128 bits)K is a required 16-digit hex string (128 bits) M is a required 8-digit hex string (64 bits)M is a required 8-digit hex string (64 bits) I is an optional number of iterations for benchmarkingI is an optional number of iterations for benchmarking

Original Design GPROF Profile % cumulative self self total time seconds seconds calls us/call us/call name 45.57 10.65 10.65 560000000 0.02 0.02 fi 19.80 15.28 4.63 160000000 0.03 0.09 fo 7.63 17.06 1.78 100000000 0.02 0.02 fl 6.94 18.69 1.62 100000000 0.02 0.02 flinv 5.25 19.91 1.23 10000000 0.12 0.27 key_schedule 3.13 20.65 0.73 10000000 0.07 1.01 decrypt_block 3.06 21.36 0.72 10000000 0.07 1.03 encrypt_block 2.44 21.93 0.57 20000000 0.03 0.03 unpack_data 1.54 22.29 0.36 50000000 0.01 0.04 decrypt_round_even 1.33 22.60 0.31 40000000 0.01 0.13 encrypt_round_even 1.03 22.85 0.24 40000000 0.01 0.18 decrypt_round_odd 0.96 23.07 0.23 __gmon_start__ 0.86 23.27 0.20 40000000 0.01 0.09 encrypt_round_odd 0.34 23.35 0.08 10000000 0.01 0.04 encrypt_final 0.21 23.40 0.05 main 0.00 23.40 0.00 48 0.00 0.00 xtoi 0.00 23.40 0.00 4 0.00 0.00 print_hex_data 0.00 23.40 0.00 2 0.00 0.00 parse_hex_arg 80% of the time spent in FO/FI/FL/FLINV 80% of the time spent in FO/FI/FL/FLINV Compiled with gcc-4.3.4 Compiled with gcc-4.3.4 Benchmarked on 64-bit Core2 @ 2.4 GHz, linux-2.6.33 Benchmarked on 64-bit Core2 @ 2.4 GHz, linux-2.6.33

Unoptimized Execution Time gcc misty_slow.c -o slow gcc misty_slow.c -o slow time./slow b 00112233445566778899aabbccddeeff 0123456789abcdef 10000000 time./slow b 00112233445566778899aabbccddeeff 0123456789abcdef 10000000 real 0m23.093s user 0m22.886s sys 0m0.031s real 0m23.093s user 0m22.886s sys 0m0.031s 10 million iterations, 2.31 µs per iteration (~ 1.15 µs per encryption and decryption) 10 million iterations, 2.31 µs per iteration (~ 1.15 µs per encryption and decryption)

Revised Software Design Designed for optimal performance Designed for optimal performance Loops unrolled (rounds, d0/d1 pack) Loops unrolled (rounds, d0/d1 pack) Pow-2 mul, div, mod → shift, and Pow-2 mul, div, mod → shift, and Functions inlined Functions inlined Reduced parameter passing (key) Reduced parameter passing (key) Compiler optimization levels enabled Compiler optimization levels enabled Compiler architecture-specific options enabled Compiler architecture-specific options enabled

Rounds: Before Unrolling for (i = 0; i < NUM_ROUNDS; i++) for (i = 0; i < NUM_ROUNDS; i++) { if (i == (NUM_ROUNDS - 1)) if (i == (NUM_ROUNDS - 1)) encrypt_final(i, &d0, &d1, ek); encrypt_final(i, &d0, &d1, ek); else if ((i % 2) == 0) else if ((i % 2) == 0) encrypt_round_even(i, &d0, &d1, ek); encrypt_round_even(i, &d0, &d1, ek); else else encrypt_round_odd(i, &d0, &d1, ek); encrypt_round_odd(i, &d0, &d1, ek); }

Rounds: After Unrolling // round 0 // round 0 d0 = fl(d0, 0); d0 = fl(d0, 0); d1 = fl(d1, 1); d1 = fl(d1, 1); d1 = d1 ^ fo(d0, 0); d1 = d1 ^ fo(d0, 0); // round 1 // round 1 d0 = d0 ^ fo(d1, 1); d0 = d0 ^ fo(d1, 1); // round 2 // round 2 d0 = fl(d0, 2); d0 = fl(d0, 2); d1 = fl(d1, 3); d1 = fl(d1, 3); d1 = d1 ^ fo(d0, 2); d1 = d1 ^ fo(d0, 2); // round 3 // round 3 d0 = d0 ^ fo(d1, 3); d0 = d0 ^ fo(d1, 3); // round 7 // round 7 d0 = d0 ^ fo(d1, 7); d0 = d0 ^ fo(d1, 7); // finalize // finalize d0 = fl(d0, 8); d0 = fl(d0, 8); d1 = fl(d1, 9); d1 = fl(d1, 9); // round 4 // round 4 d0 = fl(d0, 4); d0 = fl(d0, 4); d1 = fl(d1, 5); d1 = fl(d1, 5); d1 = d1 ^ fo(d0, 4); d1 = d1 ^ fo(d0, 4); // round 5 // round 5 d0 = d0 ^ fo(d1, 5); d0 = d0 ^ fo(d1, 5); // round 6 // round 6 d0 = fl(d0, 6); d0 = fl(d0, 6); d1 = fl(d1, 7); d1 = fl(d1, 7); d1 = d1 ^ fo(d0, 6); d1 = d1 ^ fo(d0, 6);

Execution Time and Speedup Description Time Speedup Slow / Initial 0m23.093s 1.00000 Unroll Rounds 0m21.573s 1.07046 Unroll D0/D1 Init 0m20.750s 1.11292 Shift and AND 0m18.978s 1.21683 Unroll Packing 0m18.135s 1.27339 Make EK Global 0m17.902s 1.28997 Inline F0/FI/FL 0m15.921s 1.45047 Enable O1 0m4.308s 5.36049 Enable O2 0m4.276s 5.40061 Enable O3 0m4.155s 5.55654 Architecture Flags 0m4.128s 5.59423

Building and Testing the Optimized Implementation gcc misty_fast.c -o fast gcc misty_fast.c -o fast gcc misty_fast.c -o fast -O1 gcc misty_fast.c -o fast -O1 gcc misty_fast.c -o fast -O2 gcc misty_fast.c -o fast -O2 gcc misty_fast.c -o fast -O3 gcc misty_fast.c -o fast -O3 gcc misty_fast.c -o fast -O3 -march=core2 gcc misty_fast.c -o fast -O3 -march=core2 Fastest execution time: real 0m4.128s user 0m4.117s sys 0m0.007s Fastest execution time: real 0m4.128s user 0m4.117s sys 0m0.007s 10 million iterations, 413 ns per iteration 10 million iterations, 413 ns per iteration

Execution Time and Speedup

Final Design GPROF Profile % cumulative self self total time seconds seconds calls ns/call ns/call name 42.99 2.26 2.26 10000000 226.15 226.15 decrypt_block 41.57 4.45 2.19 10000000 218.65 218.65 encrypt_block 15.41 5.26 0.81 main 0.00 5.26 0.00 4 0.00 0.00 print_hex_data 0.00 5.26 0.00 2 0.00 0.00 parse_hex_arg Most function calls inlined, only decrypt_block and encrypt_block remain Most function calls inlined, only decrypt_block and encrypt_block remain

What was Learned? Original implementation may not have been implemented all that badly (~1.5 speedup from manual implementations) Original implementation may not have been implemented all that badly (~1.5 speedup from manual implementations) Larger benefit from instruction level optimization (gcc) Larger benefit from instruction level optimization (gcc) Profile first, then optimize in places where it actually matters Profile first, then optimize in places where it actually matters Bit-wise AND operator lower precedence than modulus: Bit-wise AND operator lower precedence than modulus: x % y + z → (x % y) + z x % y + z → (x % y) + z x & y + z → x & (y + z) x & y + z → x & (y + z) All optimizations add up to a significant amount of savings All optimizations add up to a significant amount of savings

Future Work Use of SSE vector instructions for parallel operations Use of SSE vector instructions for parallel operations Data types such as uint8_t/uint16_t converted to natural integer size for better memory alignment and access performance Data types such as uint8_t/uint16_t converted to natural integer size for better memory alignment and access performance Use of a union to replace packing and unpacking of data from array to D0/D1 Use of a union to replace packing and unpacking of data from array to D0/D1 Written directly in optimized assembly Written directly in optimized assembly Dedicated hardware implementation (ASIC/FPGA) for MISTY1 (originally designed to be implemented in hardware) Dedicated hardware implementation (ASIC/FPGA) for MISTY1 (originally designed to be implemented in hardware)

Questions? ?

MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.

Similar presentations

Presentation on theme: "MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.

Similar presentations

Presentation on theme: "MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith."— Presentation transcript:

Similar presentations

About project

Feedback