Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.

Similar presentations


Presentation on theme: "Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006."— Presentation transcript:

1 Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006

2 Chosen application Archer Smart Password Recovery Tool Download Page: http://sourceforge.net/projects/sfprj06 http://sourceforge.net/projects/sfprj06

3 Archer The Archer application is intended to recover a lost password used to garble an ARJ archive. The ARJ archive program uses Huffman table search & substitute algorithm which is expected to shrink the size of the file being archived.

4 Algorithm Description The program operates as follows: The input file gets read. The input file gets read. The smaller garbled file inside the archive is selected. The smaller garbled file inside the archive is selected. An iterative password trial is performed till the CRC32 of the stored file is matched against the tried one. An iterative password trial is performed till the CRC32 of the stored file is matched against the tried one.

5 Algorithm Description (cont.) The ARJ archive program uses the following technique to garble the produced archive: Compress the file(s) as usual Compress the file(s) as usual XOR the resulting contents with the password, which is chained as necessary to match the length of the compressed data. XOR the resulting contents with the password, which is chained as necessary to match the length of the compressed data.

6 Method 1,2,3 vs. 4 1,2,3 differ only by the dynamic dictionary 1,2,3 differ only by the dynamic dictionary Different maximal depth influences compression Different maximal depth influences compression The same decompressing procedure The same decompressing procedure Possibility to employ sanity check heuristics Possibility to employ sanity check heuristics Skipping passwords at large speeds Skipping passwords at large speeds 4 th method 4 th method Fast compression – fixed dictionary size Fast compression – fixed dictionary size No shortcuts or sanity checks No shortcuts or sanity checks Each trial leads to CRC32 calculation of the data Each trial leads to CRC32 calculation of the data Slow password rate for large files Slow password rate for large files

7 Optimization Steps 32 Bit variables 32 Bit variables Original code In 16 bit – Pentium slowdown Original code In 16 bit – Pentium slowdown Majority of variables were converted to 32 bit Majority of variables were converted to 32 bit Some variables and buffers remained in 16 bit Some variables and buffers remained in 16 bit Those, which inherently must be such for algorithmic reasons (overflow, shifts etc.) Those, which inherently must be such for algorithmic reasons (overflow, shifts etc.)

8 Optimization Steps Power Buffer Unwinding Power Buffer Unwinding Dynamically created buffers for constant data Dynamically created buffers for constant data Certain combinations of powers of 2 Certain combinations of powers of 2 Those were hard-coded in the program Those were hard-coded in the program Several parameters to procedures suppressed Several parameters to procedures suppressed One procedure rewritten and spread in 2 One procedure rewritten and spread in 2

9 Optimization Steps Threading Threading Original code was single-threaded Original code was single-threaded Virtually no dependence between password trials Virtually no dependence between password trials There can be as many workers launched as possible There can be as many workers launched as possible The only interaction point is password incrementing The only interaction point is password incrementing Every worker has its own local storage Every worker has its own local storage The shared data is global The shared data is global

10 Threading (cont.) Our original threading scheme Our original threading scheme Increment password Worker №1 Worker №2 Main Thread – Initialize Workers and SP threads Waiting for the workers to finish Show Progress Thread – show global data every 1 sec. Then go to sleep… Increment password Increment password The only CS – “Increment password” The only CS – “Increment password” Some fake data races reported by the Thread Checker Some fake data races reported by the Thread Checker

11 Threading (cont.) Threading scheme - revisited Threading scheme - revisited Worker №1 – Increment password by Workers Count and continue independently… Main Thread – Initialize Workers and SP threads Waiting for the workers to finish Show Progress Thread – show global data every 1 sec. Then go to sleep… Worker №2 – Increment password by Workers Count and continue independently… Best suitable for methods 1,2,3 Best suitable for methods 1,2,3

12 Optimization Steps Optimizing CRC32 Optimizing CRC32 Practically, influences only the 4 th method Practically, influences only the 4 th method Rewritten using 4 pre-generated polynomial value tables Rewritten using 4 pre-generated polynomial value tables Calculation is done with buckets of 4 bytes Calculation is done with buckets of 4 bytes Instead of iteratively calculating CRC32 with each byte, the bucket values are combined Instead of iteratively calculating CRC32 with each byte, the bucket values are combined The performance of CRC32 algorithm improves by approximately factor of 2 The performance of CRC32 algorithm improves by approximately factor of 2

13 Optimizing CRC32 (cont.) Original pseudo-code: Original pseudo-code: void calccrc(BYTE *buf, int count) { while (count--) { crc32 = (crc32 >> 8) ^ crctbl[(BYTE)crc32 ^ *buf]; buf++; }

14 Optimizing CRC32 (cont.) Optimized pseudo-code: Optimized pseudo-code: #define DO4 c ^= *buf4++; \ c = crc_table[3][c & 0xff] ^ crc_table[2][(c >> 8) & 0xff] ^ \ crc_table[1][(c >> 16) & 0xff] ^ crc_table[0][c >> 24] #define DO32 DO4; DO4; DO4; DO4; DO4; DO4; DO4; DO4 void calccrc(BYTE *buf, int count) { buf4 = (const unsigned long*)buf; while (count >= 32) { DO32; count -= 32; } // Make the reminder }

15 Optimization Steps Using SIMD Instructions for decrypting Using SIMD Instructions for decrypting 16 bytes of data 16 bytes of chained password 16 bytes of constant + + + + + + + + XMM1 XMM2 XMM3 XOR 16 bytes Source file Decrypted file 16 bytes LOAD STORE

16 Optimization Steps Using SIMD Instructions for password maintenance Using SIMD Instructions for password maintenance abcabcabcabcabca XMM1 2) Copy the current password string to another register XMM1 1) Shift right the XMM1, 16 % password_length bits 0bcabcabcabcabca XMM2 0bcabcabcabcabca 3) Shift the copy left 16 – (16 % password_length) bits XMM2 b000000000000000 4) XMM1 = XMM1 OR XMM2 XMM1 bbcabcabcabcabca Now XMM1 contains the chained password for the next 16 bytes of data

17 Optimization Steps Limited Buffers Limited Buffers Allocating memory for the whole file causes cache misses Allocating memory for the whole file causes cache misses Small buffers cause overhead penalty Small buffers cause overhead penalty Large buffers cause cache misses penalty Large buffers cause cache misses penalty Gold value is 128K Gold value is 128K

18 Optimization Steps Compilation by Intel Compiler Compilation by Intel Compiler Method 1, 2, 3 – penalty of 9.26% Method 1, 2, 3 – penalty of 9.26% Method 4 – boost of 18.44% Method 4 – boost of 18.44%

19 Results (Times)

20 Results (Boost)


Download ppt "Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006."

Similar presentations


Ads by Google