Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.

Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation

Problem Statement Multi-media embedded applications have many recurring time consuming and long latency instructions Multi-media embedded applications have many recurring time consuming and long latency instructions –Floating point operations –Time-consuming instructions (Multiplies and Divides) which can cause 15-30 cycle delays in embedded processors

Problem Statement –Due to the demand for higher portability of computing power, power consumption is a big design constraint in embedded systems; decreased clock speed is important –Long latency instructions have the potential to cause data hazards, thus decreasing performance

Goals Develop a methodology to increase embedded applications performance Develop a methodology to increase embedded applications performance –Decrease the need to go through a complete multiply or divide instruction, opportunities exist for program speed up –Decrease the embedded system’s clock frequency; reducing power consumption –Decrease amount of data hazards due to long latencies

Applications of Solution Image processing Image processing –Low local entropy of processed data sets Speech encoding Speech encoding –Human speech characteristics High Speed Signal processing High Speed Signal processing –Values could change very little over short run, saves duplication of instructions

Solution = Data Reuse Establish a memo table of a set length on an ARM processor that holds the operands and results of past multiply and/or divide instructions Establish a memo table of a set length on an ARM processor that holds the operands and results of past multiply and/or divide instructions Send the operands to both the memo table and multiply/division unit, if hit in the memo table, complete a multi- cycle instruction in one clock cycle Send the operands to both the memo table and multiply/division unit, if hit in the memo table, complete a multi- cycle instruction in one clock cycle

Diagram of Memo Table Multiply/Division UnitMemo Table Operand 1Operand 2 “Operation Complete” “Hit/Miss” Result

Definition of the Memo Table The memo table is set up as a Look Up Table where the most recently used entries are present The memo table is set up as a Look Up Table where the most recently used entries are present The table consists of a long tag, consisting of two operands, and the result The table consists of a long tag, consisting of two operands, and the result Look-up and calculation are done in parallel to avoid adding latency Look-up and calculation are done in parallel to avoid adding latency

Constraints of the Memo Table Trivial Calculations, such as multiplying by 0, are not logged into the table Trivial Calculations, such as multiplying by 0, are not logged into the table Trivial calculations can be handled by the execution unit Trivial calculations can be handled by the execution unit If one of the operands in the table is referenced by a negative of itself, it results in a hit If one of the operands in the table is referenced by a negative of itself, it results in a hit

Current Implementations One paper deals with this concept: “Accelerating Multi-Media Processing by Implementing Memoing in Multiplication and Division Units” (Citron, Feitelson, Larry Rudolph) One paper deals with this concept: “Accelerating Multi-Media Processing by Implementing Memoing in Multiplication and Division Units” (Citron, Feitelson, Larry Rudolph) This paper dealt with Pentium Pro, Alpha 21164, ULTRASparc-II and MIPS R10000; leading microprocessors at the time This paper dealt with Pentium Pro, Alpha 21164, ULTRASparc-II and MIPS R10000; leading microprocessors at the time

Experiment Configuration A modified sim-safe application saves all instructions to a file (safet) A modified sim-safe application saves all instructions to a file (safet) A C program was created to read in the data and simulate hit rates of certain instructions if loaded into the memo table (insomnia) A C program was created to read in the data and simulate hit rates of certain instructions if loaded into the memo table (insomnia) Floating point intensive MI-Bench benchmarks were used (rsynth, lame) Floating point intensive MI-Bench benchmarks were used (rsynth, lame)

Configuration: Safet Modified version of sim-safe (performs functional simulation checking for correct memory reference), in the command line, allows for specifying a log file and number of instructions for data retrieval Modified version of sim-safe (performs functional simulation checking for correct memory reference), in the command line, allows for specifying a log file and number of instructions for data retrieval Creates 300 MB to 4 GB of opcode and operand data; solutions discarded Creates 300 MB to 4 GB of opcode and operand data; solutions discarded Shows most instructions run by the benchmark Shows most instructions run by the benchmark

Configuration: Insomnia Insomnia allows specification of logfile, num of instructions per log file, replacement policy, number of entries in memo table, number of log files, and opcode to be observed Insomnia allows specification of logfile, num of instructions per log file, replacement policy, number of entries in memo table, number of log files, and opcode to be observed Insomnia returns the number of times opcode was called, number of memo table hits, number of zero operands, and number of negative operand hits Insomnia returns the number of times opcode was called, number of memo table hits, number of zero operands, and number of negative operand hits

Configuration: Benchmarks Uses MiBench ARM processor benchmarks Uses MiBench ARM processor benchmarks –Rsynth – Text to Speech Encoder, program executes 82 million instructions to encode to speech a review for “Apocalypse Now” –Lame – Wav to MP3 encoder Both Benchmarks have over 20% of total instructions Floating Point, prime candidates for memo table implementation Both Benchmarks have over 20% of total instructions Floating Point, prime candidates for memo table implementation

Experiments Run The opcode chosen to experiment with was 102 – MUL. The opcode chosen to experiment with was 102 – MUL. It was run with the following table lengths (4,8,16,32,64,128,256) It was run with the following table lengths (4,8,16,32,64,128,256) Three different replacement policies were run (FIFO, LRU, and Random) Three different replacement policies were run (FIFO, LRU, and Random)

Results Opcode 102 (MUL) from rsynth has been tested Opcode 102 (MUL) from rsynth has been tested Rsynth has over 82 million instructions Rsynth has over 82 million instructions 102 has only 134,000 entries 102 has only 134,000 entries

Results from LRU Replacement

Results from FIFO Replacement

Results from Random Replacement

Analysis of Results Order of Multiplications, helped the hit rate results of smaller memo tables Order of Multiplications, helped the hit rate results of smaller memo tables Example: Example: –102 1 5 –….. –102 1 3 –….. With this operand ordering a single entry memo table would have a significant hit rate With this operand ordering a single entry memo table would have a significant hit rate

Analysis of Results For better results; other benchmarks should have more representative operand ordering For better results; other benchmarks should have more representative operand ordering MUL has less than 1% of the total operations; FP ADD has close to 20% of the operations; possibly use memo table to optimize this solution MUL has less than 1% of the total operations; FP ADD has close to 20% of the operations; possibly use memo table to optimize this solution

Conclusions For future tests, number of operands present in code should also be analyzed to determine best instruction to memoize For future tests, number of operands present in code should also be analyzed to determine best instruction to memoize The chance for better performance exists, but needs many different applications to completely verify The chance for better performance exists, but needs many different applications to completely verify

Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.

Similar presentations

Presentation on theme: "Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.

Similar presentations

Presentation on theme: "Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation."— Presentation transcript:

Similar presentations

About project

Feedback