Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.

Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley

IIntroduction IICoding/decoding scheme IIIDecoder IVEncoder VExperimental results VIConclusion

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Traditional ASIP Design FlowTipi Design Flow Design Micro-architecture Control path ISA Assembler Simulator Programming Write assembly in terms of the ISA ( manual scheduling and register allocation) Design Micro-architecture Generate HDL Data path Control path Horizontal microcode description Horizontal Microcode code generator ( automatic scheduling and register allocation) Simulator (Cycle accurate) Programming Write computational DAG Intermediate Representation Use the generated HM code generator to get the trace

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Object microcode Encoder Decoder Object microcode Memory Reduce memory size Reduce memory bandwidth Can use any scheme we want as long as we get back the correct microcode Horizontal Microcode Control Buffer Data Path Compiler HLL Program Store Decoder Encoder Traces Where does this fit in ? How does it help?

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Compression/Decompression scheme We use trace caches which can be thought of as a L0 cache Trace Cache 2 Trace Cache 1 1|2|3|1|2|1 2|3|1|4|2|4 Sequence Manager Sequence Manager 10xx11x10001x0001x00100x0x010x 000x10x0xxx00xxx 1|3|1|2|1|1 2|1|4|3|2|4 The trace caches are filled by the decoder which is itself a processor The encoder is the compiler that generates instructions for this Processor The set of instructions that the encoder generates is what is stored in the memory Two main instructions are the WRITE and SEQUENCE instructions The WRITE instruction fills an entry of a cache with a given value The SEQUENCE instruction gives the order in which cache entries must be accessed to give the correct microcode output.

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Consider an example with a single cache 101111001011 011000100001 100110001110 011000100001 Encoder WRITE line0, 011000100001 WRITE line1, 101111001011 WRITE line2, 100110001110 SEQUENCE line0,line1, line2,line0 Note: No write required for last microcode How do we get compression? Trace cache hits: no need to WRITE Lots of don’t cares in the microcode – low entropy Other methods – COPY instruction (delta coding) Compression/Decompression scheme

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion NOP WRITE SEQUENCE START JUMP SEQLENGTH STOP COPY 000 011 110 001CacheIndexData010CacheSequence100AddressOffset101Length 111 Cache Destination SourceBit changes Instruction Set - Variable instruction size - 8 different opcodes

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Architecture

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Fetch Decode Size is not ready ! Stream 1Stream 2Stream 1 Stream 2Stream 1 Fetch Decode Cycles instruction Stream 2 instruction size FetchDecode Size Instruction FetchDecode Size Instruction

Initial JUMPS IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Fetch 3 Fetch 1 Fetch 2 Cycle parity Unpacking Example Issue width = 3 Memory 0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F 0x0010 0x0011 0x0000 0x0013 0x0000 0x0015

ENCODER STAGE 1 Input assembly file Parameter file Cache Simulation Architecture description in XML Original Microcode Sequencer Linear Assembly file Mapped microcode IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion ENCODER STAGE 2 Linear Assembly file Packer Packed Assembly file

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion 101111001011 011000100001 100110001110 001001001001 01xxxx100001 100xxxx01xxx 0xxxxxxxxxx1 10111xxxx011 Original microcode Has a lot of don’t cares Can be exploited by the trace cache to avoid WRITE instructions Trace Cache contents at a particular point

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion 101111001011 011000100001 100110001110 001001001001 011100100001 100110001110 011100100001 101111001011 Mapped microcode The assembly produced is much smaller if we use this mapping Why? Because we avoid costly WRITE instructions What happens on a cache miss? We replace all don’t cares by 0’s. Motivation: The microcode has only a few 1’s and we expect the hit rate in the trace cache to increase

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Simulation environment Encoder (Java application) GCC X-compiler for DLX Tipi microcode generator Decoder (C++ lowlevel cycle- accurate simulation) Statistics collecter C code Assembly code Microcode Packed file Microcode Report file 3 test architectures : - RSA is a hardware RSA coder/decoder. - CC is a convolution coder - DLX is a DLX ISA processor Metrics : - Compression ratio - Number of stalls in the main architecture introduced by the decoder. References : - Dictionary-based compression methods (Huffman) - Hand made DLX decoder (simplescalar)

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Compression ratio / Cache sizeCompression ratio / Cache number Compression ratio / Sequence lengthNumber of stalls / Sequence length Des_branch Des_unrolledfftGsm_decodeGsm_encode Idct Median

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion ArchitectureNumber of bits in one microcode line CachesLength of the sequence Word size Issue width RSA 91 cache of size 85322 CC 3314 caches of size 165642 DLX 1061 cache of size 64201281 Comparaison with other schemes

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Contribution of instructions in the total size

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Benefit from the COPY instruction COPYNO COPYImprovement idct19.85%41.17%-51.79% des_branch8.70%11.86%-26.64% fft8.38%12.29%-31.81% gsm_decode8.80%21.39%-58.86% gsm_encode8.24%14.37%-42.66% des_unrolled6.47%7.12%-9.13% Average10.07%18.03%-36.82%

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Conclusion Our approach allows for high compression ratios, and is a valid alternative to hand-made ISA. It is easily scalable, a wide range of parameters can be explored by the architecture designer. Future Work A better branching mechanism could easily be created, which would allow prefetching, and would not create any performance loss for high granularity branches, by making the sequence managers more clever. As the SEQUENCE instruction represents the biggest part of the compressed file after the introduction of COPY, different ways of compressing this instruction should be explored. The parameters could be automatically generated by a program carefully analyzing the main architecture.

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Questions ?

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Application Architect Logic RTL Hardware Library Physical Asm Compiler Object Code Operation Extractor Ops Actor Library Actors Editor uArch Architect uArch RTL Extractor Simulator HLL IDE To memory Problem: Microcode memory size and bandwidth requirement too high! Solution: Use a compression/decompression scheme

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Packing Example Initial JUMPS Memory 0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F 0x0010 0x0011 0x0000 0x0013 0x0000 0x0015 linear Instructions stream Issue width of 3

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Percentages in number of instructions

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Stalls function of the issue width

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Influence of the size of the microcode

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Tipi Export execution cycles RSADESGSM… RSACCDLX Tipi + compression Export execution Cycles, memory size & bandwidth RSADESGSM… RSACCDLX Compression/Decompression platform based design

Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.

Similar presentations

Presentation on theme: "Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.

Similar presentations

Presentation on theme: "Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback