Presentation on theme: "Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design."— Presentation transcript:
Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design Methodology
2 Introduction CREC: low-cost general-purpose reconfigurable computer; Dynamically generated architecture; Built in a Hardware/Software CoDesign manner; Based on FPGA devices, on VHDL language and high level language (Java); No need for integration in a dedicated VLSI chip.
3 CREC’s Main Features Reconfigurable RISC computer; Parallel computer: each register has an associated Execution Unit (EU); All the EUs have an identical structure, and each one is able to execute any kind of instruction from the CREC Instruction Set; Having a greater number of EUs has the advantage of introducing Instruction Level Parallelism.
4 CREC Design Flow
5 The Parallel Compiler (I.) Parses the CREC-RISC source code; Takes important decisions upon the execution system that will be generated; Divides a program that is written in a sequential manner into portions of code to be executed at the same time; Determines the minimal number of program slices; Determines which instructions will be executed in parallel in each slice.
6 The Parallel Compiler (II.) Uses a set of rules; An example: each slice can contain at most one Load, Store or Jump instruction; Reads the application source code (in CREC assembly language) and generates a file in a specific format, giving a description of the tailored CREC; The resulting CREC architecture contains only the hardware needed to execute the subset of instructions used in the program.
8 Results of the Parallel Compiler The size of the various functional parts; The subset of instructions involved; The number of execution units (N); The sequence of instructions making up the program; The resulting CREC architecture contains only the hardware needed to execute the subset of instructions used in the program.
9 Slices The instructions that are assigned to each EU to be executed at a same moment of time make up a program slice; The whole program is divided into slices; The slice’s size depends on the designed number of execution units used for program execution.
10 Program sequence, and the instruction scheduling:  MOV R1,2  MOV R1,2  MOV R2,3  MOV R2,3  MOV R3,3  MOV R3,3  ADD R1,R2  ADD R1,R2  DEC R3  DEC R3  JNZ R3,  JNZ R3,  MOV STORB,R1  MOV STORB,R1  STORE   STORE  Program Example Classical, non-optimal multiplication of two integers without overflow check using three EUs
11 VHDL Source Code Generator VHDL files contain an already written source code, where the main architecture’s parameters are given as generics and constants; The following components can be tailored: The number of EUs; The number of EUs; The register’s width in all the EUs; The register’s width in all the EUs; The size of the Instructions Memory and Operands Memory for each EU; The size of the Instructions Memory and Operands Memory for each EU; The size of the Data Stack and Slice Stack Memory; The size of the Data Stack and Slice Stack Memory; The slice-mapping block, containing instructions. The slice-mapping block, containing instructions.
12 CREC General Architecture
13 The Hardware Architecture The N Execution Units; Instruction Memories; Data Stack Memory (for Push and Pop); Slice Stack Memory (for Call and Return); A Slice Program Counter; A Slice-mapping Memory; Store Buffer and Load Buffer; Data Memory (external or internal); Operand Memories.
14 The Instruction Set Relatively large instruction set, contains more instructions than the usual microcontrollers have; Every instruction performs operation only on unsigned integers; Each EU is potentially able to execute any kind of instruction from the CREC Instruction Set.
15 Addition with or without Carry; Subtraction with or without Borrow and compare; Logical functions: And, Or, Xor, Not and Bit Test; Shift arithmetic and logic to left/right; Rotate and rotate through Carry to left/right; Increment/Decrement and 2’s Complement. Data Manipulation Instructions
16 Instruction Format and Example “G” defines the Instruction Group (Data Manipulation); “Code” is the operation code (ex. Add, Sub); “Type” specifies the operation type (ex. with/without Carry); “Load” contains the load signals for the register and for the Carry and Zero flags; “D” is the Register/Data selection for the second operand.
17 Program Control Instruction Slice counter manipulation: Jump, Call and Return; Data movement: Move; Stack manipulation: Push and Pop; Input from and Output to port: In and Out; Load from and Store to external memory; For great flexibility every instruction exists also in the conditioned form: C (Carry), Z (Zero), E (Equal), A (Above), AE (Above or Equal), B (Below), BE (Below or Equal) and with negation too.
18 Instruction Format and Example “G” defines the Instruction Group (Program Control); “Code” is the operation code (ex. Jump, Call); “Conditions” field contains the code for validating the execution of a given instruction ; “R” is the load signal for the Register (ex. Move); “D” is the Register/Data selection for the second operand.
19 The Execution Unit Decoding Unit – decodes the instruction code; Control Unit – generates the control signals for the Program Control Instruction group; Multiplexer Unit – the second operand of the binary instructions is multiplexed by this unit; Operating Unit – realizes data manipulating operations; Accumulator Unit – stores the instruction result; Flag Unit – contains the two flag bits: Carry Flag (CF), and the Zero Flag (ZF)
21 The Optimized Operating Unit Symmetrical organization: at the right side are the binary instruction blocks, and at the left side are the unary operation blocks (performing operations only on the accumulator); The blocks use only one level of FPGA slices; All four subunits use the same number of slices; Takes advantage of the Fast Carry Lines; The size of the Operating Unit is growing linearly with the word length.
22 Virtex Optimized Arithmetic Unit The basic 2-bit ADD/SUB cell using the Fast Carry Lines consumes only one Xilinx VirtexE slice.
23 Arithmetic and Logic Opcodes Opcodes of the arithmetic unit Opcodes of the logic unit Where L is the “Not Load” and S is the “Subtract” signal
24 Virtex Optimized Shift Left Unit The basic 2-bit SHL/ROL/NEG/INC/DEC cell using the Fast Carry Lines consumes only one slice.
25 Virtex Optimized Shift Right Unit The basic 2-bit SHR/ROR/NOT cell using the Fast Carry Lines consumes only one Xilinx VirtexE slice.
26 Shift Left and Right Opcodes Opcodes of the shift left unit Opcodes of the shift right unit Where S is the “Shift” and D is the “Decrement” signal Where S is the “Shift” and N is the “Not” signal
27 Shift and Rotate Operations SHL – Shift Left; SAL – Shift Arithmetic Left; ROL – Rotate Left; RCL – Rotate through Carry Left. SHR – Shift Right; SAR – Shift Arithmetic Right; ROR – Rotate Right; RCR – Rotate through Carry Right.
28 Execution Unit Resources A complete Execution Unit (with all the subunits generated) having 8-bit wide accumulator consumes 20 CLBs, that is approximately 0.6% of a Xilinx Virtex600E FPGA chip; An Execution Unit with 16-bit wide register consumes 35 CLBs, that is approximately 1% of the available CLBs.
29 Experimental Results Functional Parallel compiler; Execution Units optimized for Xilinx VirtexE device; Slice Memory and Stack Memory under test; A CREC architecture having 4 EUs with 4-bit wide registers occupies 4% of the CLBs and 5% of the BlockRAMs in the Virtex600E device; A CREC architecture having 4 EUs with 16-bit wide registers occupies 18% of the CLBs and 20% of the BlockRAMs in the Virtex600E device; A CREC architecture having 4 EUs with 16-bit wide registers occupies 18% of the CLBs and 20% of the BlockRAMs in the Virtex600E device; The operating clock frequency is 100 MHz.
30 Performance evaluation The performance indexes show how many times faster a given algorithm is executed on an optimised CREC system than in the case of classical execution flow
31 Conclusions and Further Work Creating the possibility of writing high-level programs for CREC; Extend the functionalities of the Parallel Compiler, then create a C or PASCAL compiler for CREC applications; Several variants of CREC architectures; Hardware distributed computing, using the FPGA configuration over the Internet.