& FPGA Embedded Resources

Slides:



Advertisements
Similar presentations
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Advertisements

Spartan-3 FPGA HDL Coding Techniques
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
Survey of Reconfigurable Logic Technologies
Comprehensive environment for benchmarking using FPGAs: ATHENa - Automated Tool for Hardware EvaluatioN 1.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL Overview of Modern FPGAs ECE 448 Lecture 14.
Lecture 11 Xilinx FPGA Memories
FPGA Devices & FPGA Design Flow
ECE 448 Lecture 7 FPGA Devices
Programmable logic and FPGA
ECE 448 Lecture 3 Combinational-Circuit Building Blocks Data Flow Modeling of Combinational Logic ECE 448 – FPGA and ASIC Design with VHDL.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL Overview of Modern FPGAs ECE 448 Lecture 14.
Memory in FPGAs مرتضي صاحب الزماني. Inferring Memory Inferring Memory in XST:  Distributed or block memory? −XST implements small RAM components on distributed.
Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
ECE 448: Lab 4 FIR Filters.
ECE 448 – FPGA and ASIC Design with VHDL Lecture 10 Memories (RAM/ROM)
Lecture #3 Page 1 ECE 4110– Sequential Logic Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.No Class Monday, Labor Day Holiday 2.HW#2 assigned.
Designing with FPGAs ELEC 418 Advanced Digital Systems Dr. Ron Hayne Images Courtesy of Thomson Engineering.
George Mason University FPGA Memories ECE 448 Lecture 13.
Ch.9 CPLD/FPGA Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.
George Mason University Modern FPGA Devices ATHENa - Automated Tool for Hardware EvaluatioN ECE 545 Lecture 11.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
ECE 448 – FPGA and ASIC Design with VHDL Lecture 11 Memories in Xilinx FPGAs.
George Mason University ECE 645 Lecture 7 FPGA Embedded Resources.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
George Mason University ATHENa - Automated Tool for Hardware EvaluatioN Modern FPGA Families ECE 545 Lecture 12.
Lecture #3 Page 1 ECE 4110–5110 Digital System Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.HW#2 assigned Due.
ECE 545 Project 2 Specification. Schedule of Projects (1) Project 1 RTL design for FPGAs (20 points) Due date: Tuesday, November 22, midnight (firm) Checkpoints:
CPE 626 Advanced VLSI Design Lecture 6: VHDL Synthesis Aleksandar Milenkovic
ECE 545 Lecture 7 FPGA Design Flow.
ECE 545 Project 2 Specification. Project 2 (15 points) – due Tuesday, December 19, noon Application: cryptography OR digital signal processing optimized.
Introductory project. Development systems Design Entry –Foundation ISE –Third party tools Mentor Graphics: FPGA Advantage Celoxica: DK Design Suite Design.
Introduction to Experiment 6 Internal FPGA Memories, Pseudo Random Number Generator, Advanced Testbenches ECE 448 Spring 2009.
Introduction to FPGA Tools
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL ECE 448 Lecture 10 Memories: RAM, ROM.
RTL Design Methodology Transition from Pseudocode & Interface
FPGA Devices & FPGA Design Flow
Data Storage VHDL ET062G & ET063G Lecture 4 Najeem Lawal 2012.
CDA 4253 FGPA System Design Xilinx FPGA Memories
George Mason University ECE 448 – FPGA and ASIC Design with VHDL VHDL Coding for Synthesis ECE 448 Lecture 12.
Lecture 10 Xilinx FPGA Memories Part 1
Teaching Digital Logic courses with Altera Technology
Survey of Reconfigurable Logic Technologies
PARBIT Tool 1 PARBIT Partial Bitfile Configuration Tool Edson L. Horta Washington University, Applied Research Lab August 15, 2001.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL FPGA Devices ECE 448 Lecture 5.
George Mason University FPGA Memories ATHENa - Automated Tool for Hardware EvaluatioN ECE 545 Lecture 10.
Lecture 11 Xilinx FPGA Memories Part 2
George Mason University ATHENa - Automated Tool for Hardware EvaluatioN ECE 545 Lecture 12.
ECE 545 Project 1 Introduction & Specification Part I.
George Mason University ECE 545 Lecture 12 FPGA Embedded Resources.
ATHENa - Automated Tool for Hardware EvaluatioN
Introduction to Programmable Logic
Programmable Logic Memories
Programmable Logic Memories
ECE 448 Lecture 7 FPGA Devices
ECE 545 Lecture 17 RAM.
Basic Adders and Counters Implementation of Adders
Data Flow Description of Combinational-Circuit Building Blocks
ECE 448: Lab 4 FIR Filters.
Data Flow Description of Combinational-Circuit Building Blocks
THE ECE 554 XILINX DESIGN PROCESS
THE ECE 554 XILINX DESIGN PROCESS
Presentation transcript:

& FPGA Embedded Resources ECE 545 Lecture 11 ATHENa & FPGA Embedded Resources

Resources ATHENa website http://cryptography.gmu.edu/athena FPGA Embedded Resources web page available from the course web page

ATHENa – Automated Tool for Hardware EvaluatioN Supported in part by the National Institute of Standards & Technology (NIST)

ATHENa Team Venkata “Vinny” MS CpE student Ekawat “Ice” PhD CpE student Marcin PhD ECE student John MS CpE student Rajesh PhD ECE student Michal PhD exchange student from Slovakia

ATHENa – Automated Tool for Hardware EvaluatioN http://cryptography.gmu.edu/athena Benchmarking open-source tool, written in Perl, aimed at an AUTOMATED generation of OPTIMIZED results for MULTIPLE hardware platforms Currently under development at George Mason University.

Why Athena? "The Greek goddess Athena was frequently called upon to settle disputes between the gods or various mortals. 
Athena Goddess of Wisdom was known for her superb logic and intellect. Her decisions were usually well-considered, highly ethical, and seldom motivated by self-interest.” from "Athena, Greek Goddess of Wisdom and Craftsmanship"

Basic Dataflow of ATHENa User FPGA Synthesis and Implementation 6 5 Ranking of designs 2 3 Database query HDL + scripts + configuration files Result Summary + Database Entries ATHENa Server 1 HDL + FPGA Tools Download scripts and configuration files8 4 Database Entries Designer Interfaces + Testbenches 7

synthesizable source files constraint files configuration files testbench synthesizable source files database entries (machine- friendly) result summary (user-friendly)

ATHENa Major Features (1) synthesis, implementation, and timing analysis in batch mode support for devices and tools of multiple FPGA vendors: generation of results for multiple families of FPGAs of a given vendor automated choice of a best-matching device within a given family

ATHENa Major Features (2) automated verification of designs through simulation in batch mode support for multi-core processing automated extraction and tabulation of results several optimization strategies aimed at finding optimum options of tools best target clock frequency best starting point of placement OR

Generation of Results Facilitated by ATHENa batch mode of FPGA tools ease of extraction and tabulation of results Text Reports, Excel, CSV (Comma-Separated Values) optimized choice of tool options GMU_optimization_1 strategy vs.

Relative Improvement of Results from Using ATHENa Virtex 5, 256-bit Variants of Hash Functions Ratios of results obtained using ATHENa suggested options vs. default options of FPGA tools

How To Start Working With ATHENa? One-Time Tasks Download and unzip ATHENa http://cryptography.gmu.edu/athena/ Read the Tutorial! Install the Required Tools (see Tutorial - Part 1 – Tools Installation) Run ATHENa_setup

How To Start Working With ATHENa? Repetitive Tasks Prepare or modify your source files & source_list.txt Modify design.config.txt + possibly other configuration files Run ATHENa

design.config.txt Your Design # directory containing synthesizable source files for the project SOURCE_DIR = <examples/sha256_rs> # A file list containing list of files in the order suitable for synthesis and implementation # low level modules first, top level entity last SOURCE_LIST_FILE = source_list.txt # project name # it will be used in the names of result directories PROJECT_NAME = SHA256 # name of top level entity TOP_LEVEL_ENTITY = sha256 # name of top level architecture TOP_LEVEL_ARCH = rs_arch # name of clock net CLOCK_NET = clk

design.config.txt Timing Formulas #formula for latency LATENCY = TCLK*65 #formula for throughput THROUGHPUT = 512/(TCLK*65)

design.config.txt Application & Optimization Target # OPTIMIZATION_TARGET = speed | area | balanced OPTIMIZATION_TARGET = speed # OPTIONS = default | user OPTIONS = default # APPLICATION = single_run | exhaustive_search | placement_search | frequency_search | # GMU_Optimization_1 | GMU_Xilinx_optimization_1 APPLICATION = single_run # TRIM_MODE = off | zip | delete TRIM_MODE = zip

design.config.txt FPGA Families # commenting the next line removes all families of Xilinx FPGA_VENDOR = xilinx #commenting the next line removes a given family FPGA_FAMILY = spartan3 # FPGA_DEVICES = <list of devices> | best_match | all FPGA_DEVICES = best_match SYN_CONSTRAINT_FILE = default IMP_CONSTRAINT_FILE = default REQ_SYN_FREQ = 120 REQ_IMP_FREQ = 100 MAX_SLICE_UTILIZATION = 0.8 MAX_BRAM_UTILIZATION = 0.8 MAX_MUL_UTILIZATION = 1 MAX_PIN_UTILIZATION = 0.9 END FAMILY END VENDOR

design.config.txt FPGA Families # commenting the next line removes all families of Altera FPGA_VENDOR = altera #commenting the next line removes a given family FPGA_FAMILY = Stratix III # FPGA_DEVICES = <list of devices> | best_match | all FPGA_DEVICES = best_match SYN_CONSTRAINT_FILE = default IMP_CONSTRAINT_FILE = default REQ_IMP_FREQ = 120 MAX_LOGIC_UTILIZATION = 0.8 MAX_MEMORY_UTILIZATION = 0.8 MAX_DSP_UTILIZATION = 0 MAX_MUL_UTILIZATION = 0 MAX_PIN_UTILIZATION = 0.8 END FAMILY END VENDOR

Library Files Files created during ATHENa setup device_lib/xilinx_device_lib.txt device_lib/altera_device_lib.txt Files created during ATHENa setup Characterize FPGA families and devices available in the version of Xilinx and Altera tools installed on your computer Currently supported tool versions: Xilinx WebPACK 9.1, 9.2, 10.1, 11.1, 11.5, 12.1, 12.2, 12.3 Xilinx Design Suite 11.1, 12.1, 12.2, 12.3 Altera Quartus II Web Edition 8.1, 8.2, 9.0, 9.1, 10.0 Altera Quartus II Subscription Edition 9.1, 10.0 In case a library for a given version not available yet, use a library from the closest available version

Library Files device_lib/xilinx_device_lib.txt VENDOR = Xilinx #Device, Total Slices, Block RAMs, DSP, Dedicated Multipliers, Maximum User I/O Pins ITEM_ORDER = SLICE, BRAM, DSP, MULT, IO FAMILY = spartan3 xc3s50pq208-5, 768, 4, 0, 4, 124 xc3s200ft256-5, 1920, 12, 0, 12, 173 xc3s400fg456-5, 3584, 16, 0, 16, 264 xc3s1000fg676-5, 7680, 24, 0, 24, 391 xc3s1500fg676-5, 13312, 32, 0, 32, 487 END_FAMILY FAMILY = virtex5 xc5vlx30ff676-3, 4800, 32, 32, 0, 400 xc5vfx30tff665-3, 5120, 68, 64, 0, 360 xc5vlx30tff665-3, 4800, 36, 32, 0, 360 xc5vlx50ff1153-3, 7200, 48, 48, 0, 560 xc5vlx50tff1136-3, 7200, 60, 48, 0, 480

Result Files report_resource_utilization.txt xilinx : spartan3 +---------+-----------------+-----+------+---+--------+---+-------+----+-------+----+------+---+----+----+ | GENERIC | DEVICE | RUN | LUTs | % | SLICES | % | BRAMs | % | MULTs | % | DSPs | % | IO | % | | default | xc3s200ft256-5* | 1 | 142 | 3 | 74 | 3 | 4 | 33 | 7 | 58 | 0 | 0 | 20 | 11 | xilinx : spartan6 +---------+------------------+-----+------+---+--------+---+-------+---+-------+---+------+----+----+----+ | GENERIC | DEVICE | RUN | LUTs | % | SLICES | % | BRAMs | % | MULTs | % | DSPs | % | IO | % | | default | xc6slx9csg324-3* | 1 | 41 | 1 | 22 | 1 | 4 | 6 | 0 | 0 | 9 | 56 | 20 | 10 | xilinx : virtex5 +---------+-------------------+-----+------+---+--------+---+-------+----+-------+---+------+----+----+----+ | GENERIC | DEVICE | RUN | LUTs | % | SLICES | % | BRAMs | % | MULTs | % | DSPs | % | IO | % | | default | xc5vlx20tff323-2* | 1 | 101 | 1 | 56 | 1 | 4 | 15 | 0 | 0 | 9 | 37 | 20 | 11 | xilinx : virtex6 +---------+-------------------+-----+------+---+--------+---+-------+---+-------+---+------+---+----+---+ | GENERIC | DEVICE | RUN | LUTs | % | SLICES | % | BRAMs | % | MULTs | % | DSPs | % | IO | % | | default | xc6vlx75tff784-3* | 1 | 44 | 1 | 21 | 1 | 4 | 1 | 0 | 0 | 9 | 3 | 20 | 5 |

Result Files report_timing.txt REQ SYN FREQ - Requested synthesis clk freq. SYN FREQ – Achieved synthesis clk. freq. REQ SYN TCLK - Requested synthesis clk period SYN TCLK – Achieved synthesis clk. period REQ IMP FREQ - Requested implement. clk freq. IMP FREQ – Achieved implement. clk. freq. REQ IMP TCLK - Requested implement. clk period IMP TCLK – Achieved implement clk. period LATENCY - Latency [ns] THROUGHPUT – Throughput [Mbits/s] TP/Area - Throughput/Area [(Mbits/s)/CLB slices Latency*Area – Latency*Area [ns*CLB slices] xilinx : spartan3 +---------+-----------------+-----+--------------+----------+--------------+----------+--------------+----------+--------------+----------+---------+------------+------------+--------------+ | GENERIC | DEVICE | RUN | REQ SYN FREQ | SYN FREQ | REQ SYN TCLK | SYN TCLK | REQ IMP FREQ | IMP FREQ | REQ IMP TCLK | IMP TCLK | LATENCY | THROUGHPUT | TP/Area | Latency*Area | | default | xc3s200ft256-5* | 1 | default | 207.370 | default | 4.822 | default | 112.448 | default | 8.893 | 17.786 | 449.792 | 6.078 | 1316.164 | xilinx : spartan6 +---------+------------------+-----+--------------+----------+--------------+----------+--------------+----------+--------------+----------+---------+------------+------------+--------------+ | GENERIC | DEVICE | RUN | REQ SYN FREQ | SYN FREQ | REQ SYN TCLK | SYN TCLK | REQ IMP FREQ | IMP FREQ | REQ IMP TCLK | IMP TCLK | LATENCY | THROUGHPUT | TP/Area | Latency*Area | | default | xc6slx9csg324-3* | 1 | default | 75.751 | default | 13.201 | default | 78.119 | default | 12.801 | 25.602 | 312.476 | 14.203 | 563.244 | xilinx : virtex5 +---------+-------------------+-----+--------------+----------+--------------+----------+--------------+----------+--------------+----------+---------+------------+------------+--------------+ | GENERIC | DEVICE | RUN | REQ SYN FREQ | SYN FREQ | REQ SYN TCLK | SYN TCLK | REQ IMP FREQ | IMP FREQ | REQ IMP TCLK | IMP TCLK | LATENCY | THROUGHPUT | TP/Area | Latency*Area | | default | xc5vlx20tff323-2* | 1 | default | 156.347 | default | 6.396 | default | 126.952 | default | 7.877 | 15.754 | 507.808 | 9.068 | 882.224 | xilinx : virtex6 | default | xc6vlx75tff784-3* | 1 | default | 158.053 | default | 6.327 | default | 135.410 | default | 7.385 | 14.770 | 541.638 | 25.792 | 310.170 |

Result Files report_options.txt COST TABLE - parameter determining the starting point of placement Synthesis Options – options of the synthesis tool Map Options – Options of the mapping tool PAR Options – Options of the place & route tool xilinx : spartan3 +---------+-----------------+-----+------------+------------------------------+-------------------------+--------------+ | GENERIC | DEVICE | RUN | COST TABLE | Synthesis Options | Map Options | PAR Options | | default | xc3s200ft256-5* | 1 | 1 | -opt_level 1 -opt_mode speed | -c 100 -pr b -cm speed | -w -ol std | xilinx : spartan6 +---------+------------------+-----+------------+------------------------------+---------------+--------------+ | GENERIC | DEVICE | RUN | COST TABLE | Synthesis Options | Map Options | PAR Options | | default | xc6slx9csg324-3* | 1 | 1 | -opt_level 1 -opt_mode speed | -c 100 -pr b | -w -ol std | xilinx : virtex5 +---------+-------------------+-----+------------+------------------------------+-------------------------+--------------+ | GENERIC | DEVICE | RUN | COST TABLE | Synthesis Options | Map Options | PAR Options | | default | xc5vlx20tff323-2* | 1 | 1 | -opt_level 1 -opt_mode speed | -c 100 -pr b -cm speed | -w -ol std | xilinx : virtex6 +---------+-------------------+-----+------------+------------------------------+---------------+--------------+ | GENERIC | DEVICE | RUN | COST TABLE | Synthesis Options | Map Options | PAR Options | | default | xc6vlx75tff784-3* | 1 | 1 | -opt_level 1 -opt_mode speed | -c 100 -pr b | -w -ol std |

Result Files report_execution_time.txt Synthesis Time - Time of Synthesis Implementation Time - Time of Implementation Elapsed Time - Total Time xilinx : spartan3 +---------+-----------------+-----+----------------+---------------------+--------------+ | GENERIC | DEVICE | RUN | Synthesis Time | Implementation Time | Elapsed Time | | default | xc3s200ft256-5* | 1 | 0d 0h:0m:12s | 0d 0h:0m:36s | 0d 0h:0m:48s | xilinx : spartan6 +---------+------------------+-----+----------------+---------------------+--------------+ | GENERIC | DEVICE | RUN | Synthesis Time | Implementation Time | Elapsed Time | | default | xc6slx9csg324-3* | 1 | 0d 0h:0m:21s | 0d 0h:1m:13s | 0d 0h:1m:34s | xilinx : virtex5 +---------+-------------------+-----+----------------+---------------------+--------------+ | GENERIC | DEVICE | RUN | Synthesis Time | Implementation Time | Elapsed Time | | default | xc5vlx20tff323-2* | 1 | 0d 0h:0m:39s | 0d 0h:1m:50s | 0d 0h:2m:29s | xilinx : virtex6 | default | xc6vlx75tff784-3* | 1 | 0d 0h:0m:22s | 0d 0h:3m:22s | 0d 0h:3m:44s |

design.config.txt Functional Simulation (1) # FUNCTIONAL_VERFICATION_MODE = <on | off> FUNCTIONAL_VERIFICATION_MODE = <off> # directory containing source files of the testbench VERIFICATION_DIR = <examples/sha256_rs/tb> # A file containing a list of testbench files in the order suitable for compilation; # low level modules first, top level entity last. # Test vector files should be located in the same directory and listed # in the same file, unless fixed path is used. Please refer to tutorial for more detail. VERIFICATION_LIST_FILE = <tb_srcs.txt> # name of testbench's top level entity TB_TOP_LEVEL_ENTITY = <sha_tb> # name of testbench's top level architecture TB_TOP_LEVEL_ARCH = <behavior>

design.config.txt Functional Simulation (2) # MAX_TIME_FUNCTIONAL_VERIFICATION = <$time $unit> # supported unit are : ps, ns, us, and ms # if blank, simulation will run until it finishes = # = no changes in signals, i.e., clock is stopped and no more inputs coming in. MAX_TIME_FUNCTIONAL_VERIFICATION = <> # Perform only verification (synthesis and implementation parameters are ignored) # VERIFICATION_ONLY = <ON | OFF> VERIFICATION_ONLY = <off>

test_circuit: ATHENa Example including embedded FPGA resources

design.config.txt Global Generics GLOBAL_GENERICS_BEGIN # Number of stages # n is currently set to the default value i.e n=16 # for other values of n, modify the formulas for Latency and Throughput accordingly n = 16 # Memory type: 0 = MEM_DISTRIBUTED, 1= MEM_EMBEDDED mem_type = 0, 1 # Adder type: 0 = ADD_SCCA_BASED (Simple Carry Chain Adder, "+" in VHDL), # 1 = ADD_DSP_BASED # Multiplier type: 0 = MUL_LOGIC_BASED (multiplier based on configurable logic), # 1 = MUL_DEDICATED # Allowed combinations of adder and multiplier types (adder_type, multiplier_type) = (0, 0), (1, 1) GLOBAL_GENERICS_END

design.config.txt FPGA Family Specific Generics FPGA_FAMILY = Cyclone II GENERICS_BEGIN # FPGA vendor: 0 = XILINX, 1 = ALTERA vendor = 1 # Memory block size: 0 = M512, 1 = M4K, 2 = M9K, 3 = M20K, # 4 = MLAB, 5 = MRAM, 6 = M144K mem_block_size = 1 GENERICS_END FPGA_DEVICES = best_match REQ_IMP_FREQ = 120 MAX_LOGIC_UTILIZATION = 0.8 MAX_MEMORY_UTILIZATION = 0.8 MAX_DSP_UTILIZATION = 1 MAX_MUL_UTILIZATION = 1 MAX_PIN_UTILIZATION = 0.8 END FAMILY

Use of Embedded FPGA Resources in SHA-3 Candidates ECE 448 – FPGA and ASIC Design with VHDL

Xilinx FPGA Devices Technology Low-cost High-performance 120/150 nm Virtex 2, 2 Pro 90 nm Spartan 3 Virtex 4 65 nm Virtex 5 45 nm Spartan 6 40 nm Virtex 6

Altera FPGA Devices Technology Low-cost Mid-range High-performance 130 nm Cyclone Stratix 90 nm Cyclone II Stratix II 65 nm Cyclone III Arria I Stratix III 40 nm Cyclone IV Arria II Stratix IV

Basic Operations of 14 SHA-3 Candidates NTT – Number Theoretic Transform, GF MUL – Galois Field multiplication, MUL – integer multiplication, mADDn – multioperand addition with n operands

Hash Algorithm DSP Adders DSP Multipliers Block Memories BLAKE Yes - BMW CubeHash ECHO Fugue Groestl Hamsi JH Keccak Luffa SHA-2 Shabal SHAvite-3 SIMD Skein

DSP ADDERS & MULTIPLIERS

DSP Adders SHA-2 BLAKE 32-bit or 64-bit Addition BMW 32-bit or 64-bit Multioperand Addition CubeHash 32-bit addition SHA-2    

DSP Adders Shabal 32-bit Addition SIMD 32-bit Multioperand Addition Skein 64-bit addition    

BLOCK MEMORIES

Block Memories used to implement ROM and Round Constants Hamsi ROM in message expansion             8 x 4 x 256 x 32  = 256 kbit in Hamsi-256 16 x 8 x 256 x 32 = 1 Mbit in Hamsi- 512               Keccak, JH, SHA-2 Round constants only BLAKE Permutation SIMD Twiddle Factors    

PRELIMINARY RESULTS

DSP Adders & Multipliers ✔ ✗ ✗ ✔ ✗ ✔ - Throughput increases ✗ Throughput decreases (most likely as a result of design error)

Block Memory & Adders ✔ ✗ ✔ ✗ ✔ ✔ - Throughput increases ✗ Throughput decreases (most likely as a result of design error)

Block Memories used to implement T-boxes/S-boxes ECHO, SHAvite-3 AES-Sboxes (8x8) AES-Tboxes (8x32) Fugue Fugue-Tboxes (8x128) Groestl Groestl-Tboxes (8x64)

AES Input, internal state, and output 128 bits = 16 bytes a0,0 a1,0 a2,0 a3,0 a0,1 a1,1 a2,1 a3,1 a0,2 a1,2 a2,2 a3,2 a0,3 a1,3 a2,3 a3,3 column 0 column 1 column 2 column 3 a0,0 a0,1 a0,2 a0,3 a1,0 a1,1 a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2 a3,3

AES Round

SubBytes ai,j bi,j S-box a0,0 a0,1 a0,2 a0,3 b0,0 b0,1 b0,2 b0,3 a1,0

S-box and Inversion in GF(28) Hardware ROM S-box 8 x 8 8-bit address 8 28  8 bits 28 words S 8-bit output 8 direct logic y1 x1 y2 x2 ... ... y8 x8

SubBytes Look-up Table word8 S[256] = { 99, 124, 119, 123, 242, 107, 111, 197, 48, 1, 103, 43, 254, 215, 171, 118, 202, 130, 201, 125, 250, 89, 71, 240, 173, 212, 162, 175, 156, 164, 114, 192, 183, 253, 147, 38, 54, 63, 247, 204, 52, 165, 229, 241, 113, 216, 49, 21, 4, 199, 35, 195, 24, 150, 5, 154, 7, 18, 128, 226, 235, 39, 178, 117, 9, 131, 44, 26, 27, 110, 90, 160, 82, 59, 214, 179, 41, 227, 47, 132, 83, 209, 0, 237, 32, 252, 177, 91, 106, 203, 190, 57, 74, 76, 88, 207, 208, 239, 170, 251, 67, 77, 51, 133, 69, 249, 2, 127, 80, 60, 159, 168, 81, 163, 64, 143, 146, 157, 56, 245, 188, 182, 218, 33, 16, 255, 243, 210, 205, 12, 19, 236, 95, 151, 68, 23, 196, 167, 126, 61, 100, 93, 25, 115, 96, 129, 79, 220, 34, 42, 144, 136, 70, 238, 184, 20, 222, 94, 11, 219, 224, 50, 58, 10, 73, 6, 36, 92, 194, 211, 172, 98, 145, 149, 228, 121, 231, 200, 55, 109, 141, 213, 78, 169, 108, 86, 244, 234, 101, 122, 174, 8, 186, 120, 37, 46, 28, 166, 180, 198, 232, 221, 116, 31, 75, 189, 139, 138, 112, 62, 181, 102, 72, 3, 246, 14, 97, 53, 87, 185, 134, 193, 29, 158, 225, 248, 152, 17, 105, 217, 142, 148, 155, 30, 135, 233, 206, 85, 40, 223, 140, 161, 137, 13, 191, 230, 66, 104, 65, 153, 45, 15, 176, 84, 187, 22, };

AES SubBytes

ShiftRows a b c d a b c d e f g h f g h e i j k l k l i j m n o p p m no shift a b c d a b c d cyclic shift left by 1 e f g h f g h e cyclic shift left by 2 i j k l k l i j cyclic shift left by 3 m n o p p m n o

MixColumns a0,j b0,j a1,j b1,j a2,j b2,j a3,j b3,j 2 3 1 1 1 2 3 1 3 1 1 2 a0,j b0,j a0,0 a0,1 a0,2 a0,3 b0,0 b0,1 a0,2 b0,3 a1,0 a1,1 a1,2 a1,j a1,3 b1,0 b1,1 b1,j a1,2 b1,3 a2,0 a2,1 a2,2 a2,3 b2,0 b2,1 a2,2 b2,3 a2,j b2,j a3,0 a3,1 a3,2 a3,3 b3,0 b3,1 a3,2 b3,3 a3,j b3,j

AES MixColumns

AddRoundKey + = simple bitwise addition (xor) of round keys a0,0 a0,1

S-box Based Implementation of AES Round

S-box Based Basic Iterative Architecture Input 128 SubBytes Memory ShiftRows Routing MixColumns Logic Round Key AddRoundKey 128 128 Output

T-box Based Implementation of AES Round

Fast implementation of the entire round (1) e0,j e1,j e2,j e3,j = T0[a0,j] T1[a1,j+1 mod 4] T2[a2,j+2 mod 4] T3[a3,j+3 mod 4] k0,j k1,j k2,j k3,j Column of Output Each Ti table can be implemented using a 256 x 32 bit ROM

Table-lookup implementation x3,2 x2,2 x1,2 x0,2 = b2

Look-up Tables T static const u32 T0[256] = { 0xc66363a5U, 0xf87c7c84U, 0xee777799U, 0xf67b7b8dU, 0xfff2f20dU, 0xd66b6bbdU, 0xde6f6fb1U, 0x91c5c554U, 0x60303050U, 0x02010103U, 0xce6767a9U, 0x562b2b7dU, 0xe7fefe19U, 0xb5d7d762U, 0x4dababe6U, 0xec76769aU, 0x8fcaca45U, 0x1f82829dU, 0x89c9c940U, 0xfa7d7d87U, 0xeffafa15U, 0xb25959ebU, 0x8e4747c9U, 0xfbf0f00bU, . . . . . . . . . . . . .

Implementing AES Round Using T-box Tables Input 128 8 8 8 8 8 8 . . . . 8 ai,j j=0..3, i=0..3 T tables Ti[ai,j] 32 32 32 32 32 32 . . . . 32 j=0..3, i=0..3 Implementing AES Round Using T-box Tables 32 32 128 Encryption XOR Network 32 32 round key Kj j=0..3 32 32 32 32 ej j=0..3 128 Output

Test Circuit Example ECE 448 – FPGA and ASIC Design with VHDL

test_circuit: ATHENa Example including embedded FPGA resources

Generic Multiplier (1) entity mult is generic ( vendor : integer := XILINX -- vendor : XILINX=0, ALTERA=1 multiplier_type : integer:= MUL_DEDICATED; -- multiplier_type : MUL_LOGIC_BASED=0, MUL_DEDICATED=1 WIDTH : integer := 8 -- width : width (fixed width for input and output) ); port a : in std_logic_vector (WIDTH-1 downto 0); b : in std_logic_vector (WIDTH-1 downto 0); s : out std_logic_vector (WIDTH-1 downto 0) end mult;

Generic Multiplier (2) architecture mult of mult is begin xil_dsp_mult_gen : if (multiplier_type = MUL_DEDICATED and vendor = XILINX) generate mult_xil: entity work.mult(xilinx_dsp) generic map ( WIDTH => WIDTH ) port map (a => a, b => b, s => s ); end gen xil_logic_mult_gen : if (multiplier_type=MUL_LOGIC_BASED and vendor = XILINX) generate mult_xil: entity work.mult(xilinx_logic) generic map ( WIDTH => WIDTH ) end generate; alt_dsp_mult_gen : if (multiplier_type=MUL_DEDICATED and vendor = ALTERA) generate mult_alt: entity work.mult(altera_dsp) generic map ( WIDTH => WIDTH ) alt_logic_mult_gen : if (multiplier_type=MUL_LOGIC_BASED and vendor = ALTERA) generate mult_alt: entity work.mult(altera_logic) generic map ( WIDTH => WIDTH ) end mult;

Generic Multiplier (3) architecture xilinx_logic of mult is signal temp1 : std_logic_vector(2*WIDTH -1 downto 0); attribute mult_style : string ; attribute mult_style of temp1: signal is "lut”; begin temp1 <= STD_LOGIC_VECTOR(unsigned(a) * unsigned(b)); s <= temp1(WIDTH-1 downto 0); end xilinx_logic; architecture xilinx_dsp of mult is signal temp2 : std_logic_vector(2*WIDTH -1 downto 0); attribute mult_style of temp2: signal is "block”; temp2 <= STD_LOGIC_VECTOR(unsigned(a) * unsigned(b)); s <= temp2(WIDTH-1 downto 0); end xilinx_dsp;

Generic Multiplier (4) architecture altera_logic of mult is signal temp : std_logic_vector(2*WIDTH -1 downto 0); attribute multstyle : string ; attribute multstyle of altera_logic : architecture is "logic”; begin temp <= STD_LOGIC_VECTOR(unsigned(a) * unsigned(b)); s <= temp(WIDTH-1 downto 0); end altera_logic; architecture altera_dsp of mult is attribute multstyle of altera_dsp : architecture is "dsp"; end altera_dsp;

FPGA Embedded Resources ECE 448 – FPGA and ASIC Design with VHDL

Embedded Multipliers ECE 448 – FPGA and ASIC Design with VHDL

Multipliers in Spartan 3 The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com) ECE 448 – FPGA and ASIC Design with VHDL

Number of Multipliers per Spartan 3 Device

Combinational and Registered Multiplier ECE 448 – FPGA and ASIC Design with VHDL

Dedicated Multiplier Block ECE 448 – FPGA and ASIC Design with VHDL

Interface of a Dedicated Multiplier ECE 448 – FPGA and ASIC Design with VHDL

Cyclone II

Embedded Multiplier Block Overview Each Cyclone II has one to three columns of embedded multipliers. Each embedded multiplier can be configured to support One 18 x 18 multiplier Two 9 x 9 multipliers

Number of Embedded Multipliers

Multiplier Block Architecture

Two Multiplier Types Two 9x9 multiplier 18x18 multiplier

Multiplier Stage Signals signa and signb are used to identify the signed and unsigned inputs.

3 Ways to Use Dedicated Hardware Three (3) ways to use dedicated (embedded) hardware Inference Instantiation CoreGen in Xilinx MegaWizard Plug-In Manager in Altera

Inferred Multiplier library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity mult18x18 is generic ( word_size : natural := 18; signed_mult : boolean := true); port ( clk : in std_logic; a : in std_logic_vector(word_size-1 downto 0); b : in std_logic_vector(word_size-1 downto 0); c : out std_logic_vector(2*word_size-1 downto 0)); end entity mult18x18; architecture infer of mult18x18 is begin process(clk) if rising_edge(clk) then if signed_mult then c <= std_logic_vector(signed(a) * signed(b)); else c <= std_logic_vector(unsigned(a) * unsigned(b)); end if; end process; end architecture infer;

Forcing a particular implementation in VHDL Synthesis tool: Xilinx XST Attribute MULT_STYLE: string; Attribute MULT_STYLE of c: signal is block; Allowed values of the attribute: block – dedicated multiplier lut - LUT-based multiplier pipe_block – pipelined dedicated multiplier pipe_lut – pipelined LUT-based multiplier auto – automatic choice by the synthesis tool

Instantiation for Spartan 3 FPGAs

DSP Units ECE 448 – FPGA and ASIC Design with VHDL

Xilinx XtremeDSP Starting with Virtex 4 family, Xilinx introduced DSP48 block for high-speed DSP on FPGAs Essentially a multiply-accumulate core with many other features Now also in Spartan-3A, Spartan 6, Virtex 5, and Virtex 6

DSP48 Slice: Virtex 4

Simplified Form of DSP48 Adder Out = (Z ± (X + Y + CIN))

Choosing Inputs to DSP Adder P = Adder Out = (Z ± (X + Y + CIN))

DSP48E Slice : Virtex5

New in Virtex 5 Compared to Virtex 4

Stratix III DSP Unit

Embedded Memories ECE 448 – FPGA and ASIC Design with VHDL

Memory Types Memory Memory Memory RAM ROM Single port Dual port With asynchronous read With synchronous read

Memory Types in Xilinx Memory Memory Distributed (MLUT-based) Block RAM-based (BRAM-based) Memory Inferred Instantiated Manually Using Core Generator

Memory Types in Altera Memory Memory Distributed (ALUT-based, Stratix III onwards) Memory block-based Small size (512) Medium size (4K, 9K, 20K) Large size (144K, 512K) Memory Inferred Instantiated Manually Using MegaWizard Plug-In Manager

Inference vs. Instantiation

Block RAM Most efficient memory implementation Spartan-3 Dual-Port Port A Port B Most efficient memory implementation Dedicated blocks of memory Ideal for most memory requirements 4 to 104 memory blocks 18 kbits = 18,432 bits per block (16 k without parity bits) Use multiple blocks for larger memories Builds both single and true dual-port RAMs Synchronous write and read (different from distributed RAM) The Block Ram is true dual port, which means it has 2 independent Read and Write ports and these ports can be read and/or written simultaneously, independent of each other. All control logic is implemented within the RAM so no additional CLB logic is required to implement dual port configuration. The Altera 10KE and ACEX 1K families have only 2-port RAM. To emulate dual port capability, they would need twice the number of memory blocks and at half the performance.

Block RAM can have various configurations (port aspect ratios) 1 2 4 4k x 4 8k x 2 4,095 16k x 1 8,191 8+1 2k x (8+1) 2047 16+2 1024 x (16+2) 1023 16,383

Block RAM Port Aspect Ratios

Single-Port Block RAM DO[w-p-1:0] DI[w-p-1:0]

Dual-Port Block RAM DOA[wA-pA-1:0] DIA[wA-pA-1:0] DOA[wB-pB-1:0] DIB[wB-pB-1:0]

Block RAM library components Data Cells Parity Cells Address Bus Data Bus Parity Bus   Depth Width RAMB16_S1 16384 1 - (13:0) (0:0) RAMB16_S2 8192 2 (12:0) (1:0) RAMB16_S4 4096 4 (11:0) (3:0) RAMB16_S9 2048 8 (10:0) (7:0) RAMB16_S18 1024 16 (9:0) (15:0) RAMB16_S36 512 32 (8:0) (31:0)

Cyclone II Memory Blocks The embedded memory structure consists of columns of M4K memory blocks that can be configured as RAM, first-in first-out (FIFO) buffers, and ROM

Memory Modes The M4K memory blocks support the following modes: Single-port RAM (RAM:1-Port) Simple dual-port RAM (RAM: 2-Port) True dual-port RAM (RAM:2-Port) Tri-port RAM (RAM:3-Port) Single-port ROM (ROM:1-Port) Dual-port ROM (ROM:2-Port)

Single-Port ROM The address lines of the ROM are registered The outputs can be registered or unregistered A .mif file is used to initialize the ROM contents

Stratix II TriMatrix Memory

Stratix II TriMatrix Memory

Stratix III & Stratix IV TriMatrix Memory

Stratix II & III Shift-Register Memory Configuration