AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Configurable System-on-Chip: Xilinx EDK
Presenter: Shao-Jay Hou. In the multicore era, capturing execution traces of processors is indispensable to debugging complex software. The inability.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
Issues on Software Testing for Safety-Critical Real-Time Automation Systems Shahdat Hossain Troy Mockenhaupt.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Module I Overview of Computer Architecture and Organization.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Computer Architecture
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Chapter 1: Introduction. 1.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 1: Introduction What Operating Systems Do Computer-System.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.
Design of a Novel Bridge to Interface High Speed Image Sensors In Embedded Systems Tareq Hasan Khan ID: ECE, U of S Term Project (EE 800)
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
2 Systems Architecture, Fifth Edition Chapter Goals Describe the system bus and bus protocol Describe how the CPU and bus interact with peripheral devices.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
Review of Computer System Organization. Computer Startup For a computer to start running when it is first powered up, it needs to execute an initial program.
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the Field Programmable Port Extender John Lockwood and David Taylor Washington University.
Aditya Dayal M. Tech, VLSI Design ITM University, Gwalior.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
System on a Programmable Chip (System on a Reprogrammable Chip)
Computer Architecture Lecture 25 Fasih ur Rehman.
Nios II Processor: Memory Organization and Access
Processor support devices Part 2: Caches and the MESI protocol
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Parallel Programming By J. H. Wang May 2, 2017.
The University of Adelaide, School of Computer Science
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
Cache Memory Presentation I
CMSC 611: Advanced Computer Architecture
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Computer Evolution and Performance
Lecture 10: Consistency Models
Database System Architectures
Lecture 11: Consistency Models
William Stallings Computer Organization and Architecture
Presentation transcript:

AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications (FPL), rd International Conference on, Sept. (2013) Daniel Kliem, Sven-Ole Voigt

Outline Introduction Architecture And Related Work Asynchronous Secure Bridge Multi-System FPGA Designs Benchmarks Results And Discussion Conclusion 2

Outline Introduction Architecture And Related Work Asynchronous Secure Bridge Multi-System FPGA Designs Benchmarks Results And Discussion Conclusion 3

Introduction Software applications of safety- and security-critical embedded systems are often divided into several self-contained functions. Between individual system partitions and functions. We use segregation to confine error propagation. Soft processors are one order of magnitude slower in terms of operating frequency than hard-wired devices. 4

Introduction(cont.) Current FPGA families provide wide and fast memory attachments mostly implemented as hard macros that are faster than configurable logic. There is a performance gap between soft processors and the memory attachment. Propose an architecture combines : The specific needs of partitioned software. The flexibility of reconfigurable hardware. 5

Introduction(cont.) Multiple self-contained systems on a single platform FPGA Shares available memory bandwidth among the systems In a predictable and scalable way. The main building blocks of the proposed architecture Secure bus bridges that are used to form a segregated hierarchy of memory busses. 6

Introduction(cont.) With secure bus bridges, it is possible to use soft processors for safety and security-critical functions. To reach high assurance levels with far less effort. 7

Outline Introduction Architecture And Related Work Asynchronous Secure Bridge Multi-System FPGA Designs Benchmarks Results And Discussion Conclusion 8

Architecture And Related Work Architecture that achieves domain segregation with Secure Bus Bridges. 9 On-chip (Replicated local resources) Off-chip (Shared for cost reduction.)

Architecture And Related Work(cont.) Multi-Processor SoCs(MPSoCs) often comprise shared caches. CPUs and I/O units that are assigned to one particular software function Influence execution times of other functions on different cores. The Worst-Case Execution Time (WCET) of real-time applications on MPSoCs is less predictable MPSoCs with Multi-Port Memory Controllers (MPMCs) Connecting the local systems to individual ports. Would be scalable in terms of performance. 10

Architecture And Related Work(cont.) Backbone-based approach is modular and uses individually verifiable SECBRGs Keep segregation and memory attachment in different places. 11

Architecture And Related Work(cont.) 12 Local resources exclusively used Neither interrupts nor interface overload propagate into neighboring systems.

Architecture And Related Work(cont.) 13 Necessary spatial segregation by address translation. Non- overlapping memory partitions Caching can be implemented without coherency support between different bridges!!

Architecture And Related Work(cont.) If the backbone bandwidth was larger than the accumulated local bus bandwidths. Multi-system architecture would scale perfectly. The asynchronous SECBRGs can connect the local busses to wider and faster backbones. Memory access temporal conflicts occur at a lower rate. Scalability for multiple local systems is ensured. 14

Outline Introduction Architecture And Related Work Asynchronous Secure Bridge Multi-System FPGA Designs Benchmarks Results And Discussion Conclusion 15

Asynchronous Secure Bridge 16

Asynchronous Secure Bridge(cont.) -Command Queue and Result Queue 17 Implemented as FPGA-specific hard macro blocks (BRAMs)

Asynchronous Secure Bridge(cont.) -Address Translation unit 18 Exclusive non-overlapping main memory areas (spatial segregation).

Asynchronous Secure Bridge(cont.) -Bridge Slave and Cache Controller (BSCC) 19 Copy-back caching strategy with valid and dirty flags Cache hit: Local accesses are directly serviced by the BSCC. Cache miss: Fetch commands are placed into the CQ. (Address and size of the data)

Asynchronous Secure Bridge(cont.) -Burst Unit (BU) 20 Expands the fetch commands from CQ to multiple sequential read operations (bursts). Provides them to the Bridge Master Controller (BMC).

Asynchronous Secure Bridge(cont.) -Case of cache cell replacement(to service a cache miss) 21 Fetch and write commands are finally issued out-of-order by the BMC. The BSCC can proceed with the replacement operation of a cache line while the BMC services the previous fetch.

Asynchronous Secure Bridge(cont.) -Case of cache cell replacement(to service a cache miss) Although the BSCC blocks the waiting local master until fetched data arrives in the RQ. It does not wait until all write commands are serviced by the BMC Queue latency does not block the cache with this scheme. 22

Asynchronous Secure Bridge(cont.) To reduce backbone utilization and to achieve scalable performance. Caching, bursting, out-of-order prefetching, and different clock ratios are used. To ensures exclusive access to memory locations. Copy-back caching and out-of-order issuing is possible. Complex cache coherence mechanisms are avoided which leads to simpler and more scalable system designs. 23

Outline Introduction Architecture And Related Work Asynchronous Secure Bridge Multi-System FPGA Designs Benchmarks Results And Discussion Conclusion 24

Multi-System FPGA Designs Asynchronous secure bridges Without relaxing the timing constraints Improves place-and-route even when the same frequency is used in all domains. Local busses and the backbone can be driven by independent clocks. 25

Multi-System FPGA Designs(cont.) Prototype implementations on the open source VHDL IP-core library GRLIB by Aeroflex Gaisler. Xilinx ML605 board (Virtex 6 LX240T). One- to eight-fold GRLIB-based prototype designs with clock frequencies as high as 133 MHz. 26

Outline Introduction Architecture And Related Work Asynchronous Secure Bridge Multi-System FPGA Designs Benchmarks Results And Discussion Conclusion 27

Benchmarks Results And Discussion The Relative Execution Time (RET) and the Backbone Idle Ratio (BIR) RET: The quotient of the measured local clock ticks spent by a benchmark program divided by a reference tick value. BIR: Relates the idle cycles of the backbone to the total amount of cycles at backbone frequency. 28

Benchmarks Results And Discussion(cont.) T 29

Outline Introduction Architecture And Related Work Asynchronous Secure Bridge Multi-System FPGA Designs Benchmarks Results And Discussion Conclusion 30

Conclusion A novel design of an asynchronous secure bus bridge that partitions multi-SoC FPGA designs into multiple clock domains and reliably shares available bandwidth among multiple soft processor system. Individual clock domains ensure repeatable place-and-route results that scale well with increasing numbers of local systems. The proposed architecture is able to overcome this bandwidth gap and to put soft processors to practical use for safety- and security-critical applications. 31

THANK YOU 32