Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †

Slides:



Advertisements
Similar presentations
Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature  Mrinmoy Ghosh  Ripal Nathuji Min Lee Karsten Schwan Hsien-Hsin.
Advertisements

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
File System Implementation
Reliable Data Storage using Reed Solomon Code Supervised by: Isaschar (Zigi) Walter Performed by: Ilan Rosenfeld, Moshe Karl Spring 2004 Part A Final Presentation.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
1 Interrupts INPUT/OUTPUT ORGANIZATION: Interrupts CS 147 JOKO SUTOMO.
Reliable Data Storage using Reed Solomon Code Supervised by: Isaschar (Zigi) Walter Performed by: Ilan Rosenfeld, Moshe Karl Spring 2004 Midterm Presentation.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
ECE 424 Embedded Systems Design Lecture 8 & 9 & 10: Embedded Processor Architecture Chapter 5 Ning Weng.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
CSE598C Virtual Machines and Their Applications Operating System Support for Virtual Machines Coauthored by Samuel T. King, George W. Dunlap and Peter.
Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Overview IS 8040 Data Communications Dr. Hoganson Course Overview Sending signals over a wire –Data: bits – binary (0/1) –How to transmit the digital data:
Benefits: Increased server utilization Reduced IT TCO Improved IT agility.
Xen I/O Overview. Xen is a popular open-source x86 virtual machine monitor – full-virtualization – para-virtualization para-virtualization as a more efficient.
Supporting Cache Coherence in Heterogeneous Multiprocessor Systems Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee Georgia Institute of Technology.
COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems Taeweon Suh ┼, Shih-Lien L. Lu ¥, and Hsien-Hsin S. Lee § Platform.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
§ Georgia Institute of Technology, † Intel Corporation Initial Observations of Hardware/Software Co-simulation using FPGA in Architecture Research Taeweon.
Evaluating System-wide Monitoring Capsule Design Using Xilinx Virtex-II Pro FPGA Taeweon Suh Hsien-Hsin S. Lee Sally A. Mckee Taeweon Suh §, Hsien-Hsin.
W4118 Operating Systems Instructor: Junfeng Yang.
Introduction to Operating Systems Concepts
Virtual Memory Chapter 8.
Lecture 2. A Computer System for Labs
M. Bellato INFN Padova and U. Marconi INFN Bologna
Basic Paging (1) logical address space of a process can be made noncontiguous; process is allocated physical memory whenever the latter is available. Divide.
CMSC 611: Advanced Computer Architecture
Non Contiguous Memory Allocation
Processor support devices Part 2: Caches and the MESI protocol
Memory COMPUTER ARCHITECTURE
Lecture 12 Virtual Memory.
CS703 - Advanced Operating Systems
Taeweon Suh §, Daehyun Kim †, and Hsien-Hsin S. Lee § June 15, 2005
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
CS 286 Computer Organization and Architecture
Chapter 3 Top Level View of Computer Function and Interconnection
William Stallings Computer Organization and Architecture 7th Edition
Improving java performance using Dynamic Method Migration on FPGAs
What we need to be able to count to tune programs
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Page Replacement.
Taeweon Suh ┼, Shih-Lien L. Lu ¥, and Hsien-Hsin S. Lee §
ECEG-3202 Computer Architecture and Organization
Bus-Based Computer Systems
Taeweon Suh §, Hsien-Hsin S. Lee §, Sally A. Mckee †,
BIC 10503: COMPUTER ARCHITECTURE
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CSE451 Virtual Memory Paging Autumn 2002
CSC3050 – Computer Architecture
Lecture 24: Virtual Memory, Multiprocessors
NVMe.
Supporting Cache Coherence in Heterogeneous Multiprocessor Systems
William Stallings Computer Organization and Architecture 7th Edition
Chapter 13: I/O Systems.
Paging Andrew Whitaker CSE451.
Presentation transcript:

Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen † Initial Observations of Hardware/Software Co-simulation using FPGA in Architecture Research Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen † February 12, 2006 § Georgia Institute of Technology, † Intel Corporation

Hardware/Software Co-simulation Software simulation Advantages: Flexible, observable, easy-to-implement Disadvantage: Intolerable simulation time Hardware emulation Advantage: Significant speedup, concurrent execution Disadvantages: Much less flexible and observable, low-level design taking longer time to implement and validate Hardware/Software Co-simulation Try to retain advantages of both approaches Basic idea Implement time-consuming software functions into FPGA The remaining simulator interacts with FPGA Georgia Tech, Intel - WARFP 2006

Georgia Tech, Intel - WARFP 2006 Experiment Equipment Intel server system ACE FPGA board UART Logic analyzer Pentium-III Host PC Georgia Tech, Intel - WARFP 2006

“cache-to-cache transfer” Georgia Tech, Intel - WARFP 2006 Communication Method Communication between Pentium-III and FPGA Use FSB as communication medium Allocate one page of memory for communication Send data to FPGA: write-through cache mode Receive data from FPGA: cache-to-cache transfer cache line “FLUSH” Front-side bus (FSB) Pentium-III (MESI) Memory controller 2GB SDRAM FPGA (Virtex-II) “write” bus transaction “read” bus transaction “cache-to-cache transfer” Georgia Tech, Intel - WARFP 2006

Hardware/Software Implementation Hardware (FPGA) implementation State machines Monitoring bus transactions on FSB Checking bus transaction types, i.e., read or write Managing cache-to-cache transfer Implementation of software functions to FPGA Debugging logic and statistics counters Software implementation Linux device driver FPGA needs to know when to respond to FSB transactions Specific physical address is needed for communication Allocate one page of memory for FPGA access via Linux device driver Simulator modification for accessing FPGA Georgia Tech, Intel - WARFP 2006

Example: Simplescalar Co-simulation Preliminary experiment for correctness checkup Implement a simple function (mem_access_latency) into FPGA Co-simulation results mcf bzip2 crafty eon-cook Baseline (h:m:s) Co-simulation (h:m:s) difference (h:m:s) 2:18:38 2:20:50 + 0:02:12 gcc-166 parser perl twolf 3:03:58 3:06:50 + 0:02:52 2:56:38 2:59:28 + 0:02:50 2:43:52 2:45:45 + 0:01:53 3:45:30 3:48:56 + 0:03:26 3:34:57 3:37:27 + 0:02:30 2:42:30 2:45:50 + 0:03:20 2:43:30 2:45:28 + 0:01:58 Georgia Tech, Intel - WARFP 2006

Co-simulation Results Analysis FSB access is expensive ~ 20 FSB cycles (≈ 160 CPU cycles) for each transfer One cache line (32 bytes) needs to be transferred for cache-to-cache transfer P-III MESI requires to update main memory upon cache-to-cache transfer “mem_access_latency” function is too simple Even software simulation takes at most a few dozen CPU cycles Device driver overhead System overhead due to device driver It requires one TLB entry, which would be used in the simulation otherwise Time-consuming software routines and reasonable FPGA access frequency are needed to benefit from hardware implementation Georgia Tech, Intel - WARFP 2006

Georgia Tech, Intel - WARFP 2006 On-going Work SoftSDV co-simulation for multi-core research Implement distributed lowest level caches, and interconnection network such as ring or mesh in FPGA L3 CPU0 L1,L2 Ring I/F CPU4 CPU1 CPU5 CPU2 CPU6 CPU3 CPU7 FPGA Georgia Tech, Intel - WARFP 2006

Georgia Tech, Intel - WARFP 2006 Conclusions Proposed a new co-simulation methodology Preliminary co-simulation using Simplescalar proves the correctness of the methodology Hardware/software implementation Communication between P-III and FPGA via FSB Linux driver Co-simulation results indicate Bus access (FSB) is expensive Linux driver overhead also needs to be overcome Time-consuming blocks need to be emulated Multi-core co-simulation would benefit from FPGA Implement distributed low-level caches and interconnection network, which would be complex enough to benefit from hardware modeling Georgia Tech, Intel - WARFP 2006

Georgia Tech, Intel - WARFP 2006 Questions, Comments? Thanks for your attention! Georgia Tech, Intel - WARFP 2006

Georgia Tech, Intel - WARFP 2006 Backup Slides Georgia Tech, Intel - WARFP 2006

Communication Details All FSB signals are mapped to FPGA pins Encoding software function arguments in the FSB address for Simplescalar example For 4KB page, Set its attribute as write-through mode Lower 12 bits in FSB address bus are free to use High 24 bits are used for TLB translation Pentium-III (MESI) Xilinx Virtex-II Front-side bus (FSB) Georgia Tech, Intel - WARFP 2006