A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Pipelining (Week 8).
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Programmable Interval Timer
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis.
1 Lecture 6 Performance Measurement and Improvement.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Presenter: Jyun-Yan Li Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen, Chung-Ta King, Yuan-Ying Chang, Shau-Yin.
Hardware/software Interfacing. Page 2 Interrupt handling and using internal timer Two way for processor to accept external input: Waiting for input: Processor.
I/O Subsystem Organization and Interfacing Cs 147 Peter Nguyen
GCSE Computing - The CPU
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Design of a Novel Bridge to Interface High Speed Image Sensors In Embedded Systems Tareq Hasan Khan ID: ECE, U of S Term Project (EE 800)
13-Nov-15 (1) CSC Computer Organization Lecture 7: Input/Output Organization.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
MICROPROCESSOR FUNCTION Technician Series Created Mar 2015 gmail.com.
Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Overview of microcomputer structure and operation
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Buffering Techniques Greg Stitt ECE Department University of Florida.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Understanding Operating Systems Seventh Edition
Multilevel Memories (Improving performance using alittle “cash”)
Techniques for Reducing Read Latency of Core Bus Wrappers
Architecture & Organization 1
CDA 3101 Spring 2016 Introduction to Computer Organization
Hyperthreading Technology
Architecture & Organization 1
Address-Value Delta (AVD) Prediction
Ann Gordon-Ross and Frank Vahid*
Serial versus Pipelined Execution
A Self-Tuning Configurable Cache
Dynamic Hardware/Software Partitioning: A First Approach
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Presentation transcript:

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation

2 Outline Introduction Problem Definition Profiling Techniques Pipelined Binary Search Tree ProMem Conclusions

3 Introduction Mem Processor Per. I$ D$ Bridge Monitor Embedded Bus ProMem Our Solution: Add On-Chip Profiler Memory to Monitored Bus Accepts 1 pattern/cycle Keeps Exact Counts Goal: Determine # of Times Each Target Pattern Appears on the Bus Monitor Embedded Bus

4 Introduction Mem Processor Per. FPGA prog.c void compute() { // small Loop A for(i=0;…;…) … // small Loop B for(x=0;…;…) } … Loop N Instructions Loop A … Profile Information Profile Move Loop A to HW Synthesis Configure FPGA FPGA Most Instructions Executed

5 Introduction Profiling Can Be Used to Solve Many Problems –Optimization of frequently executed subroutines –Mapping frequently executed code and data to non-interfering cache regions –Synthesis of optimized hardware for common cases –Identifying frequent loops to map to a small low- power loop cache –Many Others!

6 Problem Definition Objective –Count number of times each target pattern appears on bus B Requirements –Accept input patterns on every clock cycle –Monitoring any bus, e.g., deeply embedded buses in SOCs –Non-intrusive –Exact target pattern count TPCTP tp tp …… tp m ctp m Target Patterns TP = {tp i, …, tp m } Target Pattern Counts CTP = {ctp i, …, ctp m } Mem Processor Per. p1pmpm …p2 Bus B Input Patterns P={p i, …, p m }

7 Profiling Techniques - Software Instrumenting Software –Adding code to count frequencies of desired code regions Problems –Incurs runtime overhead –Possibly changes program behavior –Increase in code size for( … ){ … ctp m ++; } MemProcessor Per. p1pmpm … p2 prog.c

8 Profiling Techniques - Software Periodic Sampling –Interrupt processor at periodic interval –Read program counter and other internal registers Problems –Disruption of runtime behavior during interrupt –Inaccurate // ISR period = 10ms ISR{ //update profile info } MemProcessor p1pmpm … p2 prog.c Per.

9 Profiling Techniques - Software Simulation –Execute application on instruction set simulator –Simulator keeps track of profile information Problems –Difficult to model external environment which leads to inaccuracy –Extremely slow ISS prog.c profile information

10 Profiling Techniques - Hardware Logic Analyzer –Probes placed directly on bus to be monitored Problems –Cannot monitor embedded buses MemProcessor p1pmpm … p2 Per.

11 Profiling Techniques - Hardware Processor Support –Mainly event counters –Monitored events include cache misses, pipeline stalls, etc. Problems –Few registers available –Reconfiguration needed to obtain a complete profile –Leads to inaccuracy MemProcessor p1pmpm … p2 Per.

12 Profiling Techniques - Hardware Content-addressable memories (CAMs) –Fast search for a key in a large data set –Returns the address at which the key resides in a memory Types –Fully Associative –RAM coupled with a smart controller Mem Processor CAM p1pmpm … p2 Per.

13 Profiling Techniques - Hardware Fully Associative CAMs –Simultaneously compares every location with the key Problems –Does not scale well to larger memories –Increased access time as CAM size grows –Large Power Consumption Mem Processor CAM p1pmpm … p2 tp1 tpm … = = tp2 = tp3 = Per.

14 Profiling Techniques - Hardware RAM coupled with a smart controller –Efficient lookup data structure in memory such as a binary tree or Patricia Trie Problems –Multiple cycle lookup Ctrl SRAM Mem Processor CAM p1pmpm … p2 Per.

15 Observations Not necessary to have 1 cycle look up Only need to accept one input pattern every cycle

16 Queueing Hold input patterns in queue until we are able to process them Problems –Does not work with patterns arriving every clock cycle Ctrl SRAM CAM Bus B FIFO

17 Pipelining Implemented in processors such that instructions can be executed every cycle Can we use pipelining to solve our problem?

18 Pipelined CAM Large CAMs required long access times Partition large CAM into several smaller CAMs –Requires pipelining to reduce access time –Provides solution to access time problem –Requires Large Area –Large Power Consumption CAM Pipeline Reg CAM

19 Pipelined CAM Entries can be stored in a CAM in any order –requires sequential lookup in pipelined CAM approach Is there a benefit to sorting the entries? –not necessary to search all entries –leads to faster lookup time Tree structure provides a inherently sorted structure –Search time remains a problem –Can we pipeline the structure?

20 Pipelined Tree Solves access time problem –One memory access per level Solves area problem –Single comparator per level –Each level grows by factor of two –For large memories, comparators are negligible = = = =

21 Pipelined Binary Search Tree Root Node Each node has at most two children Left child > Parent Right child < Parent ace jd bfik g h

22 Pipelined Binary Search Tree Searching for Input Pattern: f ace jd bfik g h h f > d, go left d f < h, go right h f = f, Found! f d h Stage 0 Stage 1 Stage 2 Stage 3

23 Pipelined Binary Search Tree ace jd bfik g h e = e, Found! f 01 d h 0 e 010 e < f, append 0 to address f 01 d h e > d, append 1 to address 01 d h 0 e < h, append 0 to address h 0 Stage 0 Stage 1 Stage 2 Stage 3 Searching for Input Pattern: e

24 Pipelined Binary Search Tree ace jd bfik g h Stage 0 Stage 1 Stage 2 Stage 3 Searching for Input Pattern: e, f e < f, append 0 to address f 010 d f > d, append 1 to address 01 e < h, append 0 to address h 0 e = e, Found! e f f = f, Found! e > d, append 1 to address 01 d f < h, append 0 to address h 0

25 Pipelined Binary Search Tree Stage 0 Stage 1 Stage 2 Stage ace jd bfik g h Standard Memories

26 ProMem – Module Design Input PatternSearch Address Enable Search Address (Next Stage) Enable (Next Stage) > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s Input Pattern

27 ProMem – Module Design > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s Target Pattern Memory TPM s (2 s ×w) addr rd dout

28 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design TPM s (2 s ×w) addr rd dout Target Pattern Not Found – Enable Next Stage Target Pattern Found Search for Target Pattern Compare > =

29 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design Compare > = TPM s (2 s ×w) addr rd dout Target Pattern Count Memory CM s (2 s ×c) wr rd addr dout

30 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design Compare > = TPM s (2 s ×w) addr rd dout CM s (2 s ×c) wr rd addr dout When Target Pattern Found - Update Count Value +1 1

31 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design Compare > = TPM s (2 s ×w) addr rd dout CM s (2 s ×c) wr rd addr dout +1 1 Pipeline Register Memories ModuleController

32 ProMem - Interface Simple Interface –Internal interface Enable signal Connection to monitored bus –External interface Read enable Write enable Connection to ProMem pattern input bus Mem Processor ProMem p1pmpm … p2 ren wen addrcen clk Per.

33 ProMem - Layout Efficient Layout –Achieved by simply abutting each module with the next –Results in very short bus wires between each module > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s Compare > = TPM s (2 s ×w) addr rd dout CM s (2 s ×c) wr rd addr dout +1 1

34 ProMem Results – Area* Module overhead only 1% *Area obtained using UMC.18 technology library provided by Artisan Components

35 ProMem Results – vs. CAM CAM design is 46% larger than ProMem

36 ProMem Results – Timing vs. CAM CAM access time grows with CAM size ProMem access time remains constant (Due to Pipelining)

37 Conclusions Introduced a new memory structure specifically for fast on-chip profiling One pattern per cycle throughput Simple interface to monitored bus Efficient design is very scalable