Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain,

Similar presentations


Presentation on theme: "Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain,"— Presentation transcript:

1 Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain, Jason Fritts Washington University in St. Louis Funded by NSF under grant Sep 22 Liquid Architecture Extracting & Improving Micro-architecture Performance on Reconfigurable Architectures

2 Application Performance ArchitectureCompiler Algorithm

3 Customization cost/ performance tradeoff GenericFPGACustom Generic processor - cheap but application-agnostic; compilers exist; compiler optimization is the key Reconfigurable logic - subject of our study; architecture and compiler research are the key Customized logic - ideal for an application but expensive; logic/architecture research is key

4 Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard

5 Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application

6 Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware

7 Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware ~ $200 - $500 Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions ~ $200 - $2000 Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware x~ $500, ,000,000+

8 Hardware platform overview FPGA Standard ISA SPARC 8 Instrumentation and v ariations FPX Interface support modules (VHDL) Memory, Network interface chip, … Interne t Development Workstation FPX research was supported by NSF: ANI and Xilinx Corp.

9 Hardware platform details FPX FPGA

10 Hardware platform details FPX Core I-CACHE D-CACHE Cache Controller LEON - SPARC8 compatible & Open soft core LEON

11 Hardware platform details FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM LEON - SPARC8 compatible & Open soft core LEON

12 Application execution FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc BLASTN DNA Sequence Comparison

13 Application runtime FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Results & Timing Slow! Where is time spent?

14 Software approach to profiling “time” Start with the program Introduce timers Run the instrumented program Execution Timings Timers must account for their own overhead Instrumented program will run slower Instrumentation skews runtime as it affects system behavior such as cache, …

15 Profiling is free with liquid architecture!

16 Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation pc Statistics Module Event monitor bus Request Timings

17 Method Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Choose methods to profile from the user interface Liquid architecture: cycle-accurate profiling for free

18 Method Address Range.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Liquid architecture: cycle-accurate profiling for free Hi 0x C Lo

19 Method.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free

20 Function.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A ≤≤ Counter Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR

21 Function.text main addQuery findMatch computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A ≤≤ Counter PCCLK 0x F Hi 0x400005D8 Lo 0x A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR

22 0x400003EF Hi 0x C Lo 0x A ≤≤ Counter PCCLK 0x F Hi 0x400005D8 Lo 0x A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free To Command Controller INCR

23 Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation pc Statistics Module Event monitor bus Request Timings findMatch 500ms coreLoop 300ms

24 “Where time was spent” for BLASTN…

25 Cycle-accurate profiling No application overhead Hence, at full speed

26 Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus pc Is cache the problem?

27 Software approach to profiling cache Not possible to profile by coding!! Simulate cache behavior Cache Simulator Timings Slow !!

28 Software approach to profiling “cache” Scale down the program Simulate cache behavior Cache Simulator Timings Cannot afford to simulate the entire program Not possible to profile by coding!!

29 How do we detect and report cache behavior using Liquid Architecture?

30 Interface extends to include cache behavior options… Liquid architecture: cache behavior for free Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd

31 Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Cache Hits / Misses ReadWrite

32 Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus pc

33 Cache behavior Hits and misses in LEON

34 Cache behavior These signals are fed into the Event Monitoring Bus

35 Cache behavior Statistics Module

36 Cache behavior Statistics Module Statistics Module counts events

37 Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus Reads hits misses Writes hits misses pc

38 % Cache hit rate for D-cache: 1KB Function-wise cache profiling, in reasonable time

39 Liquid architecture enables fast, accurate results Seconds: fast, but no cache performance data available

40 Liquid architecture enables fast, accurate results Days: so slow you wouldn’t do this on the whole program

41 Liquid architecture enables fast, accurate results ½ hour: Practical, reasonably fast, totally accurate

42 Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Pipeline Stalls Branch Predict Can profile all other aspects of micro-architecture too…

43 How do we use the profiling info to improve application performance?

44 Reconfigure micro-architecture

45

46

47 Reconfiguration FPGA Control S/W Interface Command Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation Core I-CACHE D-CACHE Cache Controller I-CACHE D-CACHE Cache Controller

48 Cache hits after D-cache reconfiguration

49

50 Conclusion for “large” run: D-cache doesn’t make much difference. Hit rate is already very high

51 Cache hits after D-cache reconfiguration

52 Conclusion for “small” run: Larger cache helps… Increased Associativity does not help as much

53 App runtime after I -cache reconfiguration

54 Larger I-cache doubles application performance for both “small” and “large” runs

55 What have we learned about BLASTN?

56 ½ execution time in two methods

57 What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance

58 What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance

59 What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance Area better spent on I-cache not D-cache for this application

60 What can we do next?

61 Most execution spent on hash functions findMatch(String) Access array Hash  array index

62 FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc Reconfigure ISA + hash instruction

63 FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc Reconfigure ISA hash instruction

64 Our development environment

65 To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K)

66 Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port

67 Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port

68 Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port Ethernet device driver to mount NFS file systems

69 Operating system call profiling Just select them in the interface…

70 Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery read Pipeline Stalls Branch Predict

71 Recap

72 Recap - Extracting & Improving Performance on Reconfigurable Architectures

73 Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure

74 Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed

75 Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance

76 Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance Currently –Reconfigure ISA and modify compiler –Automate –Profile operating system calls

77 Questions? FPX Hardware Module built At WashU Serial port Gigabit Ethernet FPGA device with LEON core

78 Hardware development flow Interface support mod VHDL Compile Simulate (Modelsim) Synthesize (Synplicity) Place n’ Route (Virtex 2000E) Verify LEON VHDL

79 Modular Design Flow (our contribution) Place and Route with constraints (Xilinx) Synthesize Logic to gates & flops (Synplicity Pro) Front End: Specify Regular Expression (Web, PHP) Install and deploy modules over Internet to remote scanners (NCHARGE) Set Boundry I/O & Routing Constraints (DHP) Back End (2): Generate Finite State Machines in VHDL Generate bitstream (Xilinx) In-System, Data Scanning on FPX Platform Back End (1): Extract Search terms from SQL database New, 2 Million-gate Packet Scanner: 9 Minutes

80 Function-wise profiling

81 Next steps - Automate configuration Application Trace Analyzer Architecture Generator Synthesis Compiler FPX Platform Reconfiguration Server Reconfiguration Cache Dynamic Adaptation Analysis + Architecture Generation Configuration Archive Simulation

82 Next steps - Automate (re)configuration FPGA Control S/W Interface LEON Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation Config Controller LEON-v1.0 I-CACHE D-CACHE Cache Controller LEON-v2.0 I-CACHE D-CACHE Cache Controller LEON-v3.0 I-CACHE D-CACHE Cache Controller


Download ppt "Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain,"

Similar presentations


Ads by Google