Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory

Lecture 18 Outline Web server example MP3 example

Lecture 18 Example: Embedded web server application Basic web server capable of responding to simple HTTP requests Simple CGI requests for dynamic HTML Read a timer peripheral before, during, and after servicing an HTTP request to log throughput calculations, which are then displayed on a dynamically generated web page Simple read only file system was implemented using flash memory to store static web pages and JPEG images

Lecture 18 Throughput calculations Transmission throughput –Reflects the latency between starting to send the first TCP packet containing the HTTP response until the file was completely sent –Could theoretically reach a maximum of 10Mbps Raw network speed that the CPU and TCP/IP stack are capable of sustaining. HTTP server throughput –Takes into account all delay between the incoming HTTP connection request and file send completion Includes the transmission latency above Also measures the time the HTTP server took to open a TCP connection to the host

Lecture 18 Baseline system Web server put to test to serve up JPEG images of varying sizes across the LAN to a host PC During each transfer several snapshots of the timer peripheral were taken

Lecture 18 Baseline system dataflow NIOs CPU Instruction Master Data Master Avalon Bus UART, IO, Timer, etc. SRAM FLASHEthernet MAC Data flow The Nios CPU’s data master port is used to read data memory (SRAM) and write to the Ethernet MAC. This would occur for each packet transmitted in the baseline system.

Lecture 18 Performance optimization Using a DMA to transfer data from incoming packets into memory without the intervention of the microprocessor The use of a custom peripheral to do the checksum calculation The combination of the two Optimization of the slave-arbitration priority for the memories to provide maximum data throughput

Lecture 18 Dataflow enhancement with DMA Using DMA to transfer packets between Ethernet MAC and data memory CPU higher priority for any conflicts with the DMA During DMA, CPU is free to access other peripherals For access to the shared SRAM, arbitration is performed NIOs CPU Instruction Master Data Master Avalon Bus UART, IO, Timer, etc. SRAM FLASHEthernet MAC Data flow DMA Controller Read Master Write Master Avalon Bus Data flow Arbitrator

Lecture 18 Performance improvement Transmission throughput is doubled compared to baseline The entire HTTP server throughput is about 2.5X that of the baseline 36% increase of logic resource usage (3600 logic elements)

Lecture 18 TCP checksum Checksum calculations can be regarded as a necessary evil in dataflow-sensitive applications –For a 1300-byte payload, it takes 33,000 clock cycles –At a 33 Mhz clock speed it requires 1ms latency for each maximum size packet In the benchmark, the largest file (60KB) breaks down into 46 maximum-sized packets –46ms out of 156ms transmission latency in the baseline The inner loop of TCP/IP stack checksum performs a 16-bit one ’ s complement checksum calculation –Adding up data repeatedly is a simple task for hardware –A Verilog implementation can be designed –The checksum peripheral operates Reading the payload contents directly out of data memory Performing the checksum calculation Storing the result in a CPU-addressable register –It takes 386 clock cycles now –Speedup of 90X over the software version

Lecture 18 Checksum peripheral Again, for access to the shared SRAM, arbitration is performed NIOs CPU Instruction Master Data Master Avalon Bus UART, IO, Timer, etc. SRAM FLASH Data flow Checksum Peripheral Read Master Avalon Bus Data flow Arbitrator

Lecture 18 Performance boost Transmission latency decreased by 44ms Average transmission throughput increase of 40% and average HTTP throughput increase of 25% over the baseline Resource usage 22% increase over the baseline (3250 logic elements)

Lecture 18 Putting it all together

Lecture 18 Embedded uP systems in Xilinx FPGA Traditional embedded microprocessor system as implemented on a platform FPGA Co-processor Architecture with multiple hardware accelerators 1. Start with developing for the first architecture 2. Automatically generating the second architecture under the control of the user

Lecture 18 Profiling results DCT32 and IMDCT36 perform the discrete cosine transform and inverse discrete cosine transform respectively. The other functions are multiply-accumulate functions of various sizes. These functions comprise over 90% of the total application execution time on the host.

Lecture 18 Design automation Implement co-processor accelerators to meet performance requirements. Using the tagging facilities in Xilinx design environment to mark the functions for hardware acceleration. ‘ Compile for target ’ –The tool chain will create an implementation that includes a MicroBlaze processor and interfaces the same as before –Augmented with three hardware accelerators that implement the multiplications, DCT and inverse DCT. The creation of the hardware accelerator blocks is done automatically: –The use of an advanced C to hardware compiler optimized for Platform FPGAs. –The ‘ stitching ’ of the accelerators into the new co-processing architecture. –Handling the movement of the appropriate data to and from the accelerators.

Lecture 18 New architecture

Lecture 18 Final results Enables the mp3 application to run in real time at a system clock rate of 67.5MHz.

Lecture 18 A simple summary Platform-based design involves hardware/software codesign Right design decisions can provide significant amount of performance improvement Need careful tradeoff between performance, resource usage, cost and design time Platform FPGAs are a convenient/low cost platform for such a task

Lecture 18 Overview of the Rest of the Semester This is the last formal lecture –If we haven’t covered it already, we can’t really expect you to use it on your projects Quiz 2. Next Thursday. No class next Tuesday. Final project proposal is 4/13 and 4/15. –2 teams each day. Each team has 20 minutes –Proposal presentations can be sent to me through email before class or brought in using a flash memory Initial report due on 4/20 (new due date) –Three-pages (four at most) –May contain: introduction, background, motivation, impact, block diagram, and workload partition among team members –Goal: give us enough information that we can provide feedbacks about project complexity and suggestions From now on, I’ll have office hours during class meeting times to discuss final project-related issues Final Project Presentation: 5/12 Final Project Report/Demo: Due 5/14 Details referring to Lecture 14

Lecture 18 Next time Quiz 2 (next Thursday, 4/8)

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.

Similar presentations

Presentation on theme: "Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.

Similar presentations

Presentation on theme: "Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback