Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.

Similar presentations


Presentation on theme: "2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution."— Presentation transcript:

1 http://www.c 2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Applications of Multi-resolution Processing Background: Linear Pipeline and Segment Pipeline  Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel + Less demand of off-chip memory bandwidth - Poor efficient use of the PE resources - Area and power overhead  Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another + Save computational resources - Require very high memory bandwidth Our Approach: Time-Sharing Pipeline Architecture  The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner  Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth  One single PE runs at full speed as segment pipeline  As low memory traffic as linear pipeline A combined approach: Time-Sharing Pipeline in Gaussian Pyramid and Laplacian Pyramid Laplacian Pyramid G0 Gaussian Pyramid  Single PE  Linebuffer pyramid  Timing MUX  The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system Time-sharing Pipeline in Optical Flow Estimation (L-K)  Three time sharing pipeline work simultaneously: Two for Gaussian pyramids construction (fine to the coarse scale) One for motion estimation (coarse to the fine scale)  Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Line Buffer, Sliding Window Registers and Blocklinear  Sliding window operations  Pixels are streaming into the on-chip line buffer for temporal storage  Line buffer size is proportional to the image width, making the line buffer cost for high resolution images huge  Inspired by the GPU block-linear texture memory layout  Significantly reduce the linebuffer size Linebuffer width is equal to the block width Data refetch at boundary Blocklinear Image Processing Hardware Synthesis in 32nm CMOS Genesis-based chip generator encapsulates all the parameters (e.g., window size, pyramid levels) and allows the automated generation of synthesizable HDL hardware for design space exploration Block diagram of a convolution-based time-sharing pyramid engine (e.g., 3-level Gaussian pyramid engine with a 3x3 convolution window) Hardware chip generator GUI Area Evaluation Design points are running at 500 MHz on 32nm CMOS Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP)  TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels  The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP)  TP consumes increasingly more area compared to SP as the pyramid levels grow  The overhead of TP over SP is fairly small for designs with small windows Memory Bandwidth Evaluation  DRAM traffic is an order of magnitude less than SP  Energy saving  TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM  All other intermediate memory traffic is completely eliminated Overall Performance & Energy Evaluation  Energy consumption is dominated by DRAM accesses  vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost  vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing  TP is almost 2x faster than SP  TP is only slightly slower than LP while eliminating all the logic costs BlockLinear Design Evaluation P(N) = Parallel Degree. B(N) = Number of Blocks.  Increase number of blocks reduces linebuffer area, while remains the same throughput  This chart demonstrates various design trade-offs Simulation Result  Optical flow (velocity) on a benchmark image with a left-to-right movement  The proposed TP-based implementation produces the same motion vectors as the SP-based implementation, validating the approach

2 http://www.c 2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision  Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel Pro: Less demand of off-chip memory bandwidth Con: o Poor efficient use of the PE resources o Area and power overhead  Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Pro: Save computational resources Con: Require very high memory bandwidth Applications in Multi-resolution Processing Panorama Stitching HDR Detail Enhancement Optical Flow Existing Solutions: Linear Pipeline and Segment Pipeline  The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner  Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth  One single PE runs at full speed as segment pipeline  As low memory traffic as linear pipeline G0 Gaussian Pyramid  Single PE  Linebuffer pyramid  Timing MUX  The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system  Three time sharing pipeline work simultaneously: Two for Gaussian pyramids construction (fine to the coarse scale) One for motion estimation (coarse to the fine scale)  Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Simulation Result of Hierarchical Lucas-Kanade Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP)  TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels  The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP)  TP consumes increasingly more area compared to SP as the pyramid levels grow  The overhead of TP over SP is fairly small for designs with small windows  Energy consumption is dominated by DRAM accesses  vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost  vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing  TP is almost 2x faster than SP  TP is only slightly slower than LP while eliminating all the logic costs  DRAM traffic is an order of magnitude less than SP  Energy saving  TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM  All other intermediate memory traffic is completely eliminated Proposed Solution: Time-sharing Pipeline Application Demonstration Laplacain Pyramid Hierarchical Lucas-Kanade Evaluation Area BandwidthPower

3 http://www.c 2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision  Linear pipeline: Replicate processing elements (PEs) for each pyramid level; all PEs work in parallel for all pyramid levels Pro: Less demand of off-chip memory bandwidth Con: o Inefficient use of the PE resources o Area and power overhead  Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Pro: Saves computational resources Con: Requires very high memory bandwidth Panorama Stitching HDR Detail Enhancement Optical Flow  The same PE works for all the pyramid levels in parallel as a time-sharing pipeline  Each work-cycle, compute -> 1 pixel for G2 (coarsest level) -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth  One PE runs at full speed as a segment pipeline  As low memory traffic as a linear pipeline G0 Gaussian Pyramid  Single PE  Linebuffer pyramid  Timing MUX  The convolution engine can be replaced by other processing elements for a more complicated multi- resolution pyramid system  Three time sharing pipelines work simultaneously: o Two for Gaussian pyramids construction (from fine to coarse scale) o One for motion estimation (from coarse to fine scale)  Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Simulation Result of Hierarchical Lucas-Kanade Optical Flow Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP)  TP consumes much less PE area  The cost of extra shift registers and controlling logic is negligible compared to the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP)  TP consumes more area as the pyramid levels grow  The area cost is still competitive in small window  Energy consumption is dominated by DRAM accesses  vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost  vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing  TP is almost 2x faster than SP  TP is only slightly slower than LP while eliminating all the logic costs  DRAM traffic is an order of magnitude less than SP  Energy saving  TP only accesses the source images from the DRAM, and returns the motion vectors back to the DRAM  All other intermediate memory traffic is completely eliminated Proposed Solution: Time-sharing Pipeline Application Demonstration Laplacian Pyramid Hierarchical Lucas-Kanade Evaluation Area BandwidthPower Applications in Multi-resolution Processing Existing Solutions: Linear Pipeline and Segment Pipeline


Download ppt "2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution."

Similar presentations


Ads by Google