© Copyright 2013 Xilinx. Design Consideration of Using Zynq & HLS to Implement Face Detection Algorithm.

. © Copyright 2013 Xilinx. Adaboost for face detection Page 2 Adaboost (adaptive boosting) –Machine learning method to train cascade classifiers –Build strong classifier with lots of weak classifiers Cascade classifier in face detection

. © Copyright 2013 Xilinx. OpenCV design acceleration flow Page 4 1)Develop OpenCV application on Desktop 2)Run OpenCV application on ARM cores without modification 3)Abstract FPGA portion using I/O functions 4)Replace OpenCV function calls with synthesizable code 5)Run HLS to generate FPGA accelerator 6)Replace call to synthesizable code with call to FPGA accelerator

. © Copyright 2013 Xilinx. Pure OpenCV face detection code snippet Page 5 cv::Mat src_mat(src); cv::Mat dst_mat(dst); cv::Mat gray_mat, eq_mat; cv::cvtColor(src_mat, gray_mat, CV_BGR2GRAY); cv::equalizeHist(gray_mat, eq_mat); if (!cascade_loaded) { cv::FileStorage fs(cascade_file_name, cv::FileStorage::READ); if (!fs.isOpened()) { printf("error: failed to open cascade file.\n"); return; } if (!unmaskedCascade.read(fs.getFirstTopLevelNode())) { printf("error: failed to get first top level node.\n"); return; } cascade_loaded = 1; } std::vector faces; unmaskedCascade.detectMultiScale( eq_mat, faces, 1.1, 3, 0, cv::Size(50, 50)); std::vector ::const_iterator i; src_mat.copyTo(dst_mat); for (i = faces.begin(); i != faces.end(); ++i) { cv::rectangle(dst_mat, cv::Point(i->x, i->y), cv::Point(i->x + i->width, i->y + i- >height), CV_RGB(0,255,0), 2); printf("width: %d, height: %d\n", i->width, i->height); } printf("***** Face detected in opencv: %d *****\n", faces.size()); opencv_top.cpp

. © Copyright 2013 Xilinx. Time profiling of OpenCV application on ARM Page 6 Test condition –OpenCV trained cascade classifier: lbpcascade_frontalface.xml –HDMI video input: theinternship-tlr2_h1080p.mov (a movie trial with lots of faces) OpenCV prefiltering - cvtcolor: ~800ms (*) OpenCV prefiltering - equalizeHist: ~<100ms (*) OpenCV unmasked face detection: ~4000ms (**) OpenCV face marking and frame copy: ~200ms With hw prefiltering, we can save the time spent on (*) With hw classification, we can further save the time spent on (**)

. © Copyright 2013 Xilinx. Ideal system partition Page 7 Pre-processingFilter enginePost-processing TPG_VDMA S2MM FILTER_VDMA MM2SS2MM logiCVC MM2S SOURCE_LAYER DISPLAY_LAYER sourcesink ARM Merge and draw rectangle Cascade classifier Face marker MASK_LAYER fsync_out fsync video file reader FPGA fsync Park mode Circular mode Mux sel GPIO: ext_fsync What is a mask image? –An image marked with face location s and sizes –output of PL –input of PS

. © Copyright 2013 Xilinx. HW/SW communication and synchronization Page 8 UIO device Interrupt Frame sync Triple buffer Pre-processingFilter enginePost-processing TPG_VDMA S2MM FILTER_VDMA MM2SS2MM logiCVC MM2S SOURCE_LAYER DISPLAY_LAYER sourcesink ARM Interrupt handler: do nothing but clean/restart HLS image filter without auto-restart MASK_LAYER fsync_out fsync video file reader FPGA fsync Circular mode

. © Copyright 2013 Xilinx. Masked detection system partition 1 Page 9 Single window size, 2 stage haar-like feature accelerator LBP masked detect in SW (why not haar? cons1) Pros: –many candidate pixels will be eliminated in partial cascade classification Cons: –haar-like trained classifiers in old format, cannot be masked Even if classifier trained in new format, sw haar-detection is quite slow (single frame, full cascade: ~12000ms) –Single window size, lost faces –high-computation burden, limitation of scaling of window size

. © Copyright 2013 Xilinx. Masked detection system partition 2 Page 10 Pre-filtering in HW –mask image has two channels (grayscale/edge) without additional vdma LBP masked detect in SW Pros: –many candidate pixels will be eliminated in hw pre-filtering –Save logic resources Cons: –Could lost faces: canny-like edge-detection filter? A tradeoff

. © Copyright 2013 Xilinx. Masked detection system partition 3 Page 11 Partial LBP cascades in HW Rest LBP cascade masked detect in SW Pros: –Part of cascades accelerated in PL, left stages in PS - well connected! Can lower clock to get more resource reuse - More stages! –many candidate pixels will be eliminated in partial cascade classification –Prevent face loss Cons: –Need input image pyramids for all window size partial detection

. © Copyright 2013 Xilinx. For an original image, 10 images will be generated with the single precession floating For each pixel the average calculations for pattern matching will be 16 times. For a given pixel to do pattern matching, the 8 single precession floating data source locations for a calculation is known in advance based on the mask Algorithm Understanding Xilinx Confidential

. © Copyright 2013 Xilinx. One frame( write only) –Given video size is a 720P, total pixel rate = 1280*720 = 1M pixel = 256K*32 bits –To generate 10 images (size equivalent) with single-precision floating data, the BW to write(only) to DDR3 memory is 10 images*1M *32 bit = 10M *32bits –Estimated each hardware acceleration needs 8 floating point data and average iterations is 16. The bandwidth required for this module is 8 (single precision data) * 16(iterations) * 1M (pixel) *1/4(interval extract)*32(bits) ~= 32M *32bits Real time (30 frame /s) – (256k write + 256K read + 10M + 32M) *30 frames * 32bits = 42.256*30 *32bit ) ~= 1275M *32bits Algorithm bandwidth analysis

. © Copyright 2013 Xilinx. Conclusion: –The algorithm definitely needs 5100MB/s for real-time performance based on the data flow. –Single 32bit DDR3 memory can’t meet this requirement 1066(Mbps)*0.6(DDR3 efficiency) *32 bits(pin) = 2558MB/s Proposal: –Use one PS DDR3 control and one PL DDR3 control to double the bandwidth Algorithm bandwidth analysis ~ cont

. © Copyright 2013 Xilinx. Two DDR3 memory control doubles BW DDR Memory Controller DDR3 32-bit GP_M# x 2 General Purpose 32-bit AXI Master HP# x 4 AXI Data 32/64-bit Slave Central Interconnect DDR PL Processor System (PS) ARM A9 32 KB I-Cache 32 KB D-Cache ARM A9 32 KB I-Cache 32 KB D-Cache AXI Interconnect Pre processor AXI Interconnect DDR3 32b @1066 Mhz 32b 64b Camera Input DDR Memory Controller DDR3 32-bit DDR DDR3 32b @800Mhz

. © Copyright 2013 Xilinx. 8 pixels Calculating operation can run at per clock cycle.(Interval = 9000 + 324 + latency of calculation function). Per calculation function run time = (9411/9000)*5ns = 5.23ns. When LUTs utilization is up to 64%, 8*parallel calculation function run time = 5.23/8ns = 0.6535ns for each calculation. That is to say, up to 1500MHz/s calculating frequency. Summary

© Copyright 2013 Xilinx. Design Consideration of Using Zynq & HLS to Implement Face Detection Algorithm.

Similar presentations

Presentation on theme: "© Copyright 2013 Xilinx. Design Consideration of Using Zynq & HLS to Implement Face Detection Algorithm."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Copyright 2013 Xilinx. Design Consideration of Using Zynq & HLS to Implement Face Detection Algorithm.

Similar presentations

Presentation on theme: "© Copyright 2013 Xilinx. Design Consideration of Using Zynq & HLS to Implement Face Detection Algorithm."— Presentation transcript:

Similar presentations

About project

Feedback