GPU-Accelerated Nick Local Image Thresholding Algorithm M

GPU-Accelerated Nick Local Image Thresholding Algorithm M
GPU-Accelerated Nick Local Image Thresholding Algorithm M. Hassan Najafi, Anirudh Murali, David J. Lilja, and John Sartori {najaf011, mural014, lilja, ICPADS-2015

Overview Introduction Why Image thresholding
Different thresholding algorithms Nick image thresholding method Algorithm, Flow Goals and contributions Implementations CPU sequential and CUDA GPU parallel implementations GPU considerations and optimizations Methodology of experiments Experimental results Effect of block size, image size, local window size GPU execution overheads Summary and conclusion

Introduction Why image thresholding ? Document binarization
Classifying to background/foreground Looking for a threshold value Pixel intensity > threshold : 1 (foreground) Pixel intensity <= threshold : 0 (background) Thresholding algorithms Global methods Local methods Document binarization as the first step in Optical Character Recognition (OCR) systems has been an active research area for many years. The binarization for OCR can be regarded as a segmentation problem classifying background and foreground (text) information in document images The simplest way to accomplish this binarization is thresholding. In thresholding a threshold value (pixel intensity) is selected, and all pixels intensities above this threshold are set to 1 (background) and values below this threshold are set to 0 (foreground)

Introduction Global methods A single threshold for the whole image
Otsu, Kaputar, Abutaleb, Monte, Don Often very fast But weak performance when the illumination over the document is not uniform Local methods (adaptive methods) A different threshold for each pixel Nick, Bernsen, Niblack, Sauvola Good results even on degraded documents But too slow Global: A single threshold value is selected for the whole image. Give good results for typical scanned documents Local: In local binarization methods, a different pixel threshold is selected for each pixel of the image, according to its neighboring pixels in a local region. Major barrier in these local methods : High computational cost = > Too slow

Introduction Looking for the best local thresholding algorithm [Gatos et al 2006] Nick method Good performance even on severely degraded document images. Local methods have been evaluated in [10], [11], [12], and [13] for different types of documents The Nick method [7] introduces an improvement in the Niblack algorithm by improving binarization for “white” and light page images through shifting down the binarization threshold. This method has shown good performance even for images where the gray values of text and non-text pixels are very similar. Figure 1. (Left) Original input images, (Right) Outputs of binarization using Nick method.

Nick method A window-based algorithm A window defines the local region
Threshold value for each pixel: 𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅=𝒎+𝑲. ( 𝑷 𝒊 𝟐 − 𝒎 𝟐 )/𝑵𝑷 m = mean value of pixels in the local window K = a factor in the range [-0.2, -0.1] 𝑷 𝒊 = gray-level value of the pixel NP = total number of pixels in each window Moving based algorithm: A moving window defines the local region Note that even with a fixed local window size, NP is not a constant value for the whole pixels of the image. Marginal pixels have smaller NP because their local window has less number of pixels. For example for the first pixel at the left up of the image, there is not any local pixel at the left up of that pixel.

Figure 3. Nick method flowchart
Assume a 𝟗∗𝟗 local window For each pixel Extract 81 neighbor pixels Compute the threshold value Compare with input value Figure 2. A 9*9 local window Figure 3. Nick method flowchart

Goals and contributions
Exploiting Graphic Processing Units (GPUs) Solve the long latency problem of Nick method We develop three work-efficient CUDA kernel Difference: how they load and access image pixels How changing block size, window size, and image size can affect the maximum achievable speedup. Performance scalability of the developed CUDA kernels as GPU architecture scales up. Developing several linear regression models to predict the total binarization time Fortunately, the rapid development of Graphic Processing Units (GPUs) brings a new opportunity to solve the computation bottlenecks of many image processing algorithms The main difference between these three CUDA kernels is in the way that they load and access image pixel intensity data We show how changing…. we develop several linear regression models to predict the total binarization time of a given input image using the developed kernels, including the time required to transfer data to the GPU’s memory.

Implementation: CPU single-thread
Need a reference An optimized single-thread implementation of the Nick method the main step of calculating t(x, y) is to compute the total sum of the pixel intensities and also the total sum of the squares of pixel intensities in the local window. Since for marginal pixels of the image the size of the local window is smaller than the image central pixels’ local window, we need to compute the size of the local window for each pixel separately. Based on the extracted values for the local window sizes, producing m(x, y) and t(x, y) for image pixels is straightforward.

Implementation: CPU – multi-thread
A multi-threaded program could have much better performance Maximum N time speedup using N threads The number of cores of a typical CPU much less than the number of cores in a GPU. GPUs can achieve massive parallelism for the applications without any data dependency. In Nick method: Computation for each pixel is independent of computation of other pixels well-suited for GPU computing A multi-threaded program could have much better performance than the serial single-thread version Although multicore/multithreaded CPUs could be helpful

Implementation: GPU Three work-efficient CUDA kernels
Difference: the way they load and access image pixel intensities The first kernel (Global) No shared memory, load all pixels from the global memory Simple implementation The second kernel (Global-Shared) Exploits both SM shared memory and the global memory Shared memory to reuse data The third kernel (Shared) Only relies on shared memory More data reuse First: Relies completely on the GPU SM’s L1 cache, Does not use shared memory, loads all pixel intensities directly from the global memory Second: Loads most of the pixels from the shared memory, But still loads marginal pixels from the long latency, global memory Third: Completely relies on the shared memory , All memory accesses are to the on-chip short latency shared memory. Hence, the motivation of developing: 1) Its simple implementation, Exploit spatial and temporal locality 2) To exploit shared memory in reusing data 3) To have more data reuse by a larger shared memory.

Implementation: GPU Each thread responsible for processing one pixel
Map the indices of each thread to one pixel. Determining the start and end of the local window Size of local window Compute the total sum of all pixels and squares of local pixels. Calculate the mean value of the local window. Calculate the threshold value based on the Nick main equation. Generate and store the output binary value. Each thread is responsible for processing one pixel of the image in the following steps Step 3: Computing the size of local window (marginal pixels have fewer pixels in their local neighborhood). Generate and store the output binary value by comparing the computed threshold value and the corresponding pixel intensity. ,

GPU Considerations and Optimizations
GPU constant memory Only useful for very small images Coalescing All developed kernels are coalesced. Divergence Only for the cases that the image size is not a multiple of the block size -use-fast-math Use the hardware accelerated version of the FP math functions GPU pinned memory Faster execution of the CudaMemCpy instruction GPU constant memory : Could only be a possible help for very small images that can fit completely inside the constant memory. Dividing large images into smaller tiles and calling the kernel multiple times could just cause some extra overheads Coalescing: Happens when consecutive threads (in the same warp) access consecutive data elements (in the same burst). By moving across the threads in each warp, the consecutive portion of image pixel intensities we are accessing shifts, Thus, all parallel versions are coalesced Divergence: When some threads within a warp follow different execution paths. In the implemented kernels each thread is responsible for binarization of only one pixel All threads in the same warp will do the same thing The only exception would be for the cases that the size of the input image is not a multiple of the thread block size and thus some of threads in the last warps will not have enough pixels to process. Since usually total number of pixels in the input images is much larger than total number of threads in each block, even divergence in the last warps will not have any significant influence on the execution of CUDA kernel. Compiler Flag: A CUDA compiler optimization, To improve the speed of running floating point operations, Force the device to use the hardware accelerated versions of the math functions, Significantly faster than the software versions, although by a little decrease in the precision Pinned memory: Copying data between host and device memories is one of the main overheads of running kernels on the GPUs. CudaMemCpy, the main instruction for copying data between the host and the device memories. Works much faster if the host memory is defined as a pinned memory

Methodology Nine different real input images
From 75*80 to 2500*4000 image size Effect of increasing the size of the local window 9*9, 15*15, and 33*33 The number of threads in each block in calling GPU kernels 8*8, 16*16, and 32*32 CPU for sequential version Nine different image : To evaluate the performance of serial and parallel implementations To see the effect of increasing the size of the local window: 33*33 can give the best possible output quality, different quality, different computational effort for each pixel

Methodology GPU for parallel versions
GeForce GTX 480 with Fermi architecture GeForce GTX 780 with Kepler architecture were selected to perform our performance evaluation experiments and to show performance scalability of the developed kernels as architecture scales up. GTX 780 is a newer GPU with a global memory twice of GTX 480’s global memory, having much more cores, a faster memory with wider bandwidth, and also larger L2 cache.

Experimental Results Kernel execution speedup
Kernel execution speedup: For binarization of the largest image sample, Using 3 implemented CUDA GPU kernels for 9x9, 15x15, and 33x33 window sizes, when block size changes from 8x8 to 16x16 to 32x32, on both selected GPU devices, GTX 480 and GTX 780. Figure 4. Kernel execution speedup for binarization of the largest image sample (Left) GTX 480, (Right) GTX 780.

Experimental Results: Block Size
Best Block Size for GTX 480: Max block size: 1024, Max threads per SM: 1536, Max block per SM: 8 If 8 x 8 8 blocks (8*64 threads) for each SM => 33% GPU occupancy If 16 x 16 6 blocks (6*256 threads) for each SM => 100% GPU occupancy If 32 x 32 1 block (1*1024 threads) for each SM => 67% GPU occupancy The higher occupancy a kernel has, the better performance it achieves if we do not consider the limitations of register and shared memory usages 16x16 thread block size: the best choice for GTX480. GTX 480: When block size is 8x8, Totally only 512 threads are assigned to each SM for scheduling, GPU occupancy will be 33% and thread-level parallelism is lost; There are not enough threads in the SM for scheduling 32x32: Each SM allows only one thread block to be scheduled for execution. (limited by 1536 maximum thread per SM) 16x16 thread block size has shown the best performance of CUDA kernels on GTX480. In general, the higher occupancy a kernel has, the better performance it achieves if we do not consider the limitations of register and shared memory usages

Experimental Results: Block Size
Best Block Size for GTX 780: Max block size: 1024, Max threads per SM: 2048, Max block per SM: 8 If 8 x 8 8 blocks (8*64 threads) for each SM => 25% GPU occupancy If 16 x 16 8 blocks (8*256 threads) for each SM => 100% GPU occupancy If 32 x 32 2 block (2*1024 threads) for each SM => 100% GPU occupancy So both 16x16 and 32x32 could give 100% occupancy, for the first and the second kernel 32x32 for the third kernel 16x16 has shown a better performance This is mainly because of the limitation in selecting the BLOCK_SIZE variable in the third kernel. In the third kernel, BLOCK_SIZE + WINDOW_SIZE should not exceed 32, and so to work with 9x9 or 15x15 window sizes we could have at most BLOCK_SIZE=16.

Experimental Results: Image Size
Increasing the size of the image More benefit from parallel processing Cause to have much more image pixels for processing : More benefit from parallel processing the achieved speedup for binarization of image samples on GTX 480 has increased from 19x for image 1 to 137x for image 6 and to 142x for image 9, the largest image, using the first kernel. Based on this figure, there will be no question on the superiority of the third kernel for binarization of all various size images if we are going to choose only one of the three proposed kernels for the 16x16 and 9x9 block size and local window size configurations. Figure 5. the gained speedups from binarization of nine sample images when block size and window size are fixed to 16x16 and 9x9 (on GTX 480)

Experimental Results: Window Size
Increasing the size of the local window Improve the quality, But, it costs more execution time Kernels do not follow the same pattern when the size of the local window increases. Table 1. Speedups of binarization using the first and the third kernel on the GTX480 with a 16x16 block size when the window size changes. In both kernels for both window sizes, enlarging the input image caused an increment in the gained speedups. However, kernels did not follow in the first kernel, increasing the local window size from 9x9 to 15x15 has shown a better speedup for all input images, in the third kernel, increasing local window size has declined the effect of using parallel processing, the smaller window size has given a better speedup. The reason: in the first kernel which relies on caching, increasing the size of the local window means exploiting more locality and so more number of accesses are to the elements in the cache. On the other hand, for the third kernel which exploits SM shared memory, increasing the local window size while having a fixed number of threads in each block causes more overhead relative to the number of threads because a larger number of local elements should be loaded into the shared memory, and so the speedup cannot scale as much as a smaller local window by increasing the size of the input image could.

GPU Execution Overheads
The main overheads of executing kernels on the GPU Copy the data from the host memory into the device global memory Copying back the results to the host memory. To reduce the overheads Allocated a specific amount of pinned memory to cudaMemCpy Table 2. The effect of using pinned memory on the GTX 480 when executing the third kernel for a window size and block size of 15*15 and 16*16. the execution overheads reduced by a factor of about 2 total speedup to increase from 83x to 118x for the largest image In the process of measuring speedup we only considered the kernel execution time with respect to the execution time of the optimized sequential C version. Execution of the GPU kernels always comes with some overheads. In order to reduce the overheads of transferring data between the host and the device, we allocated a specific amount of pinned memory in the host to cudaMemCpy instruction, to reduce the execution time of this instruction. Note that although exploiting pinned memory reduced the CUDA kernel execution overheads significantly, these overheads are still relatively large, approximately equal with the corresponding kernel execution time. Thus, considering the GPU execution overhead in computing the final speedup is inevitable if we want to estimate an accurate fair GPU to CPU speedup.

GPU Execution Overheads
for the largest image sample. Comparing these two figures shows that the best achievable speedup for the first kernel reduces from 166x to 144x on GTX 480, and the best measured speedup in the third kernel with 15x15 local window size decreases from 239x to 118x on GTX 480 when the GPU kernel execution overheads have been considered in the speedup calculations. Figure 5. GPU GTX480 to CPU Speedups (left) before considering GPU overheads (right) after including GPU execution overheads

Developing Regression Models
We develop four linear regression models Predict the total execution time of binarization using the first and the third developed CUDA kernels using the optimized sequential version Table 3. Linear regression models for the total execution time as a function of the number of pixels in the input image. when the number of pixels in the input image scales P is the total number of pixels in the input image.

Summary and Conclusion
Developed three CUDA kernels of the computation intensive Nick local image thresholding algorithm Solve the long latency problem this method The first CUDA kernel : Loads all pixels from global memory Accelerate total binarization time for the 33x33 local window size 144 times on GTX 480 161 times on GTX 780 The second CUDA kernel : exploit both global and block shared memory Speedup 66x on GTX 480 132x on GTX 780 To solve the long latency problem this method: by dividing the work among GPU devices.

Summary and Conclusion
The third CUDA kernel : Loads all pixels into shared memory have shown the best performance for the 15x15 local window size including the GPU overheads 118x improvement on GTX 480 147x improvement on GTX 780 GTX 780 (Kepler architecture) could gain much better speedup in comparison with GTX 480 (Fermi architecture) Increasing image size : more speedup Increasing window size better output quality from Nick method gaining more speedup in the first kernel and less speedup in th third one Our experimental results show that

References [1] B. Gatos, I. Pratikakis, and S. J. Perantonis, “Adaptive degraded document image binarization,” Pattern Recognit., vol. 39, no. 3, pp. 317–327, 2006. [2] F. Shafait, D. Keysers, and T. M. Breuel, “Efficient Implementation of Local Adaptive Thresholding Techniques Using Integral Images,” Doc. Recognit. Retr. XV., 2008. [3] K. Khurshid, I. Siddiqi, C. Faure, and N. Vincent, “Comparison of Niblack inspired binarization methods for ancient documents,” Proc. SPIE, vol p U–72470U–9, 2009. [4] E. Zemouri, Y. Chibani, and Y. Brik, “Enhancement of Historical Document Images by Combining Global and Local Binarization Technique,” Int. J. Inf. Electron. Eng., vol. 4, no. 1, 2014. Our experimental results show that

Thank you Questions?

GPU-Accelerated Nick Local Image Thresholding Algorithm M

Similar presentations

Presentation on theme: "GPU-Accelerated Nick Local Image Thresholding Algorithm M"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU-Accelerated Nick Local Image Thresholding Algorithm M

Similar presentations

Presentation on theme: "GPU-Accelerated Nick Local Image Thresholding Algorithm M"— Presentation transcript:

Similar presentations

About project

Feedback