Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Presented by Xinyu Chang

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.

Strategies for Implementing Dynamic Load Sharing.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

FLANN Fast Library for Approximate Nearest Neighbors

Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.

Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

CS 732: Advance Machine Learning

Computer Vision Group Department of Computer Science University of Illinois at Urbana-Champaign.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.

Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Computer Engg, IIT(BHU)

Parallel Algorithm Design

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Applying Twister to Scientific Applications

Improved schedulability on the ρVEX polymorphic VLIW processor

University of Wisconsin-Madison

Experiences with Hadoop and MapReduce

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung Leung and Minyi Guo* Department of Computer Science University of Otago, New Zealand * Department of Computer Science Shanghai Jiao Tong University, China

Contents Motivation Our work Evaluation Conclusion

Contents Motivation Our work Evaluation Conclusion

Similarity Search Definition: –To preprocess a database of N objects so that given a query object, one can effectively determine its nearest neighbors in database. Applications: –pattern recognition, chemical similarity analysis, and statistical classification, etc.

The problem – KNN Search K Nearest Neighbor Search: –Feature: an array of D elements f = [e 1 ] –Feature Space: a set of features Fs= {f 1 } –Feature Similarity: Euclidean distance =sqrt(Σ(f i m -f j m ) 2 ) –Search: given a query feature f q, find k features in F s so that they have the shortest distances to f q.

Our Case Study Feature Matching: a fundamental problem in many computer vision tasks –Use the SIFT algorithm to generate features for each image; –Use a k-Nearest Neighbors (k-NN) algorithm to find similar features between images

Challenges Very time-consuming: –datasets become larger: hundreds or thousands of images; –image resolution increases: 2300×1500 pixels, or higher; New platforms:  HPC turns to multi-/many-core age: AMD 16-core and 64-core machines.

Motivation Performance evaluation: –Find out common problems that may limit the performance of feature matching on multi- /many-core platforms. Performance tuning: –Find general methods to solve the identified problems.

Contents Motivation Our work Evaluation Conclusion

Data Distribution

Data Size

Problems Unbalanced workload: –Levels of parallelism; –Scheduling policy. Poor last-level cache utilization: –Memory architecture.

Levels of parallelism …….. Level_1 Level_2 Level_3 —— —— —— — Level_4 Linear KD-tree Kmeans LSH Others Level_1&2 Reference Images Query Images Features

Scheduling policy OpenMP scheduling policy: –Static: the scheduler will assign an equal number of tasks to each thread (not used); –Dynamic: when one thread finishes its current task, it will take new tasks from the global task queue; –Guided: chunk size is adjusted dynamically when tasks are requested from the task queue.

Memory architecture More cores are sharing the memory and last-level cache: –Memory bandwidth: AMD 16-core 12.8 GB/s AMD 64-core 25.6 GB/s –Last-level cache: AMD 16-core 6 MB AMD 64-core 16 MB Large images may not fit in cache and will cause many memory accesses, which leads to hitting the memory wall.

Divide-and-Merge We propose Divide-and-Merge: –Whole feature space is split into several smaller sub-spaces; –Search each sub-space independently; –Merge their results.

Divide-and-Merge

Time complexity Accurate algorithms: –Brute force: –Apply DM: Approximate algorithms: –Randomized KD-Tree: –Apply DM:

Contents Motivation Our work Evaluation Conclusion

Hardware and Software configuration NameCPUCacheMemoryOS Compil er AMD 16-core (AMD16) AMD Opteron Processor cores × 2.5 GHz L1: 128 KB, L2: 512 KB, L3: 6144 KB 16 GiB, DDR2 800 MHz 12.8 GB/s Ubuntu g AMD 64-core (AMD64) AMD Opteron Processor cores × 2.3 GHz L1: 48 KB, L2: 1000 KB, L3: KB 64 GiB, DDR MHz GB/s Ubuntu g Environment: OpenCV + OpenMP: one of the most frequently used setup for computer vision researchers to utilize parallel platforms

Levels of parallelism

Scheduling policy(on level_1&2)

Scheduling policy(on level_3)

Memory architecture 1. Original Execution 2. Apply Divide-and-Merge

Evaluation on Manawatu Dataset

Contents Motivation Our work Evaluation Conclusion

We have shown that performance tuning is demanding on modern multicore systems. We have comprehensively evaluated the impact of the three factors that have an influence on large- scale image feature matching. We have proposed a Divide-and-Merge algorithm that can greatly improve the speedup and scalability of feature matching algorithms on multicore machines.