加速以 GPU 為運算核心的二階段哼唱選歌系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 )

Slides:

Advertisements

Similar presentations

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Date:102/1/9 Publisher:IEEE HPCC 2012 Author:Che-Lun Hung, Hsiao-hsi.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Retrieval Methods for QBSH (Query By Singing/Humming) J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Empowering visual categorization with the GPU Present by 陳群元我是強壯 !

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

1 Multi-Core Architecture on FPGA for Large Dictionary String Matching Department of Computer Science and Information Engineering National Cheng Kung University,

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

Efficient Lists Intersection by CPU-GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab,

By Dominik Seifert B Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Implementing a Speech Recognition System on a GPU using CUDA

National Taiwan University

Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,

HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.

Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Demos for QBSH J.-S. Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Similarity Matrix Processing for Music Structure Analysis Yu Shiu, Hong Jeng C.-C. Jay Kuo ACM Multimedia 2006.

GPU Architecture and Programming

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

Content-based Music Retrieval from Acoustic Input (CBMR)

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

2016/6/41 Recent Improvement Over QBSH and AFP J.-S. Roger Jang （張智星） Multimedia Information Retrieval (MIR) Lab CSIE Dept, National Taiwan Univ.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

QCAdesigner – CUDA HPPS project

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

QBSH Corpus The QBSH corpus provided by Roger Jang [1] consists of recordings of children’s songs from students taking the course “Audio Signal Processing.

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

Content-Based MP3 Information Retrieval Chueh-Chih Liu Department of Accounting Information Systems Chihlee Institute of Technology 2005/06/16.

Query by Singing and Humming System

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

DTW for Speech Recognition J.-S. Roger Jang ( 張智星 ) MIR Lab ( 多媒體資訊檢索實驗室 ) CS, Tsing Hua Univ. ( 清華大學.

GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Distance/Similarity Functions for Pattern Recognition J.-S. Roger Jang ( 張智星 ) CS Dept., Tsing Hua Univ., Taiwan

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Query by Singing/Humming via Dynamic Programming

Optimizing Parallel Algorithms for All Pairs Similarity Search

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Bisection and Twisted SVD on GPU

Introduction to Music Information Retrieval (MIR)

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Introduction to Music Information Retrieval (MIR)

Query by Singing/Humming via Dynamic Programming

Measuring the Similarity of Rhythmic Patterns

Presentation transcript:

加速以 GPU 為運算核心的二階段哼唱選歌系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) Advisor:Jyh-Shing Roger Jang ( 張智星教授 ) Jason S. Chang ( 張俊盛教授 ) Department of Computer Science, National Tsing Hua University

2 Outline Introduction Related work System flow Methods Experimental results Conclusions and future work /25

3 Introduction QBSH (Query by Singing and Humming) Description: the user sings or hums and the system returns the most similar song from database Problem: the system usually takes too long to response when database is huge Strategies Implement the new linear scaling to match the property of GPU Reduce the database to avoid the unnecessary comparison Combine linear scaling with dynamic time warping on multiple devices (GPUs) rather than one to speed up the computation /25

4 Related Work (1/2) /25 MIRACLE(Music Information Retrieval Acoustically with Clustered and paralleL Engines) Jang, Chen, and Kao, “MIRACLE: A Music Information Retrieval System with Clustered Computing Engines”, ISMIR Linear scaling (LS) Wang, Chen, Kuo, Chiu, and Jang, “Accelerating Query by Singing/Humming on GPU: Optimization for Web Deployment”, ICASSP 2012 Lin, “ Speeding Up Query-by-Singing/Humming Systems Based on Linear Scaling ”, National Tsing Hua Univ. 2012

5 Related Work (2/2) /25 Dynamic time warping (DTW) Ferraro, Hanna, Imbert, and Izart, “Accelerating Query-by- Humming on GPU”, ISMIR 2009 Kuo, “Accelerating Query By Singing/Humming on GPU”, National Tsing Hua Univ Hybrid LS+DTW Zou, “Query By Singing/Humming Using Combination of Classifiers”, National Tsing Hua Univ Kao, “A Two-Stage Query by Singing/Humming System on GPU“, National Tsing Hua Univ. 2013

6 System Flow /25 Sing/hum the song Detect endpoints and preprocess audio Perform linear scaling Load database Show top-N song info. GPU CPU User Perform dynamic time warping Post process the ranking Reserve different amount of candidate songs Convert to frame- based data

7 Linear Scaling /25 Example: 十年, 陳奕迅 Time Perform key transposition before using linear scaling

8 LS Implementation Detail (1/3) Our research will focus on “Compute distance” part /25 Scale the input pitch vector to 31 versions with different size Put input pitch vector into constant memory Compute distance Sort result and return

9 LS Implementation Detail (2/3) Each block computes one song, each thread in a block computes different segments of the song An example of a single block Block dimension = 64 Segment size = 375 Frame rate = / Pitch vector ‧‧‧ 0 63 Thread id 1 2 ‧‧‧‧‧‧ ‧‧‧ Segments ‧‧‧ 2408

10 Database Pitch Vector ‧‧‧ ‧‧‧ Block 1 Thread 0 Thread 1 Thread 63 ‧‧‧‧‧‧ Block 999 Thread 0 Thread 1 Thread 63 ‧‧‧‧‧‧ Block 0 Thread 0 Thread 1 Thread 63 LS Implementation Detail (3/3) /25 Song 0Song 1Song 999 ‧‧‧ 0 46 ‧‧‧ 2034 ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧ ‧‧‧‧‧‧

11 Global Memory Block 0 Local Memory Implementation of LS - method 1 Each thread copies a part of the database pitch vector from global memory to their local memory, then accesses pitches from local memory while computing /25 Thread 0 Thread 1 Thread ‧‧‧ 157 ‧‧‧ ‧‧‧ 3740 ‧‧‧ 46 ‧‧‧ ‧‧‧

12 Block 0 Shared Memory … … … Threads in the same block compute the same song, so we copy the pitch vector of that song from global memory to shared memory, then each thread can access the pitch from shared memory when needed Global Memory Implementation of LS - method 2 /25/28 Shared pitch 2034 ‧‧‧ ‧‧‧ Thread 0 Thread 1 Thread ‧‧‧

13 Comparison with 2 Method of LS AdvantageDrawback Method 1 (Using local memory) ‧ Intuitive ‧ Easy to implement ‧ Local memory is slower (off-chip) ‧ Need to copy several times the same data from global memory Method 2 (Using shared memory) ‧ Shared memory is faster (on-chip) ‧ Only need to copy the same data from global memory once ‧ Bank conflict /25

14 Bank Conflict (1/2) Bank conflict: Threads in the same half-warp (which have the same color below) access the same bank but different address of shared memory simultaneously /25 ‧‧‧ Successive 4-bytes Data Bank id Thread id 2020 No bank conflict Bank conflict ‧‧‧

15 ‧‧‧ Bank Conflict (2/2) Bank conflict between thread 2 and thread 5 / Pitch vector Bank id ‧‧‧ 29 ‧‧‧ Thread id Bank conflict ‧‧‧

16 Append pitches after some note to shift banks Trade-off between performance & recognition rate Implementation of LS - method 3 /25 Pitch vector Bank id Thread id No bank conflict ‧‧‧ ‧‧‧ 29 ‧‧‧ 573 …

17 Database Reduce (1/3) Remove unnecessary comparison cause by the same song Save 6.2% computing time (remove 1265 songs from 20395) /25 First-stage ‧ Song length difference is less than 5 secs. ‧ Pitch difference (after one-shot key transposition) is less than 2 semitone/frame-based data (pitch) First-stage ‧ Song length difference is less than 5 secs. ‧ Pitch difference (after one-shot key transposition) is less than 2 semitone/frame-based data (pitch) Second-stage A ‧ Definitely the same ‧ 832 songs must be removed Second-stage A ‧ Definitely the same ‧ 832 songs must be removed Second-stage B ‧ Several cases ‧ 433 songs must be removed Second-stage B ‧ Several cases ‧ 433 songs must be removed Have same song name & singer? YesNo

18 Database Reduce (2/3) Cases of the first-stage’s result with different song name or singer Deal with “Song name error”, “Singer error” and “Abbreviation” cases /25 CasesExample 1Example 2 Song name errorA Day In The LifeDay In The Life Singer errorProcal HarumProcol Harum Abbreviation1st Of 5thFirst Of Fifth Cover 想你想斷腸補破網

19 Database Reduce (3/3) Remove unnecessary comparison cause by repeating pattern Save 22.01% size of database Using correlative matrix T For the first row Other where T is the correlative matrix, S is the pitch vector i, j are the index of matrix / X X X030 68X00 72X0 60X

20 Multiple Device Each device will be distributed 1/n scale of database to compute the similarity to the query input, while n is the # of devices /25 System Flow

21 Experimental Environment (1/2) /25 OSCent OS 6 x86_64 RAM96GB DDR3 ECC CPUIntel Xeon x5670 six cores 2.93GHz x2 GPUNVIDIA TESLA M2070 x3 (448 cores) Language versionC/C++ CUDA version5.5 Compute Capability2.0 Experimental environment (NCHC Formosa 5 Cluster)

22 Experimental Data Corpus information NTHU&NTU students recordings in 2013 Database songs (after reduce) CorpusCHT Pop Song FormatWAV, 16KHz, 16bits, mono File amount 2183 (1614 from NTU, 318 from NTHU and 251 CHT) WAV file size8-9 seconds /25

23 Experiment 1 – LS with Different Shared Pitch Sizes & Block Dimension using Method 2 When the pitch vector size in shared memory is 2000 and the block dimension is 128, the computation time is 1.427, the shortest. /25 Computation Time (sec.) Shared Pitch Size (K)

24 Experiment 2 – LS with Different Shared Pitch Sizes & Block Dimension using Method 3 When the pitch vector size in shared memory is 5000 and the block dimension is 512, the computation time is 1.213, the shortest. /25 Computation Time (sec.) Shared Pitch Size (K)

25 Experiment 3 – LS with Different Number of Thread using Three Different Method The best case of LS method 2 & 3 (with a certain shared pitch size) for each block dimension is faster than method 1 /25 Computation Time per Song (sec.) LS Block Dimension (# of threads per block)

26 Experiment 4– LSDTW with Different Number of Devices If we have more devices, the computation time becomes lower since per GPU almost only needs to query 1/n scale of database /25 Computation Time per Song (sec.) Number of devices

27 Conclusions and Future Work ( 待補 ) Conclusions Computation time LS method 2 is faster than method 1, even though the bank conflict exists The Computation time is almost 1/n times while using n devices (GPU) Future work Advanced database purification to remove bad songs Abnormal melody (e.g instrumental only) Wrong melody or song name Improve LS method 2 to reduce bank conflict for Kepler architecture Different definition of bank conflict from Tesla & Fermi architecture Using different method for appending pitches /25

28 Thank you!! & DEMO ( /25

29 i j t(i-1) r(j) r(j-1) t(i) Methods: Dynamic Time Warping /25

30 Note & Pitch data charts Average number of notes: , σ : Average number of pitch data: , σ: /25

31 Note & Pitch data charts (After Repeating Pattern Removing) Average number of notes: 694.7, σ : 386 Average number of pitch data: , σ: /25

32 Detail of Second-stage B /25 Song nameSinger 處理方式 DifferentSame Manually check DifferentUnknown 1. Compute two song names’ edit distance 2. Check the song which edit distance is smaller than threshold SameDifferentManually check SameUnknown Remove data with unknown singer

33 System Flow with multiple devices ( 待補 ) /25 Sing/hum the song Detect endpoints and preprocess audio Perform linear scaling Load database Show top-N song info. GPU CPU User Perform dynamic time warping Post process the ranking Reserve different amount of candidate songs Convert to frame- based data

34 Experimental Environment (2/2) Specification of per device /25 GPUNVIDIA TESLA M2070 Core448 CUDA version5.5 Compute Capability2.0 Global memory5.25 GB Shared memory per block48 KB Registers per block32 KB Constant memory64 KB Clock rate1.093 GHz

35 以下是高瑋的備用投影片 /25

36 DTW Implementation Detail Dynamic time warping: Each song uses one block to compute and GPU communicates each thread in the same block using shared memory /25 tid1 tid2 tid3tid4 tid5 tid6 Database Pitch Vector Input Pitch Vector Parallel Computation (DTW) tid: thread id

37 GPU Implementation Detail /25

38 Dynamic Time Warping /25 Semitones time

39 The Improvement of Borda Count /25 Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 隱形的翅膀 1/500492/ (Top-1) 笨小孩 3/49826/ (Top-5) 海闊天空 126/4901/ (Top-1) 寶貝對不起 70/4904/ (Top-7) Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 隱形的翅膀 1/500492/8508 (Top-1) 海闊天空 126/3741/ (Top-5) Spec: LSDTW result with T = 10, scaling factor: 80%-170%, resolution: 31 times, step: 3%, key transposition 6 times and 500 candidate songs Without using threshold

40 System Flow Perform linear scaling at first stage Choose better candidate songs as the second stage input Perform dynamic time warping at second stage Combine two results after two stage computation Perform linear scaling Perform dynamic time warping Combine results Sing a song /25 Reserve different amount of candidate songs

41 /25 Method: Borda Count Borda Count R: # of candidate song M: # of melody recognition method r ik: the rank which is k-th result in i-th melody recognition method Song Rank 1-ABCD (score) Rank 2-DCAB (score) D k (total score) Rank A3141 B2024 C1232 D0332 Borda Count example

42 Methods: Comparison of Two Methods The proposed system combines linear scaling with dynamic time warping to accelerate computation. /25 TypeLinear scaling (LS)Dynamic time warping (DTW) Computation timeFasterSlower Tempo variation Deal with uniform tempo variation Deal with non-uniform tempo variation Key transpositionOne-shotHeuristic search

43 Modified System Flow /25 Perform Linear Scaling Perform Dynamic Time Warping Before Using Borda Count After Using Borda Count Produce Rank Get Candidate Song and Rank Perform Linear Scaling Combine Ranks Using Borda Count Perform Dynamic Time Warping Get Candidate Song and Rank

44 Methods: Key Transposition in (DTW) /25 k : input pitch mean k+3k+1.5 k-3 k’+0.75 k-1.5 Mean k’-0.75 k’ k Test case, key transposition range = ±3.0:

45 The Improvement of Borda Count /25 Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 擱淺 2/499395/ (Top-4) 你的微笑 239/4902/ (Top-2) Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 擱淺 2/498395/ (Top-167) 你的微笑 239/2612/ (Top-82) Spec: LSDTW result with T = 10, scaling factor: 80%-170%, resolution: 31 times, step: 3%, key transposition 6 times and 4000 candidate songs Not Use threshold Use threshold

46 The Change of Dynamic Time Warping Weight /25 DB Pitch Vector t(i-1) t(i) r(j-1) Input Pitch Vector DTW with C 1 =1, C 2 =2 j r(j) i Input Pitch Vector DTW with C 1 =1, C 2 =1 DB Pitch Vector r(j-1) j r(j) t(i-1) t(i) i The input pitch vector which contains some noises affects DTW computation