Presentation is loading. Please wait.

Presentation is loading. Please wait.

加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 )

Similar presentations


Presentation on theme: "加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 )"— Presentation transcript:

1 加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 ) andy.chuang@mirlab.org Advisor:Jyh-Shing Roger Jang ( 張智星教授 ) Jason S. Chang ( 張俊盛教授 ) Department of Computer Science, National Tsing Hua University

2 2 Outline Introduction Related work System flow Methods Experimental results Conclusions and future work /25

3 3 Introduction QBSH (Query by Singing and Humming) Description: the user sings or hums and the system returns the most similar song from database Problem: the system usually takes too long to response when database is huge Strategies Implement the new linear scaling to match the property of GPU Reduce the database to avoid the unnecessary comparison Combine linear scaling with dynamic time warping on multiple devices (GPUs) rather than one to speed up the computation /25

4 4 Related Work (1/2) /25 MIRACLE(Music Information Retrieval Acoustically with Clustered and paralleL Engines) Jang, Chen, and Kao, “MIRACLE: A Music Information Retrieval System with Clustered Computing Engines”, ISMIR 2001. Linear scaling (LS) Wang, Chen, Kuo, Chiu, and Jang, “Accelerating Query by Singing/Humming on GPU: Optimization for Web Deployment”, ICASSP 2012 Lin, “ Speeding Up Query-by-Singing/Humming Systems Based on Linear Scaling ”, National Tsing Hua Univ. 2012

5 5 Related Work (2/2) /25 Dynamic time warping (DTW) Ferraro, Hanna, Imbert, and Izart, “Accelerating Query-by- Humming on GPU”, ISMIR 2009 Kuo, “Accelerating Query By Singing/Humming on GPU”, National Tsing Hua Univ. 2013 Hybrid LS+DTW Zou, “Query By Singing/Humming Using Combination of Classifiers”, National Tsing Hua Univ. 2008 Kao, “A Two-Stage Query by Singing/Humming System on GPU“, National Tsing Hua Univ. 2013

6 6 System Flow /25 Sing/hum the song Detect endpoints and preprocess audio Perform linear scaling Load database Show top-N song info. GPU CPU User Perform dynamic time warping Post process the ranking Reserve different amount of candidate songs Convert to frame- based data

7 7 Linear Scaling /25 Example: 十年, 陳奕迅 Time Perform key transposition before using linear scaling

8 8 LS Implementation Detail (1/3) Our research will focus on “Compute distance” part /25 Scale the input pitch vector to 31 versions with different size Put input pitch vector into constant memory Compute distance Sort result and return

9 9 LS Implementation Detail (2/3) Each block computes one song, each thread in a block computes different segments of the song An example of a single block Block dimension = 64 Segment size = 375 Frame rate = 31.25 /25 2034 Pitch vector ‧‧‧ 0 63 Thread id 1 2 ‧‧‧‧‧‧ ‧‧‧ 0 46 157 374420531 Segments ‧‧‧ 2408

10 10 Database Pitch Vector ‧‧‧ 5703 5521 ‧‧‧ Block 1 Thread 0 Thread 1 Thread 63 ‧‧‧‧‧‧ Block 999 Thread 0 Thread 1 Thread 63 ‧‧‧‧‧‧ Block 0 Thread 0 Thread 1 Thread 63 LS Implementation Detail (3/3) /25 Song 0Song 1Song 999 ‧‧‧ 0 46 ‧‧‧ 2034 ‧‧‧ 5521 57035817 ‧‧‧ 6124 567 ‧‧‧ 6124 683 ‧‧‧ 6124 784 ‧‧‧‧‧‧ 046 2034 6124 567 6124 683 6124 784 9987

11 11 Global Memory Block 0 Local Memory Implementation of LS - method 1 Each thread copies a part of the database pitch vector from global memory to their local memory, then accesses pitches from local memory while computing /25 Thread 0 Thread 1 Thread 63 2034 ‧‧‧ 157 ‧‧‧ 046 0 ‧‧‧ 3740 ‧‧‧ 46 ‧‧‧ 420 2034 ‧‧‧ 2408 01374 4647420 240820352034

12 12 Block 0 Shared Memory 0 462034 … … … Threads in the same block compute the same song, so we copy the pitch vector of that song from global memory to shared memory, then each thread can access the pitch from shared memory when needed Global Memory Implementation of LS - method 2 /25/28 Shared pitch 2034 ‧‧‧ 046157 ‧‧‧ Thread 0 Thread 1 Thread 63 046 2034 1 2 4748 20352036 ‧‧‧

13 13 Comparison with 2 Method of LS AdvantageDrawback Method 1 (Using local memory) ‧ Intuitive ‧ Easy to implement ‧ Local memory is slower (off-chip) ‧ Need to copy several times the same data from global memory Method 2 (Using shared memory) ‧ Shared memory is faster (on-chip) ‧ Only need to copy the same data from global memory once ‧ Bank conflict /25

14 14 Bank Conflict (1/2) Bank conflict: Threads in the same half-warp (which have the same color below) access the same bank but different address of shared memory simultaneously /25 ‧‧‧ 12345 2828 2929 3030 3131 0 012345 Successive 4-bytes Data Bank id 0123456789 10101 1212 1313 1414 1515 1616 1717 1818 1919 Thread id 2020 No bank conflict Bank conflict ‧‧‧

15 15 ‧‧‧ Bank Conflict (2/2) Bank conflict between thread 2 and thread 5 /25 0 46315 482573 0 Pitch vector Bank id 157 14 27 2 29 ‧‧‧ 29 ‧‧‧ 012345 Thread id Bank conflict ‧‧‧

16 16 Append pitches after some note to shift banks Trade-off between performance & recognition rate Implementation of LS - method 3 /25 Pitch vector Bank id 012345 Thread id No bank conflict ‧‧‧ 0 46315 482574 0 157 14 27 2 30 ‧‧‧ 29 ‧‧‧ 573 …

17 17 Database Reduce (1/3) Remove unnecessary comparison cause by the same song Save 6.2% computing time (remove 1265 songs from 20395) /25 First-stage ‧ Song length difference is less than 5 secs. ‧ Pitch difference (after one-shot key transposition) is less than 2 semitone/frame-based data (pitch) First-stage ‧ Song length difference is less than 5 secs. ‧ Pitch difference (after one-shot key transposition) is less than 2 semitone/frame-based data (pitch) Second-stage A ‧ Definitely the same ‧ 832 songs must be removed Second-stage A ‧ Definitely the same ‧ 832 songs must be removed Second-stage B ‧ Several cases ‧ 433 songs must be removed Second-stage B ‧ Several cases ‧ 433 songs must be removed Have same song name & singer? YesNo

18 18 Database Reduce (2/3) Cases of the first-stage’s result with different song name or singer Deal with “Song name error”, “Singer error” and “Abbreviation” cases /25 CasesExample 1Example 2 Song name errorA Day In The LifeDay In The Life Singer errorProcal HarumProcol Harum Abbreviation1st Of 5thFirst Of Fifth Cover 想你想斷腸補破網

19 19 Database Reduce (3/3) Remove unnecessary comparison cause by repeating pattern Save 22.01% size of database Using correlative matrix T For the first row Other where T is the correlative matrix, S is the pitch vector i, j are the index of matrix /25 726872687260 72X01010 68X0200 72X030 68X00 72X0 60X

20 20 Multiple Device Each device will be distributed 1/n scale of database to compute the similarity to the query input, while n is the # of devices /25 System Flow

21 21 Experimental Environment (1/2) /25 OSCent OS 6 x86_64 RAM96GB DDR3 ECC CPUIntel Xeon x5670 six cores 2.93GHz x2 GPUNVIDIA TESLA M2070 x3 (448 cores) Language versionC/C++ CUDA version5.5 Compute Capability2.0 Experimental environment (NCHC Formosa 5 Cluster)

22 22 Experimental Data Corpus information NTHU&NTU students recordings in 2013 Database 19130 songs (after reduce) CorpusCHT Pop Song FormatWAV, 16KHz, 16bits, mono File amount 2183 (1614 from NTU, 318 from NTHU and 251 CHT) WAV file size8-9 seconds /25

23 23 Experiment 1 – LS with Different Shared Pitch Sizes & Block Dimension using Method 2 When the pitch vector size in shared memory is 2000 and the block dimension is 128, the computation time is 1.427, the shortest. /25 Computation Time (sec.) Shared Pitch Size (K)

24 24 Experiment 2 – LS with Different Shared Pitch Sizes & Block Dimension using Method 3 When the pitch vector size in shared memory is 5000 and the block dimension is 512, the computation time is 1.213, the shortest. /25 Computation Time (sec.) Shared Pitch Size (K)

25 25 Experiment 3 – LS with Different Number of Thread using Three Different Method The best case of LS method 2 & 3 (with a certain shared pitch size) for each block dimension is faster than method 1 /25 Computation Time per Song (sec.) LS Block Dimension (# of threads per block)

26 26 Experiment 4– LSDTW with Different Number of Devices If we have more devices, the computation time becomes lower since per GPU almost only needs to query 1/n scale of database /25 Computation Time per Song (sec.) Number of devices

27 27 Conclusions and Future Work ( 待補 ) Conclusions Computation time LS method 2 is faster than method 1, even though the bank conflict exists The Computation time is almost 1/n times while using n devices (GPU) Future work Advanced database purification to remove bad songs Abnormal melody (e.g instrumental only) Wrong melody or song name Improve LS method 2 to reduce bank conflict for Kepler architecture Different definition of bank conflict from Tesla & Fermi architecture Using different method for appending pitches /25

28 28 Thank you!! & DEMO (http://miracle.mirlab.org:8080/miracle)http://miracle.mirlab.org:8080/miracle /25

29 29 i j t(i-1) r(j) r(j-1) t(i) Methods: Dynamic Time Warping /25

30 30 Note & Pitch data charts Average number of notes: 913.32, σ : 450.99 Average number of pitch data: 5849.94, σ: 1878.42 /25

31 31 Note & Pitch data charts (After Repeating Pattern Removing) Average number of notes: 694.7, σ : 386 Average number of pitch data: 4562.54, σ: 1773.6 /25

32 32 Detail of Second-stage B /25 Song nameSinger 處理方式 DifferentSame Manually check DifferentUnknown 1. Compute two song names’ edit distance 2. Check the song which edit distance is smaller than threshold SameDifferentManually check SameUnknown Remove data with unknown singer

33 33 System Flow with multiple devices ( 待補 ) /25 Sing/hum the song Detect endpoints and preprocess audio Perform linear scaling Load database Show top-N song info. GPU CPU User Perform dynamic time warping Post process the ranking Reserve different amount of candidate songs Convert to frame- based data

34 34 Experimental Environment (2/2) Specification of per device /25 GPUNVIDIA TESLA M2070 Core448 CUDA version5.5 Compute Capability2.0 Global memory5.25 GB Shared memory per block48 KB Registers per block32 KB Constant memory64 KB Clock rate1.093 GHz

35 35 以下是高瑋的 備用投影片 /25

36 36 DTW Implementation Detail Dynamic time warping: Each song uses one block to compute and GPU communicates each thread in the same block using shared memory /25 tid1 tid2 tid3tid4 tid5 tid6 Database Pitch Vector Input Pitch Vector Parallel Computation (DTW) tid: thread id

37 37 GPU Implementation Detail /25

38 38 Dynamic Time Warping /25 Semitones time

39 39 The Improvement of Borda Count /25 Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 隱形的翅膀 1/500492/490990 (Top-1) 笨小孩 3/49826/490988 (Top-5) 海闊天空 126/4901/500990 (Top-1) 寶貝對不起 70/4904/497987 (Top-7) Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 隱形的翅膀 1/500492/8508 (Top-1) 海闊天空 126/3741/500874 (Top-5) Spec: LSDTW result with T = 10, scaling factor: 80%-170%, resolution: 31 times, step: 3%, key transposition 6 times and 500 candidate songs Without using threshold

40 40 System Flow Perform linear scaling at first stage Choose better candidate songs as the second stage input Perform dynamic time warping at second stage Combine two results after two stage computation Perform linear scaling Perform dynamic time warping Combine results Sing a song /25 Reserve different amount of candidate songs

41 41 /25 Method: Borda Count Borda Count R: # of candidate song M: # of melody recognition method r ik: the rank which is k-th result in i-th melody recognition method Song Rank 1-ABCD (score) Rank 2-DCAB (score) D k (total score) Rank A3141 B2024 C1232 D0332 Borda Count example

42 42 Methods: Comparison of Two Methods The proposed system combines linear scaling with dynamic time warping to accelerate computation. /25 TypeLinear scaling (LS)Dynamic time warping (DTW) Computation timeFasterSlower Tempo variation Deal with uniform tempo variation Deal with non-uniform tempo variation Key transpositionOne-shotHeuristic search

43 43 Modified System Flow /25 Perform Linear Scaling Perform Dynamic Time Warping Before Using Borda Count After Using Borda Count Produce Rank Get Candidate Song and Rank Perform Linear Scaling Combine Ranks Using Borda Count Perform Dynamic Time Warping Get Candidate Song and Rank

44 44 Methods: Key Transposition in (DTW) /25 k : input pitch mean k+3k+1.5 k-3 k’+0.75 k-1.5 Mean k’-0.75 k’ k Test case, key transposition range = ±3.0:

45 45 The Improvement of Borda Count /25 Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 擱淺 2/499395/490989 (Top-4) 你的微笑 239/4902/499989 (Top-2) Query song nameLS rank / ScoreLSDTW rank / ScoreFinal score 擱淺 2/498395/105603 (Top-167) 你的微笑 239/2612/498759 (Top-82) Spec: LSDTW result with T = 10, scaling factor: 80%-170%, resolution: 31 times, step: 3%, key transposition 6 times and 4000 candidate songs Not Use threshold Use threshold

46 46 The Change of Dynamic Time Warping Weight /25 DB Pitch Vector t(i-1) t(i) r(j-1) Input Pitch Vector DTW with C 1 =1, C 2 =2 j r(j) i Input Pitch Vector DTW with C 1 =1, C 2 =1 DB Pitch Vector r(j-1) j r(j) t(i-1) t(i) i The input pitch vector which contains some noises affects DTW computation


Download ppt "加速以 GPU 為運算核心的二階段哼唱選歌 系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 )"

Similar presentations


Ads by Google