Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English.

Similar presentations


Presentation on theme: "Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English."— Presentation transcript:

1 Work @ Fudan University Chen, Yaoliang 1

2 TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English words. Fraud Detecting Time series tech 2

3 CGAP-align: A high performance DNA short read alignment tool ▫ Coauthor with BCM. Bioinformatics in progress ▫ NDBC Demo On Encoding Shortest Paths in Large Graphs ▫ Coauthor with Jian Pei. VLDB in progress ▫ Coauthor with Haixun Wang. Sigmod in progress ▫ NDBC Other Projects 3

4 Baylor College of Medicine 序列比对及意义 ▫ Reference & Reads  ACTAGCGATATAACCCTTTCCCTTTCCCTTT  CACGAT Given a number z reference X and read W, we want to find a subsequence W’=X[i,i+1,…,j] such that EditDistance(W,W’)≤z. ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT 4

5 DNA sequences in GenBank 5 A human genome sequence ▫ 2000 € 1,000,000,000 in ~10 years ▫ 2008 € 50 - 100,000 in ~4 months ▫ 2010 € 5 - 10,000 in ~2 weeks ▫...2015 € 1,000 in ~1 day ▫...2020 € 10 in ~1 hour to minutes

6 Burrows-Wheeler Alignment Tool ▫ 一个流行的在大型参照序列上对基因片段进行 比对工具 Optimization of BWA ▫ Code level ▫ Algorithm level BWA Performance: T = N × T aln ▫ N: enumerate all mismatches and gaps of the read ▫ T aln : time to locate the modified reads in the reference during the alignment stage 6

7 Optimizing T aln : efficiency for matching ▫ Suffix Tarray Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps ▫ Data-Conscious D-Array Calculating 7

8 Suffix Tree Suffix Array Based on BWT (FM-index) Comparison 8

9 mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m i ppi#missis s i #mississip p # mississipp i L F From Yuval Rikover 9

10 1.Find F by sorting L 2.First char of T? m 3.Find m in L 4.L[i] precedes F[i] in T. Therefore we get mi 5.How do we choose the correct i in L? ▫ The i’s are in the same order in L and F ▫ As are the rest of the char’s 6.i is followed by s: mis 7.And so on…. F Reminder: Recovering T from L L 10

11 Backward-search algorithm Uses only L (output of BWT) Relies on 2 structures: ▫ C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars) ▫ Occ(c,q): number of occurrences of char c in prefix L[1,q] Example C[ ] for T = mississippi# occ(s, 5) = 2 occ(s,12) = 4 Occ Rank Example C[ ] for T = mississippi# occ(s, 5) = 2 occ(s,12) = 4 Occ Rank 8651 1 2 3 4 5 6 7 8 9 10 11 12 i m p s 11

12 SUBSTRING SEARCH IN T ( COUNT THE PATTERN OCCURRENCES ) fr occ=2 [lr-fr+1] #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi # 1 i 2 m 7 p 8 S 10 C Available info P = si First step fr lr Inductive step: Given fr,lr for P[j+1,p] Œ Take c=P[j] P[ j ]  Find the first c in L[fr, lr] Ž Find the last c in L[fr, lr] lr rows prefixed by char “i” s s unknown Occ() oracle is enough 12

13 Backward search Store “First” and “Last” (k and l) values 13

14 P = CAA ▫ i = ▫ c = ▫ First = ▫ Last = P = CAA ▫ i = ▫ c = ▫ First = ▫ Last = 3 3 ‘A’ First(AA) Last(AA) ‘C’ C[‘T’] + Occ(‘C’,First(AA)) +1 C[‘T’] + Occ(‘C’,Last(AA)) 1 1 2 2 ‘A’ A A A A FM-index Root 14

15 Optimizing T aln : efficiency for matching ▫ Suffix Tarray Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps ▫ Data-Conscious D-Array Calculating 15

16 e(W) ▫ minimal number of the edit operations that is needed to make W exactly align onto the reference X. D-array ▫ D[i] : Lower bound of e(W[0…i]) 3 4 … … i0 16

17 Given a string W and an arbitrary combination strings of W = w 1,w 2,…,w k, we have e(W)> D array in BWA ▫ split W into several small strings like W=w 1 w 2 …w k with e(w i )=1 for all i. The correctness of the algorithm depends on the inequality: e(W) >. 17

18 Example Reference X = “AACGTATCGACG” ▫W▫W ▫D▫D A better segmentation: Consider e(·)= 2 ▫W▫W ▫D▫D ▫ calculating e(·) costs exponential time ▫ Need to pre-compution AGTCAA AGTCAA 18

19 Fasta file F containing training reads Should be similar to the reads in practice Data Concious Train Reads Frequent Patterns Trie DFA Frequent Patterns Train Reads Trie DFA Mining Frequent Patterns (FPs) Art of State Methods Our solution: A simple DFS on FM-index ▫ Count=Last-First+1 Generate prefix trie T for the FPs with e(w)=2. Refine T to a DFA G T 19

20 Why Trie DFA? ▫ When online doing alignment, we need to find all the FPs contained in a read ▫ This operation should be no more expensive than O(|W|) 20

21 Offline Index: Construction String Set(FP set) ▫ AA ▫C▫C ▫G▫G ▫T▫T ▫ AC ▫ AG The prefix trie done. We start to construct DFA. R R 4 4 1 1 6 6 5 5 2 2 7 7 3 3 T 21

22 DFS order – minimize the average hop between each jump. (7% up) 6 6 5 5 7 7 2 2 4 4 3 3 22

23 Online Query String Set(FP set) ▫ AA ▫ AC ▫ AG ▫C▫C ▫G▫G ▫T▫T W=“CACAT” R R LCLC LCLC 1 1 L AC 1 1 LTLT LTLT 23

24 Optimizing T aln : efficiency for matching ▫ Suffix Tarray (20% up) Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps ▫ Data-Conscious D-Array Calculating (0-200% up) 24

25 Background Consider a graph G = (V,E), where V is a set of vertices and E =VxV is a set of edges. FH-Partition 25

26 7 7 4 4 26 7->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10

27 27 Numbering Function

28 28

29 Compute a naïve numbering function Store the FH-partitions Compute FH- Partitions Get Numbering Function(s) Encoding FH- Partitions Get Numbering Function(s) Compute FH- Partitions Encoding FH- Partitions Reduce to TSP Region tree Multi numbering functions Further Compression Answering query efficiently 29

30 30

31 Thank you! 31


Download ppt "Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English."

Similar presentations


Ads by Google