Presentation is loading. Please wait.

Presentation is loading. Please wait.

29.3.2008ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.

Similar presentations


Presentation on theme: "29.3.2008ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola."— Presentation transcript:

1 29.3.2008ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola Ranieri ranierin@ethz.ch

2 2 Biological Genome Basics Genome Chromosome Deoxyribonucleic acid (DNA) Gene Nucleotide Base Pair Nucleotides...ACCTGAATTCG...

3 3 Overview ➢ Introduction ➢ Definitions ➢ Naive Algorithm ➢ Analysis ➢ Improvements ➢ New Algorithm

4 4 Motivation

5 5 Definition: Suffix Tree (a) the paths from the root to the leaves have a one-to-one relationship with the suffixes of S (b) edges are labelled with non-empty strings (c) every internal node (except perhaps the root node) has more than one child 0 542 31 0123456BANANA$0123456BANANA$ A tree is a suffix tree for a string S if:

6 6 Prefix Problem Problem:If a suffix t1 is a prefix of another suffix t2, there would be no leaf representing t1. t2 would be appended to the leaf representing t1, making it an internal node. Solution:Append a terminating character to the string S, which does not occur in S. This makes the suffixes of S prefix free.

7 7 Suffix Tree: Use (1) 0 542 31 A N 0123456BANANA$0123456BANANA$

8 8 Suffix Tree: Use (2) 0 542 31 NANNNANN

9 9 Definition: Trie A trie or prefix tree is a tree, in which every edge is labelled with a single character and every node represents the string of the concatenated characters on the root path. JR JAJA RARA RORO JAVJAV RADRAD RANRAN RAURAU ROSROS JAVAJAVA RANDRAND RAUMRAUM ROSEROSE

10 10 Naive Construction RootRoot 0 0123456BANANA$0123456BANANA$

11 11 Naive Construction RootRoot 0 0123456BANANA$0123456BANANA$ 1

12 12 Naive Construction RootRoot 0 0123456BANANA$0123456BANANA$ 1 2

13 13 Naive Construction RootRoot 0 0123456BANANA$0123456BANANA$ 1 2 3

14 14 Naive Construction RootRoot 0 0123456BANANA$0123456BANANA$ 1 2 3 4

15 15 Naive Construction RootRoot 0 0123456BANANA$0123456BANANA$ 1 2 3 4 5

16 16 Naive Construction RootRoot 0 1 2 3 4 5

17 17 Naive Construction 0 542 31 0123456BANANA$0123456BANANA$

18 18 Analysis Insert n + (n-1) + (n-2) +.. + 2 + 1 nodes Merge n + (n-1) + (n-2) +.. + 2 + 1 edges Worst case time complexity: O(n²) Worst case space complexity: O(n²) S = abc..xyz RootRoot 01 2424 2525..........

19 19 Improvement: Direct Merging (1) Root NAAN A $ 0123456BANANA$0123456BANANA$

20 20 Improvement: Direct Merging (2) Root NAAN A $ 0123456BANANA$0123456BANANA$

21 21 Improvement: Space Requirement (1) 0 542 31 0123456BANANA$0123456BANANA$ 0 - 6 1 - 1 6 - 62 - 3 6 - 6 2 - 3 4 - 6

22 22 Improvement: Space Requirement (2) 0 542 31 0123456BANANA$0123456BANANA$ 0 - 6 1 - 1 6 - 62 - 3 6 - 6 2 - 3 4 - 6 ➢ The right index of an edge to an inner node is equal to the smallest index in the child-edges minus one ➢ The right index of an edge to a leaf is equal to the length of S

23 23 Definition: Suffix Link A suffix link is a pointer from a inner non- root node, representing the substring xα, to another inner node (the root node, if α is empty) representing α, with x as a single character and α as a string.

24 24 A New Algorithm: Observations ➢ Suffixes with different prefixes fall in different subtrees ➢ Suffix links are the only connection between two subtrees Therefore each subtree can be built absolutely independent of the other subtrees, when no suffix links used.

25 25 A New Algorithm: Partitioning (1) Number of partitions: p = floor( ) Assumption: DNA has a pseudo-random nature and hence the tree is balanced Character Encoding: A = 00 C = 01 G = 10 T = 11 00 A 01 C C 11 T... Each suffix has now a prefix code P i, which defines a lexicographic order. Divide the range of all P i into p sub-ranges, each representing one partition.

26 26 A New Algorithm: Partitioning (1) Prefix code: Partition Range: Check for partition j:

27 27 A New Algorithm: Construction for j in partitions do initialize_tree t for i in 0..length_of_S do if suffix i is in partition j insert i into t endif endfor safe_to_disk t endfor

28 28 A New Algorithm: Experimental Results

29 29 Questions?


Download ppt "29.3.2008ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola."

Similar presentations


Ads by Google