ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola Ranieri
2 Biological Genome Basics Genome Chromosome Deoxyribonucleic acid (DNA) Gene Nucleotide Base Pair Nucleotides...ACCTGAATTCG...
3 Overview ➢ Introduction ➢ Definitions ➢ Naive Algorithm ➢ Analysis ➢ Improvements ➢ New Algorithm
4 Motivation
5 Definition: Suffix Tree (a) the paths from the root to the leaves have a one-to-one relationship with the suffixes of S (b) edges are labelled with non-empty strings (c) every internal node (except perhaps the root node) has more than one child BANANA$ BANANA$ A tree is a suffix tree for a string S if:
6 Prefix Problem Problem:If a suffix t1 is a prefix of another suffix t2, there would be no leaf representing t1. t2 would be appended to the leaf representing t1, making it an internal node. Solution:Append a terminating character to the string S, which does not occur in S. This makes the suffixes of S prefix free.
7 Suffix Tree: Use (1) A N BANANA$ BANANA$
8 Suffix Tree: Use (2) NANNNANN
9 Definition: Trie A trie or prefix tree is a tree, in which every edge is labelled with a single character and every node represents the string of the concatenated characters on the root path. JR JAJA RARA RORO JAVJAV RADRAD RANRAN RAURAU ROSROS JAVAJAVA RANDRAND RAUMRAUM ROSEROSE
10 Naive Construction RootRoot BANANA$ BANANA$
11 Naive Construction RootRoot BANANA$ BANANA$ 1
12 Naive Construction RootRoot BANANA$ BANANA$ 1 2
13 Naive Construction RootRoot BANANA$ BANANA$ 1 2 3
14 Naive Construction RootRoot BANANA$ BANANA$
15 Naive Construction RootRoot BANANA$ BANANA$
16 Naive Construction RootRoot
17 Naive Construction BANANA$ BANANA$
18 Analysis Insert n + (n-1) + (n-2) nodes Merge n + (n-1) + (n-2) edges Worst case time complexity: O(n²) Worst case space complexity: O(n²) S = abc..xyz RootRoot
19 Improvement: Direct Merging (1) Root NAAN A $ BANANA$ BANANA$
20 Improvement: Direct Merging (2) Root NAAN A $ BANANA$ BANANA$
21 Improvement: Space Requirement (1) BANANA$ BANANA$
22 Improvement: Space Requirement (2) BANANA$ BANANA$ ➢ The right index of an edge to an inner node is equal to the smallest index in the child-edges minus one ➢ The right index of an edge to a leaf is equal to the length of S
23 Definition: Suffix Link A suffix link is a pointer from a inner non- root node, representing the substring xα, to another inner node (the root node, if α is empty) representing α, with x as a single character and α as a string.
24 A New Algorithm: Observations ➢ Suffixes with different prefixes fall in different subtrees ➢ Suffix links are the only connection between two subtrees Therefore each subtree can be built absolutely independent of the other subtrees, when no suffix links used.
25 A New Algorithm: Partitioning (1) Number of partitions: p = floor( ) Assumption: DNA has a pseudo-random nature and hence the tree is balanced Character Encoding: A = 00 C = 01 G = 10 T = A 01 C C 11 T... Each suffix has now a prefix code P i, which defines a lexicographic order. Divide the range of all P i into p sub-ranges, each representing one partition.
26 A New Algorithm: Partitioning (1) Prefix code: Partition Range: Check for partition j:
27 A New Algorithm: Construction for j in partitions do initialize_tree t for i in 0..length_of_S do if suffix i is in partition j insert i into t endif endfor safe_to_disk t endfor
28 A New Algorithm: Experimental Results
29 Questions?