Presentation is loading. Please wait.

Presentation is loading. Please wait.

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Similar presentations


Presentation on theme: "GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State."— Presentation transcript:

1

2 GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State

3 Hidden Markov Model for Gene Finding Intron, Exon, Intergenic states Exon frame is encoded in the architecture by defining more states Exon states have explicit duration density Intron states have geometric duration Parameters are trained separately in different levels of GC content (correlated with amount of genes, and length of exons & introns)

4 Comparison-based Methods

5 Cross-species gene finding 5’ 3’ Exon1 Exon2 Exon3 Intron1Intron2 [human] [mouse] GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

6 Comparison of 1196 orthologous genes (Makalowski et al., 1996) Sequence identity between genes in human/mouse –exons: 84.6% –protein: 85.4% –introns: 35% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical.

7 HumanMouse Human-mouse homology

8

9 Not always: HoxA human-mouse

10 Twinscan Twinscan is an augmented version of the Gencscan HMM. E I transitions duration emissions ACUAUACAGACAUAUAUCAU

11 Twinscan Algorithm 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

12 Twinscan Algorithm 3.Run Viterbi using emissions e k (b) where b  { A-, A:, A|, …, T| } Note: Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns

13 Example Human : ACGGCGACGUGCACGU Mouse : ACUGUGACGUGCACUU Alignment : ||:|:|||||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U| Recall, e E (A|) > e I (A|) e E (A-) < e I (A-) Likely exon

14 HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

15 A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 - 2  1-  - 2        BEGIN END M J I

16 Generalized Pair HMMs

17 Exon GPHMM d e 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.

18 Cross-species gene finding 5’ 3’ Exon1 Exon2 Exon3 Intron1Intron2 CNS [human] [mouse]

19 The SLAM hidden Markov model

20 no. states max duration length seq1 length seq2 Computational complexity

21 Approximate alignment Reduces TU -factor to hT

22 Measuring Performance

23 Example: HoxA2 and HoxA3 SLAM SGP-2 Twinscan Genscan TBLASTX SLAM CNS VISTA RefSeq

24 Suffix Trees (a short break from biology)

25 Suffix Trees Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d ab d a c c a b d a c c c c a d b 1 4 2 5 6 3

26 Definition of a Suffix Tree Definition: For string x = x 1 …x m, a suffix tree is:  A rooted tree with m leaves Leaf i: x i …x m  Each edge is a substring  No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

27 Naïve Algorithm to Construct a Suffix Tree 1.Initialize tree T: a single root node r 2.Insert special symbol $ at end of x 3.For j = 1 to m Find longest match of x i …x m to T, starting from r Split edge where match stops: new node w Create edge (w, j), and label with unmatched portion of x i …x m

28 Example of Suffix Tree Construction 1 x = d a b d a $ d ab d a $ 1. Insert d a b d a $ a b d a $ 2 2. Insert a b d a $ $ a d b 3 3. Insert b d a $ $ 4 4. Insert d a $ $ 5 5. Insert a $ $ 6 6. Insert $

29 Memory to Store Suffix Tree Can store in O( N ) memory! Every edge is labeled with (i, j): (i,j) denotes x i …x j Tree has O( N ) nodes Proof: 1.# leafs  # nodes – 1 2.# leafs = |x|

30 Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant ~15 bytes/char Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

31 Application: find all matches between x, y 1.Build suffix tree for x, mark nodes with x 2.Insert y in suffix tree, mark all nodes y “passes from” with y  The path label of every node marked both 0 and 1, is a common substring

32 1 x = d a b d a $ y = a b a d a $ d ab d a $ 1. Construct tree for x a b d a $ 2 $ a d b 3 $ 4 $ 5 $ 6 x x x 6. Insert a $ 5 6 6. Insert $ 4. Insert a d a $ d a $ 3 5. Insert d a $ y 4 2. Insert a b a d a $ a y d a $ 1 y y x 3. Insert b a d a $ a d y 2 a $ x Example of Suffix Tree construction

33 Application: common substrings of k strings To find the longest common substring of s 1, s 2, …s n 1.Build suffix tree for s 1,…, s n 2.All nodes labeled {s i1, …, s ik } represent a match between s i1, …, s ik

34 Suffix Arrays ABRACADABRA$ 11 $ 10 A$ 7 ABRA$ 0 ABRACADABRA$ 3 ACADABRA$ 5 ADABRA$ 8 BRA$ 1 BRACADABRA$ 4 CADABRA$ 6 DABRA$ 9 RA$ 2 RACADABRA#$ Fast O(log n) search for every specific string Used for data compression such as bzip2 Can be built in O(n) time by first building suffix tree and then get ordered suffixes by in-order traversal  Too much memory— ~15n bytes  Difficult to implement Theoretical build in O(n log n) using O(n/ sqrt(log n)) extra memory Hot topic how to build fast in practice


Download ppt "GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State."

Similar presentations


Ads by Google