Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson

Similar presentations


Presentation on theme: "CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson"— Presentation transcript:

1 CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Graph Algorithms CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson

2 Sequencing by Hybridization
DNA Array gives all strings of length l How do we find the order? Spectrum(s,l) String s of length n Spectrum is multiset of n-l+1 l-mers in s

3 Sequencing by Hybridization
s = TATGGTGC l = 3 Spectrum(s,l) = {TAT,ATG,TGG,GGT,GTG,TGC} Problem: Input: Set S of all l-mers from s Output: String s s.t. Spectrum(s,l) = S

4 Hybridization on DNA Array

5 Sequencing by Hybridization
Special case of Shortest Superstring Problem SBH is linear-time SSP (NP-Complete) is more general In SSP, no guaranteed overlap In SBH, we know the length of the target sequence

6 Sequencing by Hybridization
There is a problem with DNA Arrays No good way to distinguish a match from a highly stable mismatch Mismatch could give strong hybridization signal Need longer probes to deal with mutations

7 SBH: Hamiltonian Path Approach
Two l-mers overlap if overlap(p,q) = l – 1 Last l-1 letters of p are same as first l-1 of q Make each l-mer in Spectrum(s,l) a node Construct directed graph(s) that connect every p and q with a directed edge 1 to 1 correspondence between paths that visit each vertex exactly once (Hamiltonian Paths) and DNA fragments with Spectrum(s,l)

8 SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG } H ATG AGG TGC TCC GTC GGT GCA CAG ATG C A G G T C C Path visited every VERTEX once

9 SBH: Hamiltonian Path Approach
A more complicated graph: S = { ATG TGG TGC GTG GGC GCA GCG CGT }

10 SBH: Hamiltonian Path Approach
S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1: ATGCGTGGCA Path 2: ATGGCGTGCA

11 SBH: Hamiltonian Path Approach
Problem is that there is no efficient algorithm As overlap graph gets larger, this is not a useful technique since the Hamiltonian Path problem is NP-Complete

12 SBH: Eulerian Path Approach
This leads to simple linear-time algorithm for sequence reconstruction Construct graph whose edges correspond to l-mers Find path(s) that visit each edge exactly once

13 SBH: Eulerian Path Approach
S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S AT GT CG CA GC TG GG Path visited every EDGE once

14 SBH: Eulerian Path Approach
S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: GT CG GT CG AT TG GC AT TG GC CA CA GG GG ATGGCGTGCA ATGCGTGGCA

15 SBH: Eulerian Path Approach
If for every vertex the number of incoming edges is equal to the number of outgoing edges, the graph is balanced Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced

16 Some Difficulties with SBH
Fidelity of Hybridization: difficult to detect differences between probes hybridized with perfect matches and 1 or 2 mismatches Array Size: Effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology. Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future Practicality again: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques

17 Fragment Assembly Now that we have our reads sequenced, we need to assemble them into the entire DNA sequence

18 Fragment Assembly We have some problems: Errors in reads (1% to 3%)
Which strand did the read come from? Did the read come from the target DNA sequence or its Watson-Crick complement? Repeats in DNA (this is the major problem) See page 278 for puzzle example

19 Fragment Assembly Very difficult to put it all together if repeats are longer than read length Could solve this by increasing read length, but the technology isn’t there yet

20 Fragment Assembly One approach is to break the sequence into about 30,000 Bacterial Artificial Chromosomes Sequence each BAC individually Put them all together Used and shown effective (if cumbersome) by the Human Genome Project

21 Fragment Assembly Another option (used in mouse genome assembly) is the Weber-Meyers approach Pairs reads that are separated by a fixed-size gap Gap size L is chosen to be longer than most repeats Unlikely both reads lie in large repeat Read that is in unique portion of DNA tells us which copy of a repeat the mate is in

22 Fragment Assembly Most algorithms consist of these steps: Overlap
Find potentially overlapping reads Layout: Find order of reads along DNA Consensus: Derive DNA sequence from layout

23 Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

24 Overlapping Reads Sort all k-mers in reads (k ~ 24)
Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >95% similar TACA TAGT || TAGATTACACAGATTAC T GA TAGA | || ||||||||||||||||| TAGATTACACAGATTAC

25 Overlapping Reads and Repeats
A k-mer that appears N times, initiates N2 comparisons For an Alu that appears 106 times  1012 comparisons – too much Solution: Discard all k-mers that appear more than t  Coverage, (t ~ 10)

26 Finding Overlapping Reads
Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

27 Layout Repeats are a major challenge
Do two aligned fragments really overlap, or are they from two copies of a repeat? Solution: repeat masking – hide the repeats!!! Masking results in high rate of misassembly (up to 20%) Misassembly means alot more work at the finishing step

28 Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected

29 Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting

30 Protein Sequencing and Identification
Protein can be digested into peptides by proteases (such as trypsin) Can then sequence the fragments individually and re-assemble Mass spectrometry allows us to find proteins involved in cell death, for example

31 Protein Sequencing and Identification
Tandem mass spectrometer breaks peptides into smaller fragments These fragments have electrical charge Fragments are spun around in an magnetic field until they hit a detector Larger masses are harder to spin than smaller ones, so mass can be determined by the amount of energy required to fling fragments around

32 Protein Sequencing and Identification
The problem we encounter is how to reconstruct the amino acid sequence of the peptide from the masses of the broken pieces

33 References Generated from:
An Introduction to Bioinformatics Algorithms, Neil C. Jones, Pavel A. Pevzner, A Bradford Book, The MIT Press, Cambridge, Mass., London, England, 2004 Slides 4, 8-10, 13, 14, 16, from


Download ppt "CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson"

Similar presentations


Ads by Google