Presentation is loading. Please wait.

Presentation is loading. Please wait.

Previous Lecture: Probability

Similar presentations


Presentation on theme: "Previous Lecture: Probability"— Presentation transcript:

1 Previous Lecture: Probability

2 This Lecture Introduction to Biostatistics and Bioinformatics
Sequence Alignment Concepts

3 Sequence Alignment Stuart M. Brown, Ph.D.
Center for Health Informatics and Bioinformatics NYU School of Medicine Slides/images/text/examples borrowed liberally from: Torgeir R. Hvidsten, Michael Schatz, Bill Pearson, Fourie Joubert, others …

4 Identity, similarity, homology Analyze sequence similarity by dotplots
Learning Objectives Identity, similarity, homology Analyze sequence similarity by dotplots window/stringency Alignment of text strings by edit distance Scoring of aligned amino acids Gap penalties Global vs. local alignment Dynamic Programming (Smith Waterman) FASTA method

5 Why Compare Sequences? Identify sequences found in lab experiments
What is this thing I just found? Compare new genes to known ones Compare genes from different species information about evolution Guess functions for entire genomes full of new gene sequences Map sequence reads to a Reference Genome (ChIP-seq, RNA-seq, etc.)

6 Are there other sequences like this one?
1) Huge public databases - GenBank, Swissprot, etc. 2) “Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes” -R. Pearson 3) Similarity searching is based on alignment 4) BLAST and FASTA provide rapid similarity searching a. rapid = approximate (heuristic) b. false + and - scores

7 Similarity ≠ Homology 1) 25% similarity ≥ 100 AAs is strong evidence for homology 2) Homology is an evolutionary statement which means “descent from a common ancestor” common 3D structure usually common function homology is all or nothing, you cannot say "50% homologous"

8 Similarity is Based on Dot Plots
1) two sequences on vertical and horizontal axes of graph 2) put dots wherever there is a match 3) diagonal line is region of identity (local alignment) 4) apply a window filter - look at a group of bases, must meet % identity to get a dot

9 Simple Dot Plot

10 Window / Stringency Filtering Window = 12 Stringency = 9 Score = 11
PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Filtering Score = 11 Window = 12 Stringency = 9 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM

11 Dot plot filtered with 4 base window and 75% identity

12 Dot plot of real data

13 Dotplot (Window = 130 / Stringency = 9)
Hemoglobin -chain Hemoglobin -chain

14 Dotplot (Window = 18 / Stringency = 10)
Hemoglobin -chain Hemoglobin -chain

15 How to Align Sequences? Manually line them up and count? Dot Plot
an alignment program can do it for you or a just use a text editor Dot Plot shows regions of similarity as diagonals GATGCCATAGAGCTGTAGTCGTACCCT <— —> CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC

16 Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical

17 Hamming Distance The minimum number of base changes that will convert one (ungapped) sequence into another The Hamming distance is named after Richard Hamming, who introduced it in his fundamental paper on Hamming codes: “Error detecting and error correcting codes” (1950) Bell System Technical Journal 29 (2): 147–160. Python function hamming_distance def hamming_distance(s1, s2): #Return the Hamming distance between equal-length sequences if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

18 Hamming Dist can be unrealistic
v: ATATATAT w: TATATATA Hamming Dist = 8 (no gaps, no shifts) Levenshtein (1966) introduced edit distance v = _ATATATAT w = TATATATA_ edit distance: d(v, w) = 2 Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcting deletions, insertions, and reversals". Soviet Physics Doklady 10 (8): 707–710.

19 Affine Gap Penalties

20 Gap Penalites With unlimited gaps (no penalty), unrelated sequences can align (especially DNA) Gap should cost much more than a mismatch Multi-base gap should cost only a little bit more than a single base gap Adding an additional gap near another gap should cost more (not implemented in most algorithms) Score for a gap of length x is: -(p + σx) p is gap open penalty σ is gap extend penalty

21 Global vs Local similarity
Global similarity uses complete aligned sequences - total % matches - Needleman & Wunch algorithm 2) Local similarity looks for best internal matching region between 2 sequences - find a diagonal region on the dotplot Smith-Waterman algorithm BLAST and FASTA 3) dynamic programming optimal computer solution, not approximate

22 Global vs. Local Alignments

23

24 [Essentially finding the diagonals on the dotplot]
Smith-Waterman Method [Essentially finding the diagonals on the dotplot] Basic principles of dynamic programming - Creation of an alignment path matrix - Stepwise calculation of score values - Backtracking (evaluation of the optimal path)

25 Creation of an alignment path matrix
Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj Build F(i,j) recursively beginning with F(0,0) = 0

26 Michael Schatz

27 Creation of an alignment path matrix
If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d The best score up to (i,j) will be the smallest of the three options

28 Michael Schatz Michael Schatz

29 Michael Schatz

30 Smith-Waterman is OPTIMAL but computationally slow
SW search requires computing of matrix of scores at every possible alignment position with every possible gap. Compute task increases with the product of the lengths of two sequence to be compared (N2) Difficult for comparison of one small sequence to a much larger one, very difficult for two large sequences, essentially impossible to search very large databases.

31 Scoring Similarity 1) Can only score aligned sequences
2) DNA is usually scored as identical or not 3) modified scoring for gaps - single vs. multiple base gaps (gap extension) 4) Protein AAs have varying degrees of similarity a. # of mutations to convert one to another b. chemical similarity c. observed mutation frequencies 5) PAM matrix calculated from observed mutations in protein families

32 Protein Scoring Systems
Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. tiny P aliphatic C small S+S G G I A V S C N SH L D T hydrophobic M Y K E Q F W H R positive aromatic polar charged

33 The PAM 250 scoring matrix

34 More distant sequences
PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations PAM120 for general use PAM60 for close relations PAM250 for distant relations

35 Search with Protein, not DNA Sequences
1) 4 DNA bases vs. 20 amino acids - less chance similarity 2) can have varying degrees of similarity between different AAs - # of mutations, chemical similarity, PAM matrix 3) protein databanks are much smaller than DNA databanks

36 FASTA 1) A faster method to find similarities and make alignments – capable of searching many sequences (an entire database) 2) Only searches near the diagonal of the alignment matrix 3) Produces a statistic for each alignment (more on this in the next lecture)

37 FASTA 1) Derived from logic of the dot plot
compute best diagonals from all frames of alignment 2) Word method looks for exact matches between words in query and test sequence hash tables (fast computer technique) Only matches exactly identical words DNA words are usually 6 bases protein words are 1 or 2 amino acids only searches for diagonals in region of word matches = faster searching

38 FASTA Format simple format used by almost all programs
>header line with a [return] at end Sequence (no specific requirements for line length, characters, etc) >URO1 uro1.seq Length: November 9, :50 Type: N Check: CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT

39 FASTA Algorithm

40 Makes Longest Diagonal
3) after all diagonals found, tries to join diagonals by adding gaps 4) computes alignments in regions of best diagonals

41 FASTA Alignments

42 FASTA Results - Alignment
SCORES Init1: Initn: Opt: z-score: E(): 2.3e-58 >>GB_IN3:DMU (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957: ) u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| ||||| DMU AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| || DMU AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT

43 Identity, similarity, homology Analyze sequence similarity by dotplots
Summary Identity, similarity, homology Analyze sequence similarity by dotplots window/stringency Alignment of text strings by edit distance Scoring of aligned amino acids Gap penalties Global vs. local alignment Dynamic Programming (Smith Waterman) FASTA method

44 Next Lecture: Searching Sequence Databases


Download ppt "Previous Lecture: Probability"

Similar presentations


Ads by Google