Download presentation
Presentation is loading. Please wait.
Published byProsper Shelton Modified over 9 years ago
1
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP Position within the alignment = columns
2
Sequence Alignment The alignment is a hypothesis: The positions with identical nt/AA were present in the common ancestor Differences represent the nt/AA that have diverged since the common ancestor Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
3
From extant sequences to evolution
4
Constructing and evaluating alignments how to identify regions of sequence similarity between two sequences? How to evaluate the degree of similarity? What is the biological significance of the alignment?
5
Dot Plots: visualization of an alignment 1)Take two English words: 2)place the two sequences on vertical and horizontal axes of graph 3)put dots wherever there is a match 4)diagonal line is the region of identity –local alignment THISSEQUENCE and THATSEQUENCE
6
Alignments reveal insertions and deletions a gap in Seq1 accounts for the insertion of ISA into Seq2 seq1 THIS---SEQUENCE seq2 THISISASEQUENCE
7
THISSEQUENCE THATSEQUENCE –How many substitutions? –Are all substitutions equal? –If these were real AA sequences in two extant organisms, how can we determine whether they reflect evolutionary ancestry? Would two unrelated sequence share this level of identity? Alignments reveal substitutions
8
Need a methods for evaluating the likelihood of - A to V (alanine to valine) - R to F (Arginine to Phenylalanine) Substitutions
9
Scoring schemes to assess similarity Percent identity = number of identical amino acids Percent similarity (biochemical equivalence) Substitution matrices –value assigned based on the probability of substitution –score the alignment 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
10
Substitution matrices How might one construct a scoring matrix? –what types of sequence events should we consider? DNA level? Transition vs. transversion Amino acid level? Biochemical equivalence Using known proteins? –comparing protein homologs –post hoc determination of probabilities –should we use the same substitution matrix for two very closely related proteins vs. proteins that diverged long ago? Probability of substitutions increases over time Probability that multiple substitutions occurred in a single position
11
Substitution matrix Each cell represents the likelihood of substitution of each possible pair of amino acids Sum up the score = 52 THISSEQUENCE THATSEQUENCE 581145505695
12
PAM and BLOSUM matrices for AA sequences Most protein alignment matrices are empirically derived: PAM Scoring Matrices –compared full length of closely related proteins –Measured the frequency of all possible substitution pairs BLOSUM Scoring Matrices –compared highly conserved regions of proteins –blocks
13
How to score gaps? THISISASEQUENCE vs THATSEQUENCE? THISISASEQUENCE TH----ATSEQUENCE THA---TSEQUENCE TH---ATSEQUENCE TH-A-T-SEQUENCE Scoring the alignment must take into account 1) Substitutions 2) Gaps Gap penalties: 1) start a new gap (-4) 2) extend an existing gap (-1) Score all, choose highest score More than one possible alignment
14
Alignments Finding regions of sequence identity or similarity Inserting gaps to reflect indels Scoring the possible alignments to find the optimal alignment by
15
Common tools that produce alignments BLAST to identify similar sequences, given a query sequence ClustalW to align two or more sequences across their entire length
16
the blast algorithm Uses PAM or BLOSUM matrix divides query sequence into short strings, called words searches through the database to find subject sequences that contain similar words When finds similar words, it extends and scores the alignment Output consists of all subject sequences that align to the query at or above a threshold score If no words are similar, then no alignment
17
BLAST Algorithm divide entire length into words (segments of X length)
18
Extend hits one base at a time S is the alignment score: If it falls below a threshold, the extension processes ends
19
HSPs are Aligned Regions High scoring segment pairs = the original word match plus the extension –high scoring = score of the alignment above threshold –segment = the region of the query sequence aligned to the subject –pair = alignment between two sequences (query and subject) BLAST often produces several short HSPs rather than a single aligned region
20
BLAST Results report local alignments >gi|17556182|ref|NP_497582.1| Predicted CDS, phosphatidylinositol transfer protein [Caenorhabditis elegans] Score = 283 bits (723), Expect = 8e-75 Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%) Query: 48 KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101 K+ RV+LP+SV+EYQVGQL+SVAEASK P++ +G+ KGQYT Sbjct: 70 KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129 Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160 HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Sbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189 Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN H L+ + E V I+IA+ + L S D + P+KF+S KTGRGPL N Sbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249 Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279 WK + P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304 Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309 LTM DIR +E + +++L+E R+ V+GM Sbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334 Query was the entire protein sequence (position 1 to 749) Score,E-value, Identities, Positives, Gaps
21
BLAST Statistics E-value is equivalent to a P value smaller numbers are more significant –1e -4 = 1 x 10 -4 = 0.0004 –1e -50 = 1 x 10 -50 E-value is calculated from the alignment score (S) how many alignments of that score would likely occur by chance if you query a database of that size? –if GenBank contains 10 million sequences, there is a good probability that the sequence “MAGAV” will occur multiple times in sequences that are NOT evolutionarily related The E-value represents the likelihood that the observed alignment is due to chance alone
22
Interpretation of output very low E-values (e-100) represent sequences that are very close to being identical moderate E-values are related genes (homologs) long list of gradually declining of E-values indicates a large gene family you must examine the results when e-value is in the 10 -4 to -5 range –examine sequences a few AA matches in a long sequence? many AA matches in a very short sequence?
23
Evaluating Blast results Alignment: –colored bar alignments (region of alignment, score along length) –sequence alignments (region of alignment, AA information) Exploring potential function from significant blast hits –use accession link to go to the record page for each hit published papers full sequence information annotation Blast is linked to a protein domain tool
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.