Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.

Similar presentations


Presentation on theme: "1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT."— Presentation transcript:

1 1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… Motivation

2 2 Lesson 2 Aligning sequences and searching databases

3 3 Homology and sequence alignment

4 Homology = Similarity between objects due to a common ancestry Homology

5 5 Sequence homology VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG Similarity between sequences as a result of common ancestry.

6 6 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

7 7 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure). 2.Required for evolutionary studies (e.g., tree reconstruction). 3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site.

8 8 Insertions, deletions, and substitutions

9 9 Three types of changes: 1.Substitution – a replacement of one (or more) sequence letter by another: 2.Insertion - an insertion of a letter or several letters to the sequence: 3.Deletion - deleting a letter (or more) from the sequence: AA A TA Evolutionary changes in sequences Insertion + Deletion  Indel AAG GAAA C G

10 10 Sequence alignment If two sequences share a common ancestor – for example human and armadillo hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV

11 11 Perfect match VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case). VLSEAVLWAKV

12 12 A substitution VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred). VLSEAVLWAKV VLSPAVLWAKV

13 13 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV * Option 1: The ancestor had L and it was lost here *. In such a case, the event was a deletion. VLSEAVLWAKV *

14 14 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAV WAKV * Option 2: The ancestor was shorter and the L was inserted here *. In such a case, the event was an insertion. VLSEAVLWAKV L *

15 15 Indel VLSPAV - WAKV Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. VLSEAVLWAKV Deletion?Insertion?

16 16 Indels in protein coding genes Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc... Gene Search In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for open reading frames (ORFs).

17 17 Global and Local pairwise alignments

18 18 Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Local alignment will return only regions of good alignment Global alignment: forces alignment in regions which differ

19 19 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey

20 20 Proteins are comprised of domains Domain B Protein tyrosine kinase domain Domain A Human PTK2 :

21 21 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain X Protein tyrosine kinase domain Domain A

22 22 Domain X Protein tyrosine kinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain

23 23 Global alignment of PTK and LTK

24 24 Local alignment of PTK and LTK

25 25 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.

26 26 How alignments scores are computed?

27 27 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment:

28 28 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches

29 29 Choosing an alignment for a pair of sequences AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? Many different alignments are possible for 2 sequences:

30 30 Scoring system (naïve) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score  Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1

31 31 Scoring systems

32 32 Scoring system In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary Different scoring systems  different alignments We want a good scoring system…

33 33 Scoring matrix TCGA 2A 2-6G 2 C 2 T Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) Symmetric

34 34 DNA scoring matrices Uniform substitutions between all nucleotides: TCGAFrom To 2A 2-6G 2 C 2 T Match Mismatch

35 35 DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion

36 36 Amino-acid scoring matrices Take into account physico- chemical properties

37 37 Amino-acid substitution matrices Actual substitutions: –Based on empirical data –Commonly used by many bioinformatics programs –PAM & BLOSUM

38 38 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E In the 4 th Column D/E is found in 7/8 of the cases (compared with 5/8 to D/Q and E/Q).

39 39 BLOSUM: Blo cks Su bstitution M atrix Based on BLOCKS database –~2000 blocks from 500 families of related proteins –Families of proteins with identical function Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

40 40 BLOSUM Each block represents a sequence alignment with different identity percentage For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix

41 41 BLOSUM Matrices BLOSUMn is based on sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45

42 42 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity

43 43 Scoring gaps In advanced algorithms, two gaps of one amino-acid ( X-Y- ) are given a different score than one gap of two amino acids ( X--Y ). This is performed by giving different penalty for “opening” a gap and for extending a gap Gap extension penalty < Gap opening penalty

44 44 Intermediate summary 1.Scoring system = substitution matrix + gap penalty. 2.Used for both global and local alignment 3.For amino acids, there are two types of substitution matrices: PAM and BLOSUM

45 45 Computational aspects

46 46 Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- AAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA- It is not trivial (for most people) to figure out how to go over all possible pairwise alignments and find the one with the highest score.

47 47 Optimal alignment algorithms Needleman-Wunsch (global) [1970] Smith-Waterman (local) [1981] Their algorithm’s complexity is O(mn) (m – length of sequence 1, n – length of sequence 2). Informally: If one doubles the sequence length of one sequence  it doubles the computation time. If one doubles both  it quadruples the computation time. For proteins of lengths < 1000 it takes much less than a second to compute the alignments.

48 48 Dynamic programming Solving a problem with many overlapping sub-problems Example: Fibonacci sequnce: 1, 1, 2, 3, 5, 8,13,… F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)

49 49 Dynamic programming Naïvely solving F(7): F(7) = F(6) + F(5) = F(5) + F(4) + F(4) + F(3) = F(4) + F(3) + F(3) + F(2) + F(3) + F(2) +F(2) + F(1) = F(3) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = F(2) + F(1) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = 13 F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)

50 50 Dynamic programming F(7) using Dynamic programming: F(3) = F(2) + F(1) = 1 + 1 = 2 F(4) = F(3) + F(2) = 2 + 1 = 3 F(5) = F(4) + F(3) = 3 + 2 = 5 F(6) = F(5) + F(4) = 5 + 3 = 8 F(7) = F(6) + F(5) = 8 + 5 = 13

51 51 Needleman Wunsch (1970)     Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),(  0iF),( i ×Gap penalty 0,  jF)( j ×Gap penalty Base Case: Recursion rule Finds the best alignment for the first i characters of seq1 with the first j of seq2

52 52 Needleman Wunsch (1970)     Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),(  0iF),( i ×Gap penalty 0,  jF)( j ×Gap penalty Base Case: Recursion rule Cool alignment applet: http://baba.sourceforge.net/ http://baba.sourceforge.net/

53 53 Searching databases

54 54 Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs The same idea in short: Use your sequence as a query to find homologous sequences in a sequence database

55 55 Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

56 56 Protein or DNA search

57 57 Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable?

58 58 Protein is better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser

59 59 Query type Nucleotides: 4 letter alphabet Amino acids: 20 letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity

60 60 Conclusion The amino-acid sequence is often preferable for homology search

61 61 Computation time

62 62 How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 10 7 sequences, it will take 10 6 seconds = 11.5 days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

63 63 Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

64 64 Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

65 65 BLAST

66 66 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).

67 67 Query:DNAProtein Database:DNAProtein DNA or Protein All types of searches are possible blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database

68 68 E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10 -4 and lower indicate a significant homology. E-values between 10 -4 and 10 -2 should be checked (similar domains, maybe non-homologous). E-values between 10 -2 and 1 do not indicate a good homology

69 69 Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology

70 70 Solution In BLAST there is an option to mask low- complexity regions in the query sequence (such regions are represented as XXXXX in query)


Download ppt "1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT."

Similar presentations


Ads by Google