Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.

Similar presentations


Presentation on theme: "Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to."— Presentation transcript:

1 Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to the theory and practice of performing the variety of sequence similiarity searches available via the NCBI BLAST service

2 Searching primary databases using sequence similarity
Basic Local Alignment Search Tool BLAST Microsatelitles can give false positive hits BLAST is a computer algorithm that returns sequences in the database with the highest percentage of bases in common to the query sequence

3 Sequence alignments The first line of evidence that two sequences are, because of a shared evolutionary history, related. If two sequences (DNA or protein) are related by descent, they will be related by sequence. A B C The “evolutionary distance” between A and B is smaller, and therefore, under an assumption of brownian motion (no selection) a given stretch of DNA or protein will share more nucleotides or amino acids in common.

4 All search algorithms will produce results when queried!
The trick is to be able to (i) trust and evaluate the result, and (ii) to be able to quantify this evaluation.

5 Reasons for performing BLAST searches
There can be many reasons, but common ones are: Human or Computational annotation: For non-model systems, for which little bench work has been done compared to a model system, sequence alignments with known, experimentally verified genes, can aid in the assignment of function Evolutionary: discovering similar sequences in different organisms allows one to ask whether and how sequence-level changes result in functional changes. Can be done for coding or non-coding (i.e. regulatory regions) . Multiple sequence alignments can help identify conserved regions of coding sequences, which might have functional significance, or to help understand evolutionary relationships among difficult to classify organisms. Multiple sequence alignments can also help with the development of primers in order to easily clone out a cDNA from an organism for which genome sequencing has not been done.

6 Sequence similarity can only be ascertained by aligning two sequences
ACGGCATCCGACGCTTAGCGGACTATCGATCTGA ACCCGGCCTACGGCTACTCGCTTAGCGGACTCGG

7 Some basic concepts: Sequence Similarity (Data) Homology (Inference)
Percent similarity of base pairs between any two sequences, over a given length of sequence Similar because of common descent from an ancestor that contained that sequence. Continuous quantity any real number (0-100%) Categorical quantity, either two sequences are homologous or not* *Because of a variety of genomic rearrangement phenomena, two sequences that code for a non-homologous protein per se, can contain sub-sequences that are indeed homologous. This is actually a source of false positive hits during Blast searches

8 Two kinds of Homology: Paralogy vs. Orthology
Orthologues are two sequences that are related because that sequence existed in an ancestor. Paralogues are two or more sequences that are related because a gene duplication event AGCCTATGGCAA ACGCTAGGGCTT Paralogous sequences ACGCTATGGCAA ACGCTAGGGGCAA Orthologous sequences Ancestral sequence ACGCTACGGGCAA

9 Global versus local alignment
Global Alignment Best alignment of two sequences, based on regions of sub-sequence (the local part) with the highest similarity Best alignment of two sequences along their entire length. Used with highly similar sequences. Better at finding weak similarities and functional domains within coding sequences Better for multiple sequence alignments Most common in database searches Not particularly useful for database searches.

10 DNA versus protein alignment
Can be much longer Only as long as the longest coding sequence (<5K a.a.) Because of the wobble effect, are less sensitive and tend to miss sequence relationships among distantly related species Are more sensitive because amino acids tend to be conserved along functional domains, so even weak total similarities can still be detected

11 Dotplot: for comparing two sequences
A dotplot is a simple techniqe that yields a graphical, but not statistical, representation of sequence similarity (see Fig 11.1) A C G T C G T A A C G T Use Dot let for protiens On Pub Med we type in the gene of interest Display settings – FASTA Input at And compute dot plot

12 Dot plots Good: Easy visualization of syntenic regions of long regions of genomic sequence Easy visualization of exon/intron boundaries Easy visualization and enumeration of tandemly repeated sequence elements Bad: No numerical evaluation of degree of similarity Can only compare two sequences at a time. No numerical evaluation of the degree of similarity – we examine this next

13 Scoring matrices When searching a complex database full of billions of nucleotides worth of sequences, we must not only identify related sequences, but develop the ability to score how “good” the match is, and then rank these “hits” in a list or a table. Assign a prior probability that two sites are similar to each other- can’t just observe similarity A scoring matrix is: an empirical weighting scheme used in all sequence comparisons Example here?

14 DNA Scoring matrices A G C T
Pyrimidine – Cytosine and Thymine Purine – Adenine Guanine Weighted scoring scheme on the likelihood of a change occuring Some substitutions are more likely than others Are all transitions and transversions equally likely? Different scoring matrices make different assumptions about this, but it should be clear that the wobble effect, numerical probability and molecular mech. matters here.

15 Amino Acid Scoring Matrices
A bit more complicated, because: there are 20 possible substitutions at any particular site Some substitutions are more constrained by function than others. In other words, we need to distinguish between absolute conservation (dark blue) and functional conservation (light blue). Some amino acids are more rare among all proteins, or within proteins, so changes in these amino acids must be given higher weight. Some aa are rare (ex. Met) would have to give it a different weigt because it is less likely to occur (other than close to beginning of seq) just assign probability that one aa would be transformed into another Depending upon evolutionary distance, some amino acid changes are more likely than others.

16 The log odds ratio A scoring matrix is a probability matrix, which is an attempt to understand the probability of all pairwise substitutions, given how often they are actually observed to change in known sequences. See wikipedia article for this: this is how you generate a scoring matrix Probabilities are calculated as being less than, or greater than observed by random chance, hence the negative numbers.

17 PAM250 Matrix Take the log of these probabilities to get an interger number In book the scoring matrix is a square (mirror image so you only need on side) Also the frequency with which that aa occurs in the protein Compared sequences to get frequency of the change from one aa to another

18 PAM amino acid scoring matrix
The Point Accepted Mutation considers only mutations at the single site level. Original matrices were done using sequences with more than 85% similarity, which means they are very closely related The term acceptance refers to functional conservation of protein function, even if sequence changes. Treat all sites so that they were different from each other (Chp 11) Acceptance-change could happen that aa sequence changes but the function doesn’t Should know what scoring matrix to use and why Blast similarity depends on what scoring matrices you have used

19 Amino acids can be group according to their “chemistries
Amino acids can be group according to their “chemistries.” Because some amino acids are very similar in their chemistries we should not score substitutions between them the same as between two amino acids that are very different in their chemistries Remember this?

20 Assumptions of the PAM matrices
Substitutions at a given site are (i) independent of previous changes at that site and are independent of changes to adjacent sites. One PAM unit corresponds to 1 a.a. change/ 100 a.a., or 1% divergence in sequence. PAM160 = 160 (total) changes/100 a.a. This can be problematic because this represents an extrapolation of probabilities calculated for closely related sequence, and so and error will simply be multiplied

21 BLOSUM matrices Considered the fact that BLOcks of sequence corresponding to secondary structure (i.e. functional domains like catalytic sites, DNA binding regions etc.) are likely to display different SUbstitution probabilities. And, BLOSUM matrices considered subsitution probabilities across several evolutionary distances, and so are more accurate than PAM for weaker sequence relationships. An attempt to say the assumptions of pam matrix are not fully true, the neighboring aa is likely affected by a change next to it -probability of substitution within or outside of similar sequence blocks Blosum 90 made using sequences with 90% similarity- higher number means that it is optimized for similarity when generating a scoring matrix Better to use lower one if you have no clue But you can look up the matrices to use (blosum 62 works the best over the widest range) E.g. BLOSUM62 matrix means that sequences with no more than 62% sequence similarity were used to calculate substitution probabilities.

22 GAPS and penalties Other types of mutations involve insertions and deletions of sequence, collectively called indels ACGATCGTCATCGATCGA ACGATCTCGATCGA These two sequences only align well across <half of their sequences. If we introduce four gaps, it is obvious that these sequences are more related than we thought. ACGATCGTCATCGATCGA ACGATC TCGATCGA Indels make us underestimate similarity because there is a break in the sequence similarity (ex. The sequences now mis align by a few nucleotides) so we are able to add gaps BUT you need to have a penalty for modifying the sequence to make it match up more, thus the score will have a higher similarity but the overall score will actually be lower due to the penalties So how so you choose what the gap penalty will be? Next slide But there must be a penalty for doing this (i.e. lowering the overall score) because you ,

23 Two kinds of Gap scoring methods
Affine gap penalty G + Ln; where G is a penalty for introducing a gap, and L is the penalty for lengthening the range of the gapped region (G > Ln) Non-affine gap penalty Ln; No penalty for opening, and where L is a fixed penalty for every gap Cost for introducing gap and cost for lengthening gap, so for a given penalty of lengthening is multiplied by number of gaps Or you could have no penalty for opening gap and then a fixed penalty for having subsequent gap

24 So, How does BLAST work? Words, Neighbourhoods, and High Scoring Segment pairs, Oh My! The Word is the minimum length of sequence that is used to start a search, usually three amino acids (RDQYPQW). Neighbourhoods are similar words to the query word, (e.g. RDQ vs RBQ vs RDE). These are subject to the scoring matrix Doesn’t do whole sequence all at once, so start with small word about three aa, cast a wide next and then go through process of weeding out false positives Decompose into all possible three letter words Tries to lengthen word and get the highest possible score A High scoring segment pair is the region of sequence for which the highest scoring Word can be extended the most as matching sequence

25 Detection of high scoring segments.
Detection of high scoring segments. Backs up to optimum and reports the sequence score at max After blast you can investigate all of the high scoring sequence, looks for longest sequence with the highest score based on matrix and reports that Shows you accession number for matching sequence and shows you the highest match Expect threshold is likely hodd we’d get a sequence match returned that is due to chance alone, lower number set = less chance of getting sequences due to chance Lexa M et al. Bioinformatics 2011;27: © The Author Published by Oxford University Press. All rights reserved. For Permissions, please

26 Let’s Blast!!


Download ppt "Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to."

Similar presentations


Ads by Google