Download presentation

Presentation is loading. Please wait.

1
Database searching

2
Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation) Search for identified gene in other organisms Search for identified gene in other organisms Identifying regulatory elements Identifying regulatory elements Assisting in sequence assembly Assisting in sequence assemblyProblems Similar sequences can have different functions Similar sequences can have different functions Non-homologous sequences can have identical function Non-homologous sequences can have identical function Feature space <> Sequence space Feature space <> Sequence space

3
Some databases nr (GenBank nucleotide and protein) nr (GenBank nucleotide and protein) nr Month: monthly update Month: monthly update swissprot (protein) swissprot (protein) swissprot EST EST EST pdb (proteins with 3D structures) pdb (proteins with 3D structures) pdb Various genome databases (human, mouse etc) Various genome databases (human, mouse etc)

4
Main tools FASTA FASTA BLAST=Basic Local Alignment Search Tool BLAST=Basic Local Alignment Search ToolProcedure 1. Choose scoring matrix 2. Find best local alignments using scoring matrix 3. Determine statistical significance of result List in decreasing order of significance List in decreasing order of significance

5
Blosum substitution matrix log odds scores 2log(proportion observed/proportion expected)

6
FASTA Step 1 : Find hot-spots Step 1 : Find hot-spots (i.e. pairs of words of length k) that exactly match. (hashing) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 3 : Combine sub-alignments Step 3 : Combine sub-alignments form diagonal runs into a longer alignment

7
Exercise (hashing Tables of FASTA) sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD Prepare Table of offset values = matching diagonals

8
Solution sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE C S Q <<offset = 0 sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE--- G C <<offset = -3 sequence 2: ---GCHCLSAGQD sequence 1: ACNGTSCHQE----- CH <<offset = -5 sequence 2: -----GCHCLSAGQD S T

9
The main steps of gapped BLAST 1. Specify word length (3 for proteins, 11 for nucleotides) 2. Filtering for complexity 3. Make list of words to search for 4. Exact search 5. Join matches, and extend ungapped alignment 6. Calculate E-values 7. Join high-scoring pairs 8. Perform Smith-Waterman on best matches

10
Filtering sequences Replacing sequence regions of low complexity K with X Find K for sequence GGGG and for sequence ATCG L!= 4*3*2*1 = 24 n G = 4, n C = 0, n T = 0, n A = 0 n i ! = 4! * 0! * 0! * 0! = 24 K = ¼ log 4 (24/24) = 0 L!= 4*3*2*1 = 24 n G = 1, n C = 1, n T = 1, n A = 1 n i ! = 1! * 1! * 1! * 1! = 1 K = ¼ log 4 (24/1) = 0.573

11
The BLAST algorithm Break the search sequence into words Break the search sequence into words W = 3 for proteins, W = 12 for DNA W = 3 for proteins, W = 12 for DNA Include in the search all words that score above a certain value (T) for any search word Include in the search all words that score above a certain value (T) for any search word MCGPFILGTYC MCG CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCGCGP MCTMGP… MCNCTP …

12
The BLAST search algorithm

13
Search for the words in the database Word locations can be precomputed and indexed Searching for a short string in a long string Searching the database

14
Search Significance Scores A search will always return some hits. A search will always return some hits. How can we determine how “unusual” a particular alignment score is? How can we determine how “unusual” a particular alignment score is? Assumptions Assumptions

15
Assessing significance requires a distribution I have an apple of diameter 5”. Is that unusual? I have an apple of diameter 5”. Is that unusual? Diameter (cm) Frequency

16
Is a match significant? Match scores for aligning my sequence with random sequences. Match scores for aligning my sequence with random sequences. Depends on: Depends on: Scoring system Scoring system Database Database Sequence to search for Sequence to search for Length Length Composition Composition How do we determine the random sequences? How do we determine the random sequences? Match score Frequency

17
Generating “random” sequences Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 Doesn’t reflect nature Doesn’t reflect nature Use sequences from a database Use sequences from a database Might have genuine homology Might have genuine homology We want unrelated sequences We want unrelated sequences Random shuffling of sequences Random shuffling of sequences Preserves composition Preserves composition Removes true homology Removes true homology

18
What distribution do we expect to see? The mean of n random (i.i.d.) events tends towards a Gaussian distribution. The mean of n random (i.i.d.) events tends towards a Gaussian distribution. Example: Throw n dice and compute the mean. Example: Throw n dice and compute the mean. Distribution of means: Distribution of means: n = 2 n = 1000

19
Determining significance of match The score of an ungapped alignment is The score of an ungapped alignment is S = sum s(x i,y i ). The scores of individual sites are independent. The scores of individual sites are independent. The distribution of the sum of independent random variables is a normal distribution (central limit theorem). The distribution of the sum of independent random variables is a normal distribution (central limit theorem).

20
Determining significance of match However, we don't select scores randomly. We take the maximum extension of the initial word (HSP). The distribution of the maximum score of a large number N of i.i.d. random variables is called the extreme value distribution.

21
Comparing distributions Extreme Value:Gaussian:

22
Determining P-values If we can estimate and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. If we can estimate and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. For sequence matches, a scoring system and database can be parameterized by two parameters, K and, related to and . For sequence matches, a scoring system and database can be parameterized by two parameters, K and, related to and . It would be nice if we could compare hit significance without regard to the database and scoring system used! It would be nice if we could compare hit significance without regard to the database and scoring system used!

23
P(Score greater than x)= Probability of observing a score S > x m’ and n’ are effective query and database sequence lengths; K and l are substitution matrix parameters. P -values

24
Determining significance of match E-value = expected number of sequences scoring above S in the given database E-value = expected number of sequences scoring above S in the given database Low E-values => significant matches When E < 0.01 P-values and E-values are nearly identical When E < 0.01 P-values and E-values are nearly identical BIT-score: Sum of scores for local alignments

25
Smith-Waterman local alignment

26
BLAST parameters Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Raising the segment extension cutoff (X) returns longer extensions for each hit. Raising the segment extension cutoff (X) returns longer extensions for each hit. Changing the minimum E-value changes the threshold for reporting a hit. Changing the minimum E-value changes the threshold for reporting a hit.

27
BLAST flavours Basic flavours Basic flavours BLASTP (proteins to protein database) BLASTP (proteins to protein database) BLASTN (nucleotides to nucleotide database) BLASTN (nucleotides to nucleotide database) BLASTX (translated nucleotides to protein database) BLASTX (translated nucleotides to protein database) TBLASTN (protein to translated database) TBLASTN (protein to translated database) TBLASTX (translated nucleotides to translated database) - SLOW TBLASTX (translated nucleotides to translated database) - SLOW

28
Example Cloned sequence from Lotus japonicus Amino-acid level (BlastP) Amino-acid level (BlastP)BlastP LLANGNFVLRESGNKDQDGLVWQSFDFPTDTLLPQMKLGWDRKTGLNKI LRSWKSPSDPSSGYYSYKLEFQGLPEYFLNNRDSPTHRSGPWDGIRFSGIPEK Nucleotide level (BlastN) Nucleotide level (BlastN)BlastN cttctcgcta atggcaattt cgtgctaaga gagtctggca acaaagatca agatgggtta gtgtggcaga gtttcgattt tcccactgac actttactcc cgcagatgaa actgggatgg gatcgcaaaa cagggcttaa caaaatcctc agatcctgga aaagcccaag tgatccgtcaagtgggtatt actcgtataa actcgaattt caagggctcc ctgagtattt tttaaacaac agagactcgc caactcaccg gagcggtccg tgggatggta tccgatttag tggtattcca

35
Matrix parameters

36
Gap parameters

37
Hits

38
Synteny between the rat, mouse and human genomes (Nature 2004) Synteny between the rat, mouse and human genomes (Nature 2004)

39
Iterated searches Advanced family searches PSI-BLAST (Position Specific Iterated BLAST) PSI-BLAST (Position Specific Iterated BLAST)

40
PSI-blast Search with BLAST using the given query. Search with BLAST using the given query. while (there are new significant hits) while (there are new significant hits) combine all significant hits into a profile combine all significant hits into a profile search with BLAST using the profile search with BLAST using the profile end end

41
PSI-BLAST Greedy algorithm

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google