Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise alignments.

Similar presentations


Presentation on theme: "Pairwise alignments."— Presentation transcript:

1 Pairwise alignments

2 A heuristic search method; seeks words of length W (default 3 in blastp) that score at least T when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report.

3 BLAST Algorithm - Input Parameters
W the length of the words for which we are looking for almost exact matches (Default W = 11 or 3). Expect - The number of different alignments with scores at least S that are expected to occur in a database search by chance. (Default E = 10). The score distribution follows the extreme value distribution: E= Kmne-S K and  are scales for search space size and scoring system, respectively. n is the length of the query sequences, m is the size of the database (all sequences concatenated). Intuition: doubling m or n doubles the number; doubling the score causes exponential decrease. Lower EXPECT thresholds are more stringent, leading to fewer matches reported.

4 Gap Models T= A--CGTGATT--- CC 4 gaps Motivation: Indels create gaps.
Gap: any maximal consecutive stretch of spaces in a single sequence in a given alignment. Example: alignment S= ATTC-- GA-TGGACC T= A--CGTGATT--- CC gaps Motivation: Indels create gaps. cDNA matching involves gaps. Gap penalty types - examples: Constant - cost is independent of number of spaces. Affine - combined of a cost for opening a gap, and a cost for each extra space within the gap.

5 PROTEOMICS The Study of Proteins. Pairwise Alignments.

6 Jellyfish green fluorescent protein
Spider webs Fireflies light Rhino horn Cobra’s venom Also: feathers, porcupine quills, fingernails, wool, scales tortoise shells etc.

7 What are Proteins ? Proteins are abundant molecules, found
in all organisms and form the very basis of life. Proteins are polypeptides, made of amino acids chains. There are 20 amino acids (building blocks). The amino acids are linked by peptide bonds. The amino acids differ in their side chain. The genetic code – each amino acid is coded by 3 nucleotides, named codon.

8 The Genetic Code The genetic code - Each amino acid is coded by 3 nucleotides, named codon. Code redundancy - Most amino acids are coded by several codons. - 64 triplets code for 20 amino acids & 3 stop codons.

9 Amino Acids - the building blocks of proteins:
From: The structure of life. (NIH and National Institute of General Medical Sciences) Side chains Glycine (hydrophilic) Asparagine (amides) Phenylalanine (aromatic) Methionine (hydrophobic)

10 Chemical Similarities Between Amino Acids:
Acids & Amides DENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) Aromatic FYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) Hydrophobic ILMV (Ile, Leu, Met, Val)

11 Allowable Amino Acid Substitution Groups

12 Protein Pairwise Sequence Similarity
The alignment tools are similar to the DNA alignment tools BLASTP, FASTA, PSI-BLAST Main difference: instead of scoring match (+1) and mismatch (-2) we have similarity scores: g(a,b) is high if amino acids a and b have similar properties (> 0) g(a,b) is low otherwise ( 0)

13

14 identity similarity

15 Scoring Matrices A matrix of 20x20 entries
Entry (i,j) is the score of aligning amino acid i against amino acid j. Entry (i,j) is equal to entry (j,i) scoring matrices are symmetric Entry (i,i) is greater than any entry (i,j), ji.

16 Log-odds Scoring matrices in general can be written as: Sij = where:
qij – target frequency. Sum over all j of qij = 1. pi – background frequencies. Score Frequency of substitution >0 more frequent than expected =0 as expected <0 less frequent than expected Background frequency is easy to compute. Target frequency – different between methods. Most common scoring matrices - PAM and BLOSUM.

17 PAM - Point (Percent) Accepted Mutations
Developed by Margaret Dayhoff, 1978. A model for protein evolution: Analyzed very similar protein sequences. Proteins are evolutionary close. Alignment is easy. Point Mutations, mainly substitutions Accepted mutations by natural selection. Found that common substitutions occurred Involving chemically similar amino acids.

18 PAM Distance and Matrix
A measure of likelihood of amino acid replacement developed by counting the number of substitutions of each amino acid pair. 1PAM unit = an average change in 1% of all amino acid positions PAM1 matrix - the likelihood of replacement during 1PAM unit. PAMn can be derived from PAM1 (Markov chain) in step 1 amino acid a changes to b using PAM1(a,b) in step 2 amino acid b changes to c using PAM1(b,c)

19 PAM or Dayhoff Family of Matrices.
(The log odds matrix for PAM 250) Similar amino acids are close to each other. Regions define conserved substitutions. Correspond to sequences that are about 20% identical.

20 PAM - Rules of Thumb PAM 40, PAM 120 and PAM 250.
When there is no information about evolutionary distance, 3 approaches are recommended for sequence comparison: PAM 40, PAM 120 and PAM 250. The PAM matrix for aligning two sequences should match their estimated evolutionary distance: PAM sequences that are 20% similar PAM % similar PAM % similar PAM % similar Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities.

21 BLOSUM - Blocks Substitution Matrix
Developed by Henikoff & Henikoff, 1992. Examined multiple alignments of distantly related protein regions directly (not extrapolating from closely related sequences). Based on the BLOCKS database ( Families of proteins Family members have identical biochemical functions Aligned the members and found common motifs common blocks of local alignment Counted the amino acid replacements within the blocks.

22 BLOSUM - Blocks Substitution Matrix AABCDA… BBCDA
DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC First column: AABACA Pairs count: 6 AA, 4 AB, 4 AC, 1 BC, 0 BB, 0 CC; 15 total. qi,j = number of ij pairs/ total number of pairs (qA,B = 4/15). pj = probability of i appearances pi = qi,i +  qi,j/2 ei,j = expected probability of pair ij ei,j = 2 pi pj; ei,i = pi pi The matrix values are log (observed / expected) log2(qi,j / eij)

23 THE BLOSUM Family of Matrices.
BLOSUMN is based on sequences that are at most N percent identical. (The log odds matrix for BLOSUM 45)

24 PAM Verses BLOSUM: PAM is based on an evolutionary model.
BLOSUM is based on protein families. PAM is based on global alignment BLOSUM is based on local alignment. PAM is for tracking evolutionary origin of proteins BLOSUM is designed to find their conserved regions.

25 Other Scoring Matrices
Scoring matrices for sequence alignment can be based on the following criteria: genetic code changes - the number of changes required to transform one codon to another. Chemical properties similarity - volume, polarity,.. Structurally similar protein sequences Specific protein family matrix, e.g., trans-membrane proteins. Matrices that employ neighboring amino acids.

26

27 Principles for Protein Similarity Search:
Use BLOSUM 62 or PAM 120 and default gap penalties. If no significant results, use BLOSUM 30 or PAM 250 and lower gap penalties. Examine results between EXP and 10 for significance. PSI-BLAST for protein families.

28 Position Specific Iterated BLAST
PSI-BLAST Position Specific Iterated BLAST Finds more distantly related sequences than FASTA or BLAST. Upon aligning a group of sequences, the vector of characters in a certain column is called a profile. Conserved regions - regions that are very similar (have profiles with little variance). SAGSTGH TAGSTAA TCGSTCC GCT is a conserved region

29 PSI-BLAST Contd. A protein family contains conserved regions. These define the structure and function typical for this family. We would like the alignment score to consider how conserved a column is. PSI BLAST gives high scores to matches within conserved regions.

30 Profile Scoring

31 PSI-BLAST - (Position Specific Iterated BLAST)
An iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. Why use PSI-BLAST ? An important tool for predicting both biochemical activity & function. Identify week homologies (distant relatives of a proteins, which are not found in FASTA or BLAST). Information:

32 How Does PSI-BLAST Work ?
1. Compare the query sequence to database (gapped BLAST). Construct profile from significant alignment Note: A highly conserved position will receive a high score and weakly conserved positions receive scores near zero. 3. Compare the profile to database. Repeat steps 2 & 3 (“iterations”) until no new significant sequences are found ("convergence”).

33 PSI-BLAST Search: Hits that are better than the E-value threshold are listed first. These hits are used in forming the profile that will be used in subsequent PSI-BLAST iterations. Hits with E-values worse than threshold, but nonetheless have an E-value better than 10 (default; selected on the query page) are listed further down the page. Any of the sequences in the list of "Sequences with E-value worse than threshold” (>0.005) can be manually added (click) to sequences used for generating the PSI-BLAST profile.

34

35 Databank of protein sequences, for both existing and putative proteins. Hbb human

36

37

38 SPECIAL BLAST PAGES

39 TaxBLAST: Organism Report
Common name Blast name Scientific name BLAST hits are sorted according to the species of the target sequence. All the hits of the same organism will appear together. Within each species, the BLAST hits are sorted by score.

40 Lineage Report Taxonomy Report
How close are organisms in the BLAST hitlist related to query sequence ? Taxonomy Report

41 Other BLAST Options: RPS-BLAST - A program that compares a protein sequence against the Conserved Domain Database (Smart and Pfam), may provide functional identifications. PHI-BLAST - (Pattern Hit Initiated BLAST) can locate other protein sequences that contain the expression patterns and are homologous to the query protein sequence.

42 Function - Structure Relationship
Protein function depends on the protein 3D structure example: zinc-finger proteins. Protein structure provides insight into protein function. How does a protein fold into its native structure?

43 Sequence - Structure Relationship
Early renaturation experiments have shown that the sequence of the protein is sufficient to determine its structure (Anfinsen, 1973). A major challenge in bio-informatics - Prediction of protein structure from its sequence.


Download ppt "Pairwise alignments."

Similar presentations


Ads by Google