Pairwise alignments.

Slides:



Advertisements
Similar presentations
Measuring the degree of similarity: PAM and blosum Matrix
Advertisements

Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence similarity search Glance to the protein world.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Biology 4900 Biocomputing.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
©CMBI 2005 Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
In-Class Assignment #1: Research CD2
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Sequence similarity search II Searching for remote homologies.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity search Glance to the protein world.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Alignment IV BLOSUM Matrices
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Presentation transcript:

Pairwise alignments

A heuristic search method; seeks words of length W (default 3 in blastp) that score at least T when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report.

BLAST Algorithm - Input Parameters W the length of the words for which we are looking for almost exact matches (Default W = 11 or 3). Expect - The number of different alignments with scores at least S that are expected to occur in a database search by chance. (Default E = 10). The score distribution follows the extreme value distribution: E= Kmne-S K and  are scales for search space size and scoring system, respectively. n is the length of the query sequences, m is the size of the database (all sequences concatenated). Intuition: doubling m or n doubles the number; doubling the score causes exponential decrease. Lower EXPECT thresholds are more stringent, leading to fewer matches reported.

Gap Models T= A--CGTGATT--- CC 4 gaps Motivation: Indels create gaps. Gap: any maximal consecutive stretch of spaces in a single sequence in a given alignment. Example: alignment S= ATTC-- GA-TGGACC T= A--CGTGATT--- CC 4 gaps Motivation: Indels create gaps. cDNA matching involves gaps. Gap penalty types - examples: Constant - cost is independent of number of spaces. Affine - combined of a cost for opening a gap, and a cost for each extra space within the gap.

PROTEOMICS The Study of Proteins. Pairwise Alignments.

Jellyfish green fluorescent protein Spider webs Fireflies light Rhino horn Cobra’s venom Also: feathers, porcupine quills, fingernails, wool, scales tortoise shells etc.

What are Proteins ? Proteins are abundant molecules, found in all organisms and form the very basis of life. Proteins are polypeptides, made of amino acids chains. There are 20 amino acids (building blocks). The amino acids are linked by peptide bonds. The amino acids differ in their side chain. The genetic code – each amino acid is coded by 3 nucleotides, named codon.

The Genetic Code The genetic code - Each amino acid is coded by 3 nucleotides, named codon. Code redundancy - Most amino acids are coded by several codons. - 64 triplets code for 20 amino acids & 3 stop codons.

Amino Acids - the building blocks of proteins: From: The structure of life. (NIH and National Institute of General Medical Sciences) www.nigms.nih.gov Side chains Glycine (hydrophilic) Asparagine (amides) Phenylalanine (aromatic) Methionine (hydrophobic)

Chemical Similarities Between Amino Acids: Acids & Amides DENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) Aromatic FYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) Hydrophobic ILMV (Ile, Leu, Met, Val)

Allowable Amino Acid Substitution Groups http://www.imb-jena.de/IMAGE_AA.html

Protein Pairwise Sequence Similarity The alignment tools are similar to the DNA alignment tools BLASTP, FASTA, PSI-BLAST Main difference: instead of scoring match (+1) and mismatch (-2) we have similarity scores: g(a,b) is high if amino acids a and b have similar properties (> 0) g(a,b) is low otherwise ( 0)

identity similarity

Scoring Matrices A matrix of 20x20 entries Entry (i,j) is the score of aligning amino acid i against amino acid j. Entry (i,j) is equal to entry (j,i) scoring matrices are symmetric Entry (i,i) is greater than any entry (i,j), ji.

Log-odds Scoring matrices in general can be written as: Sij = where: qij – target frequency. Sum over all j of qij = 1. pi – background frequencies. Score Frequency of substitution >0 more frequent than expected =0 as expected <0 less frequent than expected Background frequency is easy to compute. Target frequency – different between methods. Most common scoring matrices - PAM and BLOSUM.

PAM - Point (Percent) Accepted Mutations Developed by Margaret Dayhoff, 1978. A model for protein evolution: Analyzed very similar protein sequences. Proteins are evolutionary close. Alignment is easy. Point Mutations, mainly substitutions Accepted mutations by natural selection. Found that common substitutions occurred Involving chemically similar amino acids.

PAM Distance and Matrix A measure of likelihood of amino acid replacement developed by counting the number of substitutions of each amino acid pair. 1PAM unit = an average change in 1% of all amino acid positions PAM1 matrix - the likelihood of replacement during 1PAM unit. PAMn can be derived from PAM1 (Markov chain) in step 1 amino acid a changes to b using PAM1(a,b) in step 2 amino acid b changes to c using PAM1(b,c) …

PAM or Dayhoff Family of Matrices. (The log odds matrix for PAM 250) Similar amino acids are close to each other. Regions define conserved substitutions. Correspond to sequences that are about 20% identical.

PAM - Rules of Thumb PAM 40, PAM 120 and PAM 250. When there is no information about evolutionary distance, 3 approaches are recommended for sequence comparison: PAM 40, PAM 120 and PAM 250. The PAM matrix for aligning two sequences should match their estimated evolutionary distance: PAM 250 - sequences that are 20% similar PAM 120 - 40% similar PAM 80 - 50% similar PAM 60 - 60% similar Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities.

BLOSUM - Blocks Substitution Matrix Developed by Henikoff & Henikoff, 1992. Examined multiple alignments of distantly related protein regions directly (not extrapolating from closely related sequences). Based on the BLOCKS database (www.blocks.fhcrc.org) Families of proteins Family members have identical biochemical functions Aligned the members and found common motifs common blocks of local alignment Counted the amino acid replacements within the blocks.

BLOSUM - Blocks Substitution Matrix AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC First column: AABACA Pairs count: 6 AA, 4 AB, 4 AC, 1 BC, 0 BB, 0 CC; 15 total. qi,j = number of ij pairs/ total number of pairs (qA,B = 4/15). pj = probability of i appearances pi = qi,i +  qi,j/2 ei,j = expected probability of pair ij ei,j = 2 pi pj; ei,i = pi pi The matrix values are log (observed / expected) log2(qi,j / eij)

THE BLOSUM Family of Matrices. BLOSUMN is based on sequences that are at most N percent identical. (The log odds matrix for BLOSUM 45)

PAM Verses BLOSUM: PAM is based on an evolutionary model. BLOSUM is based on protein families. PAM is based on global alignment BLOSUM is based on local alignment. PAM is for tracking evolutionary origin of proteins BLOSUM is designed to find their conserved regions.

Other Scoring Matrices Scoring matrices for sequence alignment can be based on the following criteria: genetic code changes - the number of changes required to transform one codon to another. Chemical properties similarity - volume, polarity,.. Structurally similar protein sequences Specific protein family matrix, e.g., trans-membrane proteins. Matrices that employ neighboring amino acids.

http://www-bioeng.ucsd.edu/research/research_groups/compbio/workshop/

Principles for Protein Similarity Search: Use BLOSUM 62 or PAM 120 and default gap penalties. If no significant results, use BLOSUM 30 or PAM 250 and lower gap penalties. Examine results between EXP. 0.05 and 10 for significance. PSI-BLAST for protein families.

Position Specific Iterated BLAST PSI-BLAST Position Specific Iterated BLAST Finds more distantly related sequences than FASTA or BLAST. Upon aligning a group of sequences, the vector of characters in a certain column is called a profile. Conserved regions - regions that are very similar (have profiles with little variance). 1 2 3 4 5 6 7 SAGSTGH TAGSTAA TCGSTCC GCT is a conserved region

PSI-BLAST Contd. A protein family contains conserved regions. These define the structure and function typical for this family. We would like the alignment score to consider how conserved a column is. PSI BLAST gives high scores to matches within conserved regions.

Profile Scoring

PSI-BLAST - (Position Specific Iterated BLAST) An iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. Why use PSI-BLAST ? An important tool for predicting both biochemical activity & function. Identify week homologies (distant relatives of a proteins, which are not found in FASTA or BLAST). Information: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

How Does PSI-BLAST Work ? 1. Compare the query sequence to database (gapped BLAST). Construct profile from significant alignment Note: A highly conserved position will receive a high score and weakly conserved positions receive scores near zero. 3. Compare the profile to database. Repeat steps 2 & 3 (“iterations”) until no new significant sequences are found ("convergence”).

PSI-BLAST Search: Hits that are better than the E-value threshold are listed first. These hits are used in forming the profile that will be used in subsequent PSI-BLAST iterations. Hits with E-values worse than threshold, but nonetheless have an E-value better than 10 (default; selected on the query page) are listed further down the page. Any of the sequences in the list of "Sequences with E-value worse than threshold” (>0.005) can be manually added (click) to sequences used for generating the PSI-BLAST profile.

www.expasy.org

www.expasy.ch/sprot Databank of protein sequences, for both existing and putative proteins. Hbb human

SPECIAL BLAST PAGES

TaxBLAST: Organism Report Common name Blast name Scientific name BLAST hits are sorted according to the species of the target sequence. All the hits of the same organism will appear together. Within each species, the BLAST hits are sorted by score.

Lineage Report Taxonomy Report How close are organisms in the BLAST hitlist related to query sequence ? Taxonomy Report http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/

Other BLAST Options: RPS-BLAST - A program that compares a protein sequence against the Conserved Domain Database (Smart and Pfam), may provide functional identifications. PHI-BLAST - (Pattern Hit Initiated BLAST) can locate other protein sequences that contain the expression patterns and are homologous to the query protein sequence.

Function - Structure Relationship Protein function depends on the protein 3D structure example: zinc-finger proteins. Protein structure provides insight into protein function. How does a protein fold into its native structure?

Sequence - Structure Relationship Early renaturation experiments have shown that the sequence of the protein is sufficient to determine its structure (Anfinsen, 1973). A major challenge in bio-informatics - Prediction of protein structure from its sequence.