Large-Scale Genomic Surveys

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
Sequence Alignment.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Searching Sequence Databases
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Bioinformatics in Biosophy
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu BIOINFORMATICS Structures Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a (last edit.
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
What is Bioinformatics?
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
BIOINFORMATICS Sequence Comparison
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Searching Sequence Databases
Presentation transcript:

Large-Scale Genomic Surveys Bioinformatics Subtopics Secondary Structure Prediction Fold Recognition Homology Modeling Docking & Drug Design E-literature Function Class- ification Sequence Alignment Expression Clustering Protein Geometry Database Design Gene Prediction Large-Scale Genomic Surveys Protein Flexibility Structure Classification Genome Annotation

Some Specific “Informatics” tools of Bioinformatics Databases NCBI GenBank- Protein and DNA sequence NCBI Human Map - Human Genome Viewer NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito TIGR - The Institute for Genome Research SwissProt - Protein Sequence and Function ProDom - Protein Domains Pfam - Protein domain families ProSite - Protein Sequence Motifs Protein Data Base (PDB) - Coordinates for Protein 3D structures SCOP Database- Domain structures organized into evolutionary families HSSP - Domain database using Dali FlyBase WormBase PubMed / MedLine Sequence Alignment Tools BLAST Clustal MSAs FASTA PSI-Blast Hidden Markov Models 3D Structure Alignments / Classifications Dali VAST PRISM CATH SCOP

Multiple Sequence Alignment (MSA)

Aligning Text Strings and Gaps Raw Data ??? T C A T G C A T T G 2 matches, 0 gaps T C A T G | | C A T T G 3 matches (2 end gaps) T C A T G . | | | . C A T T G 4 matches, 1 insertion T C A - T G | | | | . C A T T G T C A T - G | | | | . C A T T G

Sequence Alignment E-value: Expect Value Each sequence alignment has a “bit score” or “similarity score” (S), a measure of the similarity between the hit and the query; normalized for “effective length" The E-value of the hit is the number of alignments in the database you are searching with similarity score ≥ S that you expect to find by chance; likelihood of the match relative to a pair of random sequences with the same amino acid composition E = 10-50. Much more likely than a random occurrence E = 10-5. This could be an accidental event E = 1. It is easy to find another hit in the database that is as The lower the Expect Value (E_val), the more significant the “hit”

E-value Expect Value Depends on: Similarity Score (Bit Score): Higher similarity score (e.g., high % seq id) corresponds to smaller E-value Length of the query: Since a particular Similarity Score is more easily obtained by chance with a longer query sequence, longer queries correspond to larger E-values Size of the database: Since a larger database makes a particular Similarity Score easier to obtain, a larger database results in larger E-values

Calculating E-val: Raw Score: calculated by counting the number of identities, mismatches, gaps, etc in the alignment Bit Score: Normalizes the “raw score” to provide a measure of sequence similarity that is independent of the scoring system E-value: E = mn2-S where m - “effective length of the query” (accounts for the fact that ends may not line up); n - length of the database (number of residues or bases) S - Bit score

Simple Score S = Total Score S(i,j) = similarity matrix score for aligning residues i and j Sum is carried out over all aligned i and j residues n = number of gaps G = gap penalty Simplest score - for “identity match matrix” S(i,j) = 1 if matches S(I,j) = 0 otherwise

Aligning Text Strings Raw Data ??? T C A T G C A T T G 2 matches, 0 gaps T C A T G | | C A T T G 3 matches (2 end gaps) T C A T G . | | | . C A T T G 4 matches, 1 insertion T C A - T G | | | | . C A T T G T C A T - G | | | | . C A T T G

Dynamic Programming Needleman-Wunsch (1970) provided first automatic method for sequence alignment Dynamic Programming to Find “Best” Global Alignment Test Data ABCNYRQCLCRPM AYCYNRCKCRBP

Step 1 -- Make a Similarity Matrix (Match Scores Determined by Identity Matrix) Put 1's where characters are identical.

Step 2 -- Start Computing the Sum Matrix new_value_cell(R,C) <= cell(R,C) { Old value, either 1 or 0 } + Max[ cell (R+1, C+1), { Diagonally Down, no gaps } cells(R+1, C+2 to C_max),{ Down a row, making col. gap } cells(R+2 to R_max, C+1) { Down a col., making row gap } ]

Step 3 -- Keep Going

Step 4 -- Sum Matrix All Done Alignment Score is 8 matches.

Step 5 -- Traceback Find Best Score (8) and Trace Back A B C N Y - R Q C L C R - P M A Y C - Y N R - C K C R B P

Step 6 -- Alternate Tracebacks A B C - N Y R Q C L C R - P M A Y C Y N - R - C K C R B P

Gap Penalties GAP = a + bN The score at a position can also factor in a penalty for introducing gaps (i. e., not going from i, j to i- 1, j- 1). Gap penalties are often of linear form: GAP = a + bN GAP is the gap penalty a = cost of opening a gap b = cost of extending the gap by one N = length of the gap (e.g. assume b=0, a=1/2, so GAP = 1/2 regardless of length.)

Step 2 -- Computing the Sum Matrix with Gaps new_value_cell(R,C) <= cell(R,C) { Old value, either 1 or 0 } + Max[ cell (R+1, C+1), { Diagonally Down, no gaps } cells(R+1, C+2 to C_max) - GAP ,{ Down a row, making col. gap } cells(R+2 to R_max, C+1) - GAP { Down a col., making row gap } ] GAP =1/2 1.5

Key Idea in Dynamic Programming The best alignment that ends at a given pair of positions (i and j) in the 2 sequences is the score of the best alignment previous to this position PLUS the score for aligning those two positions. An Example Below Aligning R to K does not affect alignment of previous N-terminal residues. Once this is done it is fixed. Then go on to align D to E. How could this be violated? Aligning R to K changes best alignment in box. ACSQRP--LRV-SH RSENCV A-SNKPQLVKLMTH VKDFCV ACSQRP--LRV-SH -R SENCV A-SNKPQLVKLMTH VK DFCV

Substitution Matrices Count number of amino acid identities (or non-identities) Count the minimum number of mutations in the DNA needed to account for the non-identical pairs Measure of similarity based on frequency of mutations observed in homologous protein sequences Measure similarity based on physical properties

Similarity (Substitution) Matrix A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 Identity Matrix Match L with L => 1 Match L with D => 0 Match L with V => 0?? S(aa-1,aa-2) Match L with L => 1 Match L with D => 0 Match L with V => .5 Number of Common Ones PAM Blossum Gonnet

Where do matrices come from? + —> More likely than random 0 —> At random base rate - —> Less likely than random Where do matrices come from? Manually align protein structures (or, more risky, sequences) 2 Look at frequency of a.a. substitutions at structurally constant sites. -- i.e. pair i - j exchanges 3 Compute log-odds S(aa-1,aa-2) = log2 ( freq(O) / freq(E) ) O = observed exchanges, E = expected exchanges odds = freq(observed) / freq(expected) Sij = log odds freq(expected) = f(i)*f(j) = is the chance of getting amino acid i in a column and then having it change to j e.g. A-R pair observed only a tenth as often as expected 90% AAVLL… AAVQI… AVVQL… ASVLL… 45%

Amino Acid Frequencies of Occurrence          1978        1991 L       0.085       0.091 A       0.087       0.077 G       0.089       0.074 S       0.070       0.069 V       0.065       0.066 E       0.050       0.062 T       0.058       0.059 K       0.081       0.059 I       0.037       0.053 D       0.047       0.052 R       0.041       0.051 P       0.051       0.051 N       0.040       0.043 Q       0.038       0.041 F       0.040       0.040 Y       0.030       0.032 M       0.015       0.024 H       0.034       0.023 C       0.033       0.020 W       0.010       0.014 Amino Acid Frequencies of Occurrence

Different Matrices are Appropriate at Different Evolutionary Distances (Adapted from D Brutlag, Stanford)

Change in Matrix with Ev. Dist. PAM-250 (distant) Change in Matrix with Ev. Dist. PAM-78 (Adapted from D Brutlag, Stanford)

The BLOSUM Matrices BLOSUM62 is the BLAST default Are the evolutionary rates uniform over the whole of the protein sequence? (No.)  The BLOSUM matrices: Henikoff & Henikoff (Henikoff, S. & Henikoff J.G. (1992) PNAS 89:10915-10919) . Use blocks of  sequence fragments from different protein families which can be aligned without the introduction of gaps. Amino acid pair frequencies can be compiled from these blocks Different evolutionary distances are incorporated into this scheme with a clustering procedure: two sequences that are identical to each other for more than a certain threshold of positions are clustered. More sequences are added to the cluster if they are identical to any sequence already in the cluster at the same level. All sequences within a cluster are then simply averaged. (A consequence of this clustering is that the contribution of closely related sequences to the frequency table is reduced, if the identity requirement is reduced. ) This leads to a series of matrices, analogous to the PAM series of matrices. BLOSUM80: derived at the 80% identity level. The BLOSUM Matrices BLOSUM62 is the BLAST default

Blast against Structural Genomic Target Registry (TargetDB) or against latest PDB or PDB-on-hold listings