Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Last lecture summary.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Sequence Similarity Searching Class 4 March 2010.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for BLAST searches

TODAY /bioinformaticsnode/home.html Lots of definitions - don’t worry!! But, later on, look stuff up on Google or Scirus

Remember: Homology:- sequences are homologous if they are related by divergence from a common ancestor

Sequence alignment In order to detect sequence homology we must first align sequences. An alignment is a hypothesis of positional homology between nucleotides/amino acids.

Alignment example Take the case of a hypothetical ancestral sequence (GAATTCGC). Over time mutation may lead to two different forms of this sequence, GAATTCGC and GATTGGC.

Example continued Alignment without gaps GAATTCGC GATTGGC ** * Alignments with gaps GAATTCGCorGAATTC–GC GA–TTGGCGA–TT–GGC ** ** **

Types of alignment Local Local alignment finds short regions of similarity between a pair of sequences Global Global alignments attempts to find the optimal alignment over the entire length of the sequences.

Local alignment Finds domains and short regions of similarity between a pair of sequences. The two sequences under comparison do not necessarily need to have high levels of similarity over their entire length in order to receive locally high similarity scores.

Local alignment This feature of local similarity searches give them the advantage of being useful when looking for domains within proteins or looking for regions of genomic DNA that contain introns. Local similarity searches do not have the constraint that similarity between two sequences needs to be observed over the entire length of each gene

Global alignment Finds the optimal alignment over the entire length of the two sequences under comparison. Algorithms of this nature are not particularly suited to the identification of genes that have evolved by recombination or insertion of unrelated regions of DNA. In instances such as this, a global similarity score will be greatly reduced. In cases where genes are being aligned whose sequences are of comparable length and also whose entire gene is homologous (descent from a common ancestor), global alignment works well.

PROGRAMS USED Local Blast Fasta3 Global Clustalw Clustalx

Terminology Exact (Exhaustive): This is a method of looking at all possibilities for a particular problem and then choosing the best one. It is the most rigorous method. Heuristic: This class of methods takes short-cuts and attempts to arrive at an optimal solution by making educated guesses.

Matrices Write one sequence horizontally Write the other sequence vertically to form a grid: TATTG T A A T G

Calculating an Alignment Score An alignment’s score is calculated using Scoring matrix Gap Opening Penalty Gap extension penalty

Scoring an alignment ACTG A1 C01 T001 G0001

Previous Example Alignment without gaps GAATTCGC GATTGGC ** * Alignments with gaps GAATTCGCorGAATTC–GC GA–TTGGCGA–TT–GGC ** ** **

Dotplot Matrix I

Dotplot Matrix II

Noise is caused by matches that have occurred by chance without any homology present. Can use a filter to reduce the noise, eg. only place a dot when a specified portion of a small group of successive bases match, eg. window of 10 only highlighted if 6 of the 10 bases match Chimpanzee haeomoglobin intergenic DNA plotted against itself c. 400 bases

8 out of 10, even less noise

IDENTITY DOT BLOT -identity blocks -looks for blocks of perfect identity, -reduces time required Chimp and spider monkey DNA, but c. 4,000 bases this time

Scoring matrix In reality, we know that certain mutations are more likely to have occurred than others. Conservation of the secondary structure of proteins is an important consideration. The mutation of the third base in a codon often results in no change in the amino acid coded for. Observations of alignments of amino acid sequences have been used to calculate the probability of certain substitutions.

Scoring Matrices Scoring matrices tell how similar amino acids are. There are two main sets of scoring matrices: PAM and BLOSUM. PAM is based on evolutionary distances BLOSUM is based on structure/function similarities

AA Matrices Assigning a score to all of the 210 possible amino acid substitutions has been done by several authors but 2 are especially noteworthy Dayhoff et al. (1978) used amino acid alignments of sequences that were 85% similar as a basis for the PAM mutation data matrices

AA Matrices Henikoff and Henikoff (1992) used several different alignments to produce the BLOSUM matrices. The Blosum 62 Matrix is based on an alignment of sequences that are at least 62% similar This is possibly the most used of amino acid substitution matrices and is the default matrix used in several applications

Scoring matrices These have been empirically determined and have been calculated by the direct comparison of related protein sequences. In general, amino acid substitutions that are seen to occur very rarely are given a negative value. Conservative substitutions (i.e., isoleucine for leucine) are given a positive value. Identical matches are also given a positive value.

The bottom line on PAM Frequencies of alignment Frequencies of occurrence The probability that two amino acids, i and j are aligned by evolutionary descent divided by the probability that they are aligned by chance

BLOSUM Matrices BLOSUM is built from distantly related sequences whereas PAM is built from closely related sequences. BLOSUM is built from conserved blocks of aligned protein segment found in the BLOCKS database.

PAM and BLOSUM Running searches with different matrices will help find different sorts of hits. PAM30 will preferentially find homologues that are evolutionarily close PAM250 will tend to find long, weak diffuse matches typical of distantly related proteins. BLOSUM62 is based on alignments of proteins that are at least 62% similar.

Evolutionary Basis of Sequence Alignment 1. Similarity: Quantity that relates to how alike two sequences are. 2. Identity: Quantity that describes how alike two sequences are in the strictest terms. 3. Homology: a conclusion drawn from data suggesting that two genes share a common evolutionary history.

Evolutionary Basis of Sequence Alignment (Cont. 1) 1. Example: Shown on the next page is a pairwise alignment of two proteins. One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity. 2. Underlined residues are identical. Asterisks and diamond represent those residues that participate in catalysis. Five gaps are placed to optimize the alignment.

Evolutionary Basis of Sequence Alignment (Cont. 2) Why are there regions of identity? 1) Conserved function-residues participate in reaction. 2) Structural-residues participate in maintaining structure of protein. (For example, conserved cysteine residues that form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a common ancestor gene.

Sequence Homology Searching Find related sequences in the database

Original BLAST Segment pair- this is a pair of subsequences of the same length that form an ungapped alignment. BLAST searches for all segment pairs between the query sequence and all of the sequences in the database (above a certain threshold). HSP-High-Scoring Pair.

Original Blast HSPs are derived by first finding the pairs that satisfy the threshold (T) conditions. Then the alignment is extended in both directions unyil the quality of the alignment drops off dramatically or falls to zero The HSPs are then sorted according to their score

Gapped BLAST The original BLAST suffered from the limitation of not being able to introduce gaps into the alignment. Gapped BLAST is an effort to circumvent this shortcoming. Experience shows that often several ungapped non-overlapping alignments result from a match to a single database entry.

Two-Hit method Find 2 HSPs within a distance m of each other on the same diagonal. Do not attempt an HSP extension unless you find two regions that meet this criterion. Attempt to generate a single gapped alignment in this region.

FastA algorithm Is the alignment significant? Could we see an alignment like this purely by chance? What are the statistics involved?

ktups Sequence XGAATTCGCATC This 11 base sequence can be divided into six 6-long segments of DNA GAATTC AATTCG ATTCGC TTCGCA TCGCAT CGCATC These are known as ‘ktuples’ (ktup Fasta). Sequences in databases are stored in this form.

Global Alignment vs. Local Alignment Global alignment is used when the overall gene sequence is similar to another sequence-often used in multiple sequence alignment e.g. Clustal W algorithm Local alignment is used when only a small portion of one gene is similar to a small portion of another gene. BLAST FASTA

Different forms of BLAST and FASTA You have a nucleotide sequence. Want to compare with other nucleotide sequences Blastn Fasta3

Different forms of BLAST and FASTA To compare the 6-frame conceptual translation of the nucleotide sequence against a protein database Blastx Fastx3 Fasty3

Different forms of BLAST and FASTA If we translate our nucleotide sequence, we can compare it to the translation of a nucleotide database; tBlastn tFasty3

Homology Search Tools BLAST (Basic Local Alignment Search Tool) by Stephen Altschul FASTA by William Pearson Open a new word file and 3 web browser windows