Introduction to bioinformatics

Slides:



Advertisements
Similar presentations
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Advertisements

Pairwise alignments.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Aligning sequences and searching databases
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Sequence similarity search Glance to the protein world.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
In-Class Assignment #1: Research CD2
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
Sequence similarity search II Searching for remote homologies.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity search Glance to the protein world.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Alignment.
Protein Sequence Alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Presentation transcript:

Introduction to bioinformatics Sequence Alignment Part 3

WHATS TODAY? MORE BLAST …. Similarity scores for protein sequences Gaps Statistical significance (e-value)

Protein Sequence Alignment Rule of thumb: Proteins are homologous if 25% identical (length >100) DNA sequences are homologous if 70% identical

Protein Pairwise Sequence Alignment The alignment tools are similar to the DNA alignment tools BLASTN for nucleotides BLASTP for proteins Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores: Score s(i,j) > 0 if amino acids i and j have similar properties Score s(i,j) is  0 otherwise How should we score s(i,j)?

The 20 Amino Acids

Chemical Similarities Between Amino Acids Acids & Amides DENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) Aromatic FYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) Hydrophobic ILMV (Ile, Leu, Met, Val)

Sequence Alignment based on AA similarity TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS || + |||| +|| ||| | +| | | | | TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |+ + | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL ---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD ++||| | + ++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID | = identity + = similarity

Amino Acid Substitutions Matrices When scoring protein sequence alignments it is common to use a matrix of 20  20, representing all pairwise comparisons : Substitution Matrix

Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y Q E In this column E & D are found 7/8

Amino Acid Matrices Symmetric matrix of 20x20 entries: entry (i,j)=entry(j,i) Entry (i,i) is greater than any entry (i,j), ji. Entry (i,j): the score of aligning amino acid i against amino acid j.

PAM - Point Accepted Mutations Developed by Margaret Dayhoff, 1978. Analyzed very similar protein sequences Proteins are evolutionary close. Alignment is easy. Point mutations - mainly substitutions Accepted mutations - by natural selection. Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j) Found that common substitutions occurred involving chemically similar amino acids.

PAM 250 Similar amino acids are close to each other. Regions define conserved substitutions.

Example: Asp & Glu Score = 3 C H +H3N COO- HCH O- O COO- +H3N C H HCH Aspartate (Asp, D) Glutamate (Glu, E)

Selecting a PAM Matrix Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities. PAM120 recommended for general use (40% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) If uncertain, try several different matrices PAM40, PAM120, PAM250 recommended

BLOSUM Blocks Substitution Matrix Steven and Jorga G. Henikoff (1992) Based on BLOCKS database (www.blocks.fhcrc.org) Families of proteins with identical function Highly conserved protein domains Ungapped local alignment to identify motifs Each motif is a block of local alignment Counts amino acids observed in same column Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on sequences that are at most n percent identical.

Selecting a BLOSUM Matrix For BLOSUMn, higher n suitable for sequences which are more similar BLOSUM62 recommended for general use BLOSUM80 for close relations BLOSUM45 for distant relations

Summary: BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

Gap Scores Example showed -1 score per indel So gap cost is proportional to its length Biologically, indels occur in groups We want our gap score to reflect this Standard solution: affine gap model Once-off cost for opening a gap Lower cost for extending the gap Changes required to algorithm

Scoring system = Substitution Matrix + Gap Penalty

Gap penalty We expect to penalize gaps Scoring for gap opening & for extension Insertions and deletions are rare in evolution But once they are created, they are easy to extend Gap-extension penalty < gap-open penalty Default gap parameters are given for each matrix: PAM30: open=9, extension=1 PAM250: open=14, extension=2

Low Complexity Sequences AAAAAAAAAAA ATATATATATATA CAGCAGCAGCAG Sequences of low complexity can cause getting significant hits which are not true homologues !!! How does BLAST deal with low complexity sequences? By default low complexity sequences are filtered out and replaced by XXXXX

Statistical significance

E-value The lower bound is normally 0 (we want to find the best) The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size. higher e-value lower similarity “sequences with E-value of less than 0.01 are almost always found to be homologous” The lower bound is normally 0 (we want to find the best)

Expectation Values Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database

E value: Number of hits of score ≥ S expected by chance Bit score (S) Similar to alignment score Normalized Higher means more significant E value: Number of hits of score ≥ S expected by chance Based on random database of similar size Lower means more significant Used to assess the statistical significance of the alignment

Remote homologues PSI-BLAST Sometimes BLAST isn’t enough. Large protein family, and BLAST only gives close members. We want more distant members PSI-BLAST

Construct profile from blast results PSI-BLAST Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

PSI-BLAST Advantage: PSI-BLAST looks for seqs that are close to ours, and learns from them to extend the circle of friends Disadvantage: if we found a WRONG sequence, we will get to unrelated sequences. This gets worse and worse each iteration