Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
From Pairwise to Multiple Alignment. WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
From Pairwise to Multiple Alignment. WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Introduction to bioinformatics
Sequence similarity.
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Multiple sequence alignment
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
Presentation transcript:

Introduction to Bioinformatics From Pairwise to Multiple Alignment

Outline Advances in BLAST Multiple Sequence Alignment- CLUSTAL

Scoring system for BLAST Substitution Matrix + Gap Penalty

Substitution Matrix BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

Gap penalty Example showed -1 score per indel –So gap cost is proportional to its length Biologically, indels occur in groups –We want our gap score to reflect this Standard solution: affine gap model –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

Statistical significance

E-value The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size. higher e-value lower similarity –“ sequences with E-value of less than 0.01 are almost always found to be homologous” The lower bound is normally 0 (we want to find the best)

Expectation Values Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment

Remote homologues Sometimes BLAST isn’t enough. Large protein family, and BLAST only gives close members. We want more distant members PSI-BLAST

Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

PSI-BLAST Advantage: PSI-BLAST looks for seqs that are close to ours, and learns from them to extend the circle of friends Disadvantage: if we found a WRONG sequence, we will get to unrelated sequences. This gets worse and worse each iteration

Multiple Sequence Alignment MSA

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Like pairwise alignment BUT compare n sequences instead of 2 Rows represent individual sequences Columns represent ‘same’ position May be gaps in some sequences

Why multiple alignments? BLAST Usually obtains many sequences that are significantly similar to the query sequence Practically Comparing each and every sequence to every other may impractical when the number of sequences is large Solution generating a profile

MSA MSA can give you a better picture of functional sites on proteins and nucleic acids as well as the forces that shape evolution! VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNWYQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-QYYSTWYQQPPG Important amino acids or nucleotides are not allowed to mutate Less important positions change more easily

Alignment Example GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2* *0.5 Score=8 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1, 3/4 =0.75, 2/4=0.5, 1/4= 0

Example of 3 sequences:

Dynamic Programming Pairwise A–B alignment table –Cell (i,j) = score of best alignment between first i elements of A and first j elements of B –Complexity: length of A  length of B 3-way A–B–C alignment table –Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C –Complexity: length A  length B  length C Example: protein family alignment –100 proteins, 1000 amino acids each –Complexity: table cells –Calculation time: beyond the big bang!

Feasible Approach Based on pairwise alignment scores –Build n by n table of pairwise scores Align similar sequences first –After alignment, consider as single sequence –Continue aligning with further sequences

–For n sequences, there are n  (n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC 1 GTCGTA-GTCG-GC-TCGAC 2 GTC-TA-G-CGAGCGT-GAT 3 G-C-GAAGA-G-GCG-AG-C 4 G-CCGTCGC-G-TCGTAA-C

CLUSTAL method Higgins and Sharp 1988 –ref: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline][Medline] An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one Progressive Sequence Alignment

ABCDABCD DCBA A 11B 13C 1022D Compute the pairwise alignments for all against all the similarities are stored in a table First step:

DCBA A 11B 13C 1022D A D C B cluster the sequences to create a tree Represents the order in which pairs of sequences are to be alignedRepresents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the treesimilar sequences are neighbors in the tree distant sequences are distant from each other in the treedistant sequences are distant from each other in the tree Second step:

N Y L SN Y L S N K Y L SN F S N F L SN F L S N K/- Y L S N F L/- S N K/- Y/F L/- S Join alignments

Treating Gaps in ClustalW Penalty for opening gaps and additional penalty for extending the gap Gaps found in initial alignment remain fixed New gaps are introduced as more sequences are added (decreased penalty if gap exists) Decreased within stretches of hydrophilic residues

MSA Approaches Progressive approach CLUSTALW (CLUSTALX) PILEUP T-COFFEE Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. Statistical Methods: Hidden Markov Models SAM2K Genetic algorithm SAGA