MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Hidden Markov Models in Bioinformatics
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Patterns, Profiles, and Multiple Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Lecture 6, Thursday April 17, 2003
Bioinformatics Algorithms and Data Structures
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Chapter 5 Multiple Sequence Alignment.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
The ideal approach is simultaneous alignment and tree estimation.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Bioinformatics and BLAST
Sequence Based Analysis Tutorial
Sequence alignment, Part 2
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram

MGM workshop. 19 Oct 2010 Outline  Pairwise Alignment  Global/Local, Scoring  BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast  Multiple Sequence Alignment  ClustalW, Kalign, MAFFT, Muscle, T-Coffee, MSA, DIALIGN, Match-Box, Multalin, MUSCA  Phylogenetic analysis and tree construction  BIONJ, DendroUPGMA, PHYLIP, PhyML, Phylogeny.fr, POWER, BlastO, TraceSuite II  HMM  Protein family profiles

MGM workshop. 19 Oct 2010 Alignment  Insert spaces in arbitrary locations -> same length and no two spaces in the same position.  Find arrangement of two sequences to identify regions of similarity

MGM workshop. 19 Oct 2010 Alignment methods: Dot plots

MGM workshop. 19 Oct 2010 Global vs Local alignment  Global alignment: An alignment that assumes that the two sequences are basically similar over the entire length of one another  Local alignment: An alignment that searches for segments of the two sequences that match well  It may seem that one should always use local alignments! However each has its application

MGM workshop. 19 Oct 2010 Substitution matrices

MGM workshop. 19 Oct 2010 Scoring an alignment

MGM workshop. 19 Oct 2010 Global alignment S1=HGSAQVKGHG S2=KTEAEMKASEDLKKHGT

MGM workshop. 19 Oct 2010 KTEAEMKAESEDLKKHGT --HG--SA--Q-VKGHG-

MGM workshop. 19 Oct 2010 Local Alignment

MGM workshop. 19 Oct 2010 How BLAST works  Blast uses pre-indexed databases  It remembers the location of every ‘word’ of each database entry  Identify High scoring Segment Pairs (HSP)  Default word lengths 11bp or 3aa  When two non-overlapping words within a certain distance of each other in the query are matched against a database entry the region of the two sequences is called a segment pair.  Slide query and target sequences across each other until the maximum number of HSPs for that target is found  Each segment pair is extended untiil the score drops by X below its maximum value  Score the alignment  A scoring matrix is used  Gaps introduced between HSP during sliding get negative score  A match gets a positive score  Total alignment score is subjected to statistical analysis to calculate the significance vs. chance of the score  Repeat for every sequence in the database  Return total results

MGM workshop. 19 Oct 2010 How BLAST works MLVTTILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGY CGSTDPYCGTGCQSQCGGGG VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCG STIDYCGPGCQSQCGG Common 3mer GCQSQCGG extend Query Subject (database) ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG HSP Score = 66.6 bits (161), Expect = 3e-12, Method: Compositional matrix adjust. Identities = 32/53 (60%), Positives = 39/53 (74%), Gaps = 0/53 (0%) Query 6 ILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGG L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG Sbjct 15 VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG 67

MGM workshop. 19 Oct 2010 Types of Blast Nucleic sequence: atcgatatatatagactgactgact Protein sequence: MTAVYHILRALRARARVARARVH 6 frame translation Nucleic acids sequence database Protein seqeunces database blastn blastp 6 frame translation tblastx blastx tblastn Database Query

MGM workshop. 19 Oct 2010

Exact multiple alignment by dynamic programming  Compexity= O(n S 2 S S 2 )  N: length of sequences  S: number of sequences  Only feasible for 4-5 sequences max.

MGM workshop. 19 Oct 2010

Neighbor Joining

MGM workshop. 19 Oct 2010 Unrooted NJ tree

MGM workshop. 19 Oct 2010 Comparison of Multiple sequence alignment programs

MGM workshop. 19 Oct 2010 Primary sequence changes:

MGM workshop. 19 Oct 2010 Profiles CGGSV 0.8 * 0.4 * 0.8 * 0.6 * 0.2 =.031 ln(0.8)+ln(0.4)+ln(0.8)+ln(0.6)+ln( 0.2) = -3.48

MGM workshop. 19 Oct 2010 Hidden Markov Models  Assumptions:  Observations are ordered  Random process can be represented by a stochastic finite state machine with emitting states Probabilistic parameters of a Hidden Markov Model x – states, y – possible observations a – state transition probabilities, b –output/emision probabilities

MGM workshop. 19 Oct 2010 HMM estimation, usage & applications Training/Estimation  Feed an architecture (given in advance) a set of observation sequences  The training process will iteratively alter its parameters to fit the training set  The trained model will assign the training sequences high probabilities Usage  Evaluate the probability of an observation sequence given the model (Forward)  Find the most likely path through the model for a given observation sequence (Viterbi) Applications  Gene finding  Protein family modeling  …

MGM workshop. 19 Oct 2010 Profile HMMs  Families of functional biological sequences  Primary sequences have diverged due to evolution, while maintaining structure/function.  Questions:  Does a biological sequence belong to a certain protein family? For example is a given protein (sequence) a globin?  Given a set of sequences, find more sequences of the same family

MGM workshop. 19 Oct 2010

Trade offs AdvandagesDisadvandages Statistics Modularity Transparency Prior knowledge State independence Over – fitting Local maximums Speed

MGM workshop. 19 Oct 2010  Questions?