Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Basics of Comparative Genomics Dr G. P. S. Raghava.
Types of homology BLAST
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Similar Sequence Similar Function Charles Yan Spring 2006.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Pairwise & Multiple sequence alignments
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Basics of Comparative Genomics
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Basics of Comparative Genomics
Presentation transcript:

Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification 2. Phylogeny prediction (tree construction) Sources: 1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount Cold Spring Harbor Press 2) NCBI tutorial and 3) Brian Fristensky. Univ. of Manitoba

Alignment: pairs of sequences DNA: A, G, C, T protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y KQTGKG | ||| KSAGKG TCGCA || TC-CA

DNA to RNA to protein to phenotype

Alignment: pairs of sequences Concepts: Similarity Identity Homology Orthology Paralog KQTGKGV | |||: KSAGKGL 4/7 identical 5/7 similar

Homology is based on evolutionary history

Figure 45 Lineage-specific expansions of domains and architectures of transcription factors. Top, specific families of transcription factors that have been expanded in each of the proteomes. Approximate numbers of domains identified in each of the (nearly) complete proteomes representing the lineages are shown next to the domains, and some of the most common architectures are shown. Some are shared by different animal lineages; others are lineage-specific.

A partial alignment of globin sequences. Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.

- Fitch, W.M Homology: A personal view of some of the problems. Trends Genet. 16: Homology, orthology and paralogy orthologs diverged at a speciation event paralogs diverged at a gene duplication event

Alignment: pairs of sequences Scoring schemes Score = matches - mismatches - gaps GKG-RRWDAKR ||| || GKGAKRWESAP What is the best way to evaluate the contribution of each?

A partial alignment of globin sequences from Pfam. Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.

Alignment: pairs of sequences Global vs. local alignment. (end gaps are ignored in local alignment)

Brian Fristensky. Univ. of Manitoba Dynamic programming TCGCA || TC-CA

Dynamic programming Brian Fristensky. Univ. of Manitoba

Dynamic programming Brian Fristensky. Univ. of Manitoba

Dynamic programming Brian Fristensky. Univ. of Manitoba

Alignment: pairs of sequences Scoring schemes Score = matches - mismatches - gaps GKG-RRWDAKR ||| || GKGAKRWESAP "The dynamic programming algorithm was improved in performance by Gotoh (1982) by using the linear relationship for a gap weight wx = g + rx, where the weight for a gap of length x is the sum of a gap opening penalty (g) and a gap extension penalty (r) times the gap length (x), and by simplifying the dynamic programming algorithm." D. W. Mount KQTGKG-RRWDAKR | ||| ||| KSAGKG-----AKR VS.

Alignment: amino acid substitution matrices Scoring schemes "Any [scoring] matrix has an implicit amino acid pair frequency distribution that characterizes the alignments it is optimized for finding. More precisely, let p i be the frequency with which amino acid i occurs in protein sequences and let q ij be the freqeuncy with which amino acids i and j are aligned within the class of alignments sought. Then, the scores that best distinguish these alignments from chance are given by the formula: S ij = log (q ij / p i p j ) The base of the logarithm is arbitrary, affecting only the scale of the scores. Any set of scores useful for local alignment can be written in this form, so a choice of substitution matrices can be viewed as an implicit choice of 'target frequencies'" - Altschul et al (Nature Genetics 6:119) Those frequencies are characteristic of the sequences being aligned, and are primarily a function of their degree of divergence.

Alignment: amino acid substitution matrices Substitution matrices -- BLOSUM 62 Henikoff and Henikoff Amino acid substitution matrices from protein blocks. PNAS 89:

Alignment: amino acid substitution matrices Substitution matrices -- BLOSUM 62

Alignment: implementations Fasta Introduces the concept of k-tuple perfects alignment to seed longer global alignments. BLAST -- Basic Local Alignment Search Tool Initiates an alignment locally and then extends that alignment. GKG ||| GKG GKG-RRW ||| || GKGAKRW

Alignment: Searching databases for sequences

There are many modifications of BLAST for specific purposes.

The NCBI BLAST interface

Extreme value distribution the expected distribution of the maximum of many independent random variables, generally Y = exp [-x -e -x ] K and lambda are statistical parameters dependent upon the scoring system and the background amino acid frequencies of the sequences being compared. While FASTA estimates these parameters from the scores generated by actual database searches, BLAST estimates them beforehand for specific scoring schemes by comparing many random sequences generated using a standard protein amino acid composition [12].

Fasta can be run at EMBL. The software is also available for download.

Alignment: Multiple sequence alignment

Alignment: Protein classification

Phylogeny prediction (tree construction)

root

Phylogeny prediction (tree construction) Character-based Methods Parsimony Maximum Likelihood tree that maximizes the likelihood of seeing the data Bayesian Analysis trees with greatest likelihoods given the data Distance Methods Unweighted Gap-pair method with Arithmetic Means Neighbor joining

a,The interspecies relationships of five chromosome regions to corresponding DNA sequences in a chimpanzee and a gorilla. Most regions show humans to be most closely related to chimpanzees (red) whereas a few regions show other relationships (green and blue). b, The among-human relationships of the same regions are illustrated schematically for five individual chromosomes. Within- and between-species variation along a single chromosome.

Tutorial III: Open problems in bioinformatics Tentatively: Detection of subtle signals promoter elements exon splicing enhancers noncoding RNAs weak protein similarities Microarrays Protein folding and homology modeling Thursday, June 10, 2:00 - 3:45

Microarray expression data Statistical analysis -- what has changed Clustering -- which genes change together Clustering -- promoter recognition Clustering -- database integration Phenotype determination (e.g. cancer prognosis)

Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Multiple sequence alignment Database searching for sequences Protein classification 2. Phylogeny prediction (tree construction) 3. microarray expression data 4. Protein structure Protein folding Structure prediction Homology modeling Sources: 1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount Cold Spring Harbor Press 2) NCBI tutorial 3) Cold Spring Harbor course in Computational Genomics (1999) Pearson