Pairwise Sequence Alignment

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Measuring the degree of similarity: PAM and blosum Matrix
Profiles for Sequences
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Molecular Evolution Revised 29/12/06
Heuristic alignment algorithms and cost matrices
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Lecture 12 Splicing and gene prediction in eukaryotes
Biological Motivation Gene Finding in Eukaryotic Genomes
Sequencing a genome and Basic Sequence Alignment
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Sequencing a genome and Basic Sequence Alignment
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Protein and RNA Families
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Bioinformatics and Computational Biology
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Pairwise Sequence Alignment and Database Searching
bacteria and eukaryotes
Bioinformatics Overview
Ab initio gene prediction
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Presentation transcript:

Pairwise Sequence Alignment Useful in structural, functional and evolutionary analyses of sequences. Seq homology : two sequences are descended from common evolutionanry origin. Seq similarity : is precentange of aligned residues that are similar in physiochemical properties such as size, charge and hydrophobicity.

Seq similarity and seq identity are synonymous for nucleotide seq. For protein seq. Seq identity refers to the percentage of matches of the same amino residues between two aligned seq. Seq similarity refer to percentage of aligned residues that have similar physicochemical characteristics.

Goal of pairwise alignment is to find the best pairing of two seq, such that there is max correspondence among residues. There are two strategies : Global alignment and local alignment. Global : two seq to be aligned are assumed to be generally similar over their entire length. The alignment is carried out from beginning to end of both sequences to find the best possible across the entire length between the two sequences. Local : does not assume that the two seq in question have similarity over entire length

Gap penalties. Performing optimal alignment between seq often involves applying gaps that represent insertions and deletions. If the penalty values are set too low: gaps can become too numerous to allow even nonrelated seq to be matched up with high similarity. The penalty values are set too high : gaps may become too difficult to appear, and reasonable alignment cannot be achieved, which is also unrealistic.

Scoring matrix The alignment procedure has to make use of a scoring system. The scoring system is called a substitution matrix and is derived from statistical analysis of residue substitution data from sets of reliable alignments of highly related sequences. Scoring matrix for nucleotide seq are relatively simple. A positive value or high score is given for a match and a negative value or low score for a mismatch Scoring matrix for amino acid are more complicated because scoring has reflect the physicochemical properties of amino acid residue

Amino acid scoring matrix : amino acid substitution matrix, which are 20 x 20 matrix. PAM matrix (Dayhoff PAM matrix) : is compiled alignments of seventy-one groups of very closely related protein sequence. BLOSUM matrix : only observation of residue substitutions. Or are actual percentage identity values of sequence selected for construction of the matrices PAM are derived from an evolutionanry model whereas the BLOSUM matrices consist of entirely direct observations

Database similarity Searching is pairwise alignment on large scale. Implementing algoritthmas for sequence database searching : BLAST (basic local alignment search; and FASTA (FAST ALL) Is an essential first step in functional charctization of novel gene or protein sequences Major issue in database searching are sensitivity, selectivity and speed. The summary of statistic for significance of database matches are: bit score; E-value, percentages of identity, similarity (Positive), and gaps

Multiple Sequence Alignment Extention of pair wise alignment is multiple seq alignment, which is to align multiple related seq to achieve optimal matching of the sequences There is unique advantage of multiple sequence alignment because it reveals more biological information than many pairs alignments can. It allows the identification of conserved sequence patterns and motifs in the whole sequence family Many conserved and functionally critical amino acid residue can be identified in a protein multiple alignment Multiple seq alignment is an essensial prerequisite to carrying out phylogenetics analysis of seq families and prediction of protein secondary and tertiary structures. It is also has application in designing degenerate polymerase chain reaction-PCR primers based on multiple related seq

Protein Motif and Domain Prediction Consensus sequence patterns A motif is a short conserved sequence pattern associated with distinct functions of a protein or DNA. Motif is associated with a distinct structural sire performing a particular function. (such as Zn-finger motif 10-20 amino acid long) A domain is a conserved sequence pattern, defined as an independent functional and structural unit. A domain consists of more than 40 residue and up to 700 residues Motifs and domains are evolutionarily more conserved than other regions of a protein and tend to evolve as units. Is an important aspects of the clasification of protein seq and functional annotation.

Identification motif and domains In multiple seq alignment Using regular expressions Using statistical models.

Motif and Domains: in multiple alignment Are first constructed from multiple alignment of related sequences Based on the multiple alignment, commonly conserved regions can be identified The regions considered motifs and domains then serve as diagnostic features for protein family The consensus seq information of motifs and domains can be stored in a database for later seraches of the presence of similar seq patterns from unknow sequences.

Regular Expressions is a concise way of representing a seq family by a string of characters Basic rules to describe a seq pattern are used For example motif : protein phosphorylation motif can be expressed as [ST]-X-[RK], can be interpreted as a an S or T residue which is followed by another one unspesific residues (X) followed by R or K residue. Motif written as E-X(2)-[FHM]-X(4)-{P}-L can be interpreted as an E residue followed by two unspesific residues which are followed by an F, or H or M residue which is followed by another four unspesific residues followed by non-P residue and a final L.

There are two mechanisms of matching regular expressions with a query seq : Exact matching and fuzzy matching Exact matching : there must be a strict match of seq patterns (searching a motif database using this approach results in either a match or nonmatch. Fuzzy matches : also called approximate matches, provide more permissive matching by following more flexible matching of residues of similar biochemical properties. For example phenylalanine at a particular position, fuzzy matching allows other aromatic residue in a sequence to match with the expression. Motif database have been used to classify proteins, provide functional assignment, and identity structural and evolutionary relationships. PROSITE is sequence pattern database Emotif is motif database that uses multiple seq alignment

Statistical Models The major limitation of regular expressions is that this method does not take into account seq probability information about the multiple alignment from which it is modeled Regular expression is derived from an incomplete seq set, it has less predictive power because many more sequences with the same type f motifs are not represented. Preserve the seq information from a multiple seq alignment and express with probabilistic models. Statistical model have stronger predictive power that the regular expression. Using such a scoring system can enhance the sensitif a motif discover and detect more divergent but truly related sequences. Situs web : PRINT – is a protein fingerprint database . BLOCK- is a database that uses multiple alignment derived from the most conserved, ungapped regions of homologous protein sequences. ProdDom -

Protein Family Databases Classify proteins based on the presence of motifs and domains. Another way, is based onnear full-length sequence comparison Clustering of proteins based on overall seq similarities. Protein family database based on phylogenetic classification

Summary Seq motif and domains represent conserved, functionally of proteins Domains correspond to contiguous regions in protein three-dimensional structure and serve as units of evolution Motifs are highly conserved segments in multiple protein alignment that may be associated with particular biological functions Databases for motifs and domains can be constructed based on multiple seq alignment of related sequences.

Gene Prediction Lecture Nov 25, 2008

Gene Prediction One of the most difficult problems in Field of pattern recognition Do not have conserved motifs Significant differences in gene structures of prokaryotes and eukaryotes

Categories of gene prediction programs : gene prediction methods Ab initio-based Homology-based Ab initio based approach predicts gene based on the given sequence alone : The exixtence of gene signals : start, stop codons, intron splice signals,transcription factor binding sites, ribosomal sites, polyadenylation (poly-A) sites. Gene content which is statistical description coding regions. (nucleotide composition and statistical patterns of the coding regions tend to vary significantly from those of the noncoding regions). Distinguish coding from noncoding regions.

The homology-based method : makes predictions based on significant matches of the query sequence with sequences of know genes ( as consensus based)

Gene prediction in Prokaryotes Include bacteria and Archaea Relatively small genomes with sizes ranging from 0.5 to 10 mB Gene density in genomes is high, more tha 90% of a genome sequence containing coding sequence Very few repetitive sequences ORF coding for a single protein. Start codon (ATG = methionine), GTG and TTG are used as alternative start codons Gene strucrure consist of Transcription start, RBS (consensus motif AGGAGGT), Translation start, Coding Region , Stop, transcription terminator.

Prediction of Open Reading Frames Start and stop codon, can be translated into a protein sequence, which is then used to search against protein database Codon bias (GC bias) : Nucleotide composition of the third position of a codon. It has been observed that this position has a preference to use G or C over A or T. GC bias : by plotting the GC composition, regions with values significantly above the random level can be identified as ORF.

Gene Prediction Using Markov model and Hidden Markov Models Statistical description of a gene Describes the probability of the distribution of nucleotides in a DNA sequence Oligonucleotide distribution in the coding regions are different from those for the noncoding regions.

Gene Prediction in Eukaryotes Sizes : from 1- Mbp to 670 Gbp Very low gene density Only 3% of the genomes codes for genes (in humans); about 1 gene per 100kbp on average Space between genes is very large and rich in repetitive sequences and transposable elements A gene is split into pieces exons and introns Structure gene : transcription start, start codon, exon, intron…exon…..intron, exon, stop codon, poly A signal.

Gene prediction Programs Ab Initio-Based programs : is to discriminate exons from non coding sequencess and subsequently join the exons together in the correct order Prediction using neural network Using discriminant analysis Using HMM Homology-Based programs : exon structures and exon sequences of related species are highly conserved Consensus-based programs : development of consensus-based algorithms.

Summary Computational prediction of genes is the most important steps of genome analysis Predictions of prokaryotic genes is easier than for eukaryotic genomes Gene prediction algorithms based on HMM (Hidden Markov Models) have good accuracy

Phylogenetics Analysis

How to construct A Tree Constructing a multiple seq. alignment Determining the substitution model Tree building Tree evalution

STEP 1 Alignment and alignment editing : Phylogenetics seq data consist of a multiple seq alignment Alignment base position, (as sites). These sites = characters. Character state : the actual base (or gap) occupying a site It is not uncommon to edit the alignment.

STEP 2 Deciding on a Data model : Models of substitution Rates Between Base : Weight matrix Models of substitusion Rates Between Amino Acid PAM Matrix and BLOSUM

STEP 3 Tree Building Distance-based method : Character-based method: Use the amount of dissimilarity (the distance) between two aligned sequence to drive trees UPGMA (unweighted pair group method); Neighbor joining (NJ). UPGMA : is a clustering; joins tree branches on the criterion of greatest similarity among pairs Character-based method: The assessment of the reliability of each base position in alignment on the basis of al other base positions Maximum Parsimony : Maximum likehood (ML)

STEP 4 Tree Evaluation : Bootstraping : is a resampling tree evalution method ; booststrap value