Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise Sequence Alignment

Similar presentations


Presentation on theme: "Pairwise Sequence Alignment"— Presentation transcript:

1 Pairwise Sequence Alignment
Useful in structural, functional and evolutionary analyses of sequences. Seq homology : two sequences are descended from common evolutionanry origin. Seq similarity : is precentange of aligned residues that are similar in physiochemical properties such as size, charge and hydrophobicity.

2 Seq similarity and seq identity are synonymous for nucleotide seq.
For protein seq. Seq identity refers to the percentage of matches of the same amino residues between two aligned seq. Seq similarity refer to percentage of aligned residues that have similar physicochemical characteristics.

3 Goal of pairwise alignment is to find the best pairing of two seq, such that there is max correspondence among residues. There are two strategies : Global alignment and local alignment. Global : two seq to be aligned are assumed to be generally similar over their entire length. The alignment is carried out from beginning to end of both sequences to find the best possible across the entire length between the two sequences. Local : does not assume that the two seq in question have similarity over entire length

4 Gap penalties. Performing optimal alignment between seq often involves applying gaps that represent insertions and deletions. If the penalty values are set too low: gaps can become too numerous to allow even nonrelated seq to be matched up with high similarity. The penalty values are set too high : gaps may become too difficult to appear, and reasonable alignment cannot be achieved, which is also unrealistic.

5 Scoring matrix The alignment procedure has to make use of a scoring system. The scoring system is called a substitution matrix and is derived from statistical analysis of residue substitution data from sets of reliable alignments of highly related sequences. Scoring matrix for nucleotide seq are relatively simple. A positive value or high score is given for a match and a negative value or low score for a mismatch Scoring matrix for amino acid are more complicated because scoring has reflect the physicochemical properties of amino acid residue

6 Amino acid scoring matrix : amino acid substitution matrix, which are 20 x 20 matrix.
PAM matrix (Dayhoff PAM matrix) : is compiled alignments of seventy-one groups of very closely related protein sequence. BLOSUM matrix : only observation of residue substitutions. Or are actual percentage identity values of sequence selected for construction of the matrices PAM are derived from an evolutionanry model whereas the BLOSUM matrices consist of entirely direct observations

7 Database similarity Searching
is pairwise alignment on large scale. Implementing algoritthmas for sequence database searching : BLAST (basic local alignment search; and FASTA (FAST ALL) Is an essential first step in functional charctization of novel gene or protein sequences Major issue in database searching are sensitivity, selectivity and speed. The summary of statistic for significance of database matches are: bit score; E-value, percentages of identity, similarity (Positive), and gaps

8 Multiple Sequence Alignment
Extention of pair wise alignment is multiple seq alignment, which is to align multiple related seq to achieve optimal matching of the sequences There is unique advantage of multiple sequence alignment because it reveals more biological information than many pairs alignments can. It allows the identification of conserved sequence patterns and motifs in the whole sequence family Many conserved and functionally critical amino acid residue can be identified in a protein multiple alignment Multiple seq alignment is an essensial prerequisite to carrying out phylogenetics analysis of seq families and prediction of protein secondary and tertiary structures. It is also has application in designing degenerate polymerase chain reaction-PCR primers based on multiple related seq

9 Protein Motif and Domain Prediction
Consensus sequence patterns A motif is a short conserved sequence pattern associated with distinct functions of a protein or DNA. Motif is associated with a distinct structural sire performing a particular function. (such as Zn-finger motif amino acid long) A domain is a conserved sequence pattern, defined as an independent functional and structural unit. A domain consists of more than 40 residue and up to 700 residues Motifs and domains are evolutionarily more conserved than other regions of a protein and tend to evolve as units. Is an important aspects of the clasification of protein seq and functional annotation.

10 Identification motif and domains
In multiple seq alignment Using regular expressions Using statistical models.

11 Motif and Domains: in multiple alignment
Are first constructed from multiple alignment of related sequences Based on the multiple alignment, commonly conserved regions can be identified The regions considered motifs and domains then serve as diagnostic features for protein family The consensus seq information of motifs and domains can be stored in a database for later seraches of the presence of similar seq patterns from unknow sequences.

12 Regular Expressions is a concise way of representing a seq family by a string of characters Basic rules to describe a seq pattern are used For example motif : protein phosphorylation motif can be expressed as [ST]-X-[RK], can be interpreted as a an S or T residue which is followed by another one unspesific residues (X) followed by R or K residue. Motif written as E-X(2)-[FHM]-X(4)-{P}-L can be interpreted as an E residue followed by two unspesific residues which are followed by an F, or H or M residue which is followed by another four unspesific residues followed by non-P residue and a final L.

13 There are two mechanisms of matching regular expressions with a query seq : Exact matching and fuzzy matching Exact matching : there must be a strict match of seq patterns (searching a motif database using this approach results in either a match or nonmatch. Fuzzy matches : also called approximate matches, provide more permissive matching by following more flexible matching of residues of similar biochemical properties. For example phenylalanine at a particular position, fuzzy matching allows other aromatic residue in a sequence to match with the expression. Motif database have been used to classify proteins, provide functional assignment, and identity structural and evolutionary relationships. PROSITE is sequence pattern database Emotif is motif database that uses multiple seq alignment

14 Statistical Models The major limitation of regular expressions is that this method does not take into account seq probability information about the multiple alignment from which it is modeled Regular expression is derived from an incomplete seq set, it has less predictive power because many more sequences with the same type f motifs are not represented. Preserve the seq information from a multiple seq alignment and express with probabilistic models. Statistical model have stronger predictive power that the regular expression. Using such a scoring system can enhance the sensitif a motif discover and detect more divergent but truly related sequences. Situs web : PRINT – is a protein fingerprint database . BLOCK- is a database that uses multiple alignment derived from the most conserved, ungapped regions of homologous protein sequences. ProdDom -

15 Protein Family Databases
Classify proteins based on the presence of motifs and domains. Another way, is based onnear full-length sequence comparison Clustering of proteins based on overall seq similarities. Protein family database based on phylogenetic classification

16 Summary Seq motif and domains represent conserved, functionally of proteins Domains correspond to contiguous regions in protein three-dimensional structure and serve as units of evolution Motifs are highly conserved segments in multiple protein alignment that may be associated with particular biological functions Databases for motifs and domains can be constructed based on multiple seq alignment of related sequences.

17 Gene Prediction Lecture Nov 25, 2008

18 Gene Prediction One of the most difficult problems in Field of pattern recognition Do not have conserved motifs Significant differences in gene structures of prokaryotes and eukaryotes

19 Categories of gene prediction programs : gene prediction methods
Ab initio-based Homology-based Ab initio based approach predicts gene based on the given sequence alone : The exixtence of gene signals : start, stop codons, intron splice signals,transcription factor binding sites, ribosomal sites, polyadenylation (poly-A) sites. Gene content which is statistical description coding regions. (nucleotide composition and statistical patterns of the coding regions tend to vary significantly from those of the noncoding regions). Distinguish coding from noncoding regions.

20 The homology-based method : makes predictions based on significant matches of the query sequence with sequences of know genes ( as consensus based)

21 Gene prediction in Prokaryotes
Include bacteria and Archaea Relatively small genomes with sizes ranging from 0.5 to 10 mB Gene density in genomes is high, more tha 90% of a genome sequence containing coding sequence Very few repetitive sequences ORF coding for a single protein. Start codon (ATG = methionine), GTG and TTG are used as alternative start codons Gene strucrure consist of Transcription start, RBS (consensus motif AGGAGGT), Translation start, Coding Region , Stop, transcription terminator.

22 Prediction of Open Reading Frames
Start and stop codon, can be translated into a protein sequence, which is then used to search against protein database Codon bias (GC bias) : Nucleotide composition of the third position of a codon. It has been observed that this position has a preference to use G or C over A or T. GC bias : by plotting the GC composition, regions with values significantly above the random level can be identified as ORF.

23 Gene Prediction Using Markov model and Hidden Markov Models
Statistical description of a gene Describes the probability of the distribution of nucleotides in a DNA sequence Oligonucleotide distribution in the coding regions are different from those for the noncoding regions.

24 Gene Prediction in Eukaryotes
Sizes : from 1- Mbp to 670 Gbp Very low gene density Only 3% of the genomes codes for genes (in humans); about 1 gene per 100kbp on average Space between genes is very large and rich in repetitive sequences and transposable elements A gene is split into pieces exons and introns Structure gene : transcription start, start codon, exon, intron…exon…..intron, exon, stop codon, poly A signal.

25 Gene prediction Programs
Ab Initio-Based programs : is to discriminate exons from non coding sequencess and subsequently join the exons together in the correct order Prediction using neural network Using discriminant analysis Using HMM Homology-Based programs : exon structures and exon sequences of related species are highly conserved Consensus-based programs : development of consensus-based algorithms.

26 Summary Computational prediction of genes is the most important steps of genome analysis Predictions of prokaryotic genes is easier than for eukaryotic genomes Gene prediction algorithms based on HMM (Hidden Markov Models) have good accuracy

27 Phylogenetics Analysis

28 How to construct A Tree Constructing a multiple seq. alignment
Determining the substitution model Tree building Tree evalution

29 STEP 1 Alignment and alignment editing :
Phylogenetics seq data consist of a multiple seq alignment Alignment base position, (as sites). These sites = characters. Character state : the actual base (or gap) occupying a site It is not uncommon to edit the alignment.

30 STEP 2 Deciding on a Data model :
Models of substitution Rates Between Base : Weight matrix Models of substitusion Rates Between Amino Acid PAM Matrix and BLOSUM

31 STEP 3 Tree Building Distance-based method : Character-based method:
Use the amount of dissimilarity (the distance) between two aligned sequence to drive trees UPGMA (unweighted pair group method); Neighbor joining (NJ). UPGMA : is a clustering; joins tree branches on the criterion of greatest similarity among pairs Character-based method: The assessment of the reliability of each base position in alignment on the basis of al other base positions Maximum Parsimony : Maximum likehood (ML)

32 STEP 4 Tree Evaluation : Bootstraping : is a resampling tree evalution method ; booststrap value

33


Download ppt "Pairwise Sequence Alignment"

Similar presentations


Ads by Google