C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 9 Database searching (3)

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Basics of Comparative Genomics Dr G. P. S. Raghava.
DNA sequences alignment measurement
Profiles for Sequences
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
Master’s course Bioinformatics Data Analysis and Tools Centre for Integrative Bioinformatics FEW/FALW
Sequence analysis course Lecture 8 Sequence databank searching 1.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 10 Database searching Issues (1)
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 11 Database searching Issues (2)
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Genome Analysis 2007 Lecture 7 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Iterative homology searching (PSI-BLAST)
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST Two methods to predict domain boundary sequence positions from sequence information.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
© Wiley Publishing All Rights Reserved.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Protein Tertiary Structure Prediction
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Protein Bioinformatics Course
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Iterative homology searching using PSI-BLAST, scoring statistics and performance evaluation Introduction to bioinformatics 2008 Lecture 10 C E N T R F.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Sequencing a genome and Basic Sequence Alignment
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence similarity, BLAST alignments & multiple sequence alignments
Basics of Comparative Genomics
Introduction to bioinformatics 2007
There are four levels of structure in proteins
Protein Bioinformatics Course
Protein structure prediction.
SnapDRAGON: protein 3D prediction-based
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Introduction to bioinformatics 2007
Presentation transcript:

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 9 Database searching (3)

Dot-plots a simple way to visualise sequence similarity Can be a bit messy, though... Filter: 6/10 residues have to match...

Dot-plots, what about... Insertions/deletions -- DNA and proteins Duplications (e.g. tandem repeats) – DNA and proteins Inversions -- DNA Dot plots are calculated using a diagonal window of preset length that is slid through the search matrix -- typically the central cell holds the window score (e.g. sum, average)

Dot-plots, self-comparison Direct repeat Tandem repeat Inverted repeat

The amount of genetic information in organisms Name# genes Escherichia coli Homo sapiens Zea mays Genome size (Mb) Mycoplasma genitalium Saccharomyces cerevisiae Drosophila melanogaster Caenorhabtitis elegans

charge

(cysteine bridge)

VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold) Protein structure hierarchical levels

Globin fold  protein myoglobin PDB: 1MBN Helices are labelled ‘A’ (blue) to ‘H’ (red). D helix can be missing in some globins: what happens with the alignment?

 sandwich  protein immunoglobulin PDB: 7FAB

TIM barrel  /  protein Triose phosphate IsoMerase PDB: 1TIM

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain

What does this mean for alignment? Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence). Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment.

What does this mean for homology searching? Database searching algorithms just need to decide if the alignment score is good enough for inferring homology Sometimes, alignments can be incorrect but the score can be close enough for the database searching method to correctly identify the DB sequence as a homolog (or not) However, for distant hits alignments become crucial

Sequence Analysis/Database Searching Finding relationships between genes and gene products of different species, including those at large evolutionary distances

Compared to the preceding plot, RMSD is better able to pin-point relationships between more divergent sequences (RMSD stays relatively small for a longer time as compared to PAM distance) – Structure more conserved than sequence. Note that the spread around RMSD is larger

Structural superpositioning RMSD: how far are equivalenced Cα atoms separated on average?

C5 anaphylatoxin -- human (PDB code 1kjs) and pig (1c5a)) proteins are superposed Two superposed protein structures with two well- superposed helices Red: well superposed Blue: low match quality

How to assess homology search methods We need an annotated database, so we know which sequences belong to what homologous (super)families Examples of databases of homologous families are PFAM, Homstrad or Astral The idea is to take a protein sequence from a given homologous family, then run the search method, and then assess how well the method has carried out the search This should be repeated for many query sequences and then the overall performance can be measured

C; family: zinc finger -- CCHH-type C; class: small C; reordered by kitschorder 1.0a C; reordered by kitschorder 1.0a C; last update 7/9/98 >P1;1zaa1 structureX:1zaa: 3 :C: 33 :C:zinc-finger (ZIF268, domain 1):Mus musculus:2.10: RPYACPVESCDRRFSRSDELTRHI-RI-HTGQK* >P1;1zaa2 structureX:1zaa: 34 :C: 61 :C:zinc-finger (ZIF268, domain 2):Mus musculus:2.10: PFQCRI--CMRNFSRSDHLTTHI-RT-HTGEK* >P1;1zaa3 structureX:1zaa: 62 :C: 87 :C:zinc-finger (ZIF268, domain 3):Mus musculus:2.10: PFACDI--CGRKFARSDERKRHT-KI-HLR--* >P1;1ard structureN:1ard: 102 : : 130 : :zinc-finger (transcription factor ADR1):Saccharomyces cerevisiae:-1.00: RSFVCEV--CTRAFARQEHLKRHY-RS-HTNEK* >P1;1znf structureN:1znf: 1 : : 25 : :zinc-finger (XFIN, 31st domain):Xenopus laevis:-1.00: YKCGL--CERSFVEKSALSRHQ-RV-HKN--* >P1;2drp2 structureX:2drp: 137 :A: 165:A:zinc-finger (tramtrack, domain 2):Drosophila melanogaster:2.80: NVKVYPCPF--CFKEFTRKDNMTAHV-KIIHK---* >P1;3znf structureN:3znf: 1 : : 30 : :zinc-finger (enhancer binding protein):Homo sapiens:-1.00: RPYHCSY--CNFSFKTKGNLTKHMKSKAHSKK-* >P1;5znf structureN:5znf: 1 : : 30 : :zinc-finger (ZFY-6T):Homo sapiens:-1.00: KTYQCQY--CEYRSADSSNLKTHIKTK-HSKEK* Example You can also look at superposed structures..

Sequence searching QUERY DATABASE True Positive True Negative True Positive False Positive True Negative False Negative T POSITIVES NEGATIVES

So what have we got TP TN FP FN Observed Predicted P P N N

Sensitivity and Specificity – medical world + - Test Test True Positive (TP) 990 False Positive (FP) All with Positive Test TP+FP Positive Predictive Value= TP/(TP+FP) 9990/( ) =91% - 10 False Negative (FN) 989,010 True Negative (TN) All with Negative Test FN+TN Negative Predictive Value= TN/(FN+TN) 989,010/(10+989,0 10) =99.999% All with Disease 10,000 All without Disease 999,000 Everyone= TP+FP+FN+TN Sensitivity= TP/(TP+ FN) 9990/( ) Specificity= TN/(FP+TN) 989,010/ (989, ) Pre-Test Probability= (TP+FN)/(TP+FP+FN+TN) (in this case = prevalence) 10,000/1,000,000 = 1%

Receiver Operator Curve (ROC) Plot Sensitivity (TP/(TP+FN)) against 1- Specificity (1 - TN/(FP+TN)), where the latter is called error Error = 1 - specificity Sensitivity Sensitivity is also called Coverage

Database Search Algorithms: Sensitivity, Selectivity Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to the query, but rejected. Sensitivity (or Coverage) = TP / (TP+FN) Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity (or Positive Prediction Value) = TP / (TP + FP) Specificity also describes the ability of the method to select proper hits Specificity = TN / (TN + FP) Sensitivity Selectivity, Specificity Courtesy of Gary Benson (ISSCB 2003)

COG – Cluster of Orthologous Groups Orthologues found using bi- directional best hit searching with PSI-BLAST All COG family members are supposed to have the same function Searching with an unknown sequence only needs to hit a single member of a COG family, annotation can then be transferred COG2813

Structure-based function prediction SCOP ( is a protein structure classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities

Structure-based function prediction SCOP hierarchy – the top level: 11 classes

Structure-based function prediction All-alpha protein Coiled-coil protein All-beta protein Alpha-beta proteinmembrane protein

Structure-based function prediction SCOP hierarchy – the second level: 800 folds

Structure-based function prediction SCOP hierarchy - third level: 1294 superfamilies

Structure-based function prediction SCOP hierarchy - third level: 2327 families

Structure-based function prediction Using sequence-structure alignment method, one can predict a protein belongs to a –SCOP family, superfamily or fold Proteins predicted to be in the same SCOP family are orthologous Proteins predicted to be in the same SCOP superfamily are homologous Proteins predicted to be in the same SCOP fold are structurally analogous folds superfamilies families

Profile wander

ABAB B C C D

Multi-domain Proteins (cont.) A common conserved protein domain such as the tyrosine kinase domain can obscure weak but relevant matches to other domain types (e.g. only appearing after 5000 kinase hits) Sequences containing low-complexity regions, such as coiled coils and transmembrane regions, can cause an explosion of the search rather than convergence because of the absence of any strong sequence signals. Conversely, some searches may lead to premature convergence; this occurs when the PSSM is too strict only allowing matches to very similar proteins, i.e., sequences with the same domain organization as the query are detected but no homologues with different domain combinations.

Multi-domain Proteins - DOMAINATION George R.A. and Heringa J. (2002) Protein domain identification and improved sequence similarity searching using PSI-BLAST, Proteins: Struct. Func. Gen. 48, Iterate PSI-BLAST searches and domain delineation DOMAINATION uses sequence signals to identify domain boundaries

Multi-domain Proteins – DOMAINATION method query Strategy: Combine C- and N-termini of local alignments to delineate domain boundaries Count start and stops of alignments P(boundary)

DOMAINATION: Identifying domain boundaries Sum N- and C-termini of gapped local alignments True N- and C- termini are counted twice (within 10 residues) Boundaries are smoothed using two windows (15 residues long) Combine scores using biased protocol: if Ni x Ci = 0 then Si = Ni + Ci else Si = Ni + Ci +(Ni x Ci)/(Ni + Ci)

DOMAINATION: identifying domain deletions Deletions in the query (or insertion in the DB sequences) are identified by –two adjacent segments in the query align to the same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini) DB Query

DOMAINATION: identifying domain permutations A domain shuffling event is declared –when two local alignments (>35 residues) within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order. DB Query b a a b

DOMAINATION: identifying continuous and discontinuous domains Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain. An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered as discontinuous domains and joined.

Low Complexity segments A sequence of L residues of N types can have L!/  N n a ! different sequences of that same composition, where the composition vector = (n 1,.., n a,.., N) and  N n a ! = n 1 ! * n 2 ! *.. * n N ! If R c is a vector of length N, where the vector numbers correspond to the number of residues with a given frequency (e.g. there are 5 amino acid types with 0 abundance, 3 amino acid types with abundance 1, etc., in the sequence), then the total number of distinct sequences corresponding to a particular complexity state-vector is (L! /  N n a !) * (N! /  L r c !), where  L r c ! = r 0 ! * r 1 ! *.. * r L-1 ! * r L ! Based on this, the final complexity score calculated by the SEG program is P SEG = (1/N L ) * (L! /  N n a !) * (N! /  L r c !)

DOMAINATION: Post-processing low complexity regions in database sequences Remove local fragments with > 15% LC

Conserved hypotheticals >P00001 Conserved hypothetical A substantial fraction of genes in sequenced genomes encodes 'conserved hypothetical' proteins, i.e. those that are found in organisms from several phylogenetic lineages but have not been functionally characterized.

Profile wander (or matrix migration) Permissive iterative searching user higher E-values can lead to incorrect hits (false positives) that become included into the profile. More incorrect hits can then be added in subsequent iterations, and true homologues can be lost. Also, the search can explode, leading to large numbers of spurious hits. A further loss of information can be incurred with PSIBLAST, because PSI-BLAST PSSMs are trimmed to only use the highest scoring region in a search, ignoring less conserved regions

Sequence identity scoring zones >25-30%: homology zone 15-25%: twilight zone <15%: midnight zone (Rost, 1999) Is midnight zone properly definable?