Matching Problems in Bioinformatics Charles Yan Fall 2008.

Slides:



Advertisements
Similar presentations
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Advertisements

Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Corrections. N-linked glycosylation (GlcNac): Look at the Swiss-Prot annotation (in a random ‘glycosylated’ entry)
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Single Motif Charles Yan Spring Single Motif.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Proteomics: Analyzing proteins space. Protein families Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.
Sequence Alignment.
Step 3: Tools Database Searching
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Sequence Based Analysis Tutorial
BLAST.
Basic Local Alignment Search Tool
Presentation transcript:

Matching Problems in Bioinformatics Charles Yan Fall 2008

2 Matching Problem Given a string P (pattern) and a long string T (text), find all occurrences, if any, of P in T. Example T: Given a string P (pattern) and a long string T (text), find all occurrences, if any, of P in T. P: any Exact matching: Does not allow any mismatch Inexact matching: Allow up to k mismatches

3 Matching Problem Unix: grep MS word: find Genbank: Human genome: Given “TTGTTCCGGTTAAAGATGGTGAAAATTTTT”, does it appear in human genome? Where? How about “ACCCCCAGGCGAGCATCTGACAGCCTGGAGCAGCACACACAACCCCAGG CGAG”?

4 Motifs A motif is a conserved element corresponding to a certain function (or structure). Occurrence of a motif in a protein is likely to indicate that the protein has the corresponding function. Motifs are usually represented using alignment or regular expression

5 Motifs

6 Protein function prediction using motifs Each protein function is characterized by one single motif or multiple motifs. If a protein contain the motif(s), it probably has the function that the motif(s) corresponds to. A pertinent analogy is the use of fingerprints by the police for identification purposes. A fingerprint is generally sufficient to identify a given individual. Similarly, motif(s) can be used to formulate hypotheses about the function of a newly discovered protein.

7 PROSITE PROSITE ( is a database of protein families and domains. (Starting in 1988). PROSITE PROSITE currently contains patterns (motifs) and profiles specific for more than a thousand protein families or domains. Release 20.36, of 22-Jul-2006 (contains 1528 documentation entries). Each of these signatures comes with documentation providing background information on the structure and function of these proteins.

8 PROSITE

9

10 PROSITE

11 PROSITE

12 PROSITE Steps in the development of a new motif Select a set of sequences that belong to a function family. Make a multiple alignment. Find a short (not more than four or five residues long) conserved sequence (core motif) which is part of a region known to be important or which include biologically significant residue(s).

13 PROSITE Steps in the development of a new motif (cont.) The most recent version of the Swiss-Prot knowledgebase is then scanned with these core pattern(s). If a core motif will detect all the proteins in the family and none (or very few) of the other proteins, we can stop at this stage. In most cases we are not so lucky and we pick up a lot of extra sequences which clearly do not belong to the group of proteins under consideration. A further series of scans, involving a gradual increase in the size of the motif, is then necessary. In some cases we never manage to find a good motif.

14 PROSITE The motif are described using the following conventions: The standard IUPAC one-letter codes for the amino acids are used. The symbol 'x' is used for a position where any amino acid is accepted. Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses '[ ]'. For example: [ALT] stands for Ala or Leu or Thr. Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met. Each element in a pattern is separated from its neighbor by a '-'.

15 PROSITE The motif are described using the following conventions (Cont.): Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x. When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a ' ' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C- terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F- [GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered. A period ends the pattern. Examples: [AC]-x-V-x(4)-{ED}.This pattern is translated as: [Ala or Cys]-any-Val- any-any-any-any-{any but Glu or Asp}

16 PROSITE

17 PROSITE

18 PROSITE A profile or weight matrix is a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut- off value constitutes a motif occurrence.

19 PROSITE

20 Motifs and Matching Motif Finding: Given a set of protein sequences, to find the motif(s) that are shared by these proteins. Motif Scanning Given a motif and a protein sequence, to find the occurrences (not necessary identical) of the motif on the protein sequences. –--The Matching Problem!

21 From Single Motif to Multiple Motifs One single motif is not sufficient to predict a protein function. Multiple motifs have stronger predicting power.

22 Multiple Motifs Protein function prediction using multiple motifs Each protein function is characterized by a set of motifs (in stead of a single one). If a protein contain a set of motifs, it probably has the function that the set of motifs correspond to.

23 PRINTS PRINTS ( ) is a database of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family; ftp.bioinf.man.ac.uk/pub/prints PRINTS is now maintained at the University of Manchester PRINTS VERSION 38.1 (25 May, 2007) 1904 FINGERPRINTS, encoding 11,451 single motifs

24 PRINTS Two types of fingerprint are represented in the database, i.e. they are either simple or composite, depending on their complexity: simple fingerprints are essentially single-motifs; while composite fingerprints encode multiple motifs. The bulk of the database entries are of the latter type because discrimination power is greater for multi-component searches. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.

25 PRINTS

26 PRINTS

27 PRINTS a) General field

28 PRINTS FPScan Submitting a PROTEIN sequence find the closest matching PRINTS fingerprint/s.

29 PRINTS

30 PRINTS

31 PRINTS

32 PRINTS

33 Related Projects InterPro - Integrated Resources of Proteins Domains and Functional Sites InterPro BLOCKS - BLOCKS db BLOCKS Pfam - Protein families db (HMM derived) [Mirror at St. Louis (USA)] PfamSt. Louis (USA) PRINTS - Protein Motif fingerprint db PRINTS ProDom - Protein domain db (Automatically generated) ProDom PROTOMAP - An automatic hierarchical classification of Swiss-Prot proteins PROTOMAP SBASE - SBASE domain db SBASE SMART - Simple Modular Architecture Research Tool SMART TIGRFAMs - TIGR protein families db TIGRFAMs

34 Motifs and Matching Motif Finding: Given a set of protein sequences, to find the motif(s) that are shared by these proteins. Motif Scanning Given a motif and a protein sequence, to find the occurrences (not necessary identical) of the motif on the protein sequences. –--The Matching Problem!