Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
Profiles for Sequences
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
Construction of Substitution Matrices
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Function preserves sequences
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Construction of Substitution matrices
Step 3: Tools Database Searching
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
HomologyIf twp proteins are homologous, they have a common fold and a common ancestor If two proteins have >25% identity across their entire length, they.
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence Evolution changes sequences

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Proteins share similar domains By comparing several related sequences to each other, one can distiguish segments with higher level of conservation. Usually they have a key role in the function of a protein. Blast identifies related sequences fast but only roughly.

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Refine the comparison Multiple sequence alignments of the best scoring sequences fround by Blast (or some other way) is done with a more sensitive algorithm. Example: The eyeless gene in the fruit fly is also found in several species: birds, mammals, reptiles, fish, invertebrates. There it is called PAX6.

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Visualise the relationship Once a multiple sequence alignment is done, it can also be used for finding r elationship (evolutionary distance) The distance is calculated as the amount of mutations needed to evolve from a putative ancestor to all used ‘present-day’ sequences. Then a path including all sequences is computed. Different metrics can be used (most parsimonious, maximum likelihood, etc).

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Visualise the output of aligned domains First all sequence pairs are aligned and scored, then in a second round a multiple sequence alignment is built up. In this case (PAX6 proteins from vertebrates and fruit fly), two domains are more conserved than the rest of the sequence. The most conserved areas have been highlighted by the use of black or gray background and white text. Only part of the alignment is shown.

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Profiles and motifs A sequence motif is a locally conserved region of a sequence or a short sequence pattern shared by a set of sequences. The term motif refers to any sequence pattern that is predictive of a molecule’s function, a structural feature, or a family membership. Motifs can be detected in proteins, DNA and RNA sequences, but they most commonly refer to protein motifs. Motifs can be represented for computational purposes as –Flexible patterns [K,R]-R-P-C-x(11)-C-V-S (qualitative, unweighted; see the Prosite database at –Position-specific scoring matrices (PSSM, see next page) –Profile hidden Markov models (HMM). These are rigorous probabilistic formulation of a sequence profile. They contain the same probability information as PSSMs but can also account for gaps.

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Position specific scoring matrix This corresponds to the flexible pattern of the paired box: [K,R]-R-P-C-x(11)-C-V-S A B C D E F G H I K L M N P Q R S T V W X Y Z *

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Motif and databases – mode of use Motifs can be used to search sequence databases –take a family of related sequences –align and define motifs –use the motifs to search a database of sequences to find novel family members –can also be generated from unaligned sequences (e.g. MEME, see next page) Motif databases can be searched with sequences –take one sequence and ask what known motifs it contains –deduce its function using knowledge about those motifs in other sequences DBs –Blocks, Fred Hutchinson Cancer Research Center (ungapped alignments) –COG, clusters of orthologous groups, NCBI (21 complete genomes) –Pfam, Sanger Center (gapped profiles, curated) –Prints, Univ. Manchester (fingerprints, i.e. more than one pattern) –Prosite, Univ. Geneva (consensus patterns, expert-curated) –SMART, EMBL-Heidelberg –IntePro, EBI (multiple, curated), includes Pfam, SMART, etc. [2 pages forward]

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Motif discovery tools and PSSM creators The MEME tool takes as input unaligned sequences and searches for patterns according to several parameters such as –Min-max length –Amount per sequence –Amount per set MEME also generates PSSM for the found domains. MAST is a tool for searching databases with PSSMs

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures The InterPro database of motifs at EBI (Nov 2001) was built from Pfam 6.6, PRINTS 31.0, PROSITE 16.37, ProDom , SMART 3.1, TIGRFAMs 1.2, and the current SWISS-PROT + TrEMBL data. This release of InterPro contains 4691 entries, representing 1068 domains, 3532 families, 74 repeats and 15 post-translational modification sites. PfamPRINTS PROSITEProDomSMART TIGRFAMs SWISS-PROT + TrEMBL

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Scan the InterPro database - example The InterPro database was scanned with the PAX6 sequence from the fruit fly.

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Protein 3D structure 3D is better than linear strings of letters... Protein folding is critical for function Protein folding is ordered Structures consist of folds 3D structure can be measured, but computational ab initio structure prediction is a tough task and nearly impossible above a certain protein size (cpu and rule limits)

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Protein 3D structure building blocks Primary structure: the linear array of aminoacids Secondary structures –Alpha helix –Beta-strand Tertiary structures DNA-binding protein (DNA helix, white; helices, pink; sheets of beta-strands, ocra)