EBI web resources II: Ensembl and InterPro Yanbin Yin Fall 2014 1

Slides:



Advertisements
Similar presentations
Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Protein 3D-structure analysis Exercises. Practicals Find update frequency for RCSB PDB: weekly. When was the last update? How many protein structures.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Sackler Medical School
Protein and RNA Families
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
Protein families, domains and motifs in functional prediction May 31, 2016.
Bio/Chem-informatics
Demo: Protein Information Resource
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
There are four levels of structure in proteins
Ensembl Genome Repository.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

EBI web resources II: Ensembl and InterPro Yanbin Yin Fall

Homework 3 Go to and finish the second online training course “Introduction to protein classification at the EBI” and then answer the following questions: – What is the difference between a protein family and a protein domain? – Can a protein belong to multiple families or contain multiple domains? – What are protein sequence features? Examples? – What is a protein signature? What is it used for? – What are the major signature types? – Is PROSITE a sequence pattern database or a profile database? What about Pfam? – What is the definition of “annotation”? In your report, answer these questions and also include the screen shot of the page(s) that support your answer. Due on 10/7 (send by ) 2 Office hour: Tue, Thu and Fri 2-4pm, MO325A Or

Outline Intro to genome annotation Protein family/domain databases – InterPro, Pfam, Superfamily etc. Genome browser – Ensembl Hands on Practice 3

Genome annotation Predict genes (where are the genes?) – protein coding – RNA coding Function annotation (What are these genes?) – Search against UniProt or NCBI-nr (GenPept) – Search against protein family/domain databases – Search against Pathway databases Function vocabularies defined in Gene Ontology 4 Proteins can be classified into groups according to sequence or structural similarity. These groups often contain well characterized proteins whose function is known. Thus, when a novel protein is identified, its functional properties can be proposed based on the group to which it is predicted to belong.

5 PDB SCOP CATH Superfamily Gene3D

1.CATH/Gene3D University College, London, UK 2.PANTHER University of Southern California, CA, USA 3.PIRSF Protein Information Resource, Georgetown University, USA 4.Pfam Wellcome Trust Sanger Institute, Hinxton, UK 5.PRINTS University of Manchester, UK 6.ProDom PRABI Villeurbanne, France 7.PROSITE Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland 8.SMART EMBL, Heidelberg, Germany 9.SUPERFAMILYUniversity of Bristol, UK 10.TIGRFAMs J. Craig Venter Institute, Rockville, MD, US 11.HAMAP Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland Pfam, SMART, TIGRFAM, COG, KOG, PRK, CD, LOAD InterPro components CDD components 6

7 Most UniProt proteins are annotated with at least one InterPro signature

8

9 Protein families are often arranged into hierarchies, with proteins that share a common ancestor subdivided into smaller, more closely related groups. The terms superfamily (describing a large group of distantly related proteins) and subfamily (describing a small group of closely related proteins) are sometimes used in this context

Protein Classification Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. Proteins are classified to reflect both structural and evolutionary relatedness. Many levels exist in the hierarchy, but the principal levels are family, superfamily and fold, described below. Family: Clear evolutionarily relationship Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. Fold: Major structural similarity Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies. 10

Structure Protein Sequence Protein Sequence Function ( literature ) Function ( literature ) Evolution Pfam SMART ProSite PDB SCOP CATH UniProt GenPept Superfamily Gene3D 11

12

13 fold ~ class – superfamily ~ clan – family – subfamily – domain sequence

14 Family- and domain-based classifications are not always straightforward and can overlap, since proteins are sometimes assigned to families by virtue of the domain(s) they contain. An example of this kind of complexity is outlined below Domain composition of phospholipase D1, which is an enzyme that breaks down phosphatidylcholine. The protein contains a PX (phox) domain that is involved in binding phosphatidylinositol, a PH (pleckstrin homology) domain that has a role in targeting the enzyme to particular locations within the cell, and two PLD (phospholipase D) domains responsible for the protein’s catalytic activity

15 Sequence features differ from domains in that they are usually quite small (often only a few amino acids long), whereas domains represent entire structural or functional units of the protein (see Figure). Sequence features are often nested within domains – a protein kinase domain, for example, usually contains a protein kinase active site Sequences features are groups of amino acids that confer certain characteristics upon a protein, and may be important for its overall function. Such features include: active sites, which contain amino acids involved in catalytic activity. binding sites, containing amino acids that are directly involved in binding molecules or ions. post-translational modification (PTM) sites, which contain residues known to be chemically modified (phosphorylated, palmitoylated, acetylated, etc) after the process of protein translation. repeats, which are typically short amino acid sequences that are repeated within a protein, and may confer binding or structural properties upon it.

Hands on exercise 1: search against protein family databases 16

put the first sequence in the search box Hit Search; take about 1 min Read more about InterPro

18

19 Click to link to InterPro page of this domain These are individual family/domain matches not integrated in InterPro Click to link to individual database website

20 This is linked from the previous page: the InterPro page to describe IPR Scientific literature for this IPR family

21 NCBI’s Conserved Domain Database (CDD): equivalent to InterPro of EBI, much faster, but integrate less member databases

22

23 Genome browser: ENSEMBL

24 The Ensembl project aims to automatically annotate genome sequences, integrate these data with other biological information and to make the results freely available to geneticists, molecular biologists, bioinformaticians and the wider research community. Ensembl is jointly headed by Dr Stephen Searle at the Wellcome Trust Sanger Institute and Dr Paul Flicek at the European Bioinformatics Institute (EBI).

What do we need in genome browsers? To make the bare DNA sequence, its properties, and the associated annotations more accessible through graphical interface. Genome browsers provide access to large amounts of sequence data via a graphical user interface. They use a visual, high-level overview of complex data in a form that can be grasped at a glance and provide the means to explore the data in increasing resolution from megabase scales down to the level of individual elements of the DNA sequence. 25

26 Short tutorial videos introducing ENSEMBL

27

28

29 Nature 491, ( 01 November 2012 )

30 Nature 458, (9 April 2009) NATURE|Vol 464|15 April 2010

While a user may start browsing for a particular gene, the user interface will display the area of the genome containing the gene, along with a broader context of other information available in the region of the chromosome occupied by the gene. This information is shown in “tracks,” with each track showing either the genomic sequence from a particular species or a particular kind of annotation on the gene. The tracks are aligned so that the information about a particular base in the sequence is lined up and can be viewed easily. In modern browsers, the abundance of contextual information linked to a genomic region not only helps to satisfy the most directed search, but also makes available a depth of content that facilitates integration of knowledge about genes, gene expression, regulatory sequences, sequence conservation between species, and many other classes of data. 31

Each uses a centralized model, where the web site provides access to a large public database of genome data for many species and also integrates specialized tools, such as BLAST at NCBI and Ensembl and BLAT at UCSC. The public browsers provide a valuable service to the research community by providing tools for free access to whole genome data and by supporting the complex and robust informatics infrastructure required to make the data accessible 32 Ensembl Genome Browsers: NCBI Map Viewer: UCSC Genome Browser:

Hands on exercise 2: Ensembl gene search 33

34 Click to link to human page

35 Put “liver cancer” in the search box and Go

36 This keyword search gives everything that contains “liver cancer” Click on Table to have a table view

37 This col tells the category of the entry Click on the numbers to only show gene entries

38 This is the list of genes Click here to show the list and select Location and Score to show chromosome location info and score respectively Score is calculated based on the query: how much the annotation description is similar to the searching keyword (liver cancer) The first two entries in this page are ncRNA genes. Let’s try the 2 nd one

39 Now it’s showing the Gene; there are also other tabs This is ENSEMBL Gene ID This is ENSEMBL Transcript ID This is is a long intergenic non-coding RNA gene Link to NCBI Here is the graphical representation of the gene Many things can be explored

40 Let’s try a protein-coding gene: LAT1, also known as SLC7A5

41 Click here

42 The three transcripts Different names of the gene Click to view the sequence page

43 Click to open a help page to explain what these highlights mean Now check the expression

44 A long list, go further down to find liver and click “View in location”

45 This is where the gene is located in the whole chromosome view Zoomed in view Further zoomed in view A long page below The RNA-seq read stack corresponding to exons Links to other genome browsers

46 This is the same region in the UCSC browser PS: much faster and easier to use/understand than ENSEMBL (richer info?)

47 From the Gene tab click on Genome alignment will get you this page Select 7 primates EPO and hit Go to see the whole genome alignment of 7 primates at this gene region

48 Hit here

49 See how conserved this gene is across different primates Some exons are missing in early primates

50

Next lecture: ExPASy and DTU tools 51