Download presentation
Presentation is loading. Please wait.
Published byLouisa Eilander Modified over 7 years ago
1
Note: several slides of this powerpoints have notes attached to them.
Via, for example, ‘View’ ‘notes page’ you can see these notes. Genomics van Ziekte
2
Genomics van Ziekte Genomics van Ziekte
3
Information Management Public Biological Databases Examples
Prof. dr. A.H.C. van Kampen (Antoine) Bioinformatics Laboratory Academic Medical Centre (AMC) Genomics van Ziekte
4
Accompanying papers Two examples of public biological databases that demonstrate several aspects of public biological databases in general are discussed in the following papers. GenBank (nucleotide sequences) Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res Jan;40(Database issue):D UniProtKB (protein sequences) Magrane M, Consortium U. UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) Mar 29;2011:bar These papers can be considered to be background information. Genomics van Ziekte Genomics van Ziekte
5
Aims of this lecture This lecture aims to
Introduce you to the GenBank database Show that many aspects of GenBank are applicable to other public biological databases Introduce you briefly to several other databases Learn you how to find the most important databases This lecture will get you aware of the large and heterogeneous amounts of information that can be retrieved from public resource. And it will give you some idea of how to use them in your own projects. Some of these databases will come back in the computer exercises or later lectures. Genomics van Ziekte Genomics van Ziekte
6
OMICS data and other information ends up in Public databases
Genomics van Ziekte Genomics van Ziekte
7
Content First examples PubMed / Medline: literature
GenBank: a nucleotide database DNA, RNA How to find public biological databases? A few other examples Genome Browsers GEO: gene expression UniprotKB: proteins PDB: proteins Reactome: biological pathways String: protein interactions OMIM: Online Medelian Inheritance in Man GeneCards There are many public databases. We will first focus on GenBank as a specific example. However, many characteristics of Genbank also apply to other databases. Genomics van Ziekte Genomics van Ziekte
8
Literature databases Not a biological database, but provides a comprehensive resource of biomedical information MEDLINE contains journal citations and abstracts for biomedical literature from around the world. PubMed provides free access to MEDLINE links to full text articles when possible. links to public biological databases (and vice versa) Biological database PubMed/Medline Genomics van Ziekte
9
But not all GenBank records are connected to PubMed
In general it is important that information in a public database (e.g GenBank) is linked to literature as this will allow you to retrieve additional information about a specific piece of information such as the (setup of the) experiment that produced the data, the cells or tissues in which the measurements were made, or additional experiments of which the results are not deposited to GenBank. Miller H, Norton CN, Sarkar IN. GenBank and PubMed: How connected are they? BMC Res Notes. 2009, 2:101. . Genomics van Ziekte Genomics van Ziekte
10
hemoglobin gene link to full text link to public HSDB database
HSDB = Hazardous Substances Databank link to public HSDB database Genomics van Ziekte
11
One of the resources of the NCBI is the Bookshelf which provides access to many online books. For example, “molecular cell biology”. Genomics van Ziekte Genomics van Ziekte
12
GeneReviews Expert-authored, peer-reviewed disease descriptions
Apply genetic testing to the diagnosis, management, and genetic counseling of patients and families with specific inherited conditions One of the online books is GeneReviews. GeneReviews are expert-authored, peer-reviewed disease descriptions that apply genetic testing to the diagnosis, management, and genetic counseling of patients and families with specific inherited conditions. Published exclusively online, each GeneReview entry is: peer reviewed for accuracy by (a) editorial staff experts in clinical genetics, laboratory genetics, and genetic counseling and by (b) acknowledged international subject experts; updated by the author(s) in a formal comprehensive process every two to three years or as needed; and revised by the author(s) or editorial staff whenever significant changes in clinically relevant information occur. GeneReviews are part of the GeneTests Web site, which also includes: a Laboratory Directory of US and international laboratories offering molecular genetic testing, specialized cytogenetic testing, and biochemical testing for inherited disorders; a Clinic Directory of US and international genetics clinics providing genetic evaluation and genetic counseling; Resources that are consumer health-oriented organizations and disease registries; GeneReviews is linked to other public databases such as Genbank and OMIM. Genomics van Ziekte Genomics van Ziekte
13
GenBank: Genetic Sequence Data Bank
GeneBank is build and distributed by the National Center for Biotechnology Information (NCBI) I will use GenBank as an example Many aspects about GenBank are also valid for other public databases that I will briefly show later !! Genomics van Ziekte
14
GenBank: Genetic Sequence Data Bank
Daily exchange with European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) to ensure worldwide coverage There exist 3 major nucleotide databases: GenBank (at NCBI), ENA and DDBJ. These database contain the same information as they are daily synchronized. Benson et al (2012) Nucleic Acids Research, 40, Database issue, D48 ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt Genomics van Ziekte Genomics van Ziekte
15
GenBank: Genetic Sequence Data Bank
Release / December 2011 Content loci (sort of sequence definition) bases reported sequences > formally described species Required Storage 568 Giga bytes Access Data formats: XML, ASN.1, GenBank format Website: Download: FTP Web services: e-utils, soap Why do we need several types of access? ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt Genomics van Ziekte
16
Exponential growth of GenBank
The consequence is that information that you may require for your project, may enter the database tomorrow. Daily updated! What is the consequence of this? Genomics van Ziekte Genomics van Ziekte
17
GenBank is organized in ‘divisions’
Large increase in transcriptome data (e.g., RNA-seq to measure gene transcription) Large section with complete genomes It is possible to download specific divisions to your own computer if you don’t require the others for your project. This saves lots of diskspace. Genomics van Ziekte
18
Top 10 sequence organisms in GenBank
Homo sapiens Mus musculus Rattus norvegicus Bos taurus Zea mays Sus scrofa Danio rerio Strongylocentrotus purpuratus Oryza sativa Japonica Group Nicotiana tabacum Xenopus (Silurana) tropicalis Arabidopsis thaliana Drosophila melanogaster Pan troglodytes Vitis vinifera Canis lupus familiaris Glycine max Gallus gallus Macaca mulatta Solanum lycopersicum GenBank includes many organisms and new organisms are added very regularly (e.g., microbes). This data allows you to do comparative genomics (i.e. comparison of nucleotide sequences, genes, etc) of different organisms. Genomics van Ziekte Mammalians, model organisms, fish, plants, microbes, etc Genomics van Ziekte
19
GenBank is linked to other public database and integrated with software tools
The integration with software tools is very convenient because this allows you to do some basic data analysis without the need for downloading the data to your own computer. Genomics van Ziekte Genomics van Ziekte
20
GenBank access through the Entrez retrieval system
HBA Genomics van Ziekte
21
GenBank access through BLAST
Allows to find sequences similar to input sequence Genomics van Ziekte
22
An example record from GenBank
Definition Genomics van Ziekte
23
An example record from GenBank
Locus (576 bp mRNA) Genomics van Ziekte
24
An example record from GenBank
Organism (HS; including taxonomy) Genomics van Ziekte
25
An example record from GenBank
Accession code NM_000558; unique identifer The primary accession number is a unique, unchanging identifier assigned to each GenBank sequence record. This identifier is used when citing information from GenBank
26
Publication policies of scientific journals
The fact that scientific journals require that experimental data, described in your paper, is deposited to a public databases has very much contributed to the growth of the public databases but also to the value of the data because now everyone can use your data (for other purposes, or to check your conclusions) Genomics van Ziekte
27
Just one example of a paper in the Proceedings of the National Academy of Sciences of the USA that includes a reference (accession codes) to Genbank. In this case to cDNA sequences. Genomics van Ziekte
28
An example record from GenBank
Literature references (including PubMed identifier)
29
An example record from GenBank
With this menu you can tune the information that is presented to you. Note that the user interface of GenBank (and of other databases) or regularly update, improved, and/or changed. Customize view & Tools Genomics van Ziekte Genomics van Ziekte
30
An example record from GenBank
LinkOut Links to many other resources/ databases. This information is not part of GenBank record itself Genomics van Ziekte
31
An example record from GenBank
Related information Links to other NCBI resources Not all of this information is part of GenBank record itself Genomics van Ziekte
32
An example record from GenBank
Comment Reviewed RefSeq (curation) (this sequence was used as reference standard) Completeness: full length An example record from GenBank Genomics van Ziekte
33
An example record from GenBank
Sequence features -Type -Description -Location -Links to other database Genomics van Ziekte
34
An example record from GenBank
Sequence features -Coding sequence -Location -Including translation If the open reading frame (coding sequence; CDS) can be determined then it can be automatically translated to a protein sequence, which is also included in GenBank. Genomics van Ziekte
35
An example record from GenBank
-Sequence features -ORIGIN (sequence) (full length mRNA) The polyA (poly-adenylation) signal is one of the signals in the DNA that determines the end of the transcription process. In this sequence we see a AATAAA polyA signal. Genomics van Ziekte
36
An example record from GenBank
Click on ‘exon’ to highlight the exon in the sequence Genomics van Ziekte
37
Display ‘fasta’ format (instead of GenBank format)
Very convenient format Used as input for many analysis tools Header includes definition / accession code No spaces/numbers are part of sequence format Genomics van Ziekte
38
Or show graphical representation
Displays all annotated sequence features Genomics van Ziekte
39
Individual record, sequence, or features can be downloaded
Genomics van Ziekte
40
I need a volunteer The longest human gene is 2,220,223 nucleotides long. It has 79 exons, with a total of only 11,058 nucleotides, which specify the sequence of the 3,685 amino acids and codes for a protein dystrophin. It is part of a protein complex located in the cell membrane, which transfers the force generated by the actin-myosin structure inside the muscle fiber to the entire fiber. Get a graphical display of this gene in GenBank.... Homo sapiens dystrophin (DMD) on chromosome X: accession code NG_ Genomics van Ziekte Genomics van Ziekte
41
Use FTP to download complete GenBank database or individual sections (e.g., primates)
ftp://bio-mirror.net/biomirror/genbank. From web-browser or (preferably)... use FTP client (e.g., Filezilla) to download the files
42
Advanced: use e-Utilities to access data from your in-house developed program
43
Entrez Programming Utilities (E-utilities)
What is it? Set of eight server-side programs (e.g., esearch) Interface into the Entrez query and database system NCBI Use URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. Currently includes 38 databases e.g., nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature. How does it work? A piece of software first posts an E-utility URL to NCBI, then retrieves the results of this posting, after which it processes the data as required. The software can thus use any computer language that can send a URL to the E-utilities server and interpret the XML response; examples of such languages are Perl, Python, Java, and C++. Combining E-utilities components to form customized data pipelines within these applications is a powerful approach to data manipulation. Genomics van Ziekte
44
E-utilities: example Use the following URL in web browser
E-utility request Server-side program: esearch.fcgi = text search GenBank database: nuccore Search term: HBA Genomics van Ziekte
45
XML Output (not for human for further processing by computer)
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It is defined in the XML 1.0 Specification produced by the W3C ( and several other related specifications, all gratis open standards. The design goals of XML emphasize simplicity, generality, and usability over the Internet. It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services. Many application programming interfaces (APIs) have been developed for software developers to use to process XML data. Hundreds of XML-based languages have been developed including web feeds (RSS, Atom), Simple Object Application Protocol (SOAP), and XHTML. XML-based formats have become the default for many office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org (OpenDocument), and Apple's iWork. Source: wikipedia Query: Genomics van Ziekte
46
Use of E-utilities in software
Your software program input from user e-utilities Retrieve sequence from GenBank Accession code NCBI In this example, you have developed a software program that displays nucleotide sequences with every ‘C’ depicted in red. You program will ask its user to provide an accession code (unique identifier for a nucleotide sequence). Subsequently, your program will use one of the e-util programs to retrieve the corresponding sequencing from the GenBank database which is located at the NCBI (USA). Once the sequence is retrieved your program can format and display the sequence. Alternatively, you could have developed a program that does not use e-utils but would ask the use to provide the sequence as input. Or you could download the GenBank database and then use your own software to retrieve the sequence from your local copy of the database. CATTGCAATCAATGGA Format sequence to show all cytosines in red Display formatted sequence Output CATTGCAATCAATGGA Genomics van Ziekte Genomics van Ziekte
47
Summary retrieving information from GenBank
Web browser Different sequence formats Graphical representation Links to other resources and tools Download individual records Keyword or Blast search FTP (File Transfer Protocol) Download complete GenBank, or Download large GenBank sections Query/Analyze with your own software e-Utilities Access GenBank from within your own software without the need of downloading GenBank GenBank Genomics van Ziekte
48
Why do we need different ways for accessing GenBank?
Can you give specific examples for each of these accessing modes? Thus, in what situation would you use a web-browser, ftp or the e-utilities. 5 minutes Consult your neighbors I will ask 3 students to step forward and explain their example Genomics van Ziekte
49
Submitting information to GenBank
Web browser BankIt Stand alone Sequin GenBank Genomics van Ziekte
50
Verification of submission
As part of the standard submission process, GenBank staff review submissions for biological accuracy and assist authors in providing accurate annotations. If GenBank staff is unable to verify the accuracy of the submitted sequences and/or annotations, they may add a comment to the record stating that the sequence is unverified. Until the submitter is able to resolve the issues, such sequences will have the word ‘UNVERIFIED:’ at the beginning of their definition lines and will not be included in BLAST databases. Genomics van Ziekte
51
How to find a public biological database?
Genomics van Ziekte
52
OMICS data and other information ends up in Public databases
Genomics van Ziekte Genomics van Ziekte
53
Databases: molecules to systems
Literature and ontologies CitExplore, GO Databases: molecules to systems Genomes Ensembl, Ensembl Genomes, EGA Nucleotide sequence EMBL-Bank Proteomes UniProt, PRIDE Gene expression ArrayExpress Protein structure PDBe Protein families, motifs and domains InterPro Many of these databases can be found at the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI), and the Swiss Institute for Bioinformatics (SIB). In addition, many other sites exist where databases are located. Chemical entities ChEBI, ChEMBL Protein interactions IntAct Pathways Reactome Systems BioModels Genomics van Ziekte Genomics van Ziekte
54
How to find a public databases
Large database providers. E.g., National Center for Biotechnology Information (NCBI) European Bioinformatics Institute (EBI) Swiss Institute for Bioinformatics (SIB) Nucleic Acids Research (NAR) Scientific Journal January: database issue GeneCards (Google) (Wikipedia) NCBI: EBI: SIB: NAR: Nucleic Acids Research (NAR) publishes the results of leading edge research into physical, chemical, biochemical and biological aspects of nucleic acids and proteins involved in nucleic acid metabolism and/or interactions. GeneCards: Genomics van Ziekte Genomics van Ziekte
55
NAR Molecular Biology Database Collection 2012
Currently 1380 databases!! Latest issue: 179 papers describing 96 new databases and 84 status updates. Coverage is far from exhaustive. It is estimated that there are about 3000 databases. Genomics van Ziekte
56
Molecular Biology Database Collection 2012
Criteria for inclusion: Thoroughly curated Of interest to a wide variety of biologists (primarily bench scientists) Comprehensiveness of coverage Degree of add value (e.g., manual curation) Likely to be maintained for a long period of time Genomics van Ziekte
57
NAR: major categories Genomics van Ziekte
58
NAR: Nucleotide Sequence databases
GenBank
59
NAR: Human Genes and Disease
Gene Wiki
60
Edit your favorite gene
Think about this: what are the advantages and disadvantages of using a Wiki for gene annotation? A wiki is a website whose users can add, modify, or delete its content via a web browser using a simplified markup language or a rich-text editor. Wikis are typically powered by wiki software and are often created collaboratively by multiple users. Examples include community websites, corporate intranets, knowledge management systems, and notetaking. Wikis may serve many different purposes. Some permit control over different functions (levels of access). For example, editing rights may permit changing, adding or removing material. Others may permit access without enforcing access control. Other rules may also be imposed for organizing content. Ward Cunningham, the developer of the first wiki software, WikiWikiWeb, originally described it as "the simplest online database that could possibly work. "Wiki" is a Hawaiian word meaning "fast" or "quick". Genomics van Ziekte Genomics van Ziekte
61
Primary and secondary databases
Definitions Primary Databases Databases consisting of data derived experimentally E.g., nucleotide sequences (GenBank), protein sequences (UniprotKB), three dimensional protein structures (PDB). Secondary Databases / composite databases Those data that are derived from the analysis or treatment of primary data E.g., protein families and domains (Prosite), metabolic pathways (Reactome), genome browser (Ensemble), GeneCards Genomics van Ziekte
62
List of 325 (!) biological pathway databases www.pathguide.org/
63
A few examples of other database
Hub to other databases (GeneCards) Human genome browser (Ensemble) Gene expression (Gene Expression Omnibus) Protein sequences (UniprotKB) 3D protein structures (PDB) Small compounds (ChEBI) Biological pathways (Reactome) Protein interactions (String) Online Mendelian Inheritance in Man (OMIM) Genomics van Ziekte
64
Biological pathways (Reactome) Protein interactions (String)
Online Mendelian Inheritance in Man (OMIM) Human genome browser (Ensemble) Gene expression (Gene Expression Omnibus) Protein sequences (UniprotKB) 3D protein structures (PDB) Small compounds (ChEBI) Public biological databases have been developed for all levels of the cell. In addition, there are database that integrate data are have a more holistic (pathway) view. Genomics van Ziekte
65
GeneCards: HBA1 GeneCards is a searchable, integrated, database of human genes that provides concise genomic related information, on all known and predicted human genes. The GeneCards human gene database extracts and integrates a carefully selected subset of gene related transcriptomic, genetic, proteomic, functional and disease information, from dozens of relevant sources. It provides robust user-friendly access to up-to-date knowledge. GeneCards overcomes barriers of data format heterogeneity, and uses standard nomenclature and approved gene symbols. GenCards is a hub to other databases. You can easily retrieve information about specific genes and proteins Genomics van Ziekte Genomics van Ziekte
66
GeneCards: HBA1 Genomics van Ziekte
67
Ensemble Human Genome Browser; EBI
Ensembl is a joint project between EMBL - EBI and the Wellcome Trust Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes. Many genome browsers can be configured to display a large number of different tracks (e.g., repeats, GC content, genes, orthology, etc etc) Genomics van Ziekte
68
Microarrays Genomics van Ziekte
69
Gene Expression Omnibus (GEO); NCBI
The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. In addition to data storage, a collection of web-based interfaces and applications are available to help users query and download the studies and gene expression patterns stored in GEO. Genomics van Ziekte Genomics van Ziekte
70
Gene Expression Omnibus (GEO); NCBI
Platform Samples Series This figure shows the relations between GPLs, GSM, GSE and GSD GPLxxx GSMxxx GSExxx Dataset Profile GDSxxx Genomics van Ziekte
71
Gene Expression Omnibus (GEO); NCBI
Insulin effect on skeletal muscle, Expression profiling by array, transformed count (110 samples) (GDS3715) Example: Affymetrix Human Genome U95A Array (GPL91) Expression of NR4A1 in 110 samples
72
GEO encourages submitters to supply MIAME compliant data
Functional Genomics Data Society It is important to describe the minimum set of information about an experiment because other it will be very difficult to reproduce an experiment. Moreover, it is also important to have this in a proper format (but miame is not about the format) because otherwise computer programs cannot retrieve and/or use this information. Such standards are also (being) developed for other types of experiments Why is this important? Genomics van Ziekte Genomics van Ziekte
73
GEO encourages submitters to supply MIAME compliant data
In our experience, the availability of the sample annotation is still a problem.... Genomics van Ziekte
74
UniProt Expertly curated database UniProtKB: UniProt Knowledge base
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data. UniProt is a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). Across the three institutes around 90 people are involved through different tasks such as database curation, software development and support. UniParc is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Proteins may exist in different source databases and in multiple copies in the same database. UniParc avoided such redundancy by storing each unique sequence only once and giving it a stable and unique identifier (UPI) making it possible to identify the same protein from different source databases. A UPI is never removed, changed or reassigned. UniParc contains only protein sequences The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data. We provide UniMES clusters in order to obtain complete coverage of sequence space at different resolutions. UniProtKB/TrEMBL: computationally generated records enhanced by automatic classification and annotation. UniProtKB/Swissprot: curated and annotated protein sequences. UniProtKB: UniProt Knowledge base UniRef: The UniRef databases provide clustered sets of sequences from UniProt Knowledgebase (including splice variants) UniParc: UniProt Archive UniMES: Metagenomic and Environmental Sequences Genomics van Ziekte Genomics van Ziekte
75
UniRef The UniRef databases provide clustered sets of sequences from UniProt Knowledgebase (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. Unlike UniParc, sequence fragments are merged in UniRef. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the longest sequence (UniRef seed sequence). Note: the UniGene database of the NCBI provides a clustering of nucleotide sequences. Genomics van Ziekte Genomics van Ziekte
76
UniProtKB/Swiss-Prot
Release 2012_01 of 25-Jan-12 of UniProtKB/Swiss-Prot contains sequence entries comprising amino acids abstracted from references. Genomics van Ziekte
77
UniProtKB / Swiss-Prot
Genomics van Ziekte
78
Expertly curated database: Biocuration
Function Structure Subcellular location Interactions with other proteins Domain composition, Sequence features e.g., active sites post-translational modifications. Etc, etc Genomics van Ziekte
79
Expertly curated database: Biocuration
Involves interpretation and integration of information relevant to biology into a database that... .....enables integration of the scientific literature as well as large data sets. Primary goals: accurate and comprehensive representation of biological knowledge easy access to this data basis for computational analysis Manual curation and automatic annotation. UniProtKB consists of two sections UniProtKB/Swiss-Prot: manually reviewed records with annotation extracted from the literature and curator evaluated computational analysis UniProtKB/TrEMBL: computationally generated records enhanced by automatic classification and annotation. Genomics van Ziekte
80
Biocuration: manual and automatic curation
Genomics van Ziekte
81
Biocuration: manual and automatic curation
Genomics van Ziekte
82
Biocuration: manual and automatic curation
Genomics van Ziekte
83
Biocuration: manual and automatic curation
Genomics van Ziekte
84
Example record: HBA1 protein (UniProtKB)
Genomics van Ziekte
85
Example record: HBA1 protein (UniProtKB)
Genomics van Ziekte
86
Example record: HBA1 protein (UniProtKB)
Genomics van Ziekte
87
Example record: HBA1 protein (UniProtKB)
Genomics van Ziekte
88
Example record: HBA1 protein (UniProtKB)
Genomics van Ziekte
89
Example record: HBA1 protein (UniProtKB)
GO: Gene Ontology ( The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains: cellular component, the parts of a cell or its extracellular environment; molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis; and biological process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. For example, the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane. Link to GO molecular function Genomics van Ziekte
90
Example record: HBA1 protein (UniProtKB)
Genomics van Ziekte
91
Protein databank (PDB): 3D structures
The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. Understanding the shape of a molecule helps to understand how it works. This knowledge can be used to help deduce a structure's role in human health and disease, and in drug development. The structures in the archive range from tiny proteins and bits of DNA to complex molecular machines like the ribosome. Genomics van Ziekte Genomics van Ziekte
92
Chemical Entities of Biological Interest (ChEBI)
Dictionary of molecular entities focused on ‘small’ chemical compounds. In addition to name of compound: one-dimensional (1D) strings, or 2D-sketch. Simplified Molecular Input Line Entry System (SMILES) IUPAC International Chemical Identifier (InChI) InChI=1S/C6H12O9S/c7-1-3(8)5(10)6(11)4(9) (12,13)14/h1,3-6,8-11H,2H2,(H,12,13,14)/t3-,4+,5+,6+/m0/s1 Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds. The term ‘molecular entity’ refers to any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity. The molecular entities in question are either products of nature or synthetic products used to intervene in the processes of living organisms. ChEBI incorporates an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI uses nomenclature, symbolism and terminology endorsed by the following international scientific bodies: •International Union of Pure and Applied Chemistry (IUPAC) •Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) Molecules directly encoded by the genome (e.g. nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. D-Glucose 6-sulfate Genomics van Ziekte Genomics van Ziekte
93
Chemical Entities of Biological Interest (ChEBI)
Due to different naming of small compounds in literature and databases it is virtually impossible to integrate different resources ChEBI aims to solve this problem. Primary motivation to provide a high quality, thoroughly annotated controlled vocabulary to promote the correct and consistent use of unambiguous biochemical terminology throughout the molecular biology databases at the EBI. Controlled vocabularies provide a way to organize knowledge (in the case of chebi, this is knowledge about metabolites) for subsequent retrieval. Controlled vocabularies are for example used in thesauri, taxonomies and other form of knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been preselected by the designer of the vocabulary, in contrast to natural language vocabularies, where there is no restriction on the vocabulary. Genomics van Ziekte Genomics van Ziekte
94
Chemical Entities of Biological Interest (ChEBI)
> molecular entities, groups and classes. Molecular entity any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., Group defined linked collection of atoms or a single atom within a molecular entity Classes e.g. ‘alkanes’, ‘alkyl groups’. Scope ‘biochemical compounds’ pharmaceuticals agrochemicals laboratory reagents isotopes subatomic particles Genomics van Ziekte
95
Chemical Entities of Biological Interest (ChEBI)
D-Glucose 6-sulfate Genomics van Ziekte
96
Reactome Manually curated and peer-reviewed pathway database.
Pathway annotations are authored by expert biologists, in collaboration with Reactome editorial staff. Provide the rich information in the visual representations of biological pathways familiar from textbooks and articles in a detailed, computationally accessible format. The core unit of the Reactome data model is the reaction. Entities (nucleic acids, proteins, complexes and small molecules) participating in reactions form a network of biological interactions and are grouped into pathways. Examples of biological pathways include signaling, innate and acquired immune function, transcriptional regulation, translation, apoptosis and classical intermediary metabolism. REACTOME is an open-source, open access, manually curated and peer-reviewed pathway database. Pathway annotations are authored by expert biologists, in collaboration with Reactome editorial staff and cross-referenced to many bioinformatics databases. The rationale behind Reactome is to convey the rich information in the visual representations of biological pathways familiar from textbooks and articles in a detailed, computationally accessible format. The core unit of the Reactome data model is the reaction. Entities (nucleic acids, proteins, complexes and small molecules) participating in reactions form a network of biological interactions and are grouped into pathways. Examples of biological pathways in Reactome include signaling, innate and acquired immune function, transcriptional regulation, translation, apoptosis and classical intermediary metabolism. Reactome provides an intuitive website to navigate pathway knowledge and a suite of data analysis tools to support the pathway-based analysis of complex experimental and computational data sets. Visualisation of Reactome data is facilitated by the Pathway Browser, a Systems Biology Graphical Notation (SBGN)-based interface, that supports zooming, scrolling and event highlighting. All data and software are freely available for download. Interaction, reaction and pathway data are provided as downloadable flat, MySQL, BioPAX, SBML and PSI-MITAB files and are also accessible through our Web Services APIs. Genomics van Ziekte Genomics van Ziekte
97
Reactome: part of TCA cycle
Genomics van Ziekte
98
We need standardization for the figures
For example, what do the arrows mean in this figure? Most biologists do probably understand this figure despite the fact that the figure does not explicitly mentions what each arrow implies. The function of each arrow can also be explained in the figure caption or accompanying text. However, because the arrows (and other symbols) are not explicitly defined, they are very hard to use by computer programs. Genomics van Ziekte Genomics van Ziekte
99
Reactome: visualization
Reactome is facilitated by the Pathway Browser based on the Systems Biology Graphical Notation (SBGN) Aims to standardize the graphical notation used in maps of biological processes. Glycolysis
100
Reactome: available in Biological Pathway Exchange (BioPax) format
Standard language that aims to enable integration, exchange, visualization and analysis of biological pathway data. Specifically, BioPAX supports data exchange between pathway data groups and thus reduces the complexity of interchange between data formats by providing an accepted standard format for pathway data. By offering a standard, with well-defined semantics for pathway representation, BioPAX allows pathway databases and software to interact more efficiently. BioPAX is a standard language that aims to enable integration, exchange, visualization and analysis of biological pathway data. Specifically, BioPAX supports data exchange between pathway data groups and thus reduces the complexity of interchange between data formats by providing an accepted standard format for pathway data. By offering a standard, with well-defined semantics for pathway representation, BioPAX allows pathway databases and software to interact more efficiently. In addition, BioPAX enables the development of pathway visualization from databases and facilitates analysis of experimentally generated data through combination with prior knowledge. The BioPAX effort is coordinated closely with that of other pathway related standards initiatives namely; PSI-MI, SBML, CellML, and SBGN in order to deliver a compatible standard in the areas where they overlap. Genomics van Ziekte Genomics van Ziekte
101
Biological Pathway Exchange (BioPax)
Genomics van Ziekte Genomics van Ziekte
102
Biological Pathway Exchange (BioPax)
Genomics van Ziekte Genomics van Ziekte
103
Biological Pathway Exchange (BioPax)
Genomics van Ziekte
104
Biological Pathway Exchange (BioPax)
Genomics van Ziekte
105
Biological Pathway Exchange (BioPax)
BioPax description of ‘Biochemical Reaction’ Genomics van Ziekte
106
Biological Pathway Exchange (BioPax)
BioPax description of ‘Catalysis’ Genomics van Ziekte
107
Biological Pathway Exchange (BioPax)
Internal representation of information by XML Genomics van Ziekte
108
SBML: another exchange language
SBML includes mathematical models of pathways Genomics van Ziekte
109
Other exchange formats: PSI-MI and SBML
The problem is that we have many ‘standards’ for representing biological pathways. Each of them is partially overlapping and complementing others. It would be great if there would be only 1 standard that could be used in every situation. Genomics van Ziekte Genomics van Ziekte
110
String: Known and predicted protein-protein interactions
STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources The database currently covers proteins from 1133 organisms. STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 5'214'234 proteins from 1133 organisms. (functional) associations; Genomics van Ziekte Genomics van Ziekte
111
Glucose metabolism genes
Genomics van Ziekte
112
Hexokinase: action view (HK1)
Blue: binding Grey: co-occurence literature/ public databases In this example we have retrieved information about the Hexokinase (HK1) protein. HK1 converts glucose to glucose-6-phosphate. In this “action view” we wee different types of relations with other proteins: blue relations and grey relations. The blue relations indicate a binding between 2 proteins (e.g., HK1 and VDAC1). The grey relations indicate that there is only support from literature and/or public databases but no other evidence like experimental information. For each relation, the table summarizes its confidence through a score. Genomics van Ziekte
113
Hexokinase: action view
Again the action view. String provides additional information about specific relations. In this case the binding between HK1 and VDAC1. It becomes clear that this relation is supported by experimental/biochemical data and by literature (the proteins co-occur in pubmed abstracts) includes experimental evidence Genomics van Ziekte
114
Hexokinase: action view
Again the action view. String shows that for the relation between HK1 and SORD only limited evidence is available: this relation is mentioned in a public database (KEGG; see next slide) and co-mentioned in pubmed abstracts (see next slide). No experimental evidence Genomics van Ziekte
115
Hexokinase: action view
116
Hexokinase: action view
117
Confidence view (HK1) The String ‘confidence view’ summarizes the scores which reflect the evidence that exists for each of the relations. Genomics van Ziekte Genomics van Ziekte
118
Hexokinase: evidence view
Eg.., Textmining Neighborhood Coexpression Etc The evidence view shows the individual pieces of evidence that support the relations between two proteins. Genomics van Ziekte
119
Stitch STITCH is a resource to explore known and predicted interactions of chemicals and proteins. Chemicals are linked to other chemicals and proteins by evidence derived from experiments, databases and the literature. STITCH contains interactions for between 300,000 small molecules and 2.6 million proteins from 1133 organisms. Genomics van Ziekte Genomics van Ziekte
120
OMIM: Online Mendelian Inheritance in Man
Knowledgebase of human genes and phenotypes Originally published as a book in 1966 Content of OMIM is derived exclusively from the published biomedical literature Statistics (2009): full text entries 2.239 genes have mutations causing disease 3.770 disease have a molecular basis 70 new entries added per month 700 entries are updated per month Also includes: complex traits and descriptions of the consequences of gene copy number variation and recurrent deletions and duplications. OMIM, Online Mendelian Inheritance in Man. OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources. This database was initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in Man (MIM). OMIM is intended for use primarily by physicians and other professionals concerned with genetic disorders, by genetics researchers, and by advanced students in science and medicine. While the OMIM database is open to the public, users seeking information about a personal medical or genetic condition are urged to consult with a qualified physician for diagnosis and for answers to personal questions. Genomics van Ziekte Genomics van Ziekte
121
OMIM: Online Mendelian Inheritance in Man
Genomics van Ziekte
122
OMIM: Online Mendelian Inheritance in Man
Phenotype Gene
123
Example record: HBB (sickle cell anemia)
Genomics van Ziekte
124
Example record: HBB (sickle cell anemia)
Genomics van Ziekte
125
Programming: manipulating databases
Example: PERL language PERL can be installed on your Windows machine In this example I will use Perl installed in a Linux environment Regular Expressions are very powerful and can be used to extract and manipulate information for public databases that you have installed on your computer. In this example, I won’t use a database but a single nucleotide sequence in fasta format. For an overview over regular expression see for example Genomics van Ziekte Genomics van Ziekte
126
Genomics van Ziekte
127
Examples of regular expresssions
Genomics van Ziekte
128
Example of substitution 1
We will use a regular expression to converts a nucleotide sequence into triplets of nucleotides (codons) Regular expresssion s/(\w{3})/$1-/g Execute regular expression on file that contains nucleotide sequence perl –pe s/(\w{3})/$1-/g seq.fasta Genomics van Ziekte
129
Perl: try it yourself...... write a regular expression to remove the first 3 and last 4 nucleotides of our sequence Genomics van Ziekte
130
Answer Regular expression to remove the first 3 and last 4 nucleotides
Regular expresssion s/^\w{3}([CATG]*)\w{4}$/$1/ Execute regular expression on file that contains nucleotide sequence perl -pe 's/^\w{3}([CATG]*)\w{4}$/$1/' seq.fasta Genomics van Ziekte
131
Aspects of public biological databases
Many different databases are available, which cover all levels of the central dogma Primary (e.g., GenBank) and secondary (e.g., Reactome) databases Databases are inter-linked and linked to literature Integrated with various database-specific tools Update regularly (e.g., daily) Fast (exponential) growth of information Accessible in multiple ways (web-browser, ftp, etc) Downloadable in different formats (text, xml, rdf, etc) Minimum information specifications (e.g., MIAME) Curation and annotation (e.g., biocuration UniProt) Entries have unique identifiers (e.g., accession code in GenBank) which can be referenced in literature or other databases. Exchange formats facilitate integration between databases and databases/software (e.g., BioPax, SBML) Not so much public clinical data available (for obvious reasons) Genomics van Ziekte
132
Summary Many database available that you can use in your own projects
Think carefully: Which database do you need Which tools do you need? Consider overlapping/complementary databases Be aware of data standards Be aware of errors in data and/or annotation Use of public biological databases is accompanied with several pitfalls. Genomics van Ziekte
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.