Metadata and Annotation with Bioconductor. Static vs. Dynamic Annotation Static Annotation: Bioconductor packages containing annotation information that.

Slides:



Advertisements
Similar presentations
“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
Advertisements

1 / 30 Data Mining with BioMart
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Introduction to BioConductor Friday 23th nov 2007 Ståle Nygård Statistical methods and bioinformatics for the analysis of microarray.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Working with gene lists: Finding data using GEO & BioMart June 5, 2014.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
Annotation, Databases, GO, Pathways,and all those things.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Biological Annotation in R Manchester R, 13th Nov, 2013 Nick Burgoyne Bioinformatician, fiosgenomics
BioC 2009 Database mining with biomaRt Steffen Durinck Illumina Inc.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
BioMart A Federated Query Architecture Arek Kasprzyk European Bioinformatics Institute 26 April 2004.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Copyright OpenHelix. No use or reproduction without express written consent1.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
Organizing information in the post-genomic era The rise of bioinformatics.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
BIOINFORMATIK I UEBUNG 2 mRNA processing.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Data Mining in Ensembl with BioMart Nov,
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Online queries to biomart webservices through the biomaRt package Steffen Durinck 1, Wolfgang Huber 2 1. NCI/NIH, Gaithersburg, Maryland, USA 2. EBI, Hinxton-Cambridge,
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14: , Genome research EBI, Wellcome Trust.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Data Mining in Ensembl with BioMart Giulietta Spudich.
Bioinformatics and Computational Biology
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Introduction to Genes and Genomes with Ensembl
The Ensembl Database Steven Jones August 18, 2004
Data Mining with BioMart
Archives and Information Retrieval
Mangaldai College, Mangaldai
Ensembl Genome Repository.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Metadata and Annotation with Bioconductor

Static vs. Dynamic Annotation Static Annotation: Bioconductor packages containing annotation information that are installed locally on a computer well-defined structure reproducible analyses no need for network connection Dynamic Annotation: stored in a remote database more frequent updates  possibly different result when repeating analyses more information one needs to know about the structure of the database, the API of the webservice etc.

EntrezGene is a catalog of genetic loci that connects curated sequence information to official nomenclature. It replaced LocusLink. UniGene defines sequence clusters. UniGene focuses on protein-coding genes of the nuclear genome (excluding rRNA and mitochondrial sequences). RefSeq is a non-redundant set of transcripts and proteins of known genes for many species, including human, mouse and rat. Enzyme Commission (EC) numbers are assigned to different enzymes and linked to genes through EntrezGene. Available Metadata

Gene Ontology (GO) is a structured vocabulary of terms describing gene products according to molecular function, biological process, or cellular component PubMed is a service of the U.S. National Library of Medicine. PubMed provides a rich resource of data and tools for papers in journals related to medicine and health. While large, the data source is not comprehensive, and not all papers have been abstracted Available Metadata

OMIM Online Mendelian Inheritance in Man is a catalog of human genes and genetic disorders. NetAffx Affymetrix’ NetAffx Analysis Center provides annotation resources for Affymetrix GeneChip technology. KEGG Kyoto Encyclopedia of Genes and Genomes; a collection of data resources including a rich collection of pathway data. IntAct Protein Interaction data, mainly derived from experiments. Pfam Pfam is a large collection of multiple sequence alignments and hidden Markov models covering manycommon protein domains and families. Available Metadata

Chromosomal Location Genes are identified with chromosomes, and where appropriate with strand. Data Archives The NCBI coordinates the Gene Expression Omnibus (GEO); TIGR provides the Resourcerer database, and the EBI runs ArrayExpress. Available Metadata

An early design decision was to provide metadata on a per chip-type basis (e.g. hgu133a, hgu95av2 ) Each annotation package contains objects that provide mappings between identifiers (genes, probes, …) and different types of annotation data One can list the content of a package: > library("hgu133a") > ls("package:hgu133a") [1] "hgu133a" "hgu133aACCNUM" [3] "hgu133aCHR" "hgu133aCHRLENGTHS" [5] "hgu133aCHRLOC" "hgu133aENTREZID" [7] "hgu133aENZYME" "hgu133aENZYME2PROBE" [9] "hgu133aGENENAME" "hgu133aGO" [11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE" [13] "hgu133aLOCUSID" "hgu133aMAP" [15] "hgu133aMAPCOUNTS" "hgu133aOMIM" [17] "hgu133aORGANISM" "hgu133aPATH" [19] "hgu133aPATH2PROBE" "hgu133aPFAM" [21] "hgu133aPMID" "hgu133aPMID2PROBE" [23] "hgu133aPROSITE" "hgu133aQC" [25] "hgu133aREFSEQ" "hgu133aSUMFUNC_DEPRECATED" [27] "hgu133aSYMBOL" "hgu133aUNIGENE" Annotation Packages

A little bit of history... (the pre-SQL era) before: hgu95av2now: hgu95av2.db

Objects in annotation packages used to be environments, hash tables for mapping  now things are stored in SQLite DB Mapping only from one identifier to another, hard to reverse quite unflexible The user interface still supports many of the old environment- specific interactions: You can access the data directly using any of the standard subsetting or extraction tools for environments: get, mget, $ and [[. > get("201473_at", hgu133aSYMBOL) [1] "JUNB" > mget(c("201473_at","201476_s_at"), hgu133aSYMBOL) $`201473_at` [1] "JUNB" $`201476_s_at` [1] "RRM1" > hgu133aSYMBOL$"201473_at" [1] "JUNB" > hgu133aSYMBOL[["201473_at"]] [1] "JUNB" Annotation Packages

Suppose we are interested in the gene BAD. > gsyms <- unlist(as.list(hgu133aSYMBOL)) > whBAD <- grep("^BAD$", gsyms) > gsyms[whBAD] 1861_at _at "BAD" > hgu133aGENENAME$"1861_at" [1] "BCL2-antagonist of cell death" Working with Metadata

Find the pathways that BAD is associated with. > BADpath <- hgu133aPATH$"1861_at" > kegg <- mget(BADpath, KEGGPATHID2NAME) > unlist(kegg) "Neurodegenerative Disorders" "ErbB signaling pathway" "Apoptosis" … "Colorectal cancer" "Pancreatic cancer" "Endometrial cancer" Working with Metadata

We can get the GeneChip probes and the unique EntrezGene loci in each of these pathways. First, we obtain the Affymetrix IDs > allProbes <- mget(BADpath, hgu133aPATH2PROBE) > length(allProbes) [1] 15 > allProbes[[1]][1:10] [1] "206679_at" "209462_at" "203381_s_at" "203382_s_at" [5] "212874_at" "212883_at" "212884_x_at" "200602_at" [9] "211277_x_at" "214953_s_at" > sapply(allProbes, length) Working with Metadata

And then we can map these to their Entrez Gene values. > getEG = function(x) unique(unlist(mget(x, hgu133aENTREZID))) > allEG = sapply(allProbes, getEG) > sapply(allEG, length) Working with Metadata

Data in the new.db annotation packages is stored in SQLite databases  much more efficient and flexible old environment-style access provided by objects of class Bimap (package AnnotationDbi) left object right object left object right object left object right object.db Packages

Data in the new.db annotation packages is stored in SQLite databases  much more efficient and flexible old environment-style access provided by objects of class Bimap (package AnnotationDbi) left object right object left object right object left object right object  bipartite graph name attr1 = value1 attr2 0 value2.db Packages

collection of classes and methods for database interaction they abstract the particular implementations of common standard operations on different types of databases resultSet: operations are performed on the database, the user controls how much information is returned dbSendQuery create result set dbGetQuery get all results dbGetQuery(connection, sql query) DBI

Notice that there are a few more entries here. They give you access to a connection to the database. > library("hgu133a.db") > ls("package:hgu133a.db") [1] "hgu133aACCNUM" "hgu133aALIAS2PROBE" [3] "hgu133aCHR" "hgu133aCHRLENGTHS" [5] "hgu133aCHRLOC" "hgu133aENTREZID" [7] "hgu133aENZYME" "hgu133aENZYME2PROBE" [9] "hgu133aGENENAME" "hgu133aGO" [11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE" [13] "hgu133aMAP" "hgu133aMAPCOUNTS" [15] "hgu133aOMIM" "hgu133aORGANISM" [17] "hgu133aPATH" "hgu133aPATH2PROBE" [19] "hgu133aPFAM" "hgu133aPMID" [21] "hgu133aPMID2PROBE" "hgu133aPROSITE" [23] "hgu133aREFSEQ" "hgu133aSYMBOL" [25] "hgu133aUNIGENE" "hgu133a_dbInfo" [27] "hgu133a_dbconn" "hgu133a_dbfile" [29] "hgu133a_dbschema".db Packages

> con <- hgu133a_dbconn() > q1 <- "select symbol from gene_info“ > head(dbGetQuery(con,q1)) symbol 1 A2M 2 NAT1 3 NAT2 4 SERPINA3 > toTable(hgu133aSYMBOL)[1:3,] probe_id symbol _at A2M _at NAT _at NAT2 extract information from a database table as data.frame reverse mapping > revmap(hgu133aSYMBOL)$BAD [1] "1861_at" "209364_at"

Lkeys, Rkeys : Get left and right keys of a Bimap object > head(Lkeys(hgu133aSYMBOL)) [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at" > head(Rkeys(hgu133aSYMBOL)) [1] "A2M" "NAT1" "NAT2" "SERPINA3" "AADAC" "AAMP" > table(nhit(revmap(hgu133aSYMBOL))) nhit : number of hits for every left key in a Bimap object

_dbschema() database schemata of the package e.g. hgu133a_dbschema() () summary of tables, number of mapped elements, etc. e.g. hgu133a() _dbInfo() meta information about origin of the data, chip type, etc e.g. hgu133a_dbInfo() Metadata about Metadata

> hgu133a() Quality control information for hgu133a: This package has the following mappings: hgu133aACCNUM has mapped keys (of keys) hgu133aALIAS2PROBE has mapped keys (of keys) … hgu133aSYMBOL has mapped keys (of keys) hgu133aUNIGENE has mapped keys (of keys) Additional Information about this package: DB schema: HUMANCHIP_DB DB schema version: 1.0 Organism: Homo sapiens Date for NCBI data: 2008-Apr2 Date for GO data: Date for KEGG data: 2008-Apr1 Date for Golden Path data: 2006-Apr14 Date for IPI data: 2008-Mar19 Date for Ensembl data: 2007-Oct24

Bioconductor also provides some comprehensive annotations for whole genomes (e.g. S. cerevisae). They follow a naming convention like: org.Hs.eg.db. Currently we are trying to support all widely used model organisms. These packages are like the chip annotation packages, except a different set of primary keys is used (e.g. for yeast we use the systematic names such as YBL088C) > library("YEAST.db") > ls("package:YEAST.db")[1:12] [1] "YEAST" "YEASTALIAS" [3] "YEASTCHR" "YEASTCHRLENGTHS" [5] "YEASTCHRLOC" "YEASTCOMMON2SYSTEMATIC" [7] "YEASTDESCRIPTION" "YEASTENZYME" [9] "YEASTENZYME2PROBE" "YEASTGENENAME" [11] "YEASTGO" "YEASTGO2ALLPROBES " Annotating a Genome

„old-style“ vs SQL example from GO: number of terms in the three different ontologies BP CC MF old style: > system.time(goCats <- unlist(eapply(GOTERM, Ontology))) User System Ellapsed > gCnums <- table(goCats)[c("BP","CC", "MF")] SQL: > system.time(goCats <- dbGetQuery(GO_dbconn(), "select ontology from go_term")) User System Ellapsed

KEGG provides mappings from genes to pathways We provide these in the package KEGG.db, you can also query the site directly using KEGGSOAP or other software. One problem with the KEGG is that the data is not in a form that is amenable to computation. KEGG

Data in KEGG.db package KEGGEXTID2PATHID provides mapping from either EntrezGene (for human, mouse and rat) or Open Reading Frame (yeast) to KEGG pathway ID. KEGGPATHID2EXTID contains the mapping in the other direction. KEGGPATHID2NAME provides mapping from KEGG pathway ID to a textual description of the pathway. Only the numeric part of the KEGG pathway identifiers is used (not the three letter species codes) KEGG

Consider pathway > KEGGPATHID2NAME$"00362" [1] "Benzoate degradation via hydroxylation„ Species specific mapping from pathway to genes is indicated by glueing together three letter species code, e. g. texttthsa, and numeric pathway code. > KEGGPATHID2EXTID$hsa00362 [1] "10449" "30" "3032" "59344" "83875" > KEGGPATHID2EXTID$sce00362 [1] "YIL160C" "YKR009C" Exploring KEGG

PAK1 has EntrezGene ID 5058 in humans > KEGGEXTID2PATHID$"5058" [1] "hsa04010" "hsa04012" "hsa04360" "hsa04510" "hsa04650" [6] "hsa04660" "hsa04810" "hsa05120" "hsa05211" > KEGGPATHID2NAME$"04010" [1] "MAPK signaling pathway „ We find that it is involved in 9 pathways. For mice, the MAPK signaling pathway contains > mm <- KEGGPATHID2EXTID$mmu04010 > length(mm) [1] 253 > mm[1:10] [1] "102626" "109689" "109880" "109905" "110157" "110651" [7] "114713" "11479" "11651" "11652" Exploring KEGG

The annotate package functions for harvesting of curated persistent data sources functions for simple HTTP queries to web service providers interface code that provides common calling sequences for the assay based metadata packages such as getSEQ perform web queries to NCBI to extract the nucleotide sequence corresponding to a GenBank accession number. > gsq <- getSEQ("M22490") > substring(gsq,1,40) [1] "GGCAGAGGAGGAGGGAGGGAGGGAAGGAGCGCGGAGCCCG" M22490: mapped to locus HUMBMP2B; Human bone morphogenetic protein-2B (BMP-2B) mRNA. Dynamic Annotation

other interface functions include getGO, getSYMBOL, getPMID, and getLL functions whose names start with pm work with lists of PubMed identifiers for journal articles. > hgu133aSYMBOL$"209905_at" [1] "HOXA9" > pm.getabst("209905_at", "hgu133a") $`209905_at` $`209905_at`[[1]] An object of class 'pubMedAbst': Title: Vertebrate homeobox gene nomenclature. PMID: Authors: MP Scott Journal: Cell Date: Nov 1992 The annotate Package

BioMart Generic data management system, collaboration between EBI and CSHL Several query interfaces and administration tools Conduct fast and powerful queries using:  website  webservice  graphical or text-oriented applications  software libraries written in Perl and Java.

Ensembl Joint project between EMBL-EBI and the Sanger Institute Produces and maintains automatic annotation on selected eukaryotic genomes.

Ensembl martview

VEGA The Vertebrate Genome Annotation (VEGA) database is a central repository for high quality, frequently updated, manual annotation of vertebrate finished genome sequence. Current release: Human Mouse Zebrafish Dog

WormBase WormBase is the repository of mapping, sequencing and phenotypic information for C. elegans (and some other nematodes).

WormMart

GrameneMart Gramene is a curated, open-source, Web-accessible data resource for comparative genome analysis in the grasses. Gramene: A Comparative Mapping Resource for Grains

Other databases with BioMart interfaces dbSNP (via Ensembl)‏ HapMap Sequence Mart: Ensembl genome sequences

BioMart user interfaces

MartShell MartShell is a command line BioMart user interface based on a structured query language: Mart Query Language (MQL)‏

BioMart user interfaces Martview Web based user interface for BioMart, provides functionality for remote users to query all databases hosted by the EBI's public BioMart server. MartExplorer Perl and Java libraries biomaRt interface to R/Bioconductor

The biomaRt package Developed by Steffen Durinck (started Feb 2005) Two main sets of functions: 1. Tailored towards Ensembl, shortcuts for FAQs (frequently asked queries): getGene, getGO, getOMIM Generic queries, modeled after MQL (Mart query language), can be used with any BioMart dataset Two communication protocols 1. Direct MySQL queries to BioMart database servers 2. HTTP queries to BioMart webservices more stable (across database releases); self-reflective; less firewall problems

Getting started > library(biomaRt)‏ > listMarts()‏ $biomart [1] "dicty" "ensembl" "snp" "vega" "uniprot" "msd" "wormbase" $version [1] "DICTYBASE (NORTHWESTERN)" "ENSEMBL 38 (SANGER)" [3] "SNP 38 (SANGER)" "VEGA 38 (SANGER)" [5] "UNIPROT 4-5 (EBI)" "MSD 4 (EBI)" [7] "WORMBASE CURRENT (CSHL)" $host [1] " " " [4] " " " [7] " $path [1] "" "/biomart/martservice" "/biomart/martservice" [4] "/biomart/martservice" "/biomart/martservice" "/biomart/martservice" [7] "/biomart/martservice"

Gene annotation The function getGene allows you to get gene annotation for many types of identifiers Supported identifiers are:  Affymetrix Genechip Probeset ID  RefSeq  Entrez-Gene  EMBL  HUGO  Ensembl  soon Agilent identifiers will also be available

getGene > mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")‏ > myProbes <- c("210708_x_at", "202763_at", "211464_x_at")‏ > z <- getGene(id = myProbes, array = "affy_hg_u133_plus_2", mart = mart)‏ ID symbol _at CASP _x_at CASP _x_at CASP6 description 1 Caspase-3 precursor (EC ) (CASP-3) (Apopain)... 2 Caspase-10 precursor (EC ) (CASP-10) (ICE-like apoptotic pro.. 7 Caspase-6 precursor (EC ) (CASP-6) (Apoptotic protease Mch-2)... chromosome band strand chromosome_start chromosome_end ensembl_gene_id 1 4 q ENSG q ENSG q ENSG ensembl_transcript_id 1 ENST ENST ENST

Gene annotation Note: Ensembl does an independent mapping of affy probe sequences to genomes. If there is no clear match then that probe is not assigned to a gene.

Gene annotation getGene returns a dataframe  Gene symbol  Description  Chromosome name  Band  Start position  End position  BioMartID

getGene > getGene(id = 100, type = "entrezgene", mart = mart)‏ ID symbol ADA description 1 Adenosine deaminase (EC ) (Adenosine aminohydrolase). [Source:Uniprot/SWISSPROT;Acc:P00813] chromosome band strand chromosome_start chromosome_end ensembl_gene_id 1 20 q ENSG ensembl_transcript_id 1 ENST

Other functions getGO: GO id, GO term, evidence code getOMIM (Online Mendelian Inheritance in Man, a catalogue of human genes and genetic disorders): OMIM id, Disease, BioMart id getINTERPRO (an integrated resource of protein families, domains and functional sites): Interpro id, description getSequence getSNP getHomolog

getSequence > seq <- getSequence(species="hsapiens", chromosome = 19, start = , end = , mart = mart) chromosome [1] "19" start [1] end [1] sequence "AGTCCCAGCTCAGAGCCGCAACCTGCACAGCCATGCCCGGGCAAGAACTCAGGACGGTGAATGGCTCTCAG ATGCTCCTGGTGTTGCTGGTGCTCTCGTGGCTGCCGCATGGGGGCGCCCTGTCTCTGGCCGAGGCGAGCCGC GCAAGTTTCCCGGGACCCTCAGAGTTGCACTCCGAAGACTCCAGATTCCGAGAGTTGCGGAAACGCTACGAG GACCTGCTAACCAGGCTGCGGGCCAACCAGAGCTGGGAAGATTCGAACACCGACCTCGTCCCGGCCCCTGCA GTCCGGATACTCACGCCAGAAGGTAAGTGAAATCTTAGAGATCCCCTCCCACCCCCCAAGCAGCCCCCATAT CTAATCAGGGATTCCTCATCTTGAAAAGCCCAGACCTACCTGCGTATCTCTCGGGCCGCCCTTCCCGAGGGG CTCCCCGAGGCCTCCCGCCTTCACCGGGCTCTGTTCCGGCTGTCCCCGACGGCGTCAAGGTCGTGGGACGTG ACACGACCGCTGCGGCGTCAGCTCAGCCTTGCAAGACCCCAGGCGCCCGCGCTGCACCTGCGACTGTCGCCG CCGCCGTCGCAGTCGGACCAACTGCTGGCAGAATCTTCGTCCGCACGGCCCCAGCTGGAGTTGCACTTGCGG CCGCAAGCCGCCAGGGGGCGCCGCAGAGCGCGTGCGCGCAACGGGGACCACTGTCCGCTCGGGCCCGGGCGT TGCTGCCGTCTGCACACGGTCCGCGCGTCGCTGGAAGACCTGGGCTGGGCCGATTGGGTGCTGTCGCCACGG GAGGTGCAAGTGACCATGTGCATCGGCGCGTGCCCGAGCCAGTTCCGGGCGGCAAACATG....

SNP Single Nucleotide Polymorphisms (SNPs) are common DNA sequence variations among individuals. e.g. AAGGCTAA and ATGGCTAA biomaRt uses the SNP mart of Ensembl which is obtained from dbSNP

getSNP > getSNP(chromosome = 8, start = , end = , mart = mart)‏ tsc refsnp_id allele chrom_start chrom_strand 1 TSC rs C/A TSC rs C/A TSC rs A/G rs13291 C/T TSC rs C/T rs C/T rs C/T rs T/G TSC rs G/A rs G/A TSC rs G/A rs C/T rs G/A TSC rs C/T rs G/A

Homology mapping The getHomolog function enables mapping of many types of identifiers from one species to the same or another type of identifier in another species.

getHomolog > from.mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")‏ > to.mart = useMart("ensembl", dataset = "mmusculus_gene_ensembl")‏ > getHomolog(id = 2, from.type = "entrezgene", to.type = "refseq", + from.mart = from.mart, to.mart = to.mart)‏ V1 V2 V3 1 ENSMUSG ENSMUST NM_ ENSMUSG ENSMUST NM_ ENSMUSG ENSMUST NM_ ENSMUSG ENSMUST NM_ ENSMUSG ENSMUST ENSMUSG ENSMUST NM_007376

Find (microarray) probes of interest getFeature function Filter on: gene location symbol OMIM GO

getFeature > getFeature(symbol = "BRCA2", array = "affy_hg_u133_plus_2", mart = mart)‏ hgnc_symbol affy_hg_u133_plus_2 1 BRCA _s_at > getFeature(chromosome = 1, start = , end = , type = "entrezgene", + mart = mart)‏ ensembl_transcript_id chromosome_name start_position end_position entrezgene 1 ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST NA 11 ENST NA

getFeature Select all RefSeq id’s involved in diabetes mellitus: >getFeature(OMIM="diabetes mellitus", type="refseq", species="hsapiens", mart=mart) ‏

Ensembl Cross-references Powerful function to map between all possible cross-references in Ensembl Can for example be used to map between different Affymetrix arrays

Ensembl Cross-references getPossibleXrefs  Retrieves all possible cross-references > xref <- getPossibleXrefs(mart = mart)‏ > xref[1:10, ] species xref [1,] "agambiae" "embl" [2,] "agambiae" "pdb" [3,] "agambiae" "prediction_sptrembl" [4,] "agambiae" "protein_id" [5,] "agambiae" "uniprot_accession" [6,] "agambiae" "uniprot_id"

Ensembl Cross-references >xref = getXref(id="1939_at", from.species="hsapiens", to.species = "mmusculus", from.xref = "affy_hg_u95av2", to.xref = "affy_mouse430_2", mart=mart)‏

The generic interface: the getBM function

useDataset > library(biomaRt)‏ > mart <- useMart("ensembl")‏ > listDatasets(mart)‏ dataset version 1 rnorvegicus_gene_ensembl RGSC3.4 2 scerevisiae_gene_ensembl SGD1 3 celegans_gene_ensembl CEL150 4 cintestinalis_gene_ensembl JGI2 5 ptroglodytes_gene_ensembl CHIMP1A 6 frubripes_gene_ensembl FUGU4 7 agambiae_gene_ensembl AgamP3 8 hsapiens_gene_ensembl NCBI36 9 ggallus_gene_ensembl WASHUC1 10 xtropicalis_gene_ensembl JGI drerio_gene_ensembl ZFISH5....(more)... > mart <- useDataset(dataset = "hsapiens_gene_ensembl", mart = mart)‏

getBM > getBM(attributes = c("affy_hg_u95av2", "hgnc_symbol"), filter = "affy_hg_u95av2", values = c("1939_at", "1000_at"), mart = mart)‏ affy_hg_u95av2 hgnc_symbol _at MAPK _at TP53 mart – an object describing the database connection and the dataset attributes – the name of the data you want to obtain filter – the name of the data by which you want to filter from the dataset values – values to filter on

Locally installed BioMarts Main use case currently is to use biomaRt to query public BioMart servers over the internet But you can also install BioMart server locally, populated with a copy of a public dataset (particular version), or populated with your own data Versioning is supported by naming convention

Installation bioMart depends on R packages Rcurl, XML, which require additional system libraries (libcurl, libxml2)‏ RMySQL package is optional Platforms on which biomaRt has been installed: Linux Mac OS X Windows

Discussion Using biomaRt to query public webservices gets you started quickly, is easy and gives you access to a large body of metadata in a uniform way Need to be online Online metadata can change behind your back; although there is possibility of connecting to a particular, immutable version of a dataset Watch this space – implementation of Bioconductor metadata packages is changing and improving! using the familiar packaging and versioning system