Presentation is loading. Please wait.

Presentation is loading. Please wait.

Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress.

Similar presentations

Presentation on theme: "Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress."— Presentation transcript:

1 Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress DW EurexpressGermOnLinePRIDEPepSeekerVectorBase Pancreatic Expression Database Reactome EU Rat Mart Paramecium DB BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the European Bioinformatics Institute (EBI). Open Source – LGPL * Perl API Web Interface, Web Services Interface, REST API * Java API Mart Explorer GUI, MartShell * 3 rd Party Software Bioclipse, biomaRt-BioConductor, Cytoscape, Galaxy, Taverna, WebLab

2 A Mart is a collection of datasets (~=Database). Marts are optimised for querying. A Dataset has a main table, with an entry (and Primary Key) for each of the items of interest in that dataset (eg Mouse Transcripts). Related bits of information about these items are hung off the table in dimension tables (eg. Affy Ids corresponding to this gene) More Info:

3 Ensembl annotates everything at the transcript level: Ensembl_transcript_1 Ensembl_transcript_2 Ensembl_transcript_3 AffyID HUGO Symbol 1939_at ENST _at ENST _at ENST TP53 Affy Ids are mapped by Ensembl. If there is no clear match then that probe is not assigned to a gene.

4 Web Interface: Choose a Database (mart) to query (eg Ensembl) Choose a Dataset from that mart to query (eg Mus Musculus Genes)

5 Filters Use filters to select the members of the dataset in which you're interested eg. Limit to miRNA genes from Chr1

6 Attributes Use attributes to define what bits of information you want to retrieve about the members of the dataset eg. Gene ID, Transcript ID, Start, End and Status:

7 Results:


9 source("") #Default package set biocLite() #OR biocLite(someBiocPkg) #OR biocLite(groupName=pkgGroupName) Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data.

10 Core Packages: affy, affydata, affyPLM, annaffy, annotate, Biobase, Biostrings, DynDoc, gcrma, genefilter, geneplotter, hgu95av2.db, limma, marray, matchprobes, multtest, ROC, vsn, xtable, affyQCReport. Alternative Package Groups lite, affy, graph, all Full Package Listing (software) Full Package Listing (annotation)

11 Querying biomart from R: # Install library source( biocLite(biomaRt) # Load library library(biomaRt) listMarts() # result is just a data.frame, so you can subset it: listMarts()[1:5,] # or search it: grep('ensembl', listMarts()[,1], value=TRUE)

12 # Select a mart mart <- useMart('ensembl') # List the available datasets (returns data.frame) listDatasets(mart) # Select a dataset mart <- useDataset('mmusculus_gene_ensembl', mart=mart) # Both in one: mart <- useMart('ensembl', dataset='mmusculus_gene_ensembl')

13 # Available Filters (returns data.frame) listFilters(mart) # Available Attributes (returns data.frame) listAttributes(mart) # A Simple Query getBM(filters=c('ensembl_gene_id'), values=c('ENSMUSG ','ENSMUSG '), attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'transcript_start', 'transcript_end'), mart=mart) ensembl_gene_id ensembl_transcript_id transcript_start transcript_end 1 ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST ENSMUSG ENSMUST

14 # If using multiple filters, values should be a list # If chromosome_name, start and end filters used they are auto # interpreted as 'search within this region' getBM(filters=c('chromosome_name', 'start', 'end' ), values=list(10, , ), attributes= c('ensembl_gene_id', 'start_position','end_position'), mart=mart) ensembl_gene_id start_position end_position 1 ENSMUSG ENSMUSG ENSMUSG ENSMUSG

15 # Filters can be either numeric, string or boolean. # Boolean filters need a TRUE or FALSE value # Determine type of filter with: filterType('with_unigene', mart) # Attributes and filters are organised into categories # To get a list of the categories: attributeSummary(mart) filterSummary(mart) # You can then list attributes and filters limited to a # specified category: listAttributes(mart, category='Variations')

16 # Older versions of ensembl are archived, useful if you've # got genome positions to a previous build old.mart <- useMart('ensembl_mart_46', dataset='mmusculus_gene_ensembl', archive=TRUE)

17 Retrieving Sequences: # can get complicated with getBM. Use the getSequence wrapper # Genome Sequences always 5'-3' but... # Web-Services mode (default): Strand is context dependant # MySQL mode: Always top strand #eg... # BRCA1 peptide sequence from gene symbol getSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart) # REST transcript 20 bases upstream getSequence(id='ENSMUST ', type='ensembl_transcript_id', seqType='transcript_flank', upstream=20, mart=mart) # Chromosome 4 100,000, ,000,010 getSequence(chromosome=4, start= , end= , mart=mart, seqType="gene_exon", type="ensembl_gene_id")

18 seqTypes: Note that any of the _flank types need an 'upstream' or 'downstream' argument to determine the size of the flanking region. At the moment, you can't specify both.

19 Exporting Sequences: # The exportFASTA function provides a quick way of saving # sequences in FASTA format: res <- getSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart) exportFASTA(res, file='sequence.fa')

20 Linking Datasets... # Make mart connections for each of the datasets: mouse.mart<-useMart('ensembl', dataset="mmusculus_gene_ensembl") people.mart<-useMart('ensembl', dataset='hsapiens_gene_ensembl') # In Ensembl, datasets are made of transcripts # from a single species. # Linking datasets amounts to homology #eg. Get pos of mouse homolog to human 'TP53' gene getLDS(attributes = c("hgnc_symbol","chromosome_name", "start_position"), filters = "hgnc_symbol", values = "TP53", mart = people.mart, attributesL = c("chromosome_name","start_position"), martL = mouse.mart) } V1 V2 V3 V4 V5 1 TP

21 Pretty HTML Output: library(annotate) # Provides the htmlpage function. Salient args are: # genelist – a list or dataframe of IDs to be made into links # filename # title – for the table # othernames – a list of other things to add to the table as is # table.head – a character vector of col headers for the table. # repository – a list of repositories to use for creating links ids <- c('ENSMUSG ','ENSMUSG ') genelist <- getBM(attributes=c('uniprot_swissprot_accession', 'entrezgene'), filters='ensembl_gene_id', values=ids, output='list', na.value=' ', mart=mart) othernames <- getBM(attributes=c('ensembl_gene_id','mgi_symbol', 'description'), filters='ensembl_gene_id', values=ids, output='list', na.value='&nsbp;',mart=mart) htmlpage(genelist=genelist, othernames=othernames, title='Some Genes', table.head=c('Uniprot', 'Entrezgene', 'Ensembl','Name', 'Description'), repository=list('sp', 'en'), filename='genes.html') # Note that all the lists are expected to be in the right order


23 More Info... Bioconductor Mailing List: biomaRt Users' Guide: vignette('biomaRt') Biomart Website Slides & examples:

Download ppt "Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress."

Similar presentations

Ads by Google