Biological Annotation in R Manchester R, 13th Nov, 2013 Nick Burgoyne Bioinformatician, fiosgenomics

Slides:



Advertisements
Similar presentations
“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
Advertisements

1 / 30 Data Mining with BioMart
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
Working with gene lists: Finding data using GEO & BioMart June 5, 2014.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Gene Ontology John Pinney
Welcome to mini-symposium on ontologies for biological sample description EMBL-EBI Wellcome Trust Genome Campus Deceber 5, 2001.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Genomic Database - Ensembl Ka-Lok Ng Department of Bioinformatics Asia University.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
BioC 2009 Database mining with biomaRt Steffen Durinck Illumina Inc.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Copyright OpenHelix. No use or reproduction without express written consent1.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Managing Data Modeling GO Workshop 3-6 August 2010.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Strategies for functional modeling TAMU GO Workshop 17 May 2010.
Data Mining in Ensembl with BioMart Nov,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Pfam, DAS and the future Rob Finn DAS Workshop 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
1 Genomics The field of biology based on studying the entire DNA sequence of an organism - its “genome”. Genomics tools don’t replace classical genetics.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Data Mining in Ensembl with BioMart Giulietta Spudich.
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
Bioinformatics and Computational Biology
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
ArrayExpress - a Public Repository for Microarray Based Gene Expression Data European Bioinformatics Institute - EMBL outstation and German Cancer Research.
BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.
GEO (Gene Expression Omnibus) Deepak Sambhara Georgia Institute of Technology 21 June, 2006.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Getting GO annotation for your dataset
Data Mining with BioMart
Optimizing Biological Data Integration
Access to Sequence Data and Related Information
ID Mapping tools: Converting Accessions between Databases
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Ensembl Genome Repository.
Gene Safari (Biological Databases)
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Cancer Cell Line Encyclopedia
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological Annotation in R Manchester R, 13th Nov, 2013 Nick Burgoyne Bioinformatician, fiosgenomics

Bioinformatics Information management in biology Application of statistics to experimental data Creation of databases to store it Post-genomic era ▫Sequence of a human genome exists ▫Sequence of many different species exist ▫New human genomes can generated in 2 weeks

Typical investigations Sequence data ▫What mutations are relevant for a sample? ▫Which differences correlate with a trait? Microarray ▫Which genes (when expressed) can be used as a classifier Generally ▫What other information is known, what is the structure of the gene in question?

Annotation Databases European bioinformatics institute ▫ebi.ac.uk National Centre for Biological information ▫ncbi.nlm.nih.govncbi.nlm.nih.gov Ensembl ▫ensembl.orgensembl.org Catalogue of Somatic Mutations in Cancer ▫cancer.sanger.ac.uk Mouse Genome Informatics ▫informatics.jax.orginformatics.jax.org FlyBase ▫flybase.org

AnnotationDbi Microarrays ▫10-100k features (genes, parts of genes) ▫Hope for ~10% genes of interest  Only care about a few  Sets of genes have known function together ▫Mapping them to information  Names, Symbols, Homologs  Functions, pathways they act in, function

Arrays and AnnotationDbi General database interface for annotation data #A definition of species specific data >require(org.Hs.eg.db) #A set of data that is relevant to your microarray >require(hgu95av2.db) #A vector of probes of relevance >probes [1] "31882_at" "38780_at" "37033_s_at" "1702_at" "31610_at”

Arrays and AnnotationDbi #Access information from these probes, symbols >probes <- c(“31882_at”, “38780_at”, “37033_s_at”, “1702_at”, “31610_a”) >unlist(mget(probes, hgu95av2SYMBOL)) 31882_at 38780_at 37033_s_at 1702_at 31610_at "RRP9" "AKR1A1" "GPX1" "IL2RA" "PDZK1IP #Access other aspects >ls("package:hgu95av2.db ") …[5] “hgu95av2CHR“ #The chromosome …[7] "hgu95av2CHRLOC“ #The location on the chromosome …[15] “hgu95av2GO” #The functions of this probe

AnnotationDbi and BioBase etc Set of tools built around AnnotationDbi Allows for the annotation and analysis of function simply and easily Most array types are catered for Species specific data also exist (most model species) Even if the database doesn’t exist your species, but is present in the ncbi repositories >library(AnnotationForge)

What if I’m interested in other data? Other questions, such as the: ▫Is my mutant associated with Cancer? ▫Does it correspond to a known disorder? ▫What is the sequence of this region of the DNA? Hope that the data source you are using is on the biomart network! ▫ ▫46 databases, common interface, defined structure ▫Bioconductor: biomaRt

biomaRt >library(biomaRt) #What databases are available? >head(listMarts()) biomart version 1 ensembl ENSEMBL 52 GENES (SANGER UK) 2 snp ENSEMBL 52 VARIATION (SANGER UK) 3 vega VEGA 33 (SANGER UK) 4 msd MSD PROTOTYPE (EBI UK) 5 uniprot UNIPROT PROTOTYPE (EBI UK) 6 htgt HIGH THROUGHPUT GENE TARGETING AND TRAPPING (SANGER UK) #Choose a database >mart = useMart("CosmicMart")

biomaRt #What datasets are avaliable for each mart? >head(listDatasets(mart) dataset description version 1 COSMIC65 COSMIC65 2 COSMIC66 COSMIC66 3 COSMIC64 COSMIC64 cos65 = useDataset("COSMIC65", mart = mart)

biomaRt #How to filter the data > head(listFilters(cos65)) name description 1 id_sample Sample ID 2 sample_name Sample Name 3 sample_source Sample Source 4 samp_gene_mutated Mutated Sample 5 tumour_source Tumour Source 6 id_gene Gene ID

biomaRt #List the data that is available head(listAttributes(cos65)) name description 1 id_sample COSMIC Sample ID 2 sample_name Sample Name 3 sample_source Sample Source 4 tumour_source Tumour Source 5 id_gene Gene ID 6 gene_name Gene Name

biomaRt #A set of gene names genes <- c("KRAS", "BRAF") #Get a list of cancers where these are mutated mutations <- getBM(attributes = c("id_gene", "gene_name", + "site_primary"), filters = "gene_name", value = genes, mart = + cos65) > head(mutations) id_gene gene_name site_primary 1 4 KRAS endometrium 2 4 KRAS pancreas 3 4 KRAS biliary_tract 4 2 BRAF large_intestine 5 2 BRAF thyroid 6 4 KRAS large_intestine

biomaRt genes <- "KRAS" sites <- "prostate" #Mutation data associated with KRAS and prostate cancer mutations <- getBM(attributes = c("gene_name", "site_primary", "cds_mut_syntax", + "aa_mut_syntax", "tumour_source"), filters = c("gene_name", "site_primary"), value = + list(genes, sites), mart = cos66) gene_name site_primary cds_mut_syntax aa_mut_syntax tumour_source 1 KRAS prostate c.182A>G p.Q61R primary 2 KRAS prostate c.34G>T p.G12C NS 3 KRAS prostate c.38G>A p.G13D NS 4 KRAS prostate c.34G>A p.G12S NS 5 KRAS prostate c.181C>A p.Q61K NS 6 KRAS prostate c.37G>A p.G13S NS 7 KRAS prostate c.35G>A p.G12D NS 8 KRAS prostate c.35G>A p.G12D primary 9 KRAS prostate c.35G>T p.G12V NS 10 KRAS prostate c.35G>T p.G12V primary 11 KRAS prostate c.35G>T p.G12V metastasis

biomaRt Always request the filter value as an attribute ▫Not all values you search for exist ▫Results will be in random orders ▫The search parameters are totally excluded Don’t expect the most up-to date answers ▫Often different (older) data than available directly ▫COSMIC is actually very good The connection can be very flaky ▫Only partial data sets may be returned