Presentation is loading. Please wait.

Presentation is loading. Please wait.

BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Similar presentations


Presentation on theme: "BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005."— Presentation transcript:

1 BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005

2 BioMart A join project –European Bioinformatics Institute (EBI) –Cold Spring Harbor Laboratory (CSHL) Aim –To develop a simple and scalable data management system capable of integrating distributed data sources.

3 Challenges Data sources –Large –Distributed –Different data

4 Requirements User –All data accessible through a single set of interaces –Suitable for power biologists and bioinformaticians Deployer –‘Out of the box’ installation –Built in query optimization –Easy data federation Architecture –Distributed –Domain agnostic –Platform independent

5 Query Engine Federated architecture

6 BioMart Data mart User interfaces Data sources

7 Data mart and dataset Dataset

8 Data mart, dataset and schema Schema

9 Dataset Configuration XML

10 BioMart abstractions Dataset –A subset of data organized into 1 or more tables Attribute –A single data point –e. g. gene name Filter –An operation on an attribute –e. g. ‘Chromosome =1’

11 Datasets, Attributes and Filters GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description MartDataset Attribute Filter

12 Examples Upstream sequences for all kinases up-regulated in brain and associated with a QTL for a neurological disorder Name, chromosome position, description of all genes located on chromosome 1, expressed in lung, associated with human homologues and non- synonymous snp changes

13 FK PK Data model

14 FK PK FK PK Data model

15 FK PK FK Data model

16 main1 PK1 2 PK2 PK1 FK2 dm FK2 dm FK1 FK2 dm FK1 FK2 PK1 FK1 FK2 PK2 FK1 Data model - ‘reversed star’

17 Dataset Fixed schema transformation A B TATA TBTB C

18 BioMart abstractions Link –‘common currency’ between two datasets –e. g. accession Exportable –Potential links to export Importable –Potential links to import

19 Exportables, Importables and Links Dataset 1 Dataset 2 Links

20 Exportables, Importables and Links Dataset 1 Dataset 2 Exportable Importable name = uniprot_id attributes = uniprot_ac name = uniprot_id filters = uniprot_ac Links

21 Exportables, Importables and Links Dataset 1Dataset 2 Exportable Importable name=genomic_region attributes=chr_name, chr_start, chr_end name=genomic_region filters=chr_name (=), chr_start (>=), chr_end (<=) Links

22 Building BioMart databases Source databases Mart Transformation MartBuilder Configuration XML MartEditor

23

24 Table naming convention Naïve configuration Tables –Meta tables meta_content –Data tables dataset__content__type Data tables –Main __main –Dimension __dm Columns –Key _key

25 Retrieval myDatabase SNPVega EnsemblUniProt myMart MSD BioMart API JAVAPerl MartExplorerMartShellMartView Schema transformation MartBuilder XML MartEditor Configuration Databases Public data (local or remote) BioMart architecture

26 MartView

27 MartExplorer

28 MartShell Using = dataset Get = attribute Where = filter

29 Mart Query Language (MQL) ● Mart Query Language (MQL) syntax: using get where ● Can join datasets together: using Dataset1 get Attribute1 where Filter1=var1 as q; using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q ● Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc

30 Third party software Bioconductor (biomaRt) –BioMart schema Taverna –BioMart java library DAS ProServer –BioMart perl library

31 biomaRt

32 Taverna

33 ProServer No programming DAS request and responses defined by Exportables and Importables and configured by MartEditor DAS1

34 BioMart deployers Large scale data federation (EBI) Optimising access to a large database (Ensembl, WormBase) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

35 EBI Uniprot MSD SANGER Ensembl SNP Vega Sequence WWW Hinxton example

36 BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase, ArrayExpress) Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

37 WormBase

38 Ensembl

39 ArrayExpress

40 BioMart deployers Large scale data federation (Hinxton) Optimising access to a large database (Ensembl, WormBase) Federating user data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

41 dbsnpHapMapEnsembl Give me frequency data from dbsnp Give me genoype and frequency data from HapMap Give me SNPs location on gene/transcript Give me frequency, genotype, location on gene/transcript from dbsnp, HapMap, Ensembl, RefSeq, AceView and Vegas Java graphical user interface WWW web browser GMIA_SNP_mart_database RefSeq SNP1 T/A AL13929 963253 1 SNP2 C/T AL13929 963255 -1 SNP3 C/G AL13929 963258 1. ………………………………. AceViewVega Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.

42 … what next ?

43 BioMart model Already applied –Ensembl –Vega –SNP –Uniprot –MSD –ArrayExpress –WormBase –Variety of ‘in house’ projects In development –HapMap

44 Summary BioMart interface –Batch queries –‘Data mining’ –Large annotation BioMart software –Set up your own database –Make your database scalable and responsive –Federate with other data

45 Where are we? 0.2 released in february 0.3 to be released in june –Platforms Mysql Oracle Postgres

46 Acknowledgments BioMart –Damian Smedley (EBI) –Darin London (EBI) –Will Spooner (CSHL) Contributors –Arne Stabenau (Ensembl) –Andreas Kahari (Ensembl) –Craig Melsopp (Ensembl) –Katerina Tzouvara (Uniprot) –Paul Donlon (Unilever)

47

48


Download ppt "BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005."

Similar presentations


Ads by Google