Presentation is loading. Please wait.

Presentation is loading. Please wait.

BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Similar presentations

Presentation on theme: "BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005."— Presentation transcript:

1 BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005

2 Biological databases Distributed Different format Different focus Different release schedule Scalability factor


4 BioMart

5 Retrieval myDatabase SNPVega EnsemblUniProt myMart MSD BioMart API JAVAPerl MartExplorerMartShellMartView Schema transformation MartBuilder XML MartEditor Configuration Databases Public data (local or remote)

6 MartView


8 MartShell

9 MartExplorer

10 Database

11 FK PK FK PK Schema

12 FK PK FK Schema

13 FK PK Schema

14 main1 PK1 2 PK2 PK1 FK2 dm FK2 dm FK1 FK2 dm FK1 FK2 PK1 FK1 FK2 PK2 FK1 Schema - reversed star

15 Fixed schema transformation A B TATA TBTB C

16 Schema transformation Central table –Longest n:1, 1:1 path Dimension table –Central transformation around 1:n table. –Link tables are decomposed into a set of 1:n first

17 MartBuilder Input –central object –database meta data –cardinalities Output –Set of SQL statements: create table as select … Transformations –represented as asymmetric tree

18 MartBuilder DATASET: hsapiens_gene_ensembl TYPE MAIN [M] DIMENSION [D] EXIT [E]: M TABLE NAME: gene gene: alt_allele cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: gene cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: gene_description cardinality [11] [n1] [0n] [1n] [SKIP S]: 11 gene: gene_stable_id cardinality [11] [n1] [0n] [1n] [SKIP S]: 11 gene: kk__gene__main cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: transcript cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: analysis cardinality [11] [n1] [0n] [1n] [SKIP S]: n1 gene: dna cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: dnac cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: seq_region cardinality [11] [n1] [0n] [1n] [SKIP S]: S TYPE MAIN [M] DIMENSION [D] EXIT [E]: E ADD EXTENSION: hsapiens_gene_ensembl__gene__MAIN [Y|N]: N CHANGE FINAL TABLE NAME: hsapiens_gene_ensembl__gene__MAIN TO: CREATE TABLE TEMP0 as SELECT gene.gene_id,gene.type,gene.analysis_id,gene.seq_region_id,gene.seq_region_start,gene.seq_region_end,gene.seq_region_strand,gene.display_xref_id,gene_ description.gene_id AS gene_id_TEMP0,gene_description.description FROM gene, gene_description WHERE gene_description.gene_id = gene.gene_id; CREATE TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.seq_region_id,TEMP0.seq_region_start,TEMP0.seq_region_end,TEMP0.seq_region_strand,TEMP0.dis play_xref_id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stable_id.gene_id AS gene_id_TEMP1,gene_stable_id.stable_id,gene_stable_id.version FROM TEMP0, gene_stable_id WHERE gene_stable_id.gene_id = TEMP0.gene_id; drop table TEMP0;

19 Transformation configuration satellog_repeats M repeats disease n1 satellog_repeats M repeats gc 11 satellog_repeats M repeats linkage_depth S satellog_repeats M repeats repeats S satellog_repeats M repeats transcripts S satellog_repeats M repeats ugcount S satellog_repeats M repeats ugstats S satellog_repeats M repeats rep_class n1 satellog_repeats D ugcount ugcount S satellog_repeats D ugcount ugstats S satellog_repeats D ugcount gc S satellog_repeats D ugcount repeats n1r

20 Data access

21 Dataset – Key Abstraction Dataset –Organised into a single schema –BioMart database contains one or more dataset(s) –Attribute –Filter –Exportable/Importable (Links) Dataset - an equivalent of relational table –Exportable/Importable = PK/FK

22 Key Abstractions GENE CENTRAL gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description MartDataset Attribute Filter

23 Exportables, Importables and Links Exportable = ordered list of attributes Importable = ordered list of filters –WHERE filt1=value1 –WHERE filt1=value1 or filt1=value2 –WHERE filt1>value1 and filt2

24 MartView

25 Dataset Configuration Dataset configuration Attributes Filters Trees, Groups, Collections Links Semantics Relational mapping User interface Linking datasets XML-based

26 Dataset Configuration XML

27 Table naming convention Naïve configuration Tables –Meta tables meta_content –Data tables dataset__content__type Data tables –Main __main –Dimension __dm Columns –Key _key –Boolean filter _bool –List filter _list

28 MartEditor

29 Naïve configuration Updates Links Automatic discovery of new tables

30 Class diagram - configuration

31 Class diagram - querying

32 Information flow Read connections Register individual datasets and create linked datasets Get input from the user, split queries to individual datasets. Find the shortest path between datasets (Dijikstra) Compile SQL

33 Summary

34 BioMart Domain independent Platform independent –MySQL 4 –Oracle 9i Plugin architecture

35 BioMart model Already applied –Ensembl –Vega –dbSNP –Uniprot –MSD –Variety of small projects In development –ArrayExpress –Wormbase –RGD

36 Future work BioMart v 0.2 to be released later on in january Java library to be upgraded over coming months to the new architecture BioMart has been integrated with Taverna MartBuilder - to be properly implemented

37 BioMart Open source (LGPL) Public MySQL server ftp

38 Acknowledgments BioMart –Damian Smedley –Darin London Contributors –Arne Stabenau (Ensembl) –Andreas Kahari (Ensembl) –Craig Melsopp (Ensembl) –Katerina Tzouvara (Uniprot) –Paul Donlon (Unilever) –Will Spooner (CSHL)



Download ppt "BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005."

Similar presentations

Ads by Google