Presentation is loading. Please wait.

Presentation is loading. Please wait.

EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.

Similar presentations


Presentation on theme: "EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine."— Presentation transcript:

1 EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine

2 2 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 History of text search Up to 2009: Notre Dame University maintained the main site text search At the time, there was no text search module available in the version of Ensembl installed. In 2010: The Ensembl installation was updated to reflect the latest Ensembl Genomes installation. Text search technology available At the time, Ensembl search was based on the EB-EYE indices 2

3 3 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Challenges in 2010 How to integrate the new Lucene EB-EYE indices in the main site? Multiple sources of indexing VectorBase (expression, community annotations, etc.) Relied on good will from external services to update the EB-EYE indices from VectorBase core databases Relied on a XML dump of the core database Time-consuming task Difficult to index new datatypes or resources 3

4 4 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Requirements Framework to generate indices at any time Can reflect new community annotations (CAP) Ontology information New datasources: literature Search to serve Lucene indices from different providers: Gene annotation, x-refs, comparative genomics data (EBI) Microarray and gene expression data (Imperial) CAP (Notre Dame) Indexing must be fast, easy to use and maintain Search can be plugged to different tools: Main VectorBase website Ensembl genome browser 4

5 5 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Architecture 5 EnsemblFuncGenCAP Lucene indices Data sources Index file VectorBase Search Service Layer Clients EBI Imperial Notre Dame Index file Index file SOAP

6 6 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 What is being searched? Genomic information (Ensembl databases) Gene models Variation Probes Orthologs Expression data (Imperial) CAP Ontologies (idomal, miro, anatomy) Population genomics (Imperial) 6

7 7 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Generating Ensembl indices at the EBI Based on a direct connection to the database(s) Use a configuration file containing the description of objects and their types Database connection (staging-1, …) Database type (core, funcgen, variation) Genome (aedes_aegypti) Homologies Each object in the configuration file is represented by a java class The configuration loader will automatically create an instance of each type using the class loader. 7

8 8 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Example of configuration file 8

9 9 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Procedure (for Ensembl indices) 9 corefuncgenvariationcompara 1.If compara is defined, get all homologies 2.For each genome in turn: Get all gene, transcript, exons, proteins, xrefs information from core Get all reporters from funcgen and their mapping to gene models Get all variations and relation to gene models Associate all existing homologies to the genes Create a Lucene Document for all genes The indices are copied to Notre Dame University Tomcat instance is restarted

10 10 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Ensembl object mapping in Java Ensembl concepts are mapped to equivalent Java data access objects (DAO) All Ensembl concepts are stored in memory and removed when a Lucene Document is created 10 EnsemblFeature Gene extends contains Transcripts, translations, exons Homology extends Xref contains

11 11 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Creating a Lucene document A document is a container for the index Each document define one or several fields The framework creates a document per gene Each field can store its value (or not) Each field can be indexed (or not) The text stored can be compressed. 11

12 12 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Gene Document Fields: Gene id, name, description Coordinates: seq region name, start, end Species, feature type (gene), source (biotype), genomic unit Transcript count, transcript stable ids Exon count, exon stable ids Peptide count, peptide stable ids, domains Core xrefs Variation xrefs (if available) Funcgen xrefs (if available) Compara homologs (If available) 12

13 13 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 CAP indices GFF parser extract gene and transcript models. Name, description, submitter, chromosome location are indexed. Very fast Could be updated overnight if required. 13

14 14 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Expression data/Population genomics Constructed by Bob McCallum (Imperial) 14

15 15 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 Ontologies Ontology term are indexed. An OBO parser extract each term in turn. Accession, name, description are parsed by default Extra fields are parsed depending on the completeness of each term. 15

16 16 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 SOAP interface 2 procedures: getNbOfResults, getResults (see wiki) 16

17 17 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 To do list Front-end: All domain should be queried to produce an ‘Entrez’ like page. So, search all by default and display count per domain Could be very simple result page (see next slide for mock-up) Updates: We could update some of the domain more frequently CAP is a good candidate. Other technologies: Other technologies can be used Auto-completion SOLR 17

18 18 Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012 18 Result page Genome (1693) Expression (3693)Ontology (70) Population (30)


Download ppt "EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine."

Similar presentations


Ads by Google