EBI is an Outstation of the European Molecular Biology Laboratory. Anatomy ontology ArrayExpress Helen Parkinson, PhD
Content ArrayExpress use cases Fuzzy matching of ontology terms Data driven ontology building Wish list
ArrayExpress: Overview Submit Hybs Experiment queries Public/Private ATLAS Summarize Public Only Re-annotate Gene queries Genes Cross expt/ species queries
Fuzzy matching of ontology terms – why? Clean up ArrayExpress OE and synonym tables OE based integration Constrain OEs on data entry/validation Improved searches in repository/DW web interface Data integration across species, experiments and experimental designs Automated mapping of free text to ontology terms for data imporrt
Phonetic Matching Precompute phonetic encodings of all terms in the ontology Match each target term by comparing these encodings Soundex: Robert Russell and Margaret Odell (1918), famously described by Donald Knuth Double Metaphone: Lawrence Philips (2000) Metaphone: Lawrence Philips Most matches are single Highest success rate
Algorithm comparisons
Percent matches using automated mapping
Failures to match Species (or Kingdom)-specific terms (e.g. plant anatomy) Conflated terms (e.g. diseased cell types) Compound terms (e.g. "cerebral cortex and hypothalamus") Genuinely missing terms Esoteric terms less of a priority Most trivial misspellings, however, were matched Dirty input data
Implications Need more terms in some commonly-used ontologies Synonyms are important generating less noise better coverage Choice of ontology can limit expressivity - this will be frustrating to biologists
Why? Clean up ArrayExpress OE and synonym tables Add accessions/DB links to these tables Constrain OEs on data entry/validation Improved searches in repository/DW web interface Generate suggestions for new OE terms Evaluate domain coverage by a given ontology
ArrayExpress Ontology Development and Future Directions Developing the Ontology Define Scope: ArrayExpress already has some useful structure given the current database plus rich source of use cases and competency questions. Build: Ontology Capture: Identify key concepts and relationships within our domain and give explicit definitions to these features: Middle-out approach – specify core of basic terms then specialise and generalise as required Mappings – text mining approach to do initial semi-automated mappings to external resources for rapid coverage Manual mapping for data warehouse data, and selected data sets
ArrayExpress Ontology Development and Future Directions Capture to Code: Definitions and Hierarchy
ArrayExpress Ontology Development and Future Directions Semantic Roadmap Position of the ArrayExpress Experimental Factor Ontology in the ‘bigger picture’ AE Ontology Disease Ontology Common Anatomy Reference Ontology Cell Type Ontology Chemical Entities of Biological Interest (ChEBI) NCI Various Species Anatomy Ontologies Key is orthogonal coverage, reuse of existing resources and shared frameworks
Wish list NOT to build our own anatomy ontology CARO extension CARO evaluation Mapping CARO to relevant multi-species ontologies Application of CARO to ArrayExpress data Use of CARO in ArrayExpress tools
Acknowledgments Anna Farne Ele Holloway James Malone Margus LukkArrayExpress Production Team Helen Parkinson Tim Rayner Faisal Rezwan Eleanor Williams Mengyao Zhao Holly Zheng