GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.

GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School of Medicine of USC July 9 th, 2015 At the 11 th Data Integration in Life Sciences Conference (DILS) 2015 Marina del Rey

Introduction: GAAIN GAAIN: Global Alzheimer’s Association Interactive Network Current Data integrated from 30+ sources Over 250,000 research subjects Access http://www.gaain.orghttp://www.gaain.org

Data Integration in GAAIN Data Subject research data Well structured (Mostly) relational Data harmonization Common data model MAP datasets to common model Data ownership sentsitivity

Data Mapping

The Data Mapping Problem Resource intensive “On average, converting a database to the OMOP CDM, including mapping terminologies, required the equivalent of four full-time employees for 6 months and significant computational resources for each distributed research partner. Each partner utilized a number of people with a wide range of expertise and skills to complete the project, including project managers, medical informaticists, epidemiologists, database administrators, database developers, system analysts/ programmers, research assistants, statisticians, and hardware technicians. Knowledge of clinical medicine was critical to correctly map data to the proper OMOP CDM tables. “ Complexity of data harmonization Several thousand data elements per dataset Multiple datasets Data elements Complex scientific concepts Cryptic names Domain expertise to interpret

Observations Rich element information in documentation Data dictionaries ! Element information Descriptions Metadata Need better approaches to matching element names MOMDEMYR1 PTGNDR

Data Dictionaries Rich element details

Approach Extract element description and metadata details from data dictionaries Determine element matches based on above Block improbable match candidates based on metadata Determine element similarity (and thus match likelihood) based on name and description similarity Initial version of system knowledge-driven, then added machine-learning classification

GEM: A Software Assistant for Data Mapping

GEM Architecture

Element Extraction Extract and segregate element information √

Metadata Detail Extraction Element categories Four categories (i) Special (ii) Coded Binary Other coded (iii) Numerical (iv) Text Classifier Heuristic based Other metadata details Cardinality Range (min, max) √

MDB: The Metadata Database Extracted detailed metadata per element  Source  Name  Description  Legend  Cardinality  Range  Category 9/8/14 √

Matching: Metadata Based “Blocking” Elimination of candidates Eliminate candidates from second source that are incompatible Incompatibility criteria - Category mismatch - Cardinality mismatch - For coded elements - Assume normal distribution with SD of 1 - Range mismatch 9/8/14 √

Matching Text Descriptions Employ a regular Tfidf cosine distance on bag-of-words Based on unsupervised topic modeling (LDA) - Treat element descriptions as ‘documents’ - Topic model over these documents - Each element (description) has a probability distribution over topics - Element similarity (or distance) based on similarity (not) of associated topic distributions √

Element Name Matching Composite element names P T G E N D E R P AT G N D R M O M D E M F H Q D E M Y R 1

Table Correspondence Elements generally do match across ‘corresponding’ tables Literal table names not scalable as a feature Determine table correspondence heuristically, based on knowledge driven match likelihood

Experimental Results Setup Various data dictionaries ADNI, NACC, DIAN, LAADC, INDD Mapping pairs Pairs of datasets ADNI-NACC, ADNI-INDD, ADNI-LAADC, … Dataset to GAAIN Common Model (GCM) ADNI-GCM, NACC-GCM, … Experiments Mapping accuracy Effectiveness of individual components Topic Modeling (text description) match and Filtering Comparison with related systems System parameters

Related Systems 9/8/14 1)Coma++ http://dbs.uni- leipzig.de/Research/coma.html More suited for ‘semantic’, ontology integration tasks Based on XML (nested structure) similarity No support for incorporating element descriptions 1)Harmony http://openii.sourceforge.net System targets exactly the same mapping problem as ours Utilizes element name similarity and also element descriptions in matching

Evaluated What Taken mappings pairwise Dataset pairs ADNI-NACC, ADNI-INDD and ADNI-LAADC Goldsets: ~ 150 element pairs (created manually) To GAAIN Common Model ADNI-GAAIN Common Model 24 GAAIN Common Model elements Report Accuracy in terms of F-Measure (Precision and Recall) Against N – the size of result alternatives per match Matching algorithms (i)Harmony (ii)TFIDF (iii)Topic Modeling for text match (iv)Topic Modeling + Metadata Filtering 9/8/14

Results ADNI to NACC

Results ADNI to LAADC

Results ADNI to INDD

Results ADNI to GAAIN Common Model

Training Topic Model

Comparison

Common Model Mapping

Conclusions from Evaluation As a medical dataset mapping tool High mapping accuracy (90% and above) possible for datasets in this domain Significantly higher mapping accuracy compared to available schema mapping systems like Coma++ and Harmony From a matching approach perspective No universally superior for text similarity matching Topic modeling based text matching provides significantly higher mapping accuracies as opposed to TfIdf when the descriptions are not exactly same TfIdf outperforms topic modeling when descriptions are exactly same Metadata based blocking is beneficial Internal system Mapping accuracy is sensitive to topic model parameters Hyperparameters in the underlying “LDA’ topic model Filter first, then match – better than  Match, then eliminate

Data Understanding: Model Discovery Using GEM Identifying data elements for a common data model over collection of multiple, disparate datasets Common data model design is a complex problem GEM helps significantly in the bottom up design of common data model For each column of source, corresponding matches from all destination sources given

Current Work Machine-learning classification Text similarity, name similarity, table correspondence … Active-learning for training Data dictionary ingestion Links 1)http://www.gaain.orghttp://www.gaain.org 2)http://www-hsc.usc.edu/~ashish/ADT.htmhttp://www-hsc.usc.edu/~ashish/ADT.htm Thank you ! nashish@loni.usc.edu

GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.

Similar presentations

Presentation on theme: "GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.

Similar presentations

Presentation on theme: "GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School."— Presentation transcript:

Similar presentations

About project

Feedback