Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Graph: Connecting Big Data Semantics

Similar presentations


Presentation on theme: "Knowledge Graph: Connecting Big Data Semantics"— Presentation transcript:

1 Knowledge Graph: Connecting Big Data Semantics
Ying Ding Indiana University

2 Entity in Big Data Entity: things, not strings
Relationship matters: connecting entities Changing in searching: string entityrelationsubgraph

3 entity relations

4 Entities Entities in social web: person, location, organization, book, music (freebase.com: Metformin) Entities in translational medicine: gene, drug, disease, protein, side effect (conceptwiki: Disease Lafora) Data: scientific papers (PubMed, PubMed central), and experimental data (SwissPro, KEGG, DrugBank,)

5 Challenges Knowledge Graph – Entity Graph
Schema graph (small size) vs. Instance graph (large size) Graph mining (e.g. shortest path, depth-first/breath first, pagerank) Neo4j, NoSQL graph database, Graph pattern search (SPARQL) Triple store, virtuoso (openlinksw )

6 Use Case: Individualized Cohort in EHR
EHR-based individualized cohort can provide a better solution than standard guidelines because the cohort is drawn from a patient population of the same geolocation, demographics, and socio-economic group to the given patient. EHRs are organized around the patient, not by concepts (diseases, lab results, medications, etc.)

7 Use Case: Individualized Cohort in EHR
EHR data contains controlled vocabularies (e.g., demographics, diagnostic codes, medications, procedures, etc.) and continuous values (e.g., lab tests, medication doses, etc.). Category hierarchy (parent, siblings, subtrees): search patients like a given diagnosis “ICD10:E11.21” (diabetes with nephropathy)  ICD10:E11.22 (with chronic kidney disease) ICD10:E11 (diabetes in general) Continuous values: serum glucose = 120 mg/dL (many continuous values may not have a natural aggregate binning) Query for searching patients are rarely exact (fasting serum glucose =126 serum glucose between 120 and 130), or serum glucose in the 80th percentile at this time A patient can have ,000 property values which contain 100 controlled vocabulary values and 1000 continuous values. Most values are time based.

8 Challenges Searching challenges
Category hierarchy (parent, siblings, subtrees): search patients like a given diagnosis “ICD10:E11.21” (diabetes with nephropathy)  ICD10:E11.22 (with chronic kidney disease) ICD10:E11 (diabetes in general) Continuous values: serum glucose = 120 mg/dL (many continuous values may not have a natural aggregate binning) Query for searching patients are rarely exact (fasting serum glucose =126 serum glucose between 120 and 130), or serum glucose in the 80th percentile at this time Map the changes in value with changes in time: search for a patient for a 60th% to 90th% transition between two serum glucose over a 6 month time frame. If we have N glucose values, for any two patient, we have to compare N*(N-1)/2 time-based glucose-value comparison. How to scale it up? Find common patterns from a set of individualized cohort patients. This means compare with the combination of subsets of million’s of differentials for each patient in the cohort.

9 Relational Database Semantic Graph
Paradigm shift from relational row-column lookup to semantic graph traversal Relational Database is less efficient in joins, Big indexing overhead (need to indexing every column)

10 EHR RDF Graph Patient EHR data in semantic graph representation. EHR timeline for Patient A and B are shown as RDF graphs. Property values of each patient (demographics, labs, diagnosis, etc.) are connected to their respective ontologies. Enabling searching for patterns across different patients.

11 EHR RDF Graph Application of continuous value classes will enrich the patients retrieved from the database. 2A. Property values as literal nodes will not link “like” patients together without a “relational” query. 2B. By using controlled vocabulary (CV)-ontology edges, we will be able to link patients through CV-value nodes. 2C. By adding “nearby” classes to continuous value nodes, we will link additional patients. Different strategies will create different “nearby” links.

12 Challenges: Semantic Graph Mining
Graph indexing gIndex: indexing frequent subgraphs, using subgraphs as features Graph classification, clustering Path-based clustering and top-k similarity problems in heterogeneous information network Path-based graph mining Complex dependencies within heterogeneous network Conventional supervised classification methods assume that the objects are independent Sequential matching vs. snapshot matching as EHR records have a time dimension.

13 Linked Open Data

14 Challenges for Semantic Web
How to handle ontology graph + instance graph How to handle inferred triples and existing triples (reasoning) Graph pattern search vs. Graph mining Datatype properties vs. object properties Different levels of semantics: ontology (schema), categorized values (terminology), continuous values (binning?), literal


Download ppt "Knowledge Graph: Connecting Big Data Semantics"

Similar presentations


Ads by Google