Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research.

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research Group University of Leipzig 9th Intl. Conf. on Data Integration in the Life Sciences (DILS) Montreal, July 2013 getSim(str1,str2 )

2 Ontologies Multiple interrelated ontologies in a domain  Example: anatomy Identify overlapping information between ontologies  Information exchange, data integration purposes, reuse ……  Need to create mappings between ontologies MeSH GALEN UMLS SNOMED NCI Thesaurus Mouse Anatomy FMA

3 Matching Example Two ‘small’ anatomy ontologies O and O’  Concepts with attributes (name, synonym) Possible match strategy in GOMMA*  Compare name/synonym values of concepts by a string similarity function, e.g., n-gram or edit distance  Two concepts match if one value pair is highly similar OO’ Attr. Values Concept c0c0 head c0‘c0‘ c1c1 trunk torso truncus c1‘c1‘ c2c2 limb extremity limbs c2‘c2‘ 5x4=20 similarity computations M O,O’ = {(c 0,c 0 ’),(c 1,c 1 ’),(c 2,c 2 ’)} * Kirsten, Groß, Hartung, Rahm: GOMMA: A Component-based Infrastructure for managing and analyzing Life Science Ontologies and their Evolution. Journal Biomedical Semantics, 2011

4 Problems Evaluation of Cartesian product OxO’  Especially for large life science ontologies  Different strategies: pruning, blocking, mapping reuse, … Excessive usage of similarity functions  Applied O (|O|  |O’|) times during matching  How efficient (runtime, space) does a similarity function work? Experiences from GOMMA 1. Optimized implementation of n-gram similarity function 2. Application on massively parallel hardware Graphical Processing Units (GPU) Multi-core CPUs getSim(str1,str2 )

5 Trigram (n=3) Similarity with Dice Trigram similarity  Input: two strings A and B to be compared  Output: similarity sim(A,B) ∈ [0,1] between A and B Approach 1. Split A and B into tokens of length 3 2. Compute intersect (overlap) between both token sets 3. Calculate dice metric based on the size of intersect and token sets  (Optional) Assign pre-/postfixes of length 2 (e.g., ##, $$) to A and B before tokenization

6 Trigram Similarity - Example sim(‘TRUNK’, ‘TRUNCUS’) 1. Token sets  {TRU, RUN, UNK}  {TRU, RUN, UNC, NCU, CUS} 2. Intersect  {TRU, RUN} 3. Dice metric  2  2 / (3+5) = 4/8 = 0.5

7 Naïve Solution Method  Tokenization (trivial) Result: two token arrays aTokens and bTokens  Intersect computation with Nested Loop (main part) For each token in aTokens look for same token in bTokens  true: increase overlap counter (and go on with next token in aTokens)  Final similarity (trivial) 2  overlap / (|aTokens|+|bTokens|) aTokens:{TRU, RUN, UNK} bTokens:{TRU, RUN, UNC, NCU, CUS} overlap: 0 0 1 1 2 2 #string-comparisons: 8 8

8 “Sort-Merge”-like Solution Optimization ideas 1. Avoid string comparisons String comparisons are expensive especially for equal-length strings (e.g., “equals” of String class in Java) Dictionary-based transformation of tokens into unique integer values 2. Avoid nested loop complexity O(m  n) comparisons to determine intersect of token sets A-priori sorting of token arrays  make use of ordered tokens during comparison (O(m+n), see Sort-Merge join ) Amortization of sorting  token sets are used multiple times for comparison ‚UNC‘ = ‚UNK‘?  3 = 8? 

9 “Sort-Merge”-like Solution - Example sim(TRUNK,TRUNCUS)  Tokenization  integer conversion  sorting TRUNK  {TRU, RUN, UNK}  {1, 2, 3} TRUNCUS  {TRU, RUN, UNC, NCU, CUS}  {1, 2, 4, 5, 6}  Intersect with interleaved linear scans aTokens:{1, 2, 3} bTokens:{1, 2, 4, 5, 6} overlap: 0 0 1 1 2 2 #integer-comparisons: 3 3

10 GPU as Execution Environment Design goals  Scalability with 100’s of cores and 1000’s of threads  Focus on parallel algorithms  Example: CUDA programming model CUDA Kernels and Threads  Kernel: function that runs on a device (GPU, CPU)  Many CUDA threads can execute each kernel  CUDA vs. CPU threads CUDA threads extremely lightweight (little creation overhead, instant switching, 1000’s of threads used to achieve efficiency) Multi-core CPUs can only consume a few threads Drawbacks  A-priori memory allocation, basic data structures

11 Bringing n-gram to GPU Problems to be solved 1. Which data structures are possible for input / output? 2. How to cope with fixed / limited memory? 3. How can n-gram be parallelized on GPU?

12 Input- /Output Data Structure Three-layered index structure for each ontology  ci: concept index representing all concepts  ai: attribute index representing all attributes  gi: gram (token) index containing all tokens Two arrays to represent top-k matches per concept  A-priori memory allocation possible (array size of k  |O|)  Short (2 bytes) instead of float (4 bytes) data type for similarities  reduce memory consumption

13 Input- /Output Data Structure - Example O: ci ai gi 0134 ci 0251013 126783418192091021 gi ai O‘: M O,O‘ : 012 1000 800 sims corrs c0c0 c1c1 c2c2 c0‘c0‘c1‘c1‘c2‘c2‘ c 0- c 0 ‘c 1- c 1 ‘c 2- c 2 ‘ OO’ Attr. Values Concept c0c0 head c0‘c0‘ c1c1 trunk torso truncus c1‘c1‘ c2c2 limb extremity limbs c2‘c2‘ OO’ Token sets Concept c0c0 c0‘c0‘ c1c1 c1‘c1‘ c2c2 c2‘c2‘ 01 35 0 2810517 1 234567891011121314151718 [1,2] [3,4,5] [6,7,8] [9,10] [11,12,13,14,15,16,17] [1,2] [6,7,8] [3,4,18,19,20] [9,10,21] Output: top-k=2 ˄ sim>0.7 Input:

14 Limited Memory / Parallelization Ontologies and mapping to large for GPU  Size-based ontology partitioning*  Ideal case: one ontology fits completely in GPU memory  Each kernel computes n-gram similarities between one concept of P i and all concepts in Q j  Re-use of already stored partition in GPU Hybrid execution on GPU/CPU possible Q0Q0 Q1Q1 Q2Q2 Q3Q3 Q4Q4 P0P0 P1P1 P2P2 P3P3 GPU thread CPU thread(s) Match task ci ai gi P0P0 ci ai gi Q3Q3 corrs sims M P 0,Q 3 Global memory Replace Q 3 with Q 4 Read M P 0,Q 4 Kernel 0 Kernel |P 0 |-1 … GPU ci ai gi P0P0 ci ai gi Q4Q4 GPU usage CPU usage * Groß, Hartung, Kirsten, Rahm: On Matching Large Life Science Ontologies in Parallel. Proc. DILS, 2010

15 Evaluation FMA-NCI match problem from OAEI 2012 Large Bio Task  Three sub tasks (small, large, whole)  With blocking step to be comparable with OAEI 2012 results* Hardware  CPU: Intel i5-2500 (4x3.30GHz, 8GB)  GPU: Asus GTX660 (960 CUDA cores, 2GB) FMANCIMatch subtask #concepts#attValues#concepts#attValues#comparisons small 3,7208,5026,55113,3080.12·10 9 large 28,88550,7286,89513,1250.66·10 9 whole 79,042125,30411,67521,0702.64·10 9 * Groß, Hartung, Kirsten, Rahm: GOMMA Results for OAEI 2012. Proc. 7 th Intl. Ontology Matching Workshop, 2012

16 Results for one CPU or GPU Different implementations for Trigram  Naïve nested loop, hash set lookup, sort-merge  Sort-merge solution performs significantly better  GPU further reduces execution times (~20% of CPU time)

17 Results for hybrid CPU/GPU usage NoGPU vs. GPU  Increasing number of CPU threads  Good scalability for multiple CPU threads (speed up of 3.6)  Slightly better execution time with hybrid CPU/GPU One thread required to communicate with GPU

18 Summary and Future Work Experiences from optimizing GOMMA’s ontology matching workflows  Tuning of n-gram similarity function Preprocessing (integer conversion, sorting) for more efficient similarity computation  Execution on GPU Overcoming GPU drawbacks (fixed memory, a-priori allocation)  Significant reduction of execution times 104min  99sec for FMA-NCI (whole) match task Future Work  Execution of other similarity functions on GPU  Flexible ontology matching / entity resolution workflows Choice between CPU, GPU, cluster, cloud infrastructure, …

19 Thank You for Your Attention !! Questions ?

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research.

Similar presentations

Presentation on theme: "Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research.

Similar presentations

Presentation on theme: "Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research."— Presentation transcript:

Similar presentations

About project

Feedback