Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research.

Slides:

Advertisements

Similar presentations

D ON ' T M ATCH T WICE : R EDUNDANCY - FREE S IMILARITY C OMPUTATION WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig University.

Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

1 M APPING C OMPOSITION FOR M ATCHING L ARGE L IFE S CIENCE O NTOLOGIES A NIKA G ROSS, M ICHAEL H ARTUNG, T ORALF K IRSTEN, E RHARD R AHM 29 TH J ULY 2011,

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

A Survey of Parallel Tree- based Methods on Option Pricing PRESENTER: LI,XINYING.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Speeding Up Batch Alignment of Large Ontologies Using MapReduce Uthayasanker Thayasivam and Prashant Doshi Dept. of Computer Science University of Georgia.

By Dominik Seifert B Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.

HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

GPU-based iterative CT reconstruction

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Introduction to Query Optimization

Evaluation of Relational Operations

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Chapters 15 and 16b: Query Optimization

Lecture 2- Query Processing (continued)

Evaluation of Relational Operations: Other Techniques

Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.

Presentation transcript:

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research Group University of Leipzig 9th Intl. Conf. on Data Integration in the Life Sciences (DILS) Montreal, July 2013 getSim(str1,str2 )

2 Ontologies Multiple interrelated ontologies in a domain  Example: anatomy Identify overlapping information between ontologies  Information exchange, data integration purposes, reuse ……  Need to create mappings between ontologies MeSH GALEN UMLS SNOMED NCI Thesaurus Mouse Anatomy FMA

3 Matching Example Two ‘small’ anatomy ontologies O and O’  Concepts with attributes (name, synonym) Possible match strategy in GOMMA*  Compare name/synonym values of concepts by a string similarity function, e.g., n-gram or edit distance  Two concepts match if one value pair is highly similar OO’ Attr. Values Concept c0c0 head c0‘c0‘ c1c1 trunk torso truncus c1‘c1‘ c2c2 limb extremity limbs c2‘c2‘ 5x4=20 similarity computations M O,O’ = {(c 0,c 0 ’),(c 1,c 1 ’),(c 2,c 2 ’)} * Kirsten, Groß, Hartung, Rahm: GOMMA: A Component-based Infrastructure for managing and analyzing Life Science Ontologies and their Evolution. Journal Biomedical Semantics, 2011

4 Problems Evaluation of Cartesian product OxO’  Especially for large life science ontologies  Different strategies: pruning, blocking, mapping reuse, … Excessive usage of similarity functions  Applied O (|O|  |O’|) times during matching  How efficient (runtime, space) does a similarity function work? Experiences from GOMMA 1. Optimized implementation of n-gram similarity function 2. Application on massively parallel hardware Graphical Processing Units (GPU) Multi-core CPUs getSim(str1,str2 )

5 Trigram (n=3) Similarity with Dice Trigram similarity  Input: two strings A and B to be compared  Output: similarity sim(A,B) ∈ [0,1] between A and B Approach 1. Split A and B into tokens of length 3 2. Compute intersect (overlap) between both token sets 3. Calculate dice metric based on the size of intersect and token sets  (Optional) Assign pre-/postfixes of length 2 (e.g., ##, $$) to A and B before tokenization

6 Trigram Similarity - Example sim(‘TRUNK’, ‘TRUNCUS’) 1. Token sets  {TRU, RUN, UNK}  {TRU, RUN, UNC, NCU, CUS} 2. Intersect  {TRU, RUN} 3. Dice metric  2  2 / (3+5) = 4/8 = 0.5

7 Naïve Solution Method  Tokenization (trivial) Result: two token arrays aTokens and bTokens  Intersect computation with Nested Loop (main part) For each token in aTokens look for same token in bTokens  true: increase overlap counter (and go on with next token in aTokens)  Final similarity (trivial) 2  overlap / (|aTokens|+|bTokens|) aTokens:{TRU, RUN, UNK} bTokens:{TRU, RUN, UNC, NCU, CUS} overlap: #string-comparisons: 8 8

8 “Sort-Merge”-like Solution Optimization ideas 1. Avoid string comparisons String comparisons are expensive especially for equal-length strings (e.g., “equals” of String class in Java) Dictionary-based transformation of tokens into unique integer values 2. Avoid nested loop complexity O(m  n) comparisons to determine intersect of token sets A-priori sorting of token arrays  make use of ordered tokens during comparison (O(m+n), see Sort-Merge join ) Amortization of sorting  token sets are used multiple times for comparison ‚UNC‘ = ‚UNK‘?  3 = 8? 

9 “Sort-Merge”-like Solution - Example sim(TRUNK,TRUNCUS)  Tokenization  integer conversion  sorting TRUNK  {TRU, RUN, UNK}  {1, 2, 3} TRUNCUS  {TRU, RUN, UNC, NCU, CUS}  {1, 2, 4, 5, 6}  Intersect with interleaved linear scans aTokens:{1, 2, 3} bTokens:{1, 2, 4, 5, 6} overlap: #integer-comparisons: 3 3

10 GPU as Execution Environment Design goals  Scalability with 100’s of cores and 1000’s of threads  Focus on parallel algorithms  Example: CUDA programming model CUDA Kernels and Threads  Kernel: function that runs on a device (GPU, CPU)  Many CUDA threads can execute each kernel  CUDA vs. CPU threads CUDA threads extremely lightweight (little creation overhead, instant switching, 1000’s of threads used to achieve efficiency) Multi-core CPUs can only consume a few threads Drawbacks  A-priori memory allocation, basic data structures

11 Bringing n-gram to GPU Problems to be solved 1. Which data structures are possible for input / output? 2. How to cope with fixed / limited memory? 3. How can n-gram be parallelized on GPU?

12 Input- /Output Data Structure Three-layered index structure for each ontology  ci: concept index representing all concepts  ai: attribute index representing all attributes  gi: gram (token) index containing all tokens Two arrays to represent top-k matches per concept  A-priori memory allocation possible (array size of k  |O|)  Short (2 bytes) instead of float (4 bytes) data type for similarities  reduce memory consumption

13 Input- /Output Data Structure - Example O: ci ai gi 0134 ci gi ai O‘: M O,O‘ : sims corrs c0c0 c1c1 c2c2 c0‘c0‘c1‘c1‘c2‘c2‘ c 0- c 0 ‘c 1- c 1 ‘c 2- c 2 ‘ OO’ Attr. Values Concept c0c0 head c0‘c0‘ c1c1 trunk torso truncus c1‘c1‘ c2c2 limb extremity limbs c2‘c2‘ OO’ Token sets Concept c0c0 c0‘c0‘ c1c1 c1‘c1‘ c2c2 c2‘c2‘ [1,2] [3,4,5] [6,7,8] [9,10] [11,12,13,14,15,16,17] [1,2] [6,7,8] [3,4,18,19,20] [9,10,21] Output: top-k=2 ˄ sim>0.7 Input:

14 Limited Memory / Parallelization Ontologies and mapping to large for GPU  Size-based ontology partitioning*  Ideal case: one ontology fits completely in GPU memory  Each kernel computes n-gram similarities between one concept of P i and all concepts in Q j  Re-use of already stored partition in GPU Hybrid execution on GPU/CPU possible Q0Q0 Q1Q1 Q2Q2 Q3Q3 Q4Q4 P0P0 P1P1 P2P2 P3P3 GPU thread CPU thread(s) Match task ci ai gi P0P0 ci ai gi Q3Q3 corrs sims M P 0,Q 3 Global memory Replace Q 3 with Q 4 Read M P 0,Q 4 Kernel 0 Kernel |P 0 |-1 … GPU ci ai gi P0P0 ci ai gi Q4Q4 GPU usage CPU usage * Groß, Hartung, Kirsten, Rahm: On Matching Large Life Science Ontologies in Parallel. Proc. DILS, 2010

15 Evaluation FMA-NCI match problem from OAEI 2012 Large Bio Task  Three sub tasks (small, large, whole)  With blocking step to be comparable with OAEI 2012 results* Hardware  CPU: Intel i (4x3.30GHz, 8GB)  GPU: Asus GTX660 (960 CUDA cores, 2GB) FMANCIMatch subtask #concepts#attValues#concepts#attValues#comparisons small 3,7208,5026,55113, ·10 9 large 28,88550,7286,89513, ·10 9 whole 79,042125,30411,67521, ·10 9 * Groß, Hartung, Kirsten, Rahm: GOMMA Results for OAEI Proc. 7 th Intl. Ontology Matching Workshop, 2012

16 Results for one CPU or GPU Different implementations for Trigram  Naïve nested loop, hash set lookup, sort-merge  Sort-merge solution performs significantly better  GPU further reduces execution times (~20% of CPU time)

17 Results for hybrid CPU/GPU usage NoGPU vs. GPU  Increasing number of CPU threads  Good scalability for multiple CPU threads (speed up of 3.6)  Slightly better execution time with hybrid CPU/GPU One thread required to communicate with GPU

18 Summary and Future Work Experiences from optimizing GOMMA’s ontology matching workflows  Tuning of n-gram similarity function Preprocessing (integer conversion, sorting) for more efficient similarity computation  Execution on GPU Overcoming GPU drawbacks (fixed memory, a-priori allocation)  Significant reduction of execution times 104min  99sec for FMA-NCI (whole) match task Future Work  Execution of other similarity functions on GPU  Flexible ontology matching / entity resolution workflows Choice between CPU, GPU, cluster, cloud infrastructure, …

19 Thank You for Your Attention !! Questions ?