Presentation is loading. Please wait.

Presentation is loading. Please wait.

OMOP CDM on Hadoop Reference Architecture

Similar presentations


Presentation on theme: "OMOP CDM on Hadoop Reference Architecture"— Presentation transcript:

1 OMOP CDM on Hadoop Reference Architecture
Target audience: technical or IT staff currently using or hoping to use OHDSI OMOP CDM Goals: Show design for CDM using ‘big data’ tools that are Apache open source as alternative to Oracle or Postgres Show ingestion paths or approaches for various data types and data sources Map current (Nov 2016) most commonly used tools fit for various business purposes based on market share but not exclusive to Cloudera Glossary of acronyms OHDSI: Observational Health Data Science & Informatics OMOP: Observational Medical Outcomes Partnership CDM: Common Data Model HDFS: Hadoop Distributed File System RDBMS: Relational Database Management System SQL: Structured Query Language Revision history Original: November 29, 2016 Send questions to sdolley at cloudera dot com Original technical authors: Derek Kane, Tom White N.B. This is offered as one opinion on the current most selected set of choices to fill technical capabilities, and is not inclusive of all approaches.

2 UPDATEABLE, ANALYTIC STORAGE
OMOP CDM on Hadoop Reference Architecture BIG DATA SUPERSET ARCHITECTURE FOR A DATA LAKE BATCH PROCESSING SPARK HIVE SQOOP MAPREDUCE PIG ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK REAL-TIME PROCESSING SPARK STREAMING KAFKA FLUME UPDATEABLE, ANALYTIC STORAGE KUDU WORKLOAD MANAGEMENT YARN WORKFLOW MANAGEMENT OOZIE SECURITY SENTRY, RECORD SERVICE HDFS Encryption, TLS/SSL, Kerberos FILE STORAGE PARQUET Source HDFS FILESYSTEM HDFS ONLINE NOSQL HBASE Source Source Legend: Box titles in black bold such as “BATCH PROCESSING” are the technical capability required to meet some business requirement(s). Tool names in dark gray such as “SPARK” or “Kerberos” are Software/Apache projects that can be implemented to meet the technical capability need. Boxes near the bottom of the architecture tend toward storing data; middle boxes tend toward managing or organizing the data; boxes at the top tend toward enabling users to access the data. NB: alternate, less common tool options can be found later in this document.

3 UPDATEABLE, ANALYTIC STORAGE
Minimum software stand up for CDM in Hadoop MINIMUM SOFTWARE TO STAND UP CDM IN HADOOP BATCH PROCESSING SPARK HIVE SQOOP MAPREDUCE PIG ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK REAL-TIME PROCESSING SPARK STREAMING KAFKA FLUME UPDATEABLE, ANALYTIC STORAGE KUDU WORKLOAD MANAGEMENT YARN WORKFLOW MANAGEMENT OOZIE SECURITY SENTRY, RECORD SERVICE HDFS Encryption, TLS/SSL, Kerberos FILE STORAGE PARQUET Source HDFS FILESYSTEM ONLINE NOSQL HBASE Source Source HDFS Legend: Box titles in black bold such as “BATCH PROCESSING” are the technical capability required to meet some business requirement(s). Tool names in dark gray such as “SPARK” or “Kerberos” are Software/Apache projects that can be implemented to meet the technical capability need. Boxes near the bottom of the architecture tend toward storing data; middle boxes tend toward managing or organizing the data; boxes at the top tend toward enabling users to access the data.

4 Hadoop Ingestion Paths
Ingestion method varies by data source & complexity of transformations needed Three approaches, select one or many Flat File Ingestion CSV/TSV/TXT Hive RDBMS Ingestion from relational databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.) RDBMS SQOOP Hive Complex Ingestion (JSON, XML, Nested sources, custom, etc.) Source file Spark Hive Note: All data is most often physically stored in HDFS. This graphic shows the most commonly used tools to manage getting data into HDFS.

5 Hadoop Ingest/Egress, Hive & Impala
GETTING DATA IN DATA STORED & MANAGED GETTING DATA OUT Ingestion method varies by data source & complexity of transformations needed. Three approaches, select one or many Flat File Ingestion CSV/TSV/TXT Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional) RDBMS Ingestion from relational databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.) RDBMS SQOOP Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional) Complex Ingestion (JSON, XML, Nested sources, custom, etc.) Source file Spark Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional)

6 for ETL and Batch processing for fast, in-memory analytics
SQL on Hadoop Hive and Impala share the same metadata repository (data dictionary) Data ingested into HDFS and made available in Hive… HDFS Hive for ETL and Batch processing Metadata (in Hive Catalog) Impala ODBC/JDBC tools for fast, in-memory analytics … is also available in Impala.

7 Programming framework Streaming Workflow Engine
Cloudera options Batch data ingest Programming framework Streaming Workflow Engine Spark MapReduce Kafka Spark Streaming Sqoop Flume Oozie Security Interactive Database (SQL) Search Other Sentry Kerberos RecordService HDFS Encryption Impala SOLR / Cloudera Search Pig Hive on Spark Hive on MapReduce Data storage file format Storing data (key value) Storing data (columnar) Storing data (files) Hadoop File System AWS S3 AWS EBS EMC Isilon OneFS Kudu Parquet Hbase Avro

8 With other options listed
Batch data ingest Programming framework Streaming Workflow Engine Kafka Spark Streaming Storm Spark MapReduce Oozie Nifi (aka Hadoop Data Flow) Sqoop Flume Security Interactive Database (SQL) Search Other Impala Hive on Spark Hive on MapReduce Hive on Tez Hawq Presto Sentry Kerberos RecordService HDFS Encryption Ranger Knox Pig SOLR / Cloudera Search Data storage file format Storing data (key value) Storing data (columnar) Storing data (files) Parquet Hbase Cassandra Accumulo Hadoop File System EMC Isilon OneFS S3 Kudu Avro NB: due to authors lack of familiarity with many alternative tools listed here, we cannot guarantee the tools shown can provide the technical capability shown

9 Minimum projects/products needed for ingesting syndicated data into OMOP CDM for analysis
Interactive Database (SQL) for running queries Impala Hive Storing data (files) Hadoop File System (HDFS) Batch data ingest for ingesting data into CDM Programming framework for ingesting data into CDM AND/OR Spark Sqoop NB: this is one approach, multiple tools exist as alternatives, and adding tools to this architecture can make your process faster, or more organized, or more secure, or have other benefits.


Download ppt "OMOP CDM on Hadoop Reference Architecture"

Similar presentations


Ads by Google