Hadoop File Formats and Data Ingestion

Slides:

Advertisements

Similar presentations

HBase and Hive at StumbleUpon

Advertisements

From SQL to Hadoop and Back The “Sqoop” about Data Connections between

© 2012 Entrinsik, Inc. Informer Administration Exploring the system menu and functions PRESENTER: Jason Vorenkamp| Informer Software Engineer| March 2012.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Video Analysis in Hadoop A Case Study Alex Gorbachev & Alan Gardner San Jose, CA

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ASP.NET Programming with C# and SQL Server First Edition Chapter 8 Manipulating SQL Server Databases with ASP.NET.

Session-01. Hibernate Framework ? Why we use Hibernate ?

Hadoop Ecosystem Overview

Simple Web SQLite Manager/Form/Report

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Hadoop File Formats and Data Ingestion

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

HADOOP ADMIN: Session -2

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

Hypertable Doug Judd Background  Zvents plan is to become the “Google” of local search  Identified the need for a scalable DB 

Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.

Eurotrace Hands-On The Eurotrace File System. 2 The Eurotrace file system Under MS ACCESS EUROTRACE generates several different files when you create.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.

Hive Facebook 2009.

Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.

NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.

NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.

Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();

Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.

Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.

Distributed Time Series Database

Nov 2006 Google released the paper on BigTable.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Page 1 Mutable Data in Hive’s Immutable World Lester Martin – Hortonworks2015 Hadoop Summit.

Cloudera Kudu Introduction

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.

16 Copyright © 2004, Oracle. All rights reserved. Testing the Migrated Oracle Database.

9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Integration of Oracle and Hadoop: hybrid databases affordable at scale

Image taken from: slideshare

ASP.NET Programming with C# and SQL Server First Edition

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Integration of Oracle and Hadoop: hybrid databases affordable at scale

File Format Benchmark - Avro, JSON, ORC, & Parquet

Daniel Lanza Zbigniew Baranowski

Collecting heterogeneous data into a central repository

Sqoop Mr. Sriram

Scaling SQL with different approaches

Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.

Lab #2 - Create a movies dataset

SETL: Efficient Spark ETL on Hadoop

Overview of big data tools

Map Reduce, Types, Formats and Features

Presentation transcript:

Hadoop File Formats and Data Ingestion Prasanth Kothuri, CERN Introduce myself: 8 years data engineering, data analytics (make sense)

Files Formats – not just CSV Key factor in Big Data processing and query performance Schema Evolution Compression and Splittability Data Processing Write performance Partial read Full read Hadoop File Formats and Data Ingestion

Available File Formats Text / CSV JSON SequenceFile binary key/value pair format RC File row columnar format Avro Parquet ORC optimized row columnar format Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion AVRO Language neutral data serialization system Write a file in python and read it in C AVRO data is described using language independent schema AVRO schemas are usually written in JSON and data is encoded in binary format Supports schema evolution producers and consumers at different versions of schema Supports compression and are splittable Hadoop File Formats and Data Ingestion

Avro – File structure and example Sample AVRO schema in JSON format Avro file structure { "type" : "record", "name" : "tweets", "fields" : [ { "name" : "username", "type" : "string", }, { "name" : "tweet", "name" : "timestamp", "type" : "long", } ], "doc:" : “schema for storing tweets" } Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Parquet columnar storage format key strength is to store nested data in truly columnar format using definition and repetition levels1 Nested schema Table representation Row format X Y Z x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 Y5 z5 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 X Y Z Columnar format x1 x2 x3 x4 x5 y1 y2 y3 y4 y5 z1 z2 z3 z4 z5 encoded chunk (1) Dremel made simple with parquet - https://blog.twitter.com/2013/dremel-made-simple-with-parquet Hadoop File Formats and Data Ingestion

Optimizations – CPU and I/O Statistics for filtering and query optimization projection push down predicate push down read only the data you need Minimizes CPU cache misses X Y Z x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 X Y Z x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 X Y Z x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster SLOW cache misses costs cpu cycles Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Encoding Delta Encoding: E.g timestamp can be encoded by storing first value and the delta between subsequent values which tend to be small due to temporal validity Prefix Encoding: delta encoding for strings Dictionary Encoding: Small set of values, e.g post code, ip addresses etc Run Length Encoding: repeating data NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Parquet file structure & Configuration Configurable parquet parameters Internal structure of parquet file Property name Default value Description parquet.block.size 128 MB The size in bytes of a block (row group). parquet.page.size 1MB The size in bytes of a page. parquet.dictionary.page.size The maximum allowed size in bytes of a dictionary before falling back to plain encoding for a page. parquet.enable.dictionary true Whether to use dictionary encoding. parquet.compression UNCOMPRESSED The type of compression: UNCOMPRESSED, SNAPPY, GZIP & LZO NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster In summation, Parquet is state-of-the-art, open-source columnar format the supports most of Hadoop processing frameworks and is optimized for high compression and high scan efficiency Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Flume Flume is for high-volume ingestion into Hadoop of event-based data e.g collect logfiles from a bank of web servers, then move log events from those files to HDFS (clickstream) Example NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Flume configuration agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 agent1.sources.source1.type = spooldir agent1.sources.source1.spoolDir = /var/spooldir agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path = hdfs://hostname:8020/user/flume/logs agent1.sinks.sink1.hdfs.filetype = DataStream agent1.channels.channel1.type = memory agent1.channels.channel1.capacity = 10000 agent1.channels.channel1.transactionCapacity = 100 flume-ng agent –name agent1 –conf $FLUME_HOME/conf One usecase can be copying of DB audit logs to HDFS Category Component Source Avro Exec HTTP JMS Netcat Sequence generator Spooling directory Syslog Thrift Twitter Sink Elasticsearch File roll HBase HDFS IRC Logger Morphline (Solr) Null Channel File JDBC Memory NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Sqoop – SQL to Hadoop Open source tool to extract data from structured data store into Hadoop Architecture NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Sqoop – contd. Sqoop schedules map reduce jobs to effect imports and exports Sqoop always requires the connector and JDBC driver Sqoop requires JDBC drivers for specific database server, these should be copied to /usr/lib/sqoop/lib The command-line structure has the following structure TOOL - indicates the operation that you want to perform, e.g import, export etc PROPERTY_ARGS - are a set of parameters that are entered as Java properties in the format -Dname=value. SQOOP_ARGS - all the various sqoop parameters. Sqoop TOOL PROPERTY_ARGS SQOOP_ARGS NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Sqoop – How to run sqoop Example: sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username pdbmon \ -P \ --num-mappers 1 \ --target-dir visitcount_rfidlog \ --table VISITCOUNT.RFIDLOG NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Sqoop – how to parallelize -- table table_name -- query select * from table_name where $CONDITIONS -- split-by primary key -- num-mappers n -- boundary-query select range from dual NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Hands On – 1 Use Kite SDK to demonstrate copying of various file formats to Hadoop Step 1) Download the MovieLens Dataset curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o movies.zip unzip movies.zip cd ml-latest-small/ Step 2) Load the Dataset into Hadoop in Avro format -- infer the schema kite-dataset csv-schema ratings.csv --record-name ratings -o ratings.avsc cat ratings.avsc -- create the schema kite-dataset create ratings --schema ratings.avsc -- load the data kite-dataset csv-import ratings.csv --delimiter ',' ratings NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Hands On – 1 contd Step 3) Load the Dataset into Hadoop in Parquet format kite-dataset csv-schema ratings.csv --record-name ratingsp -o ratingsp.avsc cat ratingsp.avsc kite-dataset create ratingsp --schema ratingsp.avsc --format parquet kite-dataset csv-import ratings.csv --delimiter ',' ratingsp Step 4) Run a sample query to compare the elapsed time between Avro & Parquet hive select sum(rating)/count(1) from ratings; select sum(rating)/count(1) from ratingsp; Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Hands On – 2 Use Sqoop to copy an Oracle table to Hadoop sudo su – cd /var/lib/sqoop curl -L https://www.dropbox.com/s/z726v1mzbysvq98/ojdbc6.jar?dl=0 -o ojdbc6.jar sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username pdbmon \ -P \ --num-mappers 1 \ --target-dir visitcount_rfidlog \ --table VISITCOUNT.RFIDLOG Check the size and number of files hdfs dfs -ls visitcount_rfidlog/ NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Hands On – 3 Use Sqoop to copy an Oracle table to Hadoop, multiple mapper sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username pdbmon \ -P \ --num-mappers 2 \ --split-by alarm_id \ --target-dir lemontest_alarms \ --table LEMONTEST.ALARMS \ --as-parquetfile Check the size and number of files hdfs dfs -ls lemontest_alarms/ NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Hands On – 4 Use Sqoop to make incremental copy of a Oracle table to Hadoop sqoop job \ --create alarms \ -- \ import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username pdbmon \ -P \ --num-mappers 1 \ --target-dir lemontest_alarms_i \ --table LEMONTEST.ALARMS \ --incremental append \ --check-column alarm_id \ --last-value 0 \ sqoop job --exec alarms NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Hadoop File Formats and Data Ingestion Hands On – 4 contd. sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username pdbmon \ -P \ --num-mappers 1 \ --table LEMONTEST.ALARMS \ --target-dir lemontest_alarms_i \ --incremental append \ --check-column alarm_id \ --last-value 47179 \ hdfs dfs -ls lemontest_alarms_i/ NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster Hadoop File Formats and Data Ingestion

Q & A E-mail: Prasanth.Kothuri@cern.ch Blog: http://prasanthkothuri.wordpress.com Q & A See also: https://db-blog.web.cern.ch/