Getting Data into Hadoop

Getting Data into Hadoop
September 18th 2017 Kyung Eun Park, D.Sc.

Contents Data Lake from Data Store or Data Warehouese
Overview of the main tools for data ingestion into Hadoop 1.1 Spark 1.2 Sqoop 1.3 Flume Basic Methods for importing CSV data into HDFS and Hive tables

Hadoop: Setting up a Single Node Cluster
Set up and configure a single-node Hadoop installation Required software for Linux: (Ubuntu x64 LTS) Java ssh: $ sudo apt-get install ssh Installing Download Edit etc/hadoop/hadoop-env.sh # set to the root of your Java installation export JAVA_HOME=/usr/bin Set JAVA_HOME in your .bashrc shell file JAVA_HOME=/usr/lib/jvm/default-java export JAVA_HOME PATH=$PATH:$JAVA_HOME export PATH export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar $ source .bashrc $ echo $JAVA_HOME Try the following command: $ bin/hadoop

Hadoop as a Data Lake With traditional database or data warehouse approach Adding data to the database: Requires ETL (extract, transform, and load) Data transformation into a pre-determined schema before loading Data usage must be decided during the ETL step Later changes costs Data discarded in the ETL step due to mismatch with the schema or capacity constraints (needed one only!) Hadoop approach: a central storage space for all data in the HDFS Inexpensive and redundant storage of large datasets Lower cost than traditional systems

Standalone Operation Copy the unpacked conf directory to use as input
'dfs[a-z.]+ Standalone Operation Copy the unpacked conf directory to use as input Find and display every match of the given regular expression Output is written to the given output directory mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples jar grep input output 'dfs[a-z.]+'

MapReduce MapReduce Schema on read MapReduce Application
Software framework for writing application which process vast amounts of data in-parallel on large clusters of commodity hardware The framework takes care of scheduling tasks, monitoring them and re- executes the failed tasks Schema on read Programmers and users to enforce a structure to suit their needs when they access data c.f.) schema on write of the traditional data warehouse approach requiring upfront design and assumptions about the usage of the data MapReduce Application

Why Raw Format? For data science purposes, keeping all data in raw format is beneficial Because it is not clear which data items may be valuable to a given data science goal Hadoop application applies a schema to data as it reads them from the lake Advantages of data lake approach over a traditional approach All data are available: no need for assumptions about future data use All data are sharable: no technical hurdle to data sharing All access methods are available: any processing engine (MapReduce, Spark, etc.) or applications (Hive, Spark-SQP, Pig) can be used to examine and process data

Data Warehouses vs. Hadoop Data Lake
Hadoop as a complement to data warehouses The growth of new data from disparate sources  quickly fill the data lake Social media Click streams Sensor data, Moving objects, etc. Traditional ETL stages may not keep up with the rate at which data are entering the lake Both supports access to data. However, in the Hadoop case it can happen as soon as the data are available in the lake.

ETL Process vs. Data Lake
Source B Source C Source A Data usage decided Enter ETL Process Enter Data Lake Data Lake ETL Discarded data Schema on Write Data Warehouse Hadoop Relational database Raw format data User Schema on Read

The Hadoop Distributed File System (HDFS)
All Hadoop applications operate on data stored in HDFS HDFS is not a general file system, but a specialized streaming file system Explicit copy to and from the HDFS file system needed Optimized for reading and writing of large files Writing Data to HDFS Sliced into many small sub-units (blocks, shards) Replicated across the servers in a Hadoop cluster: to avoid data loss  reliability Transparently written to the cluster nodes Processing Slices processed in parallel at the same time Exporting, transferring files out of HDFS Slices assembled and written as one file on the host file system Single instance of HDFS No file slicing or replication!

Direct File Transfer to Hadoop HDFS
Using native HDFS commands Copy a file (test) to HDFS: use put command $ hdfs dfs –put test View files in HDFS: use ls command (ls –la) $ hdfs dfs –ls Copy a file from HDFS to the local file system: use get command $ hdfs dfs –get another-test More commands: refer to Appendix B

Importing Data from Files into Hive Tables
An SQL-like tool for analyzing data in HDFS Useful for feature generation Importing data into Hive Tables Existing text-based files exported from spreadsheets or databases Tab-separated values (TSV) Comma-separated values (CSV) Raw txt JSON, etc Two types of Hive Table Internal table: fully managed by Hive, stored in an optimized format (ORC) External table: not managed by Hive, use only a metadata description to access the data in its raw form, delete only the definition (metadata about the table) in Hive After importing, process data using a variety of tools including Hive’s SQL query processing, Pig, or Spark Hive Tables as virtual tables: used when the data resides outside of Hive

CSV Files into Hive Tables
A comma delimited text file (CSV file) imported into a Hive table Hive Installation and configuration Install Hive 1.2.2 $tar –xzvf apache-hive bin.tar.gz Create a directory in HDFS to hold the file $ bin/hdfs dfs –mkdir game Put the file in the directory $ bin/hdfs dfs –put 4days*.csv game First load the data as an external Hive table Start a Hive shell $ hive hive> CREATE EXTERNAL TABLE IF NOT EXISTS events (ID INT, NAME STRING ,…) > …

Hive Interactive Shell Commands
All commands end with ; quit or exit Add List Delete !<cmd> : execute a shell command from the hive shell <query> : executes a hive query and prints results to standard out Source FILE <fild> : used to execute a script file inside the CLI Set

Importing Data into Hive Tables Using Spark
Apache SPARK: A modern processing engine focusing on in-memory processing Abstracted as an immutable distributed collection of items called a resilient distributed dataset (RDD) RDDs : created from Hadoop (e.g. HDFS files) or by transforming other RDDs Each dataset in an RDD: divided into logical partitions and computed on different nodes of the cluster transparently Spark’s DataFrame: built on top of an RDD, but data are organized into named columns like RDBMS table, similar to a data frame in R Can be created from different data sources: Existing RDDs, Structured data files, JSON datasets, Hive tables, External databases

Next Class: Hadoop Tutorial
Please try to install Hadoop, Hive, Spark Next week lab: Importing Data into HDFS and Hive and process the data using MapReduce and Spark engine

Getting Data into Hadoop

Similar presentations

Presentation on theme: "Getting Data into Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Getting Data into Hadoop

Similar presentations

Presentation on theme: "Getting Data into Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback