Hadoop and its Real-world Applications Xiaoxiao Shi, Guan Wang Experience: work at Yahoo! in 2010 summer, on developing hadoop-based machine learning models.

Slides:



Advertisements
Similar presentations
IT0483-PRINCIPLES OF CLOUD COMPUTING ,N.ARIVAZHAGAN
Advertisements

Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
Introduction to Hadoop Richard Holowczak Baruch College.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Apache Hadoop and Hive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Dan Bassett, Jonathan Canfield December 13, 2011.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Owen O’Malley Yahoo! Grid Team
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
A Tutorial on Hadoop Cloud Computing : Future Trends.
Introduction to MapReduce and Hadoop
Introduction to Google MapReduce
Hadoop Aakash Kag What Why How 1.
Hadoop.
An Open Source Project Commonly Used for Processing Big Data Sets
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Hadoop and its Real-world Applications Xiaoxiao Shi, Guan Wang Experience: work at Yahoo! in 2010 summer, on developing hadoop-based machine learning models.

Contents Motivation of Hadoop History of Hadoop The current applications of Hadoop Programming examples Research with Hadoop Conclusions

Motivation of Hadoop How do you scale up applications? – Run jobs processing 100’s of terabytes of data – Takes 11 days to read on 1 computer Need lots of cheap computers – Fixes speed problem (15 minutes on 1000 computers), but… – Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure – Must be efficient and reliable

Motivation of Hadoop Open Source Apache Project Hadoop Core includes: – Distributed File System - distributes data – Map/Reduce - distributes application Written in Java Runs on – Linux, Mac OS/X, Windows, and Solaris – Commodity hardware

Fun Fact of Hadoop "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term." ---- Doug Cutting, Hadoop project creator

History of Hadoop Apache Nutch Doug Cutting “Map-reduce” 2004 “It is an important technique!” Reads paper Extended Joins Yahoo! at 2006 The great journey begins…

History of Hadoop Yahoo! became the primary contributor in 2006

History of Hadoop Yahoo! deployed large scale science clusters in Tons of Yahoo! Research papers emerge: – WWW – CIKM – SIGIR – VLDB – …… Yahoo! began running major production jobs in Q Nowadays…

When you visit yahoo, you are interacting with data processed with Hadoop!

Nowadays… Ads Optimization Content Optimization Search Index Content Feed Processing When you visit yahoo, you are interacting with data processed with Hadoop!

Nowadays… Ads Optimization Content Optimization Search Index Content Feed Processing Machine Learning (e.g. Spam filters) When you visit yahoo, you are interacting with data processed with Hadoop!

Nowadays… Yahoo! has ~20,000 machines running Hadoop The largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated) Yahoo! runs hundreds of thousands of jobs every month

Nowadays… Who use Hadoop? Amazon/A9 AOL Facebook Fox interactive media Google IBM New York Times PowerSet (now Microsoft) Quantcast Rackspace/Mailtrust Veoh Yahoo! More at

Nowadays (job market on Nov 15 th )… Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop / data analytics a plus Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel data processing system, big data analytics... multiple technologies, including Hadoop

It is important Details…

Nowadays… Hadoop Core – Distributed File System – MapReduce Framework Pig (initiated by Yahoo!) – Parallel Programming Language and Runtime Hbase (initiated by Powerset) – Table storage for semi-structured data Zookeeper (initiated by Yahoo!) – Coordinating distributed systems Hive (initiated by Facebook) – SQL-like query language and metastore

HDFS Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time. Hadoop Distributed File System – Goals: Store large data sets Cope with hardware failure Emphasize streaming data access

Typical Hadoop Structure Commodity hardware – Linux PCs with local 4 disks Typically in 2 level architecture – 40 nodes/rack – Uplink from rack is 8 gigabit – Rack-internal is 1 gigabit all-to-all

Hadoop structure Single namespace for entire cluster – Managed by a single namenode. – Files are single-writer and append-only. – Optimized for streaming reads of large files. Files are broken in to large blocks. – Typically 128 MB – Replicated to several datanodes, for reliability Client talks to both namenode and datanodes – Data is not sent through the namenode. – Throughput of file system scales nearly linearly with the number of nodes. Access from Java, C, or command line.

Hadoop Structure Java and C++ APIs – In Java use Objects, while in C++ bytes Each task can process data sets larger than RAM Automatic re-execution on failure – In a large cluster, some nodes are always slow or flaky – Framework re-executes failed tasks Locality optimizations – Map-Reduce queries HDFS for locations of input data – Map tasks are scheduled close to the inputs when possible

Example of Hadoop Programming Word Count: “I ike parallel computing. I also took courses on parallel computing… …” – Parallel: 2 – Computing: 2 – I: 2 – Like: 1 – ……

Example of Hadoop Programming Intuition: design Assume each node will process a paragraph… Map: – What is the key? – What is the value? Reduce: – What to collect? – What to reduce?

Word Count Example public class MapClass extends MapReduceBase implements Mapper { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); }

Word Count Example public class ReduceClass extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); }

Word Count Example public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); // out keys are words (strings) conf.setOutputValueClass(IntWritable.class); // values are counts JobClient.runJob(conf); }c

Hadoop in Yahoo! 29 Before HadoopAfter Hadoop Time26 days20 minutes LanguageC++Python Development Time2-3 weeks2-3 days Database for Search Assist™ is built using Hadoop. 3 years of log-data 20-steps of map-reduce

Related research of hadoop Conference Tutorial: – KDD Tutorial: “Modeling with Hadoop”, KDD 2011 (top conference in data mining) – Strta Tutorial: “How to Develop Big Data Applications for Hadoop” – OSCON Tutorial: “Introduction to Hadoop”, Papers: – Scalable distributed inference of dynamic user interests for behavioral targeting. KDD 2011: – Yucheng Low, Deepak Agarwal, Alexander J. Smola: Multiple domain user personalization. KDD 2011: – Shuang-Hong Yang, Bo Long, Alexander J. Smola, Hongyuan Zha, Zhaohui Zheng: Collaborative competitive filtering: learning recommender using context of user choice. SIGIR 2011: – Srinivas Vadrevu, Choon Hui Teo, Suju Rajan, Kunal Punera, Byron Dom, Alexander J. Smola, Yi Chang, Zhaohui Zheng: Scalable clustering of news search results. WSDM 2011: – Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, Hongyuan Zha: Like like alike: joint friendship and interest propagation in social networks. WWW 2011: – Amr Ahmed, Alexander J. Smola: WWW 2011 invited tutorial overview: latent variable models on the internet. WWW (Companion Volume) 2011: – Daniel Hsu, Nikos Karampatziakis, John Langford, Alexander J. Smola: Parallel Online Learning CoRR abs/ : (2011) – Neethu Mohandas, Sabu M. Thampi: Improving Hadoop Performance in Handling Small Files. ACC 2011: – Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong: Performance Analysis of Hadoop for Query Processing. AINA Workshops 2011: – …… All just this year! 2011!

For more information: – – Who uses Hadoop?: –