Naïve Bayes (HW1): Tips and Tricks Shannon Quinn.

Slides:



Advertisements
Similar presentations
Beyond Mapper and Reducer
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MapReduce.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Developing a MapReduce Application – packet dissection.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark: Cluster Computing with Working Sets
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative.
** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
HAMS Technologies 1
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
MapReduce.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
An Introduction to HDInsight June 27 th,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Cole Jaya Chakladar Group No: 1.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop and MapReduce Shannon Quinn (with thanks from William Cohen of CMU)
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Big Data Analytics: HW#3
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Map Reduce Workshop Monday November 12th, 2012
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Lecture 16 (Intro to MapReduce and Hadoop)
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Naïve Bayes (HW1): Tips and Tricks Shannon Quinn

Git Basics – [Advanced] Cheat sheet – cheat-sheet cheat-sheet

Some useful Hadoop-isms Testing and setup – Excellent Hadoop 2.x setup and testing tutorial: p-installation-tutorial-hadoop-2-x/ p-installation-tutorial-hadoop-2-x/ You won’t need to worry about setup! – On GACRC, I’m setting up the VMs – On AWS, Amazon handles it But the testing tips near the bottom are excellent – Putting files into HDFS – Reading existing content/results in HDFS – Submitting Hadoop jobs on the command line If you use AWS GUI, you won’t need to worry about any of this

Some useful Hadoop-isms Counters – Initialize in main() / run() – Increment in mapper / reducer – Read in main() / run() – Example: example-of-hadoop-mapreduce-counter/ example-of-hadoop-mapreduce-counter/

Some useful Hadoop-isms Joins – Join values together that have the same key – Map-side Faster and more efficient Harder to implement—requires custom Partitioner and Comparator – Reduce-side Easy to implement—shuffle step does the work for you! Less efficient as data is pushed to the network MultipleInputs – Specify a specific mapper class for a specific input path – hadoop/mapreduce/lib/input/MultipleInputs.html hadoop/mapreduce/lib/input/MultipleInputs.html

Some useful Hadoop-isms setup() – Optional method override in Mapper / Reducer subclass – Executed before map() / reduce() – Useful for initializing variables…

Some useful Hadoop-isms DistributedCache – Read-only cache of information accessible by each node in the cluster – Very useful for broadcasting small amounts of read-only information – Tricky to implement – /hadoop-distributedcache-is- deprecated-what-is-the-preferred-api /hadoop-distributedcache-is- deprecated-what-is-the-preferred-api

WARNING!!! Hadoop vs Hadoop WARNING!!! Hadoop vs Hadoop 2.2.0

A little MapReduce: NB Variables 1.|V| 2.|L| 3.Y=* 4.Y= y 5.Y= y, W=* 6.Y= y, W= w 1.Size of vocabulary (unique words) 2.Size of label space (unique labels) 3.Number of documents 4.Number of documents with label y 5.Number of words in a document with label y 6.Number of times w appears in a document with label y