Experience with HiBench From Micro-Benchmarks toward End-to-End Pipelines WBDB 2013 Workshop Presentation Lan Yi Senior Software Engineer.

Slides:



Advertisements
Similar presentations
Recommender Systems & Collaborative Filtering
Advertisements

Google News Personalization Scalable Online Collaborative Filtering
Introduction to Hadoop Richard Holowczak Baruch College.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,
Tilmann Rabl Middleware Systems Research Group & bankmark UG ISC’14, June 26, 2014 Crafting Benchmarks for Big Data MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG.
BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Inter-DC Measurements; App Workloads: Google, Facebook, Microsoft Aditya Akella Lecture 11.
Movie Recommendation System
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
1 A Characterization of Big Data Benchmarks Wen.Xiong Zhibin Yu, Zhendong Bei, Juanjuan Zhao, Fan Zhang, Yubin Zou, Xue Bai, Ye Li, Chengzhong Xu Shenzhen.
Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura Keys, Randy Katz RAD Lab, UC Berkeley.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Putting the Sting in Hive Page 1 Alan F.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 9.1 Chapter 9 : Social Networks What is a social.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Improving the Catalogue Interface using Endeca Tito Sierra NCSU Libraries.
LDBC-Benchmarking Graph-Processing Platforms: A Vision Benchmarking Graph-Processing Platforms: A Vision (A SPEC Research Group Process) Delft University.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013.
EXTENDING BIGBENCH Chaitan Baru, Milind Bhandarkar, Carlo Curino, Manuel Danisch, Michael Frank, Bhaskar Gowda, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar,
1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Chokchai Junchey Microsoft Product Specialist Certified Technical Training Center.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Robert L.
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Online Learning for Collaborative Filtering
TPC BENCHMARK W (Web Commerce) SeungLak Choi Dept. of Computer Science, KAIST.
Matchmaking: A New MapReduce Scheduling Technique
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows.
Hadoop implementation of MapReduce computational model Ján Vaňo.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Collaborative Filtering: Searching and Retrieving Web Information Together Huimin Lu December 2, 2004 INF 385D Fall 2004 Instructor: Don Turnbull.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Agile Paging: Exceeding the Best of Nested and Shadow Paging
Towards Pedagogically Agnostic eLearning Systems John Hurst Selby Markham Computing Education Research Group Faculty of Information Technology.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Data mining in web applications
Image taken from: slideshare
Recommender Systems & Collaborative Filtering
Hadoopla: Microsoft and the Hadoop Ecosystem
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang
Agenda Motivation. Components. Deep Learning Approach.
Introduction to Apache
Toolbox Benchmarking data session BDVe Meetup, Sofia May 15, 2018
Charles Tappert Seidenberg School of CSIS, Pace University
Web Mining Research: A Survey
Presentation transcript:

Experience with HiBench From Micro-Benchmarks toward End-to-End Pipelines WBDB 2013 Workshop Presentation Lan Yi Senior Software Engineer Intel China Software Center

HiBench –Enhanced DFSIO Micro Benchmarks Web Search –Sort –WordCount –TeraSort –Nutch Indexing –Page Rank Machine Learning –Bayesian Classification –K-Means Clustering HDFS See our paper “The HiBench Suite: Characterization of the MapReduce-Based Data Analysis” in ICDE’10 workshops (WISS’10) 1.Different from GrixMix, SWIM? 2.Micro Benchmark? 3.Isolated components? 4.End-2-end Benchmark? 5.We need ETL- Recommendation Pipeline

TestCF Pref ETL ETL-Recommendation (hammer) Sales tables log table Sales updates h1h1 h2h2 h 24 ip agent Retcode cookies WP Cookies updates Sales preferences Browsing preferences User-item preferences Pref-logs ETL-logs Pref-sales Item based Collaborati ve Filtering Pref-comb HIVE-Hadoop Cluster (Data Warehouse) Item-item similarity matrix Offline test Test data Statistics & Measureme nts TPC-DS Mahout ETL-sales

ETL-Recommendation (hammer) Task Dependences Pref-logs ETL-logs Pref-sales Item based Collaborati ve Filtering Pref-comb ETL-sales Offline test

Empirical Data (hammer) 5 Intel Xeon 2.2Ghz, sandyBridge 2 x 8 x HT = 32 cores 192G Mem, WD x12x4=14.4T 1000M net, 300M~400M/s 4-node cluster, RHL6.2, cdh4.1.2 HiBench etl-recomm branch, HiTune-0.9 Sales ~14G (TPC-DS scale 100), logs ~105G

Empirical Data (hammer) 6

LinkBench 8 Benchmark for Social Graph Service Originally Developed by Facebook on Top of MySQL –Simulate social graph workloads similar to Facebook’s online service –Key workload properties match Facebook’s real production workload Different from Analytical Workloads Our Work –Port LinkBench to HBase –On top of Phoenix (SQL support over HBase)

Resources HiBench – HiBench ETL-Recomm Branch – LinkBench – HiTune – Phoenix – 9