MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 2.

Slides:



Advertisements
Similar presentations
Microsoft Dynamics® AX 2012
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Service Bus Service Bus Access Control.
Python Mini-Course University of Oklahoma Department of Psychology Lesson 28 Classes and Methods 6/17/09 Python Mini-Course: Lesson 28 1.
Data Mining of Very Large Data
Tilmann Rabl Middleware Systems Research Group & bankmark UG ISC’14, June 26, 2014 Crafting Benchmarks for Big Data MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG.
A Level Computing#BristolMet Session Objectives#U2 S8 MUST identify the difference between a procedure and a function SHOULD explain the need for parameters.
BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen.
Nadia Andreani Dwiyono DESIGN AND MAKE OF DATA MINING MARKET BASKET ANALYSIS APLICATION AT DE JOGLO RESTAURANT.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK.
Putting the Sting in Hive Page 1 Alan F.
Chapter 2 Querying a Database
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
Data Mining – Intro.
SLIDE 1IS 257 – Fall 2010 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.
OPENMARU dm4ir.tistory.com psyoblade.egloos.com.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Datacenter LOB web service LOB app Partner Mobile Device.
Copyright © 2006, SAS Institute Inc. All rights reserved. Enterprise Guide 4.2 : A Primer SHRUG : Spring 2010 Presented by: Josée Ranger-Lacroix SAS Institute.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG MADES - A Multi-Layered, Adaptive, Distributed Event Store Tilmann Rabl Mohammad Sadoghi Kaiwen Zhang Hans-Arno.
XML Processing Moves Forward XSLT 2.0 and XQuery 1.0 Michael Kay Prague 2005.
EMBL-EBI MSD-mine. EMBL-EBI MSD-mine overview  Web application for online data analysis and mining For the advanced MSDSD researcher Interactive ad-hoc.
BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013.
EXTENDING BIGBENCH Chaitan Baru, Milind Bhandarkar, Carlo Curino, Manuel Danisch, Michael Frank, Bhaskar Gowda, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar,
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
Clif Kranish Director, Data Management Division DataMigrator Update Release
Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.
Introduction to Hadoop and HDFS
Experience with HiBench From Micro-Benchmarks toward End-to-End Pipelines WBDB 2013 Workshop Presentation Lan Yi Senior Software Engineer.
Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University
Design Exercise UW CSE 140 Winter Exercise Given a problem description, design a module to solve the problem 1) Specify a set of functions – For.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Data Management Console Synonym Editor
Kroger’s Frequent Shopper Program – What does a Frequent Shopper Program Near You Look Like?
How to Create a Document in Google Drive By Tressa Beckler.
To access our web services, go to……. Click on Customer Login.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Libraries Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Customer Data Repository LIVE Raw Data Repository LBIS LIVE Funnel LFIS LUIS FTP Transformation Validation Sequencing Validation Summarization Validation.
Streams Using Python ByTim Walsh. What is a stream? Called a “generator” in Python Similar to a list First made available in Python 2.2, all newer versions.
Design Exercise UW CSE 190p Summer 2012 download examples from the calendar.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
Who Needs All Those Indexes ? One is Enough Bruce Lindsay IBM Almaden Research Center
June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Information Design Trends Unit Five: Delivery Channels Lecture 2: Portals and Personalization Part 2.
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
Introducing Teradata Aster Discovery Platform Getting Started Ahsan Nabi Khan September 25 th, 2015.
Execution Plans Detail From Zero to Hero İsmail Adar.
Dr. Philip Cannata 1 Python Classes. Dr. Philip Cannata 2 # The following python statement has python variables which are intended to represent the fact.
GETTING STARTED WITH AWS AND PYTHON. OUTLINE  Intro to Boto  Installation and configuration  Working with AWS S3 using Bot  Working with AWS SQS using.
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
A Straightforward Author Profiling Approach in MapReduce
HADOOP ADMIN: Session -2
Sqoop Mr. Sriram
CSc 110, Autumn 2017 Lecture 24: Lists for Tallying; Text Processing
Engrade Engrade is a free set of web-based tools for educators allowing them to manage their classes online while providing parents and students with 24/7.
ورود اطلاعات بصورت غيربرخط
Charles Tappert Seidenberg School of CSIS, Pace University
Python Basics with Jupyter Notebook
Presentation transcript:

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 2

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 3

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org47/19/2013 Unstructured Data Semi-Structured Data Structured Data Sales Customer Item Marketprice Web Page Web Log Reviews Adapted TPC-DS BigBench Specific

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 7/19/2013(c) Tilmann Rabl - msrg.org5 Online Data Generation Offline Preprocessing Categorization Real Reviews Tokenization Generalization Markov Chain Input Generated Reviews Product Customization Text Generation Parameter Generation (PDGF)

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org67/19/2013

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 7/19/2013(c) Tilmann Rabl - msrg.org7

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 7/19/2013(c) Tilmann Rabl - msrg.org8 Data SourcesNumber of QueriesPercentage Structured1860% Semi-structured723% Un-structured517% Analytic techniquesNumber of QueriesPercentage Statistics analysis620% Data mining1757% Reporting827%

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 7/19/2013(c) Tilmann Rabl - msrg.org9

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 7/19/2013(c) Tilmann Rabl - msrg.org10

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 11

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 12 Query TypesNumber of QueriesPercentage Pure Hive1447% Mahout517% OpenNLP413% Custom MR723%

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 13 ADD FILE q1_mapper.py; ADD FILE q1_reducer.py; --Find the most frequent ones SELECTpid1, pid2, COUNT (*) AS cnt FROM ( --Make items basket FROM ( -- Joining two tables FROM ( SELECT s.ss_ticket_number AS oid, s.ss_item_sk AS pid FROM store_sales s INNER JOIN item i ON s.ss_item_sk = i.i_item_sk WHERE i.i_category_id in (1,4,6) and s.ss_store_sk in (10, 20, 33, 40, 50) ) temp_join MAP temp_join.oid, temp_join.pid USING 'python q1_mapper.py' AS oid, pid CLUSTER BY oid ) map_output REDUCE map_output.oid, map_output.pid USING 'python q1_reducer.py' AS (pid1 BIGINT, pid2 BIGINT) ) temp_basket GROUP BY pid1, pid2 HAVING COUNT (pid1) > 49 ORDER BY pid1,cnt,pid2; Mapper import sys if __name__ == "__main__": for line in sys.stdin: key, val = line.strip().split("\t") print "%s\t%s" % (key, val) Reducer import sys def print_permutations(vals): l = len(vals) if l 500: return vals.sort() for i in range(l-1): for j in range(i+1,l): print "%s\t%s" % (vals[i], vals[j]) if __name__ == "__main__": current_key = '' vals = [] for line in sys.stdin: key, val = line.strip().split("\t") if current_key == '' : current_key = key vals.append(val) elif current_key == key: vals.append(val) elif current_key != key: print_permutations(vals) vals = [] current_key = key vals.append(val) print_permutations(vals)

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 14

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 15

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 16

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org177/19/2013