Homework 4 Code for word count com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/0.20.2-

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Database Architectures and the Web
C6 Databases.
Omid Efficient Transaction Management and Incremental Processing for HBase Copyright © 2013 Yahoo! All rights reserved. No reproduction or distribution.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
Management Information Systems, Sixth Edition
Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)
Chapter 13 (Web): Distributed Databases
What Should the Design of Cloud- Based (Transactional) Database Systems Look Like? Daniel Abadi Yale University March 17 th, 2011.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Chapter 14 The Second Component: The Database.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Consistent Join Queries for Cloud Data Stores Zhou Wei 1,2 Guillaume Pierre 1, Chi-Hung Chi 2 1 VU University Amsterdam 2 Tsinghua University Beijing.
Distributed Databases
Data Processing Architectures The difficulty is in the choice George Moore, 1900.
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Cloud Storage: All your data belongs to us! Theo Benson This slide includes images from the Megastore and the Cassandra papers/conference slides.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
SQL Server Replication By Karthick P.K Technical Lead, Microsoft SQL Server.
Databases with Scalable capabilities Presented by Mike Trischetta.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Introduction. 
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Database Management System Module 5 DeSiaMorewww.desiamore.com/ifm1.
Database Management Systems 1 Ramakrishnan & Gehrke Introduction to Database Systems Chpt 1 Instructor: Weichao Wang.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Replicated Databases. Reading Textbook: Ch.13 Textbook: Ch.13 FarkasCSCE Spring
Chap 7: Consistency and Replication
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NOSQL DATABASE Not Only SQL DATABASE
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
BASE Dan Pritchett, Ebay ACM Queue, May/June 2008.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
An Introduction to Super-Scalability But first…
Database Processing Chapter "No, Drew, You Don’t Know Anything About Creating Queries.” Copyright © 2015 Pearson Education, Inc. Operational database.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Managing Data Resources File Organization and databases for business information systems.
1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.
Data Bases in Cloud Environments
CPSC-310 Database Systems
CSCI5570 Large Scale Data Processing Systems
CS 405G: Introduction to Database Systems
Data Platform and Analytics Foundational Training
DBMS & TPS Barbara Russell MBA 624.
Hadoop.
Operational & Analytical Database
Modern Databases NoSQL and NewSQL
Introduction to NewSQL
Consistency and CAP.
Chapter 19: Distributed Databases
Consistency in Distributed Systems
NoSQL Systems Overview (as of November 2011).
NoSQL Databases An Overview
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Distributed Database Management Systems
Database System Architectures
Presentation transcript:

Homework 4 Code for word count com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/ /org/apache/hadoop/examples/WordCou nt.java#WordCount com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/ /org/apache/hadoop/examples/WordCou nt.java#WordCount

Data Bases in Cloud Environments Based on: Md. Ashfakul Islam Department of Computer Science The University of Alabama

Data Today Data sizes are increasing exponentially everyday. Key difficulties in processing large scale data – acquire required amount of on-demand resources – auto scale up and down based on dynamic workloads – distribute and coordinate a large scale job on several servers – Replication – update consistency maintenance Cloud platform can solve most of the above

Large Scale Data Management Large scale data management is attracting attention. Many organizations produce data in PB level. Managing such an amount of data requires huge resources. Ubiquity of huge data sets inspires researchers to think in new way.

Issues to Consider Distributed or Centralized application? How can ACID guarantees be maintained? CAPS theorem – Consistency, Availability, Partition – Data availability and reliability (even if network partition) are achieved by compromising consistency – Traditional consistency techniques become obsolete Consistency becomes bottleneck of data management deployment in cloud – Costly to maintain

Evaluation Criteria for Data Management Evaluation criteria: – Elasticity scalable, distribute new resources, offload unused resources, parallelizable, low coupling – Security untrusted host, moving off premises, new rules/regulations – Replication available, durable, fault tolerant, replication across globe

Evaluation of Analytical DB Analytical DB handles historical data with little or no updates - no ACID properties. Elasticity – Since no ACID – easier E.g. no updates, so locking not needed – A number of commercial products support elasticity. Security – requirement of sensitive and detailed data – third party vendor store data – potential risk of data leakage and privacy violation Replication – Recent snapshot of DB serves purpose. – Strong consistency isn’t required.

Analytical DBs - Data Warehousing Data Warehousing DW - Popular application of Hadoop Typically DW is relational (OLAP) – but also semi-structured, unstructured data Can also be parallel DBs (teradata) – column oriented – Expensive, $10K per TB of data Hadoop for DW – Facebook abandoned Oracle for Hadoop (Hive) – Also Pig – for semi-structured

Evaluation of Transactional DM Elasticity – data partitioned over sites – locking and commit protocol become complex and time consuming – huge distributed data processing overhead Security – requirement of sensitive and detailed data – third party vendor store data – potential risk of data leakage and privacy violation

Evaluation of Transactional DM Replication – data replicated in cloud – CAP theorem: Consistency, Availability, data Partition, only two can be achievable – consistency and availability – must choose one – availability is main goal of cloud – consistency is sacrificed – ACID violation

Transactional Data Management

Needed because: Transactional Data Management – heart of database industry – almost all financial transaction conducted through it – rely on ACID guarantees ACID properties are main challenge in transactional DM deployment in Cloud.

Scalable Transactions for Web Applications in the Cloud Two important properties of Web applications – all transactions are short-lived – data request can be responded to with a small set of well-identified data items Scalable database services like Amazon SimpleDB and Google BigTable allow data to be queried only by primary key. Eventual data consistency is maintained in these database services.

Relational Joins Hadoop is not a DB Debate between parallel DBs and MR for OLAPS – Dewitt/Stonebreaker call MR “step backwards” – Parallel faster because can create indexes

Relational Joins - Example Given 2 data sets S and T: – (k1, (s1,S1)) k1 is join attribute, s1 is tuple ID, S1 is rest of attributes – (k2, (s2,S2)) – (k1, (t1,T1)) info for T – (k2, (t2,T2)) S could be user profiles – k is PK, tuple info about age, gender, etc. T could be logs of online activity, tuple is particular URL, k is FK

Reduce side Join 1:1 Map over both datasets, emit (join key, tuple) All tuples grouped by join key – what is needed for join Which is what type of join? – Parallel sort-merge join If one-to-one join – at most 1 tuple from S, T match If 2 values, one must be from S, other from T, (don’t know which since no order), join them

Reduce side Join 1:N If one to many – If S is one (based on PK) same approach as 1 to 1 will work – But – which one is S? (no ordering) – Solution: buffer all S values in memory Pick out tuples from S and perform join Scalability – use memory

Reduce side Join 1:N Use value-to value conversion – Create composite key: join key and tuple ID – Define sort order so: sort by join key Sort by IDs from S first then Sort by IDS from T – Define partitioner so use only join key, so all keys from with same join key at same reducer

Reduce side Join 1:N Can remove join key and tuple ID from value to save space Whenever reducer finds new join key, will be from S and not T, – put into memory (only the S one) – Join with other tuples until next new join key – No more bottleneck

Consistency in Clouds

Transactional DM Transaction is sequence of read & write operations. Guarantee ACID properties of transactions: – Atomicity - either all operations execute or none. – Consistency - DB remains consistent after each transaction execution. – Isolation - impact of a transaction can’t be altered by another one. – Durability - guarantee impact of committed transaction.

ACID Properties Atomicity maintained by 2 PC. Eventual consistency is maintained. Isolation maintained by decomposing of transaction. Timestamp ordering is introduced to order conflicting transactions. Durability is maintained by the replication of data items across several LTMs.

Consistency in Clouds Consistent database must remain consistent after execution of successful operations. Inconsistency may cause to huge damage. Consistency is always sacrificed to achieve availability and scalability. Strong consistency maintenance in cloud is very costly.

Traditional DM is becoming obsolete. Thin portable devices and concentrated computing power shows new way. ACID guarantee become main challenge. Some solutions are provided to overcome challenge. Consistency remains bottleneck. Our goal to provide low cost solutions to ensure data consistency in the cloud.

Current DB Market Status MS SQL doesn’t support auto scaling and load. MySQL recommended for “lower traffic” New products: advertise replace MySQL with us Oracle recently released on-demand resource allocation IBM DB2 can auto scale with dynamic workload. Azure Relational DB – great performance