Presentation is loading. Please wait.

Presentation is loading. Please wait.

L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 www.themegallery.com.

Similar presentations


Presentation on theme: "L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 www.themegallery.com."— Presentation transcript:

1 L/O/G/O 云端的小飞象系列报告之二 Cloud 组

2 L/O/G/O Hadoop in SIGMOD 2011 www.themegallery.com

3 Outline Introduction Nova: Continuous Pig/Hadoop Workflows Apache Hadoop Goes Realtime at Facebook Emerging Trends in the Enterprise Data Analytics A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses

4 Industrial Session in Sigmod 2011 Data Management for Feeds and Streams(2) Dynamic Optimization and Unstructured Content (4) BusinessAnalytics(2) Support for Business Analytics and Warehousing (4) Applying Hadoop (4) Industrial session

5

6 Nova: Continuous Pig/Hadoop Workflows By Yahoo !

7 Nova Overview  Scenarios  Ingesting and analyzing user behavior logs  Building and updating a search index from a stream of crawled web pages  Processing semi-structured data  Two-layer programming model (Nova over Pig)  Continuous processing  Independent scheduling  Cross-module optimization  Manageability features

8 Workflow Model  Workflow  Two kinds of vertices: tasks (processing steps) and channels (data containers)  Edges connect tasks to channels and channels to tasks  Four common patterns of processing  Non-incremental (template detection)  Stateless incremental (shingling)  Stateless incremental with lookup table (template tagging)  Stateful incremental (de-duping)

9 Workflow Model (Cont.)  Data and Update Model  Blocks: A channel’s data is divided into blocks Contains a complete snapshot of data on a channel as of some point in time Base blocks are assigned increasing sequence numbers(B 0, B 1, B 2…… B n ) Base block Used in conjunction with incremental processing Contains instructions for transforming a base block into a new base block( ) Delta block

10 Workflow Model (Cont.)  Task/Data Interface  Consumption mode: all or new  Production mode: B or Δ

11 Workflow Model (Cont.)  Workflow Programming and Scheduling  Data-based trigger.  Time-based trigger  Cascade trigger.  Data Compaction and Garbage Collection  If a channel has blocks B 0 , ,, , the compaction operation computes and adds B 3 to the channel  After compaction is used to add B 3 to the channel , and current cursor is at sequence number 2 , then B 0 , , can be garbage-collected.

12 Nova System Architecture

13 Apache Hadoop Goes Realtime at Facebook By Facebook

14 Workload Types  Facebook Messaging  High Write Throughput  Large Tables  Data Migration  Facebook Insights  Realtime Analytics  High Throughput Increments  Facebook Metrics System (ODS)  Automatic Sharding  Fast Reads of Recent Data and Table Scans

15 Why Hadoop & HBase  Elasticity  High write throughput  Efficient and low-latency strong consistency semantics within a data center  Efficient random reads from disk  High Availability and Disaster Recovery  Fault Isolation  Atomic read-modify-write primitives  Range Scans  Tolerance of network partitions within a single data center  Zero Downtime in case of individual data center failure  Active-active serving capability across different data centers

16 Realtime HDFS  High Availability - AvatarNode

17 Realtime HDFS (Cont.)  Hadoop RPC compatibility  Block Availability: Placement Policy  a pluggable block placement policy

18 Realtime HDFS (Cont.)  Performance Improvements for a Realtime Workload  RPC Timeout  Reads from Local Replicas  New Features  HDFS sync  Concurrent Readers

19 Production HBase  ACID Compliance (RWCC: Read Write Consistency Control)  Atomicity (WALEdit)  Consistency  Availability Improvements  HBase Master Rewrite , Region assignment in memory -> ZooKeeper  Online Upgrades  Distributed Log Splitting  Performance Improvements  Compaction ( minor and major )  Read Optimizations

20 Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 Warehouse By IBM

21 Motivation 1.Increasing volumes of data 2. Hadoop-based solutions in conjunction with data warehouses

22

23 A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses By Teradata

24 Motivation  ETL(Extraction Transformation Loading) is a critical part of data warehouse  While data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)

25 Why Hadoop for Teradata EDW ( Enterprise Data Warehouse ) ?  More disk space can be easily added  Use as a intermediate storage  MapReduce for transformation  Load data in parallel

26 Block Assignment Problem –HDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P) – The problem is defined by: assignment(X, Y, n,m, k, r)  X is the set of n blocks (X = {1,..., n}) of F  Y is the set of m nodes running PDBMS (called PDBMS nodes) (Y ⊆ {1,..., P })  k copies, m nodes  r is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.

27 Block Assignment Problem ( Cont. ) An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1,..., n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j. An even assignment g is an assignment such that ∀ i ∈ Y ∀ j ∈ Y | |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1. The cost of an assignment g is defined to be cost(g) = |{i | g(i) r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks assigned to remote nodes.

28 L/O/G/O Thank You!


Download ppt "L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD 2011 www.themegallery.com."

Similar presentations


Ads by Google