LIBRA: Lightweight Data Skew Mitigation in MapReduce

Name: LIBRA: Lightweight Data Skew Mitigation in MapReduce
Uploaded: 2017-10-11T11:47:57+00:00
Duration: PTM17S34
Channel: Estefani Peller
Description: LIBRA: Lightweight Data Skew Mitigation in MapReduce

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 各位老师，同学们，下午好，我是来自北京大学网络与分布式系统实验室云计算组的陈琪，下面我将介绍一下我们组的一项工作LIBRA。 To appear in IEEE Transactions on Parallel and Distributed Systems

Outlines 1. Introduction 2. Background 3. Previous work
2. Background 3. Previous work 4. System Design 本次报告主要分成如下6个部分 5. Evaluation 6. Conclusion

2. Background 3. Previous work 4. System Design 首先是Introduction部分 5. Evaluation 6. Conclusion

Introduction The new era of Big Data is coming!
– 20 PB per day (2008) – 30 TB per day (2009) – 60 TB per day (2010) –petabytes per day What does big data mean? Important user information significant business value 我们知道目前大数据时代已经来临，Google、Yahoo、facebook、亚马逊等公司都声称他们每天需要处理TB甚至是PB量级的数据，这些海量数据对人们来说意味什么呢？首先是重要的用户信息，其次是潜在的巨大的商业价值，因此，人们迫切希望能够对其进行深入有效的分析和挖掘

Scientific computation
MapReduce What is MapReduce? most popular parallel computing model proposed by Google Select, Join, Group Page rank, Inverted index, Log analysis Clustering, machine translation, Recommendation database operation Search engine Machine learning Applications 那么什么是MapReduce呢？MapReduce是Google提出的用于处理这些海量数据的最热门的并行计算框架。它能够被广泛地应用在下面各种领域，比如搜索引擎，机器学习等 (它对用户来说具有高度的抽象性和透明性，它能够自动地对Job进行切分，自动对Task进行调度，自动地进行容错处理) … Scientific computation Cryptanalysis

Data skew in MapReduce The imbalance in the amount of data assigned to each task Fundamental reason: The datasets in the real world are often skewed physical properties, hot spots We do not know the data distribution beforehand It cannot be solved by speculative execution Mantri has witnessed the Coefficients of variation in data size across tasks are 0.34 and 3.1 at the 50th and 90th percentiles in the Microsoft production cluster MapReduce系统中的数据倾斜指的是分配给每个task的数据量是不均匀的，导致有一些task需要处理比其他task多得多的数据，使其成为整个Job执行的长尾和瓶颈，延长整个Job的执行时间。其形成原因主要有两个：其一是来自真实世界中的数据集常常都是分布不均匀的，这种不均匀主要由事物的物理性质（比如人的身高服从正态分布）和一些领域存在一些热点子集（比如单词在文档中出现的频率服从Zipf分布）造成的。Mantri在微软的production集群中观察到各task之间处理的数据量差异系数（std/mean)在50%的Job中超过了0.34，甚至有10%的Job中超过3.1。另一个造成MapReduce系统中数据倾斜的原因是我们在Job执行之前并不知道数据的分布情况。我们需要注意这种由于数据不均造成的慢task并不能通过多备份执行来解决。

2. Background 3. Previous work 4. System Design 5. Evaluation 6. Conclusion

Architecture Assign Assign … Output files Input files Map Stage
Intermediate data are divided according to some user defined partitioner Master Assign Assign Part 1 Map Part 2 Reduce Split 1 Part 1 Output1 Split 2 Map Part 2 … Output2 Split M … Reduce Output files 接下来介绍一下背景知识，首先是MapReduce的系统结构。主要执行流程为：首先将用户的输入数据切分成多个split，每个split由一个MapTask来处理。每个MapTask执行用户自定义的map函数，并根据用户指定的划分策略将其输出的中间结果划分成多块，存储在本地磁盘上。ReduceTask则需要先到每个MapTask去取出属于其处理的数据块，对数据进行归并排序后，执行用户自定义的Reduce函数，并将最终结果输出到分布式文件系统中。任务的调度则由Master来负责。在上述步骤中，Map的中间数据是按照用户自定义的划分策略来进行划分的，Hadoop默认使用的是Hash划分策略。根据MapReduce的原有语义，相同的key会划分到同一个reduce进行处理。 Input files Part 1 Map Part 2 Map Stage Reduce Stage map combine copy sort reduce [(K1,V1)] → [(K2,V2)] [(K2, [V2])] → [(K2, [V2])] → [(K3,V3)]

Challenges to solve data skew
Many real world applications exhibit data skew Sort, Grep, Join, Group, Aggregation, Page Rank, Inverted Index, etc. The data distribution cannot be determined ahead of time The computing environment can be heterogeneous Diversity of hardware Resource competition in cloud environment 在MapReduce系统中解决数据倾斜问题需要面临哪些挑战呢？首先，许多真实应用都会遇到data skew问题，如何提供一个比较通用的解决data skew的方法呢？其次，中间数据的分布在Job执行之前是不知道的，我们如何确定合适的划分策略呢？最后，在真实的计算环境中，可能存在异构情况，比如在一个数据中心中同时存在多代的硬件设备，他们的处理性能是不同的，或者在云计算环境中，虚拟机的计算和磁盘IO资源也会由于资源竞争造成性能上的差异。那么我们应该如何划分数据使得所有的task能够尽可能在同一时刻完成而不会出现长尾现象呢？

Previous work Significant overhead
Applicable only to certain applications In the parallel database area limited on join, group, and aggregate operations Pre-run sampling jobs Adding two pre-run sampling and counting jobs for theta join (SIGMOD’11) Operating pre-processing extracting and sampling procedures for the spatial feature extraction (SOCC’11) Collect data information during the job execution Collecting key frequency in each node and aggregating them on the master after all maps done (Cloudcom’10) Partitioning intermediate data into more partitions and using greedy bin-packing to pack them after all maps finish (CLOSER’11, ICDE’12) Skewtune (SIGMOD’12) Split skewed tasks when detected Reconstruct the output by concatenating the results Bring barrier between map and reduce phases Bin-packing cannot support total order 接下来介绍一下相关工作。在传统的并行数据库领域，对于数据倾斜的研究主要局限在join，group和aggregate操作上。而在MapReduce领域中，解决数据倾斜的主要有三种方案：1）预先跑一些采样任务来确定中间数据的分布；2）在任务的Map执行过程中进行采样，根据样本信息进行划分，然后再进行reduce阶段。3）SkewTune采用另一种思路来解决数据分布不均问题，它是在检测到慢任务时将慢任务分裂成多个新任务来处理剩余的数据然后在这些任务完成之后再对结果进行拼接生成最终的执行结果。这三类方案都存在一些不足之处。首先增加采样任务会带来比较大的额外开销并且只能适用于特定的应用。而采用Map阶段采样数据会造成Map和Reduce阶段的阻隔，导致Reduce阶段的数据拷贝不能和Map阶段并行执行，同时，之前的这些工作都采用bin-packing的方式来安排key的划分以期望能够获得最佳划分方案会导致其不能支持结果有序的应用。而SkewTune系统需要更多的task slot来执行分裂任务，这在系统繁忙时是很浪费slot资源的。同时它不能检测大key，于是当大key开始处理之后，它就不能对其进行分裂了。另外，它不能优化copy和sort阶段，这对于copy和sort阶段是瓶颈的应用来说就不能得到好的效果。 Need more task slots Cannot detect large keys Cannot split in copy and sort phases

LIBRA – Solving data skew
Normal Map Reduce Sample Map Normal Map Reduce HDFS HDFS 因此，我们提出了新的数据倾斜解决方案LIBRA。为了避免采样所带来的额外开销，我们在Job的执行过程中进行采样，只在部分Map Task的执行过程中加入采样模块（例如20%），将这些Map采样Task优先下发，这些Map Task由于不知道具体的划分信息，于是它们会生成全局有序的中间结果，并在执行完毕会将其采样的数据上传至Master，当90%的采样任务完成之后，Master根据样本信息计算划分策略，然后通知每个Worker对之前生成的全局有序的中间结果进行划分，这部分由每个worker后端执行，并不占用slot资源。在划分信息确定之后下发的Map Task则直接按原来的逻辑执行生成划分好的中间结果。Reduce Task可在划分确定之后启动，并不需要等待Map阶段完成之后再启动。 Sample Map 4: Ask Workers to Partition Map Output 1: Issue Sample Tasks First Master 2:Sample Data 3: Calculate Partitions

Sampling and partitioning
Sampling strategy Random, TopCluster (ICDE’12) LIBRA – p largest keys and q random keys Estimate Intermediate Data Distribution Large keys -> represent only one large key Random keys -> represent a small range keys Partitioning strategy Hash, bin packing, range LIBRA - range 接下来，我们介绍一下我们具体的采样和划分策略。已有的采样策略主要有随机采样策略和TopCluster采样策略。TopCluster在每个split中只采样出现频率高的一些大key，而对于其他的key则认为它们是均匀分布。在我们的实验测试中，发现随机采样对于大key的估计不准确，而TopCluster对小key的估计不准确，因此都会造成长尾现象，于是我们提出了新的采样策略——大key和随机key结合的采样策略，对于每个map中间数据我们采样p个最大的key和q个其他的随机key。然后在对数据的估计中大key只代表它自己，而随机key则代表一小部分区间的key。至于划分策略，比较常用的主要有hash、bin-packing和range策略。我们采样range策略，这样对于所有的应用都能提供很好的支持。

Heterogeneity Consideration
Performance=1.5 Cnt=300 Performance=1 Performance=0.5 Node1 Node2 Node3 Cnt=150 Cnt=100 Cnt=50 Intermediate data Reducer1 Reducer2 Reducer3 Finish Processing Start 同时，我们考虑在异构环境中，数据划分均匀仍然会导致一些task由于其所在的机器性能慢而远远落后于其他任务的情况。因此，我们改进了range均分策略，通过结合机器的性能来对数据进行划分，使得性能好的机器处理的数据多，而性能差的机器少处理些数据，从而使得所有的task都尽可能在同一时刻完成。

Problem Statement The intermediate data can be represented as:
(K1, C1), (K2, C2), …, (Kn, Cn) Ki < Ki+1 Ki  a distinct key Ci  number of (k,v) pairs of Ki Range partition: 0 = 𝑝 0 < 𝑝 1 < … < 𝑝 𝑟 = n Reducer 𝑖 processes keys in the range of (𝐾 𝑝 𝑖−1 , 𝐾 𝑝 𝑖 ] Our goal: Minimize max 𝑖=1,2,…,𝑟 𝑗= 𝑝 𝑖−1 +1 𝑝 𝑖 𝐶𝑜𝑠𝑡( 𝐶 𝑗 ) 𝑒 𝑖 𝐶𝑜𝑠𝑡( 𝐶 𝑗 )  computational complexity of processing Kj sort: 𝐶𝑜𝑠𝑡 𝐶 𝑗 =𝐶 𝑗 , self-join: 𝐶𝑜𝑠𝑡 𝐶 𝑗 = 𝐶 𝑗 2 𝑒 𝑖  performance factor of the 𝑖 𝑡ℎ worker node 接下来，我们将给出我们采样划分问题的定义，假设我们所有的中间数据可以表示成一系列key和其出现频率C的对，那么我们需要找的range partition可以表示成r个0到n之间的数，代表划分点key的下标。我们的目标是最小化如下公式，其中Cost（Cj）表示reduce处理kj的时间复杂度。例如sort，其时间复杂度即为Cj，而self-join应用Cost（Cj）为Cj^2。ei表示第i个worker的性能系数，ei越高表示性能越好。因此，上述目标表示的是尽可能均匀每台机器上的计算代价与其性能之比。

Distribution estimation
( K 1 , C 1 ), …… ( K i−1 , C i−1 ), ( K i , C i ), ( K i+1 , C i+1 ), ( K j−1 , C j−1 ), ( K j , C j ), ( K j+1 , C j+1 ), ( K |L| , C |L| ) L ( 𝑲 𝒊 , 𝑪 𝒊 ), ( 𝑲 𝒋 , 𝑪 𝒋 ), ( 𝑲 𝟏 ∗ , 𝑪 𝟏 ∗ ) ( 𝑲 𝒑 ∗ , 𝑪 𝒑 ∗ ) ( K 1 2 , C 1 2 ), ( K p+q 2 , C p+q 2 ) ( K 1 m , C 1 m ), ( K p+q m , C p+q m ) ( K 1 1 , C 1 1 ), ( K p+q 1 , C p+q 1 ) K1 K2 K3 … Ki-1 Ki Ki+1 K|L| P1 keys, Q1 tuples P2 keys, Q2 tuples P3 keys, Q3 tuples Pi (=1) keys, Qi tuples Pi+1 keys, Qi+1 tuples 具体的中间数据分布估计算法主要如下：对于每个map采样出的p+q的（K，C）对，我们对key相同的sample进行合并，Count相加。然后挑出其中出现频率最高的一些key作为marked key，这些marked key将整个key域划分成多个区间。同时，我们还需要估计这些sample map task一共处理多少个唯一的key，一共输出多少条记录，然后将这些估计值分配到各个区间，估计出每个sample key所代表的key的数量Pi以及频率Qi。具体的估计算法比较复杂，有兴趣的同学可以查阅我们的论文。在获得了Pi和Qi之后，之前的目标函数变成了下面的形式，也即转化成我们有L个Pi*Cost（Qi）值，如何在它们之间插入r个划分点使得每个部分的和除以其对应的机器性能值尽可能相等，这个可以通过一个动态规划的方法进行计算，这里由于时间关系就也不详述。 Sum up samples (b) Pick up “marked keys” (c) Estimate distribution Minimize max 𝑖=1,2,…,𝑟 𝑗= 𝑝 𝑖−1 +1 𝑝 𝑖 𝑃 𝑗 ×𝐶𝑜𝑠𝑡( 𝑄 𝑗 ) 𝑒 𝑖

Sparse Index to Speed Up Partitioning
decrease the partition time by an order of magnitude Intermediate data Offset1 (Kb1, Vb1) Index chunk (Kb1+1, Vb1+1) L1 …… Sparse index Offset2 (Kb2, Vb2) (Kb1, Offset1, L1, Checksum1) (Kb2+1, Vb2+1) L2 (Kb2, Offset2, L2, Checksum2) …… …… 根据之前的介绍，在划分确定之后，每个worker需要对之前生成的全序中间数据按照新的range划分策略进行划分。由于中间数据常常会比较大，不能全部放入内存，因此采用线性或者二分查找都是非常耗时的。我们的实验表明当中间数据的大小等于输入数据的大小时，扫描一遍中间数据的时间将近等于原来Map Task执行时间的一半。为了降低划分的时间，我们在输出中间结果时同时为其创建一个稀疏索引来加速后面划分key的查找。稀疏索引包含：每个索引chunk的起始key，其在文件中的起始offset，长度L和checksum。由于稀疏索引较小，我们可以直接载入内存，然后将划分key跟稀疏索引比较，确定其所在的chunk，然后读入该chunk来具体定位其在中间数据中的位置。然后我们记录下每个划分块的起始位置以方便后续每个reduce task能够快速获取其内容。通过采用稀疏索引，我们能将划分的时间开销降低一个数量级。 (Kbn, Offsetn, Ln, Checksumn) Offsetn (Kbn, Vbn) (Kbn+1, Vbn+1) Ln ……

Large Cluster Splitting
treat each intermediate (k,v) pair independently in reduce phase e.g. sort, grep, join A, cnt = 100 B, cnt = 10 C, cnt = 10 Cluster split is not allow Cluster split is allow A, cnt=100 B, cnt = 10 C, cnt = 10 A, cnt=60 A, cnt = 40 B, cnt = 10 C, cnt = 10 另外，我们发现当中间数据有一些key非常大，由于原始MapReduce语义规定相同的key必须划分到相同的reduce来执行，于是这些大key没办法进一步细分，会造成有一些reduce仍然要处理比其他reduce多得多的数据。而在真实应用中，有许多应用可以放松这一限制，也即相同key可以分配到不同的reduce上执行，比如sort，grep和map端的join等。因此，当应用语义满足在reduce阶段对每一个中间结果key-value对的处理都是独立的这个条件，我们允许大的key被划分到多个reduce task中执行。这样可以使得我们的划分更加均匀。我们修改划分策略，原来的划分只给出划分key，而现在我们需要给出划分key和一个百分数p，表示划分点为key的p%处。 Reducer 1 Reducer 2 Reducer 1 Reducer 2 Data Skewed

Experiment Environment
Cluster: 30 virtual machines on 15 physical machines Each physical machine: dual-Processors (2.4GHz Xeon E5620) 24GB of RAM two 150GB disks connected by 1Gbps Ethernet Each virtual machine: 2 virtual core, 4GB RAM and 40GB of disk space Benchmark: Sort, Grep, Inverted Index, join 接下来介绍一下我们的实验结果，具体的实验环境如下，我们的集群由15台物理服务器组成，每台服务器上运行2台虚拟机。测试的应用包括sort，grep，inverted index和join。

Evalution - Accuracy of the Sampling Method
Zipf distribution (𝜎= 1.0) #keys = 65535 Sample 20% of splits and 1000 keys from each split 首先展示的是我们的算法对于中间数据分布的情况。

Evaluation – LIBRA Execution (sort)
80% faster than Hadoop Hash 167% faster than Hadoop Range

Evaluation – Degree of the skew (sort)
The overhead of LIBRA is minimal

Evaluation – different applications
Grep application -- grep different words from the full English Wikipedia archive with total data size of 31GB

Inverted Index application Dataset: full English Wikipedia archive

Join application

Evaluation – Heterogeneous Environments (sort)
30% faster than without heterogeneous consideration

Conclusion We present LIBRA, a system that implements a set of innovative skew mitigation strategies in MapReduce: A new sampling method for general user-defined programs p largest keys and q random keys An approach to balance the load among the reduce tasks Large key split support An innovative consideration of heterogeneous environment Balance the processing time instead of just the amount of data Performance evaluation demonstrates that: the improvement is significant (up to 4x times faster) the overhead is minimal and negligible even in the absence of skew

Thank You!

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Similar presentations

Presentation on theme: "LIBRA: Lightweight Data Skew Mitigation in MapReduce"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Similar presentations

Presentation on theme: "LIBRA: Lightweight Data Skew Mitigation in MapReduce"— Presentation transcript:

Similar presentations

About project

Feedback