Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jiaheng Lu Renmin University of China

Similar presentations


Presentation on theme: "Jiaheng Lu Renmin University of China"— Presentation transcript:

1 Jiaheng Lu Renmin University of China
Big Data Management – Challenges and Opportunities – an Incomplete Survey Tutorial on HotDB Jiaheng Lu Renmin University of China Joint work with Yu Liu

2 Tutorial objectives Big data challenges
Big data management new principles Big data management research Indexes Transaction Architecture Application Benchmark

3 Big data challenge Big data Science data Finance data Streaming data
Internet data 目前数据量呈爆炸式增长,内容纷繁复杂,至2009年数据量已经达到281EB,并且仍以每年30%的速度增长。人们估计目前有80%的数据不能被现有的数据库系统管理。在这一背景下,人们获取信息变得更加困难。 为此,《科学》杂志最近推出了“数据处理”专辑(Science,11 February2011: Vol. 331 No. 6018),从基因组学、天文学、生态学、临床医学到高能物理等学科的14 篇论文都聚焦数据,明确提出“科学就是数据,数据就是科学”、“数据推动着科学的发展”。该专辑的引言“挑战与机遇”公布了《科学》杂志对数据使用的调查情况,从来自国际性科学研究领军的各个学科研究组的1700份答卷看,约20%的研究组通常使用或分析的数据超过千亿字节,其中有7%使用的数据超过万亿字节。这是典型的通过对多源海量数据处理发现新的科学知识的问题,解决一些关键的社会问题,如健康、自然资源管理、气候变化应对等。 3

4 Big data management challenge
The growth in database transactions and volumes has a large impact on response times Source:

5 Many techniques have been evolved ..
Master/Slave Cluster Computing Table Partitioning Federated Tables

6 Four new principles in big data management

7 New principle in big data management(1)
Partition Everything and key-value storage 切分万物以治之 1st normal form cannot be satisfied 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

8 New principle in big data management (2)
Embrace Inconsistency 容不同乃成大同 ACID properties are not satisfied 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

9 New principle in big data management (3)
Backup everything with three copies 狡兔三窟方高枕 Guarantee % safety 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

10 New principle in big data management (4)
Scalable and high performance 运筹沧海量兼容 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

11 Big data management 切分万物以治之 Partition Everything 容不同乃成大同
Embrace Inconsistency 狡兔三窟方高枕 Backup data with three copies 运筹沧海量兼容 Scalable and high performance 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

12 Big Data Management Indexes on Big Data Transaction on Big Data
Processing Architecture on Big Data Applications in MapReduce Parallel Processing Benchmark of Big Data Management System

13 Related Papers

14 Related Papers

15 Big data papers (incomplete data)
Indexes on Big Data ~ 4 papers Transaction on Big Data 4~5 papers Processing Architecture on Big Data 6~7 papers Applications in MapReduce Parallel Processing Benchmark of Big Data Management System 3~4papers

16 Big Data Management Indexes on Big Data Transaction on Big Data
Processing Architecture on Big Data Applications in MapReduce Parallel Processing Benchmark of Big Data Management System

17 Indexes on Big Data Construct indexes which can be maintained in an incremental way. Avoid bottleneck in the tree-like structure to provide concurrent reading and writing operations

18 Indexes on Big Data Distributed B-Tree Goal: perform consistent
concurrent updates while allowing high concurrency(read) M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008 索引和数据分布在n个server上,每个server存储一部分索引节点,和所有的内部节点的version number(假设可以存放的下)。客户端也存放了一部分内部节点的version number,不过可能是旧的。过时的和未存储的内部节点信息在client访问server时可以被更新。Client缓存version number等metadata是为了减少访问存储根节点(和上层节点)使其成为瓶颈。每个server存储所有节点的version number主要思想在于当一个client访问任意server时就能检查client缓存的metadata是否过时(减少client与server的通信次数是高效查询的重要条件)。

19 Indexes on Big Data Distributed B-Tree 3 techniques:
Transaction– optimistic concurrency Control Lazy replication of version numbers at clients Eager replication of version numbers at servers M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008 Sinfonia:主要提供fault-tolerance的保证,和一个light-weight的原语,用于保证transaction的执行。本文所讲的transaction同于传统数据库中的事务处理(ACID)。通过Sinfonia的minitransaction保证使用了短时间锁。采用了两阶段提交,通过read set和write set的维护,使用version number验证一致性(被很多分布式事务处理模块使用)。Eager replication at servers保证了事物一致性,lazy replication at clients处于效率考虑。

20 Indexes on Big Data Use BATON overlay to support range queris
Local B+-tree index & Cloud Global(CG) index Only publish a few local index to global index to get high throughput and concurrency Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010

21 Indexes on Big Data BATON overlay Steps to retrieve data:
Search in the BATON tree(lookup()); For all overlapping nodes in global index, find the corresponding nodes(and local index) Search in the local B+-Tree index to retrieve data Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010

22 Big Data Management Transaction on Big Data Indexes on Big Data
Processing Architecture on Big Data Applications in MapReduce Parallel Processing Benchmark of Big Data Management System

23 The CAP Theorem Availability Consistency Partition tolerance
PODC:ACM Symposium on Principles of Distributed Computing(PODC)International Conference 23

24 The CAP Theorem Once a writer has written, all readers will see that write Availability Consistency Partition tolerance

25 The CAP Theorem System is available during software and hardware upgrades and node failures. Availability Consistency Partition tolerance

26 The CAP Theorem A system can continue to operate in the presence of a network partitions. Availability Consistency Partition tolerance

27 The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Availability Consistency Partition tolerance

28 Consistency Two kinds of consistency:
strong consistency – ACID(Atomicity Consistency Isolation Durability) weak consistency – BASE(Basically Available Soft-state Eventual consistency ) Cluster里面至少有一个replica是最新的,其他的replica最终会达到一致

29 A tailor RDBMS LOCK ACID SAFETY TRANSACTION 3NF

30 Transaction on Big Data
“Not all data need to be treated at the same level of consistency.” Goal : minimize overall cost of operations in cloud Consistent Rationing Define consistency guarantees on the data instead at the transaction level Switch consistency guarantees at runtime, automatically 3 categories T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009

31 Transaction on Big Data
Category C: Session Consistency (temporal) inconsistency is acceptable read-your-own-writes monotonicity converge & achieve eventual consistency at some interval Category A: Serializable Consistency violation results in large penalty costs Category B: trade-off between cost per operation & consistency level Adaptive. Switch between session consistency and serializability at runtime T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009 Category C比如可以使用vector clock等方法达到eventual consistency(可能存在failure)。 Category B是提供在A和C间动态切换的方法。

32 Transaction on Big Data
Category B: trade-off between cost per operation & consistency level General Policy “higher consistency level need to be provided when conflicts(updates) is high.” Time Policy when “deadline” approaches, more commits. Fixed Threshold Policy (for numeric type) Dynamic Policy (for numeric type) Y: sum of update value T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009 General Policy 假设当很多更新同时发生时,可能出现不一致。公式表示update操作次数大于1的概率(在某一时间段内考虑)。分布式环境下使用了滑动窗口估算update数目。 Time Policy 假设事务到最后才提交或同步,所以最后一段时间的consistency level高。 下面3种是用于数值型的数据的,假设从一个正值V中减去值,保证V>0。比如卖东西之类的,要保证估计的盈利最大。 Fixed Threshold Policy 要求V一直比T高,比T低时转入A。Demarcation Policy 是每个server可以分一部分自己支配,当自己的这一部分用完时就要用全局的,进入A。Dynamic Policy 想法是通过像General Policy那样的方式估计Y的值,如果T比Y小,就要进入A。 因为所有的policy和文章的idea都是基于probabilistic guarantee的,所以一致性不能永远保证,而是保证了代价最小。

33 Datalog and coordination complexity: theoretical results from PODS aspects
(PODS keynote 2011 Joseph M. Hellerstein, UC Berkeley)

34 Datalog Main expressive advantage: recursive queries.
More convenient for analysis: papers look better. Without recursion but with negation it is equivalent in power to relational algebra Has affected real practice: (e.g., recursion in SQL3, magic sets transformations).

35 Datalog Example Datalog program: parent(bill,mary). parent(mary,john).
ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y). ?- ancestor(bill,X)

36 Joseph’s Conjecture(1)
CONJECTURE 1. Consistency And Logical Monotonicity (CALM). A program has an eventually consistent, coordination-free execution strategy if and only if it is expressible in (monotonic) Datalog.

37 Joseph’s Conjecture (2)
CONJECTURE 2. Causality Required Only for Non-monotonicity (CRON). Program semantics require causal message ordering if and only if the messages participate in non-monotonic derivations.

38 Joseph’s Conjecture (3)
CONJECTURE 3. The minimum number of Dedalus timesteps required to evaluate a program on a given input data set is equivalent to the program’s Coordination Complexity.

39 Joseph’s Conjecture (4)
CONJECTURE 4. Any Dedalus program P can be rewritten into an equivalent temporally-minimized program P’ such that each inductive or asynchronous rule of P’ is necessary: converting that rule to a deductive rule would result in a program with no unique minimal model.

40 Circumstance has presented a rare opportunity—call it an imperative—for the database community to take its place in the sun, and help create a new environment for parallel and distributed computation to flourish. ------Joseph M. Hellerstein (UC Berkeley)

41 Big Data Management Processing Architecture on Big Data
Indexes on Big Data Transaction on Big Data Processing Architecture on Big Data Applications in MapReduce Parallel Processing Benchmark of Big Data Management System

42 Processing Architecture on Big Data
Make MapReduce more powerful, especially on complicated analysis Merge cloud computing systems and PDBMSs

43 Mapreduce online testing platform
Cloudcomputing.ruc.edu.cn Automatic evaluation of Hadoop Mapreduce codes Theoretical questions

44 开放式Mapreduce测试平台cloudcomputing.ruc.edu.cn

45 Processing Architecture on Big Data
“Sort-merge implementation in Hadoop poses fundamental barrier to incremental one-pass analysis” New Hash-Based Platform 主要思想是sort-merge类型的实现对于流数据的分析或者其他一些one-pass analysis来说难以实现,可能是数据太多不能完全记录下来,或是这样速度太慢,所以需要一个新的平台支持pipeline形式的而非block类型的处理流程。本论文认为MapReduce实现中的sort等环节代价太大。采用的是hash-based实现。MR-hash的基本思想是所有的map task都做hash,这样key值相同的就会出现在一个reducer上了。尽量把数据保存在内存中(如图),写不下的话刷到磁盘。对于那些太大的bucket可以进一步hash,有点类似传统数据库中的多趟算法。还可以将一些重要的数据(比如用户想要的)bucket1放在内存,及时处理。每个bucket的数据都交给reduce task去做。实际上用hash实现group by操作。 针对内存不够的情形,采用了INC-HASH和Dynamic INC-hash。因为很多操作,比如count,sum等是可以记当前结果的。对于每个key,如果内存还能放下,就记一个state值,放不下的写到disk。让重要的(高频的)值存在内存,记了count值,当一个在表中的key被处理时,对应count++,否则所有count减一。采用一定的替换策略。 B. Li, E. Mazur, et al. A Platform for Scalable One-Pass Analytics using MapReduce. SIGMOD 2011

46 Processing Architecture on Big Data
Fast Join Processing in Data Warehouse Partitioning Data into Vertical Groups Dynamically Y. Lin, D. Agrawal, et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. SIGMOD 2011

47 Processing Architecture on Big Data
Fast Join Processing in Data Warehouse Partitioning Data into Vertical Groups Dynamically Concurrent Join More Map-side Joins BASIC PATTERNS: Star Pattern & Chain Pattern

48 Processing Architecture on Big Data
Make MapReduce more powerful, especially on complicated analysis Merge cloud computing systems and PDBMSs

49 Processing Architecture on Big Data
HadoopDB Combination of Parallel DBMS(performance) and MapReduce(scalability, fault-tolerance) Communication layer : MapReduce nodes: single-node DBMS instances SMS Planner: SQL MapReduce Job  SQL

50 Big Data Management Applications in MapReduce Parallel Processing
Indexes on Big Data Transaction on Big Data Processing Architecture on Big Data Applications in MapReduce Parallel Processing Benchmark of Big Data Management System

51 Applications in MapReduce Parallel Processing
A. Okcan, M. Riedewald. Processing Theta-Joins using MapReduce. SIGMOD 2011 Discuss some Theta-Joins(Inequality-Joins)algorithms

52 Applications in MapReduce Parallel Processing
R. Vernica, M. J. Carey, et al. Efficient Set-Similarity Joins Using MapReduce. SIGMOD 2010 Use MapReduce Framework to perform set-similarity join, i.e. given two(or one) files, find all pairs of records (a, b) satisfying a and b are similar(sim(a, b)> t) Give algorithms coping with large amount of data, as well as experimental evaluation.

53 Big Data Management Applications in MapReduce Parallel Processing
Indexes on Big Data Transaction on Big Data Processing Architecture on Big Data Applications in MapReduce Parallel Processing Benchmark of Big Data Management System

54 Benchmark of Big Data Management System
Comparison of the performance between MapReduce paradigm and parallel DBMSs PERFORMANCE PDBMSs >> MR systems (except data loading) Comparison Schema Support Indexing Programming Model Data Distribution Execution Strategy Flexibility Fault Tolerance A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010

55 Benchmark of Big Data Management System
Comparison of the performance between MapReduce paradigm and parallel DBMSs PERFORMANCE PDBMSs >> MR systems (except data loading) Comparison Schema Support Indexing Programming Model Data Distribution Execution Strategy Flexibility Fault Tolerance A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010

56 Benchmark of Big Data Management System
How architectures affect cloud computing (performance) on database applications? Especially for OLTP? D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010

57 Benchmark of Big Data Management System
How architectures affect cloud computing(performance) on database applications? Especially for OLTP? D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010 WIPS: valid request per second

58 Benchmark of Big Data Management System
How architectures affect cloud computing(performance) on database applications? Especially for OLTP? D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010 EB: emulated browsers

59 Conclusion Big Data Management: HOT DB topic Research topics:
Indexing, transaction, join, architecture, application, benchmark

60 References Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010 David Chiu, A. Shetty, et al. Evaluating and Optimizing Indexing Schemes for a Cloud-based Elastic Key-Value Store. In th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing J. Wang, S. Wu, et al. Indexing Multi-dimensional Data in a Cloud System. SIGMOD 2010 D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010 T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009 H. T. Vo, C. Chen, et al. Towards Elastic Transactional Cloud Storage with Range Query Support. VLDB 2010 H. Kllapi, E. Sitaridi, et al. Schedule Optimization for Data Processing Flows on the Cloud. SIGMOD 2011 M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008

61 References E. Friedman, P. Pawlowski, et al. SQL/MapReduce: A Practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB 2009 R. Vernica, M. J. Carey, et al. Efficient Set-Similarity Joins Using MapReduce. SIGMOD 2010 S. Blanas, J. M. Patel, et al. A Comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD 2010 D. Logothetis, K. Yocum. Ad-Hoc Data Processing in the Cloud. VLDB 2008 B. Panda, J. S. Herbach, et al. PLANET: Massively Parallel Learning of Three Ensembles with MapReduce. VLDB 2009 A. Okcan, M. Riedewald. Processing Theta-Joins using MapReduce. SIGMOD 2011 K. Morton, M. Balazinska, et al. ParaTimer: A Progress Indicator for MapReduce DAGs. SIGMOD 2010 Y. Cao, C. Chen, et al. ES2: A Cloud Data Storage System for Supporting Both OLTP and OLAP. ICDE 2011 K. Morton, A. Friesen, et al. Estimating the Progress of MapReduce Pipelines. ICDE 2010

62 References W. Lang, J.M. Patel. Energy Management for MapReduce Clusters. VLDB 2010 T. Nykiel, M. Potamias, et al. MRShare: Sharing Across Multiple Queries in MapReduce. VLDB 2010 C. Olston, G. Chiou, et al. Nova: Continuous Pig/Hadoop Workflows. SIGMOD 2011 Y. Lin, D. Agrawal, et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. SIGMOD 2011 B. Li, E. Mazur, et al. A Platform for Scalable One-Pass Analytics using MapReduce. SIGMOD 2011 D. G. Campbell, G. Kakivaya, et al. Extreme Scale with Full SQL Language Support in Microsoft SQL Azure. SIGMOD 2010 A. Abouzeid, K. B-Pawlikowski, et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 Y. Xu, P. Kostamaa, et al. Integrating Hadoop and Parallel DBMS. SIGMOD 2010 J. A. Q-Ruiz, C. Pinkel, et al. RAFT at Work: Speeding-Up MapReduce Applications under Task and Node Failures. SIGMOD 2011 A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010


Download ppt "Jiaheng Lu Renmin University of China"

Similar presentations


Ads by Google