Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB.

Similar presentations


Presentation on theme: "Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB."— Presentation transcript:

1 Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB

2 Tutorial objectives Big data challenges Big data management new principles Big data management research – Indexes – Transaction – Architecture – Application – Benchmark

3 Big data challenge Big data – Science data – Finance data – Streaming data – Internet data

4 Big data management challenge The growth in database transactions and volumes has a large impact on response times Source:

5 Many techniques have been evolved.. Master/Slave Cluster Computing Table Partitioning Federated Tables

6 Four new principles in big data management

7 New principle in big data management ( 1 ) Partition Everything and key-value storage 切分万物以治之 1 st normal form cannot be satisfied

8 New principle in big data management ( 2 ) Embrace Inconsistency 容不同乃成大同 ACID properties are not satisfied

9 New principle in big data management ( 3 ) Backup everything with three copies 狡兔三窟方高枕 Guarantee % safety

10 New principle in big data management ( 4 ) Scalable and high performance 运筹沧海量兼容

11 Big data management 切分万物以治之 Partition Everything 容不同乃成大同 Embrace Inconsistency 狡兔三窟方高枕 Backup data with three copies 运筹沧海量兼容 Scalable and high performance

12 Big Data Management  Indexes on Big Data  Transaction on Big Data  Processing Architecture on Big Data  Applications in MapReduce Parallel Processing  Benchmark of Big Data Management System

13 Related Papers

14

15 Big data papers (incomplete data)  Indexes on Big Data ~ 4 papers  Transaction on Big Data 4~5 papers  Processing Architecture on Big Data 6~7 papers  Applications in MapReduce Parallel Processing 6~7 papers  Benchmark of Big Data Management System 3~4papers

16 Big Data Management  Indexes on Big Data  Transaction on Big Data  Processing Architecture on Big Data  Applications in MapReduce Parallel Processing  Benchmark of Big Data Management System

17 Indexes on Big Data  Construct indexes which can be maintained in an incremental way.  Avoid bottleneck in the tree-like structure to provide concurrent reading and writing operations

18 Distributed B-Tree Goal: perform consistent concurrent updates while allowing high concurrency(read) M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008 Indexes on Big Data

19 Distributed B-Tree 3 techniques: Transaction– optimistic concurrency Control Lazy replication of version numbers at clients Eager replication of version numbers at servers M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008 Indexes on Big Data

20  Use BATON overlay to support range queris  Local B + -tree index & Cloud Global(CG) index  Only publish a few local index to global index to get high throughput and concurrency Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010 Indexes on Big Data

21 BATON overlay Steps to retrieve data: 1.Search in the BATON tree(lookup()); 2.For all overlapping nodes in global index, find the corresponding nodes(and local index) 3.Search in the local B + -Tree index to retrieve data Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010 Indexes on Big Data

22 Big Data Management  Indexes on Big Data  Transaction on Big Data  Processing Architecture on Big Data  Applications in MapReduce Parallel Processing  Benchmark of Big Data Management System

23 The CAP Theorem Consistency Partition tolerance Availability

24 The CAP Theorem Once a writer has written, all readers will see that write Consistency Partition tolerance Availability

25 The CAP Theorem System is available during software and hardware upgrades and node failures. Consistency Partition tolerance Availability

26 The CAP Theorem A system can continue to operate in the presence of a network partitions. Consistency Partition tolerance Availability

27 The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Consistency Partition tolerance Availability

28 Consistency Two kinds of consistency: – strong consistency – ACID(Atomicity Consistency Isolation Durability) – weak consistency – BASE(Basically Available Soft-state Eventual consistency )

29 A tailor 3NF TRANSACTION LOCK ACID SAFETY RDBMS

30  “Not all data need to be treated at the same level of consistency.”  Goal : minimize overall cost of operations in cloud Consistent Rationing Define consistency guarantees on the data instead at the transaction level Switch consistency guarantees at runtime, automatically 3 categories T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009 Transaction on Big Data

31 Category C: Session Consistency (temporal) inconsistency is acceptable read-your-own-writes monotonicity converge & achieve eventual consistency at some interval Category A: Serializable Consistency violation results in large penalty costs Category B: trade-off between cost per operation & consistency level Adaptive. Switch between session consistency and serializability at runtime T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009

32 Category B: trade-off between cost per operation & consistency level  General Policy “h igher consistency level need to be provided when conflicts(updates) is high.”  Time Policy w hen “deadline” approaches, more commits.  Fixed Threshold Policy (for numeric type)  Dynamic Policy (for numeric type) Y : sum of update value T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009 Transaction on Big Data

33 Datalog and coordination complexity: theoretical results from PODS aspects (PODS keynote 2011 Joseph M. Hellerstein, UC Berkeley)

34 Datalog Main expressive advantage: recursive queries. More convenient for analysis: papers look better. Without recursion but with negation it is equivalent in power to relational algebra Has affected real practice: (e.g., recursion in SQL3, magic sets transformations).

35 Datalog Example Datalog program: parent(bill,mary). parent(mary,john). ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y). ?- ancestor(bill,X)

36 Joseph’s Conjecture(1) CONJECTURE 1. Consistency And Logical Monotonicity (CALM). A program has an eventually consistent, coordination-free execution strategy if and only if it is expressible in (monotonic) Datalog.

37 Joseph’s Conjecture (2) CONJECTURE 2. Causality Required Only for Non-monotonicity (CRON). Program semantics require causal message ordering if and only if the messages participate in non-monotonic derivations.

38 Joseph’s Conjecture (3) CONJECTURE 3. The minimum number of Dedalus timesteps required to evaluate a program on a given input data set is equivalent to the program’s Coordination Complexity.

39 Joseph’s Conjecture (4) CONJECTURE 4. Any Dedalus program P can be rewritten into an equivalent temporally- minimized program P’ such that each inductive or asynchronous rule of P’ is necessary: converting that rule to a deductive rule would result in a program with no unique minimal model.

40 Circumstance has presented a rare opportunity—call it an imperative—for the database community to take its place in the sun, and help create a new environment for parallel and distributed computation to flourish Joseph M. Hellerstein (UC Berkeley)

41 Big Data Management  Indexes on Big Data  Transaction on Big Data  Processing Architecture on Big Data  Applications in MapReduce Parallel Processing  Benchmark of Big Data Management System

42 Processing Architecture on Big Data Make MapReduce more powerful, especially on complicated analysis Merge cloud computing systems and PDBMSs

43 Mapreduce online testing platform Cloudcomputing.ruc.edu.cn Automatic evaluation of Hadoop Mapreduce codes Theoretical questions

44 开放式 Mapreduce 测试平台 cloudcomputing.ruc.edu.cn

45  “Sort-merge implementation in Hadoop poses fundamental barrier to incremental one-pass analysis” New Hash-Based Platform Processing Architecture on Big Data B. Li, E. Mazur, et al. A Platform for Scalable One-Pass Analytics using MapReduce. SIGMOD 2011

46 Fast Join Processing in Data Warehouse  Partitioning Data into Vertical Groups Dynamically Y. Lin, D. Agrawal, et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. SIGMOD 2011 Processing Architecture on Big Data

47 Fast Join Processing in Data Warehouse  Partitioning Data into Vertical Groups Dynamically  Concurrent Join More Map-side Joins BASIC PATTERNS: Star Pattern & Chain Pattern Processing Architecture on Big Data

48 Make MapReduce more powerful, especially on complicated analysis Merge cloud computing systems and PDBMSs

49 HadoopDB  Combination of Parallel DBMS(performance) and MapReduce(scalability, fault- tolerance)  Communication layer : MapReduce nodes: single-node DBMS instances  SMS Planner: SQL  MapReduce Job  SQL Processing Architecture on Big Data

50 Big Data Management  Indexes on Big Data  Transaction on Big Data  Processing Architecture on Big Data  Applications in MapReduce Parallel Processing  Benchmark of Big Data Management System

51 A. Okcan, M. Riedewald. Processing Theta-Joins using MapReduce. SIGMOD 2011  Discuss some Theta-Joins(Inequality-Joins)algorithms Applications in MapReduce Parallel Processing

52 R. Vernica, M. J. Carey, et al. Efficient Set-Similarity Joins Using MapReduce. SIGMOD 2010  Use MapReduce Framework to perform set-similarity join, i.e. given two(or one) files, find all pairs of records (a, b) satisfying a and b are similar(sim(a, b)> t)  Give algorithms coping with large amount of data, as well as experimental evaluation. Applications in MapReduce Parallel Processing

53 Big Data Management  Indexes on Big Data  Transaction on Big Data  Processing Architecture on Big Data  Applications in MapReduce Parallel Processing  Benchmark of Big Data Management System

54 Benchmark of Big Data Management System Comparison of the performance between MapReduce paradigm and parallel DBMSs PERFORMANCE PDBMSs >> MR systems (except data loading) Comparison Schema Support Indexing Programming Model Data Distribution Execution Strategy Flexibility Fault Tolerance A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010

55 Benchmark of Big Data Management System Comparison of the performance between MapReduce paradigm and parallel DBMSs PERFORMANCE PDBMSs >> MR systems (except data loading) Comparison Schema Support Indexing Programming Model Data Distribution Execution Strategy Flexibility Fault Tolerance A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010

56 How architectures affect cloud computing (performance) on database applications? Especially for OLTP? D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010 Benchmark of Big Data Management System

57 How architectures affect cloud computing(performance) on database applications? Especially for OLTP? D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010 Benchmark of Big Data Management System

58 How architectures affect cloud computing(performance) on database applications? Especially for OLTP? D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010 Benchmark of Big Data Management System

59 Conclusion Big Data Management: HOT DB topic Research topics: Indexing, transaction, join, architecture, application, benchmark

60 References Sai Wu, Dawei Jiang, et al. Efficient B-tree Based Indexing for Cloud Data Processing. VLDB 2010 David Chiu, A. Shetty, et al. Evaluating and Optimizing Indexing Schemes for a Cloud-based Elastic Key-Value Store. In th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing J. Wang, S. Wu, et al. Indexing Multi-dimensional Data in a Cloud System. SIGMOD 2010 D. Kossmann, T. Kraska, et al. An Evaluation of Alternative Architectures for Transaction Processing in the Cloud. SIGMOD 2010 T. Kraska, M. Hentschel, et al. Consistency Rationing in the Cloud: Pay only when it matters. VLDB 2009 H. T. Vo, C. Chen, et al. Towards Elastic Transactional Cloud Storage with Range Query Support. VLDB 2010 H. Kllapi, E. Sitaridi, et al. Schedule Optimization for Data Processing Flows on the Cloud. SIGMOD 2011 M. K. Aguilera, W. Gloab, et al. A Practical Scalable Distributed B-Tree. VLDB 2008

61 References E. Friedman, P. Pawlowski, et al. SQL/MapReduce: A Practical approach to self- describing, polymorphic, and parallelizable user-defined functions. VLDB 2009 R. Vernica, M. J. Carey, et al. Efficient Set-Similarity Joins Using MapReduce. SIGMOD 2010 S. Blanas, J. M. Patel, et al. A Comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD 2010 D. Logothetis, K. Yocum. Ad-Hoc Data Processing in the Cloud. VLDB 2008 B. Panda, J. S. Herbach, et al. PLANET: Massively Parallel Learning of Three Ensembles with MapReduce. VLDB 2009 A. Okcan, M. Riedewald. Processing Theta-Joins using MapReduce. SIGMOD 2011 K. Morton, M. Balazinska, et al. ParaTimer: A Progress Indicator for MapReduce DAGs. SIGMOD 2010 Y. Cao, C. Chen, et al. ES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP. ICDE 2011 K. Morton, A. Friesen, et al. Estimating the Progress of MapReduce Pipelines. ICDE 2010

62 References W. Lang, J.M. Patel. Energy Management for MapReduce Clusters. VLDB 2010 T. Nykiel, M. Potamias, et al. MRShare: Sharing Across Multiple Queries in MapReduce. VLDB 2010 C. Olston, G. Chiou, et al. Nova: Continuous Pig/Hadoop Workflows. SIGMOD 2011 Y. Lin, D. Agrawal, et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. SIGMOD 2011 B. Li, E. Mazur, et al. A Platform for Scalable One-Pass Analytics using MapReduce. SIGMOD 2011 D. G. Campbell, G. Kakivaya, et al. Extreme Scale with Full SQL Language Support in Microsoft SQL Azure. SIGMOD 2010 A. Abouzeid, K. B-Pawlikowski, et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 Y. Xu, P. Kostamaa, et al. Integrating Hadoop and Parallel DBMS. SIGMOD 2010 J. A. Q-Ruiz, C. Pinkel, et al. RAFT at Work: Speeding-Up MapReduce Applications under Task and Node Failures. SIGMOD 2011 A. Pavlo, E. Paulson, et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2010


Download ppt "Big Data Management – Challenges and Opportunities – an Incomplete Survey Jiaheng Lu Renmin University of China Joint work with Yu Liu Tutorial on HotDB."

Similar presentations


Ads by Google