Can Parallel Database Systems Help Big Data Analytics?

Can Parallel Database Systems Help Big Data Analytics?
Carlos Ordonez University of Houston USA 1

Talk Outline 1. Big Data Analytics 2. Processing alternatives
Cubes: DSS queries, basic statistics Models: machine learning & statistics Graphs 2. Processing alternatives Inside a DBMS: SQL queries, UDFs Outside a DBMS: HDFS, MapReduce, Spark, C++ 3. Benchmarking & optimizations Benchmarkinbg General: Algorithmic: parallel, general, across systems Specific: System-dependent 2/60 2

Acknowledgments Mike Stonebraker, MIT (DBMS-centric)
Divesh Srivastava, ATT (Theory) 3/79

Big Data VVVV: VVV+veracity Finer granularity than transactions
No schema required In general big data cannot be directly analyzed: pre-processing needed Diverse data sources, non-relational, beyond alphanumeric tables logs: user interactions, social networks, streams, sensors 4/79

Big Data Systems NoSQL, no SQL really?
Some ACID guarantees, but not OLTP Web-scale data is not universal Many practical problems are smaller Data(base) integration and cleaning harder than DB SQL remains the defacto query language 5/79

Big Data Analytics DWH + Machine Learning + Text
Data Preparation Copy, load Data Exploration, Data Quality Prep: Data Integration, Cleaning, Preprocessing Analytics Cubes Machine Learning, Statistics Models Graphs Highly Iterative Process Deployment Scoring Tuning Lifecycle Maintenance 6

Ignored and misunderstood aspects
Preparing data takes a lot of time. SQL helps but it requires expertise, statistical programming in R or SAS, spreadsheet macros, custom applications in C++/Java Lot of data: tabular originates in a DBMS Emphasis on large n, but many problems have many “small” files Data set is generally much smaller than raw data Model computation is the most researched topic; deployment rarely considered 7/79

Analytics Simple: Cubes: BI, OLAP, MOLAP, they capture descriptive statistics (histograms, means, quantiles, plots, statistical tests) Complex: Mathematical models (ML, statistics, optimiz.) Graphs (path, connectivity, cliqsues) 8/60 8

Cubes Multidimensional cube Star/snowflake schema
Queries: aggregations+joins Query optimization: cost-based opt, heuristics, avoid worst plans, relational algebra Requirement: ER and physical model Best: still DBMS

Machine Learning Models
Large n or d: cannot fit in RAM, minimize I/O Multidimensional d: 10s-1000s of dimensions feature selection and dimensionality reduction Represented & computed with matrices & vectors data set: mixed attributes model: numeric=matrices, discrete: histograms intermediate computations: matrices, histograms, equations 10/60 10

Machine Learning Algorithms
Behavior with respect to data set X: Few: one pass, fixed # passes (SVD, regress., NB) Most: iterative, convergence (k-means, CN 5.0, SVM) Time complexity: Research issues: Parallel processing; Main memory coming back time complexity may increase with disk incremental and online learning DBMS: vanilla models on tabular data 11/60 11

Graphs G=(V,E) n=|V| Large n, but G sparse
Typical problems (poly,combinatorial, NP-complete): transitive closure, reachability shortest/critical path neighborhood analysis clique detection

Graph algorithms Combinatorial versus matrices Complexity P
NP-complete, NP-hard problems Exp DBMS: indeed good for many graph problems on POLY, but certainly not for harder problems

Lessons from DB research

Why analytics inside the DBMS?
Huge data volumes: potentially better results with larger amounts of data; less processing time Minimizes data redundancy; Eliminate proprietary data structures; simplifies data management; security Caveats: SQL is not as powerful as C++, limited mathematical functionality, complex DBMS architecture to extend source code 15

2. Processing 2.1 Parallel processing 2.2 Inside DBMS: SQL, UDFs, system C Outside DBMS: MapReduce, Spark, C++/Java 2.4 Contrasting technologies 16/60 16

Parallel Processing: Scaleout vs Scaleup

Parallel architecture

Parallel processing issues
Contention for shared resources Load balancing: data values skew startup, availability overhead Data shuffling: redistribution, transfer, read/write to disk Scaleup better for DBMSs Scaleout harder for parallel DBMS than Hadoop

2.1 Inside DBMS Assumption: Programming alternatives: DBMS advantages:
data records are in the DBMS; exporting slow data set or graph stored as a table Programming alternatives: SQL queries and UDFs: SQL code generation (JDBC), precompiled UDFs. +: SP, embedded SQL, cursors Internal C Code (direct access to file system and RAM)..hard, non-portable DBMS advantages: important: storage, queries, security secondary: recovery, concurrency cntl, integrity, OLTP 20/60 20

In-DBMS analytic architecture

Programming with queries
pros: Isolation from DBMS internals Declarative, elegant, abstract cons: Difficult to program (numerical methods) no direct binding with arrays/matrices in SQL slow processing if there are joins (even hash join) internal data conversion generally necessary

Inside DBMS Accelerate processing with UDFs
Classification of User-Defined Functions (UDFs): Scalar UDF Aggregate UDF Table UDF Programming: Called in a SELECT statement C code or similar language API provided by DBMS, generally C/C++ Data type mapping 23/60 23

Inside DBMS UDF pros and cons
Advantages: arrays and flow control Flexibility in code writing and no side effects No need to modify DBMS internal code In general, simple data types Limitations: OS and DBMS architecture dependent, not portable No I/O capability, no side effects Null handling and fixed memory allocation Memory leaks with arrays (matrices): fenced/protected mode Debugging: harder than Java, Scala, R 24/60 24

Inside DBMS Aggregate UDF (scalar UDF skipped) [JM1998,SIGMOD]
Table scan Memory allocation of variables in the heap GROUP BY extend their computing expressibility But require handling nulls Advantage: parallel & multithreaded processing Drawback: returns a single value, not a table Practically all DBMSs: Vertica, SQL Server, PostgreSQL, SciDB, Teradata, Oracle, DB2, .. Useful for mathematical model computation 25/60 25

Inside DBMS Internal C code (if source code available); not popular [LTWZ2005,SIGMOD], [MYC2005,VLDB] [SD2001,CIKM] Advantages: access to file system (table record blocks), leverage physical operators (scan, join, sort, search) main memory, data structures, libraries hardware: multithreading, multicore CPU, RAM, caching LI/L2 math libraries: LAPACK Disadvantages: requires careful integration with subsystems not available to end users and practitioners may require exposing calls with DM language or SQL 26/60 26

Inside DBMS Query/UDF evaluation: Physical Operators [DG1992,CACM] [SMAHHH2007,VLDB] [WH2009,SIGMOD]
Serial DBMS (1 CPU, RAID|SharedDisk): table scan join: hash/sort-merge, indexed nested loop external merge sort Parallel DBMS (shared-nothing): even row distribution, hashing parallel table scan parallel joins: large/large (sort-merge, hash); large/short (replicate short) parallel/distributed sort: aggregation 27/60 27

Outside DBMS Plain files: C++, Java, R, SAS, ..
shared-nothing (DBMS, Hadoop) File systems: POSIX | HDFS Platform: DBMS, MapReduce, Spark, Graph engines Shared memory (HPC, 1000s of cores) MPI OpenMP

Outside DBMS: plain files, sequential processing Packages, libraries, Java/C++ [ZHY2009,CIDR] [ZZY2010,ICDE] Statistical and DM packages (e.g. R, SAS): exported flat files; proprietary file formats Memory-based (processing data records, models, internal data structures) Programming languages (arrays, programming control statements, libraries) Limitation: large number of records Packages: R, SAS, SPSS, Matlab, WEKA 29/60 29

Outside DBMS: MapReduce, Spark [DG2008,CACM]
Parallel processing; simple; shared-nothing Commodity diverse hardware (big cluster) Functions are programmed in a high-level programming language (e.g. Java, Python, Scala); flexible. <key,value> pairs processed in two phases: map(): computation is distributed and evaluated in parallel; independent mappers reduce(): partial results are combined/summarized Can be inside/outside DBMS, depending on level of integration with DBMS 30/60 30

Outside DBMS MapReduce Files and Processing
File Types: Text Files: portable storage (e.g. CSV files.) SequenceFiles: Efficient processing as blocks Processing: Points are sorted by “key” before sending to reducers Small files merged to get larger blocks Con: intermediate results are stored in file system (I/O) 31/60 31

Contrasting: DBMS vs MapReduce/Spark
Schema required Text processing difficult Simple parsing of files Built-in indexing Programming: declarative More rigid Shorter programs w/SQL queries Aggregate UDF Automatic optimization: data distribution for parallel proc. Fault-toler: in transact., not query No schema Text processing easy Need customized parsers Manually added indexes Programming: imperative Much more flexible Longer programs in Java/MR; Scala/Spark shorter map()/reduce() (Spark in RAM) Manual optim: programmer optimizes access Fault-tolerance in long jobs, per node

3. Benchmarking & optimizations
Benchmarking; DBMS vs Spark Data set layout Optimizations: Algorithmic DBMS-specific: SQL queries, UDFs MapReduce: optimize map() and reduce() Spark: optimize MapReduce, higher abstraction than Java

Benchmark: models linear regression

Benchmark: graphs

Storage: plan similar, but physical operator and cost do change
Classical Row New Column (relational algebra same, cost different) Array (non-relational, linear algebra, cost different)

Optimization: Data Set Storage layout: Horizontal/Vertical
Limitation with high d (max columns). No problems with high d. Default layout for most algorithms. Requires clustered index. SQL arithmetic expressions and UDFs. SQL aggregations, joins, UDFs. Easy to interpret. Difficult to interpret. Suitable for dense matrices. Suitable for sparse matrices, text, web. Complete record processing UDF: detect point boundaries n rows, d columns |X|<= dn rows, few (3-5) columns Fast block-based I/Os Slower I/O for low d 37/60 37

Optimizations Algorithmic & Systems
90% research, many efficient algorithms accelerate/reduce computations or convergence database systems focus: reduce I/O approximate solutions parallel Systems (SQL, MapReduce/Spark) Hardware: cluster of computers vs multicore CPUs Programming: SQL/C++ versus Java/Scala 38/60 38

Algorithmic: models [ZRL1996,SIGMOD]
Programming: data set available as flat file, binary file required for random access May require data structures working in main memory and disk Programming not in SQL: C/C++ are preferred languages Assumption d<<n: n has received more attention Issue: d>n produces numerical issues and large covariance/correlation matrix (larger than X) 39/60 39

Algorithmic Optimizations: models [STA1998,SIGMOD] [ZRL1996,SIGMOD][O2007,SIGMOD]
Exact model computation: data summaries: sufficient statistics (Gaussian pdf), histograms, discretization accelerate convergence, reduce iterations, faster matrix operations: * + parallel Approximate model computation: Sampling: efficient in time O(s) Incremental: escape local optima, reseed, gradient descent 40/60 40

DBMS Query Optimizations
Split queries; query optimizer falls short Join: denormalized storage: model, intermediate tables favor hash joins over sort-merge for data set secondary indexing for join: sort-merge join Aggregation (create statistical variables): push group-by before join: watch out nulls and high cardinality columns Outer joins scans: sync scans (share I/O), sampling O(1) access (diffic. truly random, must handle approx. error) 41/60 41

DBMS SQL Optimizations
SQL queries mathematical equations as queries: joins+aggregations Turing-complete: SQL code generation + imperative programming language UDFs as query optimization substitute difficult/slow math computations: summaries and gradient descent eliminate joins RAM: data structures, push processing but no I/O 42/60 42

DBMS programming optimizations [HLS2005,TODS] [O2007,TKDE]
UDFs can substitute slow SQL query code complex matrix computations scalar UDFs: vector operations, math functions Aggregate UDFs (aka UDAs): compute data set summaries in parallel, update state in RAM, but blocking behavior Table UDFs: streaming model; external temporary file; get close to array functionality => patch solution 43/60 43

Skip: MapReduce Optimizations [ABASR2009,VLDB] [CDDHW2009,VLDB] [SADMPPR2010,CACM]
Data set keys as input, partition data set text versus sequential file loading into file system may be required Parallel processing high cardinality keys: i handle skewed distributions reduce row redistribution in map() Main memory processing 44/60 44

Skip: MapReduce Common Optimizations [DG2008,CACM] [FPC2009,PVLDB] [PHBB2009,PVLDB]
Modify Block Size; Disable Block Replication Delay reduce(), chain Map() Tune M and R (memory allocation and number) Several M use the same R: M>R Avoid full table scans by using subfiles (requires file naming convention) combine() in map() to shrink intermediate files Intermediate files should be managed in SequenceFiles for efficiency SequenceFiles as input with custom data types. 45/60 45

Skip: MapReduce Issues
Loading files is fast, but converting to binary may be necessary Not every analytic task is efficiently computed with MapReduce Input key generally OK if high cardinality Skewed map key distribution: bottleneck Key redistribution (lot of message passing and I/O) 46/60 46

Spark based on MapReduce, but faster and more expressive (Scala, Java)
RDDs: caching data map(): avoid intermediate map files reduce(): in RAM pipelining, like query optimizers Scala versus Java: higher abstraction (comparable to SQL?) Fault-tolerance compromised, better in MR Mllib and GraphX: growing

Systems optimizations SQL vs MR/Spark (optimized versions, same hardware)
Task SQL UDF MR/Spark Speed: compute model 2 1 3 Speed: score data set Programming flexibility Process non-tabular data Loading/copy speed Ability to add optimizations Manipulating data key distribution Immediate processing (push=SQL,pull=MR/Spark) 48/60 48

DBMS, MapReduce/Spark 1-node sequential open-source y
Task SQL/UDF/AFL MR/Spark 1-node sequential open-source y Parallel open source y? Fault tolerant on long jobs n Libraries limited Many Arrays and matrices good Massive parallelism, large # nodes ok 49/60 49

Conclusions DBMS future
Parallel DBMSs: expensive, but reliable Fast data mining algorithms solved? Yes, but not considering data sets are stored in a DBMS; in-DBMS analytics of relational tabular data Main memory query processing, streams Faster/parallel load/transfer between DBMS and Hadoop General tradeoffs in speed/programming: horizontal (relational, data frame) vs vertical layout (web, rdf, sparse mtrx) Algorithms: one pass versus parallel processing reduce passes/iterations 50/60 50

Hadoop future New analytic languages After Spark?
MapReduce being forgotten Reprogramming parallel DBMSs on HDFS: Impala, Shark (open-source DWH) More processing nodes Java+Scala provide great programming flexibility Gap with DBMSs will keep shrinking

Extra slides more technical details

Machine learning models
Unsupervised: math: simpler task: clustering, dimensionality reduction, time models: KM, EM, PCA/SVD, FA, HMM Supervised: math: + tune /validate than unsuperv. tasks: classification, regression, series models: decision trees, Naïve Bayes, Bayes, linear/logistic regression, SVM, neural nets Good fit for DBMS: ? 53/60 53

Data set: Machine Learning
Data set with n records (or points) Each has attributes: numeric, discrete or both (mixed) Commonly, d dimensions; high d makes problem mathematically more difficult Supervised: additional column G/Y 54/60 54

Why is it hard? Matrices and numerical methods
55/60 55

Can Parallel Database Systems Help Big Data Analytics?

Similar presentations

Presentation on theme: "Can Parallel Database Systems Help Big Data Analytics?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Can Parallel Database Systems Help Big Data Analytics?

Similar presentations

Presentation on theme: "Can Parallel Database Systems Help Big Data Analytics?"— Presentation transcript:

Similar presentations

About project

Feedback