Parallel Database Systems

Parallel Database Systems
Carlos Ordonez Research Overview 2016

Cubes: Statistical Tests
Goal: Isolate factors that cause significant changes in a measured value Ex: Increase in age causes increase in risk for heart disease Combined OLAP with Means Comparison Parametric Test Used to pair similar groups and determine if they are significantly different Want to reject hypothesis that the two groups have the same mean Developed GUI that allows for easy user interface

Cube Statistical Tests
Association Rules – technique used to detect patterns within items of dataset HighAge, High Cholestrol => Heart Disease Compare results from both techniques OLAP Statistical Test discovered more rules than Association Rules p-value is more reliable than confidence (considers pdf) OLAP affected less by distribution than AR AR better when performance is priority and data is skewed OLAP Statistical Test better when data is distributed

Cubes Statistical Test versus Association Rules
Blue and red lines represent location of the averages of the two groups Averages are fairly different from one another Confidence says that the two groups are similar Many blue points above 50 Many red points above 50 confidence is low

Cubes: Exploration with UDF
Zhibo Chen Advisor: Dr. Carlos Ordonez On-Line Analytical Process (OLAP) Set of techniques allowing users to explore various aggregations of a dataset Ex: dataset with day, month, year, sales What were average sales for Sundays? Solve by grouping on day and then extracting Sunday Normally done outside the database or with OLAP servers We want to study how to perform the same techniques inside the DBMS (SQL or UDF) Found that users can efficiently perform OLAP exploration using UDFs

Digital Libraries in a DBMS
Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez Information retrieval techniques have been traditionally exploited outside relational database systems due to storage overhead, complexity to suit them in a relational model, and slower performance in SQL implementations. Searching and querying documents under information retrieval models in relational database systems can be performed with optimized SQL. We explore three phases: Document preprocessing. Document storage. Document retrieval (VSM, OPM, DPLM).

Keyword Search Across Document and Databases
Sometimes the meaning and structure of a database is unknown. There are external semi-structured sources that can help to describe it. We found that we can link these two worlds to identify relationships between the structured data with the semi-structured data. We believe that is the right approach to do it inside the database. We implemented a prototype entirely in SQL.

New trends Gamma Graphs

Data Summarization with Gamma
Descriptive stats, correlation, covariance Linear models Bayesian statistics Good for any parallel system Parallel DBMS, best for array DBMS R Spark, MapReduce ScaLAPACK with MPI

New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]
10

2-phase algorithm Phase 1: Compute Gamma
Phase 2: iterate exploiting Gamma in intermediate matrix computations

Focus: array DBMS SciDB: Large matrices beyond RAM size
Storage by row or column not good enough Matrices natural in statistics, engineer. and science Multidimensional arrays -> matrices, not same thing Parallel shared-nothing best for big data analytics Closer to DBMS technology, but some similarity with Hadoop Feasible to create array operators, having matrices as input and matrix as output Combine processing with R package and LAPACK 12

Focus: array DBMS SciDB: Large matrices beyond RAM size
Storage by row or column not good enough Matrices natural in statistics, engineer. and science; Multidimensional arrays -> matrices, not same thing Parallel shared-nothing best for big data analytics; Closer to DBMS technology, but some similarity with Hadoop Feasible to create array operators, having matrices as input and matrix as output Combine processing with R package and LAPACK 13

Pros: Algorithm evaluation with physical array operators
Since xi fits in one chunk joins are avoided (at least 2X I/O with hash or merge join) Since xi*xiT can be computed in RAM we avoid an aggregation which would require sorting points by i No need to store X twice: X, XT: half I/O, half RAM space No need transpose X, costly reorganization even in RAM, especially if X spans several RAM segments Operator works in C++ compiled code: fast; vector accessed once; direct assignment (bypass C++ functions calls) 14

System issues and limitations
Gamma not efficiently computable in AQL or AFL: hence operator is required Arrays of tuples in SciDB are more general, but cumbersome for matrix manipulation: arrays of single attribute (double) Points must be stored completely inside a chunk: wide rectangular chunks: may not be I/O optimal Slow: Arrays must be pre-processed to SciDB load format, loaded to 1D array and re-dimensioned=>optimize load. Multiple SciDB instances per node improve I/O speed: interleaving CPU Larger chunks are better: 8MB, especially for dense matrices; avoid shuffling; avoid joins Dense (alpha) and sparse (beta) versions 15 15

Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez
Bayesian Statistics Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez Latest trend in advanced statistics; very demanding: CPU and large data sets Applied to microarray data in the DBMS. The problem involves high dimensionality data of few samples. Variable selection is the first issue that we have been trying to solve. Computational expensive looking for the best model (2^d), where d is de number of dimensions. Applying SQL optimizations and data layout modifications, we obtain less than 3 seconds selections of > 1 M dimensions , but still not enough. Current work: Gibbs Sampler Variable Selection.

Mario Navas Advisor: Dr. Carlos Ordonez
PCA Mario Navas Advisor: Dr. Carlos Ordonez Black-box Rotation of the input space Make the representative components evident No Covariance between attributes Variance represented by the eigenvalues Deal with high dimensionality Japanese manufacturing companies during the 1980s were very good at manufacturing. This was new and disturbing to western managers. Japanese culture – attention to detail, individual sacrifice for collective goals, order, neatness. Industries including automobile market (offering features, fits, and finishes at prices domestic producers were unable to match), also in machine tools, copiers. 17

DB Implementation Summary matrices n L Q Correlation matrix
Eigenvalue decomposition problem Set up times are significantly reduced in the factory. Cutting down the set up time to be more productive will allow the company to improve their bottom line to look more efficient and focus time spent on other areas that may need improvement. This allows the reduction or elimination of the inventory held to cover the "changeover" time, the tool used here is SMED. The flows of goods from warehouse to shelves are improved. Having employees focused on specific areas of the system will allow them to process goods faster instead of having them vulnerable to fatigue from doing too many jobs at once and simplifies the tasks at hand. Small or individual piece lot sizes reduce lot delay inventories which simplifies inventory flow and its management. Employees who possess multiple skills are utilized more efficiently. Having employees trained to work on different parts of the inventory cycle system will allow companies to use workers in situations where they are needed when there is a shortage of workers and a high demand for a particular product. Better consistency of scheduling and consistency of employee work hours. If there is no demand for a product at the time, workers don’t have to be working. This can save the company money by not having to pay workers for a job not completed or could have them focus on other jobs around the warehouse that would not necessarily be done on a normal day. Increased emphasis on supplier relationships. No company wants a break in their inventory system that would create a shortage of supplies while not having inventory sit on shelves. Having a trusting supplier relationship means that you can rely on goods being there when you need them in order to satisfy the company and keep the company name in good standing with the public. Supplies continue around the clock keeping workers productive and businesses focused on turnover. Having management focused on meeting deadlines will make employees work hard to meet the company goals to see benefits in terms of job satisfaction, promotion or even higher pay. 18

Outliers detection in microarray data
Deal with high dimensionality Redundancy minimized Find distance based outliers in a reduced space PCA -based Outliers [2D] Distance-based Outliers [126] PCA -based Outliers [2D] Distance-based Outliers [7D] Asking any two managers what JIT does and you will get two different answers. JIT gurus invoke a rather romantic vision of manufacturing. One that inspires readers, who are often senior managers far removed from the factory floor, to demand wholly unrealistic goals from subordinates in the line. The danger is that disappointed expectations will discredit real, tangible accomplishments. All of this leads to two very distinct messages. Matching top 10

Bayesian Classification Based On Decomposition via Clustering
Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez An Extension Of Naïve Bayes. Class Decomposition of the Gaussians Using Clustering Using K-Means and E-M Scalability - Query Optimizations for Computationally and Memory Intensive Computations Incremental Learning of the Classifier

Computing Distance & Sufficient Statistics Using SQL & UDFs
Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez Five different SQL optimizations and one User Defined Function (UDF) to compute Euclidean distance in K-Means Sufficient Statistics – Count, Linear Sum and Quadartic Sum for multiple clusters and multiple classes computed in a single data set scan Using SQL (or) UDF.

Fast Bayesian Classifier Based on FREM
The Algorithm Initialization : Randomly initialize k clusters per class from the data set. E-step : Compute Mahalanobis distance, find nearest cluster and then compute sufficient statistics. M-step : Recompute the mean and variances and weight of the clusters per class. Mixture parameters updated in this step. SplitClusters : Splitting Heavy Clusters to reach higher quality solutions and reseeding low weight clusters. The E-step and M-step are iterated until the model converges.

Graphs R=E*E*E .. S=S*E Non-linear recursion: Clique detection
Transitive closure Triangle enumeration S=S*E Shortest path from single source Topological sort Connected components Page rank Non-linear recursion: Clique detection

Frequent Subgraph Mining
Kai Zhao Advisor: Dr. Carlos Ordonez Frequent subgraph A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold (B) (C) (A) FREQUENT PATTERNS (MIN SUPPORT IS 2)

Stonebraker: One size does not fit all!
Storage in a DBMS: Row: OLTP, point queries, cubes Column: cube queries, ad-hoc queries Array: math, science Other: Stream: one pass; in-RAM MMDB: OLTP Hadoop/noSQL: yawn (but evolving)

DBMS storage elevator story row | column | array
Row: old, single file, block, B-trees/hash, hash horizontal partitioning Column: new, multiple files, var. size blocks, ordered values, compressed, no row-level index!, hash-segment Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM; still hash but on chunk

Join: hash versus sort-merge Goal: O(N)
Main computation: Join optimization: Column: projection={unordered, ordered values} Row: unordered, ordered versus index Array: default={ordered, indexed} choice={sparse,dense}

Projection Duplicate elimination Aggregation reachability binary edges
shortest/longest path count # paths length vs weight/cost

Selection Reduce |Rd|, correctness

Example: Directed Graph
2 3 6 2 2 1 3 2 1 2 4 3 2 3 1 7 5 3

Graph Algorithms Main idea: These algorithms can be expressed as
a sequence of vector-matrix multiplications How can they work in a relational database?

Graph algorithms over a semi-ring:

Algorithm Pattern:

Example: Vector-Matrix Multiplication with SQL queries
Vector-Matrix Multiplication (+ ,* ) semiring SELECT S.j, sum(S.v * E.v) FROM Sd-1 as S join E on S.j=E.i GROUP BY j Vector-Matrix Multiplication (min, +) semiring SELECT S.j, min(S.v + E.v) FROM Sd-1 as S JOIN E on S.j=E.i In general SELECT S.j, g()(S.v ⊕ E.v)

Algorithm Pattern:

Unified Algorithm Input: E, S0, R0, f(), g(), ⨂, ε, unionFlag
Optional Input: s Output: Rd

Parallel Database Systems

Similar presentations

Presentation on theme: "Parallel Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Database Systems

Similar presentations

Presentation on theme: "Parallel Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback