Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Database Systems Group Research Overview 2010. 2 OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:

Similar presentations


Presentation on theme: "1 Database Systems Group Research Overview 2010. 2 OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:"— Presentation transcript:

1 1 Database Systems Group Research Overview 2010

2 2 OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex: Increase in age causes increase in risk for heart disease Combined OLAP with Means Comparison Parametric Test – Used to pair similar groups and determine if they are significantly different – Want to reject hypothesis that the two groups have the same mean Developed GUI that allows for easy user interface Zhibo Chen Advisor: Dr. Carlos Ordonez

3 3 OLAP Statistical Tests Association Rules – technique used to detect patterns within items of dataset – HighAge, High Cholestrol => Heart Disease Compare results from both techniques OLAP Statistical Test discovered more rules than Association Rules – p-value is more reliable than confidence (considers pdf) – OLAP affected less by distribution than AR AR better when performance is priority and data is skewed OLAP Statistical Test better when data is distributed Zhibo Chen Advisor: Dr. Carlos Ordonez

4 4 OLAP Statistical Test versus Association Rules Blue and red lines represent location of the averages of the two groups – Averages are fairly different from one another Confidence says that the two groups are similar – Many blue points above 50 – Many red points above 50 – confidence is low Zhibo Chen Advisor: Dr. Carlos Ordonez

5 5 OLAP Exploration with UDF On-Line Analytical Process (OLAP) – Set of techniques allowing users to explore various aggregations of a dataset – Ex: dataset with day, month, year, sales What were average sales for Sundays? Solve by grouping on day and then extracting Sunday Normally done outside the database or with OLAP servers – We want to study how to perform the same techniques inside the DBMS (SQL or UDF) – Found that users can efficiently perform OLAP exploration using UDFs Zhibo Chen Advisor: Dr. Carlos Ordonez

6 6 Digital Libraries in a DBMS have been traditionally exploited outside relational databaseInformation retrieval techniques have been traditionally exploited outside relational database systems due to storage overhead, complexity to suit them in a relational model, and slower performance in SQL implementations. Searching and querying can be performed SQLSearching and querying documents under information retrieval models in relational database systems can be performed with optimized SQL. We explore three phases: Document preprocessing. Document storage. Document retrieval (VSM, OPM, DPLM). Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez

7 7 Keyword Search Across Document and Databases meaningSometimes the meaning and structure of a database is unknown. describeThere are external semi-structured sources that can help to describe it. linkWe found that we can link these two worlds to identify relationships between the structured data with the semi- structured data. rightWe believe that is the right approach approach to do it inside the database. We implemented a prototype SQL entirely in SQL. Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez

8 8 Bayesian Statistics Latest trend in advanced statistics; very demanding: CPU and large data sets microarray data high dimensionalityApplied to microarray data in the DBMS. The problem involves high dimensionality data of few samples. Variable selection Computational expensiveVariable selection is the first issue that we have been trying to solve. Computational expensive looking for the best model (2^d), where d is de number of dimensions. Applying SQL optimizationsApplying SQL optimizations and data layout modifications, we obtain less than 3 seconds selections of > 1 M dimensions, but still not enough. : Gibbs Sampler Variable SelectionCurrent work: Gibbs Sampler Variable Selection. Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez

9 9 PCA Black-box Black-box Rotation of the input space Rotation of the input space Make the representative components evident Make the representative components evident No Covariance between attributes No Covariance between attributes Variance represented by the eigenvalues Variance represented by the eigenvalues Deal with high dimensionality Deal with high dimensionality Mario Navas Advisor: Dr. Carlos Ordonez

10 10 DB Implementation Summary matrices n L Q Summary matrices n L Q Correlation matrix Correlation matrix Eigenvalue decomposition problem Eigenvalue decomposition problem

11 11 Outliers detection in microarray data Deal with high dimensionality Deal with high dimensionality Redundancy minimized Redundancy minimized Find distance based outliers in a reduced space Find distance based outliers in a reduced space PCA -based Outliers [2D] Distance-based Outliers [7D] PCA -based Outliers [2D] Distance-based Outliers [126] Matching top 10

12 12 Bayesian Classification Based On Decomposition via Clustering An Extension Of Na ï ve Bayes. Class Decomposition of the Gaussians Using Clustering Using K-Means and E-M Scalability - Query Optimizations for Computationally and Memory Intensive Computations Incremental Learning of the Classifier Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez

13 13 Computing Distance & Sufficient Statistics Using SQL & UDFs Five different SQL optimizations and one User Defined Function (UDF) to compute Euclidean distance in K-Means Sufficient Statistics – Count, Linear Sum and Quadartic Sum for multiple clusters and multiple classes computed in a single data set scan Using SQL (or) UDF. Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez

14 14 Fast Bayesian Classifier Based on FREM The Algorithm – Initialization : Randomly initialize k clusters per class from the data set. – E-step : Compute Mahalanobis distance, find nearest cluster and then compute sufficient statistics. – M-step : Recompute the mean and variances and weight of the clusters per class. Mixture parameters updated in this step. – SplitClusters : Splitting Heavy Clusters to reach higher quality solutions and reseeding low weight clusters. – The E-step and M-step are iterated until the model converges.

15 15 Constrained Association Rules in SQL Association rules are a data mining technique used to discover frequent patterns in a data set. Real world application of this technique is broad and can include fields such as medical and commerce. We can automatically generate efficient SQL queries for discovering association rules Kai Zhao Advisor: Dr. Carlos Ordonez

16 16 Comparison between CAR and DT CAR perform an exhaustive combinatorial research whereas DT recursively partition the input attribute space. CAR aim to find all rules above the given thresholds whereas DT find regions in space where most records belong to the same class. CAR analyze item combinations whereas DT select only one input attribute at one time. Kai Zhao Advisor: Dr. Carlos Ordonez

17 17 Frequent Subgraph Mining Frequent subgraph – A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold FREQUENT PATTERNS (MIN SUPPORT IS 2) (A) (B)(C) Kai Zhao Advisor: Dr. Carlos Ordonez


Download ppt "1 Database Systems Group Research Overview 2010. 2 OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:"

Similar presentations


Ads by Google