Nguyen Ngoc Tuan – Le Nguyen Duy Vu /24/2010 1
1. Introduction to Data Warehouses and OLAP systems. 2. Security problem description and its related works. 3. Classify Security Threats & Identify Security Requirements. 4. Solution of Thee-tier Security Architecture. 5. Conclusion 11/24/2010 2
Introduction to Data Warehouses and OLAP systems 11/24/2010 3
A decision support database that is maintained separately from the organization’s operational database. “A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” W. H. Inmon 11/24/2010 4
5
Subject-oriented DW is organized around major subjects, such as: customer, supplier, product and sales. Integrated: DW is usually constructed by integrating multiple heterogeneous sources: relational databases, flat files and on-line transaction records, etc. Time-variant: data stored to provide information from a historical perspective (e.g., the past 5-10 years). Nonvolatile: DW is always a physically separate store of data transformed form. 11/24/2010 6
Information processing: supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs. Analytical processing: supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. Data mining: 11/24/2010 7
Data cubes (aka. Hypercubes, or OLAP cubes): multidimensional matrices is Data Warehouse and OLAP data model support analysis to query data in different perspectives 11/24/ Example of 3-dimensional data cube model Three dimensions are: Product Location Year
2 type of tables: Dimensional table Fact table 2 type of schema: Star schema Snowflake schema: dimensional tables from star schema are organized into hierarchy by normalization Fact constellation: set of fact tables that share some dimension tables. 11/24/2010 9
Star schema 11/24/
Snowflake schema: 11/24/
Fact constellation 11/24/
OLAP (On-Line Analytical Processing): decision support system that enable analysts to construct a mental image about the underlying data (collected from Data Warehouse) by exploring it: from different perspectives, at different level of generations, and in interactive manner. 11/24/
OLAP provides a user-friendly environment for interactive data analysis. Roll-up (aka. drill up): performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Drill-down (reverse of roll-up): navigates from less detailed data to more detailed data. Slide and dice: performs a selection on one dimension of the given cube, resulting a sub cub. Pivot (rotate): visualization operation that rotates the data axes in order to provide an alternative presentation of the data. 11/24/
Security problem description and its related works 11/24/
Insiders who have legitimate accesses to data through OLAP queries Access control techniques are not directly applicable due to the difference in data models Indirect inferences of protected data Inference control is absent in most commercial OLAP systems 11/24/
Restricted-based methods: Cell suppression: hide cells that contain small COUNT values, detect possible inferences related to these cells and remove them using linear programming Partitioning: defines a partition on sensitive data and restricts queries to aggregate only to complete blocks in the partition Micro-aggregation: replace clusters of sensitive data with their averages Perturbed-based techniques: add random noise to data 11/24/
Classify Security Threats & Identify Security Requirements 11/24/
In OLAP Systems, sensitive data can be inferred from answers to legitimate queries. There are two kind of inference One dimensional inference (1-d inference) Multi-dimensional inference (m-d inference) A cell is inferred using two or more of its descendants Neither of those descendants causes 1-d inferences Examples 11/24/
11/24/
1-d inference: Adversary: Prohibited from accessing cuboid Allowed to access its descendant Suppose: Knows about empty cells, Bob & Alice taking the same amount of commission in Q3 Infer that and as 5500, half of 11/24/
M-d inference with SUM Adversary Prohibited from accessing cuboid Allowed to access its descendants, Supposed: know empty cells Infer that: = ( + ) – ( + ) = /24/
M-d inference with MAX Adversary Prohibited from accessing cuboid Allowed to access its descendants, knows MAX( ) = 6400, MAX ( ) = 6000 ≠ Similarly, and ≠ Conclusion: = /24/
M-d inference with SUM, MAX & MIN Adversary Assumption like above examples. Adversary can ask queries using SUM, MAX, MIN Get = 6400, MAX( ) = 6400, MIN( ) = 6000, SUM( ) = {(,, } = {6000, 6000, 0} Continue to MAX, MIN, SUM on, = /24/
Security solution for OLAP systems combine access control and inference control Achieve a balance among following objectives Security: from both unauthorized access and malicious inferences Applicability: cover a wide range of scenarios without need for significant modifications Efficiency Availability Practicality 11/24/
Solution of Thee-tier Security Architecture 11/24/
In statistical databases: two tier (sensitive data, aggregation queries) Apply this architecture to OLAP has some drawbacks Unacceptable delay for query processing Inference control methods cannot take advantage of the special characteristics of an OLAP application 11/24/
Three tier: query tier, aggregation tier and data tier 11/24/
Aggregation tier must satisfy 3 properties Aggregation layer is secure with respect to Data layer, enforced by inference control Its size must be comparable with the Data layer Problem of inference control can be partitioned into blocks in Data layer and Aggregation layer. Security need only to ensure each corresponding pair of blocks in the two tiers 11/24/
Reduce performance overhead of inference control Aggregation tier can pre-computed: computation intensive part of inference control can be shifted to offline processing Reduce size of inputs to inference control algorithms reduce complexity Localizing inference control tasks to each block of data failure in one block won’t affect other block 11/24/
Cardinality-based method Detect inferences based on the number of answered queries We consider one-level hierarchy, each dimension can only have two attributes: core cuboid, its descendants are, and 11/24/
Cardinality-based method 11/24/
Cardinality-based method Existence of 1-d inferences and the number of empty cells k=number of dimensions, d max is greatest domain size of all dimensions Number of empty cells 0 2 k-1.d max Free of 1-d inference Always have 1-d inference 11/24/
Cardinality-based method Existence of m-d inferences and the number of empty cells Cuboid with no empty cells is free of m-d inferences Theorem: C c is core cuboid, C all is collection of all aggregation cuboids i th attribute of C c has d i values, d u and d v is the 2 smallest among d i ’s w is number of C c empty cells We have: C c is free from m-d inference if w < 2(d u -4) + 2(d v -4) -1, d i ≥ 4 for all 1 ≤ I ≤ k. C c has m-d inference if w ≥ 2(d u -4) + 2(d v -4) /24/
Parity-based method Based on a simple fact that even number is closed under the operation of addition and subtraction The nature of m-inference is to keep adding (or subtracting) sets of cells until the result yields one cell We consider multi-dimensional range (MDR) query is considered. An MDR is an operation of addition (or subtraction) 11/24/
We use: q*(, ) = x1 + x2 + x3 + x4 + x5 + x6 q*(, ) = x1 + x2 … Restricting MDR queries to only include even number of cells hard to obtain (maybe) 11/24/
Parity-based method Inference: q*(, ) = x1 + x2 = q*(, ) = x4 + x5 = q*(, ) = x5 + x6 = q*(, ) = x3 + x5 = q*(, ) = x1+ x2 + x3 + x4+ x5 +x6 = 6500 = x5 + x5 = 1000 x5 = /24/
Parity-based method Derivability: a set of queries Q1 is derivable from another set Q2, then the answer to Q1 can be computed using answers to Q2. Q1 is free of inferences if Q2 is. Find another collection of even MDR queries Q p that are equivalent to Q* and whose inferences are easier to detect. Then, denote Q p as an undirected simple graph G(C c, Q p ). After that, check G whether or not a bipartite graph (graph no cycle composed of odd number of edges) 11/24/
Approach detect inferences caused by queries involving both MAXs and SUMs is intractable not directly detect inferences, but instead first prevents m-d inferences and then remove 1-d inferences Access control Define 2 functions: Below() partitions data cube along the dependency lattice Slide() partitions data cube along dimensions. Object is the intersection of the two above partitions. 11/24/
Access control Example: Employee’s yearly or more detailed commission is sensitive. This requirement only applied to first year data Specifies as Object(L, S), L =, S includes all cells in the first four quarters of 11/24/
Lattice-based inference control Given to set of cells S and T. For any cell c in S, we say c is redundant with respect to T if S includes both c and c’s ancestors c is non-comparable to T if T contains no c’ that c is ancestor/ descendant of c’. Reducible inference: only check if S – {c} causes any inferences to T Example: we want to protect Object(L, S), where S is complete cuboid(means “no slide”), and L = {, }. 11/24/
Lattice-based inference control 11/24/
Lattice-based inference control More generally, as long as any cuboid c r satisfies that all ancestors are included by T (under LOWER curve), the descendant closure of c r is the maximal result for preventing m-d inferences After m-d inferences are prevented, remove 1-d inferences control m-d inferences to this new object Repeating the two above steps until removing all 1-d references The final result is a set of cells that are guaranteed to be free of inferences to the object 11/24/
Implement lattice-based inference control method in three- tier architecture: The authorization object computed through the above iterative process comprises the data tier The complement of the object is the aggregation tier since it does not cause any inferences to the data tier 11/24/
Conclusion 11/24/
The most challenging security threat in Data Warehouse and OLAP systems is: Data stored in data warehouse may be disclosed through seemingly innocent OLAP queries 2 main inference threat that should be considered: 1-d inference m-d inference We presented 3 methods to prevent / remove inference: Cardinality-based method Parity-based method Lattice-based inference control All above methods are applicable to the three-tier inference control architecture, that especially suits OLAP systems. 11/24/
Lingyu and Sushil Jajodia. Security in Data Warehouses and OLAP Systems. 11/24/
11/24/