Presentation is loading. Please wait.

Presentation is loading. Please wait.

Implementação do DW. SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações.

Similar presentations


Presentation on theme: "Implementação do DW. SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações."— Presentation transcript:

1 Implementação do DW

2 SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações eficientes Materialização de vistas para informação agregada: Materialização de vistas para informação agregada: Identificar, explorar e actualizar de forma eficiente Índices: Índices: Index intersection, bit map indices, join indices Index intersection, bit map indices, join indices

3 SAD Tagus 2004/05 H. Galhardas Escolha dos agregados Objectivo: Minimizar o tempo de resposta Pergunta: Quais os agregados a materializar? Factores: Espaço de armazenamento – custo baixo! Custo de actualização – custo alto! Compromisso: Tempo de resposta vs custo de actualização

4 SAD Tagus 2004/05 H. Galhardas Cube Operation Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () 2^n cuboids (item)(city) () (year) (city, item)(city, year)(item, year) (city, item, year)

5 SAD Tagus 2004/05 H. Galhardas Goal: efficient computation of aggregations across many sets of dimensions Goal: efficient computation of aggregations across many sets of dimensions Take into account amount of main memory available and computation time Take into account amount of main memory available and computation time Data cube can be viewed as a lattice of cuboids Data cube can be viewed as a lattice of cuboids Materialization of data cube Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Efficient Data Cube Computation

6 SAD Tagus 2004/05 H. Galhardas Partial materialization of cuboids Three factors must be considered: 1. Identify the subset of cuboids to materialize Take into account queries in the workload, their frequencies and accessing costs; cost of incremental updates, total storage requirements Popular approaches: use heuristics or materialize cuboids based on other frequently referenced cuboids 2. Exploit the materialized cuboids during query processing How to use indexes and how to transform OLAP operations onto the selected cuboids 3. Efficiently update the materialized cuboids during load and refresh Explore parallelism and incremental update

7 SAD Tagus 2004/05 H. Galhardas Classification of agg. facts wrt the agg. functions Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. Ex: count(), sum(), min(), max(). Algebraic: if it can be computed by an algebraic function with M arguments (where M is a constant), each of which is obtained by applying a distributive aggregate function. Ex: avg(), min_N(), standard_deviation(). Holistic: if there is no algebraic function with M arguments that characteriezes the computation. Ex: median(), mode(), rank().

8 SAD Tagus 2004/05 H. Galhardas Arquitecturas de servidor OLAP Relational OLAP (ROLAP) Usa SGBDs relacionais ou relacional extendido para armazenar e gerir os dados do datawarehouse e usa middleware OLAP para suportar funcinalidades específicas do OLAP. Inclui optimização suportada pelo SGBDR, implementa lógica de navegação de agregação e serviços/ferramentas adicionais Maior escalabilidade Multidimensional OLAP (MOLAP) Motor de armazenamento multidimensional baseado em arrays (sparse matrix techniques) Indexação rápida de dados sumarizados pré-calculados Hybrid OLAP (HOLAP) Flexibilidade: baixo nível: relacional, alto nível: array Specialized SQL servers Suporte especializado para interrogações SQL sobre esquemas em estrela e floco de neve

9 SAD Tagus 2004/05 H. Galhardas Some efficient cube computation methods ROLAP-based cubing algorithms (Agarwal et al’96) Array-based cubing algorithm (Zhao et al’97) Bottom-up computation method (Bayer & Ramarkrishnan’99) And others...

10 SAD Tagus 2004/05 H. Galhardas ROLAP-Based Cubing Algorithm Value-based addressing: dimension values accessed by key-based addressing strategies Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Grouping is performed on some subaggregates as a “partial grouping step” Aggregates may be computed from previously computed aggregates, rather than from the base fact table

11 SAD Tagus 2004/05 H. Galhardas Iceberg Cube Computing only the cuboid cells whose count or other aggregates satisfying the condition like Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup Motivation Motivation Only a small portion of cube cells may be “above the water’’ in a sparse cube Only calculate “interesting” cells—data above certain threshold Avoid explosive growth of the cube Suppose 100 dimensions, only 1 base cell. How many aggregate cells if count >= 1? What about count >= 2?

12 SAD Tagus 2004/05 H. Galhardas Multi-way Array Aggregation for Cube Computation Direct array addressing: dimension values accessed via the index of their corresponding array locations Direct array addressing: dimension values accessed via the index of their corresponding array locations Partition arrays into chunks (a small subcube which fits in memory). Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, thereby reducing memory access and storage cost. Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, thereby reducing memory access and storage cost. What is the best traversing order to do multi-way aggregation? A B 29303132 1234 5 9 13141516 64636261 48474645 a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2a3 C B 44 28 56 40 24 52 36 20 60

13 SAD Tagus 2004/05 H. Galhardas Cuboids ABC – base cuboid – already computed ABC – base cuboid – already computed AB, AC, BC – 2-D cuboids AB, AC, BC – 2-D cuboids A, B, C – 1-D cuboid A, B, C – 1-D cuboid All – 0-D cuboid All – 0-D cuboid Many possible orderings w/ which chunks can be read into memory Many possible orderings w/ which chunks can be read into memory

14 SAD Tagus 2004/05 H. Galhardas Example (1) A B 29303132 1234 5 9 13141516 64636261 48474645 a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2a3 C 44 28 56 40 24 52 36 20 60 B

15 SAD Tagus 2004/05 H. Galhardas Example (2) A B 29303132 1234 5 9 13141516 64636261 48474645 a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2a3 C 44 28 56 40 24 52 36 20 60 B

16 SAD Tagus 2004/05 H. Galhardas Multi-Way Array Aggregation for Cube Computation (Cont.) Method: the planes should be sorted and computed according to their size in ascending order. See the details of Example 2.12 (pp. 75-78) Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored

17 SAD Tagus 2004/05 H. Galhardas Indexing OLAP Data: Bitmap Index Index on a particular column Index on a particular column Each value in the column has a bit vector: bit-op is fast Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column The i-th bit is set if the i-th row of the base table has the value for the indexed column Not suitable for high cardinality domains Not suitable for high cardinality domains Base table Index on RegionIndex on Type CustRegionType C1AsiaRetail C2EuropeDealer C3AsiaDealer C4AmericaRetail C5EuropeDealer

18 SAD Tagus 2004/05 H. Galhardas Indexing OLAP Data: Join Indices Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …) Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …) Traditional indices map the values to a list of record ids Traditional indices map the values to a list of record ids It materializes relational join in JI file and speeds up relational join — a rather costly operation In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. Ex: fact table: Sales and two dimensions city and product A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city

19 SAD Tagus 2004/05 H. Galhardas Example of join indexes

20 SAD Tagus 2004/05 H. Galhardas Efficient Processing of OLAP Queries 1. Determine which operations should be performed on the available cuboids: transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection 2. Determine to which materialized cuboid(s) the relevant operations should be applied. 3. Exploring indexing structures and compressed vs. dense array structures in MOLAP

21 SAD Tagus 2004/05 H. Galhardas Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 2 no livro e Capítulo 3 nos drafts) (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 2 no livro e Capítulo 3 nos drafts) (Artigo) Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals, J. Gray et al, DMKD 1(1), 1997 (Artigo) Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals, J. Gray et al, DMKD 1(1), 1997 (Artigo) On the computation of multidimensional aggregates, Agarwal et al, VLDB, 1996 (Artigo) On the computation of multidimensional aggregates, Agarwal et al, VLDB, 1996


Download ppt "Implementação do DW. SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações."

Similar presentations


Ads by Google