Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehouse Operational Issues Potential Research Directions.

Similar presentations


Presentation on theme: "Data Warehouse Operational Issues Potential Research Directions."— Presentation transcript:

1 Data Warehouse Operational Issues Potential Research Directions

2 Query processing General issues Query processing for OLAP queries has been addressed extensively in the past, including standard OLAP operations, ad-hoc queries, etc. Research Challenges: 1.There is a need for a “standardized” query language for OLAP, a Cube algebra 2.For several queries it would be sufficient to provide approximate answers. We need to identify approximation “measures” and formal techniques to evaluate them. 3.Integrate “what-if” analysis facilities in current OLAP tools, with particular emphasis on improved performance, in order to avoid using more than one tools for this purpose (as currently happens) 4.Support “holistic” aggregation (i.e., aggregation that cannot exploit differentials) efficiently. Examples involve the computation of top-K, or median values in the presence of streaming insertions and deletions.

3 Stream data Streams are sequences of data, continuously flowing from a data source with the particular characteristic that, due to their volume, each tuple is available only for a limited time window for querying. Stream examples would involve: Stock rates extracted from the web Packets going through a router Clickstreams from a web site

4 Query processing OLAP aggregation and stream data Example: these are the standard OLAP queries we know (cube operator, etc) Research challenges: 1.Issues related to processing OLAP queries over streams; streams are eventually stored in the DW (updated consistently) – an interesting issue is defining consistency between stream and DW. 2.How to split query processing between on-line queries on stream data and queries on the contents of the warehouse; it might also be necessary to store some of the results back in the DW to allow for future queries. 3.Synchronization of multiple incoming streams with different arrival rates, e.g., when we need to combine them with some join operation and aggregation, in the presence of consistency constraints.

5 Query processing Subscription queries Example: Set of indicators define whether or not business is going well depending on whether stream values exceed some threshold Subscription queries are defined in order to continuously monitor the DW. One way to do that is by defining ECA rules on the data warehouse – evaluation of triggers requires access to the DW; when event takes place, execute queries against the warehouse. Research challenges: 1.Modeling of subscription queries (is it a new language or standard trigger languages are enough?) 2.What kind of extensions do we need at the physical level to support that? 3.How to optimize trigger condition checking (difficult in light of heavy aggregation)

6 Query processing Cached queries Example: Compute the sum per country and per month and save it because you can compute based on this the sum per continent and year. Important component, should exist in any data warehousing architecture. Research challenges: 1.How to and under what conditions can you do query rewriting? [cube algebra would be useful here]. 2.Work on all “caching” issues: updating/refreshing, deciding whether or not to cache, replacement policies in the case of bounded space, etc. 3.Work on the same problem but for streams (fast arriving data) and for particular types of queries that are less sensitive to temporal changes or compute approximate answers (otherwise caching is of no interest as data changes too fast)

7 Query processing Personalization Example: Profile (query) set by the user denotes that she is interested in sales between 2k and 4k in August in Germany and does not care about sales between 4k and 5k in July in Italy. She wishes is to see in the results highlighted regions of the cube that satisfy her profile. Research challenges: 1.Can personalization be used in the DW context? Does it make sense in the context of aggregation and what kind of predicates based on aggregations are interesting? 2.How to express profiles? 3.How to process “profiles”? Particular emphasis should be put on the scalability of profile management in the presence of multiple users. 4.How could profiles be used for the design of the DW so as to avoid unnecessary processing? 5.Is this topic really new Research or Re-engineering of existing solutions for simple queries?

8 Query processing Data mining Example: a user specifies an ECA rule that involves the identification of potential interesting opportunities, based on the values of a cube. The condition part has to do with the identification of a pattern more than 5 times within the last 2 days. The pattern is computed by an online data mining algorithm Research challenges: The integration of data mining algorithms in OLAP tools, especially in the presence of novel types of data and data streams. Particular emphasis is put on the efficiency of the algorithms as well as their smooth integration within the overall OLAP environment.

9 ETL Traditional ETL Definition: Traditional ETL processes are responsible for the extraction of data from several sources, their cleansing, customization and insertion into a DW. These tasks are repeated on a regular basis and in most cases, they are asynchronous. Example: integration of data extracted from OLTP and legacy systems to a DW State of the art: So far, the research community has only partially dealt with the problem of designing and managing ETL processes. Typically, research approaches concern (a) stand-alone problems (e.g., the problem of duplicate detection) in an isolated setting and (b) problems mostly related to web data. Recently, research on data streams has brought up the possibility of giving an alternative look to the problem of ETL. Nevertheless, for the moment, research in data streaming has focused on different topics, such as on-the- fly computation of queries.

10 ETL Traditional ETL (cont.) Research Challenges 1.A formal description of ETL processes with particular emphasis on an algebra (for optimization purposes) and a formal declarative language. 2.Optimization of ETL processes in logical and physical level. A challenge will be either the optimization of the whole ETL process or of any individual transformation. Parallel processing of ETL processes is of particular importance. 3.Propagation of changes back to the sources. Potential quality problems observed at the end-user level can lead to clean data being propagated back to the sources, in order to avoid the repetition of several tasks in future application of the ETL process. 4.Standard-based metadata for ETL processes. There does not exist common model for the metadata of ETL processes. 5.Integration of ETL with XML wrappers, EAI (Enterprise Application Integration) tools data quality tools. 6.Extension of the ETL mechanisms for non-traditional data, like XML/HTML, spatial and biomedical data. 7.Security issues in ETL; source data and data in transit are security risks.

11 ETL Stream ETL (cont.) Definition: Stream ETL is an ETL process involving the possible filtering, value conversion and transformation of the incoming information in a desirable format. Although, streams in their entirety cannot always be stored (also do not want to go over them again and again in ETL processing), some patterns or snapshot aggregates of them can be stored for subsequent querying. Research Challenges: An issue that arises concerns correctness guarantees. Also, we need cost models for the tuning of the incoming stream within specified time window. Another issue is the audit of the incoming stream data for several constraints or business rules, also with respect to stored data (e.g., primary key violations).

12 ETL On-Demand ETL Example: Some users request data to be brought in from the web. The administrator/programmer is assigned the task of constructing an ETL process that extracts the dates from the specified sites, transforms them and ultimately stores them in the of the warehouse. Any time the user needs this data, this on-demand ETL process brings in the relevant information. Definition An ETL process executed sporadically, manually initiated by some user demand. The process is responsible for retrieving external data and loading them in the DW after the appropriate transformations. Research Cahllenges Since this process is mostly focused towards web data, there is a need for the appropriate operators. Also, additional challenges involve: Minimum effort/time/resources for the construction of the process Easy adaptability to the changes of the external data Efficient algorithms.


Download ppt "Data Warehouse Operational Issues Potential Research Directions."

Similar presentations


Ads by Google