Data Warehousing Lifecycle Conceptual modeling: System requirements, data sources and warehousing activities. Logical design: Data flow from sources to.

Slides:



Advertisements
Similar presentations
Data Warehousing and Data Mining J. G. Zheng May 20 th 2008 MIS Chapter 3.
Advertisements

University at BuffaloThe State University of New York The Data Warehouse Schema of HIV/AIDS and Drug Use Project Characteristic of Source Data Our data.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
C6 Databases.
Lecture-7/ T. Nouf Almujally
Data Warehousing Willem Visser RW334. Somebody is watching! Everybody seems to be recording your every move Loyalty cards Cookies – Facebook, Twitter,…
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Real World Objects and relationships Database Schema (Object state) Physical Model Conceptual Model Lists, flow diagrams, etc Logical Model Diagram in.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Data Warehousing and OLAP
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
1 Lecture 10: More OLAP - Dimensional modeling
Data Warehousing Lifecycle Conceptual modeling: System requirements, data sources and warehousing activities. Logical design: Data flow from sources to.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
DATA WAREHOUSE (Muscat, Oman).
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
CS346: Advanced Databases
Chapter 13 – Data Warehousing. Databases  Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age  Information,
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Data Warehousing.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Warehouse & Data Mining
311: Management Information Systems Database Systems Chapter 3.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Datawarehouse Objectives
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
Roadmap 1.What is the data warehouse, data mart 2.Multi-dimensional data modeling 3.Data warehouse design – schemas, indices 4.The Data Cube operator –
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
1 On-Line Analytic Processing Warehousing Data Cubes.
Mining the Biomedical Research Literature Ken Baclawski.
Foundations of Business Intelligence: Databases and Information Management.
Data Warehousing Multidimensional Analysis
Data Mining Data Warehouses.
3/6: Data Management, pt. 2 Refresh your memory Relational Data Model
Data Warehousing.
Advanced Database Concepts
The Data Warehouse Chapter Operational Databases = transactional database  designed to process individual transaction quickly and efficiently.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 6 The Data Warehouse Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Contextual Text Cube Model and Aggregation Operator for Text OLAP
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Data Warehousing COMP3017 Advanced Databases Dr Nicholas Gibbins –
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
3. Data storage and data structures in Warehouses
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Warehouse and OLAP
Introduction of Week 9 Return assignment 5-2
Data Warehouse and OLAP
Presentation transcript:

Data Warehousing Lifecycle Conceptual modeling: System requirements, data sources and warehousing activities. Logical design: Data flow from sources to DW, composition and semantics of activities. DW construction: Schema implementation, data population and warehouse tuning. Application development: DW interfaces, OLAP and data mining tools.

On-Line Analytical Processing (OLAP) Store Product Time (day) M T W Th F S S Juice Milk Coke Cream Soap Bread NY SF LA Dimensions: Time, Product, Store Hierarchies: Day  Week  Quarter Product  Brand  … Store  Region  Country roll-up to week roll-up to brand roll-up to region Store Product Time (week) W Juice Milk Coke Cream Soap Bread NY SF LA 120 Operators: roll-up, drill-down, slice and dice. Uses: Business data analysis, e.g., market-driven trend analysis.

CSE6013 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize

CSE6014 Dimension Hierarchies all state city

CSE6015 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

Logical Data Modeling: A Star Schema Example Sales time_key branch_key location_key product_key num_units amount_usd Time time_key day month year Product product_key name brand type Supplier supplier_key name type Location location_key city state country Branch branch_key name type 1 n n n n ??? One-to-many relationships between the fact and dimensions. The fact-dimension relationships are certain. Dimensions in star models are often tightly coupled. Star schema does not appear to be very extensible.

Biomedical Data Resources Static data: data on genotypes, biological entities such as nucleic acids, protein and relationships between these entities. Dynamic data: data on phenotypes, the dynamics of biological processes. Data on analysis tools: data on biological and computer science methods which can be used to identify the entities and relationships. References and annotations: to scientific papers and textual explanations.

Biomedical Data Modeling Flat file collections: Databases were built up as indexed ASCII text files. Relational databases: many biology databases were implemented using Oracle, Sybase, or MySQL. Object-oriented databases: data are modeled as objects that are organized in classes. Multidimensional databases: data are organized in star like schema.

Using Star Schema in Gene Expression Data Management “Applying Data Warehouse Concepts to Gene Expression Data Management”, by V. Markowitz and T. Topaloglou Three modeling data spaces: –Sample data space –Gene Annotation data space –Gene expression data space

Gene Expression Data Space Gene_id Experiment_id Analysis_id Expression_call Analysis_id Algorithm version Gene_id Gene_name Gene_symbol Experiment_id Exp_name Exp_date Exp_file Sample Gene Analysis Expression Experiment Clinical Sample

Sample Data Space Biological Sample Pathways Study Donor Demorgraphics Donor Clinical

Gene Annotation Data Space Gene Fragments Sequence Pathways Sequence Cluster Known gene Microarray Design Chromosome

OLAP Operations Sample selection: extract sets of samples with a certain profile on the sample data space. Eg, a sample set of male colon samples with adenocarcenoma for donors in the age group Classification on organ: total number of samples classified by liver, brain, …

OLAP Operations Gene selection: extract sets of genes with certain properties over the gene annotation data space. Eg, a gene set of the genes on chromosome 22 … Aggregates: gene summarization on sample dimension, sample summarization on gene dimension. Etc.

Clinical Data Sapce Clinical Sample Medical ImageFollowup Drug DemographicsClinical Test Physiology Patient 1 n n n 1 n 1 n 1 n n n Disease n n n

Sample Data Sapce Protein Expression mRNA Expression Anatomy Ontology Biochemical Assay Genetic Screening Clinical Sample n n 1 n Patient n n 1 n n 1 n n

Microarray Data Sapce mRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample

Proteomic Data Sapce Protein Expression Experiment Measurement Unit Gene Sequence n n 1 1 Clinical Sample

Experiment Data Sapce Project Experiment Publication Normalization Protocol Person n n n 1 1 n 1 1 Platform

Gene Data Sapce n 1 Protein Expression Gene Sequence PromoterGene Ontology 1 n n n Protein Domain Protein-Protein Interaction n n n n n Gene Cluster mRNA Expression Array Probe n 1

mRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform Normalization 1 n 1 n 1 n Gene OntologyGene Cluster n n n n Explicit Definition of Concept Hierarchies

Characteristics of Clinical and Genomic Data Clinical and Genomic DataBusiness Data Complex data structure with many potential dimensions Easy-to-understand data structure with few dimensions Often many-to-many relationships between facts and dimensions Many-to-one relationships between facts and dimensions Uncertain relationships between fact and dimension objects Certain relationships between fact and dimension objects Some measures require advanced temporal support for time validity Historical data, no advanced temporal support needed Incomplete and/or imprecise data very common Few incomplete and/or imprecise data

Large Number of Dimensions and Evolution of Dimensions If Star schema is used and the number of dimensions is large, the fact table will be huge (combination of foreign keys). Adding new dimension to Star schema will require re-computing of all data entries in the fact table.

Many-to-Many relationships The many-to-many relationships cannot be easily modeled using Star schema, which is originally designed to handle many-to- one relationships between business fact and a dimension.

Incompleteness of Data Clinical data may be incomplete. This may cause a lot of null values in the fact table for foreign keys, which will result in inconsistency.

Star Schema Fact DimKey1 DimKey2 DimKey3 DimKey4 Measure1 Measure2 Measure3 Measure4 Dim3 DimKey3... Dim2 DimKey2... Dim4 DimKey4... Dim1 DimKey1... BioStar Schema Fact FactKey... Dim3 DimKey3... MTable2 DimKey2 FactKey Measure2 MTable4 DimKey4 FactKey Measure4 Dim1 DimKey1... MTable3 DimKey3 FactKey Measure3 MTable1 DimKey1 FactKey Measure1 Dim2 DimKey2... Dim4 DimKey4...

BioStar Schema for Part of the Clinical Data Space Patient PatientID SSN Name Gender DOB DrugUse DrugID PatientID Dosage ValidFrom ValidTo TestResult TestID PatientID Result DateTested ClinicalSample SampleID PatientID Source Amount DateTaken Diagnosis DiseaseID PatientID Symptom ValidFrom ValidTo Drug DrugID DrugName DrugType Description Disease DiseaseID Name Type Description ClinicalTest TestID TestName TestType TestSetting Extensibility and flexibility

BioStar Schema for the Sample Data Space ClinicalSample SampleID PatientID Source Amount DateTaken mRNAExpression SampleID ArrayProbeID ExperimentID MeasureUnitID Expression AssayResult AssayID SampleID Result Comment DateTested AnatomyTerm TermID TermType TermName Definition BiochemAssay AssayID AssayName AssayType AssaySetting Description SampleAnatomy TermID SampleID Description GeneticScreen MarkerID SampleID Result RawData Comment DateTested GeneticMarker MarkerID MarkerName MarkerType GeneticLocus Description

BioStar Schema for Part of the Gene Data Space GeneSequence UID SeqType Accession Version SeqDataset SpeciesID Status GOAnnotation GOID UID Evidence Promoter PromoterID UID PromoterType PromoterSeq Length Description ProteinInteract UID1 UID2 Evidence Description GeneCluster ClusterID UID GOTerm GOID Accession TermType TermName Definition Cluster ClusterID NumOfGenes ExprPattern ClusteringTool ToolSetting Description ArrayProbe ArrayProbeID UID ArrayID ProbeName Description IsQC GeneDomain DomainID UID Alignment SeqFrom SeqTo DomainFrom DomainTo EValue BitScore DomainModel DomainID ModelType SourceDB Accession Title Length Description

Star Schema for the Microarray Data Space mRNAExpression SampleID ArrayProbeID ExperimentID MeasureUnitID Expression Experiment ExperimentID ExperimentName ExperimentType ProjectID PersonID PlatformID ProtocolID NormalizationID PublicationID ArrayProbe ArrayProbeID UID ArrayID ProbeName Description IsQC MeasurementUnit MeasureUnitID MeasureUnitName MeasureUnitType Description GeneSequence UID SeqType Accession Version SeqDataset SpeciesID Status ClinicalSample SampleID PatientID Source Amount DateTaken

Star Schema for the Proteomic Data Space ProteinExpression SampleID UID ExperimentID MeasureUnitID Expression Experiment ExperimentID ExperimentName ExperimentType ProjectID PersonID PlatformID ProtocolID NormalizationID PublicationID MeasurementUnit MeasureUnitID MeasureUnitName MeasureUnitType Description GeneSequence UID SeqType Accession Version SeqDataset SpeciesID Status ClinicalSample SampleID PatientID Source Amount DateTaken

Star Schema for the Experiment Data Space Experiment ExperimentID ExperimentName ExperimentType ProjectID PersonID PlatformID ProtocolID NormalizationID PublicationID Project ProjectID ProjectName Investigator Description Protocol ProtocolID ProtocolName ProtocolText CreatedBy Publication PublicationID PubMedID Title Authors Abstract PubDate Citation Platform PlatformID Hardware Software Settings Description Person PersonID PersonName LabName Contact Normalization NormalizationID NormType Software Parameters Description

BioStar is not Fact Constellation You may view measure tables as small “fact” tables, but fact tables in a constellation usually share multiple dimension tables. Dimension table Fact table Dimension table Dimension table Dimension table Dimension table Dimension table Dimension table Dimension table

Extensibility of BioStar Add a protein structure information dimension to gene data space. GeneSequence UID SeqType Accession Version SeqDataset SpeciesID Status UID PDBID ….. PDBID ….. ProteinStructureProteinSequence Dimension table Measure table Populating the two new tables will not affect other tables.

Flexibility of BioStar Separate tables for fact measures to solve the many-to-many relationship problem  dimension table and its associated measure table can be populated independently  avoid null values.

Sample Classification Hierarchy All_sample NormalTumor Brain Blood Colon Breast CNS_tumor Leukemia... Adeno- carcinoma... Glio- blastoma... ALL AML Colon tumor Breast tumor... (Patients)

OLAP for Microarray Data Exploration Measurement Unit Gene Sample (patient) D13626 D13627 D13628 J04605 L37042 S78653 X60003 Z11518 PA Val roll-up to disease types roll-up to GO terms roll-up to expression Dimensions: Sample Gene Measurement Unit Operators: roll-up drill-down slice dice t-test p-select Application: Exploration of gene expression data

Data SourcesData WarehouseUnified Access Clinical data and sample annotations Gene functional annotations Microarray mRNA expression Proteomics protein expression Promoter sequences and motifs Protein domains & interactome Data Integration Data extraction, trans- formation, cleaning & loading Metadata capturing & integration Data quality control Refreshment Data Mining Ad hoc queries OLAP Cluster analysis Mining gene regulatory networks Interactome prediction Pathway analysis A standard interface for application tools Object- oriented Defining basic operators for data access Biomediacl Data Warehouse System Architecture