Presentation is loading. Please wait.

Presentation is loading. Please wait.

VERTICAL DATA INTEGRATION FOR CLINICAL GENOMICS PhD Thesis, cycle XXIII Andrea Calabria CNR - Institute for Biomedical Technologies Università degli Studi.

Similar presentations


Presentation on theme: "VERTICAL DATA INTEGRATION FOR CLINICAL GENOMICS PhD Thesis, cycle XXIII Andrea Calabria CNR - Institute for Biomedical Technologies Università degli Studi."— Presentation transcript:

1 VERTICAL DATA INTEGRATION FOR CLINICAL GENOMICS PhD Thesis, cycle XXIII Andrea Calabria CNR - Institute for Biomedical Technologies Università degli Studi Milano Bicocca – DISCO Saturday, January 30, 2016

2 Project Context and Domain  Genetic Studies  Genome Wide Association Studies  Family Based and Population Based  Domain  Complex Diseases, focus on Brain dysfunctions and NeuroPathologies: Alzheimer (medium stage), Schizophrenia  Data Types  Personal Data  Phenotypes: clinical data, functional magnetic resonance  Genotypes: using SNPs (Single Nucleotide Polymorphism) 2 A.Calabria PhD Thesis XXIII cycle

3 Motivation and Objective  Data Mining and Genetics Studies (Linkage analysis, CNV, etc) on brain diseases need for Data Integration and High Performance Infrastructures with distributed environmet  Data Integration must be both Vertical and Horizontal  security and privacy policies for experimental data  Grid Environment merges security and privacy issues and distributed computing  Project’s Objective  to designVertical Integration on experimental data in Grid environment for genetics studies and data mining analyses purpose 3 A.Calabria PhD Thesis XXIII cycle

4 Why Grid Environment A.Calabria PhD Thesis XXIII cycle 4 LayerRequirements and Properties ObjectivesGrid-enabled Infras. Layer ToDo Applicat. Layer Data Mining and Genetic Studies Coputational Resources and parallelization; web services oriented. Reliablility Availability Robustness Scalability Genetics analyses and brain dependecies discovery (domain related) Parallelization problem specific. Grid ensures: Reliability, Robustness and Scalability native Availability depends on sites Adapt algorithms to distributed environments (ie: linkage analysis); web service oriented Data Layer Horizontal Integration Security, Privacy, Replication (space). Flexibility Scalability Consistency and Quality To Integrate experimental data in global view (filtering, quality, std schema) OLTP Security and Privacy issues are granted native. Consistency and Quality are site- related Global schema study, grid db adaptation (AMGA). Replica/Quality /Distrib mgt Vertical Integration Comput. resources, web services oriented Scalability Updating To integrate gene knowledge data (DW) Not necessary; ensures Scalability; updating to be configured. Quality control, gene data fusion (conflicts study)

5 Application Layer – Genetics Studies A.Calabria PhD Thesis XXIII cycle 5  Algorithm domain related (population genetics studies)  linkage analysis: the problem is computational intensive, NP-hard. Limits are related to number of markers (<40).  Our problem: chip of 1M SNPs (markers), need to compute linkage analysis for population with 1M SNPs  Solution: heuristic for distributing linkage analysis  Preliminary results on Cluster: 70% average time improvement respect to single CPU  Work in progress  grid porting algorithm and comparison performance test  specific monitoring and job controlling system  Next steps  release linkage on grid as web services

6 Data Layer – Horizontal Integration A.Calabria PhD Thesis XXIII cycle 6  Database of Genotypes  genotypic database design and creation  standard HL7 analysis  Work in progress  HL7 application to global database schema  database porting in EGEE grid with AMGA  Next steps  studies of grid db problems related to distribution, federation and hub and spoke paradigm adaptation for extension to biobanks approach  testing of data integration

7 Data Layer – Vertical Integration A.Calabria PhD Thesis XXIII cycle 7  Objective  integrate genes’ knowledge with data fusion approach  Genes’ Knowledge quality control  predicted genes can present conflicts among main different databases (NCBI, EnsEMBL, UCSC)  conflicts could affect analyses  need for evaluating conflict impact within the genome  Work in progress and Next Steps  data extraction (API, Web Services, DB access, parsing)  data integration  data fusion: conflict analysis and evaluation

8 Project Plan A.Calabria PhD Thesis XXIII cycle 8  Linkage algorithm Grid enabling (May-September)  grid porting  application testing and performance measurements  Gene-oriented Data quality (September-November)  data extraction  genes’ knowledge integration  conflicts evaluation  Database Design for Grid porting (November-March)  HL7 schema design  AMGA database creation  Query and data management issues, data import and testing

9 References A.Calabria PhD Thesis XXIII cycle 9  Bibliografy  Pubblications


Download ppt "VERTICAL DATA INTEGRATION FOR CLINICAL GENOMICS PhD Thesis, cycle XXIII Andrea Calabria CNR - Institute for Biomedical Technologies Università degli Studi."

Similar presentations


Ads by Google