Systems analysis of innate immune mechanisms in infection – a role for HPC Peter Ghazal.

Systems analysis of innate immune mechanisms in infection – a role for HPC Peter Ghazal

What is Pathway Biology? Pathway biology is…. A systems biology approach for understanding a biological process - empirically by functional association of multiple gene products & metabolites - computationally by defining networks of cause-effect relationships. Pathway Models link molecular; cellular; whole organism levels. FORMAL MODELS --- ALLOW PREDICTING the outcome of Costly or Intractable Experiments

Focus and outline of talk High through-put approaches to mapping and understanding host-response to infection. Targeting the host NOT the bug as anti-infective strategy Making HPC more accessible: SPRINT a new framework for high dimensional biostatistic computation Story starts at the bed side

Differentially expressed genes in neonates control vs Infected (FDR p>1x10 -5, FC±4)

Dealing with HTP data: Impact of data variability Model for introducing biological and technical variation:

Modelling patient variability and biomarkers for classification Machine Learning methods: Random Forest (RF) Support Vector Machine (SVM) Linear Discriminant Analysis (LDA) K-Nearest Neighbour (K-NN) How different data characteristics affect the misclassification errors? Factors investigated: Data variability (biological and technical variations) Training set size Number of replications Correlation between RNA biomarkers Mizanur Khondoker

Error rate vs. (number of biomarkers, total variation) An example of a simulation model to quantify number of biomarkers and level of patient variability

Conclusions from simulations There is increased predictive value using multiple markers – although there is no magic number that can be recommended as optimal in all situations. Optimal number greatly depends on the data under study. The important determining factors of optimal number of biomarkers are: The degree of differential expression (fold-change, p-values etc.) Amount of biological and technical variation in the data. The size of the training set upon which the classifier is to be built. The number of replication for each biomarkers. The degree of correlation between biomarkers. Now possible to predict optimal number through simulation.

Rule of five: Criteria for pathogenesis based biomarkers Readily accessible Multiple markers Appropriately powered statistical association Physiological relevance Causally linked to phenotype Key challenge is mapping biomarkers into: biological context and understanding Requires an experimental model system

Bone MarrowBloodTissue Pluripotent Stem Cell Myeloid Stem Cell Activated Cytolytic Macrophage Primed Macrophage Resident Macrophage (immature) Activated T-Lymphocyte Promonocyte (Primary Signal) Inflammation IFN-gamma (Secondary Signal) Endotoxin, IFN-gamma Lymphokines ? Monocyte ?

Transcriptional profile of MΦ activated by Ifng

How do we tackle this? PATHWAY BIOLOGY Literature Data-mining Modelling Network analysis Experimentation genetic screens microarrays Y2H mechanism based studies A sub-system study of cause effect relationships with a defined start (input) and end (output).

Mapping new nodes PATHWAY BIOLOGYLiterature Data-mining Experimentation

Transcriptional profile of MΦ infected with CMV

Hypothesis generation Blue zone vs red zone

Down regulation of sterol pathway

BUT… recorded changes are small – Do they have any effect? Next step modelling PATHWAY BIOLOGY Pure and applied modelling Network inference analysis Experimental data

Workflow ODE model Literature derived model Known parameters Unknown parameters Vary parameters by an order of magnitude Order of magnitude estimation Ensemble of ODE models Results Ensemble average

Modelling Where available, parameters obtained from the Brenda enzyme database http://www.brenda-enzymes.info/ Cholesterol Synthesis ODE model, Michaelis-Menten interactions 57 Parameters 25 Known Parameters 32 Unknown Parameters Algorithm Using the first three time points, calculate an equilibrium state Release model from equilibrium and simulate using enzyme data For each unknown, consider this model across 3 orders of magnitude, holding the other unknowns parameters fixed.

Cholesterol (output of sterol pathway) results from simulation and expts Cholesterol rate/flux Cholesterol levels Predictions:Experiments:

Lipidomic – mass spec results

Infection down regulate cholesterol biosynthesis pathway and free intra-cellular cholesterol. Can now predict the behaviour of the pathway. But? Just as a good as UK (Met Office) weather predictions……because……

Scalability issues related to increased complexity Increasing complexity and size of biological data Solution: High Performance Computing (HPC)? HPC for High Throughput Post-Genomic Data

Problems with large biological data sets –Volume of data Many research groups can now routinely generate high volumes of data –Memory (RAM) handling: Input data size is too big Algorithms cause linear, exponential or other growth in data volume – CPU performance: Routine analyses take too long

Limitation examples: Clustering Gene clustering using R on a high-spec workstation: –16,000 genes, k=12 gene clusters runs for ~30min –16,000 genes, k=40 gene clusters runs for ~10hrs Partitioning-Around-Medoids, n genes, k=12 clusters requested Memory fail limit

Outcome: Adverse effect on research Arbitrary size reduction of input data Batch processing of data Analyses in smaller steps Avoidance of some algorithms Failure to analyse

Solution: High Performance Computing HPC takes many forms: –clusters, networks, supercomputer, grid, GPUs, cloud,... Provides more computational power HPC is technically accessible for most: –Department own, Eddie, HECToR,... However!

HPC Access Hurdles Cost of access Time to adapt Complex, require specialist skills Consultancy (e.g. EPCC) only feasible on ad-hoc basis, not routinely

HPC Access Hurdles HPC is (currently) optimal for: -Specific problems that can be tackled as a project -Individuals who are familiar with parallelisation and system architectures HPC is not optimal for: -Routine/casual analyses of high-throughput data -Ad-hoc and ever-changing analyses algorithms -Data analysts without time or knowledge to sidestep into parallelisation software/hardware.

Need a step change (up!) to broaden HPC access to all biologists Challenge two fold!! Provide a generic solution Easy to use

SPRINT Post Genomic Data R Biological Results Very Large Post Genomic Data R R Biological Results HPC (Eddie) A solution for analyses using R SPRINT (DPM & EPCC))

SPRINT SPRINT has 2 components: 1.HPC harness manages access to HPC 2.Library of parallel R functions e.g. cor (correlation) pam (clustering) maxt (permutation Allows non-specialists to make use of HPC resources, with analysis functions parallelised by us or the R community.

data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- mt.maxT(smallgd, classlabel, test="t", side="abs") quit(save="no") library("sprint") data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- pmaxT(smallgd, classlabel, test="t", side="abs") pterminate() quit(save="no") Code comparison

Permutation Benchmark Input Array Data Size Permutation Count Estimated maxt 1 CPU Pmaxt on 256 CPUs (s) 36,612 x 76500,0006 hrs73.18 36,612 x 761,000,00012 hrs46.64 73,224 x 76500,00010 hrs148.46 100,000 x 3201,000,00020 hrs294.61

Correlation Benchmark Input Array Data SizeOutput Array Data Size pcor() on 256 CPUs (s) 11,000 x 320 (27 MB) 923 MB4.76 22,000 x 320 (54 MB) 3.6 GB 13.87 35,000 x 320 (85 MB) 9.1 GB36.64 45,000 x 320 (110 MB) 15 GB42.18

Clustering Benchmarks

Future Cloud (confidentiality issues) GPU (limitations is data size)

New therapeutic and diagnostic opportunities Viral Interaction Networks Host Interaction Networks VirusAntiviralHostSystemic Therapeutic Bed-bench-models-almost back to bed

THANK YOU & Acknowlegments to our sponsors

Mathieu Blanc Steven Watterson Mizanur Khondoker Paul Dickinson Thorsten Forster Muriel Mewissen Terry Sloan Jon Hill Michal Piotrowski Arthur Trew Acknowledgement EPCC

Systems analysis of innate immune mechanisms in infection – a role for HPC Peter Ghazal.

Similar presentations

Presentation on theme: "Systems analysis of innate immune mechanisms in infection – a role for HPC Peter Ghazal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Systems analysis of innate immune mechanisms in infection – a role for HPC Peter Ghazal.

Similar presentations

Presentation on theme: "Systems analysis of innate immune mechanisms in infection – a role for HPC Peter Ghazal."— Presentation transcript:

Similar presentations

About project

Feedback