Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP
DNA microarrays
Clustering MA data (not computers) Conditions Clustering Genes
Why do I need sooo much computing power? Goal: Determine all posterior pairwise probabilities of two genes/samples belonging to same cluster IMMs cannot be solved analytically Use sampling method to approximate posterior probabilities Typically 10,000 iterations Sevaral 1,000 genes 10 … 200 samples Compute O(genes2 x samples2) probabilities per iteration Some overhead for cluster-reassignment, other model parameters
Open Multi-Processing (OpenMP) Facilitates parallelization of C++ and Fortran code for shared memory environments e.g. multi-processor machines Set of compiler directives, system variables, and library functions Platform-independent Website: http://www.openmp.org/ Parallelize sequential code by using compiler directives Relatively small programming effort Reduced risk for programming errors Use of shared memory Reduces the communication overhead required to synchronize multiple threads But cannot run threads on multiple nodes
“Hello World” in openMP #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); printf("Hello World from thread %d\n", id); #pragma omp barrier if ( id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } return 0;
OpenMP examples Header Library Functions Compiler directives #include <omp.h> Library Functions omp_set_num_threads(4); printf("number of threads %d\n", omp_get_num_threads()); Compiler directives #pragma omp for schedule(dynamic, 1) for(j=0;j<=Q;j++){ clusterProbabilities[j]= getProbCsMissing2(i,j,Contexts); }
OpenMP examples More Compiler directives #pragma omp for for(j=0;j<=Q;j++){ … #pragma omp critical sigmas[i][j]=1.0/gengam(beta[i]* v[i]/2.0,beta[i]/2.0); }
OpenMP examples More Compiler directives … same as … int i; #pragma omp parallel for private(i, pos) for(j=0;i<T;i++){ … } … same as … #pragma omp parallel for private(pos)
Some more compiler directives Reduction #pragma omp do reduction (+:sum) summarize the share variable “sum” Parallel region #pragma omp parallel { … } Sections #pragma omp sections #pragma omp section Code block 1 Code block 2
The making-of Start an interactive session Intel compiler g++ compiler jfreuden@fructose:~> qsub -I -l nodes=1:opteron Intel compiler jfreuden@bmi-opt2-01:~> module load openmpi-intel jfreuden@bmi-opt2-01:~> icpc -w –openmp g++ compiler jfreuden@bmi-opt2-01:~> module load gcc-4.2.3 jfreuden@bmi-opt2-01:~> g++ -fopenmp
The batch file #PBS -S /bin/csh #PBS -l nodes=1:opteron:ppn=2 #PBS -l walltime=18:00:00 #PBS –e /users/jfreuden/runGimm/stderr.txt #PBS -o /users/jfreuden/runGimm/stdout.txt setenv OMP_NUM_THREADS `cat $PBS_NODEFILE | grep $HOST | wc -l` module load intel cd /users/jfreuden/runGimm/ R CMD BATCH runGimm.R
Simulation study: Non-informative samples 4 gene clusters of sizes 20, 20, 80, and 80 3 sample clusters of size 5 Additional samples m+ = 5, 10, 20, 50, 100 No change in expression Same noise level 100 repeats for each level
Simulation study: Non-informative samples
Simulation study: Non-informative samples
Simulation study: Non-informative samples
Questions? Comments?
Additional Slides
Clustering D’haeseleer (2005)
Example for Gibbs Sampling: BUGS
Simulation study: Simple Case
Simulation study: Simple Case
Simulation study: ‘Time course 1’
Simulation study: ‘Time course 2’
Simulation study: Non-informative samples
Simulation study: Non-informative samples