Presentation is loading. Please wait.

Presentation is loading. Please wait.

Servers, R and Wild Mice Robert William Davies Feb 5, 2014.

Similar presentations


Presentation on theme: "Servers, R and Wild Mice Robert William Davies Feb 5, 2014."— Presentation transcript:

1 Servers, R and Wild Mice Robert William Davies Feb 5, 2014

2 Overview 1 - How to be a good server citizen 2 – Some useful tricks in R (including ESS) 3 – Data processing – my wildmice pipeline

3 1 - How to be a good server citizen Three basic things – CPU usage cat /proc/cpuinfo top htop – RAM (memory) usage top htop – Disk IO and space iostat df -h

4 Cat /proc/cpuinfo rwdavies@dense:~$ cat /proc/cpuinfo | head processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 2 model name : AMD Opteron(tm) Processor 6344 stepping : 0 microcode : 0x600081c cpu MHz : 1400.000 cache size : 2048 KB physical id : 0 rwdavies@dense:~$ cat /proc/cpuinfo | grep processor | wc -l 48

5 Htop and top 48 cores Load average – average over 1, 5, 15 minutes RAM – 512GB total 142 in use (rest free) Swap is BAD Ideal use – 0 (in this case it is probably residual) Memory can be in RAM or in swap

6 disk use High sequential reading (fast!) rwdavies@dense:~$ iostat -m -x 2 Relatively unused Also note state – D = no IO There are also ways to optimize disk use for different IO requirements on a server – ask Winni

7 Disk usage Get sizes of directories Get available disk space

8 How to be a good server citizen Take away CPU usage – Different servers different philosophies – At minimum, try for load <=number of cores RAM – High memory jobs can take down a server very easily and will make others mad at you – best to avoid Disk IO – For IO bound jobs you often get better combined throughput from running one or a few jobs than many in parallel – Also don’t forget to try and avoid clogging up disks P.s. – A good server uptime is 100%

9 2 – Some useful tricks in R (including ESS) R is a commonly used programming language / statistical environment Pros – (Almost) everyone uses it – Very easy to use Cons – It’s slow – It can’t do X But! R can be faster, and it might be able to do X! Here I’ll show a few tricks

10 ESS (Emacs speaks statistics) Emacs is a general purpose text editor Lots of programs exist for using R There exists an extension to emacs called ESS allowing you to use R within emacs This allows you to analyze data on a server very nicely using a split screen environment and keyboard shortcuts to run your code

11 I have code on the left An R terminal on the right Running a line of code ctrl-c ctr-j Running a paragraph of code ctrl-c ctrl-p Switching windows ctrl-x o

12 Google: ESS cheat sheet http://ess.r-project.org/refcard.pdf C- = ctrl M- = option key

13 R – mclapply lapply – apply a function to members of a list mclapply – do it multicore! Note there exists a spawning cost depending on memory of current R job Not 19X faster due to chromosome size differences Also I ran this on a 48 core server with a load of 40

14 /dev/shm (This might be an Ubuntu Linux thing?) Not really an R thing but can be quite useful for R multicore /dev/shm uses RAM but operates like disk for many input / output things Example: You loop on 2,000 elements, each of which generates an object of size 10Mb. You can pass that all back at once to R (very slow) or write to /dev/shm and read the files back in (faster)

15 ff Using save, load is an easy way to interact with data most of time time. ff allows you to use variables like a pointer to RAM Bonus – you can use mclapply to different entries without collisions! (same entries = collisions)

16 Rcpp I write the c++ code as a text vector I then “compile” it here (takes ~1-3 seconds for simple (<1000) line things) Some things R is not very good at – like for loops Simple example of c++ use in R: for a reference genome (say 60,000,000), coded as integer 0 to 3, determine the number of each possible Kmer of size K (ignoring converses for now) (Note that upon making this slide I realized there is an easy way to do this using vectors) table(4^0 * ref[seq(1,refT-K)] + 4^1 * ref[seq(2,refT- K+1)] + … ) and adjusting for NAs I often just pass in lists, pass out lists You can call fancy R things from c++

17 Complicated example I actually use Want to do an EM HMM with 2000 subjects and up to 500,000 SNPs with 30 rounds of updating Using R – pretty fast, fairly memory efficient – Use Rcpp to make c++ forward backward code in R – for iteration from 1 to 30 mclapply on 2000 subjects – Run using Rcpp code, write output to /dev/shm/ Aggregate results from /dev/shm to get new parameters – Write output (huge) to ff for easy downstream analysis

18 A lot of people complain about R being slow but it’s really not that slow – Many packages such as Rcpp, ff, multicore, etc, let your code run much faster – Also, vector or matrix based R is pretty much as fast as anything If 1 month after starting to use R you are still copying and pasting, stop what you’re doing and take a day to learn ESS or something similar – If you don’t use R often you can probably ignore this – (P.s. I copied and pasted for 2 or 3 years before using ESS) 2 – Some useful tricks in R (including ESS) – Take away

19 3 – Data processing – my wildmice pipeline We have data on 69 mice Primary goals of this study – Recombination Build rate maps for different subspecies Find motifs – Population genetics Relatedness, history, variation Admixture

20 N=1 - 40X - WDIS N=1 - 40X – WDIS = Wild derived inbred strain N=1 - 40X - WDIS N=20 - 10X - Wild N=10 - 30X - Wild N=20 - 10X - Wild N=13- 40X – Lab strains M. m. Domesticus M. m. Castaneus M. m. musculus

21 bwa aln –q 10 Stampy –bamkeepgoodreads Add Read group info Merge into library level BAM using picard MergeSamFiles 69 analysis ready BAMS! Picard markDuplicates Merge into sample level BAM Use GATK RealignerTargetCreator on each population Realign using GATK IndelRealigner per BAM Use GATK UnifedGenotyper on each population to create a list of putative variant sites. GATK BaseRecalibrator to generate recalibration tables per mouse GaTK PrintReads to apply recalibration 6 pops – 20 French, 20 Taiwan, 10 Indian, 17 Strains, 1 Fam, 1 Caroli

22 Downloaded 95GB of gzipped.sra (15 files) Turned back into FQs (relatively fast) (30 files) bwa – about 2 days at 40 AMD cores (86 GB output, 30 files) Merged 30 -> 15 files (215 GB) stampy – cluster 3 – about 2-3 days, 1500 jobs (293 GB output, 1500 files) Example for 1 mus caroli (~2.5 GB genome ~50X coverage) Merge stampy jobs together, turn into BAMs (220 GB 15 files) Merge library BAMs together, then remove duplicates per library, then merge and sort into final BAM (1 output, took about 2 days, 1 AMD) 1BAM, 170 GB NOTE: GATK also has scatter-gather for cluster work – probably worthwhile to investigate if you’re working on a project with 10T+ data Indel realignment – find intervals – 16 Intel cores, fast (30 mins) Apply realignment – 1 intel core – slower 1 BAM, 170 GB BQSR – call putative set of variants – 16 intel cores – (<2 hours) BQSR – generate recalibration tables – 16 intel cores – 10.2 hours (note – used relatively new GATK which allows multi-threading for this) BQSR – output – 1 Intel core – 37.6 hours 1 BAM, 231 GB

23 Wildmice – calling variants We made two sets of callsets – 3 population specific (Indian, French, Taiwanese), principally for estimating recombination rate FP susceptible – prioritize low error at the expense of sensitivity – Combined – for pop gen We used the GATK to call variants and VQSR to filter

24 Take raw callset. Split into known and novel (array, dbSNP, etc) Split into known and novel Fit a Gaussian Mixture Model on QC parameters on known Keep the novel that’s close to the GMM, remove if far away What is the VQSR? (Variant Quality Score Recalibrator) Ti/Tv -> Expect ~2.15 genome wide Higher in genic regions

25 PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen FrenchArray Filtered950.641.9712,957,8302.2099.0894.02 FrenchArray Filtered970.722.2814,606,1492.1999.0796.01 FrenchArray Filtered991.123.6217,353,2642.1699.0698.09 FrenchArray Not Filt952.065.8218,071,5932.1499.0796.58 FrenchArray Not Filt972.978.2419,369,8162.1099.0798.01 FrenchArray Not Filt996.1115.7322,008,9782.0199.0699.20 French17 Strains951.293.8916,805,7172.1499.0793.49 French17 Strains972.206.5218,547,7132.1199.0796.49 French17 Strains994.1911.6320,843,6792.0499.0698.62 FrenchHard FiltersNA5.3616.3719,805,5922.0699.0996.96 Sensitivity – You set this – How much of your training set do you want to recover HetsInHomE – Look at homozygous regions in the mouse – how many hets do you see chrXHetE – Look at chromosome X in males – how many hets do you see nSNPs – number of SNPs TiTv – transition transversion ratio – expect ~2.15 for real, 0.5 for FP arrayCon – Concordance with array genotypes arraySen – Sensitivity for polymorphic array sites We chose a dataset for recombination rate estimation with low error rate but still a good number of SNPs Notes – VQSR sensitivity not always “calibrated” - It’s a good idea to benchmark your callsets and decide on the one with the parameters that suit the needs of your project (like sensitivity (finding everything) vs specificity (being right))

26 PopulationTrainingSensitivityHetsInHomEchrXHetEnSNPsTiTvarrayConarraySen TaiwanArray Not Filt952.0511.2036,344,0632.12NA TaiwanArray Not Filt972.8714.6739,183,9322.10NA TaiwanArray Not Filt996.3425.5742,864,3222.05NA Taiwan17 Strains951.8310.3229,748,4562.11NA Taiwan17 Strains972.1611.2034,112,3252.11NA Taiwan17 Strains993.6615.8039,549,6662.08NA TaiwanHard FiltersNA6.1119.4433,692,8572.04NA IndianArray Not Filt951.111.8066,190,3902.18NA IndianArray Not Filt971.592.5771,134,7572.16NA IndianArray Not Filt993.705.5678,220,3482.11NA Indian17 Strains950.671.1657,674,2092.18NA Indian17 Strains971.091.6365,981,6542.17NA Indian17 Strains992.633.3175,103,8862.13NA IndianHard FiltersNA5.4172.6178,487,6162.10NA AllArray Not Filt951.908.95140,827,8102.0499.0796.74 AllArray Not Filt972.3813.99160,447,2552.0399.0798.20 AllArray Not Filt994.5222.73184,977,1571.9999.0699.36 Some of the datasets are extremely big Combined datasets allow us to better evaluate differences between populations Notes – VQSR sensitivity not always “calibrated” – Note: Be VERY skeptical of the work of others wrt sensitivity, specificity, that depends on NGS. Different filtering on different datasets can often explain alot

27 Huge Taiwan and French bottleneck, India OK Homozygosity = red French and Taiwanese very inbred, not so for the Indian mice Taiwan France India

28 Admixture / introgression common Recent Admixture is visible in French and Taiwanese populations

29 French hotspots are cold in Taiwan and vice-versa Our Domesticus hotspots are enriched in an already known Domesticus motif Broad scale correlation is conserved between subspecies, like in humans vs chimps

30 Conclusions 1 – Don’t crash the server 2 – There are tricks to make R faster 3 – Sequencing data is big, slow and unwieldy. But it can tell us a lot

31 Acknowledgements Simon Myers – supervisor Jonathan Flint, Richard Mott – collaborators Oliver Venn – Recombination work for wild mice Kiran Garimella – GATK guru Cai Na – Pre-processing pipeline Winni Kretzschmar – ESS, and many other things he does I copy Amelie Baud, Binnaz Yalcin, Xiangchao Gan and many others for the wild mice


Download ppt "Servers, R and Wild Mice Robert William Davies Feb 5, 2014."

Similar presentations


Ads by Google