Presentation on theme: "Bioinformatics Tips NGS data processing and pipeline writing"— Presentation transcript:
1 Bioinformatics Tips NGS data processing and pipeline writing Na Cai3rd year DPhil in Clinical MedicineSupervisor: Jonathan Flint
2 Example projects CONVERGE Commercial Outbred Mice 1.7x whole genome sequencing in 12,000 Han Chinese Women6000 Cases of MD, 6000 controlsDetailed questionnaire45T of sequencing dataCommercial Outbred Mice0.1x whole genome sequencing in 2,000 miceKnown breeding historyExtensive phenotyping2T of sequencing dataHi everyone,I am Na, a 3rd year PhD student in the Flint group working on two low coverage whole genome sequencing projects and I have been involved in the processing of both sets of sequencing data.Like many other projects that you may encounter in Oxford, these are big projects that collect a large amount of data from a large sample, and hence will involve a lot of people, time, money, logistics and data, and finally data analysis.For the CONVERGE project, after 5 years of genetic and phenotypic data collection from 30 cities and 60 hospitals in China, we have 1x whole genome sequencing data from 12,000 Chinese women half of whom are patients of major depression and half of whom are matched controls, amounting to about 45T of data. For the Commercial Outbred mice project, we have 0.1x whole genome sequencing and a wide range of phenotypic data for 2000 mice which were brought and bred in house (actually to collaborators in Harwell) and taken through a standardised phenotyping pipeline till culling and dissection, again amounting to 2T of data.These are large amounts of data, and the moment I got into processing these data I realised there are quite a lot of practical things to think about because any processes would require a large amount of computing resource, extra storage space, and time.
3 NGS data processingThis is the Best Practices recommendations from the Genome Analysis Toolkit (GATK), which is one of the commonly used and well referenced softwares for processing next generation sequencing data, the steps of which I will come back to in detail later.Usually we are interested in identifying the variants in a sample and either using them as markers for studying population genetics or candidate sites of associations with certain phenotypes, so what we want to do with the sequencing data is call variants as accurately as we can.We start off at the data preprocessing stage. Assuming you already have bam files (Richard covered read mapping yesterday right?) that are already mapped to references and cleaned of duplicates, there are still two more steps you can take to make the bam files “better” for variant discovery, namely realignment of reads around insertions and deletions, and recalibration of base qualities.Then we get to the actual variant calling which gives us the raw callset on which we do further recalibration in attempts to separate the true positives from the false positives. This is especially important for low coverage sequencing, where a true variant and a sequencing error may be indistinguishable.There are also some features of this workflow that would impact how we would go about implementing it on any dataset given its size and computing resources available.Firstly, there are many steps involved and almost at every step of processing each of which would generate an output that the next step will use, so this fits very nicely into a pipeline with sequential steps where jobs in the first step would have to finish running before those of the second are run, and timestamps of all outputs are checked as well to make sure they came in sequential order.Secondly, at each step we need to either read through the entire dataset worth of data or write out a better version of it, or both. We therefore need to think about ways to make both reading in and writing out efficient by partitioning the data into sensible chunks such that a) the same process can be run on different parts of the data in parallel and b) each small process does not have a huge overheadThirdly, this pipeline is fairly standard so it can be applied to many different datasets, but for any one dataset, it is unlikely to be used more than a few times (or just one time if everything’s done right the first time round). This means one can either write a script or bash code quickly and hardcode all the file names, use the bash script once and never bother with it again, or invest more time in a pipeline with no hardcoded file names but user friendly options, such that the code can be directly used or easily adapted for other projects.I will now show you examples of both these approaches.Thirdly,Taken from:
4 One step at a time processing Make new directories as you go alongMake flag files to indicate successful completion of previous commandParallelize using makeThis is good for step by step troubleshootingThis is a small piece of bash code from the outbred mice project that I used to create the make file that I eventually used to re-map the bam files from one reference (mm9) to the updated (mm10) reference. This is the first step in the re-mapping pipeline, which is to take the bam files mapped to mm9 and turn them into raw fastq files again using Picardtools (which I will be introducing later).As you can see it is a simple for loop that loops through all of the bam files mapped to mm9, grabs the animal id part of their filename, makes a new output directory for each animal, then outputs to a text file the command that turns the bam file into one to multiple pairs of fastq files depending on how many read groups constitute the reads of that particular animal. The reason why I made a directory for each animal is to 1) not mix up read groups from different animals, and 2) if I had all of the fastqs of all animals in one directory, I would have a huge directory with tens of thousands of files that could trip up downstream processing even when I have unambiguously named everything (it’s way to troublesome, just create different directories)I then have a small helper python script that turns the text files into a make file, after which I run the make with parallelisationI have many of these small scripts for the remapping effort because 1) it’s very quick to write, 2) it’s very easy to troubleshoot, 3) apart from the input and output files, there isn’t any other files or optional parameters this step needs, and 4) I would probably only use it once
5 Pipeline writing – Ruffus If I had wanted to write a pipeline with many more steps, can be reused, and and involve a lot more options than just the input and output files, I would have used Ruffus, which is a module in Python to run computation pipelines.I have put the link to the python website on the powerpoint, and very conveniently we have the person who developed Ruffus right here in our building so you can bug him any time for a detailed tutorial on Ruffus. Ruffus has also changed a lot since this website had last been updated and I heard it’s a lot easier to write and at the same time more versatile now, so it would be really good to talk to Leo when you need to use Ruffus to get a proper introduction to all of its new functionalities. Here I would just show you small amounts of what it can do.
6 Setting up RuffusWhen I started using Ruffus about 2 years ago it was a very involved process to write a Ruffus pipeline, you have to import all the right module dependencies, then put in all the options, their descriptions and their default values, import all the right python libraries, make the log files, then actually parse in all the data, and finally write the scripts for each step of the pipeline. Of course you build up your pipeline with time and because pipeline managing modules helps you keep track of file dependencies and timestamps of outputs, adding a step into the pipeline usually doesn’t affect all the other parts of the pipeline that you may have already finished running.
7 Once Ruffus is set up - Help Once you have written everything, you would have created a user friendly programme that can be used by anyone (assuming your descriptions to options are helpful) and can be run on any machine (assuming that it has ruffus and all the relevant libraries installed) on any data (assuming they have the same data structures).
8 Once Ruffus is set up – just print You can also use the –n just print option to check which parts of the pipeline has been or has not been run, and increasing the level of verbosity of the just print option allows you to look at more details of the jobs to be run.
9 NGS data processingNow we come back to this data processing workflow. Assuming you already have bam files with reads mapped to a reference genome. There are a number of things to consider and perhaps needs to be done to the bam files before they are pushed into the pre-processing steps.Taken from:
10 Processing a raw BAM file Things to considerHow many samples one is processingCoverage per samplePloidy of subjectsSize of genomeSource of DNA and possible contaminationServer/cluster usage: How the jobs can be parallelizedFor example, read from slide
11 Processing a raw BAM file Some manipulations of bam filesConverting between bams and fastqsIndexingCoordinate sortingSplitting or mergingFilter out readsMask entire regions
12 Tools of the Trade Picardtools For performing these basic manipulations of the bam files, Picardtools is a really good suite of tools to use.
13 Tools of the Trade - Picardtools Commonly used Picard tools:ValidateSamFileSamToFastqMergeSamFilesReplaceSamHeaderCool Picard options:SORT_ORDER <default=null>CREATE_INDEX <default=false>CREATE_MD5_FILE <default=F>VALIDATION_STRINGENCY <default=STRICT>Some commonly used tools include validate sam file, etc etc, and for all of these tools there are picard options to sort the bam files in order (usually coordinate along the reference genome), create index (enable random access), create md5sum hash (this is especially useful when you need to share the files with some collaborators through some sort of file transfer which may cause corruption. Your collaborators upon receiving the files can then create md5sums using the files they have received and check them against the md5sums you sent them. Any changes to the bams will likely cause a change in the md5sum hash, so your collaborator will very easily know whether the files are received with integrity or not). The validation stringency option is important for the validate sam file tool, which you may use after receiving bam files from collaborators. Sometimes bam file processing tools or variant callers don’t complain because of minor bam file faults, but others do! Do you may be far along in your processing before you even realise some parts of your bam files are corrupt, which is a pain. Checking md5sums don’t always catch the problem because bam files sent from collaborators are already corrupt, and validate sam (usually with the STRICT) option will allow you to catch these bam files before you push the bam files further into any processing will hopefully help you catch the error.
14 NGS data processingOkay so now we’re finally ready to start on the pre-processing steps of the NGS processing work flow.Taken from:
15 Indel RealignmentThe first recommended step by GATK is indel realignment.Image from:
16 Why Realign Around Indels? Insertions and deletions at the ends of reads can trick the mapper into aligning the reads to another part of the genome especially if the mapper is working on parameters that prefers mismatches than gaps.Mappers also get really confused by homopolymer runs in the reference sequence especially when there may be an insertion or deletion of a few bases in the sample’s reads. As shown in the IGV display of the reads of this part of the genome where there is a 7bp T-homopolymer run in the reference genome and there are clusters of consecutive “SNPs” both on the right and left of the homopolymer when the reads are aligned assuming there is no insertion or deletion, when in fact…Image from:
17 Why Realign Around Indels? If we take allow for a one base deletion in the homopolymer run and realign the reads, the “SNPs” would all be gone.Image from:
18 How does it work? Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files)Reads where there is evidence of possible misalignmentImage from:
20 Implementing the Indel Realignment Site1Site2Site3Site4Site5Site6Site7Site8sample1readssample2sample3sample4…sample5sample6sample7The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there need to parse in data for all samples at the same time
22 Implementing the Indel Realignment Site1Site2Site3Site4Site5Site6Site7Site8sample1readssample2sample3sample4…sample5sample6sample7Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths only need to parse in one sample’s data at a time
27 Implementing the BQSR Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 sample1readssample2sample3sample4…sample5sample6sample7The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset need to parse in all of the data of each sample
31 Implementing Variant Calling Site1Site2Site3Site4Site5Site6Site7Site8sample1readssample2sample3sample4…sample5sample6sample7The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend need to parse in data for all samples at a particular site at the same time