Capture / Resequencing Data Handling and Analysis

Capture / Resequencing Data Handling and Analysis
MGL Users Group Capture / Resequencing Data Handling and Analysis

Designing and ordering a targeted exome probe set
We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes Process of design chose your genes of interest submit them to the SureDesign website some considerations price breaks at 0.5, 3, 6, 12, 24 Mb (see next slide) for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome

Example of scaling of costs for SureSelect probes
These are costs per sample. For example, for 96 samples for ~130 genes: 96 x $260 = $24, 960.

Designing and ordering a targeted exome probe set
We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes Process of design chose your genes of interest submit them to the SureDesign website some considerations price breaks at 0.5, 3, 6, 12, 24 Mb for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome

Example of SureDesign report

Targeted vs whole exome sequencing (TES vs WES)
Cost of WES is ~$120 for pulldown probes Can run many more samples per lane for TES WES uses off-the-shelf probe kit, so shorter ordering time Less “extraneous” data with TES = more “free” data with WES

Process of hybridization and library preparation
We use the Agilent SureSelectXT Target Enrichment kit need 5 µg of high quality genomic DNA to start probes are RNA, be sure DNA is Rnase-free Shear the DNA, size select, ligate adaptors, amplify library Hybridize to custom probes and pull down Add barcodes, pools samples for sequencing

Sequencing ABI SOLiD 5500xl
Optimum density is 160 million beads per lane (one DNA fragment per bead). Nominally 110 bases read per fragment = 16.2 billion bases per lane. Significant losses due to filtering and off-target reads.

Understanding Data from the Sequencer
Each fragment can produce one or two reads from the forward and or reverse ends. Commonly for re-sequencing projects we want to maximize both coverage and call reliability, therefore paired ends are desirable of the longest length the sequencer can produce. Data is in the form of individual calls and qualities are present for each. In order to reduce possible artifacts multiple filtering steps are desirable.

Colorspace Compared to FASTQ
Colorspace is similar to FASTQ, but there is a layer of encoding making it not immediately interpretable. Both have calls and qualities Due to the encoding sampling two bases, call error actually goes down in colorspace data, making it a bit more reliable for re-sequencing. A tradeoff is that reads are a bit shorter, meaning more independent fragments must be read to achieve similar coverage. 2nd Base Encoding 1st Base csqual file with associated call qualities. XSQ is a compressed binary format combining both.

You WILL have variants The human reference genome (hg19) is assembled from 13 people, various portions represent only a fraction of those individuals. The human genome prior to the most recent build (not yet generally adopted by the vast majority of tools) contains many rare alleles. dbSNP (build 141) reports 62 million common variants (from 260 million submissions), 29.9 million of which occur within genes. Includes mainly synonymous and ‘non-impactful’ mutations. The goal of many re-sequencing projects is to try to distill meaningful mutations from all of this common genetic variation.

Considerations with Capture data
Exome or targeted capture is an excellent tool for reducing the amount of ‘irrelevant’ data for a study, but does introduce some caveats. Capture is never 100% enrichment. In both our hands and in data evaluated from NISC exome capture tends to be ~50% or so on vs off target bases, as explicitly defined by the capture (exons +/- 10bp). Product literature usually extends the capture regions a further 100 bp to pad that. By the complex hybridization nature of capture, there is a LOT of variability in how well some sequences are captured vs others. Some regions may have low/no coverage while others may be heavily covered.

Distribution of Coverage in Capture
“Average” Coverage is overall 228x Reads for capture bases, but note the range, and the presence of a terribly captured fraction!

Falloff of coverage in targeted regions
80% of bases 50% of bases 20% of bases We can track what fraction of bases are covered at a certain level. This can be adjusted by how much sequencing is done.

Capture coverage scales fairly linearly with input, but low coverage bases do not scale well!
High coverage bases vs low coverage bases scale differently. A factor of how well they can be hybridized.

Pre-filtering of data Reads are evaluated and trimmed based on contents BEFORE any form of mapping. Important as “bad” reads may map and result in variant calls! Generally important for any form of project, not just resequencing, but especially critical here. A variety of tools exist to perform this. I prefer Trimmomatic for this task. Two main tasks for Trimmomatic: Remove adapter or problematic sequences (poly-A, etc) Clip or trim read sequences at low quality positions Discard below a minimum threshold length

Alignment of Data This is actually a critical choice. Which aligner you use will determine the reliability of your downstream results! Alignment algorithms may change depending on task/project. Generally three types of aligners: Seed & Extend Reference Indexing Prefix/Suffix matching (Burrows Wheeler Transforms) Computational time and accuracy vary.

Benchmarking of Common Aligners
For Illumina and some colorspace mapping I prefer to use Novocraft. It’s less commonly used as it’s not free. (Simulated data on actual aligners) Oliver GR. F1000Research 2012

Benchmarking Indel Detection
Indels are a bit trickier to detect, particularly for some alignment strategies Oliver GR. F1000Research 2012

Post alignment Workflow
GATK best practices (Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy- Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics. 43: ) Continually updated tools and recommendations for handling of sequencing data from Broad Institute.

Final portable data format
VCF (Variant call format) – Tab-delimited text Each line represents a position of a variant, then describes the genotype and underlying data & reliability for each sample. Extendable with annotations and additional information. Common and readable by many current third party tools. ##fileformat=VCFv4.1 ##fileDate= ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length= ,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA NA NA00003 rs G A PASS NS=3;DP=14;AF=0.5;DB;H GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. T A q10 NS=3;DP=11;AF= GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 rs A G,T PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 T PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 microsat1 GTC G,GTCT PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35: /2:17: /1:40:3

Additional handling Varies significantly by project & goals.
Association testing with disease phenotypes Modifiers Identification of mutations segregating with disease among families Causative mutation(s) Copy Number Variation (CNV) The amount of data needed to perform these sorts of tests and analysis will vary depending on characterization and type of study. Filtering, visualization, and manipulation can be done by many third party tools. Varsifter, Golden Helix, IGV, GALAXY, and MANY more.

Capture / Resequencing Data Handling and Analysis

Similar presentations

Presentation on theme: "Capture / Resequencing Data Handling and Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Capture / Resequencing Data Handling and Analysis

Similar presentations

Presentation on theme: "Capture / Resequencing Data Handling and Analysis"— Presentation transcript:

Similar presentations

About project

Feedback