Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rosie Coates-Brown Final year Bioinformatics trainee

Similar presentations


Presentation on theme: "Rosie Coates-Brown Final year Bioinformatics trainee"— Presentation transcript:

1 A bilateral approach to reducing the burden of Sanger sequencing confirmations
Rosie Coates-Brown Final year Bioinformatics trainee University of Manchester NHS Foundation Trust

2 Benefits of reducing Sanger confirmations for NGS data
Reduce workflow complexity Reduce turnaround times Reduce costs Predominantly expertise

3 Key challenges Ensuring only true positive variants are reported
Ensure that variants have been attributed to the correct patient

4 Key challenges Ensuring only true positive variants are reported
Machine learning classification workflow to assess quality signatures associated with high quality, true positive variants Possible thresholds for high quality variants

5 Machine learning Workflow
Define question and categories for classification What quality metrics are important in classifying a variant as confirmed or not confirmed?

6 Machine learning Workflow
Define question and categories for classification Gather SNP dataset Disease Unconfirmed Confirmed Total Ophthalmic 130 260 Cardiac 165 330

7 Machine learning Workflow
Define question and categories for classification Gather dataset Select input variables QUAL, Depth, Reference Depth, Ratio, Fisher Strand, Haplotype Score, Strand Bias 12 variables originally considered including mapping quality

8 Machine learning Workflow
Define question and categories for classification Gather dataset Select input variables Carry out RandomForest to rank input variables

9 NextSeq Variable importance
AUC = 0.79 OOB error = 25.93% (15.25% confirmed variants) AUC 0.79, OOB error 25.93%, however better at classifying confirmed variants 15.25%

10 Machine learning Workflow
Define question and categories for classification Gather dataset Select input variables Carry out RandomForest to rank input variables Define threshold values for important variables to classify variants as high quality

11 NextSeq threshold: Summary
Artifacts (%) Confirmed (%) Other (%) QUAL>3000 1 97.6 86 QUAL>2000 7 98.3 88 QUAL> 3000 & SB < -649 96.9 87 QUAL> 2000 97.7 75 Based on QUAL >2000 and SB < % of variants would require confirmation Based on QUAL >2000 alone 1.7% of variants would require confirmation Based on QUAL >3000 alone 2.3% of variants would require confirmation Based on SB alone 2.4% of variants would require confirmation

12 NextSeq threshold: Validation
Artifacts (%) Confirmed (%) Other (%) QUAL>3000 86 64 Validation of the thresholds in an unrelated dataset from a different gene panel on the same bioinformatics workflow

13 Take home message Indicates the feasibility of using QUAL thresholds to determine high quality Currently applied to targeted, NextSeq workflows for SNP variants Validation of the thresholds in an unrelated dataset from a different gene panel on the same bioinformatics workflow

14 Key challenges Ensuring only true positive variants are reported
Ensure that variants have been attributed to the correct patient

15 Key challenges Ensuring only true positive variants are reported
Ensure that variants have been attributed to the correct patient Synthetic oligonucleotide spike in protocol

16 SasiSeq fragment 11 bp barcodes 2 barcodes per patient 568bp
Virus Vector (PhiX) 397 bp 214 bp

17 Patient identity confirmation
Pre-prepared tube with dried Spike in of barcoded PhiX fragment at 0.1% concentration Patient DNA extracted, eluted into tube Enrichment, library preparation and sequencing Bioinformatics workflow to extract barcodes Key consideration: Workflow should not impact sample TATs

18 Bioinformatics workflow
Carry out alignment of the patient data to the PhiX reference bam sequences for phage aligned reads within the fragment coordinates retrieved Assessed presence of each barcode with up to 2 mismatches Panel of 213 suitable barcodes from the 384 barcode panel >31000 combinations of 2 barcodes therefore should be able to bioinfomatically differentiate between any 2 samples in the lab at any given time

19 Nextera TruSeq SureSelect Nextera – LR PCR
TruSeq – Short amplicon based SureSelect – Hybridisation based enrichment Nextera TruSeq SureSelect

20 Take home message Ability to extract ONLY expected sequences
Possible to differentiate patients based on barcode 22,500 combinations of 2 barcodes from panel Ability to bioinfomatically differentiate between 2 samples in the lab at any given time

21 Further work Investigate feasibility of thresholds for indel variants
Perform cross contamination assessments Optimise PhiX fragment size for Nextera Quail et al state a cross contamination sensitivity of 1% however this may require an increase in the initial spike in concentration

22 Acknowledgements Support at MCGM Sanjeev Bhaskar Laura Dutton
Simon Ramsden Helene Schlect Congenica Katie Tate Daniel Bunford-Jones University of Manchester Angela Davis Andy Brass

23 Carry out PCA

24 NextSeq threshold

25 Takes a random sample of the data and builds a forest of decision trees that try to separate the data into the two classes Randomly removes a variable, the worse the separation of the classes, the more important the variable is in defining a class One tree by itself is poor support, 100 or 1000 trees is pretty good support

26 A B Final report with validated patient identification Patient sample
Bioinformatics report including bioinformatic SNP profile Current bioinformatic analysis Current sample processing Additional bioinformatic SNP profile of polymorphic region Final report with validated patient identification Patient sample Concomitant analysis of information (BCS or GCS?) Additional lab process- Real time SNP assay of polymorphic region SNP assay results B Current bioinformatic analysis with additional retrieval of barcodes Patient sample Current sample processing Final report with validated patient identification

27 Indel Variable importance
AUC SNP: 0.79 Indel: 0.85 Gini node purity of QUAL is 40, its around 60 for SNPs so not as strong for indels The plot shows the ranked importance of the quality metrics in discriminating between confirmed and unconfirmed variants QUAL was shown by the Random Forest model to be important in discriminating between confirmed and unconfirmed variants The statistical support for this metric (mean decrease in GINI) is comparable to the statistical support for the importance of QUAL in discriminating between confirmed and unconfirmed SNPs

28 Comparison of SNP and Indel QUAL distributions between the three groups of variants
Indel class Mean QUAL Median QUAL Range Artifacts 705 510 186 : 2377 Confirmed 6649 4954 1634 : 23188 Other 4964 4024 155 : 21073 SNP class Mean QUAL Median QUAL Range Artifacts 739 567 103 : 3049 Confirmed 4176 3837 342 : 9195 Other 4790 3867 108 : 9578


Download ppt "Rosie Coates-Brown Final year Bioinformatics trainee"

Similar presentations


Ads by Google