Download presentation
Presentation is loading. Please wait.
1
Johns Hopkins University
ASHG Workshop Classifying and Interpreting Germline and Somatic Variants in Your Large Cohort Study with the CRAVAT Tool Suite Karchin Lab Department of Biomedical Engineering Institute of Computational Medicine Johns Hopkins University WiFi SSID: ASHGWORKSHOP Password: ASHGWORKSHOP
2
Wi-Fi Information: Hilton Ballrooms
Select the SSID: ASHGWORKSHOP Enter password: ASHGWORKSHOP Password is case-sensitive
3
Disclosure for: ASHG Interactive Workshop: Classifying and Interpreting Germline and Somatic Variants in Your Large Cohort Study with the CRAVAT Tool Suite No Relevant Conflicts to Disclose: Rachel Karchin Michael Ryan
4
CRAVAT/MuPIT Workshop
Overview CRAVAT Login, Job Submission, Results Viewer Study Overview Variant Analysis Impact What change is a variant introducing? Importance How to sort/filter to identify variants of interest? Investigate What annotations can be used to further investigate variant relevance? MuPIT – 3D Structure analysis of Variants
5
Variant Studies High volume sequencing delivers hundreds of called variants across hundreds of samples. Many genomic annotation packages (e.g. Annovar) provide mapping of variants onto transcripts and proteins, and provide annotations and scores. But: Identifying interesting mutations and exploring the impact of the mutations remains challenging and often requires bioinformatics specialists. CRAVAT / MuPIT were developed to meet this need for users without bioinformatics/biostatistics experience using machine learning and a highly visual, dynamic interface.
6
The number of variants detected is large and is getting larger
70M SNVs from WGS 638 patients 967 controls >70M exonic germline variants! Dr. Nicholas Roberts Department of Pathology Johns Hopkins Medicine
7
CRAVAT Cancer-Related Analysis of VAriants Toolkit CRAVAT
Masica DL, Douville C, Tokheim C, Bhattacharya R, Kim R, Moad K, Ryan MC, Karchin R (2017). CRAVAT: Cancer-Related Analysis of VAriants Toolkit. CRAVAT 4: Cancer-Related Analysis of Variants Toolkit. Cancer Research. [in press]. CHASM Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, Karchin R. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations Cancer Res Aug 15;69(16): SNVBox Wong WC, Kim D, Carter H, Diekhans M, Ryan M, Karchin R (2011). CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer. Bioinformatics, 27(15): VEST Carter H, Douville C, Stenson P, Cooper D, Karchin R (2013) Identifying Mendelian disease genes with the Variant Effect Scoring Tool. BMC Genomics, 14(Suppl 3):S3.
8
CRAVAT is designed to filter and prioritize variants
Funnel for automated variant analysis mention MIke’s demo time and place
9
CRAVAT http://www.cravat.us
CRAVAT is a free, web-based tool for high-throughput analysis of genomic variants. Genomic Features Machine Learning Algorithms Parallel Processing Server Spreadsheet Results Annotations Interactive Results
10
CRAVAT Analysis – Setup
Click here to create an account
11
CRAVAT Analysis – Sample Job
Click here to see jobs.
12
CRAVAT Analysis – Sample Job
Click here for Interactive Results Viewer
13
CRAVAT Analysis – Sample Job Interactive Results
Tabs for different lists Click header to sort Select row to see details Type to Filter Add/remove columns
14
CRAVAT Input Format Format 1 – Simple Cravat Format (tab separated) ID Chr Position Strand RefBase AltBase Single Base Substitution: Mut1 chr C T Insert: Mut2 chr G Delete: Mut3 chr G - Multi-base Insert: TR4 chr AGG
15
CRAVAT Input Format http://projects.insilico.us.com/CRAVATClass
Exercise 1: Use the link on the site above to get Exercise1.txt. Correct format errors. (Use CRAVAT Help) Copy / Paste into input screen and submit In the results, can you find all 9 variants that were input?
16
CRAVAT – Input Format Exercise 1: Can you find all 9 variants that were input? Tip: In the Columns / Variant Info add Input Line and ID to help find particular variants.
17
CRAVAT Input Format The optional sample ID column can be used when providing study data for a cohort: Mut1 chr T C Patient1 Mut2 chr T C Patient1 Mut5 chr T G Patient1 … Mut234 chr T G Patient5 Mut789 chr T G Patient11 Mut1036 chr T G Patient21 . .
18
VCF Input Format CRAVAT Input Format – VCF File Exercise 2:
Download Exercise2.vcf from the class webpage. Submit the file to CRAVAT as a new job. How may input variants are in the file? How many input results did you receive from CRAVAT? Add VCF Specific “Mutation Call Quality” columns What additional information do these fields provide?
19
CRAVAT – Input Format VCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA NA NA00003 TR1 A T 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. GGAAGAAGAA G,GGAA,GGAAGAA 50 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ:AD 0|1:48:4:51,51:1,1,1,1 1|2:21:6:23,27:1,2,1,2 3|1:48:8:51,51:0,3,3,2
20
CRAVAT – Input VCF
21
CRAVAT – Results Overview
Exercise 3: Open BRCA Study Results – random cohort of TCGA somatic breast cancer variants. Link on class page: Use the Summary Tab: What percentage of study variants are coding? What gene function GO category is most heavily mutated? What percentage of study variants are missense mutations if you discard synonymous variants? Which chromosome has a high density of inactivating mutations? Which genes occur in those locations? Which genes are pervasively mutated in study participants? Which patients had the most / least variants. Can you graph stopgain mutations by patient? Which gene pathways are enriched for likely cancer driver variants as predicted by CHASM score? Can you visualize the pathway and see the mutated genes?
22
CRAVAT – Results Overview
23
CRAVAT Analysis – Impact Analysis
CRAVAT will determine the impact of a variant (sequence ontology), identify affected genes, and determine codon/protein coding changes. CRAVAT will evaluate variant impacts on all the transcripts of a gene. Code Meaning SY Synonymous Variant SL Stop Lost SG Stop Gained MS Missense Variant II In Frame Insertion FI Frameshift Insertion ID In Frame Deletion FD Frameshift Deletion SS Splice Site
24
CRAVAT – Variant Impact
Exercise 4: Using the same results: Go to Variant Tab With Columns panel, Add ID to your display What is the variant impact: gene, sequence ontology, protein change caused by the variants with the IDs: r r01448 r r07405 r r02152 r09656
25
CRAVAT – Variant Impact
26
CRAVAT – Variant Scoring
Scoring methods are needed to help prioritize investigation of mutations. CRAVAT provides two scoring methods: VEST – Machine learning algorithm trained to identify pathogenic variants. Single base variants, indels (frameshift and in frame), splice site variants, and nonsense variants (stop codon changes). CHASM – Machine learning algorithm trained to identify cancer drivers. Missense Only CRAVAT analysis can include VEST scores, CHASM scores or both.
27
VEST and CHASM bioinformatic variant scores
Random forest machine learning uses an ensemble of simple decision trees
28
VEST and CHASM bioinformatic variant scores
Supervised machine learning Training “Bad” variants Feature encoding Harmless variants Random Forest 85 features sequence-based evolutionary conservation in protein and DNA multiple alignments predicted protein structure positional- and regional protein annotations
29
VEST and CHASM bioinformatic variant scores
Supervised machine learning Prediction Feature encoding New variant s 0.8 85 features sequence-based evolutionary conservation in protein and DNA multiple alignments predicted protein structure positional- and regional protein annotations Random Forest
30
VEST and CHASM use a statistical model to threshold variant scores
Statistical significance of tree vote score? From scores to P-values Null hypothesis is that there are no “bad” variants present sorting variants by scores Null tree vote scores Allows comparison of scores for different mutation consequence types.
31
CHASM performance CHASM remains one of the top-performing predictors of driver missense mutations Benchmarking done by an independent group
32
VEST is one of the top performing predictors of missense pathogenicity
Benchmarking done by an independent group VEST3.0 VEST3.0
33
VEST provides integrated prioritization of all non-silent variants
VEST classifiers have been built for missense, frameshift, nonsense, and splice site variants. Scores are not directly comparable across different variant types - But - P-value may be directly compared across types making it easier to sort all variants by predicted pathogenic impact. Precomputed VEST scores for all possible missense mutations have been developed so VEST scores can be quickly returned for single variant lookups and are included in other annotation databases.
34
CRAVAT – Variant Scoring
Exercise 5 Using the same results, variant tab: Sort by CHASM p-value (lowest at top of list). What type of variants are identified as potential cancer drivers? Why? What genes have the variants at the top of the list? Are they familiar? Are there unexpected genes? Sort by VEST p-value (lowest at the top of the list) What type of variants are at the top of the list. Why? If you wanted to see the most pathogenic frameshift deletions can you sort first by variant type and then by vest p-value? Are VEST scores the same for all transcripts of a gene?
35
CRAVAT Analysis –Variant Scoring
36
CRAVAT – Annotations Population Statistics / Gene Information
37
CRAVAT – Annotations Cancer / Disease annotations
CGL – Oncogene / Tumor Suppressor
38
CRAVAT – Annotation Exercise 6 Using the same results, variant tab: Variant r09138 has a good CHASM p-value. Use the annotation data to decide if you would further pursue it as a driver variant. Variant r07683 also has a good CHASM score. Is there annotation data to support the prediction that it is acting as a cancer driver? Assume the variants in analysis were from a breast cancer study. What does the gene affected by the variant do? Splice Site mutation r01466 has a good VEST score. Is there supporting evidence that this variant is pathogenic?
39
CRAVAT – Annotation
40
CRAVAT Gene Level Tab
41
Very large number of variants Variant calls that pass QC
Variant consequence types of interest Below allele frequency of interest Bioinformatics variant impact prediction Annotations splice not yet supported - in progress TRACTABLE LIST OF THE MOST INTERESTING variants
42
Example of funnel developed for familial pancreatic cancer (FPC) WGS study
well-known FPC genes Genetic basis of FPC unknown for 80-90% of patients Analysis of 70M SNVs identifies new genes not previously implicated in FPC susceptibility
43
CRAVAT Analysis Miscellaneous CRAVAT Tips:
Spreadsheet and interactive results are provided. Spreadsheet has multiple tabs. Gene level tab shows aggregated information by gene (e.g. how many mutations in the study) CRAVAT is in GRCh38 coordinates if you have data in hg19 coordinates, use the checkbox to ‘lift over’ the coordinates or use the older CRAVAT at hg19.cravat.us. If you run more than 100,000 coding mutations, result filtering will be needed to see results. If you process more than 60,000 mutations, results will be in a text file rather than spreadsheet. Results are not kept on the CRAVAT server indefinitely. Get results before 30 days have elapsed. FDR will not be calculated unless at least 10+ variants are scored. Export can be used to send on screen variant list to a spreadsheet.
44
CRAVAT Advanced Interfaces
Web Service Fast Single Mutation Lookup Program / Web Server RESTful Web Service Submit Full Asynchronous Jobs Program CRAVAT Web Server CRAVAT in Galaxy Run CRAVAT in your own Docker Container
45
CRAVAT Coming Soon Non-coding annotations
5’ 3’ UTR, Intron, upstream/downstream Non-coding RNA Pseudogene LINE, SINE, repeats GWAS Regulator and chromatin modification regions Additional Mutation Databases for Lollipop and MuPIT Genome Aggregation Database (gnomAD) Modular CRAVAT Install locally with just the annotation modules you want Develop your own annotation modules and make them available to other CRAVAT users.
46
MuPIT Visualization 3D Structure Viewer with mutation mapping capabilities. Niknafs N, Kim D, Kim R, Diekhans M, Ryan M, Stenson PD, Cooper DN, Karchin R. MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures. Hum Genet Nov;132(11): doi: /s Epub 2013 Jun 23. Results are not kept on the CRAVAT server indefinitely. Get results before 30 days have elapsed. Identify 3D mutation clusters Combine mutation with annotation to identify proximity of mutation to critical protein structures.
47
MuPIT Visualization Exercise 7: Using the same results, variant tab: On r08293 in the detail panel, click on the MuPIT button to see the variant on STK3’s 3D structure. Click and drag with your mouse to rotate the structure. Use wheel to zoom in / out. Change the color of your mutation to red. On the Chains section, set the color of each protein chain to a different color to better understand the structure. Use the style menus in the Protein tab to see different renderings of the protein. Use the small molecule buttons on the Protein tab and active/binding buttons on the annotation tab to discover potential impacts of the variant.
48
MuPIT Visualization
49
MuPIT Visualization Exercise 8: Using the same results, variant tab: On r04947 in the detail panel, click on the MuPIT button to see the variant on a 3D structure. Use the Results button to switch to the homology model of NP_ _9. Turn off your mutation. Try the TCGA variant buttons for BRCA to see other breast cancer variants for MAP3K1. Turn off the variants but leave the hot region (right click on the BRCA button) Turn your variant back on. What is the relationship between your study mutation and the hot region?
50
MuPIT Visualization
51
MuPIT from Gene Panel
52
MuPIT Stand-Alone
54
Principal Collaborators and Funding for the Project
Johns Hopkins University Rachel Karchin David Masica Collin Tokheim Chris Douville Violeta Beleva Lily Zheng In Silico Solutions Michael Ryan RyangGuk Kim Kyle Moad Funding for CRAVAT/MuPIT Information Technologies for Cancer Research, NCI (grants U01-CA180956, U24CA204817)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.