BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.

BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI

Big Data BIG DATA

Big Data Volumes

Big Data in Biology

Big Data 3 V’s

Biological data types and analysis objectives Genomics – Nucleotide genome sequences, metagenomic sequences – Gene finding, functional annotation, sequence alignment, homology determination, comparative analysis, phylogenetic inferencing, association analysis, mutation functional prediction, species distribution analysis Transcriptomics – RNA expression levels, transcription factor binding, chromatin structure information – Differential expression, clustering, functional enrichment, transcriptional regulation/causal reasoning Proteomics – Proteins levels, protein structures, protein interactions – Protein identification, protein functional predictions, structural predictions, structural comparison, molecular dynamic simulation, mutation functional prediction, docking predictions, network analysis Metabolomics – Metabolite/small molecule levels – Pathway/network analysis Imaging – Microscopy images, MRI images, CT scans – Feature extraction, high content screening Cytometry – Cell levels, cell phenotypes – Cell population clustering, cell biomarker discovery Systems biology – All of the above – Network analysis, causal reasoning, reverse causal reasoning, drug target prediction, regulatory network analysis, information flow, population dynamics, modeling and simulation

Variety

No Variety

Big Data Volume + Variety = Value Variety = Metadata

DMID Genomics Courtesy of Alison Yao, DMID

www.viprbrc.orgwww.fludb.org Bioinformatics Resource Centers (BRCs) www.patricbrc.orgwww.eupathdb.org www.vectorbase.org

IRD Home Page www.fludb.org Comprehensive collection flu-related data and analysis tools Free use without restrictions Standardization and integration

IRD Data Summary Protein Structures 412structure files 379Influenza A 9PB2 6PB1 1PB1-F2 25PA 162HA 20NP 110NA 6M1 15M2 27NS1 2NS2 Host Factor Data 55experiments 35transcriptomics 16proteomics 4lipidomics 2968experiment samples 544host factor biosets 497378host factor responses Sequence Features 4794Sequence Features 321Structural 176Functional 122Sequence alterations 4175Epitopes 888406Variant Types Data in IRD

GSC-BRC Metadata Working Group Collaboration between U.S. Genome Sequence Centers for Infectious Diseases and Bioinformatics Resource Centers What kind of data should be collected for a sequencing specimen? How should the information be represented? Decisions driven by usage

organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16

Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Quality Assessment has_input has_output

Data Standards Dugan V, et al. PLOS One 2014, submitted.

Can we monitor influenza genetic drift and predict when a new variant has escaped protective immunity? Genetic Drift and Escape from Protective Immunity

Evolutionary drivers Viruses experiences 2 main drivers of evolution: Selection against deleterious amino acid substitutions in order to maintain important structural and functional elements Selection for amino acid mutations that result in viruses that evade pre- existing immunity and other characteristics of enhanced fitness Functional Constraint Immune Pressure Purifying selection Diversifying selection

Selective Pressures on HA Hemagglutinin (HA) protein is: Responsible for virus attachment and entrance into the host cell A major antigenic component of the virus If we can determine which regions of HA are targets of protective immunity, we can monitor genetic drift in those regions to predict escape. Regions undergoing diversifying selection as HA naturally evolves would correspond to the relevant epitopes for protective immunity This information could be used to help predict when new vaccine strains are warranted

Approach 1.Map all experimentally defined immune epitopes on the H1 HA protein 1.Identify sites that have experienced diversifying selection in pre-pandemic H1N1 strains and use to select immune epitopes likely to be targets of protective immunity. 1.Determine whether these regions are being targeted for the mutation during the ongoing evolution of the pandemic H1N1 lineage Pre-pandemic HA Pandemic HA

B-cell Epitopes from Immune Epitope Database (IEDB)

Identifying Sites Experiencing Diversifying Selection Selection Pressure using Fast Unconstrained Bayesian Approximation (FUBAR) – Murrell B, et al. (2013) Mol. Biol. Evol. 30(5):1196–1205: dN : Rate of non-synonymous substitutions dS : Rate of synonymous substitutions Non-synonymous Substitution: CTA (Leu)  CCA (Pro) Synonymous substitution: CTA (Leu)  CTG (Leu) The non-synonymous and synonymous rates are estimated for each site by calculating the posterior probability, Prob(dN site, dS site │Data site, Tree, Codon Substitution Rate, Codon Freq). Sites are considered to be under diversifying selection if the (dN/dS) observed > (dN/dS) expected has a Bayesian score > 0.9. Calculated using all H1 NA sequences prior to the 2009 pandemic (pre-pandemic) – 2105 full length HA protein sequences

Sites Experiencing Diversifying Selection Found 7 sites experiencing diversifying selection in pre-pandemic H1 HA 172 177 179 203 204 278468 Threshold = 0.9 Bayesian Score

B-cell Epitopes with Diversified Sites 172 177 179 203 204 278 468 p =.02 }

Relevant B-cell Epitopes 172 177 179 203 204 278 468 Sa Sb Caton et al. 1982 5/7 diversifying sites correspond to two well characterized B cell/antibody epitopes that may be targets of protective immunity 2/7 sites do not correspond to any previously characterized B cell/antibody epitope Highlight “evolutionary regions of interest”

Test Predictions on Pandemic Drift Meta-CATS (Pickett BE, et al. (2013) Virology, 447:45-51) is a statistical tools that determines if nucleotide or amino acid residues at each position in a multiple sequence alignment are significantly different between groups of sequences using a chi-squared statistic Group 1 (Early Pandemic Isolates): – Original outbreak sequences (21 earliest 2009 pandemic North American sequences) Group 2 (Late Pandemic Isolates): – California 12-13 and 13-14 season (15 sequences) – Florida 12-13 and 13-14 season (21 sequences) – New York 12-13 and 13-14 season (13 sequences)

Meta-CATS Results (California) Group 1: Early pandemic Group 2: Late CA pandemic (season 12-13 and 13-14)

Results Site Diversifying Sites from Pre- Pandemic Diversifying Sites from Pandemic Meta-Cats (CA season 12-13,13-14) Meta-Cats (FL season 12- 13, 13-14) Meta-Cats (NY season 12-13, 13-14) Diversified Epitopes (# epitopes) T-cell Epitope (# epitopes) 52++(6) 101++(5) 114++++(6) 172++ (8)+ (6) 177++ (4)+ (5) 179+++ (4)+ (5) 180++++ (5)+ (4) 202++++ (2) 203++ (6)+ (5) 204++ (3)+ (6) 214++ (2) 220++++ (2) 239++ (1)+(4) 240++ (1)+(5) 251+++(5) 266++ (1) 273++++ (4) 278++ (6) 300++++(3) 389+(5) 391+++++(5) 468+++++ (4) 516++++(1) 544++(4) Sa Sb T-cell

Test Relevant B-cell Epitopes 172 177 179 180 202 203 204 273 278 468 Sa Sb 114 220 251 300 391

Tree Analysis Flu Season 273180 08-09 09-10 10-11 11-12 12-13 13-14 300 516 114 Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids

202 468 Flu Season 08-09 09-10 10-11 11-12 12-13 13-14 391 220 Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids Tree Analysis

Tree Analysis Summary Flu Season 08-0909-10 10-11 11-12 12-13 13-14 S220T E391K S468N S202T D114N E516K K300E K180Q A273T

Big Data to Knowledge Volume + Variety = Value Variety = Metadata Data + Metadata + Integration + Interpretation = Knowledge

Big Data for Vaccine Selection Large scale statistical genomic analysis can identify sites experiencing diversifying selection – Help determine how much sequence data is needed When integrated with immune epitope data, could pinpoint those regions important for protective immunity and predict relevant antigenic drift – Natural experiment to identify correlates of protective immunity Monitoring genetic drift in these regions could augment approaches like antigenic cartography/landscape analysis to determine when vaccine candidates should be adjusted

36 U.T. Southwestern/JCVI – Richard Scheuermann (PI) – Burke Squires – Jyothi Noronha – Alex Lee – Brian Aevermann – Brett Pickett – Yun Zhang MSSM – Adolfo Garcia-Sastre – Eric Bortz – Gina Conenello – Peter Palese Vecna – Chris Larsen – Al Ramsey LANL – Catherine Macken – Mira Dimitrijevic U.C. Davis – Nicole Baumgarth Northrop Grumman – Ed Klem – Mike Atassi – Kevin Biersack – Jon Dietrich – Wenjie Hua – Wei Jen – Sanjeev Kumar – Xiaomei Li – Zaigang Liu – Jason Lucas – Michelle Lu – Bruce Quesenberry – Barbara Rotchford – Hongbo Su – Bryan Walters – Jianjun Wang – Sam Zaremba – Liwei Zhou – Zhiping Gu IRD SWG – Gillian Air, OMRF – Carol Cardona, Univ. Minnesota – Adolfo Garcia-Sastre, Mt Sinai – Elodie Ghedin, Univ. Pittsburgh – Martha Nelson, Fogarty – Daniel Perez, Univ. Maryland – Gavin Smith, Duke Singapore – David Spiro, JCVI – Dave Stallknecht, Univ. Georgia – David Topham, Rochester – Richard Webby, St Jude USDA – David Suarez Sage Analytica – Robert Taylor – Lone Simonsen CEIRS Centers Acknowledgments N01AI40041 HHSN272201200005C

BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.

Similar presentations

Presentation on theme: "BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.

Similar presentations

Presentation on theme: "BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI."— Presentation transcript:

Similar presentations

About project

Feedback