Toward a unified view of human genetic variation Gabor Marth Boston College Biology Department on behalf of the International 1000 Genomes Project.

Slides:



Advertisements
Similar presentations
Considerations for Analyzing Targeted NGS Data HLA
Advertisements

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Corso di Genomica lezione laurea magistrale Biotecnologia Industriale Giovedì 9 dicembre 2010 aula 6 orario : Martedì ore
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Ruibin Xi Peking University School of Mathematical Sciences
Introduction  Human leukocyte antigen (HLA) is the major histocompatibility complex (MHC) in humans  Group of genes ('superregion') on chromosome 6.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
1000G Pilot 3 Progress (in silico analysis and comparison to experimental validation) Amit Indap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl.
Lessons learnt from the 1000 Genomes Project about sequencing in populations Gil McVean Wellcome Trust Centre for Human Genetics and Department of Statistics,
1000G Pilot 3 Progress in silico analysis and comparison to experimental validation Gabor Marth (Boston College) + A + L Kiran Garimella (Broad Institute)
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Towards Personal Genomics Tools for Navigating the Genome of an Individual Saul A. Kravitz J. Craig Venter Institute Rockville, MD Bio-IT World 2008.
The Phase 1 Variant Set and Future Developments
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Robust Software Tools for Variant Identification and Functional Assessment (Boston College & University of Michigan) Gabor Marth, Goncalo Abecasis, PIs.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Medical variations Gabor T. Marth Boston College Biology Department BI543 Fall 2013 February 5, 2013.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
10cM - Linkage Mapping Set v2 ABI Median intermarker distance: 4.7 Mb Mean intermarker distance: 5.6 Mb Mean genetic gap distance: 8.9 cM Average Heterozygosity.
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Identification of Copy Number Variants using Genome Graphs
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
HW2: exome sequencing and complex disease Jacquemin Jonathan de Bournonville Sébastien.
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Integrated variant detection Erik Garrison, Boston College.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Genome-Wides Association Studies (GWAS) Veryan Codd.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: Clinical Interpretation and Implications of Whole-Genome.
From Reads to Results Exome-seq analysis at CCBR
Armenian Genome Project
Canadian Bioinformatics Workshops
Interpreting exomes and genomes: a beginner’s guide
Disease risk prediction
Itsik Pe’er, Yves R. Chretien, Paul I. W. de Bakker, Jeffrey C
Content and Labeling of Tests Marketed as Clinical “Whole-Exome Sequencing” Perspectives from a cancer genetics clinician and clinical lab director Allen.
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Jong-Min Lee, Kyung-Hee Kim, Aram Shin, Michael J
Next-generation DNA sequencing
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
SNPs and CNPs By: David Wendel.
Analysis of protein-coding genetic variation in 60,706 humans
Presentation transcript:

Toward a unified view of human genetic variation Gabor Marth Boston College Biology Department on behalf of the International 1000 Genomes Project

GOALS

The 1000 Genomes Project goals Discover population level human genetic variations of all types (95% of variation > 1% frequency) Define haplotype structure in the human genome Develop sequence analysis methods, tools, and other reagents that can be transferred to other sequencing projects

HOW FAR HAVE WE COME IN THE PAST YEAR?

Finalized project design Based on the result of the pilot project, we decided to collect data on 2,500 samples from 5 continental groupings – Whole-genome low coverage data (>4x) – Full exome data at deep coverage (>50x) – Hi-density genotyping at subsets of sites Moved from the Pilot into Phase 1 of the project

New data from new populations Data typePilotPhase 1 (now) Deep genomes6- Low coverage genomes1791,094 Deep exonic697 (1,000 genes)977 (full exomes) Chip genotypes-1,542 (OMNI2.5) Sample originPilotPhase 1 (now) AfricaYRILWK, ASW AsiaJPT, CHBCHS EuropeCEUGBR, FIN, IBS, TSI Americas (admixed)MXL, PUR, CLM

Detected new variants VariantPilotPhase 1 (now) Total SNP15.2M38.9M Known SNP6.8M8.5M Novel SNP8.4M30.4M Short INDELs1.3M4.7M** ftp://ftp.1000genomes.ebi.ac.uk **Estimated from chromosome 20. Credit: Gerton Lunter

Improved completeness and accuracy Call setSamples Sensitivity (HapMap3.3) Sensitivity (OMNI polymorphic sites) FDR (OMNI monomorphic sites) Pilot %98.49%73.02%** ASHG’ %97.55%5.41% Phase 11, %98.41%2.11% **Fraction of the 59,721 sites on the OMNI2.5 chip, designed based on early Pilot data variant call sets, that turned out to be monomorphic

Exome sequencing data Paul Flicek time data volume [TB]

Exome variants Alistair Ward, Kiran Garimella, Fuli Yu ~30Mb aggregate exon target length +/-50bp beyond exon boundaries analyzed Based on ~half the data analyzed (458 samples) ~400,000 SNPs ~15,000 INDELs

Sensitivity of low coverage whole genome data measured against exomes count of alternate allele in exomes (in 688 shared samples) number of sites Number of sites also found in low coverage whole genome data Number of sites in exome data Erik Garrison AF > 0.5%

Site concordance is very high above 1% allele frequency Number of sites also found in exome data Number of sites in low coverage data count of alternate allele in low coverage (in 688 shared samples) number of sites Erik Garrison AF > 0.5%

Genotypes are accurate Average low coverage depth is ~5x We obtain genotypes by sharing data between samples (using imputation-related methods) HomRefHetHomAltOverall Error rate0.16%0.76%0.39%0.37%

Newly discovered SNPs are enriched for functional variants Ryan Poplin 12M 10M 8M 4M 2M 0 6M number of sites frequency of alternate allele splice-disrupting621 stop-gain1,654 non-synonymous84,358 synonymous 61,155 Daniel MacArthur, Suganti Balasubramaniam

NON-SNP VARIANTS

Short INDEL variants

Finding structural variants Discovery with a number of different methods Several types (e.g. deletions, tandem duplications, mobile element insertions) now detectable with high accuracy We are pulling in new types for the Phase I data (inversions, de novo insertions, translocations)

Finding Mobile Element Insertions Chip Stewart

Detection of non-reference mobile element insertion (MEI) events Chip Stewart

MEI allele frequency behavior Chip Stewart Segregation properties of MEIs are very similar to SNPs

CURRENT AIM: INTEGRATING DATASETS AND VARIANT TYPES

Datasets & variant types GCGTGCTGA G GCGTGATGA G GCGTGCCTG AG GCGTGAGTG AG GCGTGCCTG AG GCGTG-- TGAG SNP MNP INDEL SV SNP array data

Deletion SNPs (from LC, EX, OMNI) Indels Goncalo Abecasis Reconstruct haplotypes including all variant types, using all datasets

ADDITIONAL POPULATIONS

Continental & admixed populations

Local ancestry deconvolution Columbian child 1Columbian child 2 Simon Gravel

WHAT ARE WE DELIVERING?

Data and resources Comprehensive catalog of human variants – SNPs, short INDELs – MNPs, structural variations Sites and allele frequency estimates in “normal” genomes that can be used in interpreting rare and common variants in medical sequencing projects Imputation panels to help accurate genotype calling in medical sequencing projects Genotyping chips based on new variants

Data delivery Bulk downloads Browser – Currently based on August 2010 data (to be updated) – Allows retrieval of data “slices” (both VCF and BAM)

The 1000GP is a driver for method and tool development New data formats (BAM, VCF) developed by the 1000GP are now adopted by the entire genomics community Tools (read mappers e.g. BWA, MOSAIK, etc; variant callers including those for SVs) Data processing protocols (BQ recalibration, dup removal, etc.) Imputation and haplotype phasing methods

Fraction of variant sites present in an individual that are NOT already represented in dbSNP DateFraction not in dbSNP February, % February, % April, % February, 20112% May 2011 (now)1% Ryan Poplin, David Altshuler

April 2009 June 2009 Aug 2009 Oct Dec Feb2010 April 2010 Aug 2010 June 2010 Oct 2010 Dec 2010 Feb 2011 April 2011 June 2011 Aug 2011 MAB (target – 100T); DNA from LCL AJM (target – 80T); DNA from Bld Oct2011 Dec 2011 Feb 2012 April 2012 FIN (100S); DNA from LCL PUR (70T); DNA from Blood CHS (100T); DNA from LCL CLM (70T); DNA from LCL Phase I (1,150) IBS (84/100T); DNA from LCL 16 (8T) PEL (70T); DNA from Blood CDX 17S CDX (100S); DNA: 17 DNA from Bld, 83 from LCL Phase II (1,721) Phase III (2,500) Sierra Leone (target – 100T); DNA from LCL GBR (96/100S); DNA from LCL 3 1 KHV (82/100) – 15 trios; DNA Bld (29T) 23 (7T) 18 (5-10 trios) ACB (28/79T) – 14 trios; DNA Bld (11 trios; 39S) 15 PJL (target – 100T) ; DNA from Blood GWD (target – 100T); DNA from LCL 15 GWD 15 GWDGWD 270 Nigeria (target – 100T); DNA from LCL Bengalee (target – 100T) Sri Lankan (target – 100T) Tamil (target – 100T) GIH vs. Sindhi (target – 100T)

Credits ★ 1000G Tutorial at ICHG 2011 ★ Community Meeting in Spring 2012