Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015.

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Analysis of imputed rare variants
Association Tests for Rare Variants Using Sequence Data
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Basics of Linkage Analysis
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Ferdinand van ’t Hooft Cardiovascular Genetics and Genomics Group Karolinska Institutet, Stockholm, Sweden Genome-Wide Association Study GWAS
Genetic Association Analysis --- impact of NGS 1.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
University of Connecticut
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Comments on Rare Variants Analyses Ryo Yamada Kyoto University 2012/08/27 Japan.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Massive Parallel Sequencing
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
CS177 Lecture 10 SNPs and Human Genetic Variation
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
Genome-Wide Association Study (GWAS)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
California Pacific Medical Center
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Genome-Wides Association Studies (GWAS) Veryan Codd.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
From Reads to Results Exome-seq analysis at CCBR
Interpreting exomes and genomes: a beginner’s guide
Disease risk prediction
upstream vs. ORF binding and gene expression?
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
BMI/CS 776 Spring 2018 Anthony Gitter
Beyond GWAS Erik Fransen.
Linking Genetic Variation to Important Phenotypes
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Exercise: Effect of the IL6R gene on IL-6R concentration
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Perspectives from Human Studies and Low Density Chip
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Analysis of Next Generation Sequence Data BIOST /06/2015

Last Lecture Genome-wide association study has identified thousands of disease-associated loci Large consortium performs meta-analysis to further increase the sample size (power) to detect additional loci GWAS is limited by the chip design and rare variants are rarely explored

Genetic Spectrum of Complex Diseases GWAS Sequencing Linkage

Outline Background From sequence data to genotype Rare Variant Tests

Human Genome and Single Nucleotide Polymorphisms (SNPs) 23 chromosome pairs 3 billion bases A single nucleotide change between pairs of chromosomes E.g. A/A or G/G homozygote A/G heterozygote : AAGGGATCCAC Haplotype1: AAGGGATCCAC : AAGGAATCCAC Haplotype2: AAGGAATCCAC

Association Study in Case Control Samples CAGATCGCTGGATGAATCGCATC CGGATTGCTGCATGGATCGCATC CAGATCGCTGGATGAATCGCATC CAGATCGCTGGATGAATCCCATC CGGATTGCTGCATGGATCCCATC SNP2 ↓ SNP3 ↓ SNP4 ↓ SNP5 ↓ SNP1 ↓ Disease

– Only subset of functional elements include common variants – Rare variants are more numerous and thus will point to additional loci

History of DNA Sequencing

Sequencing Cost

A Road to Discover Human Genome

Current Genome Scale Approaches Deep whole genome sequencing – Expensive, only can be applied to limited samples currently – Most complete ascertainment of all variations Low coverage whole genome sequencing – Modest cost, typically samples – Complete ascertainment of common variations – Less complete ascertainment of rare variants Exome capture and targeted region sequencing – Modest cost, high coverage – Most interesting part of the genome

Next Generation Sequencing Commercial platforms produce gigabases of sequence rapidly and inexpensively – ABI SOLiD, Illumina Solexa, Roche 454, Complete Genomics, and others… Sequence data consist of thousands or millions of short sequence reads with moderate accuracy 0.5 – 1.0% error rates per base may be typical High-throughput but hard to assemble

A Typical Pipeline Shotgun Sequencing Reads Single Marker Caller Haplotype- based Caller Mapped Reads Polymorphic Sites Individual Genotypes Read Alignment Software

Short read alignment Sequencer Reads from new sequencing machines are short: bp Human source

Short read alignment Sequencing machine And you get MILLIONS of them

Short read alignment Need to map them back to human reference

Alignment Reference sequence: actgtagattagccgagtagctagctagtcgat ccgagaagctag Find best match for each read in a reference sequence Hashing is time and memory consuming for millions of reads and billion-base long reference Errors in reads Each read may be mapped to multiple positions Individual polymorphisms

Existing Alignment by Category Hashing reference genome – SOAP1, MOSAIK, PASS, BFAST, … Hashing short reads – Eland, MAQ, SHRiMP, … Merge-sorting reference together with reads – Slider Based on Burrows-Wheeler Transform – BWA, SOAP2, Bowtie, … Li and Durbin (2009), Bioinformatics 25 (14):

After Alignment Each read is mapped to reference genome with tolerated number of mismatches – Mismatches allow us to discover the individual variation Each site of reference genome is covered by multiple un-evenly distributed reads – Some sites might not be covered

Genome Genome 1 Genome 2 Genome 3 Genome 4 Reads Coverage (High vs Low) VS Which one has more power to detect variations?

Genotype Calling from Sequence Data 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ Reference Genome GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAG C TAGCTGATGAGCC ATAGCTAG A TAGCTGATGAGCCCGATCGCTGCTAGCTC TAGCTGATAGCTAG A TAGCTGATGAGCCCGAT Sequence Reads Predicted Genotype A/C or A/A or C/C Observed Data 2A and 3C

A Simple Model At one site, Na reads carry A, Nb reads carry B

Inference with no reads Reference Genome Sequence Reads Possible Genotypes P(reads|A/A)= 1.0 P(reads|A/C)= 1.0 P(reads|C/C)= 1.0 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

Inference with short read data Reference Genome Sequence Reads Possible Genotypes P(reads|A/A)= P(C observed, read maps |A/A) P(reads|A/C)= P(C observed, read maps |A/C) P(reads|C/C)= P(C observed, read maps |C/C) 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA

Inference assuming error of 1% Reference Genome Possible Genotypes P(reads|A/A)= 0.01 P(reads|A/C)= 0.50 P(reads|C/C)= 0.99 Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA

As data accumulate … Reference Genome Possible Genotypes P(reads|A/A)= P(reads|A/C)= 0.25 P(reads|C/C)= 0.98 Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG

As data accumulate … Reference Genome Possible Genotypes P(reads|A/A)= P(reads|A/C)= P(reads|C/C)= 0.97 Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAG C TAGCTGATGAGCC

As data accumulate … Reference Genome Possible Genotypes P(reads|A/A)= P(reads|A/C)= P(reads|C/C)= Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAG C TAGCTGATGAGCC TAGCTGATAGCTAG A TAGCTGATGAGCCCGAT

In the “end” Reference Genome P(reads|A/A)= P(reads|A/C)= P(reads|C/C)= Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAG C TAGCTGATGAGCC TAGCTGATAGCTAG A TAGCTGATGAGCCCGAT ATAGCTAG A TAGCTGATGAGCCCGATCGCTGCTAGCTC

Not the “end” yet Reference Genome P(reads|A/A) = P(reads|A/C) = P(reads|C/C) = Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAG C TAGCTGATGAGCC TAGCTGATAGCTAG A TAGCTGATGAGCCCGAT Making a genotype call requires combining sequence data with prior information ATAGCTAG A TAGCTGATGAGCCCGATCGCTGCTAGCTC

Not the “end” yet Reference Genome P(reads|A/A)= Prior(A/A) = P(A/A|reads) < 0.01 P(reads|A/C)= Prior(A/C) = P(A/C|reads) = P(reads|C/C)= Prior(C/C) = P(C/C|reads) = Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAG C TAGCTGATGAGCC TAGCTGATAGCTAG A TAGCTGATGAGCCCGAT Base Prior: every site has 1/1000 probability of varying ATAGCTAG A TAGCTGATGAGCCCGATCGCTGCTAGCTC

Population Based Prior Reference Genome P(reads|A/A)= Prior(A/A) = 0.04 P(A/A|reads) <.001 P(reads|A/C)= Prior(A/C) = 0.32 P(A/C|reads) = P(reads|C/C)= Prior(C/C) = 0.64 P(C/C|reads) = <.001 Sequence Reads 5’-ACTGGTCGATGCTAGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ GCTAGCTGATAGCTAG C TAGCTGATGAGCCCGA AGCTGATAGCTAG C TAGCTGATGAGCCCGATCGCTG ATGCTAGCTGATAGCTAG C TAGCTGATGAGCC TAGCTGATAGCTAG A TAGCTGATGAGCCCGAT Population Based Prior: Use frequency information from examining others at the same site. E.g. P(A) = 0.2 ATAGCTAG A TAGCTGATGAGCCCGATCGCTGCTAGCTC

Prior Information Individual based prior – Equal probability of showing polymorphism – 1/1000 bases different from reference – Error Free and Poisson distribution – Single sample, single site Population based prior – Estimate frequency from many individuals – Multiple sample, single site Haplotype/Imputation based prior – Jointly model flanking SNPs, use haplotype information – Important for low coverage sequence data – Multiple samples, multiple sites

Comparisons of Different Genotype Calling Methods

Rare Variant Tests Genotype calling is the first step of the journey Identify SNPs/genes associated with phenotype Sequencing provides more comprehensive way to study the genome – Discover more rare variants

– Only subset of functional elements include common variants – Rare variants are more numerous and thus will point to additional loci

Genetic Spectrum of Complex Diseases GWAS Sequencing McCarthy MI et al. Nat Rev Genet. 2008

Several Approaches to Study Rare Variants Deep whole genome sequencing – Can only be applied to limited numbers of samples – Most complete ascertainment of variation Exome capture and targeted sequencing – Can be applied to moderate numbers of samples – SNPs and indels in the most interesting 1% of the genome Low coverage whole genome sequencing – Can be applied to moderate numbers of samples – Very complete ascertainment of shared variation New Genotyping Arrays and/or Genotype Imputation – Examine low frequency coding variants in 100,000s of samples – Current catalogs include 97-98% of sites detectable by sequencing an individual

Single SNP Test for Rare Variant Rare variants are hard to detect Power/sample size depends on both frequency and effect size Rare causal SNPs are hard to identify even with large effect size

Single SNP Test for Rare Variant Disease prevalence ~10% Type I error 5x10 -6 To achieve 80% power Equal number of cases and controls Minor Allele Frequency (MAF) = 0.1, 0.01, Required sample size = 486, 3545, 34322,

Alternatives to Single Variant Test Collapsing Method (Burden Test) Group rare variants in the same gene/region Score each individual – Presence or absence of rare copy – Weight each variant Use individual score as a new “genotype” Test in a regression framework

Challenges Disease is caused by multiple rare variants in an additive manner It is hard to separate causal and null SNPs – Including all rare variants will dilute the true signals The effect size of each rare variant varies

Power of Burden Test Power tabulated in collections of simulated data Combining variants can greatly increase power Currently, appropriately combining variants is expected to be key feature of rare variant studies.

Impact of Null Variants Including non-disease variants reduces power Power loss is manageable, combined test remains preferable to single marker tests

Impact of Missing Disease Alleles Missing disease alleles loses power Still better than single variant test

1. y i : quantitative or binary phenotypes; 2. α'X i : fixed effects of covariates; 3. β'G i : genetic effects from one gene consisted of SNPs; 4. ε i : random error. Sequence Kernel Association Test (SKAT)

Regression based method Score statistic Kernel

Maximizing the Power Power depends summed frequency – Choose threshold for defining rare carefully Enriched functional variants in cases increase power – Focus on loss of function variants only Use more efficient design – For quantitative traits, focus on individuals with extreme trait values – For binary traits, focus on individuals with family history of disease

Discussion Analysis of rare variants is an active research area Weight for each SNP is the key What to do if the samples are related Most tests reply on permutation – Computationally intensive

Reference The 1000 Genomes Project (2010) A map of human genome vairation from population-scale sequencing. Nature 467: Nielsen R, Paul JS et al. (2011) Genotype and SNP calling from next- generation sequencing data. Nat Rev Genet Li Y, Chen W et al. (2012) Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data. Statistics in Biosciences. Li Y et al (2011) Low-coverage sequencing: Implication for design of complex trait association studies. Genome Research 21: Chen W, Li B et al. (2013) Genotype calling and haplotyping in parent- offspring trios. Genome Research.

Reference Raychaudhuri S. Mapping rare and common causal alleles for complex human diseases. Cell Sep 30;147(1): Li and Leal (2008) Am J Hum Genet 83: Madsen BE, Browning SR (2009) A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genet 5(2) Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zöllner S (2010) Am J Hum Genet 87: Wu M, Lee S, et al. (2011) Am J Hum Genet