Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis and Applications of next-gen sequencing Gerton Lunter Wellcome Trust Centre for Human Genetics GMS lecture, Feb 2015.

Similar presentations


Presentation on theme: "Analysis and Applications of next-gen sequencing Gerton Lunter Wellcome Trust Centre for Human Genetics GMS lecture, Feb 2015."— Presentation transcript:

1 Analysis and Applications of next-gen sequencing Gerton Lunter Wellcome Trust Centre for Human Genetics GMS lecture, Feb 2015

2 Today A brief history of dna sequencing Emerging clinical applications Current problems, questions, opportunities Methods in sequence analysis (an idiosyncratic tour) Practical

3 A BRIEF HISTORY OF DNA SEQUENCING

4 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 nt)

5 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 nt) 1992: Craig Venter establishes first sequencing factory ABI 373A (picture courtesy Ebay)

6 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 nt) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established

7 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 nt) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort

8 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 nt) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort 1996: ABI 3700 introduced; Celera founded (Venter)

9 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 bp) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort 1996: ABI 3700 introduced; Celera founded (Venter) ‘ : Human Genome Project Hutchison 2007, PMID

10 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 bp) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort 1996: ABI 3700 introduced; Celera founded (Venter) ‘ : Human Genome Project 2006: Illumina launches GA-II 50 bp reads, $500 / Gb ($48,000 / human genome)

11 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 bp) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort 1996: ABI 3700 introduced; Celera founded (Venter) ‘ : Human Genome Project 2006: Illumina launches GA-II 2008: First individual genome

12 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 bp) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort 1996: ABI 3700 introduced; Celera founded (Venter) ‘ : Human Genome Project 2006: Illumina launches GA-II 2008: First individual genome 2012: 1000 Genomes Project

13 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 bp) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort 1996: ABI 3700 introduced; Celera founded (Venter) ‘ : Human Genome Project 2006: Illumina launches GA-II 2008: First individual genome 2012: 1000 Genomes Project 2014: Illumina launches HiSeq X Ten HiSeq X Ten: $1000 per human genome 6 genomes / day / machine

14 Reading the genome 1977: Fred Sanger sequenced ΦX174 (5386 bp) 1992: Craig Venter establishes first sequencing factory 1993: Sanger Center established 1996: Yeast (12 Mb): first eukaryotic genome, consortium effort 1996: ABI 3700 introduced; Celera founded (Venter) ‘ : Human Genome Project 2006: Illumina launches GA-II 2008: First individual genome 2012: 1000 Genomes Project 2014: Illumina launches Hiseq X Ten 2015: Genome England’s 100,000 Genomes Project (pilot phase)

15 Genetic testing today Sanger sequencing – Still considered gold standard – Cheap; low throughput (~1000 bp, 96/384 samples at a time) SNP-chip – Cheap, especially for cohorts (GWAS) – Only interrogates (pre-specified) common variants Amplicon sequencing – Cheap; up to ~100 genes at a time – Requires development of specific primer pool Exome sequencing – Still cheaper than whole genome (but not for long) – All exons, UTRs, some intergenic material of interest Whole genome sequencing – $1000 only available to large centers – Turnaround time weeks; complexity in handling data

16 The future of genetic testing: Oxford Nanopore? Hand-held sequencing device Single molecule, long reads Cheap (device and reagents) High error rate (~10-15%) Low throughput (100s Mb/day) Not yet commercially available Promise: – Sequencing at the GP (or at home?) – DNA-based diagnosis of infections – Sequencing in remote locations

17 EMERGING CLINICAL APPLICATIONS

18 Sequencing of suspected novel mutations Sequence parents + child to identify novel mutations Done as part of WGS500 project at WTCHG

19 Fetal screening Cell-free fetal DNA (cfDNA) is present in a pregnant mother’s blood Can be used to identify trisomies, or full fetal genome including de novo mutations Lower false-positive rates and higher positive predictive value for detecting chr. 18 and 21 trisomies than existing tests. PMID: Identified essentially complete fetal genome at 98.1% accuracy. Identified 39 or 44 de-novo mutations. Currently still limited specificity. PMID:

20 Tracking an outbreak of MRSA Sequencing bacteria and identifying single mutations allows tracking of the outbreak (“who infects whom”) “Whole-genome sequencing data were used to propose and confirm that MRSA carriage by a staff member had allowed the outbreak to persist during periods without known infection on the SCBU and after a deep clean.” (PMID: )

21 Existing NICE guidelines: – Only test BRCA1/2, only in patients with family history With Illumina sequencing – Offer genetic testing to all breast-cancer patients And family members in case mutation is identified – Tests 100 cancer predisposition genes at once – Significantly lower cost than existing test – Faster turnaround time Challenges – Accurate analysis of data (SNPs, indels, whole exon deletions) – Set up infrastructure to process tests (in centers around UK) – Have genetic testing done by non-geneticists (oncologists) – Added burden to clinicians (15 minute time budget per patient)

22 Cancer sequencing - diagnosis PMID:

23 Cancer sequencing - treatment PMID:

24 PROBLEMS / QUESTIONS / OPPORTUNITIES

25 1. Variant calling In large part under control, but still patchy – SNVs: Germline: Good De novo: Pass Somatic: Weak pass – Indels: Pass / weak pass – Structural variants: pass to fail, depending on type – Challenging loci (HLA, KIR): fail

26 Jan 2015

27 2. B cell repertoire sequencing

28

29 3. Interpretation of variants “Results: The sensitivity of SIFT and PolyPhen was reasonably high (69% and 68% respectively), but their specificity was low (13% and 16%).” PMID: ; ;

30 4. Genetic architecture of disease (& other phenotypes) 5 % << 80 % GWAS hits explain Heritability, as estimated by twin studies, explains of phenotypic variance Eichler et al., Nature Reviews Genetics 2010, PMID Manolio et al., Nature 2009, PMID missing heritability Height:

31 4. Genetic architecture of disease (& other phenotypes) 5 % < 80 % GWAS hits explain of phenotypic variance Yang et al, Nature Genetics PMID < 45 % All SNPs explain of phenotypic variance missing heritability “hidden” heritability Height: Heritability, as estimated by twin studies, explains

32 Some explanations for missing heritability Many variants of weak effect (‘hidden’ heritability) Interacting loci (epistasis) – Can lead to overestimates of total (additive) heritability Zuk et al., PNAS 2012, PMID Gene-environment interactions Parent-of-origin effects – Direction of effect can depend on whether it was paternal/maternally inherited Kong et al., Nature 2009, PMID Rare variation

33 Rare variation is common! Congenital birth defects affect ~3% of all births (USA) Multiple causes: – environmental effects (e.g. folic acid, alcohol) – recessive mutations (common, ~1%) – rare (familial) or de-novo mutations (including e.g. Down’s syndrome) 8% of population carries (rare) variant >500 kb Itsara et al., AJHG 2009, PMID Each individual carries ~60 novel variants (not inherited from parents) – 1 expected protein-coding mutation per generation

34 55.1% of variants found in 1 sample Rare (young) variants often deleterious

35 Large-scale sequencing projects will help to better understand disease burden of rare variation 500,000 1,000, , , ,000

36 Other challenges Logistics – NHS (or most healthcare systems worldwide) not geared up to systematically, electronically collect health records – Combining medical health records from multiple sources will be challenging – Storing and accessing very high data volumes Ethical – Anonymity, gaining public trust and support – Access for drugs companies: beneficial or problematic? – Reporting of incidental findings – Fetal screening; “designer babies”

37 (A SELECTIVE TOUR OF) METHODS IN SEQUENCE ANALYSIS

38 Overview (Bayesian) modeling Hidden Markov models – A few simple examples Formal definition, and algorithms Examples – With a brief primer in population genetics

39 Bayesian inference Model: P(D, y, θ) D = data θ = parameters of interest y = nuisance parameters (but essential to relate D and θ ) Goal: estimate θ Examples: D = sequences y=phylogeny θ=alignment: multiple alignment D = sequences y=alignment θ=phylogeny: phylogenetic inference

40 Bayesian inference Model: P(D, y, θ) D = data θ = parameters of interest y = nuisance parameters (but essential to relate D and θ ) D = sequences y=phylogeny θ=alignment: multiple alignment D = sequences y=alignment θ=phylogeny: phylogenetic inference PMID:

41 Bayesian inference Model: P(D, y, θ) Goal: estimate θ Approach 1: Maximum Likelihood – As estimate θ ML use max θ, y P(D, y, θ) Advantage: relatively simple Disadvantage: ignores uncertainty in y and θ P P y θ

42 Bayesian inference Model: P(D, y, θ) Goal: estimate θ Approach 2: Bayesian inference Find posterior: P(θ |D) = const. ×  y P(D, y, θ) Estimate: maximum a-posteriori (MAP) θ MAP = argmax θ P(θ |D) or use posterior average, θ est =  θ θ P(θ |D) Advantage: accounts for uncertainty in in y and θ Disadvantage: integration over y or θ can be hard (analytical integration, MCMC, HMMs, …) P P y θ

43 The fundamental problem in mathematical modeling: Model as many informative features as possible – Biological, statistical intuition BUT: Keep inference (integration / maximization) feasible – Bag of technical tricks Bayesian inference

44 Hidden Markov models A particular type of probabilistic model Generalization of Markov chains Very well suited for sequential data – Think: (evolutionary) time, or DNA sequence position – First developed in context of speech recognition, 50s Has very efficient inference algorithms

45 Some notation P(X,Y,Z): probability of X, Y and Z occurring P(X,Y): probability of X and Y occurring. P(X,Y|Z): probability of X and Y occurring, while it is given that Z occurs also (“conditional on Z”, or “given Z”) Σ X,Y P(X,Y) = 1 Σ X,Y,Z P(X,Y,Z) = 1 (definition) P(X,Y) = Σ Z P(X,Y,Z) P(X) = Σ Y P(X,Y) (definition) P(X|Y) = P(X,Y) / P(Y) (definition) (1) P(Y|X) = P(X|Y) P(Y) / P(X) (Bayes’ rule, follows from (1))

46 Markov model A particular kind of probabilistic model All variables are observed Good for modeling dependencies within sequential data Graphical model notation: X  Y means “Y depends on X” P(S n | S 1,S 2,…,S n-1 ) = P(S n | S n-1 ) (Markov or forgetting property) P(S 1, S 2, S 3, …, S n ) = P(S 1 ) P(S 2 |S 1 ) … P(S n | S n-1 ) (follows) S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 …

47 Markov model States: letters in English words Transitions: which letter follows which S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 … MR SHERLOCK HOLMES WHO WAS USUALLY VERY LATE IN THE MORNINGS SAVE UPON THOSE NOT INFREQUENT OCCASIONS WHEN HE WAS UP ALL …. S 1 =M S 2 =R S 3 = S 4 =S S 5 =H …. P(S n = y| S n-1 = x ) = P(S n-1 S n = xy ) / P(S n-1 = x ) (frequency of xy) / (frequency of x) UNOWANGED HE RULID THAND TROPONE AS ORTIUTORVE OD T HASOUT TIVE IS MSHO CE BURKES HEST MASO TELEM TS OME SSTALE MISSTISE S TEWHERO

48 Markov model States: triplets of letters Transitions: which (overlapping) triplet follows which S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 … MR SHERLOCK HOLMES WHO WAS USUALLY VERY LATE IN THE MORNINGS SAVE UPON THOSE NOT INFREQUENT OCCASIONS WHEN HE WAS UP ALL …. S 1 =MR S 2 =R S S 3 = SH S 4 =SHE S 5 =HER …. P(S n = xyz| S n-1 = wxy ) = P( wxyz ) / P( wxy ) (frequency of wxyz) / (frequency of wxy) THERE THE YOU SOME OF FEELING WILL PREOCCUPATIENCE CREASON LITTLED MASTIFF HENRY MALIGNATIVE LL HAVE MAY UPON IMPRESENT WARNESTLY

49 Markov model States: word pairs Text from: Then churls their thoughts (although their eyes were kind) To thy fair appearance lies To side this title is impanelled A quest of thoughts all tenants to the sober west As those gold candles fixed in heaven's air Let them say more that like of hearsay well I will drink Potions of eisel 'gainst my strong infection No bitterness that I was false of heart Though absence seemed my flame to qualify As easy might I not free When thou thy sins enclose! That tongue that tells the story of thy love Ay fill it full with feasting on your sight Book both my wilfulness and errors down And on just proof surmise accumulate Bring me within the level of your eyes And in mine own when I of you beauteous and lovely youth When that churl death my bones with dust shall cover And shalt by fortune once more re-survey These poor rude lines of life thou art forced to break a twofold truth Hers by thy deeds

50 Hidden Markov model Another special kind of probabilistic model (Bayesian network) The S i form a Markov chain as before, but are unobserved (hidden) Instead, y i (dependent on S i ) are observed Generative viewpoint: state S i “emits” symbol y i y i do not form a Markov chain (= do not satisfy Markov property). S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 … y1y1 y1y1 y2y2 y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5 y6y6 y6y6 y7y7 y7y7 y8y8 y8y8 …

51 HMMs vs. Markov models

52 Toy splice finder From: Sean R Eddy, Nature Biotech 22(10) 2004

53 Profile HMMs

54 Chris Burge’s ab-initio gene finder from Burge and Karlin, 1997 Predicts genes directly from DNA Modern gene finders also exploit: – Evolutionary conservation – mRNA sequences – Databases of known proteins

55 HMMs for sequence alignment

56 Complexity of inference algorithms: O( L 1 L 2 ) Quadratic in sequence length Can get very expensive! Image credit: https://compbio.soe.ucsc.edu/papers/samspace/

57 HMMs - Formal definition definitions details Probability of realization (x,q):

58 Inference algorithms for HMMs

59 Further reading

60 phastCons Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Siepel A. et al. Genome Res Aug;15(8):

61 Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Lunter G, et al. Genome Res Feb;18(2): Statistical alignment

62 Intermezzo: Population genetics Wright-Fisher model – Models genealogies in a population – Model moves forward in time (down in picture) – Assumptions: Population size 2N haploid genomes Discrete generations No population structure – At each step: each offspring chooses a parent randomly. present past Generation

63 Intermezzo: Population genetics Sample individuals from the present Trace their history backwards in time Probability for two individuals to choose the same parent is 1/2N Take continuous-time limit (N  infinity, generation time  0) When tracking k individuals, the rate at which two choose the same parent (they coalesce) is This is the Kingman’s coalescent model present past Generation

64 Intermezzo: Population genetics The coalescent model – Describes the history of a present-day population sample – Traces history backwards in time – Provides a prior on the shape of genealogies Short branches near the present Long branches near the root High variance in time to most recent common ancestor (TMRCA) – Relates (instantaneous) population size to coalescence rate and tree shapes – Recombination can be added to the model

65 Li and Stephen’s copying model h i = i-th haplotype; ρ= recombination rate Write full likelihood as product of conditional likelihoods Approximate the full conditional likelihoods by an HMM HMM creates an imperfect mosaic of the existing haplotypes. Switching rate depends on ρ and i as suggested by the coalescent model Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single- Nucleotide Polymorphism Data. N Li and M Stephens, Genetics 165: 2213–2233 (2003)

66 Inferring speciation time and ancestral population size Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet Feb 23;3(2):e7. [See also: PMID: , PMID: ]

67 Li and Durbin single-individual model Inference of human population history from individual whole-genome sequences H Li and R Durbin, Nature 2011, doi: /nature10231

68 PRACTICAL

69 Practical I: HMMs and population genetics

70 PMID: and What is the difference between phylogeny and genealogy? What is incomplete lineage sorting? The model operates on multiple sequences. Is it a linear HMM, a pair HMM, or something else? What do the states represent? Can you think of ways to improve the model? What patterns in the data is the model looking for? How does the method scale to more species?

71 Practical II: HMMs and population genetics

72 PMID: What do the states of the HMM represent? What do transitions represent? What approximations have been used in the model? What limits the resolution of the inferred population sizes for recent and ancient times? What would happen if there had been (or is) structure in the population? For instance, population split followed by merger. Practical II: HMMs and population genetics

73 Practical III: HMMs and alignment

74 PMID: Name as many causes of inaccuracies in alignments as you can. A priori you would think that a more accurate model of sequence evolution would improve alignments. Is this true? Can the impact of model (in)accuracy on alignment accuracy be quantified? Is there a limit (in terms of evolutionary distance) on pairwise alignment? Is this a practical, or a fundamental limit? Would multiple alignment allow more divergent species to be aligned? How does the complexity scale for multiple alignment using HMMs, in a naïve implementation? Can you think of ways to improve on this? What is posterior decoding and how does it work? In what way does it improve alignments, compared to Viterbi? Why is this?

75 Practical IV: HMMs and conservation: phastCons

76 PMID: What is the difference between a phyloHMM and a “standard” HMM? How does the model identify conserved regions? How is the model helped by the use of multiple species? How is the model parameterized? The paper uses the model to estimate the fraction of the human genome that is conserved. How could this estimate be criticized? Look at a few protein-coding genes, and their conservation across mammalian species, using the UCSC genome browser. Is it always true that (protein-coding) exons are well conserved? Can you see regions of conservation outside of protein-coding exons? Do these observations suggest that the model is inaccurate? Read PMID: Summarize the differences of approaches of the new methods and the “old” phyloHMM.

77 Practical V Choose any paper referenced in the slides, and discuss / criticize / place into context.


Download ppt "Analysis and Applications of next-gen sequencing Gerton Lunter Wellcome Trust Centre for Human Genetics GMS lecture, Feb 2015."

Similar presentations


Ads by Google