Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovery tools for human genetic variations

Similar presentations


Presentation on theme: "Discovery tools for human genetic variations"— Presentation transcript:

1 Discovery tools for human genetic variations
Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA 02467

2 Sequence variations Human Genome Project produced a reference genome sequence that is 99.9% common to each human being sequence variations make our genetic makeup unique SNP Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important

3 How do we find variations?
comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage) diverse sequence resources can be used EST WGS BAC

4 Steps of SNP discovery Sequence clustering Cluster refinement
Multiple alignment SNP detection

5 Computational SNP mining – PolyBayes
1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources Two innovative ideas: 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism

6 SNP discovery with PolyBayes
genome reference sequence 1. Fragment recruitment (database search) 3. Paralog identification 2. Anchored alignment 4. SNP detection

7 Sequence clustering Clustering simplifies to search against sequence database to recruit relevant sequences Clusters = groups of overlapping sequence fragments matching the genome reference genome reference fragments cluster 1 cluster 2 cluster 3

8 (Anchored) multiple alignment
The genomic reference sequence serves as an anchor fragments pair-wise aligned to genomic sequence insertions are propagated – “sequence padding” Advantages efficient -- only involves pair-wise comparisons accurate -- correctly aligns alternatively spliced ESTs

9 Paralog filtering The “paralog problem”
unrecognized paralogs give rise to spurious SNP predictions SNPs in duplicated regions may be useless for genotyping Challenge to differentiate between sequencing errors and paralogous difference Sequencing errors Paralogous difference

10 Paralog filtering Pair-wise comparison between fragment and genomic sequence Model of expected discrepancies Orthologous: sequencing error + polymorphisms Paralog: sequencing error + paralogous sequence difference Bayesian discrimination algorithm

11 Paralog filtering

12 SNP detection Goal: to discern true variation from sequencing error
polymorphism

13 Bayesian-statistical SNP detection
polymorphic permutation A C T G monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Depth of coverage Base composition

14 Priors Polymorphism rate in population -- e.g. 1 / 300 bp
Distribution of SNPs according to minor allele frequency Distribution of SNPs according to specific variation Sample size (alignment depth)

15 SNP score polymorphism specific variation

16 Validation – pooled sequencing
African Asian Caucasian Hispanic CHM 1

17 Validation -- resequencing

18 Properties of SNP detection algorithm
frequent alleles are easier to detect high-quality alleles are easier to detect

19 The PolyBayes software
First statistically rigorous SNP discovery tool Correctly analyzes alternative cDNA splice forms Available for use (~70 licenses) Marth et al., Nature Genetics, 1999

20 SNP mining: genome BAC overlaps
SNP analysis overlap detection inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data candidate SNP predictions

21 BAC overlap mining results
~ 30,000 clones >CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) ACCTAGGAGACTGAACTTACTG 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 ACCTAGGAGACCGAACTTACTG

22 SNP mining projects 1. Short deletions/insertions (DIPs) in the BAC overlaps Weber et al., AJHG 2002 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001

23 Genotyping by sequence
SNP discovery usually deals with single-stranded (clonal) sequences It is often necessary to determine the allele state of individuals at known polymorphic locations Genotyping usually involves double-stranded DNA  the possibility of heterozygosity exists there is no unique underlying nucleotide, no meaningful base quality value, hence statistical methods of SNP discovery do not apply

24 Genotyping homozygous peak heterozygous peak


Download ppt "Discovery tools for human genetic variations"

Similar presentations


Ads by Google