Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics

Sequence variations Human Genome Project produced a reference genome sequence that is 99.9% common to each human being sequence variations make our genetic makeup unique SNP Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important

Why do we care about variations? phenotypic differences inherited diseases demographic history

Where do variations come from? sequence variations are the result of mutation events TAAAAAT TAACAAT TAAAAAT TAACAAT TAAAAATTAACAAT TAAAAAT MRCA mutations are propagated down through generations variation patterns permit reconstruction of phylogeny

SNP discovery comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage) diverse sequence resources can be used EST WGS BAC

Steps of SNP discovery Sequence clustering Cluster refinement Multiple alignment SNP detection

Computational SNP mining – PolyBayes 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing errortrue polymorphism 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources Two innovative ideas:

Computational SNP mining – PolyBayes sequence clustering simplifies to database search with genome reference paralog filtering by counting mismatches weighed by quality values multiple alignment by anchoring fragments to genome reference SNP detection by differentiating true polymorphism from sequencing error using quality values

genome reference sequence 1. Fragment recruitment (database search) 2. Anchored alignment 3. Paralog identification 4. SNP detection SNP discovery with PolyBayes

Sequence clustering Clustering simplifies to search against sequence database to recruit relevant sequences cluster 1cluster 2cluster 3 genome reference fragments Clusters = groups of overlapping sequence fragments matching the genome reference

(Anchored) multiple alignment Advantages efficient -- only involves pair-wise comparisons accurate -- correctly aligns alternatively spliced ESTs The genomic reference sequence serves as an anchor fragments pair-wise aligned to genomic sequence insertions are propagated – “sequence padding”

Paralog filtering -- idea The “paralog problem” unrecognized paralogs give rise to spurious SNP predictions SNPs in duplicated regions may be useless for genotyping Paralogous difference Sequencing errors Challenge to differentiate between sequencing errors and paralogous difference

Paralog filtering -- probabilities Model of expected discrepancies Native: sequencing error + polymorphisms Paralog: sequencing error + paralogous sequence difference Pair-wise comparison between EST and genomic sequence Bayesian discrimination algorithm

Paralog filtering -- paralogs

Paralog filtering -- selectivity 375 paralogous ESTs 1,579 native ESTs probability cutoff

SNP detection Goal: to discern true variation from sequencing error sequencing errorpolymorphism

Bayesian-statistical SNP detection AAAAAAAAAA CCCCCCCCCC TTTTTTTTTT GGGGGGGGGG polymorphic permutation monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Base composition Depth of coverage

The SNP score polymorphism specific variation

SNP priors Polymorphism rate in population -- e.g. 1 / 300 bp Sample size (alignment depth) Distribution of SNPs according to minor allele frequency Distribution of SNPs according to specific variation

Selectivity of detection 76,844 SNP probability threshold

Validation by pooled sequencing African Asian Caucasian Hispanic CHM 1

Validation by re-sequencing

Rare alleles are hard to detect frequent alleles are easier to detect high-quality alleles are easier to detect

http://genome.wustl.edu/gsc/polybayes Marth et al., Nature Genetics, 1999 Available for use (~70 licenses) First statistically rigorous SNP discovery tool Correctly analyzes alternative cDNA splice forms The PolyBayes software

INDEL discovery There is no “base quality” value for “deleted” nucleotide(s) Sequencing chemistry context-dependent No reliable prior expectation for INDEL rates of various classes

INDEL discovery Deletion Flank Insertion Insertion Flank Q(insertion flank) >= 35 Q(deletion flank) >= 35 Insertion Flank Deletion FlankDeletion Q(deletion) = average of Q(deletion flank)

INDEL discovery 123,035 candidate INDELs (~ 25% of substitutions) Majority 1-4 bp insertion length (1 bp – 68 %, 2bp – 13%) Validation rate steeply increases with insertion length 14.3% 60.8% 61.7% <<

SNP discovery in diploid traces sequence is guaranteed to originate from a single location: no alignment problem usually, PCR products are sequenced from multiple individuals sequence is the product of two chromosomes, hence can be heterozygous; base quality values are not applicable to heterozygous sequence =

SNP discovery in diploid traces Homozygous trace peak Heterozygous trace peak

overlap detection inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data SNP analysis candidate SNP predictions SNP mining: genome BAC overlaps

>CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA ACCTAGGAGACTGAACTTACTG ACCTAGGAGACCGAACTTACTG ~ 30,000 clones 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 BAC overlap mining results

Weber et al., AJHG 2002 1. Short deletions/insertions (DIPs) in the BAC overlaps 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001 SNP mining projects

The current variation resource The current public resource (dbSNP) contains over 2 million SNPs as a dense genome map of polymorphic markers 1. How are these SNPs structured within the genome? 2. What can we learn about the processes that shape human variability?

Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Similar presentations

Presentation on theme: "Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Similar presentations

Presentation on theme: "Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback