Discovery tools for human genetic variations

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Outline to SNP bioinformatics lecture
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Genome Variations & GWAS
CO 10.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Genetics-multistep tumorigenesis genomic integrity & cancer Sections from Weinberg’s ‘the biology of Cancer’ Cancer genetics and genomics Selected.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
MAPPING GENOMES – genetic, physical & cytological maps Genetic distance (in cM) 1 centimorgan = 1 map unit, corresponding to recombination frequency of.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.

Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Aaron R. Quinlan and Gabor T. Marth Department of Biology, Boston College, Chestnut Hill, MA 02467
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Virginia Commonwealth University
Lesson: Sequence processing
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Single Nucleotide Polymorphism
Jin Zhang, Jiayin Wang and Yufeng Wu
Genome sequencing informatics
Databases BI420 – Introduction to Bioinformatics Gabor T. Marth
Genome organization and Bioinformatics
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Databases BI420 – Introduction to Bioinformatics Gabor T. Marth
Sequence the 3 billion base pairs of human
Research for medical discovery at the Computational Genomics Laboratory at Boston College Biology Gabor T. Marth Department of Biology, Boston College.
Human Genome Project Seminal achievement. Scientific milestone.
Introduction to Bioinformatics
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Discovery tools for human genetic variations Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA 02467

Sequence variations Human Genome Project produced a reference genome sequence that is 99.9% common to each human being sequence variations make our genetic makeup unique SNP Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important

How do we find variations? comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage) diverse sequence resources can be used EST WGS BAC

Steps of SNP discovery Sequence clustering Cluster refinement Multiple alignment SNP detection

Computational SNP mining – PolyBayes 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources Two innovative ideas: 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism

SNP discovery with PolyBayes genome reference sequence 1. Fragment recruitment (database search) 3. Paralog identification 2. Anchored alignment 4. SNP detection

Sequence clustering Clustering simplifies to search against sequence database to recruit relevant sequences Clusters = groups of overlapping sequence fragments matching the genome reference genome reference fragments cluster 1 cluster 2 cluster 3

(Anchored) multiple alignment The genomic reference sequence serves as an anchor fragments pair-wise aligned to genomic sequence insertions are propagated – “sequence padding” Advantages efficient -- only involves pair-wise comparisons accurate -- correctly aligns alternatively spliced ESTs

Paralog filtering The “paralog problem” unrecognized paralogs give rise to spurious SNP predictions SNPs in duplicated regions may be useless for genotyping Challenge to differentiate between sequencing errors and paralogous difference Sequencing errors Paralogous difference

Paralog filtering Pair-wise comparison between fragment and genomic sequence Model of expected discrepancies Orthologous: sequencing error + polymorphisms Paralog: sequencing error + paralogous sequence difference Bayesian discrimination algorithm

Paralog filtering

SNP detection Goal: to discern true variation from sequencing error polymorphism

Bayesian-statistical SNP detection polymorphic permutation A C T G monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Depth of coverage Base composition

Priors Polymorphism rate in population -- e.g. 1 / 300 bp Distribution of SNPs according to minor allele frequency Distribution of SNPs according to specific variation Sample size (alignment depth)

SNP score polymorphism specific variation

Validation – pooled sequencing African Asian Caucasian Hispanic CHM 1

Validation -- resequencing

Properties of SNP detection algorithm frequent alleles are easier to detect high-quality alleles are easier to detect

The PolyBayes software http://genome.wustl.edu/gsc/polybayes First statistically rigorous SNP discovery tool Correctly analyzes alternative cDNA splice forms Available for use (~70 licenses) Marth et al., Nature Genetics, 1999

SNP mining: genome BAC overlaps SNP analysis overlap detection inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data candidate SNP predictions

BAC overlap mining results ~ 30,000 clones >CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) ACCTAGGAGACTGAACTTACTG 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 ACCTAGGAGACCGAACTTACTG

SNP mining projects 1. Short deletions/insertions (DIPs) in the BAC overlaps Weber et al., AJHG 2002 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001

Genotyping by sequence SNP discovery usually deals with single-stranded (clonal) sequences It is often necessary to determine the allele state of individuals at known polymorphic locations Genotyping usually involves double-stranded DNA  the possibility of heterozygosity exists there is no unique underlying nucleotide, no meaningful base quality value, hence statistical methods of SNP discovery do not apply

Genotyping homozygous peak heterozygous peak