Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.

Slides:



Advertisements
Similar presentations
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Efficient Algorithms for Imputation of Missing SNP Genotype Data A.Mihajlović, V. Milutinović,
Probability in genetics. Gregor Mendel (1822 – 1884), experimented on peas Mendelian inheritance, single (or very few) genes controlling certain expressed.
G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Module 12 Human DNA Fingerprinting and Population Genetics p 2 + 2pq + q 2 = 1.
Modeling Populations forces that act on allelic frequencies.
Next Generation Sequencing, Assembly, and Alignment Methods
University of Connecticut
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
BNFO 602 Lecture 1 Usman Roshan.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Habil Zare Department of Genome Sciences University of Washington
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Broad-Sense Heritability Index
Hidden Markov Models for Sequence Analysis 4
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
An Efficient Method of Generating Whole Genome Sequence for Thousands of Bulls Chuanyu Sun 1 and Paul M. VanRaden 2 1 National Association of Animal Breeders,
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
CATALYST Recall and Review: – What are chromosomes? – What are genes? – What are alleles? How do these terms relate to DNA? How do these terms relate to.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Neanderthals Noonan, et al. Sequencing and Analysis of Neanderthal Genomic DNA Green, et al. Analysis of one million base pairs of Neanderthal DNA Kristine.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
By Alfonso Farrugio, Hieu Nguyen, and Antony Vydrin Sequencing Technologies and Human Genetic Variation.
Genomics Chapter 18.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Lesson: Sequence processing
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Bellwork: What is the human genome project. What was its purpose
Discovery tools for human genetic variations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández

Genetic Variation Variations in the genome are called Single Nucleotide Polymorphisms (SNPs). Most of the genetic material between individuals is the same. SNPs make up < 1% of the human genome.

Genotypes The possible values for a SNP are called alleles. We consider only “bi-allelic” SNPs (SNPs with exactly two alleles in the human population). Each person has two alleles for each SNP. A “genotype” is said to be “homozygous” if the same allele is present in both copies and “heterozygous” if both alleles are present. A Heterozygous genotype

Shotgun Sequencing “Shotgun sequencing” is the most popular method for sequencing genomes. In shotgun sequencing, the DNA molecule is broken into many small, random fragments and the DNA sequence of each fragment is determined. These fragments are called “reads”. The reference genome is then searched for each of these reads, to find the position in the genome, which the read most likely came from. Since the position of SNPs within the reference genome is known, the alleles can be extracted from the reads.

James Watson's Genome ~74.4 million reads from James Watson's genome are freely available on the internet (generated by Wheeler et al.) These reads were generated using technology from 454 Life Sciences. Wheeler used “DNA microarrays” from AffyMetrix, to genotype James Watson, but not all known SNPs were found. Microarrays are expensive and can quickly become obsolete due to the reference genome changing. We used different methods to determine the genotypes, based on the reads, and compared our results to those of Wheeler.

Our method We need...  to have an already sequenced genome (reference genome) to compare this individual against.  to have a number of fragments derived from the individual's genome  to be able to locate where the fragments belong in the genome.  to know the location and alleles of the SNPs in the genome.  to determine whether both copies of the SNP are covered The HapMap project stores a database of the alleles and position for most known SNPs. The Human Genome Project, has created a reference genome, based on multiple volunteers. If we're sure we've got both copies covered, then determining the genotype is easy.

Problems The same copy may be sequenced much more often than the other or reads may be mapped to the wrong position in the genome. Sequencing and mapping accuracy affect genotype determination. Using older methods, the genotype cannot be determined with a high confidence if the coverage is low.  It is estimated about 13x coverage is needed to accurate genotyping. Can we determine genotypes with high accuracy from low- coverage data? How?

Read Mapping In order to map the reads to the reference genome, Nucmer was used. Nucmer uses the following procedure to map a read to the genome: 1.Find MUMs between the read and the reference genome, using suffix trees. 2.Cluster matches into closely grouped sets, dropping inconsistent matches. 3.Align the region between MUMs in each cluster.

Read mapping After using Nucmer to map all the reads, we filtered Nucmer's output. We removed:  all reads which were mapped to more than one position in the genome.  all reads which had more than 10 errors (substitutions, insertions, deletions) and less than 90% of the read covered by a cluster. Using simulated reads (reads generated randomly from the reference genome, using ReadSim), this method achieved a 0.37% false positive rate. Our accuracy was further improved by removing reads which gave an invalid allele.

SNP Genotyping Once we have mapped the reads, we know what SNPs each read contains, but how do we find a SNP's genotype? Binomial distribution: there are only two alleles per SNP, so the allele combinations for a heterozygous SNP follow a binomial distribution. Using a binomial distribution, we can infer possible genotypes.

Locating and identifying SNPs There is a tremendous number of SNPs to search through and this search needs to be done for each read. A binary search algorithm is used to find the first SNP within the read. We want to find the SNP contained in the read with the smallest position. Consecutive SNPs which follow the first SNP are included if they are contained in the read. The base found at the SNP position is extracted, and counted if it is a valid allele. Otherwise, we assume there was an error in the read and the entire read is thrown out (because it indicates a possibly mismapped read). Every time an allele is found, we update the heterozygous and corresponding heterozygous genotype probabilities, and move on to the next SNP.

List of SNPs Read start: Read end: Short Binary search example Look in the upper half... This is it! Look at consecutive SNPs to see if any more are in the read

“Calling” Genotypes There may be many reads which contain the same SNP. We count the number of times an allele appears, and use known frequencies to calculate the genotype probability. After all the reads have been processed, we can calculate the posterior probabilities using Bayes' Theorem: P(G|R)=P(G)*P(R|G)/P(R) Where P( G ) is given by Hardy-Weinberg proportions for known population allele frequencies f 0 and f 1 (i.e., P( G = {0,0} ) = f 0 2 and P( G = {0,1} ) = 2 * f 0 *f 1 ). P(R|G={0,1})=(1/2)^|R| P(R|G={0,0})=P(error)^|R(1)|*(1-P(error))^|R(0)| P(R|G={1,1})=P(error)^|R(0)|*(1-P(error))^|R(1)| P(R)= P(R|G={0,1})*P( G={0,1} )+ P(R|G={0,0})*P( G={0,0} )+ P(R|G={1,1})*P( G={1,1})

Example A SNP is overlapped with six reads. Four of the reads have the 0 allele, two of the reads have the 1 allele. In the population, the 0 allele occurs 75% of the time, the 1 allele occurs 25% of the time. Assume errors occur uniformly with a rate of 1% |R|=6, |R(1)|=2, |R(0)|=4 P( G = {0,1} )=2*0.75*0.25=0.375 P( G = {0,0} )=0.75*0.75= P( G = {1,1} )=0.25*0.25= P(R|G = {0,1} )=0.5^6= P(R|G = {0,0} )=0.01^2*0.99^4= P(R|G = {1,1} )=0.01^4*0.99^2= P(R)= * * *0.0625= P(G = {0,1}|R)=0.375* / = P(G = {0,0}|R)=0.5625* / = P(G = {1,1}|R)=0.0625* / =

Results We compared our results with those of the team which sequenced Watson's genome. Our binomial distribution calculations gave better results, calling more SNPs and giving better accuracy than Wheeler et al. did. There are probabilistic models which are far superior to the simple binomial distribution, and which are more realistic biologically. We are currently working on implementing Hidden Markov Models to solve this problem.

Questions??