Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Introduction to genomes & genome browsers
Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Clustering Beyond K-means
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Using the whole read: Structural Variation detection with RPSR
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
DETECTING CNV BY EXOME SEQUENCING Fah Sathirapongsasuti Biostatistics, HSPH.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Primer Selection Methods for Detection of Genomic Inversions and Deletions via PAMP Bhaskar DasGupta, University of Illinois at Chicago Jin Jun, and Ion.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Next generation sequencing Xusheng Wang 4/29/2010.
1 Genetic Variability. 2 A population is monomorphic at a locus if there exists only one allele at the locus. A population is polymorphic at a locus if.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Estimation of Distribution Algorithms (EDA)
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Next Generation DNA Sequencing
HOW TO MAKE A TIMETABLE USING GENETIC ALGORITHMS Introduction with an example.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
A Genome-wide association study of Copy number variation in schizophrenia Andrés Ingason CNS Division, deCODE Genetics. Research Institute of Biological.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Identification of Copy Number Variants using Genome Graphs
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Introduction to RNAseq
Cluster validation Integration ICES Bioinformatics.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
ANOVA, Regression and Multiple Regression March
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
By Alfonso Farrugio, Hieu Nguyen, and Antony Vydrin Sequencing Technologies and Human Genetic Variation.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
10/25/2007Nick Sinev, ALCPG07, FNAL, October Simulation of charge collection in chronopixel device Nick Sinev, University of Oregon.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
Canadian Bioinformatics Workshops
The Haplotype Blocks Problems Wu Ling-Yun
Canadian Bioinformatics Workshops
GENOME ORGANIZATION AS REVEALED BY GENOME MAPPING WHY MAP GENOMES? HOW TO MAP GENOMES?
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
MGmapper A tool to map MetaGenomics data
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
How to Solve NP-hard Problems in Linear Time
Jin Zhang, Jiayin Wang and Yufeng Wu
Mattew Mazowita, Lani Haque, and David Sankoff
BF528 - Genomic Variation and SNP Analysis
Canadian Bioinformatics Workshops
IWGS workflow. iWGS workflow. A typical iWGS analysis consists of four steps: (1) data simulation (optional); (2) preprocessing (optional); (3) de novo.
Chromosomal Mutations
Universal microbial diagnostics using random DNA probes
Presentation transcript:

Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009

SNPs have been the main way of quantifying genetic variation Attention is now switching to other harder to detect variations Structural variations possibly account for a large portion of genetic variance Structural variations include –Insertions –Deletions –Inversions –Translocations –Copy number variations (CNVs) SNPs are old news

Copy Number Variation: What does a CNV look like? Donor Reference

How do we find CNVs? De novo sequence assembly is hard Resequencing is now an option with low-cost next gen sequencing This project aims to find CNVs using next gen sequencing reads.

Paired-end sequencing Illumina website

Paired-end sequencing Read lengths are a known size Insert length has a distribution Output:

The Computational Problem Given: A set of paired-end reads The mapping positions in the reference Output: 1.A set of CNVs 2.An estimation of the boundaries of each CNV 3.For each CNV an associated probability for the number of copies Read and Mapping Quality

Proposed Method 1.Use discordant read pairs as CNV signal 2.Cluster discordant read pairs that explain the same CNV 3.Estimate CNV boundaries based on clustered reads 4.For each cluster calculate the probability of the number of copies = 1,2,3…

Discordant read pair Donor Reference Concordant

Discordant read pair Donor Reference Discordant

Clustering Discordant Read Pairs Use a greedy approach 1. Pick any discordant read 2.Compare with all other discordant reads and group any that are within a given distance 3.Do this until no reads can be clustered together This sounds problematic but with the right assumptions it works Assume to know the maximum insert length Assume that the reverse read maps into the second copy and the forward read maps into the first copy Assume that CNVs are far apart

Read Pair Cluster Distance x min R+M I N N = X min + R + M I

Read Pair Cluster Distance x max R N N = X min + R+ M I N = X max + R X max - X min = M I if otherwise

Estimating Boundaries Have a set of clusters now Simple Boundary estimation: Left bounary = Right bounary =

Estimating the number of copies Utilize coverage - Position Coverage ~ Poisson(coverage) For each cluster: Perform a goodness of fit test for each coverage level (ie. Number of copies = 2 => coverage’ = 2*coverage) The coverage level that gives the best fit is the most likely Sadly this did not work :( So I resorted to estimating by looking at the ratio of the mean Coverage to the expected.

Wrote simulator tool in C –Simulates FASTQ paired-end reads –Reads and writes MAQ files –Computes Coverage Generated 10MB random genome using mouse chr 19 Inserted 5 CNVs spread across the genome Generated reads at 40x Coverage Mapped to fake reference using MAQ Applied the previously mentioned method Simulation

Found all of the CNVs No False Positives Worked exactly as predicted, because reads were perfect. Results CNVPositionLengthMinMaxPredicted length Copy Ratio (CNV) Copy Ratio (Ref) CNV CNV 210, ,00010, CNV 320, ,99921, CNV 424,000 (200,300) ,15425, CNV 51,000,000 (2000) 10,0001,002,0001,009,9687,

Applied method to real mouse sequence data 40x coverage on chromosome 17 for CAST mouse strain Found 1456 possible CNVs Most of them were crazy looking Zeroed in on 5 that look interesting Results CNVMinMaxPredicted length Copy Ratio (CNV) CNV CNV CNV CNV CNV Obviously the method needs some work

Future Work Table from Lee, et al. shows that the number of perfectly mapped reads that are discordant is very small Their method considers all possible mapping positions for each pair My method needs to consider this and toss MAQ

Future Work Replace the ad-hociness with a more formal probabilistic framework Consider all high-quality mapping positions (take into account low quality mappings) Consider the problem of repeated sequences