Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.

Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009

SNPs have been the main way of quantifying genetic variation Attention is now switching to other harder to detect variations Structural variations possibly account for a large portion of genetic variance Structural variations include –Insertions –Deletions –Inversions –Translocations –Copy number variations (CNVs) SNPs are old news

Copy Number Variation: What does a CNV look like? Donor Reference

How do we find CNVs? De novo sequence assembly is hard Resequencing is now an option with low-cost next gen sequencing This project aims to find CNVs using next gen sequencing reads.

Paired-end sequencing Illumina website

Paired-end sequencing Read lengths are a known size Insert length has a distribution Output:

The Computational Problem Given: A set of paired-end reads The mapping positions in the reference Output: 1.A set of CNVs 2.An estimation of the boundaries of each CNV 3.For each CNV an associated probability for the number of copies Read and Mapping Quality

Proposed Method 1.Use discordant read pairs as CNV signal 2.Cluster discordant read pairs that explain the same CNV 3.Estimate CNV boundaries based on clustered reads 4.For each cluster calculate the probability of the number of copies = 1,2,3…

Discordant read pair Donor Reference Concordant

Discordant read pair Donor Reference Discordant

Clustering Discordant Read Pairs Use a greedy approach 1. Pick any discordant read 2.Compare with all other discordant reads and group any that are within a given distance 3.Do this until no reads can be clustered together This sounds problematic but with the right assumptions it works Assume to know the maximum insert length Assume that the reverse read maps into the second copy and the forward read maps into the first copy Assume that CNVs are far apart

Read Pair Cluster Distance x min R+M I N N = X min + R + M I

Read Pair Cluster Distance x max R N N = X min + R+ M I N = X max + R X max - X min = M I if otherwise

Estimating Boundaries Have a set of clusters now Simple Boundary estimation: Left bounary = Right bounary =

Estimating the number of copies Utilize coverage - Position Coverage ~ Poisson(coverage) For each cluster: Perform a goodness of fit test for each coverage level (ie. Number of copies = 2 => coverage’ = 2*coverage) The coverage level that gives the best fit is the most likely Sadly this did not work :( So I resorted to estimating by looking at the ratio of the mean Coverage to the expected.

Wrote simulator tool in C –Simulates FASTQ paired-end reads –Reads and writes MAQ files –Computes Coverage Generated 10MB random genome using mouse chr 19 Inserted 5 CNVs spread across the genome Generated reads at 40x Coverage Mapped to fake reference using MAQ Applied the previously mentioned method Simulation

Found all of the CNVs No False Positives Worked exactly as predicted, because reads were perfect. Results CNVPositionLengthMinMaxPredicted length Copy Ratio (CNV) Copy Ratio (Ref) CNV 11000 99819689702.031.003 CNV 210,000100010,00010,9679671.91.97 CNV 320,000200019,99921,96719672.091.09 CNV 424,000 (200,300) 200024,15425,96718131.951.04 CNV 51,000,000 (2000) 10,0001,002,0001,009,9687,9681.92.99

Applied method to real mouse sequence data 40x coverage on chromosome 17 for CAST mouse strain Found 1456 possible CNVs Most of them were crazy looking Zeroed in on 5 that look interesting Results CNVMinMaxPredicted length Copy Ratio (CNV) CNV 127343569274715981280292.94 CNV 23632476936348359235900.85 CNV 3341021493629292621907770.77 CNV 419711195199077791965840.71 CNV 53139056631446686561202.4 Obviously the method needs some work

Future Work Table from Lee, et al. shows that the number of perfectly mapped reads that are discordant is very small Their method considers all possible mapping positions for each pair My method needs to consider this and toss MAQ

Future Work Replace the ad-hociness with a more formal probabilistic framework Consider all high-quality mapping positions (take into account low quality mappings) Consider the problem of repeated sequences

Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.

Similar presentations

Presentation on theme: "Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.

Similar presentations

Presentation on theme: "Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009."— Presentation transcript:

Similar presentations

About project

Feedback