Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sangwoo Kim, Ph.D. Assistant Professor,

Similar presentations


Presentation on theme: "Sangwoo Kim, Ph.D. Assistant Professor,"— Presentation transcript:

1 Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant
Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine

2 Overview PART I: NGS technologies and standard workflow
Next generation sequencing History and technology Data and its meaning; process workflow Discussion PART II: NGS Analysis to find variants NGS analysis to find variants Single nucleotide variants (SNVs) Copy number variations (CNVs) Structural variations (SVs) PART III: NGS application to diagnostics NGS in genomic medicine Potential application to forensic science

3 From Previous session Conventional variant calling
Variant calling in minor subgroups From Previous session

4 Next-generation sequencing
Massively Parallel Sequencing (a.k.a. Next-generation sequencing) via spatially separated, clonally amplified DNA templates or single DNA molecules Metzker et al, Nat Rev Genet, 2010 Illumina HiSeq2500 5500 SOLiD system Ion Torrent PGM

5 The human genome project
Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc. “Rough draft” in 2000 “Complete genome” published in 2003 13 years, $3 billion dollars.

6 FASTQ format (NGS raw data)
sequence quality one read A format for NGS read (FASTQ + quality)

7 D. Validation and functional assessment
control sequencing quality control short read alignment (BAM files) raw reads (FASTQ files) germ-line mutation somatic mutation copy number variation (CNV) structural variation (SV) A. Data Generation B. Variant Finding C. Variant Analysis xenogeneic sequence 43% 0% 31% recurrence analysis GKRRAGGGKRRAV*G variant impact prediction mutation filtration/selection tumor heterogeneity inference disease Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimization. Whole genome sequencing (WGS) allows an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only interrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear. D. Validation and functional assessment variant confirmation pathway analysis functional study Kim S and Paik S, in preparation

8 Short Read Alignment Data preprocessing

9 Mapping back to genome Where is this sequence in human genome?
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Do this as fast as possible!

10 brute force way Find “GATTCAAA” in human genome
This is very long (3 billion) The reference genome (chr1, start) T G A C G A T C Your query G A T C G A T C G A T C

11 How fast should it be? time per 1 read (sec) time per 80x WGS (sec)
is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days based on 200bp read length, 80x single-end wgs

12 Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome”

13 How can we build an index for genome?
Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome” How can we build an index for genome?

14 Burrows-Wheeler Transform

15 Burrows-Wheeler Transformation
BANANA

16 Burrows-Wheeler Transformation
Lexicographically smallest BANANA$

17 Burrows-Wheeler Transformation
BANANA$ ANANA$B

18 Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA

19 Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

20 Burrows-Wheeler Transformation
0 BANANA$ 1 ANANA$B 2 NANA$BA 3 ANA$BAN 4 NA$BANA 5 A$BANAN 6 $BANANA

21 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B 4 NA$BANA 4 0 BANANA$ sort 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

22 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

23 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA”

24 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA” BWT just changes the order of the string BWT tends to collect similar characters together With only the transformed string, we can easily get the original string

25 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA

26 NAN LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA NAN N AN NAN

27 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point the number of ‘N’ to determine the end point end

28 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 end

29 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 start end

30 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

31 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end This is a range for ‘A’ not ‘AN’!!

32 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

33 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

34 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN “Ax” is not “AN” and less than “AN” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ + number of ‘A’ before start point to determine the start point =1 + 1 = 2 the number of ‘A’ before end point to determine the end point =3 start end

35 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “NAN” can be calculated from: the number of symbols that are lexicographically less than ‘N’ + number of ‘N’ before start point to determine the start point =5 + 1 = 6 the number of ‘N’ before end point to determine the end point =2 start end

36 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA BANANA 2nd row at the original permutation =number of rotations of original string =“NAN” exists at the 3rd position of “BANANA” start end

37 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

38 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

39 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

40 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

41 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

42 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

43 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

44 Genome Informatics I (2015 Spring)
Inexact matching T G A C G A T When exact match does not exist: continue other possible candidates (G -> A, C, T) and increase the mismatch count If another mismatch occurs, again branch it out. So edit distance is critical to alignment speed Genome Informatics I (2015 Spring)

45 Genome Informatics I (2015 Spring)
Goal achieved time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days Genome Informatics I (2015 Spring)

46 Variant calling – SNV calling

47 Detailed View A genome region one read = one DNA fragment
aligned to a specific genomic region = observation of our sample in this region (1 time)

48 Detailed View A certain genomic position (in bp) A —A C

49 Detailed View A —A C A certain genomic position (in bp)
reference allele observation of our sample at this position from read 1 observation of our sample at this position from read 2 observation of our sample at this position from read 10

50 Why multiple observations?
Observations contain errors errors from machine basecall error errors from mapping mapping error errors from others library prep error With accuracy of 99%... 1% error from whole region leads to ~30million false SNPs for whole genome ~500k false SNPs for whole exome

51 Human diploid genome G A G G G A G A A A A Homozygotic Reference
Sequencing error / mapping error G G Homozygotic Reference G A G A Heterozygotic Alternative somatic mutations A A A Homozygotic Alternative

52 Allele fraction distribution (binomial)
Normal approximation of B(100,0.5) 𝜇=𝑛𝑝=100×0.5=50 𝜎= 𝑛𝑝𝑞 =5 Pr⁡(𝜇−3𝜎≤𝑥≤𝜇+3𝜎)≈0.9973 Pr⁡(35≤𝑥≤65)≈0.9973

53 Allele fraction distribution (binomial)
G A G A A

54 Inferring mutations G AGAGGGGGAAAGAGA reference allele
Probability of observing “G” at the site of “G” Observation of donor genome True genotype = “AA” and no sequencing error 𝑃(1−𝑒) True genotype = “AB” and Read was generated from ‘A’ allele and no sequencing error 1 2 ∗𝑃(1−𝑒) Read was generated from ‘B’ allele and sequencing error and ‘A’ was generated by chance 1 2 ∗ 1 4 ∗𝑃(𝑒) True genotype = “BB” and sequencing error 1 4 ∗𝑃(𝑒)

55 Inferring mutations G AGAGGGGGAAAGAGA
reference allele G AGAGGGGGAAAGAGA Probability of observing “A” at the site of “G” Observation of donor genome True genotype = “AA” and sequencing error 1 4 ∗P(e) True genotype = “AB” and - Read was generated from ‘A’ allele and sequencing error and ‘T’ was generated by chance 1 2 ∗ 1 4 ∗𝑃(𝑒) - Read was generated from ‘B’ allele and no sequencing error 1 2 ∗𝑃(1−𝑒) True genotype = “BB” and no sequencing error 𝑃(1−𝑒)

56 Genotype determination
Likelihood that the genotype is wild-type given the observation! L(g=AA|D)= 𝑖=1 𝑑 𝑃( 𝐷 𝑖 |𝑔=𝐴𝐴) L(g=AB|D)= 𝑖=1 𝑑 𝑃( 𝐷 𝑖 |𝑔=𝐴𝐵) L(g=BB|D)= 𝑖=1 𝑑 𝑃( 𝐷 𝑖 |𝑔=𝐵𝐵) Likelihood that the genotype is mutant given the observation!

57 Tools

58 somatic mutations

59 Germline vs. Somatic mutation
sample from non-disease site sample from disease site reference sequence (e.g. hg19) UnifiedGenotyper VarScan2 SomaticSniper

60 Easy way to somatic mutations
sample from non-disease site GN=AA sample from disease site GT=AB

61 Joint Probabilities

62 Joint Probabilities P(GT=AB|GN=AA) ≠P(GT=AB|GN=AB) ≠P(GT=AB|GN=BB)
Tumor genotype is dependent on normal genotype!!! G: Joint Genotype Matrix

63 when sample is not pure

64 Heterogeneous Sample G G G G G G G G G G G G G Normal Cells G G A G G
Tumor Cells G G A A G G

65 Causes of low-frequency
Sample contamination (e.g. stromal cells)

66 Causes of low-frequency
Sample contamination (e.g. stromal cells) Tumor heterogeneity

67 Causes of low-frequency
Sample contamination (e.g. stromal cells) Tumor heterogeneity Extreme environments

68 Causes of low-frequency
Sample contamination (e.g. stromal cells) Tumor heterogeneity Extreme environments Somatic mosaicism

69 Heterogeneous Sample “2/15: No mutation. Two ‘A’s are from sequencing errors” VS “2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!” G G G G G G G G G G G A A G G

70 Heterogeneous Sample “2/15: No mutation. Two ‘A’s are from sequencing errors...” VS “2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!” “How do we know this?” G G G G G G G G G G G A A G G

71 Estimating Cellularity
It is “easy” only if we already know where to see (disease genotype is AB or BB) But how do we know the genotype? (even without knowing α?) Use SNP array - ONCOSNP (Yau et al, Genome Biol, 2009), Absolute (Carter et al, Nature Biotech, 2012) SNP Calling - Snyder et al, PNAS, 2010, PurityEst (Su et al, Bioinformatics, 2012)

72 Accurate inference in Virmid
Estimate global within-individual contamination to accurate detection of somatic mutations

73 Bias 1 - Loss of Reads (Virmid)
g1 A ref A r1 g2 A B B r2 𝑥 𝑎 =𝑝 a read that passes 𝑔 1 being unmapped =𝑝 𝑟 1 has 𝑑+1 or more variants in the remaining sites 𝑥 𝑏 =𝑝 a read that passes 𝑔 2 being unmapped =𝑝 𝑟 2 has 𝑑 or more variants in the remaining sites 𝑥 𝑎 =1− 𝑖=0 𝑑 𝑙−1 𝑖 𝑝 𝑖 1−𝑝 𝑙−1−𝑖 𝑥 𝑏 =1− 𝑖=0 𝑑−1 𝑙−1 𝑖 𝑝 𝑖 1−𝑝 𝑙−1−𝑖 ,where 𝑑=maximum edit distance, 𝑙=read length, and 𝑝=frequency of variation

74 Bias 2 - Loss of variants (Virmid)
α reads from normal 1-α reads from disease B-allele overestimate BAF underestimate α

75 Estimated α underestimated α overestimated α

76 Calling low-fraction somatic mutations in Virmid
Kim S et al, Genome Biology 2013

77 Low frequent mutations in disease
Identification of de novo somatic mutation in ATK-MTOR-PIK3CA in hemimegalencephaly Lee J et al, Nature Genetics, 2012

78 Low frequent mutations in disease
Identification of MTOR driver mutations in focal cortical dysplaisa Lim J et al, Nature Medicine 2015

79 Copy number variation (CNV)

80 Copy Number Variation Changes in copy number of large DNA segment
usually in terms of genes e.g. HER2 amplification Types of CNVs Copy number gain (CN > 2): Increase of copy number due to genomic rearrangement like insertion/duplication Copy number loss (CN < 2): Decrease of copy number due to deleterious genomic rearrangements Copy number aberration (CNA) refers to CNV particularly when the events are associated with disease phenotype

81 Comparative Genome Hybridization (CGH)
 500kb-1500kb fragment for optimal hybridization

82 Array CGH

83 Resolution

84 Benefits of NGS-based CNV detection
High resolution (< 50 bp) in size Data reuse (multi-purpose) One NGS (whole-genome) sequencing can be used to SNV, CNV, SV detection Can be improved with additional NGS information Discordant reads in paired-end sequencing

85 Inferring CNVs from NGS
Principle: Samples with copy number gain (or loss) will generate more (or less) reads in the region gene 3 Copy (gain) 2 Copy (normal) 1 Copy (loss)

86 The signal Genome Informatics I (2015 Spring) 3 Copy (gain)
2 Copy (normal) 1 Copy (loss) mapped to reference Genome Informatics I (2015 Spring)

87 needs a systematic approach!
The signal 3 Copy (gain) 2 Copy (normal) 1 Copy (loss) mapped to reference catching these needs a systematic approach!

88 Catching the signal Problems
Read depth is not uniform even without copy number changes GC bias Mapping bias in repeat region Natural variance (Poisson distribution) Poisson distribution:   - The probability of a given number of events occurring in a fixed interval of time and/or space. Example: - You have 120 phone calls a day, what is the best way to describe the number of phone call in an hour? - Similarly, you generated 100,000,000 NGS reads from whole genome, what is the number of reads generated within chr1: ?  

89 Significantly deviated read-depth
Null hypothesis (H0): copy number of a given region is unchanged we assume the read-depth follows Poisson dist. Alternative hypothesis (Ha): copy number of a given region is changed If H0 is right: The read-depth (calculated from number of reads) within a specific genomic region is not significantly deviated from the Poisson distribution If the read-depth is too deviated to explain with natural variance (Poisson distribution) Copy number has been changed

90 Practically, we should consider
Bias correction from sequence context (GC-bias, etc.) Event detection method If the significant rise (or drop) of read-depth looks like an event mean-shift technique (CNVnator, Abyzov et al 2013) event-wise testing (Yoon et al, 2009) paired-end signal (CNVer, Medvedev et al 2010)

91 CNVNator

92 structure variation (SV)

93 Beyond the SNVs

94 Beyond the SNVs

95 Beyond the SNVs TFE3-KHSRP Translocation in Renal Cell Carcinoma

96 Structural Variations (SVs)
Genomic rearrangements that affect >50bp of sequence Alkan et al, Nat. Rev. Genetics 12, , 2011

97 List of structural variations

98 List of structural variations

99 Paired-end sequencing

100 Paired end reads for SV finding
Reference Donor Reference Donor Bix Seminar UCSD

101 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

102 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

103 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

104 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

105 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

106 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

107 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

108 Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA 

109 Methods for Deletion Detection

110 Methods for Deletion Detection

111 Methods for Deletion Detection

112 Methods for Deletion Detection

113 Methods for Deletion Detection

114 Methods for Deletion Detection

115 Problems 1. Judgment of discordance

116 Problems 1. Judgment of discordance

117 Problem 2. Size of insertion

118 Novel Sequence Insertion
Problem 2. Large indels Novel Sequence Insertion

119 Existing Sequence Insertion
Problem 2. Large Indels Existing Sequence Insertion

120 Problem 3. Nonspecific Mappings

121 Problem 3. Nonspecific Mappings

122 discussion

123 Thank you


Download ppt "Sangwoo Kim, Ph.D. Assistant Professor,"

Similar presentations


Ads by Google