Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Throughput Sequencing

Similar presentations


Presentation on theme: "High-Throughput Sequencing"— Presentation transcript:

1 High-Throughput Sequencing
Advanced Microarray Analysis BIOS , 2008 Dr. Mark Reimers, VCU

2 Quantitative HTS - Outline
Technology Preprocessing Quantitative analysis Applications ChIP-Seq RNA-Seq Methyl-Seq

3 The Technology Most sequencing proceeds by addition of fluor-labeled bases Do this in parallel on a flat surface Capture each stage with good camera Align images

4 Roche - 454 Parallel Pyrosequencing on beads

5 Mardis, Trends in Genetics

6 454 Sequencing Operation

7 Illumina - Solexa

8 ABI SOLiD Resquencing each fragment with different primers
Reconstruct each fragment separately

9

10 Paired-End Reads

11 Issues Pre-processing Quantitative analysis Base calling Mapping reads
QA Quantitative analysis Variation and noise Biases Models Accuracy and validation

12 Pre-processing – Base Calling
Not all steps completed properly Sequence can lag behind or skip ahead Hence most light spots a mixture of different colors Simple rule: use brightest signal

13 Types of mismatches in uniquely mapped tags with a single mismatch are profoundly asymmetric and biased Courtesy Thierry-Mieg

14 Typical Errors in Base-Calling

15 Position of single mismatch in uniquely mapped tags
Courtesy Thierry-Mieg

16 Improving Base-Calling with SVM

17 Pre-processing – Mapping Reads
Huge numbers (10M – 70M) BLAT (2002 high-speed method) Eland (proprietary Illumina) Other new methods: MAQ, SOAP

18 Quality Assessment Fraction of reads mapping to targets
Typically 5-10M reads per lane and 60-80% map to targets Some repetitive sequence

19 Comparing Samples - A Simple Normalization
Different numbers of counts per lane Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (total per million reads -TPM) For comparing genomic regions of different lengths divide also by length of region TPKM (total per kilobase per million)

20 Quant. Analysis - Variation
Poisson model often used for random variation Most HTS data ‘over-dispersed’ relative to Poisson Negative Binomial often used Parameter fitted

21 Quantitative Analysis - Biases
Not all regions represented equally GC rich regions represented more Independent of GC some chromosome regions represented more Euchromatin bias Sequence initiation site biases ‘Mapability’ biases – some regions won’t have any uniquely mapped tags

22 GC Bias Density of reads depends strongly on GC content of regions

23 Genomic Position Biases
Count tags from randomly sheared DNA in red with GC content in blue

24 Start Position Bias

25 Consistent Start Position Bias
Counts per start site in lane 1 vs lane 2

26 RNA-Seq

27 RNA-Seq Data Gene Model Kidney Reads Liver Reads
From Marioni et al 2008

28 Accuracy of Illumina RNA-Seq

29 Comparing RNA-Seq & Affy
Issues How replicable is RNA-Seq? How consistent are the two technologies? Which is better? Marioni et al, Genome Research, 2008

30 Comparing Fold-Changes
D.E. by ILM Red >250 Green <250 Black Not DE by ILM

31 Model for Variation Poisson counts hypergeometric comparison
Make uniform p-values by adding random term Use lower tails only

32 False Positive Rates QQ-plots of p-values between tech. reps

33 Different Concentrations are NOT Comparable!
QQ-plots of p-values between 3pM and 1.5 pM

34 Normalization of RNA-Seq
Robinson et al noticed that most genes appeared less expressed in liver Fig 1 from Robinson & Oshlak, Genome Biology 2010

35 A Better Normalization for RNA-Seq - TMM
Drop extremes of ratios Drop very high count genes Compute trimmed means of samples Center log-ratios between samples

36 New Things to do with RNA-Seq
Allele-specific expression Splice variation Between tissues In disease Alternate initiation sites Select 5’ capped RNA fragments Alternate termination

37 Allelic Comparison It is possible to compare allele-specific expression counts Sample from VCU Replicate samples P-values for binomial tests of equality About half show differential expression!

38 Detecting Splice Variation
Deep sequencing shows up clear variation in exon usage Wang et al Nature 2008

39 Tissue Map of Splice Variation
From Wang et al Brain is most distinctive Individuals seem to differ Cell lines seem to have distinct splice patterns

40 Splicing is Complex Many different splice operations exist
Only some of these characterized by counting exon reads

41 Issues in Detecting Splice Variants
Counts in exons reflect biases (as yet uncharacterized) as well as actual abundance Reads that bridge splice junctions would be definitive but mapping is very dubious with short (<40 base) reads All possible splice junctions are not known Hard to even search through the known ones

42 Methodology for Splice Variants
Count reads mapped to exons and and compare ratios across samples Wang et al, and most others Count reads that cross splice junctions

43 Methodology for Finding Junctions

44 ChIP-Seq

45 Chromatin Immuno-precipitation

46 ChIP-Seq Workflow Cross-link proteins to DNA Fragment DNA
Extract with antibody Reverse cross links Sequence fragments DO CONTROLS!

47 ChIP-Seq Data From Rozowsky et al, Nature Biotech 2009

48 ChIP-Seq vs ChIP-chip

49 Peak-Finding - Simple Extend tags and count overlap
How much to extend?

50 Peak Finding – Better Tags starting on opposite strands are likely to start at opposite ends Identifying the cross-over point leads to improved accuracy

51 The Value of Controls: ChIP vs. Control Reads
Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation

52 Cause of Variation in Read Density
In study of FoxA1 binding, even control reads enriched near FoxA1 binding site! Probably due to open chromatin near FoxA1 binding site Density of Control Channel reads around FoxA1 site Courtesy Shirley Liu

53 ChIP-Seq – MACS Key Ideas
Smart peak imputation estimate Uses read directions Empirical estimate of fragment length Local frequency estimate Using control, if available Using wide estimate, otherwise Not using sequence

54

55 Read Lengths and Directions
Some clear clusters – even before stats Reads on opposite sides of peak map to opposite strands Hence fragments have opposite directions Can estimate apparent fragment length

56 Fragment Lengths Puzzle: Fragments from sonication expected to be between 200 – 500 bp Estimated fragment size ~ 100 bp Shirley Liu’s explanation: preferential cutting near to TF ??

57 Comparison to ChIP-chip
Broad correlation Not dramatic improve-ment in precision !

58 Methyl-Seq

59 Methylation Assays Affinity purification: e.g. MeDIP-Seq (methylated dinucleotide immunoprecipitation) Methylation-specific cleavage by endonucleases e.g. Methyl-Seq: Cleaves with HPA2 to identify Bisulphite conversion WGBS (Whole-Genome Bisulphite Sequencing) RRBS (Reduced Representation Bisulphite Sequencing) Cleaves with MSPI to reduce complexity

60 Affinity: MeDIP-Seq & MBD-Seq

61 Issues with Affinity Methods
Analysis essentially like ChIP-Seq BUT: Sequence count reflects both density of CpG’s and proportions of methylation No individual CpG-level information Advantages: no conversion so sequence tags are easily mappable

62 Methyl-Seq Use HPAII to cleave only at unmethylated CCGG sites
Size-select fragments (50-300) Sequence fragment ends Always starting at a CCGG Easy to map – few possible loci (<1M) Paired ends give actual fragment

63 Schematic Here

64 Issues for Methyl Seq Computational problem to re-assemble actual proportions of methylation at each locus from counts Prone to false positives because of incomplete digestion (for reasons other than methylation of CCGG site) e.g. insufficient time … rates vary by 50-fold depending on sequence context

65 WGBS Bisulphite conversion, fragmentation and shotgun sequencing
Requires very many reads! Use of capture arrays reduces work… BUT different sequences have different capture efficiencies!

66 WGBS Data (from capture array)
top, CHP-SKN-1; bottom, MDA-MB-231 NB. Inconsistent tag numbers

67 Issues with WGBS Lose many C’s Hard to map to genome
Strategy depends on less penalty for mapping T to C Too many loci!

68 RRBS Too many methylation sites in genome
Cleave with MSPI and size select in order to reduce number of fragments Convert C to T with bisulphite (not mC) Then sequence fragments 1.4 M fragments

69 Issues with RRBS Fairly broad but not complete coverage of ‘interesting’ regions of genome Bisulphite conversion of limited regions means mapping is fairly easy Bisulphite conversion not always complete

70 Meta-Genomics

71 What is Meta-genomics? Sequencing random fragments of DNA from all microbial denizens of a community (and traces of a few others) Sometimes broadly used for surveys of microbial diversity based on sequencing all 16SrRNA genes present

72 Kinds of Questions What is out there?
Most microbial species not known What metabolic fluxes in any environment? What microbes associated with specific conditions? Including disease or health Human Microbiome Project

73 Environmental Meta-Genomics

74 Human Microbiome Project

75 Data Analysis Issues – 16S rRNA
Identification of microbes – most are unknown and un-culturable Distinguishing errors in sequencing from novel microbes Biases in sequencing

76 Data Analysis Issues - Metagenomics
Mapping and characterizing unknown protein sequences Usually assume conservation Full-coverage allows assembly of genomes Counting Biases probably smaller (Bork)


Download ppt "High-Throughput Sequencing"

Similar presentations


Ads by Google