4 What is sequencing? Finding the sequence of a DNA/ RNA molecule What can we sequence?
5 Sanger sequencing Up to 1,000 bases molecule One molecule at a time Widely used fromFirst human genome draft was based on Sanger sequencingStill in use for single molecules
6 High Throughput Sequencing Next Generation Sequencing (NGS) / Massively parallel sequencingSequencing millions of molecules in parallelDo not need prior knowledge of what you’re sequencingPlatformRead lengthNo. of reads per run454 sequencingUp to 1,000 bp~1 M (1 million)SOLiD50-75 bp~1 G (1 billion)HiSeqbp~0.5 GWe will discuss Illumina’s platform only
7 Sequencing Workflow Extract tissue cells Extract DNA/RNA from cells Sample preparation for sequencingSequencingBioinformatics analysisWhy is it important to understandthe “wet lab” part?
8 Sample Prep Random shearing of the DNA Adding adaptors and barcodes Size selectionAmplificationSequencing
15 DNA sequencingResequencing – sequencing the genome of an organism with a known genomeExome sequencing / Targeted sequencing – sequencing only selected regions from the genomeDe-novo sequencing– sequencing the genome of an organism with a unknown genome
16 RNA-SeqSequencing of mRNA extracted form the cell to get an estimate of expression levels of genes.
17 RNA-Seq vs. DNA-sequencing Counting vs. ReadingRNA-Seq vs. DNA-sequencing
18 ChIP-SeqSequencing the regions in the genome to which a protein binds to.
19 Basic concepts Insert – the DNA fragment that is used for sequencing. Read – the part of the insert that is sequenced.Single Read (SR) – a sequencing procedure by which the insert is sequenced from one end only.Paired End (PE) – a sequencing procedure by which the insert is sequenced from both ends.
22 Mapping parameters affect the rest of the analysis DemultiplexingSampleMappingReference GenomeExample of mapping parameters:Number of mismatches per readScores for mismatch or gapsMapping parameters affect the rest of the analysis
25 Removing duplicates and non-unique mappings DemultiplexingMappingRemoving duplicates and non-unique mappingsCoverage profile and variant callingReference Genome…ACTTCGTCGAAAGG…G
26 Removing duplicates and non-unique mappings DemultiplexingFrequency >= 20%MappingReference Genome…ACTTCGTCGAAAGG…Removing duplicates and non-unique mappingsCoverage profile and variant callingCoverage >= 5Variant filteringReference Genome…ACTTCGTCGAAAGG…
27 Removing duplicates and non-unique mappings Genes and known variants DemultiplexingMappingRemoving duplicates and non-unique mappingsVariant callingReference Genome…ACTTCGTCGAAATG… …GTCCCGTGATACTCCGT…GAVariant filteringGenes and known variantsrs230985Gene X
29 Example for further analysis Recessive disease:Variant not in known databasesHomozygous variant shared by all affected individualsSame variant appears in healthy parents at heterozygous stateHealthy brothers can be heterozygous to the same variantDemultiplexingMappingRemoving duplicates and non-unique mappingsCoverage profile and variant callingDominant disease:Variant not in known databasesHeterozygous variant shared by all affected individualsThe variant doesn’t appear in healthy individualsVariant filteringGenes and known variants1 Table per sampleFinding suspicious variants1 Table per project
30 Quality control steps in the pipeline DemultiplexingQCMappingQCRemoving duplicates and non-unique mappingsQCCoverage profile and variant callingQCVariant filteringQCGenes and known variantsFinding suspicious variants
31 How is de-novo assembly different from resequencing analysis ?
36 Coverage Coverage – resequencing “Coverage” – RNA-Seq Number of bases that cover each base in the genome in average.𝒂𝒗𝒆𝒓𝒂𝒈𝒆 𝒄𝒐𝒗𝒆𝒓𝒂𝒈𝒆= 𝑟𝑒𝑎𝑑 𝑙𝑒𝑛𝑔𝑡ℎ ⋅ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑎𝑑𝑠 ⋅ % 𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑔𝑒𝑛𝑜𝑚𝑒 𝑠𝑖𝑧𝑒“Coverage” – RNA-SeqDepends on the expression profile of each sample.Highly expressed genes will be detected with less “coverage” than lowly expressed genes.
37 What should you ask yourself before sequencing when planning the experiment Reference genome:What is my reference genome?Does it have updated annotations?What annotations are known?Are my samples closely related to the reference genome?Do I expect to have contaminations in my sample?Do I have validations from other technologies? (RT-PCR, SNPchip…)Do I have controls and replicates?RNA-Seq: am I interested in alternative splicing?Resequencing: What kind of mutations do I expect to find?