Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2. Topics in High-Throughput Sequencing (Basics)

Similar presentations


Presentation on theme: "Lecture 2. Topics in High-Throughput Sequencing (Basics)"— Presentation transcript:

1 Lecture 2. Topics in High-Throughput Sequencing (Basics)
The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

2 Lecture outline Sequencing and high-throughput sequencing
Standard data processing Data preprocessing Sequence alignment Sequence assembly Applications Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

3 Sequencing and High-Throughput Sequencing
Part 1 Sequencing and High-Throughput Sequencing

4 DNA sequencing The process of determining the order of nucleotides in a DNA sequence Input: a sample containing some biological DNA sequences Output: a mathematical representation of the DNA sequences in the form of strings of ACGTs ACCGCGCTCTAGCAC... Sequencing Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

5 DNA sequencing Remarks:
RNA sequencing can be done by converting RNA into cDNA; protein sequencing is something different. Actual output may cover both strands. The sample usually contains DNA from multiple cells. If the DNA sequences in different cells are different (how?), the output will be their combinations. Single cell sequencing is now possible [project] If the amount of DNA is small, may need to make more copies by an experimental procedure called “amplification”, which could introduce biases. Especially problematic when the quantity is important Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

6 Sequencing experiments: Basic ideas
Use one strand as template, grow the other strand Different ways to detect which nucleotide is added. For example, Give a different color for each type of nucleotide added Supply only one type of nucleotides at a time, and see if any signals (e.g., light) can be detected Stop whenever a certain nucleotide is added. Then deduce the nucleotide by DNA lengths (Sanger sequencing) Can only handle up to a certain length of DNA Need to break down a DNA sequence into small fragments if it is too long Template GCGAACGCT Synthesis CGCT TGCGA CGCT: primer Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

7 Sanger sequencing Low-throughput, but accurate and can handle up to 1000bp Still standard for small-scale laboratory use Components: DNA to be sequenced Primer Free nucleotides that allow further extension (dNTP, circles): N=A, C, G or T, all four types are present Free nucleotides that terminate extension (ddNTP, rhombuses): N=A, C, G or T, only one type is present DNA polymerase See these videos for animations: Image credit: the-scientist.com Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

8 Next-generation sequencing (NGS)
“Next-generation” refers to the generation of high-throughput methods as compared to the first generation of low-throughput, Sanger-like methods Also called “second generation sequencing”, “deep sequencing” or “massively parallel sequencing” Motivated by the large size of the human genome (i.e., the set of all chromosomes, about 3 billion base pairs in a haploid copy) and the high sequencing cost Driven by the Human Genome Project Key idea: parallelization Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

9 Some NGS methods Naming convention: Roche: Company
454: Sequencing method GS FLX Titnium: Machine type See these videos for details: Pyrosequencing: Solexa: SOLiD: Image credit: Metzker, Nature Reviews Genetics 11:31-46, (2010) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

10 More updated comparisons
Image credit: Liu et al., Journal of Biomedicine and Biotechnology 2012:251364, (2012) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

11 More updated comparisons
Image credit: Lee et al., Translational Cancer Research 2:1 (2013); Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

12 Some NGS methods Platform for library/template preparation
Droplet Solid-phase (single molecule, no amplification) Immobilization Primer Template Polymerase Image credit: Metzker, Nature Reviews Genetics 11:31-46, (2010) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

13 Some NGS methods Chemistry for identification of nucleotide:
Reversible dye-terminators: terminating base  fluorescence  removal of terminating group Pyrosequencing: pyrophosphate, which fuels a reaction to give out visible light signals Image credit: Metzker, Nature Reviews Genetics 11:31-46, (2010) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

14 Sequencing long DNA Cut down a long DNA sequence into shorter ones, by either Restriction enzymes that recognize specific sequences Mechanical shearing Acoustic waves Sequence one or both ends of each fragment Determine the original DNA from the sequenced fragments Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

15 Illustration Multiple copies of an unknown DNA biological sequence
TACCAGCGGACCGCTGAC TACCAGCGGACCGCTGAC Possible to deduce original genome? TACCAGCGGACCGCTGAC Breaking down into fragments Sequencing Text sequences of fragments TACCAG GGACCG TACCAGC CGGAC CGCTGAC CTGAC CGCT GAC Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

16 Shotgun sequencing Difficult to keep track of the order of fragments
Shotgun: random fragmentation See Whole genome shotgun Hierarchical approach: slightly easier to get back the original sequence Image credit: Jennifier et al., Biological Procedures Online 11(1):52-78, (2009) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

17 Major milestones in sequencing
Genome Type Size Completed year Time needed Cost (USD) Bacteriophage MS2 Virus (RNA) 3,569nt 1976 ? Bacteriophage X174 Virus (DNA) 5,368bp 1977 Haemophilus influenzae Bacteria 1.8Mb 1995 Saccharomyces cerevisiae Fungus (yeast) 12.1Mb 1996 Caenorhabditis elegans Nematode (worm) 100Mb 1998 Arabidopsis thaliana Plant 157Mb 2000 Homo sapiens Mammal (human) 3.2Gb 2003 15 years 3B Craig Venter 2.8Gb 2007 5 years 100M James Watson 6Gb (diploid) 2008 4 months 1.5M YanHuang 1 (Chinese) ~3Gb 2 months 0.5M Neanderthal Mammal 2010 4 years 6.4M Anyone 2011 1 week 10K 2014 Few days 1K Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

18 Cost and number of human genomes sequenced
Estimates only Image source: Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

19 New bottleneck: bioinformatics
Image credit: Sboner et al., Genome Biology 12(8):125, (2011) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

20 Third generation sequencing [project]
Characteristics: Longer reads Higher error rate (currently) Higher cost (currently) Single-cell sequencing Example: Pacific Biosciences’ Single Molecule Real-Time (SMRT) sequencing Several hundred base pairs or more >10 times higher error rate than NGS Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

21 Third generation sequencing technologies
How third-generation DNA-sequencing technologies work. Third-generation DNA-sequencing technologies are distinguished by direct inspection of single molecules with methods that do not require wash steps during DNA synthesis. (A) Pacific Biosciences technology for direct observation of DNA synthesis on single DNA molecules in real time. A DNA polymerase is confined in a zero-mode waveguide and base additions measured with florescence detection of gamma-labeled phosphonucleotides. (B) Several companies seek to sequence DNA by direct inspection using electron microscopy similar to the Reveo technology pictured here, in which an ssDNA molecule is first stretched and then examined by STM. (C) Oxford Nanopore technology for measuring translocation of nucleotides cleaved from a DNA molecule across a pore, driven by the force of differential ion concentrations across the membrane. (D) IBM's DNA transistor technology reads individual bases of ssDNA molecules as they pass through a narrow aperture based on the unique electronic signature of each individual nucleotide. Gold bands represent metal and gray bands dielectric layers of the transistor. Image credit: Schadt et al., Human Molecular Genetics 19(R2): , (2010) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

22 Standard Data Processing
Part 2 Standard Data Processing

23 From images to formatted data
Image credit: Geospiza Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

24 “Raw data” for processing
Sequencing reads: Single-end sequencing: one sequencing read (i.e., a short string) per fragment Paired-end sequencing: two sequencing reads per fragment Quality score: How reliable each sequenced base is While sequencing is quite reliable, errors do occur Fragment length Read length mate pair Insert size Note: Some define insert size as the same as fragment length Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

25 Quality score High-throughput sequencing has a relatively high error rate Based on the sequencing signal, the sequencing machine can estimate an error probability p for each base call A corresponding quality score can be defined. One commonly used quality score is Phred Quality q: q = -10 log10 p q can take value from 0 (p = 1) to infinity (p = 0) Higher q  Better base quality Practically, a Phred score of 30 (p=0.001) or more indicates good quality Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

26 The FASTQ file format See for a list of commonly used file formats in genomics FASTQ: read sequences and quality scores (see This is the “raw data” bioinformaticians deal with – We seldom need to work on the raw images directly Another famous file format for genomics is the FASTA format mainly for sequences. FASTQ is like FASTA + quality Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

27 FASTQ Each sequence occupies four lines:
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Line followed by sequence ID and descriptions Line 2: the sequence Line 3: +, optionally followed by sequence ID and descriptions Line 4: quality scores (mapped to ASCII characters) Standard: <score> = <ASCII number of character> – 33; E.g., ‘5’ has an ASCII number of 53, which means the last base has a quality score of 53 – 33 = 20, i.e., error probability is 10-(20/10) = 0.01 Illumina has a different standard (with different versions) Example source: Wikipedia Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

28 Data preprocessing Read level: Global level:
Removing adapter sequences Removing poly-A tails (for RNA data) Filtering reads of low quality Trimming reads with low quality bases ... Global level: Checking fraction of reads that pass quality thresholds Comparing distribution of read lengths with expectation Checking distribution of nucleotides Removing duplicate reads Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

29 Quality checking reports
Base quality Usually lower at first and last bases even for good cases A good case A bad case Image credit: FastQC Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

30 Quality checking reports
Duplication (reads with exactly the same sequences) Suggestive of amplification bias or insufficient starting materials A good case A bad case Image credit: FastQC Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

31 Meaning of “sequenced genome”
How to get the actual order of nucleotides of the whole DNA sequence from short sequencing reads? Two main approaches: Sequence assembly: assemble the original sequences using the short reads Also called “de novo assembly” [project] Sequence alignment/mapping: using a reference sequence, find out the position of each read and identify differences between the current DNA sequence and the reference This kind of studies is called “re-sequencing” [project] Only a relatively small number of human genomes have been assembled. Most sequenced human genomes were only mapped to reference Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

32 Sequence assembly Main idea: If two reads overlap substantially, there is a high chance that they come from adjacent positions in the original sequence Example: Original sequence: ACCGGGTCTACGTTCCAT Read 1: ACCGGGT Read 2: CGGGTCT Alignment: ACCGGGT__ __CGGGTCT Partial assembly: ACCGGGTCT Now, suppose Read 3 is GTTCCAT It can be aligned with read 1 as follows: ACCGGGT_____ _____GTTCCAT Which results in a wrong partial assembly: ACCGGGTTCCAT In general, longer overlaps are more likely to be correct because it is less likely to have multiple occurrences of it in the genome Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

33 Sequence assembly Main challenges:
Find reads with substantial overlaps efficiently Determine the size of overlap necessary to eliminate false hits Handling repeats (consider the following example:) Original sequence: ACCTCCTCCTCCTG Suppose we get all length-4 reads: ACCT, CCTC, CTCC, TCCT, CCTG They can also be produced from this sequence: ACCTCCTG Need to check read count (number of copies for each type of reads) to determine number of CCT occurrences Demonstrating why both read length and read count matter Handling sequencing errors Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

34 Sequence assembly In order to have overlaps between reads, a base needs to be covered by more than one read Suppose The whole DNA has length N Each read has length n There are m reads What is the probability that a base is not covered, if the reads are independently and uniformly sampled? Ignoring boundary effects (i.e., some reads are at the ends of the DNA), m=1: (N-n) / N In general: [(N-n) / N]m Can use similar calculations to estimate the number of reads needed to provide good coverage of all bases The average number of times each base is covered is called the “read depth”: 30x, 60x, etc. Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

35 One formulation – de Bruijn graphs
Treat each read (suppose all are of length k) as a node Add a directed edge (i.e., an arrow) from a node to another if the last k-1 bases of the former are exactly the same as the first k-1 bases of the latter In the ideal case, the goal is to start from a node and traverse all edges Image credit: Compeau et al., Nature Biotechnology 29(11): , (2011) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

36 Quality of assembly Key terms:
Contig: A partially assembled sequence from some reads Scaffold: An arrangement of the contigs with specified order and orientations Usually the final output does not contain a single sequence, but just some scaffolds Some descriptive statistics of assembly outputs: Length of longest contig Average length of contigs Total length of contigs N50: Length of the contig such that shorter contigs amount to 50% or more of the total length of all contigs If the lengths are (in an arbitrary unit) 10, 8, 6, 5, 3, 3, 2, 1, 1, 1, then the N50 value would be 6, since (10+8+6) = 24, which is larger than 50% of the sum ( ) = 40 Last update: 15-Sep-2014 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

37 Sequence alignment/mapping
Sequence assembly is difficult for long sequences If the sequenced DNA is similar to an assembled one, it would be much easier to simply find out the location of each read in the reference, and identify the differences Major challenges: Find out the locations of many reads in the reference efficiently Handle mismatches Distinguish between genetic variants and sequencing errors Handle insertions, deletions, duplications, and other types of genomic variations Most current algorithms use hash-tables, suffix trees or Burrows-Wheeler Transform (BWT) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

38 The SAM and BAM formats SAM: a text-based file format for storing sequence alignments BAM: a compressed binary version of SAM Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

39 SAM Conceptual alignment: SAM format:
Coor ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/ TTAGATAAAGGATA*CTG +r aaaAGATAA*GGATA +r gcctaAGCTAA +r ATAGCT TCAGC -r ttagctTAGGC -r001/ CAGCGCCAT SAM format: @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r ref M2I4M1D3M = TTAGATAAAGGATACTG * r ref S6M1P1I4M * AAAAGATAAGGATA * r ref H6M * AGCTAA * NM:i:1 r ref M14N5M * ATAGCTTCAGC * r ref H5M * TAGGC * NM:i:0 r ref M = CAGCGCCAT * CIGAR string: M=alignment match; S=substitution (mismatch); I=insertion; D=deletion, etc. Example source: Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

40 Part 3 Applications

41 Applications “DNA sequencing” can be used to study many things, not limited to DNA Genetic variants [lecture], haplotypes [project] Gene expression (RNA-seq, CAGE, ...), isoforms [project] Protein-DNA binding (ChIP-seq, ChIP-exo, ...) Protein-RNA binding (CLIP-Seq, HITS-CLIP, PAR-CLIP, RIP-seq, ...) [project] DNA methylation (bisulfite sequencing, RRBS, MeDIP-seq, MBDCap-seq, ...) [project] Histone modifications (ChIP-seq) Open chromatin (DNase-seq, FAIRE-seq, ...) DNA long-range interactions (ChIA-PET, Hi-C, TCC, ...) [project] ... Common idea: Sequence only the parts of DNA of interest Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

42 RNA-seq Can selectively sequence a subset of RNA based on:
Presence or absence of poly-A tail Length Residing cell compartment Form (linear or circular) [project] Image credit: Wang et al., Nature Reviews Genetics 10(1):57-63, (2009) Last update: 15-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

43 ChIP-seq Chromatin immunoprecipitation followed by sequencing
Use antibody to “pull down” target DNA, such as DNA bound by a certain protein or with a certain chemical modification Image credit: Mardis, Nature Methods 4: , (2007) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

44 HITS-CLIP High-throughput sequencing together with UV-crosslinking and immunoprecipitation For studying protein-RNA interactions Image credit: Zhang and Darnell, Nature Biotechnology 29(7): , (2011) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

45 Bisulfite sequencing To find out cytosines methylated at the carbon-5 position Usually occurring at CpG, CpHpG and CpHpH nucleotide patterns Bisulfite sequencing: Use bisulfite treatment to turn unmethylated cytosines (C) into uracils (U), which are sequenced as thymines (T) Determining methylated locations: Mapping sequencing reads to both original and CT transformed references Image source: Wikipedia Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

46 ChIA-PET Chromatin interaction analysis with paired-end tag sequencing
For studying DNA long-range interactions (involving something recognized by an antibody) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

47 Some more specific tasks
General: Comparing with controls Handling replicates RNA-seq: Measuring expression levels Determining differentially expressed genes Determining isoforms [project] Detecting gene fusion events [discussion paper] ChIP-seq: Detecting signal peaks Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

48 Comparing with controls
Observed data contain both desired signals and unwanted stuff For example: In DNA sequencing, the sample may contain contaminated DNA (normal cells contaminating cancer sample) In ChIP-seq, some regions with no protein binding may also be pulled down By comparing with a control, we get information for removing background noise and biases Cancer: Normal cells (tumor-adjacent/ normal tissue/ blood) Chip-seq: Same DNA, but without ChIP Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

49 Handling replicates Random noise can be filtered by having replicates
Biological replicates (different samples) Technical replicates (different experiments) Main idea: If a signal is consistently observed in multiple experiments, it is more likely to be real Usual steps: Computing probability of consistency Filtering inconsistent signals Combining unfiltered data from replicates Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

50 Comparing signal ranks
Suppose we have two ChIP-seq datasets. For each dataset, we have ranked each region by the signal strength (number of reads) If the real signals have high and consistent ranks, while the noise has low and random ranks, we would get a curve like this: Fraction of regions within the top t ranks in both datasets Image credit: Li et al., arXiv: Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

51 Measuring expression levels
How to compute an expression level from a distribution of read counts? Calculate the average Based on a statistical model Normalization: If expression levels of different genes or the same gene in different datasets are to be compared Longer genes are expected to get more reads For a dataset with more reads, each gene gets more reads on average RPKM: Reads per Kilobase per Million reads Image credit: Wang et al., Nature Reviews Genetics 10(1):57-63, (2009) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

52 Differential expression
Suppose we have measured gene expression in two samples (e.g., tumor vs. normal), how to identify the list of genes with differential expression? May not want to consider genes with very low expression (where random fluctuation has a large influence) Consider genes with a statistically significant difference (compare 1 vs. 2 and 100 vs. 200) Consider genes with a large fold change (large numbers can easily get statistical significance: vs ) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

53 Isoforms Alternative isoforms: same gene producing multiple types of transcript Image credit: Wang et al., Nature 456(7221): , (2008) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

54 Reconstructing isoforms
Paired-end reads and split reads could help connect neighboring exons Still not easy to determine isoforms If there are reads that connect exon 1 and exon 2, and reads that connect exon 2 and exon 3, do we have a transcript with all 3 exons? Also need to consider read counts E1:50, E2:60, E3:10 In general, need to make some assumptions, construct a statistical model, and do predictions Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

55 Signal peaks Protein binding sites are usually short (~10bp) but the DNA pulled down can be much longer With a large number of reads from random positions around the binding site, a distribution will be formed Image credit: Rozowsky et al., Nature Biotechnology 27(1):66-75, (2009) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

56 Calling signal peaks Things to consider:
Signals in control (e.g., due to open chromatin) Height of peaks Fluctuations Local bias Read distribution in the two strands Image credit: Kharchenko et al., Nature Biotechnology 26(12): , (2008) Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

57 Some other common file formats
WIG: For storing signals at base resolution BED: For storing intervals Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

58 Wiggle Track Format (WIG)
Text-based format for storing signals for individual bases Fixed step: fixedStep chrom=chr1 start=14051 step= Variable step: variableStep chrom=chr1 span= There is also a binary format called BigWig with more efficient data access Chromosome Position Value 1 14051 18.6 14151 2.4 14251 44.7 Chromosome Position Value 1 143001 12.5 143002 143003 Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

59 BED format Text-based, tab-delimited format for storing signals for intervals 3 required fields: chrom, chromStart, chromEnd 9 optional fields: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts (last ones for visualization) Example: chr cloneA ,488, 0,3512 chr cloneB ,399, 0,3601 There is also a binary format called BigBed with more efficient data access Many variations, such as the commonly-used bedGraph format with only 4 fields: chrom, chromStart, chromEnd, dataValue Example source: UCSC Genome Browser Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

60 Formatting traps When you work with genomic data files, be careful of the following: Whether genomic locations start with 0 or start with 1 Whether the first position of a region is included Whether the last position of a region is included For example, for the bedGraph format: First position is counted as 0. First specified position is included. Last specified position in not included. Therefore, “chr1 2 4” means the third and fourth positions of chromosome 1. Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015

61 Summary Characteristics of next-generation sequencing:
High-throughput Relatively low cost Short reads Standard data processing Data preprocessing Sequence alignment Sequence assembly Applications (X-seq) Many based on cross-linking and immuno-precipitation Specific data processing and analysis Last update: 6-Sep-2015 CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015


Download ppt "Lecture 2. Topics in High-Throughput Sequencing (Basics)"

Similar presentations


Ads by Google