Download presentation
Presentation is loading. Please wait.
1
GE3M25: Data Handling – ChIP-Seq
TCD, 31/10/2017 Karsten Hokamp, PhD Genetics
2
GE3M25 sub-parts Statistics Data Handling Python Program-ming
ChIP-Seq analysis Statistics Data Handling
3
NGS 1 Intro-duction Week 10 Week 11 NGS 2 QC, Trimming Week 12 NGS 3
GE3M11Exam Week 10 Week 11 Python 1 Intro-duction Python 2 Strings and Files NGS 2 QC, Trimming Week 12 Python 3 File I/O, Branching Python 4 Lists, Tuples NGS 3 Mapping Week 13 Python 5 Dictiona-ries Python 6 Regex, System NGS 4 Peak Calling Week 14
4
ChIP-Seq project report
NGS 5 Gene Lists NGS 6 / Python 7 Pipelines NGS 7 / Python 8 Revision Week 15 Python Exam Week 16 ChIP-Seq project report January 2018:
5
Class 1: ChIP-seq data analysis in a nutshell
What is Next-Generation Sequencing? What kind of data are produced? Quality assessment Read mapping Peak detection Visualisation
7
Class 1: ChIP-seq data analysis in a nutshell
8
Next Generation Sequencing
Technologies that parallelize the sequencing process, producing thousands or millions of sequences massive impact on Genomics Massively parallel signature sequencing (MPSS) Polony sequencing 454 pyrosequencing. Illumina (Solexa) sequencing. SOLiD sequencing Ion Torrent semiconductor sequencing DNA nanoball sequencing Heliscope single molecule sequencing Single molecule real time (SMRT) sequencing NGS increased output from 84 Kilobase (Kb) per run to 1 Gigabase (Gb) per run!
9
Illumina Sequencing Flow cell is coated with lawn of oligos complimentary to adapter sequence B THe llumina Sequencing method can be broken down into 4 steps: First, the library is prepared. The sequencing library is prepared by random fragmentation of the DNA or cDNA sample, followed by 5’ and 3’ adapter ligation The next step is cluster amplication For cluster amplicifcation, the library is loaded into a flow cell. The flowcell is coated with a lawn of oligos that are complimentary to the adapter sequence so the DNA fragments can hybridise to it. In order for sequencing to work the DNA must be amplified. Each fragment is then PCR amplified into distinct, clonal clusters through bridge amplification. When cluster generation is complete the DNA is denatured, tthe reverse strands are cleaved and washed off leaving only the forward strand ready for sequencing.
10
Flow Cell Illumina Sequencing 8 lanes with 120 tiles each (GAIIx)
Illumina SBS technology utilizes a proprietary reversible terminator–based method that detects single bases as they are incorporated into DNA template strands. all 4 reversible, terminator-bound dNTPs are present during each sequencing cycle, Sequencing begins with the extension of the first sequencing primer to produce the first read. Illumina uses a particular technology that detects single bases as they are incorporated into DNA template strands. As soon as a nucleotide is added a certain wavelength is emitted at a particular signal intensity. This is captured by the detectors and a base call is made. For a given cluster, all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel process. Each cluster corresponds to a different section of the DNA sequence. The number of cycles determines the length of the read. Normally 100bp reads are generated. 8 lanes with 120 tiles each (GAIIx) ~ 10 million reads per tile
11
YouTube Videos Illumina Solexa Sequencing (< 2 minutes):
Illumina (5 minutes): See 'Links' section on course web page
12
Next Generation Sequencing - Applications
Xu F, Wang Q, Zhang F, Zhu Y, Gu Q, Wu L, Yang L, Yang X. Impact of Next-Generation Sequencing (NGS) technology on cardiovascular disease research. Cardiovasc Diagn Ther 2012;2(2):
13
Frank Wellmer's slide – as a reminder...
14
Frank Wellmer's slide – let's try this!
15
Hfq bound to transcripts instead of genomic DNA!
Example data set Related information GEO DataSets 14 Samples: D%20gsm[ETYP] Test sample: Hfq coIP OD2+6h, SRA SRX155645 Hfq bound to transcripts instead of genomic DNA!
16
Example data set Download from local server:
Download from local server:
17
Working on the Command Line
NGS files are generally too big to open in TextEdit, Word or Excel! We can access and process them on the command line.
18
Working on the Command Line
Start: Open 'Terminal' from Spotlight or Dock
19
Working on the Command Line – Terminal
Prompt Cursor Title bar
20
Working on the Command Line – the Prompt
host user directory symbol
21
Working on the Command Line – Orientation
output pwd = print working directory
22
File Hierarchy Application Desktop Library Documents root / Users
kahokamp Library bin Movies tmp Music
23
Working on the Command Line – Orientation
directory root separator ~ = short-cut for home-directory
24
Working on the Command Line – File Listing
ls -- list directory contents
25
Working on the Command Line – File Listing
ls -l long directory listing
26
Working on the Command Line – File Listing
last modification date permissions owner group size name type link count
27
Working on the Command Line – File Listing
Permissions in triplets for owner, group, everyone else rwxr-xr-x r = read permission w = write permission x = execution permission - = forbidden
28
Working on the Command Line – File Listing
Parameter(s) Argument(s)
29
Working on the Command Line – Manual
man = manual page for a program
30
Working on the Command Line – Manual
space for next page h for help q for quit
31
Working on the Command Line – Moving Around
Some examples of using cd (change directory): cd Downloads cd change into home directory cd ~/Downloads cd .. change into upper directory Cd - change into previous directory Try and combine with 'pwd' to get your bearings!
32
Working on the Command Line – Short-cuts
Automatic extension with <TAB> key: cd cd D<TAB><TAB> shows possible extensions cd Dow<TAB> extends to 'Downloads' <TAB> shortens work and prevents typos!
33
change into Downloads directory
and do a long listing the downloaded file should be there as a directory ('chip-seq')
34
If you see a file called 'chip-seq.zip' instead,
do the following: unzip chip-seq.zip
35
list its content
36
Content of zip archive General Feature Format (annotation)
Compressed FastQ file Reference Genome in Fasta format
37
'gunzip' uncompresses a file '-c' option directs output to terminal
Peek into the file: 'gunzip' uncompresses a file '-c' option directs output to terminal pipe symbol ('|') connects two commands 'head' displays first 10 lines of input
38
similar to Fasta but includes Quality information
The FastQ Format Fasta: FastQ: similar to Fasta but includes Quality information
39
The FastQ Format @ sign header sequence string + sign
header (optional) quality string
40
The FastQ Format @ sign SRA ID Sequencer ID lane tile x, y coordinates
41
Indication of the probability that a base call is correct
Quality Information Indication of the probability that a base call is correct Base call Quality C I T I … A # N #
42
use FastQC for quality assessment
43
display report in Web browser
45
Working on the Command Line – Short-cuts
Make browser bigger (green dot in title bar)! Switch between applications: <cmd>-<tab>
46
Mapping 312k reads to Salmonella genome
1 read per line 80 reads per page ~4000 pages 3 books with genome sequence
47
prepare index for read mapping
48
prefix for output files
tools/bowtie2-build chrSL1344.fa SL1344 program input file prefix for output files
50
map reads against indexed genome
51
increase % of mapped reads through soft-trimming
52
use Gem to detect peaks and motifs
java -jar tools/gem/gem.jar --g SL1344.chrom.size --genome . --s --expt mapped.sam --f SAM --out peaks --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt
53
display report in Web browser
55
Frank Wellmer's slide – looks familiar?
56
Do some checks, visualisations and comparisons with other tools!
End of story? No! Do some checks, visualisations and comparisons with other tools!
57
Use Excel to display list of peaks:
File open Downloads chip-seq peaks peaks_GEM_events.txt
58
sorted by probability score
Next: Visualise read coverage at top positions
59
compress mapped reads and create an index
60
(Integrated Genome Browser)
start IGV (Integrated Genome Browser)
62
load 'chrSL1344.fa' from chip-seq directory as genome
63
load from file: annotation (SL1344.gff) mapped reads (mapped.bam)
64
click and drag region to zoom in
66
coverage individual reads
67
top peak notice data range
68
Frank Wellmer's slide - do our peaks look like this?
69
duplicate reads from PCR amplification?
70
Fine-tuning Remove duplicates:
tools/samtools rmdup -s mapped.bam mapped_rmdup.bam
71
Run Gem again using new input file: java -jar tools/gem/gem.jar
--g SL1344.chrom.size --genome . --s --expt mapped_rmdup.bam --f BAM --out peaks_rmdup --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt
72
display report in Web browser
73
index new BAM file for IGV browser
74
with duplicates w/o duplicates
75
peak coincides with 3' end of small RNA (SalCom Browser)
76
Further work Use control data to remove noise
Use biological replicates (variation) Apply hard-trimming Try different mappers, peak callers Downstream analysis…
77
Achievements so far Learnt about NGS, ChIP-Seq
Worked on the UNIX command line Carried out QC on sequence file Mapped reads to reference genome Detected peaks and motifs Visualised data
78
Voluntary Exercise Carry out the Software Carpentry Course on the Unix Shell:
79
Don't forget to log out!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.