GE3M25: Data Handling – ChIP-Seq

GE3M25: Data Handling – ChIP-Seq
TCD, 31/10/2017 Karsten Hokamp, PhD Genetics

GE3M25 sub-parts Statistics Data Handling Python Program-ming
ChIP-Seq analysis Statistics Data Handling

NGS 1 Intro-duction Week 10 Week 11 NGS 2 QC, Trimming Week 12 NGS 3
GE3M11Exam Week 10 Week 11 Python 1 Intro-duction Python 2 Strings and Files NGS 2 QC, Trimming Week 12 Python 3 File I/O, Branching Python 4 Lists, Tuples NGS 3 Mapping Week 13 Python 5 Dictiona-ries Python 6 Regex, System NGS 4 Peak Calling Week 14

ChIP-Seq project report
NGS 5 Gene Lists NGS 6 / Python 7 Pipelines NGS 7 / Python 8 Revision Week 15 Python Exam Week 16 ChIP-Seq project report January 2018:

Class 1: ChIP-seq data analysis in a nutshell
What is Next-Generation Sequencing? What kind of data are produced? Quality assessment Read mapping Peak detection Visualisation

Class 1: ChIP-seq data analysis in a nutshell

Next Generation Sequencing
Technologies that parallelize the sequencing process, producing thousands or millions of sequences  massive impact on Genomics Massively parallel signature sequencing (MPSS) Polony sequencing 454 pyrosequencing. Illumina (Solexa) sequencing. SOLiD sequencing Ion Torrent semiconductor sequencing DNA nanoball sequencing Heliscope single molecule sequencing Single molecule real time (SMRT) sequencing NGS increased output from 84 Kilobase (Kb) per run to 1 Gigabase (Gb) per run!

Illumina Sequencing Flow cell is coated with lawn of oligos complimentary to adapter sequence B THe llumina Sequencing method can be broken down into 4 steps: First, the library is prepared. The sequencing library is prepared by random fragmentation of the DNA or cDNA sample, followed by 5’ and 3’ adapter ligation The next step is cluster amplication For cluster amplicifcation, the library is loaded into a flow cell. The flowcell is coated with a lawn of oligos that are complimentary to the adapter sequence so the DNA fragments can hybridise to it. In order for sequencing to work the DNA must be amplified. Each fragment is then PCR amplified into distinct, clonal clusters through bridge amplification. When cluster generation is complete the DNA is denatured, tthe reverse strands are cleaved and washed off leaving only the forward strand ready for sequencing.

Flow Cell Illumina Sequencing 8 lanes with 120 tiles each (GAIIx)
Illumina SBS technology utilizes a proprietary reversible terminator–based method that detects single bases as they are incorporated into DNA template strands. all 4 reversible, terminator-bound dNTPs are present during each sequencing cycle, Sequencing begins with the extension of the first sequencing primer to produce the first read. Illumina uses a particular technology that detects single bases as they are incorporated into DNA template strands. As soon as a nucleotide is added a certain wavelength is emitted at a particular signal intensity. This is captured by the detectors and a base call is made. For a given cluster, all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel process. Each cluster corresponds to a different section of the DNA sequence. The number of cycles determines the length of the read. Normally 100bp reads are generated. 8 lanes with 120 tiles each (GAIIx) ~ 10 million reads per tile

YouTube Videos Illumina Solexa Sequencing (< 2 minutes):
Illumina (5 minutes):  See 'Links' section on course web page

Next Generation Sequencing - Applications
Xu F, Wang Q, Zhang F, Zhu Y, Gu Q, Wu L, Yang L, Yang X. Impact of Next-Generation Sequencing (NGS) technology on cardiovascular disease research. Cardiovasc Diagn Ther 2012;2(2):

Frank Wellmer's slide – as a reminder...

Frank Wellmer's slide – let's try this!

Hfq bound to transcripts instead of genomic DNA!
Example data set Related information  GEO DataSets 14 Samples: D%20gsm[ETYP] Test sample: Hfq coIP OD2+6h, SRA SRX155645 Hfq bound to transcripts instead of genomic DNA!

Example data set Download from local server:
Download from local server:

Working on the Command Line
NGS files are generally too big to open in TextEdit, Word or Excel! We can access and process them on the command line.

Working on the Command Line
Start: Open 'Terminal' from Spotlight or Dock

Working on the Command Line – Terminal
Prompt Cursor Title bar

Working on the Command Line – the Prompt
host user directory symbol

Working on the Command Line – Orientation
output pwd = print working directory

File Hierarchy Application Desktop Library Documents root / Users
kahokamp Library bin Movies tmp Music

Working on the Command Line – Orientation
directory root separator ~ = short-cut for home-directory

Working on the Command Line – File Listing
ls -- list directory contents

ls -l  long directory listing

last modification date permissions owner group size name type link count

Permissions in triplets for owner, group, everyone else rwxr-xr-x r = read permission w = write permission x = execution permission - = forbidden

Parameter(s) Argument(s)

Working on the Command Line – Manual
man = manual page for a program

Working on the Command Line – Manual
space for next page h for help q for quit

Working on the Command Line – Moving Around
Some examples of using cd (change directory): cd Downloads cd  change into home directory cd ~/Downloads cd ..  change into upper directory Cd -  change into previous directory Try and combine with 'pwd' to get your bearings!

Working on the Command Line – Short-cuts
Automatic extension with <TAB> key: cd cd D<TAB><TAB>  shows possible extensions cd Dow<TAB>  extends to 'Downloads' <TAB> shortens work and prevents typos!

change into Downloads directory
and do a long listing the downloaded file should be there as a directory ('chip-seq')

If you see a file called 'chip-seq.zip' instead,
do the following: unzip chip-seq.zip

list its content

Content of zip archive General Feature Format (annotation)
Compressed FastQ file Reference Genome in Fasta format

'gunzip' uncompresses a file '-c' option directs output to terminal
Peek into the file: 'gunzip' uncompresses a file '-c' option directs output to terminal pipe symbol ('|') connects two commands 'head' displays first 10 lines of input

similar to Fasta but includes Quality information
The FastQ Format Fasta: FastQ: similar to Fasta but includes Quality information

The FastQ Format @ sign header sequence string + sign
header (optional) quality string

The FastQ Format @ sign SRA ID Sequencer ID lane tile x, y coordinates

Indication of the probability that a base call is correct
Quality Information Indication of the probability that a base call is correct Base call Quality C I T I … A # N #

use FastQC for quality assessment

display report in Web browser

Working on the Command Line – Short-cuts
Make browser bigger (green dot in title bar)! Switch between applications: <cmd>-<tab>

Mapping 312k reads to Salmonella genome
1 read per line 80 reads per page ~4000 pages 3 books with genome sequence

prepare index for read mapping

prefix for output files
tools/bowtie2-build chrSL1344.fa SL1344 program input file prefix for output files

map reads against indexed genome

increase % of mapped reads through soft-trimming

use Gem to detect peaks and motifs
java -jar tools/gem/gem.jar --g SL1344.chrom.size --genome . --s --expt mapped.sam --f SAM --out peaks --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

Frank Wellmer's slide – looks familiar?

Do some checks, visualisations and comparisons with other tools!
End of story? No! Do some checks, visualisations and comparisons with other tools!

Use Excel to display list of peaks:
File  open  Downloads  chip-seq  peaks  peaks_GEM_events.txt

sorted by probability score
Next: Visualise read coverage at top positions

compress mapped reads and create an index

(Integrated Genome Browser)
start IGV (Integrated Genome Browser)

load 'chrSL1344.fa' from chip-seq directory as genome

load from file: annotation (SL1344.gff) mapped reads (mapped.bam)

 click and drag region to zoom in

 coverage  individual reads

 top peak  notice data range

Frank Wellmer's slide - do our peaks look like this?

 duplicate reads from PCR amplification?

Fine-tuning Remove duplicates:
tools/samtools rmdup -s mapped.bam mapped_rmdup.bam

Run Gem again using new input file: java -jar tools/gem/gem.jar
--g SL1344.chrom.size --genome . --s --expt mapped_rmdup.bam --f BAM --out peaks_rmdup --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

index new BAM file for IGV browser

 with duplicates  w/o duplicates

peak coincides with 3' end of small RNA (SalCom Browser)

Further work Use control data to remove noise
Use biological replicates (variation) Apply hard-trimming Try different mappers, peak callers Downstream analysis…

Achievements so far Learnt about NGS, ChIP-Seq
Worked on the UNIX command line Carried out QC on sequence file Mapped reads to reference genome Detected peaks and motifs Visualised data

Voluntary Exercise Carry out the Software Carpentry Course on the Unix Shell:

Don't forget to log out!

GE3M25: Data Handling – ChIP-Seq

Similar presentations

Presentation on theme: "GE3M25: Data Handling – ChIP-Seq"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GE3M25: Data Handling – ChIP-Seq

Similar presentations

Presentation on theme: "GE3M25: Data Handling – ChIP-Seq"— Presentation transcript:

Similar presentations

About project

Feedback