GE3M25: Data Handling – ChIP-Seq

GE3M25: Data Handling – ChIP-Seq
TCD, 31/10/2017 Karsten Hokamp, PhD Genetics

GE3M25 sub-parts Statistics Data Handling ChIP-Seq analysis Python
Programming ChIP-Seq analysis Statistics Data Handling

Class 1: ChIP-seq data analysis in a nutshell
What is Next-Generation Sequencing? What kind of data are produced? Quality assessment Read mapping Peak detection Visualisation

Class 1: ChIP-seq data analysis in a nutshell

Next Generation Sequencing
Technologies that parallelize the sequencing process, producing thousands or millions of sequences  massive impact on Genomics Massively parallel signature sequencing (MPSS) Polony sequencing 454 pyrosequencing. Illumina (Solexa) sequencing. SOLiD sequencing Ion Torrent semiconductor sequencing DNA nanoball sequencing Heliscope single molecule sequencing Single molecule real time (SMRT) sequencing NGS took sequencing runs from 84 kilobase (kb) per run to 1 gigabase (Gb) per run!

Illumina Sequencing Flow cell is coated with lawn of oligos complimentary to adapter sequence B THe llumina Sequencing method can be broken down into 4 steps: First, the library is prepared. The sequencing library is prepared by random fragmentation of the DNA or cDNA sample, followed by 5’ and 3’ adapter ligation The next step is cluster amplication For cluster amplicifcation, the library is loaded into a flow cell. The flowcell is coated with a lawn of oligos that are complimentary to the adapter sequence so the DNA fragments can hybridise to it. In order for sequencing to work the DNA must be amplified. Each fragment is then PCR amplified into distinct, clonal clusters through bridge amplification. When cluster generation is complete the DNA is denatured, tthe reverse strands are cleaved and washed off leaving only the forward strand ready for sequencing.

8 lanes with 120 tiles each (GAIIx)
Illumina Sequencing Flow Cell Illumina SBS technology utilizes a proprietary reversible terminator–based method that detects single bases as they are incorporated into DNA template strands. all 4 reversible, terminator-bound dNTPs are present during each sequencing cycle, Sequencing begins with the extension of the first seqeuncing primer to produce the first read. Illumina uses a particular technology that detects single bases as they are incorporated into DNA template strands. As soon as a nucleotide is added a certain wavelength is emitted at a particular signal intensity. This is captured by the detectors and a base call is made. For a given cluster, all identical strands are read simultaneously. Hundreads of millions of clusters are sequenced in a massively parallel process. Each cluster corresponds to a different section of the DNA sequence. The number of cycles determines the length of the read. Normally 100bp reads are generated. 8 lanes with 120 tiles each (GAIIx)

YouTube Videos Illumina Solexa Sequencing (< 2 minutes):
Illumina (5 minutes):

Next Generation Sequencing - Applications
Xu F, Wang Q, Zhang F, Zhu Y, Gu Q, Wu L, Yang L, Yang X. Impact of Next-Generation Sequencing (NGS) technology on cardiovascular disease research. Cardiovasc Diagn Ther 2012;2(2):

Frank Wellmer's slide – as a reminder...

Frank Wellmer's slide – let's try this!

Hfq bound to transcripts instead of genomic DNA!
Example data set Related information  GEO DataSets 14 Samples: D%20gsm[ETYP] Test sample: Hfq coIP OD2+6h, SRA SRX155645 Hfq bound to transcripts instead of genomic DNA!

Example data set Download from local server:
Download from local server:

Working on the Command Line
NGS files are generally too big to open in TextEdit, Word or Excel! We can access and process them on the command line.

Working on the Command Line
Start: Open 'Terminal' from Spotlight or Dock

Working on the Command Line – Terminal
Prompt Cursor Title bar

Working on the Command Line – the Prompt
host user directory symbol

Working on the Command Line – Orientation
output pwd = print working directory

File Hierarchy Application Desktop Library Documents root / Users
kahokamp Library bin Movies tmp Music

Working on the Command Line – Orientation
directory root separator ~ = short-cut for home-directory

Working on the Command Line – File Listing
ls -- list directory contents

ls -l  long directory listing

last modification date permissions owner group size name type link count

Permissions in triplets for owner, group, everyone else rwxr-xr-x r = read permission w = write permission x = execution permission - = forbidden

Parameter(s) Argument(s)

Working on the Command Line – Manual
man = manual page for a program

Working on the Command Line – Manual
space for next page h for help q for quit

Working on the Command Line – Moving Around
Some examples of using cd (change directory): cd Downloads cd  change into home directory cd ~/Downloads cd ..  change into upper directory cd –  change into previous directory Try and combine with 'pwd' to get your bearings!

Working on the Command Line – Short-cuts
Automatic extension with <TAB> key: cd cd D<TAB><TAB>  shows possible extensions cd Dow<TAB>  extends to 'Downloads' <TAB> shortens work and saves from typos!

change into Downloads directory
and do a long listing the downloaded file should be there as a directory ('chip-seq')

If you see a file called 'chip-seq.zip' instead,
do the following: unzip chip-seq.zip

list its content

'gunzip' uncompresses a file '-c' option directs output to terminal
Peek into the file: 'gunzip' uncompresses a file '-c' option directs output to terminal pipe symbol ('|') connects two command 'head' displays first 10 lines of input

similar to Fasta but includes Quality information
The FastQ Format similar to Fasta but includes Quality information

The FastQ Format @ sign header sequence string + sign
header (optional) quality string

The FastQ Format @ sign SRA ID Sequencer ID lane tile x, y coordinates

Indication of the probability that a base call is correct
Quality Information Indication of the probability that a base call is correct Base call Quality C I T I … A # N #

use FastQC for quality assessment

display report in Web browser

Working on the Command Line – Short-cuts
Make browser bigger (green dot in title bar)! Switch between applications: <cmd>-<tab>

Mapping 312k reads to Salmonella genome
1 read per line 80 reads per page ~4000 pages 3 books with genome sequence

prepare index for read mapping

map reads against indexed genome

increase % of mapped read through soft-trimming

use Gem to detect peaks and motifs
java -jar tools/gem/gem.jar --g SL1344.chrom.size --genome . --s --expt mapped.sam --f SAM --out peaks --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

display report in Web browser

Frank Wellmer's slide – looks familiar?

Do some checks, visualisations and comparisons with other tools!
End of story? No! Do some checks, visualisations and comparisons with other tools!

Use Excel to display list of peaks:
File  open  Downloads  chip-seq  peaks  peaks_GEM_events.txt

sorted by probability score
Next: Visualise read coverage at top positions

compress mapped reads and create an index

(Integrated Genome Browser)
start IGV (Integrated Genome Browser)

load 'chrSL1344.fa' from chip-seq directory as genome

load from file: annotation (SL1344.gff) mapped reads (mapped.bam)

 click and drag region to zoom in

 coverage  individual reads

 top peak  notice data range

Frank Wellmer's slide - do our peaks look like this?

 duplicate reads from PCR amplification?

Fine-tuning Remove duplicates:
tools/samtools rmdup -s mapped.bam mapped_rmdup.bam

Run Gem again using new input file: java -jar tools/gem/gem.jar
--g SL1344.chrom.size --genome . --s --expt mapped_rmdup.bam --f BAM --out peaks_rmdup --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

 with duplicates  w/o duplicates

peak coincides with 3' end of small RNA (SalCom Browser)

Further work Use control data to remove noise
Use biological replicates (variation) Apply hard-trimming Try different mappers, peak callers Downstream analysis…

Achievements so far Learnt about NGS, ChIP-Seq
Worked on the UNIX command line Carried out QC on sequence file Mapped reads to reference genome Detected peaks and motifs Visualised data

Don't forget to log out!

GE3M25: Data Handling – ChIP-Seq

Similar presentations

Presentation on theme: "GE3M25: Data Handling – ChIP-Seq"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GE3M25: Data Handling – ChIP-Seq

Similar presentations

Presentation on theme: "GE3M25: Data Handling – ChIP-Seq"— Presentation transcript:

Similar presentations

About project

Feedback