Presentation is loading. Please wait.

Presentation is loading. Please wait.

GE3M25: Data Handling – ChIP-Seq

Similar presentations


Presentation on theme: "GE3M25: Data Handling – ChIP-Seq"— Presentation transcript:

1 GE3M25: Data Handling – ChIP-Seq
TCD, 31/10/2017 Karsten Hokamp, PhD Genetics

2 GE3M25 sub-parts Statistics Data Handling ChIP-Seq analysis Python
Programming ChIP-Seq analysis Statistics Data Handling

3 Class 1: ChIP-seq data analysis in a nutshell
What is Next-Generation Sequencing? What kind of data are produced? Quality assessment Read mapping Peak detection Visualisation

4 Class 1: ChIP-seq data analysis in a nutshell

5 Next Generation Sequencing
Technologies that parallelize the sequencing process, producing thousands or millions of sequences  massive impact on Genomics Massively parallel signature sequencing (MPSS) Polony sequencing 454 pyrosequencing. Illumina (Solexa) sequencing. SOLiD sequencing Ion Torrent semiconductor sequencing DNA nanoball sequencing Heliscope single molecule sequencing Single molecule real time (SMRT) sequencing NGS took sequencing runs from 84 kilobase (kb) per run to 1 gigabase (Gb) per run!

6 Illumina Sequencing Flow cell is coated with lawn of oligos complimentary to adapter sequence B THe llumina Sequencing method can be broken down into 4 steps: First, the library is prepared. The sequencing library is prepared by random fragmentation of the DNA or cDNA sample, followed by 5’ and 3’ adapter ligation The next step is cluster amplication For cluster amplicifcation, the library is loaded into a flow cell. The flowcell is coated with a lawn of oligos that are complimentary to the adapter sequence so the DNA fragments can hybridise to it. In order for sequencing to work the DNA must be amplified. Each fragment is then PCR amplified into distinct, clonal clusters through bridge amplification. When cluster generation is complete the DNA is denatured, tthe reverse strands are cleaved and washed off leaving only the forward strand ready for sequencing.

7 8 lanes with 120 tiles each (GAIIx)
Illumina Sequencing Flow Cell Illumina SBS technology utilizes a proprietary reversible terminator–based method that detects single bases as they are incorporated into DNA template strands. all 4 reversible, terminator-bound dNTPs are present during each sequencing cycle, Sequencing begins with the extension of the first seqeuncing primer to produce the first read. Illumina uses a particular technology that detects single bases as they are incorporated into DNA template strands. As soon as a nucleotide is added a certain wavelength is emitted at a particular signal intensity. This is captured by the detectors and a base call is made. For a given cluster, all identical strands are read simultaneously. Hundreads of millions of clusters are sequenced in a massively parallel process. Each cluster corresponds to a different section of the DNA sequence. The number of cycles determines the length of the read. Normally 100bp reads are generated. 8 lanes with 120 tiles each (GAIIx)

8 YouTube Videos Illumina Solexa Sequencing (< 2 minutes):
Illumina (5 minutes):

9 Next Generation Sequencing - Applications
Xu F, Wang Q, Zhang F, Zhu Y, Gu Q, Wu L, Yang L, Yang X. Impact of Next-Generation Sequencing (NGS) technology on cardiovascular disease research. Cardiovasc Diagn Ther 2012;2(2):

10 Frank Wellmer's slide – as a reminder...

11 Frank Wellmer's slide – let's try this!

12 Hfq bound to transcripts instead of genomic DNA!
Example data set Related information  GEO DataSets 14 Samples: D%20gsm[ETYP] Test sample: Hfq coIP OD2+6h, SRA SRX155645 Hfq bound to transcripts instead of genomic DNA!

13 Example data set Download from local server:
Download from local server:

14 Working on the Command Line
NGS files are generally too big to open in TextEdit, Word or Excel! We can access and process them on the command line.

15 Working on the Command Line
Start: Open 'Terminal' from Spotlight or Dock

16 Working on the Command Line – Terminal
Prompt Cursor Title bar

17 Working on the Command Line – the Prompt
host user directory symbol

18 Working on the Command Line – Orientation
output pwd = print working directory

19 File Hierarchy Application Desktop Library Documents root / Users
kahokamp Library bin Movies tmp Music

20 Working on the Command Line – Orientation
directory root separator ~ = short-cut for home-directory

21 Working on the Command Line – File Listing
ls -- list directory contents

22 Working on the Command Line – File Listing
ls -l  long directory listing

23 Working on the Command Line – File Listing
last modification date permissions owner group size name type link count

24 Working on the Command Line – File Listing
Permissions in triplets for owner, group, everyone else rwxr-xr-x r = read permission w = write permission x = execution permission - = forbidden

25 Working on the Command Line – File Listing
Parameter(s) Argument(s)

26 Working on the Command Line – Manual
man = manual page for a program

27 Working on the Command Line – Manual
space for next page h for help q for quit

28 Working on the Command Line – Moving Around
Some examples of using cd (change directory): cd Downloads cd  change into home directory cd ~/Downloads cd ..  change into upper directory cd –  change into previous directory Try and combine with 'pwd' to get your bearings!

29 Working on the Command Line – Short-cuts
Automatic extension with <TAB> key: cd cd D<TAB><TAB>  shows possible extensions cd Dow<TAB>  extends to 'Downloads' <TAB> shortens work and saves from typos!

30 change into Downloads directory
and do a long listing the downloaded file should be there as a directory ('chip-seq')

31 If you see a file called 'chip-seq.zip' instead,
do the following: unzip chip-seq.zip

32 list its content

33 'gunzip' uncompresses a file '-c' option directs output to terminal
Peek into the file: 'gunzip' uncompresses a file '-c' option directs output to terminal pipe symbol ('|') connects two command 'head' displays first 10 lines of input

34 similar to Fasta but includes Quality information
The FastQ Format similar to Fasta but includes Quality information

35 The FastQ Format @ sign header sequence string + sign
header (optional) quality string

36 The FastQ Format @ sign SRA ID Sequencer ID lane tile x, y coordinates

37 Indication of the probability that a base call is correct
Quality Information Indication of the probability that a base call is correct Base call Quality C I T I A # N #

38 use FastQC for quality assessment

39 display report in Web browser

40

41 Working on the Command Line – Short-cuts
Make browser bigger (green dot in title bar)! Switch between applications: <cmd>-<tab>

42 Mapping 312k reads to Salmonella genome
1 read per line 80 reads per page ~4000 pages 3 books with genome sequence

43 prepare index for read mapping

44

45 map reads against indexed genome

46 increase % of mapped read through soft-trimming

47 use Gem to detect peaks and motifs
java -jar tools/gem/gem.jar --g SL1344.chrom.size --genome . --s --expt mapped.sam --f SAM --out peaks --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

48 display report in Web browser

49

50 Frank Wellmer's slide – looks familiar?

51 Do some checks, visualisations and comparisons with other tools!
End of story? No! Do some checks, visualisations and comparisons with other tools!

52 Use Excel to display list of peaks:
File  open  Downloads  chip-seq  peaks  peaks_GEM_events.txt

53 sorted by probability score
Next: Visualise read coverage at top positions

54 compress mapped reads and create an index

55 (Integrated Genome Browser)
start IGV (Integrated Genome Browser)

56

57 load 'chrSL1344.fa' from chip-seq directory as genome

58 load from file: annotation (SL1344.gff) mapped reads (mapped.bam)

59  click and drag region to zoom in

60

61  coverage  individual reads

62  top peak  notice data range

63 Frank Wellmer's slide - do our peaks look like this?

64  duplicate reads from PCR amplification?

65 Fine-tuning Remove duplicates:
tools/samtools rmdup -s mapped.bam mapped_rmdup.bam

66 Run Gem again using new input file: java -jar tools/gem/gem.jar
--g SL1344.chrom.size --genome . --s --expt mapped_rmdup.bam --f BAM --out peaks_rmdup --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

67  with duplicates  w/o duplicates

68 peak coincides with 3' end of small RNA (SalCom Browser)

69 Further work Use control data to remove noise
Use biological replicates (variation) Apply hard-trimming Try different mappers, peak callers Downstream analysis…

70 Achievements so far Learnt about NGS, ChIP-Seq
Worked on the UNIX command line Carried out QC on sequence file Mapped reads to reference genome Detected peaks and motifs Visualised data

71 Don't forget to log out!


Download ppt "GE3M25: Data Handling – ChIP-Seq"

Similar presentations


Ads by Google