Presentation is loading. Please wait.

Presentation is loading. Please wait.

GE3M25: Data Handling – ChIP-Seq

Similar presentations


Presentation on theme: "GE3M25: Data Handling – ChIP-Seq"— Presentation transcript:

1 GE3M25: Data Handling – ChIP-Seq
TCD, 31/10/2017 Karsten Hokamp, PhD Genetics

2 GE3M25 sub-parts Statistics Data Handling Python Program-ming
ChIP-Seq analysis Statistics Data Handling

3 NGS 1 Intro-duction Week 10 Week 11 NGS 2 QC, Trimming Week 12 NGS 3
GE3M11Exam Week 10 Week 11 Python 1 Intro-duction Python 2 Strings and Files NGS 2 QC, Trimming Week 12 Python 3 File I/O, Branching Python 4 Lists, Tuples NGS 3 Mapping Week 13 Python 5 Dictiona-ries Python 6 Regex, System NGS 4 Peak Calling Week 14

4 ChIP-Seq project report
NGS 5 Gene Lists NGS 6 / Python 7 Pipelines NGS 7 / Python 8 Revision Week 15 Python Exam Week 16 ChIP-Seq project report January 2018:

5 Class 1: ChIP-seq data analysis in a nutshell
What is Next-Generation Sequencing? What kind of data are produced? Quality assessment Read mapping Peak detection Visualisation

6

7 Class 1: ChIP-seq data analysis in a nutshell

8 Next Generation Sequencing
Technologies that parallelize the sequencing process, producing thousands or millions of sequences  massive impact on Genomics Massively parallel signature sequencing (MPSS) Polony sequencing 454 pyrosequencing. Illumina (Solexa) sequencing. SOLiD sequencing Ion Torrent semiconductor sequencing DNA nanoball sequencing Heliscope single molecule sequencing Single molecule real time (SMRT) sequencing NGS increased output from 84 Kilobase (Kb) per run to 1 Gigabase (Gb) per run!

9 Illumina Sequencing Flow cell is coated with lawn of oligos complimentary to adapter sequence B THe llumina Sequencing method can be broken down into 4 steps: First, the library is prepared. The sequencing library is prepared by random fragmentation of the DNA or cDNA sample, followed by 5’ and 3’ adapter ligation The next step is cluster amplication For cluster amplicifcation, the library is loaded into a flow cell. The flowcell is coated with a lawn of oligos that are complimentary to the adapter sequence so the DNA fragments can hybridise to it. In order for sequencing to work the DNA must be amplified. Each fragment is then PCR amplified into distinct, clonal clusters through bridge amplification. When cluster generation is complete the DNA is denatured, tthe reverse strands are cleaved and washed off leaving only the forward strand ready for sequencing.

10 Flow Cell Illumina Sequencing 8 lanes with 120 tiles each (GAIIx)
Illumina SBS technology utilizes a proprietary reversible terminator–based method that detects single bases as they are incorporated into DNA template strands. all 4 reversible, terminator-bound dNTPs are present during each sequencing cycle, Sequencing begins with the extension of the first sequencing primer to produce the first read. Illumina uses a particular technology that detects single bases as they are incorporated into DNA template strands. As soon as a nucleotide is added a certain wavelength is emitted at a particular signal intensity. This is captured by the detectors and a base call is made. For a given cluster, all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel process. Each cluster corresponds to a different section of the DNA sequence. The number of cycles determines the length of the read. Normally 100bp reads are generated. 8 lanes with 120 tiles each (GAIIx) ~ 10 million reads per tile

11 YouTube Videos Illumina Solexa Sequencing (< 2 minutes):
Illumina (5 minutes):  See 'Links' section on course web page

12 Next Generation Sequencing - Applications
Xu F, Wang Q, Zhang F, Zhu Y, Gu Q, Wu L, Yang L, Yang X. Impact of Next-Generation Sequencing (NGS) technology on cardiovascular disease research. Cardiovasc Diagn Ther 2012;2(2):

13 Frank Wellmer's slide – as a reminder...

14 Frank Wellmer's slide – let's try this!

15 Hfq bound to transcripts instead of genomic DNA!
Example data set Related information  GEO DataSets 14 Samples: D%20gsm[ETYP] Test sample: Hfq coIP OD2+6h, SRA SRX155645 Hfq bound to transcripts instead of genomic DNA!

16 Example data set Download from local server:
Download from local server:

17 Working on the Command Line
NGS files are generally too big to open in TextEdit, Word or Excel! We can access and process them on the command line.

18 Working on the Command Line
Start: Open 'Terminal' from Spotlight or Dock

19 Working on the Command Line – Terminal
Prompt Cursor Title bar

20 Working on the Command Line – the Prompt
host user directory symbol

21 Working on the Command Line – Orientation
output pwd = print working directory

22 File Hierarchy Application Desktop Library Documents root / Users
kahokamp Library bin Movies tmp Music

23 Working on the Command Line – Orientation
directory root separator ~ = short-cut for home-directory

24 Working on the Command Line – File Listing
ls -- list directory contents

25 Working on the Command Line – File Listing
ls -l  long directory listing

26 Working on the Command Line – File Listing
last modification date permissions owner group size name type link count

27 Working on the Command Line – File Listing
Permissions in triplets for owner, group, everyone else rwxr-xr-x r = read permission w = write permission x = execution permission - = forbidden

28 Working on the Command Line – File Listing
Parameter(s) Argument(s)

29 Working on the Command Line – Manual
man = manual page for a program

30 Working on the Command Line – Manual
space for next page h for help q for quit

31 Working on the Command Line – Moving Around
Some examples of using cd (change directory): cd Downloads cd  change into home directory cd ~/Downloads cd ..  change into upper directory Cd -  change into previous directory Try and combine with 'pwd' to get your bearings!

32 Working on the Command Line – Short-cuts
Automatic extension with <TAB> key: cd cd D<TAB><TAB>  shows possible extensions cd Dow<TAB>  extends to 'Downloads' <TAB> shortens work and prevents typos!

33 change into Downloads directory
and do a long listing the downloaded file should be there as a directory ('chip-seq')

34 If you see a file called 'chip-seq.zip' instead,
do the following: unzip chip-seq.zip

35 list its content

36 Content of zip archive General Feature Format (annotation)
Compressed FastQ file Reference Genome in Fasta format

37 'gunzip' uncompresses a file '-c' option directs output to terminal
Peek into the file: 'gunzip' uncompresses a file '-c' option directs output to terminal pipe symbol ('|') connects two commands 'head' displays first 10 lines of input

38 similar to Fasta but includes Quality information
The FastQ Format Fasta: FastQ: similar to Fasta but includes Quality information

39 The FastQ Format @ sign header sequence string + sign
header (optional) quality string

40 The FastQ Format @ sign SRA ID Sequencer ID lane tile x, y coordinates

41 Indication of the probability that a base call is correct
Quality Information Indication of the probability that a base call is correct Base call Quality C I T I A # N #

42 use FastQC for quality assessment

43 display report in Web browser

44

45 Working on the Command Line – Short-cuts
Make browser bigger (green dot in title bar)! Switch between applications: <cmd>-<tab>

46 Mapping 312k reads to Salmonella genome
1 read per line 80 reads per page ~4000 pages 3 books with genome sequence

47 prepare index for read mapping

48 prefix for output files
tools/bowtie2-build chrSL1344.fa SL1344 program input file prefix for output files

49

50 map reads against indexed genome

51 increase % of mapped reads through soft-trimming

52 use Gem to detect peaks and motifs
java -jar tools/gem/gem.jar --g SL1344.chrom.size --genome . --s --expt mapped.sam --f SAM --out peaks --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

53 display report in Web browser

54

55 Frank Wellmer's slide – looks familiar?

56 Do some checks, visualisations and comparisons with other tools!
End of story? No! Do some checks, visualisations and comparisons with other tools!

57 Use Excel to display list of peaks:
File  open  Downloads  chip-seq  peaks  peaks_GEM_events.txt

58 sorted by probability score
Next: Visualise read coverage at top positions

59 compress mapped reads and create an index

60 (Integrated Genome Browser)
start IGV (Integrated Genome Browser)

61

62 load 'chrSL1344.fa' from chip-seq directory as genome

63 load from file: annotation (SL1344.gff) mapped reads (mapped.bam)

64  click and drag region to zoom in

65

66  coverage  individual reads

67  top peak  notice data range

68 Frank Wellmer's slide - do our peaks look like this?

69  duplicate reads from PCR amplification?

70 Fine-tuning Remove duplicates:
tools/samtools rmdup -s mapped.bam mapped_rmdup.bam

71 Run Gem again using new input file: java -jar tools/gem/gem.jar
--g SL1344.chrom.size --genome . --s --expt mapped_rmdup.bam --f BAM --out peaks_rmdup --k_min 6 --k_max 13 --d tools/gem/Read_Distribution_default.txt

72 display report in Web browser

73 index new BAM file for IGV browser

74  with duplicates  w/o duplicates

75 peak coincides with 3' end of small RNA (SalCom Browser)

76 Further work Use control data to remove noise
Use biological replicates (variation) Apply hard-trimming Try different mappers, peak callers Downstream analysis…

77 Achievements so far Learnt about NGS, ChIP-Seq
Worked on the UNIX command line Carried out QC on sequence file Mapped reads to reference genome Detected peaks and motifs Visualised data

78 Voluntary Exercise Carry out the Software Carpentry Course on the Unix Shell:

79 Don't forget to log out!


Download ppt "GE3M25: Data Handling – ChIP-Seq"

Similar presentations


Ads by Google