GE3M25: Data Analysis, Class 4 TCD, 30/11/2017 http://bioinf.gen.tcd.ie/GE3M25/ngs Karsten Hokamp, PhD Genetics
Python 6 Functions, Regex NGS 1 Intro-duction GE3M11Exam Week 10 Week 11 Python 1 Intro-duction Python 2 Strings and Files NGS 2 QC, Trimming Week 12 Python 3 File I/O, Branching Python 4 Modules,Lists, Sets NGS 3 Mapping Week 13 Python 5 Dictiona-ries Python 6 Functions, Regex NGS 4 Peak Calling Week 14
ChIP-Seq project report NGS 5 Gene Lists, Tuning NGS 6 / Python 7 Pipelines NGS 7 / Python 8 Revision Week 15 Python Exam Week 16 ChIP-Seq project report January 2018:
Marks for GE3M25 Python exam: 50% 2/3 data handling 1/3 statistics ChIP-Seq report: 50%
Python exam Date: Mon, 11th Dec, 11am – 12.45 Venue: Mac Lab Structure: 10 multiple-choice questions (20 points) 4 programming tasks: 2 short ones (30 points each) 2 more involved ones (50 points each) Submission: multiple-choice test (1 sheet print-out) 1 – 2 Python scripts with execution output (file upload)
Python exam Material: Anything from the course Website Official Python documentation Python Books Content: Material covered during classes Note: Add comments! Include copy of output (Terminal/Idle) Include Student ID in script and file name Submit frequently – only last version counts Even scripts that don't work can receive points
Class 4: Project overview Visualisation Peak detection Motif detection http://bioinf.gen.tcd.ie/GE3M25/project
ChIP-Seq Different sets of genes are expressed under different conditions Regulated through transcription factors that bind to promoters Binding can be captured by ChIP Enriched regions are revealed through NGS
Class 1: ChIP-seq data analysis in a nutshell
ChIP-Seq Analysis Goal
Recap – From Reads to Peaks (Visualisation) NGS data (FastQ format) Mapped reads (SAM format) bowtie2 samtools Index files (*.bt2) Sorting/indexing (*.bam, *.bai) Reference (Fasta format) bowtie2-build IGV
Recap – From Reads to Peaks (Visualisation) NGS data (FastQ format) Mapped reads (SAM format) bowtie2 samtools Sorting/indexing (*.bam, *.bai) BigWig file Index files (*.bt2) Reference (Fasta format) bowtie2-build IGV
Recap – From Reads to Peaks (Calling) NGS data (FastQ format) Mapped reads (SAM format) bowtie samtools Index files (*.bt2) Sorting/indexing (*.bam, *.bai) Reference (Fasta format) bowtie-build Gem Peak list, motifs
Project Data http://bioinf.gen.tcd.ie/GE3M25/project Antimicrob. Agents Chemother. (2014)
Project Data Three strains: Wild type TAP-Pdr1 Pdr1-k.o.
Project Data Three strains, two antibodies Wild type TAP-Pdr1 Pdr1-k.o. Pdr1 antibody TAP antibody
Project Data Paul et al. Figure 2A
Project Data Potential consensus for the C. glabrata PDR1 binding site Paul et al. Figure 2B
GE3M25 Project Previous steps: Download FastQ data set (ChIP-Seq of TF in yeast) ✔ Quality assessment with FastQC ✔ Read mapping (Bowtie2) ✔ Generate indexed and sorted BAM file ✔ Visualisation in IGV ✔ Store BAM and index files ✔
GE3M25 Project Data Download: Start here: bioinf.gen.tcd.ie/GE3M25
GE3M25 Project Data Download: NGS page: bioinf.gen.tcd.ie/GE3M25/ngs
GE3M25 Project Data Download: bioinf.gen.tcd.ie/GE3M25/ngs/data Main data files (Fastq format)
GE3M25 Project Data Download: bioinf.gen.tcd.ie/GE3M25/ngs/data/fastq Control data files ChIP data files download files that have your student id
Preparations – new tools folder 1. Rename previous directory (in Terminal): mv tools tools.prev If you see mv: rename tools to tools.old/tools: No such file or directory then there was no tools directory – that's ok!
GE3M25 Project Data Download: bioinf.gen.tcd.ie/GE3M25/ngs/data additional files in tools.zip
Preparations Tools Rename previous directory (in Terminal) Download 'tools.zip' from webpage Unpack archive (if not done by browser): unzip tools.zip If you see unzip: cannot find or open tools.zip, tools.zip.zip then it was already unpacked during download
Preparations Tools Rename previous directory (in Terminal) Download 'tools.zip' from webpage Unpack archive (if not done by browser) Check content of the folder: ls -lh tools
Preparations
Preparations Download tools.zip (class 4) again if this is missing!
Data Processing Indexing Mapping Compressing Sorting Visualisation
Data Processing Indexing Mapping Compressing Sorting BigWig generation Visualisation Peak/Motif detection
Data Processing Indexing Mapping Compressing Sorting BigWig generation Visualisation Peak/Motif detection can be combined
Data Processing Indexing Mapping | Compressing | Sorting BigWig generation Visualisation Peak/Motif detection
GE3M25 Project – Read Mapping Build an index of the Genome: Syntax: bowtie2-build fasta_file index_name e.g. tools/bowtie2-build ASM254v2.fa C_glabrata This name to be used in mapping step!
GE3M25 Project – Read Mapping Bowtie2 mapping: Single-end data: bowtie2 -U 11111111_exp_1_fastq.bz2 -x C_glabrata -p 4 > exp1.sam 2. Paired-end data: bowtie2 -1 file1 -2 file2 -x C_glabrata -p 4 > exp.sam e.g.: bowtie2 -1 11111111_exp_1_fastq.bz2 -2 11111111_exp_2_fastq.bz2 -x C_glabrata -p 4 > exp.sam
GE3M25 Project – Sorting and Indexing Change SAM to BAM format: tools/samtools view -b exp.sam > exp.bam 2. Sorting with 4 threads for speed-up: tools/samtools sort -@ 4 exp.bam > exp_sorted.bam intermediates Results file
Data Processing output from left is used as input on right of pipe Mapping | Compressing | Sorting tools/bowtie2 -1 file1 -2 file2 -x index | tools/samtools view -b - | tools/samtools sort - > out.bam all on one line file names replaced with '-' redirect output into file
make output name descriptive Data Processing Indexing Mapping | Compressing | Sorting, e.g.: tools/bowtie2 -x C_glabrata -p 4 -1 11111111_exp_1_fastq.bz2 -2 11111111_exp_2_fastq.bz2 | tools/samtools view -b - | tools/samtools sort - > exp.sorted.bam make output name descriptive
Data Processing Indexing Mapping | Compressing | Sorting BigWig generation Visualisation Peak/Motif detection
file that lists BAM files Data Processing The bigWig format is useful for dense, continuous data that will be displayed in the Genome Browser as a graph. file that lists BAM files
GE3M25 Project
Kill stuck IGV via Activity Monitor
GE3M25 Project New file with .bw ending: Load .bam and .bw files into IGV
BigWig track visible across whole genome!
GE3M25 Project Data formats: Fastq SAM BAM BAM index BigWig
GE3M25 Project Peak calling with GEM Required input parameters: BAM file Fasta file with reference sequence File with chromosome size(s) Genome size Read distribution Output directory
GE3M25 Project Peak calling with GEM java -jar tools/gem/gem.jar --expt exp.sorted.bam --f BAM --genome . --g chrom.sizes.txt --s 12000000 --d tools/gem/Read_Distribution_default.txt --out peaks BAM file Directory with fasta file(s) File with chromosome size(s) Genome size Read distribution Output directory
GE3M25 Project Download these two files
GE3M25 Project Running Gem:
GE3M25 Project Output produced by GEM:
GE3M25 Project Check out top peaks: head peaks/peaks_GPS_events.txt
GE3M25 Project Peak calling with GEM
GE3M25 Project Peak calling with GEM Add parameters to initiate motif finding: --k_min 6 --k_max 13
GE3M25 Project Output produced by GEM: open peaks/peaks_result.htm
GE3M25 Project Peak calling with GEM Add control file to remove noise: --ctrl ctrl.sorted.bam Check how detected peaks/motif differ!
GE3M25 Project Calculate chromosome sizes tools/samtools idxstats exp_sorted.bam | cut -f 1,2 > chrom.sizes
GE3M25 Project Storage of results files Upload .bam, bam.bai, .bw etc through bioinf.gen.tcd.ie/GE3M25/project
Don't forget to log out!