GE3M25: Data Analysis, Class 4

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Introduction to RNA-Seq and Transcriptome Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python Karsten Hokamp, PhD Genetics TCD, 03/11/2015.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
GE3M25: Computer Programming for Biologists Python, Class 5
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 4 Karsten Hokamp, PhD Genetics TCD, 01/12/2015.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 2 Karsten Hokamp, PhD Genetics TCD, 17/11/2015.
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
HOMER – a one stop shop for ChIP-Seq analysis
+ Introduction to Unix Joey Azofeifa Dowell Lab Short Read Class Day 2 (Slides inspired by David Knox)
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
Visualizing data from Galaxy
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Index Building.
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Stubbs Lab Bioinformatics - 2 Retrieving sequence data files and Linux commands Nov 17, 2016 Joe Troy.
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Dowell Short Read Class Phillip Richmond
Integrative Genomics Viewer (IGV)
Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.
Regulatory Genomics Lab
Short Read Sequencing Analysis Workshop
Chip – Seq Peak Calling in Galaxy
Stubbs Lab Bioinformatics - 3 Review RNA-Seq Analysis Overview Alignment using Tophat2 Nov 22, 2016 Joe Troy.
GE3M25: Data Handling – ChIP-Seq
Introductory RNA-Seq Transcriptome Profiling
GE3M25: Data Handling – ChIP-Seq
GE3M25: Data Analysis, Class3
Rod Eyles1, John Juma1, Morag Ferguson1, Trushar Shah1 1 IITA, Nairobi
Workshop on Microbiome and Health
Day 5 Session 29: Questions and follow-up…. James C. Fleet, PhD
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
ChIP-Seq Data Processing and QC
Epigenetics System Biology Workshop: Introduction
ECE 353 Lab 3 Pipeline Simulator
ChIP-seq Robert J. Trumbly
Regulatory Genomics Lab
Computational Pipeline Strategies
Regulatory Genomics Lab
Chip – Seq Peak Calling in Galaxy

RNA-Seq Data Analysis UND Genomics Core.
Quality Control & Nascent Sequencing
Presentation transcript:

GE3M25: Data Analysis, Class 4 TCD, 30/11/2017 http://bioinf.gen.tcd.ie/GE3M25/ngs Karsten Hokamp, PhD Genetics

Python 6 Functions, Regex NGS 1 Intro-duction GE3M11Exam Week 10 Week 11 Python 1 Intro-duction Python 2 Strings and Files NGS 2 QC, Trimming Week 12 Python 3 File I/O, Branching Python 4 Modules,Lists, Sets NGS 3 Mapping Week 13 Python 5 Dictiona-ries Python 6 Functions, Regex NGS 4 Peak Calling Week 14

ChIP-Seq project report NGS 5 Gene Lists, Tuning NGS 6 / Python 7 Pipelines NGS 7 / Python 8 Revision Week 15 Python Exam Week 16 ChIP-Seq project report January 2018:

Marks for GE3M25 Python exam: 50% 2/3 data handling 1/3 statistics ChIP-Seq report: 50%

Python exam Date: Mon, 11th Dec, 11am – 12.45 Venue: Mac Lab Structure: 10 multiple-choice questions (20 points) 4 programming tasks: 2 short ones (30 points each) 2 more involved ones (50 points each) Submission: multiple-choice test (1 sheet print-out) 1 – 2 Python scripts with execution output (file upload)

Python exam Material: Anything from the course Website Official Python documentation Python Books Content: Material covered during classes Note: Add comments! Include copy of output (Terminal/Idle) Include Student ID in script and file name Submit frequently – only last version counts Even scripts that don't work can receive points

Class 4: Project overview Visualisation Peak detection Motif detection http://bioinf.gen.tcd.ie/GE3M25/project

ChIP-Seq Different sets of genes are expressed under different conditions Regulated through transcription factors that bind to promoters Binding can be captured by ChIP Enriched regions are revealed through NGS

Class 1: ChIP-seq data analysis in a nutshell

ChIP-Seq Analysis Goal

Recap – From Reads to Peaks (Visualisation) NGS data (FastQ format) Mapped reads (SAM format) bowtie2 samtools Index files (*.bt2) Sorting/indexing (*.bam, *.bai) Reference (Fasta format) bowtie2-build IGV

Recap – From Reads to Peaks (Visualisation) NGS data (FastQ format) Mapped reads (SAM format) bowtie2 samtools Sorting/indexing (*.bam, *.bai)  BigWig file Index files (*.bt2) Reference (Fasta format) bowtie2-build IGV

Recap – From Reads to Peaks (Calling) NGS data (FastQ format) Mapped reads (SAM format) bowtie samtools Index files (*.bt2) Sorting/indexing (*.bam, *.bai) Reference (Fasta format) bowtie-build Gem Peak list, motifs

Project Data http://bioinf.gen.tcd.ie/GE3M25/project Antimicrob. Agents Chemother. (2014)

Project Data Three strains: Wild type TAP-Pdr1 Pdr1-k.o.

Project Data Three strains, two antibodies Wild type TAP-Pdr1 Pdr1-k.o. Pdr1 antibody TAP antibody

Project Data Paul et al. Figure 2A

Project Data Potential consensus for the C. glabrata PDR1 binding site Paul et al. Figure 2B

GE3M25 Project Previous steps: Download FastQ data set (ChIP-Seq of TF in yeast) ✔ Quality assessment with FastQC ✔ Read mapping (Bowtie2) ✔ Generate indexed and sorted BAM file ✔ Visualisation in IGV ✔ Store BAM and index files ✔

GE3M25 Project Data Download: Start here: bioinf.gen.tcd.ie/GE3M25

GE3M25 Project Data Download: NGS page: bioinf.gen.tcd.ie/GE3M25/ngs

GE3M25 Project Data Download: bioinf.gen.tcd.ie/GE3M25/ngs/data Main data files (Fastq format)

GE3M25 Project Data Download: bioinf.gen.tcd.ie/GE3M25/ngs/data/fastq Control data files ChIP data files download files that have your student id

Preparations – new tools folder 1. Rename previous directory (in Terminal): mv tools tools.prev If you see mv: rename tools to tools.old/tools: No such file or directory then there was no tools directory – that's ok!

GE3M25 Project Data Download: bioinf.gen.tcd.ie/GE3M25/ngs/data additional files in tools.zip

Preparations Tools Rename previous directory (in Terminal) Download 'tools.zip' from webpage Unpack archive (if not done by browser): unzip tools.zip If you see unzip: cannot find or open tools.zip, tools.zip.zip then it was already unpacked during download

Preparations Tools Rename previous directory (in Terminal) Download 'tools.zip' from webpage Unpack archive (if not done by browser) Check content of the folder: ls -lh tools

Preparations

Preparations Download tools.zip (class 4) again if this is missing!

Data Processing Indexing Mapping Compressing Sorting Visualisation

Data Processing Indexing Mapping Compressing Sorting BigWig generation Visualisation Peak/Motif detection

Data Processing Indexing Mapping Compressing Sorting BigWig generation Visualisation Peak/Motif detection can be combined

Data Processing Indexing Mapping | Compressing | Sorting BigWig generation Visualisation Peak/Motif detection

GE3M25 Project – Read Mapping Build an index of the Genome: Syntax: bowtie2-build fasta_file index_name e.g. tools/bowtie2-build ASM254v2.fa C_glabrata This name to be used in mapping step!

GE3M25 Project – Read Mapping Bowtie2 mapping: Single-end data: bowtie2 -U 11111111_exp_1_fastq.bz2 -x C_glabrata -p 4 > exp1.sam 2. Paired-end data: bowtie2 -1 file1 -2 file2 -x C_glabrata -p 4 > exp.sam e.g.: bowtie2 -1 11111111_exp_1_fastq.bz2 -2 11111111_exp_2_fastq.bz2 -x C_glabrata -p 4 > exp.sam

GE3M25 Project – Sorting and Indexing Change SAM to BAM format: tools/samtools view -b exp.sam > exp.bam 2. Sorting with 4 threads for speed-up: tools/samtools sort -@ 4 exp.bam > exp_sorted.bam intermediates Results file

Data Processing output from left is used as input on right of pipe Mapping | Compressing | Sorting tools/bowtie2 -1 file1 -2 file2 -x index | tools/samtools view -b - | tools/samtools sort - > out.bam all on one line file names replaced with '-' redirect output into file

make output name descriptive Data Processing Indexing Mapping | Compressing | Sorting, e.g.: tools/bowtie2 -x C_glabrata -p 4 -1 11111111_exp_1_fastq.bz2 -2 11111111_exp_2_fastq.bz2 | tools/samtools view -b - | tools/samtools sort - > exp.sorted.bam make output name descriptive

Data Processing Indexing Mapping | Compressing | Sorting BigWig generation Visualisation Peak/Motif detection

file that lists BAM files Data Processing The bigWig format is useful for dense, continuous data that will be displayed in the Genome Browser as a graph. file that lists BAM files

GE3M25 Project

Kill stuck IGV via Activity Monitor

GE3M25 Project New file with .bw ending: Load .bam and .bw files into IGV

BigWig track visible across whole genome!

GE3M25 Project Data formats: Fastq SAM BAM BAM index BigWig

GE3M25 Project Peak calling with GEM Required input parameters: BAM file Fasta file with reference sequence File with chromosome size(s) Genome size Read distribution Output directory

GE3M25 Project Peak calling with GEM java -jar tools/gem/gem.jar --expt exp.sorted.bam --f BAM --genome . --g chrom.sizes.txt --s 12000000 --d tools/gem/Read_Distribution_default.txt --out peaks BAM file Directory with fasta file(s) File with chromosome size(s) Genome size Read distribution Output directory

GE3M25 Project Download these two files

GE3M25 Project Running Gem:

GE3M25 Project Output produced by GEM:

GE3M25 Project Check out top peaks: head peaks/peaks_GPS_events.txt

GE3M25 Project Peak calling with GEM

GE3M25 Project Peak calling with GEM Add parameters to initiate motif finding: --k_min 6 --k_max 13

GE3M25 Project Output produced by GEM: open peaks/peaks_result.htm

GE3M25 Project Peak calling with GEM Add control file to remove noise: --ctrl ctrl.sorted.bam Check how detected peaks/motif differ!

GE3M25 Project Calculate chromosome sizes tools/samtools idxstats exp_sorted.bam | cut -f 1,2 > chrom.sizes

GE3M25 Project Storage of results files Upload .bam, bam.bai, .bw etc through bioinf.gen.tcd.ie/GE3M25/project

Don't forget to log out!