NGS File formats Raw data from various vendors => various formats

Slides:



Advertisements
Similar presentations
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
Advertisements

RNA-seq data analysis Project
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Al Ritacco, Shailender Nagpal Research Computing
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Introduction to RNA-Seq and Transcriptome Analysis
Expression Analysis of RNA-seq Data
File formats Wrapping your data in the right package Deanna M. Church
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
Introduction to RNA-Seq & Transcriptome Analysis
NGS data analysis CCM Seminar series Michael Liang:
Transcriptome Analysis
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Introduction to Hartree Centre Resources: IBM iDataPlex Cluster and Training Workstations Rob Allan Scientific Computing Department STFC Daresbury Laboratory.
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
GIAB: Genome reference material development resources for clinical sequencing Chunlin Xiao 1, Justin Zook 2, Shane Trask 1, Melissa Landrum 1, Marc Salit.
PARADOX Cluster job management
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
Day 5 Mapping and Visualization
Placental Bioinformatics
Cancer Genomics Core Lab
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Dowell Short Read Class Phillip Richmond
Next Generation Sequencing Analysis
RNA Sequencing Day 7 Wooohoooo!
Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.
Call SNPs & Infer Phylogeny (CSI Phylogeny)
Variant Calling Workshop
Short Read Sequencing Analysis Workshop
Genome Sequence Annotation Server
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
How to store and visualize RNA-seq data
Introductory RNA-Seq Transcriptome Profiling
GE3M25: Data Analysis, Class 4
Shell scripts on Pegasus 2
Genome Sequence Annotation Server
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
BF528 - Biological Data Formats
Next Gen. Sequencing Files and pysam
Maximize read usage through mapping strategies
A web-based platform for structural and functional annotation of model and non-model organisms Jodi Humann, Taein Lee, Stephen Ficklin,
Next Gen. Sequencing Files and pysam
Next Gen. Sequencing Files and pysam
Canadian Bioinformatics Workshops
Alignment of Next-Generation Sequencing Data
BF528 - Sequence Analysis Fundamentals
Computational Pipeline Strategies
Introduction to RNA-Seq & Transcriptome Analysis
Short Read Sequencing Analysis Workshop

RNA-Seq Data Analysis UND Genomics Core.
Quality Control & Nascent Sequencing
Presentation transcript:

NGS File formats Raw data from various vendors => various formats Different quality metrics (some more stringent than others) As data analysis proceeds, end up with even more formats: GenBank formats (SRA) Alignments are in SAM/BAM Genome Browser formats (wig, bed, gff, etc) Variants in vcf files (SNPs, indels, etc)

What, if there is no Galaxy, …? Looking for Sequence reads data (SRA) http://www.ncbi.nlm.nih.gov/sra , http://www.ebi.ac.uk/ena for example:

We’ve got read data in sra format. Now what? We need to convert to FASTQ format to use TopHat, STAR, etc. on Pegasus2.

SRA toolkit http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software

SRA toolkit fastq-dump –X 5 –Z SRR390728 fastq-dump –I --split-files SRR390728 fastq-dump --split-files –fasta 60 SRR390728

FASTQ file format

Working on pegasus2 We’ve got our fastq file(s). To align the reads with the reference genome we will use TopHat on pegasus2 transfer data through scp yourfile.fastq <user>@pegasus2-gw.ccs.miami.edu:~/. into your home directory scp yourfiel.fastq <user>@pegasus2-gw.ccs.miami.edu:/scratch/<user> into scratch directory

Preparing to use TopHat on Pegasus2 Tophat was built on top of the non-splice-aware aligner Bowtie. So in order to use Tophat, you must also have Bowtie available. Tophat and Bowtie are both available as modules on Pegasus2 to simply load and use. To load Tophat and Bowtie for use on pegasus, simply type the following commands: To see a complete list of all available modules, type: To confirm that the modules have been loaded properly, type: module load bowtie2/2.2.2 module load tophat/2.0.11 module avail which tophat which bowtie

Preparing data for TopHat use We need to give it information about the genome to which we want to align our data: Assuming we are in our accounts on Pegasus 1.) First, we need to download the genome sequence as a fasta file For human: ftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/ dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz : wget -O GRCh38.fa.gzftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 2.) Second, we need to download (or construct) an indexed version of the genome for bowtie2 to work with. Pre-built indexes for Bowtie2 can be found on http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Preparing data for TopHat use However, bowtie2 has a command to build this index out of the genome sequence file. bowtie2-build -f GRCh38.fa GRCh38 considering GRCh38.fa contains the genome sequences in fasta format. This will create a series of files called GRCh38.1.bt2, GRCh38.rev.1.bt2, GRCh38.2.bt2, etc.

Preparing data for TopHat use 3.) Third, we need to download an annotation file containing all the known genes and transcripts for this genome. This third step is technically optional, but it helps to improve the accuracy of splice junction calling and is generally recommended. We are also going to need this file when we quantify the transcripts present, so might as well use it now too. We need to give it information about the genome to which we want to align our data: Genome annotation: Human: ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz) wget -O GRCh38.gtfftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz) We are ready to run TopHat!

Running TopHat on Pegasus2 To run a job on Pegasus2 in the background we need to create a shell script (http://ccs.miami.edu/hpc/support/faq/): #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # Your actual commands for the job are going to be placed here. which shell? job name file for stderr file for stdout number of cores queue, more info http://ccs.miami.edu/hpc/doc/pegasus2-queues/ time allocation

Running TopHat The complete list of TopHat parameters and their official descriptions are http://ccb.jhu.edu/software/tophat/manual.shtml Optional parameters: -G … specify a genomic annotation to use -o … locate the output directory Required parameters: - 'base name' of the genome sequence/index data so that it can go and find it. - fastq files to use

Running TopHat Simplest example: ‘base name’ one fastq file #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/RNA-seq/Sample1.R1.fastq ‘base name’ one fastq file

Running TopHat Paired-ended reads #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/RNA-seq/Sample1.R1.fastq ~/RNA-seq/Sample1.R2.fastq Paired-ended reads

Running TopHat multiple fastq files #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/FastQFiles/Sample1.Lane1.R1.fastq,~/FastQFiles/Sample1.Lane2.R1.fastq,~/FastQFiles/Sample1.Lane3.R1.fastq tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/FastQFiles/Sample1.Lane1.R1.fastq,~/FastQFiles/Sample1.Lane2.R1.fastq,~/FastQFiles/Sample1.Lane3.R1.fastq ~/FastQFiles/Sample1.Lane1.R2.fastq,~/FastQFiles/Sample1.Lane2.R2.fastq,~/FastQFiles/Sample1.Lane3.R2.fastq multiple fastq files multiple fastq files wit paired-ended reads

Finally submit your TopHat jobs submit a job with bsub < script.sh bjobs bkill <jobid> returns the status of current jobs kills job with <jobid> A successful run of Tophat will return the following files accepted_hits.bam junctions.bed insertions.bed deletions.bed holds our results To view a BAM file you need: module load samtools/0.1.19 samtools view accepted_hits.bam