NGS File formats Raw data from various vendors => various formats

Slides:

Advertisements

Similar presentations

Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.

Advertisements

RNA-seq data analysis Project

TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping

Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Al Ritacco, Shailender Nagpal Research Computing

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

Before we start: Align sequence reads to the reference genome

NGS Analysis Using Galaxy

Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.

Introduction to RNA-Seq and Transcriptome Analysis

Expression Analysis of RNA-seq Data

File formats Wrapping your data in the right package Deanna M. Church

BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.

Introduction to RNA-Seq & Transcriptome Analysis

NGS data analysis CCM Seminar series Michael Liang:

Transcriptome Analysis

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.

Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.

Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,

Introduction to Hartree Centre Resources: IBM iDataPlex Cluster and Training Workstations Rob Allan Scientific Computing Department STFC Daresbury Laboratory.

Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.

Canadian Bioinformatics Workshops

RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -

Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.

Canadian Bioinformatics Workshops

From Reads to Results Exome-seq analysis at CCBR

Canadian Bioinformatics Workshops

GIAB: Genome reference material development resources for clinical sequencing Chunlin Xiao 1, Justin Zook 2, Shane Trask 1, Melissa Landrum 1, Marc Salit.

PARADOX Cluster job management

Introductory RNA-seq Transcriptome Profiling

Using command line tools to process sequencing data

Day 5 Mapping and Visualization

Placental Bioinformatics

Cancer Genomics Core Lab

WS9: RNA-Seq Analysis with Galaxy (non-model organism )

Dowell Short Read Class Phillip Richmond

Next Generation Sequencing Analysis

RNA Sequencing Day 7 Wooohoooo!

Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.

Call SNPs & Infer Phylogeny (CSI Phylogeny)

Variant Calling Workshop

Short Read Sequencing Analysis Workshop

Genome Sequence Annotation Server

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

How to store and visualize RNA-seq data

Introductory RNA-Seq Transcriptome Profiling

GE3M25: Data Analysis, Class 4

Shell scripts on Pegasus 2

Genome Sequence Annotation Server

Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng

BF528 - Biological Data Formats

Next Gen. Sequencing Files and pysam

Maximize read usage through mapping strategies

A web-based platform for structural and functional annotation of model and non-model organisms Jodi Humann, Taein Lee, Stephen Ficklin,

Next Gen. Sequencing Files and pysam

Next Gen. Sequencing Files and pysam

Canadian Bioinformatics Workshops

Alignment of Next-Generation Sequencing Data

BF528 - Sequence Analysis Fundamentals

Computational Pipeline Strategies

Introduction to RNA-Seq & Transcriptome Analysis

Short Read Sequencing Analysis Workshop

RNA-Seq Data Analysis UND Genomics Core.

Quality Control & Nascent Sequencing

Presentation transcript:

NGS File formats Raw data from various vendors => various formats Different quality metrics (some more stringent than others) As data analysis proceeds, end up with even more formats: GenBank formats (SRA) Alignments are in SAM/BAM Genome Browser formats (wig, bed, gff, etc) Variants in vcf files (SNPs, indels, etc)

What, if there is no Galaxy, …? Looking for Sequence reads data (SRA) http://www.ncbi.nlm.nih.gov/sra , http://www.ebi.ac.uk/ena for example:

We’ve got read data in sra format. Now what? We need to convert to FASTQ format to use TopHat, STAR, etc. on Pegasus2.

SRA toolkit http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software

SRA toolkit fastq-dump –X 5 –Z SRR390728 fastq-dump –I --split-files SRR390728 fastq-dump --split-files –fasta 60 SRR390728

FASTQ file format

Working on pegasus2 We’ve got our fastq file(s). To align the reads with the reference genome we will use TopHat on pegasus2 transfer data through scp yourfile.fastq <user>@pegasus2-gw.ccs.miami.edu:~/. into your home directory scp yourfiel.fastq <user>@pegasus2-gw.ccs.miami.edu:/scratch/<user> into scratch directory

Preparing to use TopHat on Pegasus2 Tophat was built on top of the non-splice-aware aligner Bowtie. So in order to use Tophat, you must also have Bowtie available. Tophat and Bowtie are both available as modules on Pegasus2 to simply load and use. To load Tophat and Bowtie for use on pegasus, simply type the following commands: To see a complete list of all available modules, type: To confirm that the modules have been loaded properly, type: module load bowtie2/2.2.2 module load tophat/2.0.11 module avail which tophat which bowtie

Preparing data for TopHat use We need to give it information about the genome to which we want to align our data: Assuming we are in our accounts on Pegasus 1.) First, we need to download the genome sequence as a fasta file For human: ftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/ dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz : wget -O GRCh38.fa.gzftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 2.) Second, we need to download (or construct) an indexed version of the genome for bowtie2 to work with. Pre-built indexes for Bowtie2 can be found on http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Preparing data for TopHat use However, bowtie2 has a command to build this index out of the genome sequence file. bowtie2-build -f GRCh38.fa GRCh38 considering GRCh38.fa contains the genome sequences in fasta format. This will create a series of files called GRCh38.1.bt2, GRCh38.rev.1.bt2, GRCh38.2.bt2, etc.

Preparing data for TopHat use 3.) Third, we need to download an annotation file containing all the known genes and transcripts for this genome. This third step is technically optional, but it helps to improve the accuracy of splice junction calling and is generally recommended. We are also going to need this file when we quantify the transcripts present, so might as well use it now too. We need to give it information about the genome to which we want to align our data: Genome annotation: Human: ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz) wget -O GRCh38.gtfftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz) We are ready to run TopHat!

Running TopHat on Pegasus2 To run a job on Pegasus2 in the background we need to create a shell script (http://ccs.miami.edu/hpc/support/faq/): #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # Your actual commands for the job are going to be placed here. which shell? job name file for stderr file for stdout number of cores queue, more info http://ccs.miami.edu/hpc/doc/pegasus2-queues/ time allocation

Running TopHat The complete list of TopHat parameters and their official descriptions are http://ccb.jhu.edu/software/tophat/manual.shtml Optional parameters: -G … specify a genomic annotation to use -o … locate the output directory Required parameters: - 'base name' of the genome sequence/index data so that it can go and find it. - fastq files to use

Running TopHat Simplest example: ‘base name’ one fastq file #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/RNA-seq/Sample1.R1.fastq ‘base name’ one fastq file

Running TopHat Paired-ended reads #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/RNA-seq/Sample1.R1.fastq ~/RNA-seq/Sample1.R2.fastq Paired-ended reads

Running TopHat multiple fastq files #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/FastQFiles/Sample1.Lane1.R1.fastq,~/FastQFiles/Sample1.Lane2.R1.fastq,~/FastQFiles/Sample1.Lane3.R1.fastq tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/FastQFiles/Sample1.Lane1.R1.fastq,~/FastQFiles/Sample1.Lane2.R1.fastq,~/FastQFiles/Sample1.Lane3.R1.fastq ~/FastQFiles/Sample1.Lane1.R2.fastq,~/FastQFiles/Sample1.Lane2.R2.fastq,~/FastQFiles/Sample1.Lane3.R2.fastq multiple fastq files multiple fastq files wit paired-ended reads

Finally submit your TopHat jobs submit a job with bsub < script.sh bjobs bkill <jobid> returns the status of current jobs kills job with <jobid> A successful run of Tophat will return the following files accepted_hits.bam junctions.bed insertions.bed deletions.bed holds our results To view a BAM file you need: module load samtools/0.1.19 samtools view accepted_hits.bam