NGS File formats Raw data from various vendors => various formats

NGS File formats Raw data from various vendors => various formats
Different quality metrics (some more stringent than others) As data analysis proceeds, end up with even more formats: GenBank formats (SRA) Alignments are in SAM/BAM Genome Browser formats (wig, bed, gff, etc) Variants in vcf files (SNPs, indels, etc)

What, if there is no Galaxy, …?
Looking for Sequence reads data (SRA) , for example:

We’ve got read data in sra format.
Now what? We need to convert to FASTQ format to use TopHat, STAR, etc. on Pegasus2.

SRA toolkit

SRA toolkit fastq-dump –X 5 –Z SRR390728
fastq-dump –I --split-files SRR390728 fastq-dump --split-files –fasta 60 SRR390728

FASTQ file format

Working on pegasus2 We’ve got our fastq file(s).
To align the reads with the reference genome we will use TopHat on pegasus2 transfer data through scp yourfile.fastq into your home directory scp yourfiel.fastq into scratch directory

Preparing to use TopHat on Pegasus2
Tophat was built on top of the non-splice-aware aligner Bowtie. So in order to use Tophat, you must also have Bowtie available. Tophat and Bowtie are both available as modules on Pegasus2 to simply load and use. To load Tophat and Bowtie for use on pegasus, simply type the following commands: To see a complete list of all available modules, type: To confirm that the modules have been loaded properly, type: module load bowtie2/2.2.2 module load tophat/2.0.11 module avail which tophat which bowtie

Preparing data for TopHat use
We need to give it information about the genome to which we want to align our data: Assuming we are in our accounts on Pegasus 1.) First, we need to download the genome sequence as a fasta file For human: ftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/ dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz : wget -O GRCh38.fa.gzftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 2.) Second, we need to download (or construct) an indexed version of the genome for bowtie2 to work with. Pre-built indexes for Bowtie2 can be found on

However, bowtie2 has a command to build this index out of the genome sequence file. bowtie2-build -f GRCh38.fa GRCh38 considering GRCh38.fa contains the genome sequences in fasta format. This will create a series of files called GRCh38.1.bt2, GRCh38.rev.1.bt2, GRCh38.2.bt2, etc.

3.) Third, we need to download an annotation file containing all the known genes and transcripts for this genome. This third step is technically optional, but it helps to improve the accuracy of splice junction calling and is generally recommended. We are also going to need this file when we quantify the transcripts present, so might as well use it now too. We need to give it information about the genome to which we want to align our data: Genome annotation: Human: ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz) wget -O GRCh38.gtfftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz) We are ready to run TopHat!

Running TopHat on Pegasus2
To run a job on Pegasus2 in the background we need to create a shell script ( #!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # Your actual commands for the job are going to be placed here. which shell? job name file for stderr file for stdout number of cores queue, more info time allocation

Running TopHat The complete list of TopHat parameters and their official descriptions are Optional parameters: -G … specify a genomic annotation to use -o … locate the output directory Required parameters: - 'base name' of the genome sequence/index data so that it can go and find it. - fastq files to use

Running TopHat Simplest example: ‘base name’ one fastq file
#!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/RNA-seq/Sample1.R1.fastq ‘base name’ one fastq file

Running TopHat Paired-ended reads #!/bin/bash #BSUB –J Tophat_job1
#BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/RNA-seq/Sample1.R1.fastq ~/RNA-seq/Sample1.R2.fastq Paired-ended reads

Running TopHat multiple fastq files
#!/bin/bash #BSUB –J Tophat_job1 #BSUB –e Tophat_job1.err #BSUB –o Tophat_job1.out #BSUB –n 4 #BSUB -q general #BSUB -W 72:00 # tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/FastQFiles/Sample1.Lane1.R1.fastq,~/FastQFiles/Sample1.Lane2.R1.fastq,~/FastQFiles/Sample1.Lane3.R1.fastq tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ~/FastQFiles/Sample1.Lane1.R1.fastq,~/FastQFiles/Sample1.Lane2.R1.fastq,~/FastQFiles/Sample1.Lane3.R1.fastq ~/FastQFiles/Sample1.Lane1.R2.fastq,~/FastQFiles/Sample1.Lane2.R2.fastq,~/FastQFiles/Sample1.Lane3.R2.fastq multiple fastq files multiple fastq files wit paired-ended reads

Finally submit your TopHat jobs
submit a job with bsub < script.sh bjobs bkill <jobid> returns the status of current jobs kills job with <jobid> A successful run of Tophat will return the following files accepted_hits.bam junctions.bed insertions.bed deletions.bed holds our results To view a BAM file you need: module load samtools/0.1.19 samtools view accepted_hits.bam

NGS File formats Raw data from various vendors => various formats

Similar presentations

Presentation on theme: "NGS File formats Raw data from various vendors => various formats"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NGS File formats Raw data from various vendors => various formats

Similar presentations

Presentation on theme: "NGS File formats Raw data from various vendors => various formats"— Presentation transcript:

Similar presentations

About project

Feedback