Presentation is loading. Please wait.

Presentation is loading. Please wait.

NGS Analysis Using Galaxy

Similar presentations

Presentation on theme: "NGS Analysis Using Galaxy"— Presentation transcript:

1 NGS Analysis Using Galaxy
Sequences and Alignment Format Galaxy overview and Interface Getting Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises (

2 SNPSeq Analysis dataset
Data source: SRR sample from from experiment published by Kaufman et al (2012, GSE20176) FastqFile: TAIR10 Genome: The API transcription factor is a key regulator of Arabidopsis flower development. To understand the target molecular mechanisms underlying AP1 function, target genes were identified during floral initiation using a combination of gene expression profiling and genome-wide binding studies. Many of its targets encode transcriptional regulators, including floral repressors.

3 SNP-Seq pipeline Sequence (fastq file) Reference genome (fasta file)
Upload data and QC Sequence (fastq file) Reference genome (fasta file) Filter reads based on quality scores Alignment with BWA Align with BWA SAM to BAM conversion Variant calling Mpileup Bcftools view

4 Upload data Go to ”Get Data”, click open ”Upload File from your computer”. Then specify the following list of URLs in URL/Text box Next, we need to upload files from URL. Click on Upload file. Uploading file on galaxy can be done in a number of ways. You could simply upload the file from your local desktop/computer or you could upload the data from any URL or ftp website. Another option could be to upload the data from popular browsers or databases like UCSC Genome browser or BioMart. There are also a number of file formats that you could choose from here. Next, we will copy and paste the URLs given here in the URL/Text box and click on Execute. For this exercise, we are uploading a part of Arabidopsis genome and the

5 Convert Fastq file to sanger sequences
Select NGS: QC and manipulation and Fastq groomer The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. After grooming, the user is presented with a valid FASTQ format that is accepted by all downstream analysis tools. Next, we need to convert these FASTQ files into sanger format. Select NGS: QC and manipulation and Fastq groomer The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. After grooming, the user is presented with a valid FASTQ format that is accepted by all downstream analysis tools. The output of the FASTQ Groomer is sanger formatted fastq reads, which is acceptable by all downstream, analysis tools.

6 FASTQ Summary Statistics
To understand the quality properties of the reads, one can run the FASTQ Summary Statistics tool from NGS: QC and manipulation.

7 FASTQ Quality control To understand the quality properties of the reads, one can also run the FASTQC: Read QC reports from NGS: QC and manipulation.

8 Quality control output

9 Quality control reports
Per base sequence quality Per base sequence content Per base sequence quality The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read. Per base sequence content Per Base Sequence Content plots out the proportion of each base position in a file for which each of the four normal DNA bases has been called. In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other. If you see strong biases which change in different bases then this usually indicates an overrepresented sequence which is contaminating your library. A bias which is consistent across all bases either indicates that the original library was sequence biased, or that there was a systematic problem during the sequencing of the library. Per sequence GC content This module measures the GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content. In a normal random library you would expect to see a roughly normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome. Since we don't know the the GC content of the genome the modal GC content is calculated from the observed data and used to build a reference distribution. An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset. A normal distribution which is shifted indicates some systematic bias which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what your genome's GC content should be. Per base GC content

10 Quality filter This tool filters reads based on quality scores. NGS: QC and manipulation -> Generic FASTQ manipulation->Filter FASTQ reads by quality score and length

11 Alignment with BWA BWA is a fast and accurate short read aligner that allows mismatches and indels Go to ”NGS: Mapping” and click on ”Map with BWA”.

12 SAM to BAM format conversion
Produce an indexed BAM file based on a sorted input SAM file. Go to ”NGS: SAM Tools”, then click open ”SAM-to-BAM”.

13 Variants calling with Mpileup
SNP and INDEL caller to Generate BCF (Binary Variant Format) for one or multiple BAM files Go to ”NGS: SAM Tools”, then click open ”MPileup SNP and indel caller”.

14 Bcftools view Converts BCF format to VCF format.
Go to ”NGS: SAM Tools”, then click open ”bcftools view”

15 Exercise 2: RNASeq Analysis
Data source: RNA-seq experiment SRA023501 Four Samples: Samples Factors Fastq AP3_f14 AP3 T1_f14 TRL

16 RNASeq Analysis workflow
Input data Upload the fastq files, reference genome and GTF file Use fastq groomer to groom the fastq sequences Alignment TopHat for alignment Results in insertions, deletions, splice junctions and accepted hits Differential expression testing Use Cuffdiff for differential expression testing Finds significant changes in gene expression, splicing and promoter use

17 Upload Data Upload four fastq files with URL
Upload tair10chr.fa with URL ( Upload TAIR10.GTF with URL ,specify the format ”gtf” (

18 Fastq Groomer Select NGS: QC and manipulation and Fastq groomer
Run Fastq Groomer for all the 4 fastq sequences.

19 Alignment with TopHat TopHat is a fast splice junction mapper for RNA-Seq reads, it can identify splice junctions between exons. Go to ”NGS RNA analysis”, click open ”Tophat for illumina” Similarly repeat this process for all the 4 fastq groomed sequences

20 Find Significant Changes
Cuffdiff find significant changes in transcript expression. Go to ”NGS RNA analysis”, click open ”Cuffdiff”

21 Cuffdiff Output TSS... files report on Transcription Start Sites
splicing... report on splicing CDS... track coding region expression transcript... track transcripts gene... rolls up the transcripts into their genes gene/transcript FPKM tracking: gives information about the gene/transcript (length, nearest ref id, TSS, etc) and the confidence intervals for FPKM for each condition. gene/transcript differential expression testing: gives the expression change between groups, a status of whether there was enough data for that value to be accurate (OK is good, FAIL and NOTEST are bad. LOWDATA is somewhere in between). Finally, it gives a p-value. see more details… Link

22 Find Significant Changes
Cuffdiff find significant changes in transcript expression

23 Output File from Galaxy
SNPseq Save Bam file BWA generated Save .bai file (index of BAM) BWA generated Save vcf file Samtools mpileup generated Already saved them at RNASeq Save four Bam files and four .bai files

24 Downloading bam index file (.bai)

25 Outline What is Galaxy Galaxy for Bioinformaticians
Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV

26 Why IGV IGV is an integrated visualization tool of large data types
View large dataset easily Faster navigation on browsing Run it locally on your computer Easy to use interface

27 IGV Interface Tool bar Chromosome ideogram Ruler Track data Features
Track names Attributes See more details (Link)

28 IGV download

29 Load data Select genome: Click the genome drop-down list in the toolbar and select the genome Select chromosome: Click the chromosome drop-down box and select chromosome

30 Load data files Load from URL, file, server, DAS (Distributed Annotation System): UCSC DAS Sources Import genome

31 Toolbar Genome drop-down box: loads a genome
Chromosome drop-down box: zooms to a chromosome Search box: Displays the chromosome location being shown. To scroll to a different location, enter the gene name, locus or track name and click Go. Whole genome view: Zooms to whole genome view. Define a region: Defines a region of interest on the chromosome. Zoom slider: Zooms in and out on a chromosome.

32 Change Display Options
IGV offers several display options for tracks Zoom in and Zoom out Modify Track Height Sort the Tracks Filter the Tracks Group the Tracks Sort Tracks based on Region of Interest

33 Variants Visualization in IGV
Load TAIR10 genome to IGV Load BAM file “SRR bam” to IGV with “Load from URL” Load VCF file “var.raw.vcf” to IGV with “Load from URL”

34 Zoom In Screen Zoom in to : Chr5:57,073-57,142

35 Zoom in position (chr5:6,435-6,475)

36 RNAseq Results Visualization
Load four BAM files to IGV Load gene differential expression GFF3 file “expression_diff.gff3” to IGV

37 Exercise2: RNAseq Result Visualization
Zoom in to Chr1:41,351-51,208

Download ppt "NGS Analysis Using Galaxy"

Similar presentations

Ads by Google