Presentation on theme: "NGS Analysis Using Galaxy"— Presentation transcript:
1 NGS Analysis Using Galaxy Sequences and Alignment FormatGalaxy overview and InterfaceGetting Data in GalaxyAnalyzing Data in GalaxyQuality ControlMapping DataHistory and workflowGalaxy Exercises (
2 SNPSeq Analysis dataset Data source: SRR sample from from experiment published by Kaufman et al (2012, GSE20176)FastqFile:TAIR10 Genome:The API transcription factor is a key regulator of Arabidopsis flower development. To understand the target molecular mechanisms underlying AP1 function, target genes were identified during floral initiation using a combination of gene expression profiling and genome-wide binding studies. Many of its targets encode transcriptional regulators, including floral repressors.
3 SNP-Seq pipeline Sequence (fastq file) Reference genome (fasta file) Upload data and QCSequence (fastq file)Reference genome (fasta file)Filter reads based on quality scoresAlignment with BWAAlign with BWASAM to BAM conversionVariant callingMpileupBcftools view
4 Upload dataGo to ”Get Data”, click open ”Upload File from your computer”. Then specify the following list of URLs in URL/Text boxNext, we need to upload files from URL. Click on Upload file. Uploading file on galaxy can be done in a number of ways. You could simply upload the file from your local desktop/computer or you could upload the data from any URL or ftp website. Another option could be to upload the data from popular browsers or databases like UCSC Genome browser or BioMart. There are also a number of file formats that you could choose from here. Next, we will copy and paste the URLs given here in the URL/Text box and click on Execute. For this exercise, we are uploading a part of Arabidopsis genome and the
5 Convert Fastq file to sanger sequences Select NGS: QC and manipulation and Fastq groomerThe FASTQ Groomer tool is used to verify and convert between the known FASTQ variants.After grooming, the user is presented with a valid FASTQ format that is accepted by all downstream analysis tools.Next, we need to convert these FASTQ files into sanger format. Select NGS: QC and manipulation and Fastq groomerThe FASTQ Groomer tool is used to verify and convert between the known FASTQ variants.After grooming, the user is presented with a valid FASTQ format that is accepted by all downstream analysis tools. The output of the FASTQ Groomer is sanger formatted fastq reads, which is acceptable by all downstream, analysis tools.
6 FASTQ Summary Statistics To understand the quality properties of the reads, one can run the FASTQ Summary Statistics tool from NGS: QC and manipulation.
7 FASTQ Quality controlTo understand the quality properties of the reads, one can also run the FASTQC: Read QC reports from NGS: QC and manipulation.
9 Quality control reports Per base sequence qualityPer base sequence contentPer base sequence qualityThe central red line is the median valueThe yellow box represents the inter-quartile range (25-75%)The upper and lower whiskers represent the 10% and 90% pointsThe blue line represents the mean qualityThe y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.Per base sequence contentPer Base Sequence Content plots out the proportion of each base position in a file for which each of the four normal DNA bases has been called.In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other.If you see strong biases which change in different bases then this usually indicates an overrepresented sequence which is contaminating your library. A bias which is consistent across all bases either indicates that the original library was sequence biased, or that there was a systematic problem during the sequencing of the library.Per sequence GC contentThis module measures the GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content.In a normal random library you would expect to see a roughly normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome. Since we don't know the the GC content of the genome the modal GC content is calculated from the observed data and used to build a reference distribution.An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset. A normal distribution which is shifted indicates some systematic bias which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what your genome's GC content should be.Per base GC content
10 Quality filterThis tool filters reads based on quality scores. NGS: QC and manipulation -> Generic FASTQ manipulation->Filter FASTQ reads by quality score and length
11 Alignment with BWABWA is a fast and accurate short read aligner that allows mismatches and indelsGo to ”NGS: Mapping” and click on ”Map with BWA”.
12 SAM to BAM format conversion Produce an indexed BAM file based on a sorted input SAM file.Go to ”NGS: SAM Tools”, then click open ”SAM-to-BAM”.
13 Variants calling with Mpileup SNP and INDEL caller to Generate BCF (Binary Variant Format) for one or multiple BAM filesGo to ”NGS: SAM Tools”, then click open ”MPileup SNP and indel caller”.
14 Bcftools view Converts BCF format to VCF format. Go to ”NGS: SAM Tools”, then click open ”bcftools view”
16 RNASeq Analysis workflow Input dataUpload the fastq files, reference genome and GTF fileUse fastq groomer to groom the fastq sequencesAlignmentTopHat for alignmentResults in insertions, deletions, splice junctions and accepted hitsDifferential expression testingUse Cuffdiff for differential expression testingFinds significant changes in gene expression, splicing and promoter use
17 Upload Data Upload four fastq files with URL Upload tair10chr.fa with URL (Upload TAIR10.GTF with URL ,specify the format ”gtf” (
18 Fastq Groomer Select NGS: QC and manipulation and Fastq groomer Run Fastq Groomer for all the 4 fastq sequences.
19 Alignment with TopHatTopHat is a fast splice junction mapper for RNA-Seq reads, it can identify splice junctions between exons.Go to ”NGS RNA analysis”, click open ”Tophat for illumina”Similarly repeat this process for all the 4 fastq groomed sequences
20 Find Significant Changes Cuffdiff find significant changes in transcript expression.Go to ”NGS RNA analysis”, click open ”Cuffdiff”
21 Cuffdiff Output TSS... files report on Transcription Start Sites splicing... report on splicingCDS... track coding region expressiontranscript... track transcriptsgene... rolls up the transcripts into their genesgene/transcript FPKM tracking: gives information about the gene/transcript (length, nearest ref id, TSS, etc) and the confidence intervals for FPKM for each condition.gene/transcript differential expression testing: gives the expression change between groups, a status of whether there was enough data for that value to be accurate (OK is good, FAIL and NOTEST are bad. LOWDATA is somewhere in between). Finally, it gives a p-value.see more details… Link
29 Load dataSelect genome: Click the genome drop-down list in the toolbar and select the genomeSelect chromosome: Click the chromosome drop-down box and select chromosome
30 Load data filesLoad from URL, file, server, DAS (Distributed Annotation System): UCSC DAS SourcesImport genome
31 Toolbar Genome drop-down box: loads a genome Chromosome drop-down box: zooms to a chromosomeSearch box: Displays the chromosome location being shown. To scroll to a different location, enter the gene name, locus or track name and click Go.Whole genome view: Zooms to whole genome view.Define a region: Defines a region of interest on the chromosome.Zoom slider: Zooms in and out on a chromosome.
32 Change Display Options IGV offers several display options for tracksZoom in and Zoom outModify Track HeightSort the TracksFilter the TracksGroup the TracksSort Tracks based on Region of Interest
33 Variants Visualization in IGV Load TAIR10 genome to IGVLoad BAM file “SRR bam” to IGV with “Load from URL”Load VCF file “var.raw.vcf” to IGV with “Load from URL”