RNA-seq data analysis Project

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNAseq analysis Bioinformatics Analysis Team
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-seq Analysis in Galaxy
RNA-Seq data analysis Qi Liu Department of Biomedical Informatics
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Transcriptome Analysis
RNA-seq workshop ALIGNMENT
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Introduction to RNAseq
Genome-wide association study between DSE polymorphism and Poly-A usage in Human population Hiren Karathia Sridhar Hannenhalli.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov.
The iPlant Collaborative
Objectives Genome-wide investigation – to estimate alternate Poly-Adenylation (APA) usage on 3’UTR – to identify polymorphism of Downstream Sequence Elements.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
+ RNAseq for differential gene expression analysis Molly Hammell, PhD
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
RNA Sequencing Day 7 Wooohoooo!
Stubbs Lab Bioinformatics – 5 Review tophat, alignment summary and htseq-count exercises: MDS plots and Differential expression We want to be able to.
Short Read Sequencing Analysis Workshop
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Introductory RNA-Seq Transcriptome Profiling
Kallisto: near-optimal RNA seq quantification tool
Learning to count: quantifying signal
Maximize read usage through mapping strategies
Additional file 2: RNA-Seq data analysis pipeline
Sequence Analysis - RNA-Seq 2
Computational Pipeline Strategies
Introduction to RNA-Seq & Transcriptome Analysis

RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

RNA-seq data analysis Project QI LIU

From reads to differential expression Raw Sequence Data FASTQ Files QC by FastQC Reads Mapping Unspliced Mapping BWA, Bowtie Spliced mapping TopHat, MapSplice Mapped Reads SAM/BAM Files Expression Quantification Summarize read counts FPKM/RPKM Cufflinks QC by RNA-SeQC DE testing DEseq, edgeR, etc Cuffdiff List of DE Functional Interpretation Function enrichment Infer networks Integrate with other data Biological Insights & hypothesis

Tools Read alignment: TopHat2 (http://ccb.jhu.edu/software/tophat/index.shtml) Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) Tools for manipulating SAM files: SAMTOOLS (http://samtools.sourceforge.net/samtools.shtml) Counting reads in gene level: htseq-count (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) Downstream analysis: The R statistical computing environment (http://www.r-project.org/) R package: edgeR (http://www.bioconductor.org/packages/release/bioc/html/edgeR.html) Visualization: IGV (http://www.broadinstitute.org/software/igv/download)

/home/igptest/path.txt setpkgs -a python BOWTIE2=/scratch/liuq6/software/bowtie2 TOPHAT2=/scratch/liuq6/software/tophat2 export PYTHONPATH=/scratch/liuq6/software/htseqlib:$PYTHONPATH export PATH=/home/igptest/exomesequencing/software/:$BOWTIE2:$TOPHAT2:$PATH transcriptomeindex=/scratch/liuq6/reference/gtfindex/Homo_sapiens.GRCh37.75 genomeindex=/scratch/liuq6/reference/bowtie2_index/hg19 gtffile=/scratch/liuq6/reference/Homo_sapiens.GRCh37.75_chr1-22-X-Y-M.gtf reference=/home/igptest/exomesequencing/reference/hg19/hg19_chr.fa VarScan=/home/igptest/exomesequencing/software/VarScan.v2.2.10.jar

Environment Variables The PATH is an environment variable. It is a colon delimited list of directories that your shell searches through when you enter a command. All executables are kept in different directories on the Linux and Unix like operating systems. PYTHONPATH is an environment variable which you can set to add additional directories where python will look for modules and packages

Two ways source path.txt 2. put the scripts in .bashrc source is a bash shell built-in command that executes the content of the file passed as argument. 2. put the scripts in .bashrc .bashrc is a file from which bash reads and executes command automatically when you log in. /home/yourusername/.bashrc

Practice Before and after you load all the environment echo $SHELL (the name of the current shell) env (the existing environment variables) echo $HOME echo $PATH echo $gtffile

Reads alignment

1. un-spliced mapping to transcriptome and then genome (bowtie) TopHat 1. un-spliced mapping to transcriptome and then genome (bowtie) 2. “Contiguously unmappable" reads are used to predict possible splice junctions.

Tophat2 The basename of the genome index to be searched tophat2 [options]* <genome_index_base> PE_reads_1.fq.gz PE_reads_2.fq.gz Options: -o/--output-dir <string> Sets the name of the directory in which TopHat will write all of its output. The default is "./tophat_out". -G/--GTF <GTF/GFF3 file> Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file --transcriptome-index <dir/prefix> use the previously built transcriptome index files

TopHat output accepted_hits.bam. A list of read alignments in SAM format. junctions.bed. A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. insertions.bed and deletions.bed. UCSC BED tracks of insertions and deletions reported by TopHat.

Practice tophat2 --transcriptome-index=$transcriptomeindex -o ctrl1 $genomeindex /scratch/igptest/RNAseq/data/ctrl1.fastq Add --transcriptome-only to save time tophat2 --transcriptome-index=$transcriptomeindex -o ctrl1 --transcriptome-only $genomeindex /scratch/igptest/RNAseq/data/ctrl1.fastq

SAMTOOLS samtools view samtools sort samtools index

Practice View the header section of bam file samtools view -h accepted_hits.bam Extract the alignments with MAPQ>10 samtools view -q 10 accepted_hits.bam

Practice Extract the alignments mapped to reverse strand samtools view -f 16 accepted_hits.bam

Practice Index the alignment file (.bai file) samtools index accepted_hits.bam Use samtools to get all the reads mapped to tp53 and myc. How many reads? Any junction reads? TP53, 17:7,571,720-7,590,868 MYC, 8:128,748,315-128,753,680

Homework Install IGV https://www.broadinstitute.org/igv/download Use IGV to view the bam file (need the bam and index files) Take a look at the reads mapped to genes MYC and TP53

Homework Align ER1.fastq How many reads mapped to MYC and TP53? Are there any junction reads? Visualize the results in IGV Extract the alignments mapped to MYC and generate a new BAM file Extract the alignments with MAPQ>10 and generate a new BAM file and index the file

Ht-seq Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature. Deal with reads that overlap more than one features

Ht-seq Options

Ht-seq htseq-count [options] <alignment_file> <gff_file> Output: a table with counts for each feature, followed by the special counters, which count reads that were not counted for any feature for various reasons. The names of the special counters all start with a double underscore, to facilitate filtering. __no_feature: reads (or read pairs) which could not be assigned to any feature (set S as described above was empty). __ambiguous: reads (or read pairs) which could have been assigned to more than one feature and hence were not counted for any of these (set Shad mroe than one element). __too_low_aQual: reads (or read pairs) which were skipped due to the –a option __not_aligned: reads (or read pairs) in the SAM file without alignment __alignment_not_unique: reads (or read pairs) with more than one reported alignment. These reads are recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned reads will be counted multiple times, unless they getv filtered out by due to the -a option.)

Practice htseq-count -s no -f bam -i gene_name ctrl1/accepted_hits.bam $gtffile > ctrl1.count

Homework Summarize the read counts to the gene level for ER1 Summarize the read counts to the exon level for ctrl1

edgeR Install R in your laptop Install edgeR package source("http://bioconductor.org/biocLite.R") biocLite("edgeR")

edgeR Raw counts Small sample size Complicated experimental design

Negative binomial distribution Technical replicates –Poisson distribution var(ygi)=𝜇gi Biological variances var(ygi)=𝜇gi+ ∅ 𝑔 𝜇gi 2

Normalization TMM (a trimmed mean of M-values) minimize the log-fold changes between the samples for most genes.

Normalization RNA-seq measures the relative abundance of each gene in each RNA sample, but RNA output per cell FPKM: g2-g10 TMM: g1 N T g1 1 1000 g2 g3 g4 g5 g6 g7 g8 g9 g10

Differential analysis Read Data DGEList preprocess Normalization MDS calcNormFactors visualize PCA estimateGLMCommonDisp Estimate the dispersion estimateGLMTrendedDisp Heatmap estimateGLMTagwiseDisp glmFit Differential expression glmLRT