Alignment of Next-Generation Sequencing Data

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

RNAseq.
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Introduction to Short Read Sequencing Analysis
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
RNAseq analyses -- methods
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Transcriptome Analysis
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Introduction to RNAseq
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
No reference available
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNAseq: a Closer Look at Read Mapping and Quantitation
Introductory RNA-seq Transcriptome Profiling
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
GCC Workshop 9 RNA-Seq with Galaxy
Lesson: Sequence processing
An Introduction to RNA-Seq Data and Differential Expression Tools in R
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
RNA Sequencing Day 7 Wooohoooo!
VCF format: variants c.f. S. Brown NYU
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
How to store and visualize RNA-seq data
Pairwise and NGS read alignment
Jin Zhang, Jiayin Wang and Yufeng Wu
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Learning to count: quantifying signal
Maximize read usage through mapping strategies
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
Sequence Analysis - RNA-Seq 2
BF528 - Sequence Analysis Fundamentals
Transcriptomics – towards RNASeq – part III
Introduction to RNA-Seq & Transcriptome Analysis

RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Alignment of Next-Generation Sequencing Data Gene Expression Analyses Alignment of Next-Generation Sequencing Data Nadia Lanman February 26, 2019

What is sequence alignment? A way of arranging sequences of DNA, RNA, or protein to identify regions of similarity Similarity may be a consequence of functional, structural, or evolutionary relationships between sequences In the case of NextGen sequencing, alignment identifies where fragments which were sequenced are derived from (e.g. which gene or transcript) Two types of alignment: local and global http://www-personal.umich.edu/~lpt/fgf/fgfrseq.htm

Global vs Local Alignment Global aligners try to align all provided sequence end to end Local aligners try to find regions of similarity within each provided sequence (match your query with a substring of your subject/target) gap mismatch

Alignment Example Raw sequences: A G A T G and G A T T G 2 matches, 0 gaps A G A T G G A T T G 4 matches, 1 insertion A G A - T G . . G A T T G 4 matches, 1 insertion A G A T - G . . G A T T G 3 matches, 2 end gaps A G A T G . . G A T T G

NGS read alignment Allows us to determine where sequence fragments (“reads”) came from Quantification allows us to address relevant questions How do samples differ from the reference genome Which genes or isoforms are differentially expressed Haas et al, 2010, Nature.

Standard Differential Expression Analysis Check data quality Trim & filter reads, remove adapters Align reads to reference genome Count reads aligning to each gene Unsupervised Clustering Differential expression analysis GO enrichment analysis Pathway analysis

Challenges of NGS Read Alignment Alternative splicing Often must align billions of reads Reads will have some mismatches = an approximate matching problem Sequencing errors True biological variation Eukaryotic genome are rich in repetitive regions` New methods had to be developed which were specific to NGS sequence alignment Multi-mapped reads

More than 90 aligners exist Bowtie2 BWA STAR Tophat2 Pseudo Aligners for RNA- seq quantification Kalliso Salmon Sailfish https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software

BWA Burrows-Wheeler transform algorithm with FM-index using suffix arrays. Need to create a genome index BWA can map low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack (Illumina sequence reads up to 100bp) BWA-SW (more sensitive when alignment gaps are frequent) BWA-MEM (maximum exact matches) BWA SW and MEM can map longer sequences (70bp to Mbp) and share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

Bowtie2 Bowtie (now Bowtie2) is part of a suite of tools for analyzing RNA-seq data (Bowtie, Tophat, Cufflinks, CummeRbund) http://bowtie-bio.sourceforge.net Bowtie2 is a Burrows-Wheeler Transform (BWT) aligner and handles reads longer than 50 nt. Need to prepare a genome index Given a reference and a set of reads, this method reports at least one good local alignment for each read if one exists. Bowtie (now Bowtie2) is faster than BWA for some alignments but is sometimes less sensitive than BWA Can view all sorts of arguments for one or the other on SeqAnswers.com Langmead and Salzberg, 2012, Nat. Methods

STAR Splicing Transcripts Alignment to a Reference Two steps: Seed searching and clustering/stitching/scoring (find MMP -maximal mappable prefix using Suffix Arrays) Fast splice aware aligner, high memory (RAM) footprint Can detect chimeric transcripts Generate indices using a reference genome fasta, and annotation gtf or gff from Ensembl/UCSC. Dobin et al., 2013 Bioinformatics

Alignment Concepts and Terminology Edit distance Mapping quality: the confidence that the read is correctly mapped to the genomic coordinates Probability mapping is incorrect = 10-MQ/10 Proper pairs not not

Alignment Concepts and Terminology Read 1 Adapter Read 1 Read 2 Read 2 Adapter Insert size = the sequence between two adapters Fragment length = insert size + adapters Read 1 Adapter Read 1 Read 2 Read 2 Adapter Inner distance = the length of the inner part of DNA, not shown by reads

Insert size Find insert size and mean and median bamtools stats -i foo.bam –insert Or samtools stats --insert-size foo.bam bbmap.sh ref=reference.fasta in1=read1.fq in2=read2.fq ihist=ihist_mapping.txt out=mapped.sam maxindel=200000 Or one of the many other options available (just Google)…..

Alignment Concepts and Terminology Multimapped reads: reads that align equally well to more than one reference location -How multimapped reads are handled depends on parameter settings selected, program, application, and how many time reads map to multiple places Duplicate reads: reads that arise from the same library fragment -How duplicates are handled also depends on parameter settings selected, program, application -Can arise during library prep (PCR duplicates) or during colony formation (optical duplicates)

Mappability Not all of the genome is ‘available’ for mapping when reads are aligned to the unmasked genome. Uniqueness: This is a direct measure of sequence uniqueness throughout the reference genome. Rozowsky, (2009)

FASTQ Format Text files containing header, sequence, and quality information Quality information is in ACII format Q = -10 log10p https://www.drive5.com/usearch/manual/fastq_files.html

BAM and SAM format SAM file is a tab-delimited text file that contains sequence alignment information BAM files are simply the binary version (compressed and indexed version )of SAM files  they are smaller Example: Header lines (begin with “@”) Alignment section

Understanding SAM flags https://broadinstitute.github.io/picard/explain-flags.html

Processing SAM/BAM files SAMtools is a software suite which provides various utilities for manipulating alignments in SAM format Sorting, merging, indexing, and generating alignments in a per-position format import: SAM-to-BAM conversion view: BAM-to-SAM conversion and sub alignment retrieval sort: sorting alignment merge: merging multiple sorted alignments index: indexing sorted alignment faidx: FASTA indexing and subsequence retrieval tview: text alignment viewer pileup: generating position-based output and consensus/indel calling RSamTools package in Bioconductor allows similar functionality in R.

Observe Data in a Genome Browser known genes forward reverse

Observe Data in a Genome Browser differential expression alternative splicing known genes forward reverse noise or regulation? unannotated gene

Quantifying Genes Now that alignment has been performed, reads can be counted and a matrix created, quantifying genes and transcripts Many techniques exist HTSeq-Count FeatureCounts RSEM Pseudoaligners

Demonstration Download public dataset from a repository STAR alignment of RNA-seq data SAMtools

Setup your workspace #login to Scholar and go to scratch space ssh username@scholar.rcac.purdue.edu cd $RCAC_SCRATCH #copy folder cp –r /scratch/scholar/natallah/BioinfLecture_2019 ./ #make your own workspace mkdir myWork cd myWork mkdir genome mkdir indices mkdir fastqFiles mkdir annotation mkdir STAR_align

Download Genome Sequence Reference genomes can be downloaded from UCSC, Ensembl or NCBI genome resources. Here’s the UCSC genome browser url for the human reference GRCh38: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ cd genome wget ftp://ftp.ensembl.org/pub/release- 77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz cd ../

Download Annotation if your reference genome is from Ensembl get GTF file from Ensembl otherwise get from UCSC table browser cd annotation wget ftp://ftp.ensembl.org/pub/release- 90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz gunzip Homo_sapiens.GRCh38.90.gtf.gz head Homo_sapiens.GRCh38.90.gtf cd ../

Copy Data cd fastqFiles cp ../../BioinfLecture_2019/fastqFiles/* ./ cd ../

STAR alignment of RNA-seq data Generate genome indices: cd indices cp ../../BioinfLecture_2019/indices/star_build.sub ./ edit paths in star_build.sub file (for example change /scratch/scholar/natallah/BioinfLecture_2019/indices to /scratch/scholar/username/myWork/indices) qsub star_build.sub qstat | grep username qdel jobID.scholar-adm cd ../ mkdir indices2 cp ../BioinfLecture_2019/indices/* indices2

STAR alignment of RNA-seq data Align reads to genome cd STAR_align cp ../../BioinfLecture_2019/STAR_align/star.sub ./ edit path names (make sure your genome folder directory is indices2 and change BioinfLecture_2019 to myWork) qsub star.sub qstat | grep username cat star.sub.*

SAMtools # look at the first 10 lines of your SAM file #convert to BAM module use /apps/group/bioinformatics/modules/ module load samtools head severe10Aligned.out.sam #convert to BAM samtools view -bT ../genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa severe10Aligned.out.sam > severe10Aligned.out.bam Sort BAM file samtools sort severe10Aligned.out.bam -o severe10Aligned_sorted.bam