NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia.

Slides:

Advertisements

Similar presentations

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.

Advertisements

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.

Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.

DNAseq analysis Bioinformatics Analysis Team

Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.

High Throughput Sequencing

BioUML integrated platform for building virtual cell and virtual physiological human Fedor Kolpakov Institute of Systems Biology Laboratory of Bioinformatics,

NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.

Before we start: Align sequence reads to the reference genome

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

NGS Analysis Using Galaxy

Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Whole Exome Sequencing for Variant Discovery and Prioritisation

DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.

Internet sources WEB-BASED GENOME BROWSER USING AJAX AND CANVAS TECHNOLOGIES T.F.Valeev 1,2, N.Tolstykh 1, F.A.Kolpakov 1,3 1 Institute of System Biology,

BioUML – open source integrated platform for collaborative and reproducible research in systems biology Fedor Kolpakov, Institute of Systems.

Copyright © 2011 Partek Incorporated. All rights reserved. Statistics Visualizations Annotations Start-to-Finish Analysis of Integrated Genomics.

Li and Dewey BMC Bioinformatics 2011, 12:323

Customized cloud platform for computing on your terms !

GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB

Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.

Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

BioUML - интегрированная платформа для совместных исследований в области биоинформатики и системной биологии Колпаков Ф.А. 1,2, Толстых.

BioUML integrated platform for building virtual cell and virtual physiological human Fedor Kolpakov Institute of Systems Biology Laboratory of Bioinformatics,

NGS data analysis CCM Seminar series Michael Liang:

Next Generation DNA Sequencing

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

Moby Web Services Iván Párraga García MSc on Bioinformatics for Health Sciences May 2006.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

NCBI Genome Workbench Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 15, 2004 Slides from Michael Dicuccio’s Genome Workbench.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.

A collaborative tool for sequence annotation. Contact:

SAVANT GENOME BROWSER Marc Fiume Department of Computer Science University of Toronto.

Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.

Genome STRiP ASHG Workshop demo materials

The iPlant Collaborative

No reference available

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,

Personalized genomics

GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.

BioUML – integrated platform for building virtual cell and virtual physiological human Fedor Kolpakov 1,2, Nikita Tolstykh 1,2, Elena Kutumova 1,2, Ilya.

Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.

Canadian Bioinformatics Workshops

User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.

Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.

HOMER – a one stop shop for ChIP-Seq analysis

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

From Reads to Results Exome-seq analysis at CCBR

Centralizing Bioinformatics Services: Analysis Pipelines, Opportunities, and Challenges with Large- scale –Omics, and other BigData High-Performance Computing.

Canadian Bioinformatics Workshops

Konstantin Okonechnikov Qualimap v2: advanced quality control of

Introductory RNA-seq Transcriptome Profiling

Using command line tools to process sequencing data

Cancer Genomics Core Lab

Variant Calling Workshop

The FASTQ format and quality control

EMC Galaxy Course November 24-25, 2014

Presentation transcript:

NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia

Agenda BioUML overview NGS tools – quality control – alignment tools – annotation tools – workflows Genome browser Archakov’s genome Ribosome profiling Live demonstration

BioUML overview

BioUML platform BioUML is an open source integrated platform for systems biology that spans the comprehensive range of capabilities including access to databases with experimental data, tools for formalized description, visual modeling and analyses of complex biological systems. Due to scripts (R, JavaScript) and workflow support it provides powerful possibilities for analyses of high- throughput data. Plug-in based architecture (Eclipse run time from IBM is used) allows to add new functionality using plug-ins. BioUML platform consists from 3 parts: BioUML server – provides access to biological databases; BioUML workbench – standalone application. BioUML web edition – web interface based on AJAX technology;

Main platforms for bioinformatics and BioUML Taverna standalone application powerful workflows Galaxy workflows, web interface, collaborative research, genome browser scripts, statistics, plots R/Bioconductor BioUML platform standalone application powerful workflows web interface, collaborative research genome browser scripts, statistics, plots BioClipse Eclipse plug-in based architecture, chemoinformatics Eclipse plug-in based architecture, chemoinformatics

Main platforms for bioinformatics and BioUML Taverna standalone application powerful workflows Galaxy workflows, web interface, collaborative research, genome browser scripts, statistics, plots R/Bioconductor BioUML platform standalone application powerful workflows web interface, collaborative research genome browser scripts, statistics, plots + systems biology visual modelling simulation parameters fitting … + chat for on-line consultations BioClipse Eclipse plug-in based architecture, chemoinformatics Eclipse plug-in based architecture, chemoinformatics

Android market Android AppStore MacOS, iPOD, iPhone Market Platform Biostore BioUML

Biostore BioUML platform Developers - plug-ins: methods, visualization, etc. - databases Users - subscriptions - collaborative & reproducible research Experts -services for data analysis - on-line consultations BioUML ecosystem provide tools and databases use provide services

NGS - интегрированные в BioUML методы (Bowtie, MACS, ChIPHorde, ChIPMunk, …) - программы, интегрированные в Galaxy - пакеты R - аннотация найденных пиков (SNP, сайтов и т.п.) - визуализация - workflows - ChIP-SEQ - RNA-SEQ - сборка и аннотация генома человека (в процессе) - поддержка распарелеливания внешних программ как часть workflow - база данных GTRD (на основе данных ChIP-SEQ) - выделенные сервера - Amazon EC2 – по запросу - Biodatomics – 64 ядра, 256 Гб памяти.

Galaxy – analyses methods

Galaxy - workflow

Raw data preprocessing Track statistics Gather various statistics about track or FASTQ file Preprocess raw reads Remove reads not satisfying simple quality tests, removes adapters, trims low quality bases from read ends

Bowtie - fast - no indels - used for chip-seq Novoalign -single-end and paired-end - in nucleotide and color space - handle indels, - finds global optimum alignments using full Needleman-Wunsch algorithm выравнивание коротких ридов:

RNA-seq with tophat and Cuff* tools

ChIP-seq Bowtie Bowtie for alignment MACS MACS for peak calling ChipMunkIPSMEME ChipMunk, IPS, MEME for motif discovery

Popular NGS toolboxes available: GATK, Picard, SAM tools

An example: workflow for analyses of ChIP-Seq data

example: RNA-seq workflow

NGS data quality control 2 examples: rna-seq data (rat, IPS ) genome data – Archakov’s genome

Track statistics (FastQC) Estimate quality of RAW or aligned reads like in FastQC program All original FastQC processors are supported Works faster than FastQC Additional processor: Overrepresented prefixes Overrepresented K-mers works more precise (do not skip 80% of sequences) Along with HTML report separate statistics tables are generated and accessible for further analysis Ability to merge several reports into composite report As any BioUML analysis can become a part of workflow, scripts, etc. Tested on Archakov AP3 (RAW reads: 5.9Gb csfasta+12.7Gb qual), analysis time: 36 min (all processors) Tested on Zakian db50 (RAW reads: 6.5Gb fastq), analysis time: 7 min (all processors)

Track statistics launch Input data: BAM, FastQ and Solid (colorspace) data supported Whether reads should be aligned by left or right side Switch off individual processors to save time.

Track statistics results (Archakov AP3): Quality per base

Track statistics results (Archakov AP3): Quality per sequence

Track statistics results (Archakov AP3): Nucleotide content per base

Track statistics results (Archakov AP3): GC content per base

Track statistics results (Archakov AP3): GC content per sequence

Track statistics results (Archakov AP3): N content per base

Track statistics results (Archakov AP3): Duplicate sequences

Track statistics results (Archakov AP3): Overrepresented sequences and 5-mers

Track statistics results (Archakov AP3): Overrepresented prefixes

Track statistics results (Zakian db50): Quality per base

Track statistics results (Zakian db50): Quality per sequence

Track statistics results (Zakian db50): Nucleotide content per base

Track statistics results (Zakian db50): GC content per base

Track statistics results (Zakian db50): GC content per sequence

Track statistics results (Zakian db50): N content per base

Track statistics results (Zakian db50): Duplicate sequences

Track statistics results (Zakian db50): Overrepresented sequences and 5-mers

Genome browser

uses AJAX and HTML5 technologies interactive - dragging, semantic zoom tracks support Ensembl DAS-servers user-loaded BED/GFF/Wiggle files Genome browser: main features

DAS The Distributed Annotation System (DAS) defines a communication protocol used to exchange annotations on genomic or protein sequences.annotations It is motivated by the idea that such annotations should not be provided by single centralized databases, but should instead be spread over multiple sites. Data distribution, performed by DAS servers, is separated from visualization, which is done by DAS clients. DAS is a client-server system in which a single client integrates information from multiple servers. It allows a single machine to gather up sequence annotation information from multiple distant web sites, collate the information, and display it to the user in a single view. DAS is heavily used in the genome bioinformatics community. Over the last years we have also seen growing acceptance in the protein sequence and structure communities.

Genome browser Two BAM tracks are compared with each other (Example view on Human NCBI37 Chr.1) Profile is visible showing the coverage

Genome browser Upon zooming individual reads become visible. All information associated with selected read is displayed in the Info box

Genome browser In detailed scale phred qualities graph is displayed along with changed nucleotides between read and reference sequence

NGS data Archakov’s genome

Preprocessing 1.Remove duplicates Purpose is to mitigate the effects of PCR amplification bias introduced during library construction. Two read pairs considered duplicate if they align to the same genomic position. >60% were removed as duplicates Alignments after this step:

Preprocessing 2. Local realignment Read mapping algorithms operate on each read independently, locally realign reads such that the number of mismatching bases is minimized across all the reads.

Preprocessing 3. Remove duplicates after realignment Realignment may change genomic positions of read pairs, after this step additional duplicates can be identified. 712 reads were removed (< %)

Preprocessing 4. Recalibration of base quality values For each base in each read calculates various covariates (such as reported quality score, cycle, dinucleotide, GC-content). Using these values build the model that predicts sequencing errors. Then apply this model to calculate an empirical base quality score and overwrites the phred quality score currently in the read.

Genotyping 1.Call SNV by GATK 'Unified Genotyper' 2.Assign a well-calibrated probability to each variant call. Estimate the probability that SNV is a true genetic variant versus a sequencing or data processing artifact given SNP call annotations provided by 'Unuified Genotyper' (DepthOfCoverage, StrandBias, HaplotypeScore, ReadPosRankSumTest for example). o Variant Annotator - create the set of "true variants" from dbSNP, Hapmap and 1000 genomes databases. o Variant Recalibrator - create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set ("true variants"). o Apply Variant Recalibration - apply the model parameters to each variant identified by Unified Genotyper calculating log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

Genotyping 3. Call indels by GATK 'Unified Genotyper' 4. Assign a well-calibrated probability to each indel. Similar to SNV calling but use only indels from 1000 Genomes as "true variants"

Genotyping 5. Filter out low quality variant calls SNVs Indels 6. Annotate identified variants relative to genes.

Genotyping Affected genes Affected genes #de=data/Collaboration/Dr.Archakov/Data/alignment/ Ap1.bam-CleanedAlignment/Genotyping2/tmp/Raw-affected-annotated

Genotyping: potential lose of function 118 genes118 genes have mutations that potentially affect function Mutation in the exon of MAP4K3

Gene ontology classification Full table

Genome browser Example of deletion and insertion presentation in genome browserdeletioninsertion

Ribosome profiling

Live demonstration