High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.

Slides:



Advertisements
Similar presentations
Facilitator: Richard Bruskiewich
Advertisements

Chapter 5 Operating Systems. 5 The Operating System When working with multimedia, the operating system is perhaps the most important, the most complex,
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Systems Software Operating Systems.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Accessing the Internet with Anonymous FTP Transferring Files from Remote Computers.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
Eucalyptus Virtual Machines Running Maven, Tomcat, and Mysql.
Li and Dewey BMC Bioinformatics 2011, 12:323
Customized cloud platform for computing on your terms !
4 1 Operating System Activities  An operating system is a type of system software that acts as the master controller for all activities that take place.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
Logging in to the Maine Innovation Cloud (and some other stuff) BMB550.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview.
CISC105 General Computer Science Class 1 – 6/5/2006.
CHAPTER FOUR COMPUTER SOFTWARE.
Introduction to Short Read Sequencing Analysis
Introduction to Interactive Media Interactive Media Tools: Software.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Computing Fundamenatls CMSC 201 Computer Science I Penny Rheingans University of Maryland Baltimore County (with inspiration from previous 201 instructors.
HTML Hyper Text Markup Language A simple introduction.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
NGS data analysis CCM Seminar series Michael Liang:
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
Operating System What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. An operating.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
CS4710 Why Progam?. Why learn to program? Utility of programming skills: understand tools modify tools create your own automate repetitive tasks automate.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Application of Bioinformatics in Genetic Research Instructors: Dr. Henry Baker Dr. Luciano Brocchieri Dr. Michele Tennant Dr. Lei Zhou
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Remote Access Usages. Remote Desktop Remote desktop technology makes it possible to view another computer's desktop on your computer. This means you can.
Unix Servers Used in This Class  Two Unix servers set up in CS department will be used for some programming projects  Machine name: eustis.eecs.ucf.edu.
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
+ Vieques and Your Computer Dan Malmer & Joey Azofeifa.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
+ Introduction to Unix Joey Azofeifa Dowell Lab Short Read Class Day 2 (Slides inspired by David Knox)
Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
نظام المحاضرات الالكترونينظام المحاضرات الالكتروني Computer Software.
Intro to GNU/Linux See, Stallman? I said GNU. Are you happy now?
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.
Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.
July LJM Introduction to Bioinformatics Lisa Mullan, HGMP-RC.
QuasR: Quantify and Annotate Short Reads in R Anita Lerch, Dimos Gaidatzis, Florian Hahne and Michael Stadler Friedrich Miescher Institute for Biomedical.
Chapter 5 Operating Systems.
Canadian Bioinformatics Workshops
Welcome to Indiana University Clusters
Computing challenges in working with genomics-scale data
Stubbs Lab Bioinformatics - 2 Retrieving sequence data files and Linux commands Nov 17, 2016 Joe Troy.
Computing Fundamenatls CMSC 201 Computer Science I Penny Rheingans University of Maryland Baltimore County (with inspiration from previous 201 instructors.
Lab 1 introduction, debrief
Workshop on Microbiome and Health
University of Pittsburgh
MapView: visualization of short reads alignment on a desktop computer
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Information Technology Ms. Abeer Helwa
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering genomic patterns from HTS data.

Mac user, type in terminal: $ ssh If you do not have an HPC acct: $ ssh Windows, Open in Putty: gator.hpc.ufl.edu or Practice: log into UFHPC / Linux server.

Connect and log into the system with Putty. Switch to scratch: “$> cd /scratch/lfs/username” Make a directory for the course: “$> mkdir GMS6014” Type “ls” or “ls -l” to verify the folder. Enter into the folder by typing “$> cd GMS6014” Practice: navigate HPC

Practice: Decompress achieve files. Download the two HTS archieve files with “$> wget –c URL” First load the “sra” module, “$> module load sra” Decompress both.sra files in the directory: “$> fastq- dump.2 *.sra” Practice: Retrieve HTS files from GEO

Large Data Set Analysis. Hardware considerations: 1.) Data storage.  FASTA record of a protein (1,000 aa) ~ 1 KB.  Human proteome, or Chromosome 21 ~ 50 MB  Human genome ~ 1.5 GB  HTS transcriptome analysis (4 40 million reads each) original and derived data sets ~ 200 GB

Large Data Set Analysis. Hardware considerations: 2.) Processors and RAM.  Comparison: tbalstn of 5 protein sequences against 1.2GB genome, ~15 sec CPU time. Map a single 10 M reads illumina run to human genome ~15,000 CPU sec (> 4 hours).  When RAM < data size, the computer will come to a crawl.

Large Data Set Analysis. Hardware considerations: 3.) Operating system determines the availability of tools.  Linux is the default development system for most bioinformatics groups. It is also the OS of the UFHPC.  Easy control and automation.  Most tools are portable to Mac OSX, but often requires recompiling the source code.

Navigating the Linux command line environment: User rights ~ Program can not run unless you have the rights to read/write/execute the file. Basic commands to survive: cd, ls -l, cp, mv, pwd, chmod, etc.

High Throughput DNA-Sequencing (HTS) data analysis 1.Sources and representation of HTS data. 2.Visualization of HTS data. 3.Discovering genomic pattern from HTS data. 4.Integrated data analysis and hypothesis- generating exploration.

 Your own ( sequencing service ).  Public databases, such as NCBI/GEO.  Major genomic /epigenomic projects, such as ENCODE (ENCylopedia Of DNA Elements); the Cancer Genome Project, etc.  Other internet sources. Source of HTS data

Retrieving HTS data  Retrieving HTS data from the web using wget.  Loading to and unloading data from UFHPC (check with HPC instructions).

Recoding sequence information – sequence file format FASTA format– suitable for single gene or genomic region, pre-genomic era. > Gene_name or accession, (other info) ACTGGGTTTATGACGTGTCATGCATGCA ATGTAGCTAGATGCTAGCTAGATGCTAG CTAGATGCTA…. Defined format is necessary for computers to identify and process the information.

Recording sequence reads from the machine – FASTQ FASTA: >My_sequence AATTACGCGCGATACGAT AATTACGCGCGATACGAT +My_sequence quality efcfffffcfeeYBBsdf Recording of quality assessment allows filtering based on sequence quality.

Paint the sequence reads to the genome HTS AATTACGCGCGATACGAT + ACCGAGGCGCGTATGTCT + efcfffffcfeeYBBsea Corresponding location on the genome ELAND (Illumina) Bowtie, etc. ChIP-Seq; RNA-Seq De novo assembly of genomes, chromatin conformation, genomic abnormality, etc…

Recording sequence and quality information FASTQ format = FASTA + TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTT +HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 !"#$%&'()*+,-./ :; Two identification lines +) for each sequence. Identification line format depends on specific sequencing platform. Quality line using characters representing integer values.

HTS data file  Sequence and quality information are recorded as multi-FASTQ files.  For efficient storage and transmission, they are transformed into SRA (Sequence Read Archives) format. Observe: transform the SRA file to fastq. “$ fastq-dump.2 path_to_sra_file”

Representation of (HTS) data – BED (Browser Extensible Data) file chr U00+ chr U10- chr U20+ chr U10- chr U20+ Chrom.Start EndnameScorStrand With the completion of the genome, there is no need to record the base pair identity (if it is the same as the reference genome). Detailed description of genomic data formats:

HTS data – map to genome  “bwa” or “bowtie” are the two most popular software that implement a similar strategy (Burrows-Wheeler Transform).  Can benefit from multi-processor. map the reads to hg19. bowtie2 -x hg19 -U SRR fastq -p 2 -S Input.sam bowtie2 -x hg19 -U SRR fastq -p 2 -S P53ChIP.sam

ChIP-Seq – identifying TF binding sites  MACS- Model-based Analysis of ChIP-Seq Practice: Identifying peacks macs2 callpeak -t P53ChIP.sam -c Input.sam -f SAM -g dm -n P53_GM B