Facilitator: Richard Bruskiewich

Slides:



Advertisements
Similar presentations
Schulich School of Medicine & Dentistry The University of Western Ontario London Regional Genomics Centre Next Generation Sequencing Meeting April 1, 2010.
Advertisements

HCS806 “Methods in Horticulture and Crop Science” Introduction to methods in Bioinformatics for plant science. David Francis (Coordinator) Ian Holford.
Next-generation sequencing
Online Counseling Resource YCMOU ELearning Drive… School of Architecture, Science and Technology Yashwantrao Chavan Maharashtra Open University, Nashik.
Next Generation Sequencing, Assembly, and Alignment Methods
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Canadian Bioinformatics Workshops
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Luxembourg, Sep 2001 Pedro Fernandes Inst. Gulbenkian de Ciência, Oeiras, Portugal EMBER A European Multimedia Bioinformatics Educational Resource.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
GENOME SEQUENCING. I. Genome sequencing The Sanger Method (1977) Denaturation +priming Polymerization.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Molecular Biology Dr. Chaim Wachtel April 4, 2013.
Bioinformatics Core Facility Ernesto Lowy February 2012.
CSE 6406: Bioinformatics Algorithms. Course Outline
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Introduction to next generation sequencing Rolf Sommer Kaas.
Collecting and Storing Sequences In the laboratory Heather Helm UPR Sequencing Facilities Manager.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
NGS Bioinformatics Workshop 1.5 Tutorial – Genome Annotation April 5th, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Bioinformatics Core Facility Guglielmo Roma January 2011.
Molecular Biology Dr. Chaim Wachtel May 28, 2015.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Overview of Bioinformatics 1 Module Denis Manley..
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Class material and homework for February 9 today’s in-class topic: selected examples of contemporary biotechnology –polymerase chain reaction (PCR) –DNA.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Central dogma: the story of life RNA DNA Protein.
Pathogenomics How this project began: Ann Rose - take advantage of DNA sequence information - genomics Julian Davies - use the information to understand.
Algorithms for Biological Sequence Analysis Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University,
Bioinformatics and Computational Biology
Bioinformatics Lecture to accompany BLAST/ORF finder activity
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Lecture 1 CS5661 Topics Basis of Bioinformatics Goals of Bioinformatics Bioinformatics Jargon 101.
High throughput biology data management and data intensive computing drivers George Michaels.
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
BCH339N Systems Biology/Bioinformatics (course # 54040) Spring 2016 Tues/Thurs 11 – 12:30 PM BUR 212.
Scaling bio-analyses from computational clusters to grids George Byelas University Medical Centre Groningen, the Netherlands IWSG-2013, Zürich, Switzerland,
Transforming Science Through Data-driven Discovery Tools and Services Workshop Atmosphere Joslynn Lee – Data Science Educator Cold Spring Harbor Laboratory,
Canadian Bioinformatics Workshops
1Introduction 1.0 Welcome to the Canadian Bioinformatics Workshops Bioinformatics, 7 th Ed Vancouver BC, Feb 16 – 28, 2004 Fiona Brinkman.
Canadian Bioinformatics Workshops bioinformatics.ca.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Canadian Bioinformatics Workshops
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Workshop on the analysis of microbial sequence data using ARB
Algorithms for Biological Sequence Analysis
Mangaldai College, Mangaldai
Genomes and Their Evolution
Introduction to Bioinformatic
BIOL 433 Plant Genetics Term 2,
BF nd (Next) Generation Sequencing
Applying principles of computer science in a biological context
Presentation transcript:

Facilitator: Richard Bruskiewich NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations March 7th, 2012 IRMACS, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Welcome (title slide, 2 minutes) Advanced thank you’s Jim Mattson Felix Breden IRMACS team: Pam Borghardt, IRMACS Centre, Managing Director Brian Technical Director Westgrid Team: Ata Roudgar, Martin Siegert Fiona Brinkman: for the kind permission to adapt a significant number of her introductory bioinformatics course MBB slides for portions of the workshop

Today’s Agenda – Part 1 Welcome and Acknowledgments Some administrative details… Introductions: Facilitator Participants 10 minute break Introduce myself (5 minutes, 1 slide bio) Administrative details of the workshop (5 minutes, 1 slide - admin details (schedule/rooms/dates/times/payment), material pre-requisites (computer) Invite participants (round the table) to give ~1 minute talk (~30 minutes) Your Name, department, lab, (your “port of origin”) What is your research focus? How can bioinformatics (NGS) support that research? What NGS data of your own do you have to analyse *now* Expectations for the workshop… Survey Results (Part I) Most of you have not yet taken a bioinformatics course; 1 – 2 of you took significant MBB courses Overview of expectations from survey 10 minute break

Advance Acknowledgments Jim Mattson: for championing the workshop idea Felix Breden: for championing the idea of IRMACS bioinformatics support & endorsing this workshop IRMACS team: Pam Borghardt, IRMACS Managing Director: sponsorship Brian Technical Director: workshop infrastructure WestGrid Team: Ata Roudgar, Martin Siegert: workshop HPC infrastructure Fiona Brinkman: for her kind permission to adapt a number of her MBB introductory bioinformatics course slides for portions of the workshop

Lecture (12:30 – 14:30, Wednesdays) Demo/Lab (9:30 – 11:30, Thursdays) Topic Lecture (12:30 – 14:30, Wednesdays) Demo/Lab (9:30 – 11:30, Thursdays) Bioinformatics Overview (roughly equivalent to core MBB 441/741 topics) Workshop Overview and Practical Informatics Considerations March 7th March 8th Sequence Formats, Databases and Visualization Tools March 14th March 15th Sequence Alignment and Searching March 21st March 22nd Principles of Structural Genomics and Overview of Next Generation Sequencing Technologies March 28th March 29th Sequence Assembly Algorithms April 4th April 5th Specific Applications Sequence Assembly of Transcriptomes May 2nd May 3rd Sequence Assembly of Whole Genomes May 9th May 10th Annotation of de novo Assembled Sequences May 16th May 17th Identification and Analysis of Sequence Variation May 23rd May 24th Comparative Genomic Analysis and Visualization May 30th May 31st Meta-Analysis of Newly Annotated Sequence Data June 6th June 7th

Venue The workshop lectures and demo/labs will generally take place here, in the IRMACS Centre, Room 10900 (top floor, Applied Sciences Building) with the exception of the March 14th and May 9th lectures, plus the May 10th lab/demo for which there is a meeting conflict in IRMACS. These particular sessions will instead be convened in BioSci room B9242. The lab/demo sessions on March 8th, 15th and 29th will end earlier, at 11 am, to accommodate the next scheduled event in IRMACS 10900.

Workshop Fee Sign-up list to Barbara Sherman… will contact PI for billing(?)

NGS Bioinformatics Workshop 1 NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Roadmap of the workshop (10 minutes, 3 slides - program revisited; + tech structure/flow diagram(?) introductions

Facilitator Richard: A Brief Bio Professional Experience 2009 – present, Adjunct Professor, MBB, SFU 2000-2011, Research Scientist, Computational and Systems Biology, Bioinformatics, International Rice Research Institute (IRRI; irri.org) 1999-2000, Postdoc, Human Analysis Team, Sanger Centre, Cambridge, UK Academic Background 1999, PhD (Medical Genetics), UBC 1992, B.Sc. (Biochemistry, Molecular Biology& Genetics), UBC 1987, B.A. (Minor Computing), SFU Personal Originally from Edmonton; moved to GVRD in late teens and resided here for over 2 decades before travelling abroad to work Wife is Filipina-Canadian (hence the job in the Philippines); 3 teenage kids (son in his late teens has just started in the SIAT program at SFU Surrey) Returned last June to reside in Port Moody, at the foot of Burnaby Mountain

Participants “Around the table” Your Name, department, lab, (PI) (optional) Your “Port of Origin” What is your research focus? How can bioinformatics (NGS) support that research? What NGS data of your own do you have to analyse *now* Expectations for the workshop… Introduce myself (5 minutes, 1 slide bio) Administrative details of the workshop (5 minutes, 1 slide - admin details (schedule/rooms/dates/times/payment), material pre-requisites (computer) Invite participants (round the table) to give ~1 minute talk (~30 minutes) Your Name, department, lab, (your “port of origin”) What is your research focus? How can bioinformatics (NGS) support that research? What NGS data of your own do you have to analyse *now* Expectations for the workshop… Survey Results (Part I) Most of you have not yet taken a bioinformatics course; 1 – 2 of you took significant MBB courses Overview of expectations from survey 10 minute break

10 minute break…

Today’s Agenda – Part 2 What is Bioinformatics and why is it needed? What is “Next Generation Sequencing” Coping with the NGS bioinformatics challenge The Workshop Road Map Looking ahead…

What is bioinformatics? NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations What is bioinformatics?

Bioinformatics is… The development of computational methods for studying the structure, function, and evolution of genes, proteins, and whole genomes; The development of methods for the management and analysis of biological information arising from genomics and high-throughput biological experiments.

Why is there Bioinformatics? Fiona Brinkman Bioinfo Course Summer 2002 Why is there Bioinformatics? Huge datasets Lots of new sequences being added - Automated sequencers Genome Projects Metagenomics - RNA sequencing, microarray studies, proteomics,… Patterns in datasets that can be analyzed using computers

Need for informatics in biology: origins Gramicidine S (Consden et al., 1947), partial insulin sequence (Sanger and Tuppy, 1951) 1961: tRNA fragments Francis Crick, Sydney Brenner, and colleagues propose the existence of transfer RNA that uses a three base code and mediates in the synthesis of proteins (Crick et al., 1961) General nature of genetic code for proteins. Nature 192: 1227- 1232. In Microbiology: A Centenary Perspective, edited by Wolfgang K. Joklik, ASM Press. 1999, p.384 First codon assignment UUU/phe (Nirenberg and Matthaei, 1961)

Need for informatics in biology: origins The key to the whole field of nucleic acid-based identification of microorganisms… …the introduction molecular systematics using proteins and nucleic acids by the American Nobel laureate Linus Pauling. Zuckerkandl, E., and L. Pauling. "Molecules as Documents of Evolutionary History." 1965. Journal of Theoretical Biology 8:357-366 Another landmark: Nucleic acid sequencing (Sanger and Coulson, 1975)

Need for informatics in biology: origins First genomes sequenced: 3.5 kb RNA bacteriophage MS2 (Fiers et al., 1976) 5.4 kb bacteriophage X174 (Sanger et al., 1977) 1.83 Mb First complete genome sequence of a free-living organism: Haemophilus influenzae KW20 (Fleischmann et al., 1995) First multicellular organism to be sequenced: C. elegans (C. elegans sequencing consortium, 1998) Early databases: Dayhoff, 1972; Erdmann, 1978 Early programs: restriction enzyme sites, promoters, etc… circa 1978. 1978 – 1993: Nucleic Acids Research published supplemental information

Genbank and associated resources doubles faster than Moore’s Law Genbank and associated resources doubles faster than Moore’s Law! (< every 18 months) http://en.wikipedia.org/wiki/Moore’s_law (from the National Centre for Biotechnology Information)

Today: So many genomes… As of mid-August 2010, according to the GOLD GenomesOnline database…. Eukaryotic genome projects are in progress? (Genome and ESTs) 1548 (517 - 5 years ago) Prokaryote genome projects are in progress? 5006 (740 - 5 years ago) Metagenome projects are in progress? 133 (Zero - 5 years ago) TOTAL 6687 projects (As of Sept 2011: >10,000)

Information sources: (Rhesus macaque) Robert F. Service. Science 311: 5767. 1544-1546 (2006). 454 press release, May 31, 2007. http://www.454.com/about-454/news/index.asp?display=detail&id=68 Wellcome Trust Sanger Institute press release, July 2, 2008. http://www.sanger.ac.uk/Info/Press/2008/080702.shtml Complete Genomics article in Bio-IT World: http://www.bio-itworld.com/BioIT_Article.aspx?id=82058 Applied Biosystems press release, October 1, 2008. http://phx.corporate-ir.net/phoenix.zhtml?c=61498&p=irol-abiNewsArticle&ID=1207598&highlight=

The genome sequence is complete - almost! The Human Genome The genome sequence is complete - almost! approximately 3.5 billion base pairs.

Work ongoing to locate all genes and regulatory regions and describe their functions… …bioinformatics plays a critical role

Identifying single nucleotide polymorphisms (SNPs) and other changes between individuals

Bioinformatics helps with……. Sequence Similarity Searching/Comparison Fiona Brinkman Bioinfo Course Summer 2002 Bioinformatics helps with……. Sequence Similarity Searching/Comparison What is similar to my sequence? Searching gets harder as the databases get bigger - and quality changes Tools: BLAST and FASTA = early time saving heuristics (approximate methods) Need better methods for SNP analysis! Statistics + informed judgment of the biologist

Bioinformatics helps with……. Structure-Function Relationships Fiona Brinkman Bioinfo Course Summer 2002 Bioinformatics helps with……. Structure-Function Relationships Can we predict the function of protein molecules from their sequence? sequence > structure > function Prediction of some simple 3-D structures possible (a-helix, b-sheet, membrane spanning, etc.)

Bioinformatics helps with……. Phylogenetics Fiona Brinkman Bioinfo Course Summer 2002 Bioinformatics helps with……. Phylogenetics Can we define evolutionary relationships between organisms by comparing DNA sequences? Lots of methods and software, what is the best analysis approach?

What is Next Generation Sequencing (ngs)? NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations What is Next Generation Sequencing (ngs)?

Sanger (“dideoxy sequencing or chain termination”) Sequencing Single stranded DNA from sample* extended by polymerase from primer then randomly terminated by dideoxy nucleotide (ddNTP) Variable length DNA fragments radiolabelled or fluorescently detected ddNTP *sample derived from amplified cDNA, genomic clones or whole genome shotgun

Sanger Pro’s & Con’s Advantages Disadvantage Relatively accurate Relatively long (500 – 1500) bp reads Disadvantage Relatively costly in terms of reagents and relatively low throughput

Next Generation Sequencing (NGS) Polonator Roche 454 Sequence Assembly on HPC Life Tech. Ion Torrent HeliScope Illumina HiSeq Life Tech SOLiD Oxford Nanopore “GridION” Pacific Biosciences SMRT Cell

(General) NGS Pro’s & Con’s Advantages Very high throughput Very cheap data production Disadvantages Relatively short reads Relatively higher error rates Bioinformatics of assembly is much more challenging

General NGS Workflow Template preparation Sequencing & imaging Genome alignment/assembly

Coping with the NGS bioinformatics challenge NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Coping with the NGS bioinformatics challenge

Challenge Assembling “next generation sequence” (NGS) data requires a great deal of computing power and gigabytes memory Software often can execute in parallel on all available computer processing unit (CPU) cores. Many functional annotation processes (e.g. database searching, gene expression statistical analyses) also demand a lot of computing power

“High Performance Computing” and “Cloud Computing” Computer Nodes Network Storage Your local workstation/ laptop

What is Cloud Computing? Pooled resources: shared with many users (remotely accessed) Virtualization: high utilization of hardware resources (no idling) Elasticity: dynamic scaling without capital expenditure and time delay Automation: build, deploy, configure, provision, and move without manual intervention Metered billing: “pay-as-you-go, only for what you use Cloud Computing

Cloud Bioinformatics Module Input Job Message Queue Output Job Message Queue Task- Specialized Server Start-up (w/parameters) Job Status Notification Customized Machine Image Raw Data/ Results/ Snapshots

A More Complete Picture… Web Portal Project Relational Database Database Loader Raw Data + Results

Case Study in Bioinformatics on the Cloud Used Amazon Web Services http://aws.amazon.com Assembled ~99 raw NGS transcriptome sequence datasets from 83 species, on 16 Amazon EC2 instances with 8 CPU cores, 68 GB of RAM, ~200 hours of computer time, total run in less than one working day. Each single machine of the required size would likely have cost at least ~$10,000 (and time) to purchase, and incur significant operating costs overhead (machine room space, power supplies, networking, air conditioning, staff salaries, etc.) The above run could be started up in a few minutes and cost ~ $500 to complete. Once done, no machines left idling and unused…

Software for (NGS) Bioinformatics Bundled with sequencing machines: e.g. Newbler assembler with Roche 454 3rd party commercial: DNA Star (www.dnastar.com) Geneious (http://www.geneious.com/) GeneWiz (http://www.genewiz.com) And others… Open Source: Lots (selected examples to be covered in this workshop)

What do I need to run bioinformatics software locally? Some common bioinformatics software is platform independent, hence will run equally under Windows and UNIX (Linux, OSX) Most other software targets Unix systems. If you are running Microsoft Windows and want to run such software locally, the easiest way to do this(?) is to install some version of Linux (suggest “Ubuntu”) as a dual boot or (less intrusively) as a guest operating system in a virtual machine, e.g. http://www.vmware.com/products/player/

But, what are *we* going to use here?

https://computecanada.org/ WestGrid @ SFU / IRMACS WestGrid is a consortium member of “Computer Canada” https://computecanada.org/ “bugaboo” cluster: 4328 cores total: 1280 cores, 8 cores/node, 16 GB/node, x86_64, IB. Plus 3048 cores, 12 cores/node, 24GB/node, x86_64, IB. capability cluster, 40 Core Years Access to other Westgrid resources through LAN and WAN More details from Brian Corrie tomorrow…

Galaxy Genomics Workbench http://galaxy.psu.edu/ (also http://main.g2.bx.psu.edu/)

NGS Bioinformatics Workshop 1 NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Roadmap of the workshop (10 minutes, 3 slides - program revisited; + tech structure/flow diagram(?) The Workshop roadmap

Visualization of Sequence & Annotation Road Map What is Bioinformatics? Visualization of Sequence & Annotation NGS Annotation Sequence Assembly Sequences (Formats) Sequence Databases Search & Alignments

Specific Applications Sequence Assembly of Transcriptomes Sequence Assembly of Whole Genomes Annotation of de novo Assembled Sequences Identification and Analysis of Sequence Variation Comparative Genomic Analysis and Visualization Meta-Analysis of Annotated Sequence Data

Survey: Workshop Expectations I How to find significance in the huge amount of data that Next Gen sequencing, but also microarrays etc. generate. A basic understanding of how to analyse next generation sequencing data. Learn some hands-on computer experience learning to use software for analysing sequence data; what can be done and how to do it. genome assembly + meta-analysis

Survey: Workshop Expectations II The basics of alignment and SNP calling with next-gen sequencing, and what kind of programs are out there to do these tasks and then analyze the large datasets (I've been trying to figure this out on my own through reading the literature and it's quite time consuming so any info provided through the workshop would be very helpful - thanks) The main workflow for processing sequence data from the beginning to the more specific paths of analyses. Also the concepts, significance of the adjustable parameters behind the various algorithms used in the workflow.

Survey: Workshop Expectations III I expect to learn the basic bioinformatics tools. Learn different sequence alignment software/technologies (i.e. BWA, Abyss, etc.). Learn more about the complexities of NGS sequencing Next generation sequencing, data analysis etc. Parameters regulating assembly of contigs. How to take raw data to an assembly, control the main parameters for assembly, mass analyze data for annotation and SNPs How to compare expression profiles using RNA transcriptomes. Want to learn new things

Survey: Operating System Being Used Microsoft Windows on Intel/AMD – 14 (86.7%) Most running Windows 7 (some XP & Vista) One uses Linux through Westgrid and the IRMACS cluster Some of you also thinking of running Linux Apple OS X – 2 (13.3%) Snow Leopard Release Apple Lion, running Windows 7 using Parallels Linux on Intel - 2 (13.3%)

Looking Ahead… What will you need for this workshop? Reading list: Mainly, just a laptop running a web browser (Optional) access to Linux/Unix locally (VM Player) Reading list: Will give review citations for future lectures For next week, suggest that you surf to http://www.ncbi.nlm.nih.gov/