Presentation on theme: "Facilitator: Richard Bruskiewich"— Presentation transcript:
1 Facilitator: Richard Bruskiewich NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics ConsiderationsMarch 7th, 2012IRMACS, SFUFacilitator: Richard BruskiewichAdjunct Professor, MBBWelcome (title slide, 2 minutes)Advanced thank you’sJim MattsonFelix BredenIRMACS team:Pam Borghardt, IRMACS Centre, Managing DirectorBrian Technical DirectorWestgrid Team:Ata Roudgar, Martin SiegertFiona Brinkman: for the kind permission to adapt a significant number of her introductory bioinformatics course MBB slides for portions of the workshop
2 Today’s Agenda – Part 1 Welcome and Acknowledgments Some administrative details…Introductions:FacilitatorParticipants10 minute breakIntroduce myself (5 minutes, 1 slide bio)Administrative details of the workshop (5 minutes, 1 slide - admin details (schedule/rooms/dates/times/payment), material pre-requisites (computer)Invite participants (round the table) to give ~1 minute talk (~30 minutes)Your Name, department, lab, (your “port of origin”)What is your research focus?How can bioinformatics (NGS) support that research?What NGS data of your own do you have to analyse *now*Expectations for the workshop…Survey Results (Part I)Most of you have not yet taken a bioinformatics course; 1 – 2 of you took significant MBB coursesOverview of expectations from survey10 minute break
3 Advance Acknowledgments Jim Mattson: for championing the workshop ideaFelix Breden: for championing the idea of IRMACS bioinformatics support & endorsing this workshopIRMACS team:Pam Borghardt, IRMACS Managing Director: sponsorshipBrian Technical Director: workshop infrastructureWestGrid Team:Ata Roudgar, Martin Siegert: workshop HPC infrastructureFiona Brinkman: for her kind permission to adapt a number of her MBB introductory bioinformatics course slides for portions of the workshop
4 Lecture (12:30 – 14:30, Wednesdays) Demo/Lab (9:30 – 11:30, Thursdays) TopicLecture (12:30 – 14:30, Wednesdays)Demo/Lab (9:30 – 11:30, Thursdays)Bioinformatics Overview (roughly equivalent to core MBB 441/741 topics)Workshop Overview andPractical Informatics ConsiderationsMarch 7thMarch 8thSequence Formats, Databases and Visualization ToolsMarch 14thMarch 15thSequence Alignment and SearchingMarch 21stMarch 22ndPrinciples of Structural Genomics and Overview of Next Generation Sequencing TechnologiesMarch 28thMarch 29thSequence Assembly AlgorithmsApril 4thApril 5thSpecific ApplicationsSequence Assembly of TranscriptomesMay 2ndMay 3rdSequence Assembly of Whole GenomesMay 9thMay 10thAnnotation of de novo Assembled SequencesMay 16thMay 17thIdentification and Analysis of Sequence VariationMay 23rdMay 24thComparative Genomic Analysis and VisualizationMay 30thMay 31stMeta-Analysis of Newly Annotated Sequence DataJune 6thJune 7th
5 VenueThe workshop lectures and demo/labs will generally take place here, in the IRMACS Centre, Room (top floor, Applied Sciences Building) with the exception of the March 14th and May 9th lectures, plus the May 10th lab/demo for which there is a meeting conflict in IRMACS. These particular sessions will instead be convened in BioSci room B9242.The lab/demo sessions on March 8th, 15th and 29th will end earlier, at 11 am, to accommodate the next scheduled event in IRMACS
6 Workshop FeeSign-up list to Barbara Sherman… will contact PI for billing(?)
7 NGS Bioinformatics Workshop 1 NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics ConsiderationsRoadmap of the workshop (10 minutes, 3 slides - program revisited; + tech structure/flow diagram(?)introductions
8 Facilitator Richard: A Brief Bio Professional Experience2009 – present, Adjunct Professor, MBB, SFU, Research Scientist, Computational and Systems Biology, Bioinformatics, International Rice Research Institute (IRRI; irri.org), Postdoc, Human Analysis Team, Sanger Centre, Cambridge, UKAcademic Background1999, PhD (Medical Genetics), UBC1992, B.Sc. (Biochemistry, Molecular Biology& Genetics), UBC1987, B.A. (Minor Computing), SFUPersonalOriginally from Edmonton; moved to GVRD in late teens and resided here for over 2 decades before travelling abroad to workWife is Filipina-Canadian (hence the job in the Philippines); 3 teenage kids (son in his late teens has just started in the SIAT program at SFU Surrey)Returned last June to reside in Port Moody, at the foot of Burnaby Mountain
9 Participants “Around the table” Your Name, department, lab, (PI)(optional) Your “Port of Origin”What is your research focus?How can bioinformatics (NGS) support that research?What NGS data of your own do you have to analyse *now*Expectations for the workshop…Introduce myself (5 minutes, 1 slide bio)Administrative details of the workshop (5 minutes, 1 slide - admin details (schedule/rooms/dates/times/payment), material pre-requisites (computer)Invite participants (round the table) to give ~1 minute talk (~30 minutes)Your Name, department, lab, (your “port of origin”)What is your research focus?How can bioinformatics (NGS) support that research?What NGS data of your own do you have to analyse *now*Expectations for the workshop…Survey Results (Part I)Most of you have not yet taken a bioinformatics course; 1 – 2 of you took significant MBB coursesOverview of expectations from survey10 minute break
11 Today’s Agenda – Part 2 What is Bioinformatics and why is it needed? What is “Next Generation Sequencing”Coping with the NGS bioinformatics challengeThe Workshop Road MapLooking ahead…
12 What is bioinformatics? NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics ConsiderationsWhat is bioinformatics?
13 Bioinformatics is…The development of computational methods for studying the structure, function, and evolution of genes, proteins, and whole genomes;The development of methods for the management and analysis of biological information arising from genomics and high-throughput biological experiments.
14 Why is there Bioinformatics? Fiona Brinkman Bioinfo CourseSummer 2002Why is there Bioinformatics?Huge datasetsLots of new sequences being added- Automated sequencersGenome ProjectsMetagenomics- RNA sequencing, microarray studies, proteomics,…Patterns in datasets that can be analyzed using computers
15 Need for informatics in biology: origins Gramicidine S (Consden et al., 1947), partial insulin sequence (Sanger and Tuppy, 1951)1961: tRNA fragmentsFrancis Crick, Sydney Brenner, and colleagues propose the existence of transfer RNA that uses a three base code and mediates in the synthesis of proteins (Crick et al., 1961) General nature of genetic code for proteins. Nature 192: In Microbiology: A Centenary Perspective, edited by Wolfgang K. Joklik, ASM Press. 1999, p.384First codon assignment UUU/phe (Nirenberg and Matthaei, 1961)
16 Need for informatics in biology: origins The key to the whole field of nucleic acid-based identification of microorganisms… …the introduction molecular systematics using proteins and nucleic acids by the American Nobel laureate Linus Pauling. Zuckerkandl, E., and L. Pauling. "Molecules as Documents of Evolutionary History." Journal of Theoretical Biology 8:Another landmark: Nucleic acid sequencing (Sanger and Coulson, 1975)
17 Need for informatics in biology: origins First genomes sequenced:3.5 kb RNA bacteriophage MS (Fiers et al., 1976)5.4 kb bacteriophage X (Sanger et al., 1977)1.83 Mb First complete genome sequence of a free-living organism: Haemophilus influenzae KW20 (Fleischmann et al., 1995)First multicellular organism to be sequenced: C. elegans (C. elegans sequencing consortium, 1998)Early databases: Dayhoff, 1972; Erdmann, 1978Early programs: restriction enzyme sites, promoters, etc… circa 1978.1978 – 1993: Nucleic Acids Research published supplemental information
18 Genbank and associated resources doubles faster than Moore’s Law Genbank and associated resources doubles faster than Moore’s Law! (< every 18 months)(from the National Centre for Biotechnology Information)
19 Today: So many genomes… As of mid-August 2010, according to the GOLD GenomesOnline database….Eukaryotic genome projects are in progress? (Genome and ESTs) ( years ago)Prokaryote genome projects are in progress?5006 ( years ago)Metagenome projects are in progress?133 (Zero - 5 years ago)TOTAL 6687 projects (As of Sept 2011: >10,000)
20 Information sources:(Rhesus macaque) Robert F. Service. Science 311: (2006).454 press release, May 31,Wellcome Trust Sanger Institute press release, July 2,Complete Genomics article in Bio-IT World:Applied Biosystems press release, October 1,
25 The genome sequence is complete - almost! The Human GenomeThe genome sequence is complete - almost!approximately 3.5 billion base pairs.
26 Work ongoing to locate all genes and regulatory regions and describe their functions… …bioinformatics plays a critical role
27 Identifying single nucleotide polymorphisms (SNPs) and other changes between individuals
28 Bioinformatics helps with……. Sequence Similarity Searching/Comparison Fiona Brinkman Bioinfo CourseSummer 2002Bioinformatics helps with……. Sequence Similarity Searching/ComparisonWhat is similar to my sequence?Searching gets harder as the databases get bigger - and quality changesTools: BLAST and FASTA = early time saving heuristics (approximate methods)Need better methods for SNP analysis!Statistics + informed judgment of the biologist
29 Bioinformatics helps with……. Structure-Function Relationships Fiona Brinkman Bioinfo CourseSummer 2002Bioinformatics helps with……. Structure-Function RelationshipsCan we predict the function of protein molecules from their sequence?sequence > structure > functionPrediction of some simple 3-D structures possible (a-helix, b-sheet, membrane spanning, etc.)
30 Bioinformatics helps with……. Phylogenetics Fiona Brinkman Bioinfo CourseSummer 2002Bioinformatics helps with……. PhylogeneticsCan we define evolutionary relationships between organisms by comparing DNA sequences?Lots of methods and software, what is the best analysis approach?
31 What is Next Generation Sequencing (ngs)? NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics ConsiderationsWhat is Next Generation Sequencing (ngs)?
32 Sanger (“dideoxy sequencing or chain termination”) Sequencing Single stranded DNA from sample* extended by polymerase from primer then randomly terminated by dideoxy nucleotide (ddNTP)Variable length DNA fragments radiolabelled or fluorescently detected ddNTP*sample derived from amplified cDNA, genomic clones or whole genome shotgun
33 Sanger Pro’s & Con’s Advantages Disadvantage Relatively accurate Relatively long (500 – 1500) bp readsDisadvantageRelatively costly in terms of reagents and relatively low throughput
34 Next Generation Sequencing (NGS) PolonatorRoche 454Sequence Assemblyon HPCLife Tech. Ion TorrentHeliScopeIllumina HiSeqLife Tech SOLiDOxford Nanopore “GridION”Pacific Biosciences SMRT Cell
35 (General) NGS Pro’s & Con’s AdvantagesVery high throughputVery cheap data productionDisadvantagesRelatively short readsRelatively higher error ratesBioinformatics of assembly is much more challenging
37 Coping with the NGS bioinformatics challenge NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics ConsiderationsCoping with the NGS bioinformatics challenge
38 ChallengeAssembling “next generation sequence” (NGS) data requires a great deal of computing power and gigabytes memorySoftware often can execute in parallel on all available computer processing unit (CPU) cores.Many functional annotation processes (e.g. database searching, gene expression statistical analyses) also demand a lot of computing power
39 “High Performance Computing” and “Cloud Computing” Computer NodesNetwork StorageYour local workstation/ laptop
40 What is Cloud Computing? Pooled resources: shared with many users (remotely accessed)Virtualization: high utilization of hardware resources (no idling)Elasticity: dynamic scaling without capital expenditure and time delayAutomation: build, deploy, configure, provision, and move without manual interventionMetered billing: “pay-as-you-go, only for what you useCloud Computing
42 A More Complete Picture… WebPortalProjectRelationalDatabaseDatabaseLoaderRaw Data + Results
43 Case Study in Bioinformatics on the Cloud Used Amazon Web ServicesAssembled ~99 raw NGS transcriptome sequence datasets from 83 species, on 16 Amazon EC2 instances with 8 CPU cores, 68 GB of RAM, ~200 hours of computer time, total run in less than one working day.Each single machine of the required size would likely have cost at least ~$10,000 (and time) to purchase, and incur significant operating costs overhead (machine room space, power supplies, networking, air conditioning, staff salaries, etc.)The above run could be started up in a few minutes and cost ~ $500 to complete. Once done, no machines left idling and unused…
44 Software for (NGS) Bioinformatics Bundled with sequencing machines:e.g. Newbler assembler with Roche 4543rd party commercial:DNA Star (www.dnastar.com)Geneious (http://www.geneious.com/)GeneWiz (http://www.genewiz.com)And others…Open Source:Lots (selected examples to be covered in this workshop)
45 What do I need to run bioinformatics software locally? Some common bioinformatics software is platform independent, hence will run equally under Windows and UNIX (Linux, OSX)Most other software targets Unix systems. If you are running Microsoft Windows and want to run such software locally, the easiest way to do this(?) is to install some version of Linux (suggest “Ubuntu”) as a dual boot or (less intrusively) as a guest operating system in a virtual machine, e.g.
47 https://computecanada.org/ SFU / IRMACSWestGrid is a consortium member of “Computer Canada”https://computecanada.org/“bugaboo” cluster: 4328 cores total: 1280 cores, 8 cores/node, 16 GB/node, x86_64, IB. Plus 3048 cores, 12 cores/node, 24GB/node, x86_64, IB. capability cluster, 40 Core YearsAccess to other Westgrid resources through LAN and WANMore details from Brian Corrie tomorrow…
49 NGS Bioinformatics Workshop 1 NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics ConsiderationsRoadmap of the workshop (10 minutes, 3 slides - program revisited; + tech structure/flow diagram(?)The Workshop roadmap
50 Visualization of Sequence & Annotation Road MapWhat is Bioinformatics?Visualization of Sequence & AnnotationNGSAnnotationSequence AssemblySequences(Formats)SequenceDatabasesSearch &Alignments
51 Specific Applications Sequence Assembly of TranscriptomesSequence Assembly of Whole GenomesAnnotation of de novo Assembled SequencesIdentification and Analysis of Sequence VariationComparative Genomic Analysis and VisualizationMeta-Analysis of Annotated Sequence Data
52 Survey: Workshop Expectations I How to find significance in the huge amount of data that Next Gen sequencing, but also microarrays etc. generate.A basic understanding of how to analyse next generation sequencing data.Learn some hands-on computer experiencelearning to use software for analysing sequence data; what can be done and how to do it.genome assembly + meta-analysis
53 Survey: Workshop Expectations II The basics of alignment and SNP calling with next-gen sequencing, and what kind of programs are out there to do these tasks and then analyze the large datasets (I've been trying to figure this out on my own through reading the literature and it's quite time consuming so any info provided through the workshop would be very helpful - thanks)The main workflow for processing sequence data from the beginning to the more specific paths of analyses. Also the concepts, significance of the adjustable parameters behind the various algorithms used in the workflow.
54 Survey: Workshop Expectations III I expect to learn the basic bioinformatics tools.Learn different sequence alignment software/technologies (i.e. BWA, Abyss, etc.). Learn more about the complexities of NGS sequencingNext generation sequencing, data analysis etc.Parameters regulating assembly of contigs. How to take raw data to an assembly, control the main parameters for assembly, mass analyze data for annotation and SNPsHow to compare expression profiles using RNA transcriptomes.Want to learn new things
55 Survey: Operating System Being Used Microsoft Windows on Intel/AMD – 14 (86.7%)Most running Windows 7 (some XP & Vista)One uses Linux through Westgrid and the IRMACS clusterSome of you also thinking of running LinuxApple OS X – 2 (13.3%)Snow Leopard ReleaseApple Lion, running Windows 7 using ParallelsLinux on Intel - 2 (13.3%)
56 Looking Ahead… What will you need for this workshop? Reading list: Mainly, just a laptop running a web browser(Optional) access to Linux/Unix locally (VM Player)Reading list:Will give review citations for future lecturesFor next week, suggest that you surf to