Assembly & Annotation at iPlant

Slides:

Advertisements

Similar presentations

Model Organism Databases and Community Annotation

Advertisements

Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.

Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.

Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.

Daniel Ence Yandell Lab University of Utah.  Annotations are descriptions of features of the genome  Structural: exons, introns, UTRs, splice forms.

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.

Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.

Henrik Lantz - BILS/SciLife/Uppsala University

UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.

Genome Annotation BCB 660 October 20, From Carson Holt.

Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.

Sequence Analysis with Artemis & Artemis Comparison Tool (ACT) South East Asian Training Course on Bioinformatics Applied to Tropical Diseases (Sponsored.

De-novo Assembly Day 4.

Mouse Genome Sequencing

Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.

CS 394C March 19, 2012 Tandy Warnow.

Tomato genome annotation pipeline in Cyrille2

Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.

Todd J. Treangen, Steven L. Salzberg

Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.

Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.

CUGI Pilot Sequencing/Assembly Projects Christopher Saski.

Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.

What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.

PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.

GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.

Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.

Next Generation DNA Sequencing

SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.

Welcome to DNA Subway Classroom-friendly Bioinformatics.

The Changing Face of Sequencing

I. Introduction and Red Line Education for Data-unlimited Science.

The iPlant Collaborative

Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.

Gramene Objectives Provide researchers working on grasses and plants in general with a bird’s eye view of the grass genomes and their organization. Work.

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,

Mark D. Adams Dept. of Genetics 9/10/04

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

De novo assembly validation

Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.

The iPlant Collaborative

Whole Genome Assembly with iPlant

Accessing and visualizing genomics data

BIOL 433 Plant Genetics Term 2, Instructors: Dr. George Haughn Dr. Ljerka Kunst BioSciences 2239BioSciences Tel

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P.

Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Using DNA Subway in the Classroom Genome Annotation: Red Line.

Basics of Genome Annotation Daniel Standage Biology Department Indiana University.

Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

Virginia Commonwealth University

Human Genome Project.

BIOL 433 Plant Genetics Term 2,

Pre-assembly analyses

Transcriptome Assembly

Genome Annotation w/ MAKER

BIOL 433 Plant Genetics Term 2,

CSCI 1810 Computational Molecular Biology 2018

Sequence the 3 billion base pairs of human

Presentation transcript:

Assembly & Annotation at iPlant Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz (CSHL) Roger Barthelson (CSHL) Cantarel et al. 2008. Genome Research 18:188 Holt & Yandell. 2011. BMC Bioinformatics 12:491

Maize Genome Project Genome Strategy 9 PI’s 2500 Mb 10 chromosomes 3-yr NSF funded project -- $30 M Mapping U. Arizona Genome 2500 Mb 10 chromosomes 50,000 genes Strategy BAC-by-BAC 17,000 clones Finish genic regions 9 PI’s FPC map Min. tiling path BAC selection Sequencing Washington U. 6X shotgun Auto finish Manual finishing GenBank c Annotation Repeat analysis Gene prediction Database Browser CSHL Maizesequence.org

Technology Lowering Barriers

Assembly & Annotation at iPlant

Science. 2009 Nov 20;326(5956):1112-5. doi: 10.1126/science.1178534. Complexity of Genomes Science. 2009 Nov 20;326(5956):1112-5. doi: 10.1126/science.1178534.

Assembling a Genome 1. Shear & Sequence DNA 2. Construct assembly graph from overlapping reads …AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC CAACCTCGGACGGACCTCAGCGAA… 3. Simplify assembly graph 4. Detangle graph with long reads, mates, and other links

Ingredients for a good assembly Coverage High coverage is required Oversample the genome to ensure every base is sequenced with long overlaps between reads Biased coverage will also fragment assembly Read Coverage Expected Contig Length Read Length Reads & mates must be longer than the repeats Short reads will have false overlaps forming hairball assembly graphs With long enough reads, assemble entire chromosomes into contigs Quality Errors obscure overlaps Reads are assembled by finding kmers shared in pair of reads High error rate requires very short seeds, increasing complexity and forming assembly hairballs Amount of oversampling depends of read length, genome complexity Current challenges in de novo plant genome sequencing and assembly Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243

N50 size Def: 50% of the genome is in contigs as large as the N50 value Example: 1 Mbp genome N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp) Note: N50 values are only meaningful to compare when base genome size is the same in all cases 50% 1000 300 100 45 45 30 20 15 15 10 . . . . .

Attempt to answer the question: “What makes a good assembly?” Organizers provided sequence data to assembly experts around the world Assemblathon 1: ~100Mbp simulated genome Assemblathon 2: 3 vertebrate genomes each ~1GB Results demonstrate trade-offs assemblers must make organized by UC Davis and UC Santa Cruz “good framing problem” Assemblathon 1: A competitive assessment of de novo short read assembly methods. Earl, DA, et al. (2011) Genome Research. doi: 10.1101/gr.126599.111 Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species Bradnam, KR. et al (2013) GigaScience 2:10 doi:10.1186/2047-217X-2-10

Final Rankings organized by UC Davis and UC Santa Cruz “good framing problem” ALLPATHS and SOAPdenovo came out neck-and-neck followed closely behind by Celera Assembler, SGA, and ABySS My recommendation for “typical” short read assembly is to use ALLPATHS Single molecule sequencing becoming extremely attractive if you have access

Apps in Discovery Environment Genome Assembly Allpaths-LG Soapdenovo2 ABySS Velvet Newbler Ray Contig analysis tools With or without reference sequence for comparison

Assembly Workflow Upload Reads Quality Assessment De novo Assembly Minutes to Months Quality Assessment Minutes to Hours An unfamiliar problem with familiar data De novo Assembly Hours to Days Assembly Assessment Minutes to Hours

Apps in Discovery Environment (for sequencing studies) Sequence Quality Control FastQC Fastx Toolkit Suffixerator/Tallymer/mkindex Sabre, Scythe, Sickle (paired end trimming) SGA cleanup (paired end quality trimming) Future plans Sequence induction, assessment, and trimming pipeline Mira contaminant detection and removal

QC: FastQC An unfamiliar problem with familiar data

QC: Read Coverage Reference: Reads: Errors Coverage Repeats

Wheat Genome (A. tauschi / CSHL)

QC: Mer counts Frag1.fq Frag2.fq FASTX_fastq-to-fasta An unfamiliar problem with familiar data Suffixerator Suffixerator-Tallymer-mkindex A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes Kurtz S. Narechania A, Stein JC, Ware D. (2008) BMC Genomics. 9:517

Using Allpaths LG You must have at least 2 libraries One overlapping fragment library, e.g. 100 bp reads with 180 bp spacing One jumping mate-pair library, e.g. 3000 bp spacing

reads unipaths assembly corrected reads doubled reads localized data How ALLPATHS-LG works reads See Youtube:https://www.youtube.com/watch?v=USlTWhmw0oQ&index=3&list=PL-0S9LiUi0viEhYTP_EQtKpYkcYAVW6IH corrected reads doubled reads localized data local graph assemblies global graph assembly Oversimplified, actually fifty modules developed over the past six years and a bit cluttered. Laden with opportunities for improvement! unipaths assembly Sante Gnerre et al (2010) PNAS 1513–1518, doi: 10.1073/pnas.1017351108

Where is the sample data? ALLPATHS-LG in DE 180 bp 3500 bp Data Source: GAGE Project

Where is the Allpaths LG App? ALLPATHS-LG in DE Where is the Allpaths LG App?

Fragment Reads ALLPATHS-LG in DE

Jumping Reads ALLPATHS-LG in DE

ALLPATHS-LG in DE Run Settings

Running ALLPATHS-LG An unfamiliar problem with familiar data

Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9): 1061-1067. Post-QC: CEGMA An unfamiliar problem with familiar data CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9): 1061-1067.

Resources iPlant Assembly Competitions http://www.iplantcollaborative.org/ Assembly Competitions Assemblathon: http://assemblathon.org/ GAGE: http://gage.cbcb.umd.edu/ Assembler Websites: ALLPATHS-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ SOAPdenovo: http://soap.genomics.org.cn/soapdenovo.html Celera Assembler: http://wgs-assembler.sf.net Tools: FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tallymer: http://www.zbh.uni-hamburg.de/?id=211 CEGMA: http://korflab.ucdavis.edu/datasets/cegma/

What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Expression, repeats, transposons Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality csontrol and downstream management of genome annotations. Now while many of you are likely already familiar with this, I will explain anyway

Secondary Annotation Protein Domains GO and other ontologies InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools

Challenges in Plant Genome Annotation Genomes are BIG Highly repetitive Many pseudogenes Assembly contamination Incomplete evidence No method is 100% accurate

Options for Protein-coding Gene Annotation Yandell & Ence. Nature Reviews Genetics 13, 329-342 (May 2012) | doi:10.1038/nrg3174

Typical Annotation Pipeline Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-driven prediction Chooser/combiner Evaluation/filtering Manual curation

MAKER-P Automated Pipeline MPI-enabled to allow parallel operation on large compute clusters Repeat Library Ab initio prediction Evidence Collaboration with Yandell Lab

What is a GFF File? Generic Feature Format

Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED). Figure 1.MAKER-P provides automated means for QC. MAKER-P provides methods for automatic management and quality control of genome annotations, using metrics developed by the Sequence Ontology project. One of these metrics is AED. AED is calculated in the same manner as SN and SP, but in place of a reference gene model, the coordinates of the union of the aligned evidence is used instead. AED = 1 – AC, where AC = (SN + SP)/2. An AED of 0 indicates that the annotation is in perfect agreement with its evidence, whereas an AED of 1 indicates a complete lack of evidence support for the annotation. The left panel of this figure illustrates hypothetical cumulative AED distributions for 3 different annotated genomes. 95% of the annotations in a very well annotated genome, for example, have an annotation edit distance (AED) of less than 0.5 (illustrated in left panel above). This is true, for example of the human genome annotations. In the current release of the Arabidopsis annotations (TAIR10) 88% of the annotations have an AED of less than 0.5 (navy line)(see above right); thus the TAIR10 annotations are already quite good, but could be further improved. This value is increased to 98% when only TAIR10’s 4- and 5-star rated transcripts are considered in the analysis (blue line, right panel). When all TAIR10 gene models are passed to MAKER-P and processed using its update functionality to automatically revise them to better fit the evidence, AEDs drop (green line vs. navy line), indicating improvements in quality. De novo annotation with MAKER produced an annotation set in which 97% of the annotations have an AED of 0.5 (red line). Better Quality Worse

MAKER-P at iPlant TACC Lonestar Supercomputer PAG 2014: 22,656 CPU cores on1,888 nodes Genome Assembly Size (Mb) CPU Run Time Arabidopsis thaliana TAIR10 120 600 2:44 1500 1:27 Zea mays RefGen_v2 2067 2172 2:53 Campbell et al. Plant Physiology. December 4, 2013, DOI:10.1104/pp.113.230144 PAG 2014: W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species 10 rice species (each w/12 chromosome pseudomolecules) 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome

MAKER-P at iPlant Atmosphere: MAKER_2.28 (emi-F13821D0) Virtual image MPI-enabled for parallel computing Check out with up to 16 CPU Tested with 4 CPU instance Completed rice chr 1 in 8 hr 45 min

MAKER-P Tutorial https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial

Annotation Post-Analysis AED threshold InterProScan Comparative analysis, e.g. BLAST vs RefSeq proteins

Annotation Post-Analysis InterProScan

Assembly & Annotation at iPlant

Additional MAKER-P Resources MAKER-P: http://www.yandell- lab.org/software/maker-p.html Repeat Library contstuction: http://weatherby.genetics.utah.edu/MAKER/w iki/index.php/Repeat_Library_Construction-- Advanced Pseudogene identification: http://shiulab.plantbiology.msu.edu/wiki/index .php/Protocol:Pseudogene