Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.

Similar presentations


Presentation on theme: "Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011."— Presentation transcript:

1 Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011

2 Outline GEP web framework updates –GEP web site –Gene Record Finder –Gene Model Checker –Small Exon Finder Tools under development –modENCODE mRNA-Seq data –Designing and managing your own projects Discussions on needed improvements

3 Graded Web Browser Support GEP web framework aims to provide support for the following web browsers: –Based on graded browser support policy from Yahoo! Web BrowsersWin XPWin 7 / VistaMac OS 10.6 Safari 5A-grade Chrome (latest stable)A-grade Firefox (latest stable)A-grade Firefox 3.6A-grade IE 9.0A-grade IE 8.0A-grade IE 7.0A-grade IE 6.0A-grade * Other configurations may work but may not be tested (X-grade)

4 Goals for GEP Web Site Update More easily find and discover materials Search engine optimizations and site search Added Quick Start and FAQ sections Search for related documents using tags Standardize layout and file download links New section for contributions from GEP members Maintain backward compatibility Improve support for modern web browsers

5 GEP Web Site Demo

6 GEP Web Site Questions GEP glossary –Currently listed under Introducing Students to DNA Sequencing and Genomic Analysis GEP photos –Community section –Facebook groups –Flickr groups

7 GEP Wiki and Forum Bulletin board software upgraded to phpBB3 –Allow upload of images and other attachments –Automatic image thumbnails –More powerful full text search –Better cross-browser support Plan to migrate GEP Wikis (both private and public) to newer version of the Mediawiki software in Fall 2011

8 Gene Record Finder Update Two FlyBase updates – Releases 5.29, 5.32 –Release 5.39 for Fall 2011 Start and end columns now refers to the 5’ start and 3’ end coordinates –For features on the minus strand, start coordinates are larger than the end coordinates Added new section for D. melanogaster genes with non-canonical splice sites

9 Gene Record Finder Demo

10 Potential Issues with Gene Record Finder Phase for coding exons were based on GFF files provided by FlyBase Since Release 5.33, the phase entries for CDS features may be incorrect –In older releases, the phase column in the FlyBase GFF file represents the reading frame –Issue has not yet been resolved as of Release 5.39 Instead of relying on the FlyBase entries, phase and CDS translations are calculated separately

11 Keeping up with FlyBase Releases Release 6 assembly may be released in September New modENCODE RNA-Seq data has led to many updates to the D. melanogaster gene annotations Graveley BR, et al. The development transcriptome of Drosophila melanogaster. Nature (471) 473-479 More up-to-date Gene Record Finder available at: http://gander.wustl.edu/~wilson/dmelgenerecord_current/index.html

12 Gene Model Checker User Interface Improvements Form values (except the sequence file) will persist when you refresh the web page Added support for sequence file in rich text format Improve detection of overlapping coordinates –Overlap among exon coordinates –Overlap between exon coordinates and stop codons

13 Gene Model Checker Updates New “Warn” level in the checklist –Non-canonical splice donor site (GC) –Number of coding exons in submitted model differ from the D. melanogaster ortholog –Cannot find the putative D. melanogaster ortholog Global alignment between submitted model and the D. melanogaster ortholog Color dot plot for complete gene models

14 Gene Model Checker Demo

15 Annotation Files Merger Combine files generated by the Gene Model Checker for different gene models into a single file –Use this tool to reduce submission errors Added link to view combined GFF file as a custom track on the UCSC Genome Browser mirror Updated documentation for annotation submission shows how to use this tool to prepare files for project submission

16 Annotation Files Merger Demo

17 Small Exon Finder Search for small open reading frames that cannot be identified through sequence alignments Search for small exons that satisfy a set of biological constraints: –CDS type (initial, internal, terminal) –CDS size –Donor and acceptor phase Documentation available in the Small Exon Finder User Guide (under Help -> Documentations)

18 Small Exon Finder Demo

19 UCSC Genome Browser Mirror Update Updated genome browser software to release 238 –New navigation features Search for blastx hits, drag and zoom, re-order tracks –Improved support for second-generation sequence data Added initial set of mRNA-Seq and TopHat junction predictions to the D. mojavensis dot chromosome assembly

20 Finishing Updates GEP LiveCD –Re-master image with new kernel and software updates –Consed updated to version 20 –Created new VirtualBox appliance with support for both both SATA and IDE drives –Improve support for VirtualBox 4 –Updated documentation for storing user data on USB and local VirtualBox disk image Finishing Packages –New naming conventions for fosmid end traces –Updated configuration files (e.g. for autofinish, digests)

21 Tools Under Development http://www.flickr.com/photos/gullevek/155604654/sizes/z/

22 modENCODE Drosophila mRNA-Seq Data modENCODE project (Brian Oliver) has generated mRNA-Seq data for multiple Drosophila species –mRNA-Seq of head tissues from D. mojavensis Data tracks added to GEP Genome Browsers in Fall 2010 Unpaired, Illumina Genome Analyzer and Genome Analyzer II –mRNA-Seq of whole flies from Drosophila Paired-end, Illumina Genome Analyzer II and HiSeq 2000 mRNA’s from male and female of multiple Drosophila species –Verify D. melanogaster gene models –Examine differences in gene expression

23 modENCODE Drosophila RNA-Seq Data Species with mRNA-Seq data that are also in the GEP annotation pipeline –D. ananassae –D. mojavensis ReferencePublishedIn Progress RNA-Seq

24 mRNA-Seq Overview Wang et al. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics (10) 57-63

25 Types of mRNA-Seq Data Tracks Mapping mRNA-Seq reads onto contig sequences –Read coverage and alignment summary Splice junction predictions –TopHat predictions, spliced reads alignments Transcriptome assembly –Velvet and Oases –Cufflinks Reads unmapped by TopHat

26 mRNA-Seq Alignment Summary Track Because of high read coverage, unable to display all the reads because it may overload the browser Composite multi-wiggle track captures the number of high quality reads aligned at each position

27 Identify Splice Junctions with mRNA-Seq Two additional tracks can be used to identify splice junctions –TopHat junctions For reads >= 75bp, search for GT-AG, GC-AG, and AT-AC intron junctions Search for joins between neighboring coverage islands Use mate pair information to estimate intron sizes –Average mate pair distance for this library is 150 bp –Spliced mRNA-Seq Subset of read mate pairs mapped by TopHat with at least three alignment blocks

28 mRNA-Seq Transcriptome Assembly Two basic approaches to transcriptome assembly –Assemble reads first then map the assembled transcripts back to the genome Trinity, ABySS, Oases –Map reads onto the reference genome first and then merge overlapping reads to create transcripts Cufflinks, Scripture Because of limited computational resources: –Map reads against each contig with TopHat –Extract mapped reads and assemble with Oases –Align assembled transcripts back to the contig with BLAT

29 Number of Unmapped mRNA-Seq Reads Against the D. mojavensis Assembly ~300 million reads unmapped by TopHat SRR Accession# Missing Reads SRR166832_119,005,424 SRR166832_219,005,425 SRR166833_154,724,955 SRR166833_254,724,955 SRR166834_122,019,750 SRR166834_222,019,751 SRR166835_162,179,296 SRR166835_262,179,296 Total315,858,852

30 Number of Reads Removed because of Low Quality or Unknown Bases Only about 1% of the reads contain unknown bases once the reads are trimmed from 101 to 75 bases SRR Accession# RemovedTotal # Missing% Removed SRR166832_1834883190054244.39 % SRR166832_2541113190054252.85 % SRR166833_1505054547249550.92 % SRR166833_2646392547249551.18 % SRR166834_1227753220197501.03 % SRR166834_2165507220197510.75 % SRR166835_1411722621792960.66 % SRR166835_2566942621792960.91 % Total3,899,366315,858,8521.23 %

31 Potential Problems with TopHat Alignments Bowtie is optimized for ungapped alignment TopHat subdivide each read into 25bp segments –Each segment is mapped independently –Alignment blocks are then merged back together TopHat could fail to map reads that are derived from multiple exons Coding Exons Alignment

32 Mapping unaligned reads with BLAT

33 Intron Sizes Distribution in D. melanogaster Comeron JM and Kreitman M. The Correlation Between Intron Length and Recombination in Drosophila: Dynamic Equilibrium Between Mutational and Selective Forces. Genetics (156): 1175-1190

34 Filter the Unaligned Reads using Minimum Intron Size Coding Exons Alignment >= 30 Minimum intron size in Drosophila is ~40 bases Only keep alignments that consist of 3 blocks where the distance between each adjacent block is at least 30 bases

35 Example of Unaligned Reads that Span Multiple Exons

36 Using mRNA-Seq Tracks in Annotation

37 Plans to Incorporate mRNA-Seq Data into the GEP Annotation Pipeline Develop and continue to improve programs used to manipulate and process mRNA-Seq datasets New Homework #2 from Dr. Buhler on how to use mRNA-Seq data for annotation Test and validate the new mRNA-Seq tracks –Han and William have used the mRNA-Seq data when they checked the annotation submissions this summer –Also received feedback from Bio 4342 this year

38 Cross-species mRNA-Seq How can we incorporate the mRNA-Seq data into the D. erecta and D. grimshawi annotation projects? ReferencePublishedIn Progress RNA-Seq

39 Map D. yakuba mRNA-Seq Reads onto D. erecta Contigs Use D. yakuba reads to generate TopHat junctions and coverage data tracks for D. erecta Cannot generate Cufflinks and Oases transcripts directly from D. erecta alignments –Reads from less conserved regions may not be mapped Build transcriptome library from whole genome alignments to D. yakuba –Map assembled transcripts against the D. erecta contigs with BLAT

40 Incorporating mRNA-Seq data into D. grimshawi projects D. virilis and D. mojavensis are the two species that are most closely related to D. grimshawi –Cannot reliably detect conserved nucleotide sequences with BLASTN, BLAT, Clustalw Build transcriptome library based on whole genome alignments to D. virilis and D. mojavensis –Run Cufflinks and Oases in sliding window –Align the assembled transcripts against the D. grimshawi contigs with translated BLAT and TBLASTX

41 Cross-species mRNA-Seq

42 Designing and Managing Your Own Annotation Projects Three major components: –Workflow system for creating custom genome browsers Command-line based workflow is now operational –Galaxy modules for performing statistical analysis and data mining (with modENCODE data) –Create and manage your own annotation projects using the Project Management System

43 Project Management System Demo

44 Conclusions During the past year, we have made substantial improvements to the GEP web framework New mRNA-Seq data should help resolve many ambiguous cases and speed up annotation –mRNA-Seq evidence tracks now available on gander Continue to work on system that will allow you to create and manage your own projects

45 Questions and Group Discussion

46

47 rRNA in D. mojavensis mRNA-Seq


Download ppt "Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011."

Similar presentations


Ads by Google