Acknowledgements and References

Slides:



Advertisements
Similar presentations
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Advertisements

1 3. genome analysis. 2 The first DNA-based genome to be sequenced in its entirety was that of bacteriophage Φ-X174; (5,368 bp), sequenced by Frederick.
National Microbial Pathogen Data Resource About us NMPDR is a Bioinformatics Resource Center dedicated to the thorough understanding of core.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
1 Unity of Invention: Biotech Examples TC1600 Special Program Examiner Julie Burke (571)
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Mouse Genome Sequencing
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
1 The Genome Browser allows you to –Browse the Rice-Japonica, Maize and Arabidopsis genomes. –View the location of a particular feature on the rice genome.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
A new way of seeing genomes Combining sequence- and signal-based genome analyses Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI Introduction: So far,
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Statistical Tool for Identifying Sequence Variations that Correlate with Virus Phenotypic Characteristics in the Virus Pathogen Resource (ViPR) Brett E.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
B Nameeta Shah 1, Michael Teplitsky 2, Len A. Pennacchio, 2,3, Philip Hugenholtz 3, Bernd Hamann 1, 2, and Inna Dubchak 2, 3 1 Institute for Data Analysis.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Sackler Medical School
Integration of Host Factor Data into the Virus Pathogen Database and Analysis Resource (ViPR) and the Influenza Research Database (IRD) Brett E. Pickett.
SRI International Bioinformatics 1 Genome Browser Markus Krummenacker Bioinformatics Research Group SRI, International Q
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
SRI International Bioinformatics 1 Genome Browser Tomer Altman Bioinformatics Research Group SRI, International August 19th, 2009.
Copyright OpenHelix. No use or reproduction without express written consent1.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Copyright OpenHelix. No use or reproduction without express written consent1.
Application of Bioinformatics in Genetic Research Instructors: Dr. Henry Baker Dr. Luciano Brocchieri Dr. Michele Tennant Dr. Lei Zhou
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Locating and sequencing genes
Student Submissions Integrity Diagnosis System (SSID) Min-Yen Kan.
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
Copyright OpenHelix. No use or reproduction without express written consent1.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Accessing and visualizing genomics data
Copyright OpenHelix. No use or reproduction without express written consent1.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Welcome to the combined BLAST and Genome Browser Tutorial.
Web Apollo/JBrowse • JBrowse is a web based genome browser
Lesson: Sequence processing
3. genome analysis.
Bioinformatics Research Group
Figure 20.0 DNA sequencers DNA Technology.
From: RTCGD: retroviral tagged cancer gene database
Figure 1. Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The X-axis indicates the year in which a CCDS dataset.
Figure 1. Exploring and comparing context-dependent mutational profiles in various cancer types. (A) Mutational profiles of pan-cancer somatic mutations,
Lettuce/Sunflower EST CGPDB project.
Introduction: RFQ 2.0 QTE Type
Chapter 14 Bioinformatics—the study of a genome
N. Capp, E. Krome, I. Obeid and J. Picone
Scientists use several techniques to manipulate DNA.
Visualization of genomic data
Today… Review a few items from last class
There are four levels of structure in proteins
“Coding” for the building blocks of our bodies Edited
Identification and Characterization of pre-miRNA Candidates in the C
Ensembl Genome Repository.
Bacterial genomics: The controlled chaos of shifty pathogens
Enrique Garcia-Assad, Indresh Singh, Pratap Venepally, Jason Inman
Basic Local Alignment Search Tool
Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes
From Mendel to Genomics
A. McNally, F. Alhashash, M. Collins, A. Alqasim, K. Paszckiewicz, V
Welcome - webinar instructions
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Figure 1. The overlap between Ensembl/GENCODE, RefSeq and UniProtKB genes. The number of genes classified as coding in ... Figure 1. The overlap between.
Thomas J Cradick, Peng Qiu, Ciaran M Lee, Eli J Fine, Gang Bao 
Presentation transcript:

Acknowledgements and References PanGenome Visualization with JBrowse Namita Dongre, Anika Verma, Indresh Singh Informatics Department, J. Craig Venter Institute, Rockville, MD 20850 Abstract The objective of this project was to visually compare genome variations to a single consensus using a genome browser.  Our biological data came from the Enterobacter cloacae, strain 34977 and was based on the pan-genomic analysis done by pan-genome ortholog clustering tool (PanOCT) and three other commonly used ortholog-detection methods. Our consensus track contained the 70% common clusters present in the various genomes of E. cloacae. We created GFF-file format tracks for the consensus and individual genomes with Python and imported them into JBrowse, a JavaScript-based genome browser. We also prepared a reference sequence FASTA file which contained sequence data for the consensus track. Our JBrowse customizations allow users to easily view and compare clusters, flexible genome island (fGI) inserts, and other special features present in the consensus and genome variants. Methods Preparing Reference FASTA file Sequence data was taken from the centroids.fasta file which listed core clusters and their corresponding genes and sequences. Sequences of core clusters present on the same contig were concatenated and listed under that contig on the FASTA file. Sequences for insertions and non-core clusters were substituted with X’s. Features not represented by centroid.fasta were placed in a “Missing” contig. Python was used to prepare the reference sequence FASTA file. The following script from the JCVI Araport project was used to process the file: Preparing Consensus Track file Consensus track data such as coordinates, contigs, clusters, and annotations were taken from the Core.att_fgi file. The consensus track contains two features: fGI insertions (fgi_INS) and core clusters (CL). Python was used to prepare the consensus GFF track file. The following script from the JCVI Araport project was used to process the file: Preparing Genome Track file Individual genome track data came from the combined_att.file, panoct.results, and fgi_stats.details. The combined_att.file file provided the genes present in each genome. The panoct.results file mapped the clusters associated with each gene per genome. The fgi_stats.details file provided the fGI insertion data for each genome. The individual genome tracks contained four features, U_CORE, STOP_CORE, Gene, and Break, and two subfeatures to display the fGI insertions, fgi_u and fgi_s. Python was used to prepare the individual genome GFF tracks. The following script from the JCVI Araport project was used to process the file: Customizing Tracks Track customization was done by editing the trackList.json file for the JBrowse tracks. The feature tracks were all designed with HTMLFeatures for best representation. Background Pan-genome analysis is the process of comparing protein content in order to identify the differences between strains. However to find the differences between the strains, one must find the identical proteins throughout them. The easiest way to find these proteins is to find the orthologous clusters. We used pan-genomic analysis for this project to identify the orthologous clusters within the bacterium Enterobacter cloacae. Figure 2. The topmost track with the red, clear, and green clusters are part of the E00001 Genome Track. U_CORE is represented in the light red, STOP_CORE is represented in green, and Genes are colorless. The same data is represented in the E00002 Track below. The track below that is the Consensus track, which contains CL’s and fGI inserts. The CL’s are represented in blue, and the fGI inserts are represnted in purple. Below that is the reference sequence. Introduction The goal of this project was to visualize Pan-Genome Analysis using a genome browser. Ten Enterobacter cloacae strains were used as a pilot project. Genome browser will allow researcher to compare, interact and probe different fGIs, genes, features or regions of interest. Challenges While the cluster coordinates provided in the data were with respect to the nucleotide level, the reference sequence data were with respect to the peptide level. To account for this discrepancy, we divided the coordinates by 3 to match the sequence with the clusters. The reference sequence represented the core clusters present in the consensus. Genomes contained clusters that are not present in consensus core genome, non consensus clusters were listed in “missing” consensus contig. Ambiguous Amino Acid code “X” was used to represent inserts and non core clusters. Some of the U_CORE features existed across multiple contigs due to the circular nature of the E. cloacae strain. Since JBrowse only supports a linear view, we listed the other contigs IDs in track details. Results With JBrowse, we created a clear visual way to compare clusters, insertions, and other special features of the genomes and consensus. Funding This project has been funded in whole or part with funds from J. Craig Venter Institute, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under Award Number U19AI110819. Future Directions In the future, we hope to add more customization to feature tracks such as icons for insertions and make it easier for the user to play around with the track colors. Additionally, we hope to take care of the data discrepancies mentioned above such as more accurate nucleotide to peptide-level coordinate conversion and perhaps finding a genome browser to retain the circular properties of the DNA of E. cloacae and other bacteria. We also hope to add functionality to allow users to sort genome tracks by various criteria. Acknowledgements and References Fouts, D. E., Brinkac, L., Beck, E., Inman, J., & Sutton, G. (2012). PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Research, 240(22). Thank you to Granger Sutton, Derrick Fouts, and Maria Kim for your assistance in this project. https://github.com/jcvi-software/PanGenomeVisualization Figure 1. A screenshot of our code in the consensusTrackCreate.py file. This creates the consensus (in gff format) file.