Presentation on theme: "1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries."— Presentation transcript:
1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries
NEB Educational Support 2 http://www.neb.com/nebecomm/course_support.asp?
Why study Computational Biology and Bioinformatics? DNA sequencing output is growing faster than Moore’s law! 1 Illumina sequencing machine = 0.5 Tbp/week There are hundreds of these and thousands of other sequencing machines around the world. New sequencing technology will conceivably allow sequencing a human genome for less than $1K in less than 1 day! 3
Why study Medical Bioinformatics? In the near future, most cancer diagnostics will involved DNA or RNA sequencing! In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments! Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections. 4
DOE Undergraduate Research in Microbial Genome Analysis and Functional Genomics 5 http://www.jgi.doe.gov/education
6 Why Study Microbial Genomes? Large biological mass (50% of total) photosynthetic (Prochlorococcus) fix N 2 gas to NH 3 (Rhodopseudomonas) NH 3 to NO 2 (Nitrosomonas) bioremediation (Shewanella, Burkholderia) pathogens, BW (Yersinia pestis - plague) food production (Lactobacillus) CH 4 production (Methanosarcina) H 2 production (Rhodopseudomonas)
Example of Current Microbial Genome Projects UC Davis – FDA funded 100K bacterial genomes project associated with food. 5 years = 20K per year / 200 days/year = 100 genomes/day! 7
8 Web Resources and Contact Information http://genome.ornl.gov/microbial/ http://www.jgi.doe.gov/ http://www.jgi.doe.gov/ http://genome.jgi-psf.org/ http://www.jcvi.org/ http://www.ncbi.nlm.nih.gov/ http://www.sanger.ac.uk/ http://www.ebi.ac.uk/ ftp://ftp.lsd.ornl.gov/pub/JGI artemis ready files for each scaffold = (feature table plus fasta sequence file) Contact: firstname.lastname@example.org; email@example.com firstname.lastname@example.org
13 Basic Annotation Impacts Design of oligonucleotide arrays Design & prioritize protein expression constructs Design & prioritize gene knockouts Assessment of overall metabolic capacity Database for proteomics Allows visualization of whole genome
14 Additional Analysis Impacts Revised functional assignments based on domain fusions, functional clustering, phylogenetic profile Regulatory motif discovery Operon and regulon discovery Regulatory and protein association network discovery
15 Scaffolds or contigs Prodigal Model correction Final Gene List InterPro COGs Web Pages Blast Complex Repeats Simple repeats GC Content, GC skew PRIAM Function call tRNAs rRNA, Misc_RNAs Feature table TMHMM SignalP Microbial Annotation Genome Pipeline
16 Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) Unsupervised: Automatically learns the statistical properties of the genome. Indifferent to GC Content: Prodigal performs well irrespective of the GC content of the organism. Draft: Prodigal can train on multiple sequences then analyze individual draft sequences. Open Source: Prodigal is freely available under the GPL. Reference: Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. (Highly Accessed)
17 G+C Frame Plot Training Takes all ORFs above a specified length in the genome. Examines the G+C bias in each frame position of these ORFs. Does a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes. Takes those predicted genes and gathers dicodon usage statistics.
18 Gene Prediction Dicodon usage coding score Length factor added to coding score (GC- content-dependent) Coding/noncoding thresholds sharpened (starts downstream of starts with higher coding get penalized by the difference). Dynamic programming to put genes together. Bonuses for operon distances, larger bonus for -1/-4 overlaps. Same strand overlap allowed (up to 60 bases). Opposite strand -->3'r 5'f<- allowed (up to 250 bases)
19 Start Site Scoring Shine Dalgarno Motif Examines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG vs GTG vs TTG frequency) Moves starts based on these discoveries. Gathers statistics on the new set of starts and repeats this process until convergence (5-10 iterations). RBS motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG). Does a final dynamic programming with the start scoring function.
20 Start Site Scoring Other Motifs If Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes. If Shine-Dalgarno scoring is weak, look for other motifs If a strong scoring motif is found, use it (example GGTG in A. pernix) If no strong scoring motif is found, use highest score of all found motifs (example – Crenarchaea, Tc and Tl start sites are the same, but internal operon genes use weak Shine-Dalgarno motifs)
44 Transporter Gene Loss in Yersina Pestis 36 Genes involved in transport from YPSE are nonfunctional in YPES 13 lost due to frameshifts 11 lost due to deletions 6 lost due to IS element insertions 4 (2 pair) lost due to recombination causing deletions and frameshifts 2 lost due to premature stop codons
Long Term Vision Develop TPing SOPs, and an automated analysis pipeline. Initially produce TPs and preliminary GRNs for all important DOE microbial genomes (i.e. BESC), and eventually all DOE microbial genomes. Incorporate the TP analysis pipeline into ORNL’s automated microbial annotation pipeline, and eventually into IMG and GenBank files. Add additional experimental methods to improve the GRN determinations.