Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Similar presentations


Presentation on theme: "1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries."— Presentation transcript:

1 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries

2 NEB Educational Support 2

3 Why study Computational Biology and Bioinformatics?  DNA sequencing output is growing faster than Moore’s law!  1 Illumina sequencing machine = 0.5 Tbp/week  There are hundreds of these and thousands of other sequencing machines around the world.  New sequencing technology will conceivably allow sequencing a human genome for less than $1K in less than 1 day! 3

4 Why study Medical Bioinformatics?  In the near future, most cancer diagnostics will involved DNA or RNA sequencing!  In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments!  Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections. 4

5 DOE Undergraduate Research in Microbial Genome Analysis and Functional Genomics 5

6 6 Why Study Microbial Genomes?  Large biological mass (50% of total)  photosynthetic (Prochlorococcus)  fix N 2 gas to NH 3 (Rhodopseudomonas)  NH 3 to NO 2 (Nitrosomonas)  bioremediation (Shewanella, Burkholderia)  pathogens, BW (Yersinia pestis - plague)  food production (Lactobacillus)  CH 4 production (Methanosarcina)  H 2 production (Rhodopseudomonas)

7 Example of Current Microbial Genome Projects  UC Davis – FDA funded 100K bacterial genomes project associated with food.  5 years = 20K per year / 200 days/year = 100 genomes/day! 7

8 8 Web Resources and Contact Information         ftp://ftp.lsd.ornl.gov/pub/JGI  artemis ready files for each scaffold = (feature table plus fasta sequence file)  Contact: 

9 9

10 Evolution of Sequencing Throughput

11 11 Sequenced Microbial Genomes  ARCHAEAL GENOMES  159 FINISHED; 218 IN PROGRESS  BACTERIAL GENOMES  3363 FINISHED; IN PROGRESS  ENVIRONMENTAL COMMUNITIES  > 50,000 samples (see MGRast)  as of Sept 6, 2012   

12 12 Published Genomes  Nitrosomonas europaea - J.Bac. 185(9): (2003)  Prochlorococcus MED4 & MIT Nature 424: (2003)  Synechococcus WH Nature 424: (2003)  Rhodopseudomonas palustris - Nat. Biotech. 22(1):55-61 (2004)  Yersinia pseudotuberculosis - PNAS 101(22): (2004)  Nitrobacter winogradskyi – Appl. Envir. Micro. 72(3): (2006)  Nitrosococcus oceani - Appl. Envir. Micro. 72(9): (2006)  Burkholderia xenovorans – PNAS 103(42): (2006)  Thiomicrospira crunogena – PLoS Biology 4(12):e383 (2006)  Nitrosomonas eutropha C91 – Env. Micro. 9(12): (2007)  Sulfuromonas denitrificans – Appl. Envir. Micro. 74(4): (2008)  Nitrosospira multiformis -- Appl. Envir. Micro. 74(11): (2008)  Nitrobacter hamburgensis -- Appl. Envir. Micro. 74(9): (2008)  Saccharophagus degradans – PLoS Genetics 4(5):e (2008)  R. palustris – 5 strain comparison – PNAS 105(47): (2008)  L. rubarum and L. ferrodiazotrophum – Appl. Envir. Micro. (in press)

13 13 Basic Annotation Impacts  Design of oligonucleotide arrays  Design & prioritize protein expression constructs  Design & prioritize gene knockouts  Assessment of overall metabolic capacity  Database for proteomics  Allows visualization of whole genome

14 14 Additional Analysis Impacts  Revised functional assignments based on domain fusions, functional clustering, phylogenetic profile  Regulatory motif discovery  Operon and regulon discovery  Regulatory and protein association network discovery

15 15 Scaffolds or contigs Prodigal Model correction Final Gene List InterPro COGs Web Pages Blast Complex Repeats Simple repeats GC Content, GC skew PRIAM Function call tRNAs rRNA, Misc_RNAs Feature table TMHMM SignalP Microbial Annotation Genome Pipeline

16 16 Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm)  Unsupervised: Automatically learns the statistical properties of the genome.  Indifferent to GC Content: Prodigal performs well irrespective of the GC content of the organism.  Draft: Prodigal can train on multiple sequences then analyze individual draft sequences.  Open Source: Prodigal is freely available under the GPL.  Reference: Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics Mar 8;11(1):119. (Highly Accessed)

17 17 G+C Frame Plot Training  Takes all ORFs above a specified length in the genome.  Examines the G+C bias in each frame position of these ORFs.  Does a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes.  Takes those predicted genes and gathers dicodon usage statistics.

18 18 Gene Prediction  Dicodon usage coding score  Length factor added to coding score (GC- content-dependent)  Coding/noncoding thresholds sharpened (starts downstream of starts with higher coding get penalized by the difference).  Dynamic programming to put genes together.  Bonuses for operon distances, larger bonus for -1/-4 overlaps.  Same strand overlap allowed (up to 60 bases).  Opposite strand -->3'r 5'f<- allowed (up to 250 bases)

19 19 Start Site Scoring Shine Dalgarno Motif  Examines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG vs GTG vs TTG frequency)  Moves starts based on these discoveries.  Gathers statistics on the new set of starts and repeats this process until convergence (5-10 iterations).  RBS motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG).  Does a final dynamic programming with the start scoring function.

20 20 Start Site Scoring Other Motifs  If Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes.  If Shine-Dalgarno scoring is weak, look for other motifs  If a strong scoring motif is found, use it (example GGTG in A. pernix)  If no strong scoring motif is found, use highest score of all found motifs (example – Crenarchaea, Tc and Tl start sites are the same, but internal operon genes use weak Shine-Dalgarno motifs)

21 Annotated Gene Prediction 21

22 Prodigal Scoring 22

23 23 Gene Prediction Problems – Pseudogenes

24 24 Pseudogenes – Internal deletion

25 25 Pseudogenes – Premature stop codon

26 26 Pseudogenes – N-terminal deletion

27 27 Pseudogenes – Transposon insertion

28 28 Pseudogenes – Multiple frameshifts

29 29 Pseudogenes – Premature Stop and Frameshift

30 30 Pseudogenes – Dead Start Codon

31 31

32 32 GENE PAGE

33 33

34 34

35 35

36 36 ORGANISM’S (PSYC) COGS LIST

37 37 Taxonomic Distribution of Top KEGG BLAST Hits

38 38 Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 2

39 39 Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 3b

40 40 Branched Chain Amino Acid Transporter family

41 41 Probable Ancient Gene (Liv Operon)

42 42 Branched Chain Amino Acid Transporter family – Rhodopseudomonas palustris

43 43 Example of Lateral Transfer

44 44 Transporter Gene Loss in Yersina Pestis  36 Genes involved in transport from YPSE are nonfunctional in YPES  13 lost due to frameshifts  11 lost due to deletions  6 lost due to IS element insertions  4 (2 pair) lost due to recombination causing deletions and frameshifts  2 lost due to premature stop codons

45 45

46 46 Nostoc punctiforme Signal Transduction Histidine Kinases

47 47 Nostoc punctiforme Signal Transduction Histidine Kinases

48 48 Nostoc punctiforme Signal Transduction Histidine Kinases

49 49 Nostoc punctiforme Signal Transduction Histidine Kinases

50 50 Nostoc punctiforme Regulatory Proteins

51 51 Burkholderia xenovorans Regulatory Proteins

52 52 Regulatory Protein Identification Scheme

53 53 Summary of automated transporter annotation --- Zymomonas

54 54 Zymomonas transporters complete listing

55 Transcriptome Analysis Pipeline: RNA sequences to GRN Collect RNAseq data Map reads to genomes Calculate reads/bp Display frequency plot Determine operons from frequency plot Compare operon determinations (genome co- ordinates) Predict operons In silico Improve algorithm Determine orthologous operons Determine orthologs with OrthoMCL Align orthologous promoters Determine TFBS from alignments Determine TISs with 5’ RACE. Cluster analysis from gene expression arrays Predict TFBS In silico Cluster analysis of gene expression changes GRN genetic regulatory network

56 Dynamic range and sensitivity

57 New gene, wrong start, riboswitch

58 Small Regulatory RNA ???

59 Differential gene expression

60 Operon with Internal Promoter 60

61 Long Term Vision  Develop TPing SOPs, and an automated analysis pipeline.  Initially produce TPs and preliminary GRNs for all important DOE microbial genomes (i.e. BESC), and eventually all DOE microbial genomes.  Incorporate the TP analysis pipeline into ORNL’s automated microbial annotation pipeline, and eventually into IMG and GenBank files.  Add additional experimental methods to improve the GRN determinations.


Download ppt "1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries."

Similar presentations


Ads by Google