Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.

Similar presentations


Presentation on theme: "Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products."— Presentation transcript:

1 Genome analysis and annotation

2 Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products ? Can we link genotype to phenotype ? (i.e. What genes are turned on when ? Why do two strains of the same pathogen vary in their pathogenicity ?) Can we trace the evolutionary history of an organism from its genomic sequence and genome organization ? Evolutionary history of a pathway ?

3 Gene finding Begins with the prediction of gene models through the 1) Identification of Open Reading Frames (ORFs) 2) Examination of base composition differences between coding vs. non-coding regions 3) Computational gene recognition (exons, introns, exo- intron boundaries) using a variety of gene-finding algorithms (GLIMMER, GRAIL, FGENEH, GENSCAN GLIMMER-HMM, etc…)

4 Gene finding (cont’) Another gene finding/confirmation approach is based on experimental evidence using homology 1)Alignment of Expressed Sequence Tags (EST) and full cDNA sequences with gDNA Advantages: gene discovery, proof of expression, training for gene finders Disadvantages: Disproportionate representations 2) Examination of protein translation profiles: Peptide sequencing, mass spectrometry, etc…

5 Gene finding (cont’) The gene finding task comes with various levels of difficulty in different organisms Relatively easy in bacterial and archeal genomes mostly due to: 1)High gene density (1 kb per gene on average) 2)Short intergenic regions 3)Lack of introns Much more difficult in eukaryotic genomes and can become major focus of activity in the annotation phase of a genome: 1) Low gene density (1-200 kb per gene) 2)Presence of repeats 3)Most eukaryotic genes have introns and exons, alternative splicing Innacurate predictions and false postives are common

6 53% id. Sm SR2 sub-familyA non-LTR retrotransposon (SmR2A) 94% id. Sm SR2 sub-familyB non-LTR retrotransposon Unknown repeat SmR2A (95% id.) Unknown repeat SmR2A (91% id.) SmR2A (89% id.) SmR2A (92% id.) SjR2 like (85% id.) SR2A (90% id.) Repeats complicate genome assembly and gene finding (Example: Schistosoma mansoni genome)

7 Comparing genomes can help with gene finding S. japonicum S. mansoni Nucleotide sequence conservation using mVISTA

8 Sequence homology at exons S. mansoni as Reference Conclusion: The S. japonicum sequence can be used to find exons in S. mansoni S. japonicum as Reference Conclusion: The S. mansoni sequence can be used to find exons in S. japonicum

9 THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Case study: Gene finding in the Schistosoma mansoni eukaryotic parasite

10 The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures Prior to gene discovery efforts, repeats must be identified and masked. Prior to gene discovery efforts, repeats must be identified and masked. Repeats tend to confuse ab-initio gene finders. Repeats tend to confuse ab-initio gene finders. Fragments of transposons are often confused for protein-coding exons of genes. Fragments of transposons are often confused for protein-coding exons of genes. By masking repeats, we increase the (signal / noise) ratio. By masking repeats, we increase the (signal / noise) ratio.

11 THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Construction of a S. mansoni Repeat Library Catalog known Schistosoma Transposable Elements (TEs) Catalog known Schistosoma Transposable Elements (TEs) -particularly retrotransposons: SR1, SR2, Sinbad, fugitive, salmonid, boudicca, saci, cercyon De-novo construction of repeat library using RepeatScout (Price, et al. 2005) De-novo construction of repeat library using RepeatScout (Price, et al. 2005) -1125 repeat families found

12 THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Genome Masking Statistics Total number basepairs381,816,328 'N's found in gaps6,171,089 'N's found after masking187,957,396 Adjusted totals, accounting for N-gaps Total number of basepairs375,645,239 masked bps181,786,307 Percentage of the genome repeat masked48.3%

13 THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures augustus: augustus: -provided by Mario Stanke -predicted 9,208 genes glimmerHMM: glimmerHMM: -provided by Ela Pertea -predicted 25,890 genes

14 THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures Spliced protein alignments using AAT (Huang, 1997) Spliced protein alignments using AAT (Huang, 1997) -Searched: ùTIGR’s internal non-redundant protein db ùCustom protein databases:  Caenorhabditis elegans and briggsae  Brugia malayi ùGenewise predictions for best protein alignments Spliced transcript alignments Spliced transcript alignments –alignments (blat, sim4) of S. mansoni ESTs and cDNAs, followed by alignment assembly using Program to Assemble Spliced Alignments (PASA) –AAT alignments of S. japonicum ESTs

15 THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR The TIGR Gene Modeling Pipeline Repeat Masking Ab-initio Gene Prediction Sequence Homology Searching Combining Evidence Final Gene Structures 4 6 9 6 6 71 2 6 7 61010 6 Start End EVidenceModeler (EVM) Combines predicted exons and alignments into weighted consensus gene structures weight PASA transcript alignment assemblies Genewise protein alignments Gene Predictions, AAT alignments

16 THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Evidence View S.mansoni PASA assemblies S. japonicum EST alignments Genewise alignments(predictions) nr Protein Alignments Caenorhabditis sp. Protein Alignments Brugia malayi Protein Alignments


Download ppt "Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products."

Similar presentations


Ads by Google