Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources.

Similar presentations


Presentation on theme: "Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources."— Presentation transcript:

1 Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

2 Background

3 Genome quality

4 Genes in Drosophila melanogaster ● high gene density ● at least 20% with alternative transripts ● can be nested  on the same strand  on different strands ● di-cistronic ● involve trans-splicing  exons from a different strand

5 Gene prediction pipeline ● Gene prediction by homology  no ab-initio predictions  not using genomic alignments ● TBLASTN/Genewise process  quick genome scan to find putative gene containing regions  aligning peptide sequence to genomic fragment using a gene model ● cds ● introns ● splice-sites

6

7 Sensitivity – Selectivity - Speed ● Genome scan  strict trade-off between ● sensitivity versus memory/time ● Transcript prediction  t = O(MN) ● N: length of peptide sequence = quite short ● M: length of DNA sequence = large  you want to minimize ● the length of the genomic sequence to search ● the number of fragments you align

8 Solutions ● ENSEMBL: Minigenes  cut out putative introns ● My pipeline:  priority lists  gene structure conservation

9 Difficulties ● Terminal exons  short and thus alignment signal is weak ● Spindly genes  there is no length penalty on introns

10 Concepts ● Predict in three passes 1)Predict clear cut cases 2)Predict dubious cases  only if they don't overlap with a previous prediction 3)Predict alternative transcripts ● Iteratively search for duplications ● Accept a prediction with conserved exon boundaries

11 Conservation of gene structure Query Prediction Conserved Query Prediction Partially conserved Query Prediction Single exon Query Prediction Retrotransposed Query Prediction Unconserved (exon boundaries of query/prediction mapped on query protein)

12 Quality control ● Classify predictions into categories  Full length or fragment  Gene or pseudogene  Conserved or not conserved gene structure ● Heuristically remove predictions  that are redundant  that are in conflict ● nested genes ● good predictions take precedence over bad predictions

13 Results ● http://wwwfgu.anat.ox.ac.uk:8080/cgi-bin/gbrowse

14 Number of predicted genes

15 Orthology assignments Genes in D. melanogaster with ortholgs

16 Technical details ● Hardware:  28 dual CPU nodes with 2Gb memory  sun grid engine (SGE) ● Pipeline logic  gmake ● Tasks  Python scripts (and Perl scripts)  Bash/awk scripts ● Database  Postgres

17 Downstream analysis ● Pairwise orthology assignment  PhyOP Pipeline (Leo Goodstadt (2006)) ● Multiple orthology assignment  My own concoction based on graph clustering with some consistency criteria ● Multiple alignment of cds  Dialign (<50 sequences)  Muscle (<500 sequences)

18 Phylogenetic analysis ● 14,000 GBlocks cleaned multiple alignments ● Calculation of ka and ks with PAML ● Phylogenetic trees  Genome trees  Gene trees  built with Fitch/Kitsch

19 Odds and bits ● Mapping of Pdb -> Uniprot -> dmel proteins ● Mapping of Interpro domains onto predictions  not up-to-date ● Codon bias analysis  ENC, CAI, information theoretic measures  GC3, GC3_4D

20 Comparison of measures Experimental CAI Computational CAI ENC GC3 Encoding | bias Encoding | unbiased Encoding | uniform Ribosomal CAI

21 Other groups ● see http://rana.lbl.gov/drosophila/wiki/index.php/Main_Page ● Gene predictions by others  Don Gilbert: SNAP  Lior Pachter: GeneMapper (genomic alignments)  Eisen Lab : TBLastN + Genewise/Exonerate, GeneMapper  Batzoglou Lab: CONTRAST  Brent Lab: N-Scan  Guigo: geneid and SGP2

22 http://insects.eugenes.org/species/news/genome- summaries/genepredictions.html

23 Consensus predictions ● Gbrowser comparison of all gene predictions  http://rana.lbl.gov/drosophila/gbrowse/cgi-bin/gbrowse http://rana.lbl.gov/drosophila/gbrowse/cgi-bin/gbrowse ● Mike Eisen's group: GLEAN consensus set ● Don Gilbert: http://insects.eugenes.org/species/ ● Other resources  tRNA predictions  genome alignments


Download ppt "Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources."

Similar presentations


Ads by Google