Survey of Misannotations and

Survey of Misannotations and
Pseudogenes in the Arabidopsis Genome Tanmay Prakash

Objectives Objectives Find Possible Misannotations
Find Possible Pseudogenes Why Misannotation can hinder research Pseudogenes can be used to study natural selection

Misannotations CDS Intron UTR
Many misannotations are the result of gene prediction programs mislabeling introns because of the presence of a stop codon

Pseudogenes Pseudogenes are DNA sequences that no longer function but resemble the functional genes they once were. There are two types: Processed Non-processed Common Properties of Pseudogenes Stop Codons Frameshift mutations Lack of Selective Pressure Processed:formed by retrotransposition and comprise most of the pseudogenes in mammals Non-processed:products of duplication of the entirety of portion of a segment of genes followed by mutations. Because polyploidiszation (the process of having more one sets of chromosomes) is common in plants, the majority of pseudogenes in plants are non-processed Lack of Selective Pressure: Measured using Ka/Ks. Ka(nonsyn) Ks(syn). Functional genes have more syn so Ka/Ks significantly less than one. Pseudogenes don’t care so Ka/Ks significantly closer one. Because pseudogenes have these stop codons and frameshift mutations, the gene prediction programs often misannotate them agtacatgcataggactcgatcgactc STCIGLDRL agtacatgataggactcgatcgactc ST..DSID

Pipeline Query Protein Domains Genes BLAST Matching Search In Introns
Subject Arabidopsis Introns BLAST Search HMMER CDS Genes Matching In Introns In CDS In Both Possibly Misannotated Check for Stop Codons Frameshift Check Ka/Ks Possible Pseudogenes

Query Protein Domains Genes BLAST Matching Search In Introns Subject
Arabidopsis Introns Query Protein Domains HMMER Search Genes Matching In Exons Subject Arabidopsis CDS Each of the 8296 protein domain families is searched against the introns of the genes of the Arabidopsis genome. This finds any introns where there are matches to a protein domain. This is done also for the coding sequence of the Arabidopsis genome, but using a HMMER search. A HMMER search would’ve been used for the intron search, but it would take far too long. This search found matches to any domains in the coding sequence.

Genes Possibly Matching Misannotated In Both Genes
Genes that don’t have matches to the same domain in both the introns and the coding sequence are then filtered out. These genes are possibly misannotated. These genes were further filtered to leave the genes that had matches in an intron and its flanking exons. These introns will be checked for stop codons and frameshift mutations. The Ka/Ks value will also be checked. This information will be used to identify pseudogenes.

Results There were 346 genes (different models not included) that had matches to the same domain in the introns and exons There were 299 genes (different models not included) that had matches to the same domain in an intron and flanking exons. These are most likely misannotations.

4 domains with the most possible misannotations

Future Research Identify pseudogenes by looking for stop codons, and frameshift mutations in the introns and checking the Ka/Ks value Use a more recent database of domains Follow the same process for the rice genome

Acknowledgement Dr. Shin-Han Shiu Dr. Kosuke Hanada
Dr. Melissa Lehti-Shiu Dr. Gail Richmond HSHSP

Survey of Misannotations and

Similar presentations

Presentation on theme: "Survey of Misannotations and"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Survey of Misannotations and

Similar presentations

Presentation on theme: "Survey of Misannotations and"— Presentation transcript:

Similar presentations

About project

Feedback