Evaluating classifiers for disease gene discovery

Evaluating classifiers for disease gene discovery
Lon Turnbull and Kino Coursey University of North Texas Introduction Classifiers Classifications analysis of correct responses Discussion Determining that a gene is probably involved in some genetic disease is an important bioinformatics task. In this poster we extend the work of the Prospectr project, at defining classifiers to determine the likelihood of candidate regions being capable of causing a genetic disease. While their work focused on only limited depth decision trees, we will examine the other classifiers in the Weka machine learning tool set, and determining the quality and accuracy of the various classifiers. Being able to improve the accuracy of this classification task allows the high probability sites to be given priority when searching for disease genes. 1. ADTree: alternating decision tree, optimized for two-class problems. 2. J48: is a variant of the C4.5 decision tree induction algorithm. 3. Logistic: Linear logistic regression, a form of regression for classification. 4. SMO: Sequential Minimal Optimization algorithm for training a support vector classifier. 5. Naïve Bayes: Standard probabilistic Naïve Bayes classifier. 6. Ibk-K: K-nearest neighbor classifier (k=5). Select the class as the majority of the 5 nearest training instances using a distance metric. 7. PART: Obtains rules from partial decision trees build using C4.5 heuristics. Classifier 2 is clearly the winner as it correctly classified 88.75% of the genes. In most cases, the reduced features set had lower correct classifications. This might mean that either there is a methodological problem or that the high optimal features are due to an anomalous matching in the data set. This problem is emphasized by the best classifier having the largest change when the reduced features were analyzed. There are two sets of data as we also performed an analysis using a reduced set of the number of features that focused on the more prominent ones found by the PROSPECTR results. The bar graphs show the numbers of genes found in the desired classification: disease genes, the red column, and the not desired classification: non-disease genes, the purple column. The two middle columns are the mismatches: blue is disease classified as not disease and green is not disease miss-classified as disease. The least desired cases are the green columns. There are two obvious methods of picking a good classifier: 1. Since there are equal numbers of genes of each type in the samples, ideally the red and purple columns ought to be the same height. The greater the difference, the worse the classifier. 2. The classifier with the least number of mismatches can also be considered a better choice. With these criteria, for the full data set, classifier 2 is the best classifier for the first two data sets and classifier 1 is a little better than classifier 2 for the oligogenic set. Unexpectedly, a different set of classifiers predominates in the reduced data set. Classifier 4 is best for the first two data sets and classifier 3 is better for the last data set. Note Well: in most cases the number of mismatches is relatively large. There was no improvement in the reduced features set, which means, that despite the statistical significance, the features that showed the largest differences were most likely a statistical anomaly. Data used for testing Hypothesis Training set: A set that consisted of 1,084 genes known to be associated with a disease and 1,084 genes not known to be associated with diseases. HGMD: Independent test set 1: A set with 675 disease genes listed in the Human Gene Mutation Database (HGMD) and 675 genes not known to be involved in disease. Oliongenic: Independent test set 2: A set based on oliongenic disorders. It contained 54 genes known to be associated with an oliongenic disorder and 54 genes not known to be associated with gene diseases. It has been suggested that the genes which have some relationship to hereditary disease might have common variations in their DNA sequence structure. A research group [1] has used the alternating decision tree algorithm from Weka to test this hypothesis. They used 24 distinct features to test about 18,000 genes that are not known to be involved in disease and the 1,084 Ensembl[2] genes also listed in OMIM. On average, 70% of the disease genes were correctly identified with their automatic classifier they called PROSPECTR. Can we do better with other methods of classification? We expected that focusing on a set of features that were highly predictive of disease genes would tend to enhance the results, that is, decrease the number of miss-classified genes. This did not happen. Results for all features Conclusions PROSPECTR results We have shown that classifier 2, performs better than classifier 1, the one chosen by the PROSPECTR method. Either the reduced features are not a well chosen subset when using these analysis methods or using these machine learning methods to classify disease genes is not very productive. They tested a number of DNA features in an attempt to find differences between disease genes and non-disease genes. This table is a ratio of the median in a disease set to the median in a control set of the 9 of the 24 features that had statistically significant differences. The larger the ratio the greater the dependence. Gene encodes signal peptide Gene Length 5' CpG islands Protein length Exon Number cDNA length Distance to neighboring gene 3' UTR length Protein identity with BRH in mouse 1.09 References [1] Euan Adie et. al., Speeding disease gene discovery by sequence based candidate prioritization, BMC Bioinformatics 2005, 6:55. [2] Hammond MP, Birney E: Genome information resources - developments at Ensembl. Trends in Genetics 2004, 20: Results for the reduced set of features Contact What are disease genes? Biocomputing Fall 2005 CSCD /CSCE Disease genes are genes that have been mutated so that the body or some parts of the body no longer functions correctly. Most of the more than 100 known genetic disorders are the direct result of a mutation in one gene. It is much more difficult to find the basis of diseases that have a complex pattern of inheritance where more than one gene needs to be mutated before a susceptibility to a disease is expressed. For more information contact: Armin R. Mikler University of North Texas Web:

Evaluating classifiers for disease gene discovery

Similar presentations

Presentation on theme: "Evaluating classifiers for disease gene discovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating classifiers for disease gene discovery

Similar presentations

Presentation on theme: "Evaluating classifiers for disease gene discovery"— Presentation transcript:

Similar presentations

About project

Feedback