Presentation on theme: "TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB George Ferizis and Peter Bailey CSIRO ICT Centre Figure Authors: George Ferizis"— Presentation transcript:
TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB George Ferizis and Peter Bailey CSIRO ICT Centre Figure Authors: George Ferizis firstname.lastname@example.org Peter Bailey email@example.com Introduction Algorithm Throughput (documents/second) POS features2.1 Term frequency 145 Approximating238 Method Since POS tagging is such a slow process some POS features that are critical to the performance of the classifier are approximated using some heuristics. These features are: Adverbs Present participles Personal pronouns A restricted set of determiners The classifier uses other simple features that can be determined quickly from the text in the document such as average word length, the number of long words in the document and average sentence length. Many classification methods apply statistical methods to a set of features obtained from the data to obtain a function that can differentiate between classes. Genre classification usually follow this method by using either term frequency features or features obtained through Part-Of-Speech (POS) tagging the documents. While using features obtained from POS has resulted in good accuracy the speed of the POS tagging systems is unsuitable for time critical applications. 0.020.1Application of classifier 97.2450POS Tagging 2.813Extraction of variables Percentage of time (%) Time (s)Stage Table 1: The time spent in each phase during the classification of 1000 documents Table 1 shows the amount of time that is spent in each phase during the classification of 1000 documents. 97% of the time spent classifying can be attributed to the POS tagger. The results in table 1 also show that it would take over 5 days to classify a corpus containing 1,000,000 documents. Results Table 2: The table shows the number of documents classified per second by each method, including the time each method requires to generate and analyse the necessary features. Two orders of magnitude of improvement can be gained by approximating POS features. This reduces the time required to classify a corpus of 1,000,000 documents from over 5 days to under 2 hours. A comparison of the number of documents that each method classifies per second shows that the term frequency and approximation approaches are two orders of magnitude quicker than the POS approach (table 2). Using features that are derived from approximating POS tags has similar accuracy to actually using the POS tags as features. These features are also more accurate than using features from a term frequency approach (figure 1). Figure 1: The number of documents classified per second by each method. This includes the time each methods requires to generate and analyse the necessary features. Two orders of magnitude of improvement can be gained by approximating POS features. Experiment These features were compared to two other sets of features for the genre classification problem: POS features Term frequency features Two experiments were run to compare these features: A comparison of the throughput of each method A comparison of the classification accuracy of each method The genres that were used during classification were: Newspaper editorial Newspaper reportage Scientific articles Speeches EditorialReportageScientificSpeeches Editorial95.0 87.8 4.8 10.8 0 0.2 1.4 Reportage3.9 5.1 96.1 94.9 0 Scientific0 95.2 89.6 4.8 10.4 Speeches6.0 4.2 0.8 4.9 8.1 7.9 85.1 83.0 Table 3: The confusion matrices for the POS feature approach (darker triangular cells) and the approximating approach (lighter triangular cells). The matrices show that both methods confused documents between genres in a similar way, although with different magnitudes of confusion. A comparison of the confusion matrices for the POS features and approximating POS features show that they both confuse similar genres with each other (table 3). The confusion matrix shows the percentage of documents of genre A (corresponding to the row) that are classified as genre B (corresponding to the column). The value of each row adds to 100%. Conclusions POS tagging is too slow for collections with millions of documents. Approximating some POS tags reduces the time that is required to extract classification features from a corpus by two orders of magnitude. Approximating the POS tags that are used as features results in a loss of 1- 2% in classification accuracy. The accuracy of classification when using approximated POS tags as features is still higher than using term frequency features.