Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.

Similar presentations


Presentation on theme: "Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20."— Presentation transcript:

1 Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20

2 Domain: Speech Recognition A large portion of errors are due to over- generation of trigram language models. If we can detect trigram-generated sentences, we can improve accuracy. when the director of the many of your father and so the the monster and here is obviously a very very profitable business his views on today's seek level thanks very much for being with us

3 A two-class classification problem ‘fake’ (trigram-generated) or real sentence? Data: 100k fake and 100k real long (> 7 words) sentences. ‘fake’ sentences don’t look right (bad syntax), don’t make sense (bad semantics). Boils down to finding good features. Semantic coherence has been explored [Eneva et al], but not syntactic features, and the combination. SVM margin for probabilities.

4 Previous work: semantic features Around 70 semantic features, most interestingly: Content word co-occurrence statistics Content word repetition Decision tree + Boosting, around 80% accuracy. We hope the combination of syntactic features will significantly improve accuracy.

5 Exploring syntactic features Bag-of-word feature (raw counts, frequency, binary; linear or polynomial kernel) : 57% Tag with part-of-speech (39 POS): when/WRB the/DT director/NN of/IN the/DT many/NN of/IN your/PRP$ father/NN Bag-of-POS: 56% Sparse Sequence of POS: any k POS in that order, weighted by the span. 39 k features. ( … WRB-IN-DT …) … 5 ….

6 Exploring syntactic features (cnt.) Sparse Sequence works on letters for text categorization, but on POS: 58% (k=3, max span=8) Leave stopwords together with POS: WRB the NN of the many of your NN Sparse sequence on stopwords&POS: 57%

7 Exploring syntactic features (cnt.) Stopwords&POS 4-grams: novelty rate, count distribution likelihood ratio, min, max, median, mean counts These combined with semantic features: 75% Semantic features alone: 77%

8 SVM margin Empirically ‘good shape’

9 Summary Now we know these features don’t work… SVM wasn’t a wise choice with large amount of data and a lot of noise…

10 Future? Parsing Logistic regression instead of SVM


Download ppt "Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20."

Similar presentations


Ads by Google