Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.

Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20

Domain: Speech Recognition A large portion of errors are due to over- generation of trigram language models. If we can detect trigram-generated sentences, we can improve accuracy. when the director of the many of your father and so the the monster and here is obviously a very very profitable business his views on today's seek level thanks very much for being with us

A two-class classification problem ‘fake’ (trigram-generated) or real sentence? Data: 100k fake and 100k real long (> 7 words) sentences. ‘fake’ sentences don’t look right (bad syntax), don’t make sense (bad semantics). Boils down to finding good features. Semantic coherence has been explored [Eneva et al], but not syntactic features, and the combination. SVM margin for probabilities.

Previous work: semantic features Around 70 semantic features, most interestingly: Content word co-occurrence statistics Content word repetition Decision tree + Boosting, around 80% accuracy. We hope the combination of syntactic features will significantly improve accuracy.

Exploring syntactic features Bag-of-word feature (raw counts, frequency, binary; linear or polynomial kernel) : 57% Tag with part-of-speech (39 POS): when/WRB the/DT director/NN of/IN the/DT many/NN of/IN your/PRP$ father/NN Bag-of-POS: 56% Sparse Sequence of POS: any k POS in that order, weighted by the span. 39 k features. ( … WRB-IN-DT …) … 5 ….

Exploring syntactic features (cnt.) Sparse Sequence works on letters for text categorization, but on POS: 58% (k=3, max span=8) Leave stopwords together with POS: WRB the NN of the many of your NN Sparse sequence on stopwords&POS: 57%

Exploring syntactic features (cnt.) Stopwords&POS 4-grams: novelty rate, count distribution likelihood ratio, min, max, median, mean counts These combined with semantic features: 75% Semantic features alone: 77%

SVM margin Empirically ‘good shape’

Summary Now we know these features don’t work… SVM wasn’t a wise choice with large amount of data and a lot of noise…

Future? Parsing Logistic regression instead of SVM

Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.

Similar presentations

Presentation on theme: "Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.

Similar presentations

Presentation on theme: "Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20."— Presentation transcript:

Similar presentations

About project

Feedback