Statistical n-gram 4-12-2017 David ling.

Statistical n-gram David ling

Contents System update (speed and google ngram) Next available corpus
Papers that related to using n-gram corpus Google Books N-gram Corpus used as a Grammar Checker (EACL 2012) N-gram based Statistical Grammar Checker for Bangla and English (ICCIT 2006) Correcting Serial Grammatical Errors based on N-grams and Syntax (CLCLP 2013)

System updates Speeded up by using SQL (Wikipedia dump only)
Added Google books ngram corpus Corpus Wikipedia dump 2007 (4GB txt) Version 1 SQL : 11GB disk space Version 2 SQL : 9GB disk space Google ngram via online API Source: US or GB English books (1-5 grams) Far far larger than wiki Included if frequency >= 40 Searching is slow (online), so next step is to make it offline as a SQL database

Student script 1 Wikipedia 2007 Google ngram book

Newspaper article Wikipedia 2007 Google ngram book
Still a lot of false positive results Two directions: Implement more corpus: the common crawl corpus Develop a language model for unseen ngram

Next available corpus British National Corpus (FREE)
Only 500 MB in xml (therefore the size is very small) works written by native, expert speakers Need sometime to extract Web 1T 5-gram (LDC member) 1-5gram counts from approximately 1 trillion word tokens publicly accessible Web page (at 2006) Common Crawl (FREE) Webpage archives over years Extremely large, several TBs (at 2017)

Google Books N-gram Corpus used as a Grammar Checker (Proceedings of the EACL 2012 Workshop)
On Spanish 2 main tasks: Grammatical error detection Toy grammar exercises (multiple choices, fill-in the blanks, etc.) Break a sentence into a sequence of bigrams Error if Bigrams not in Google N-gram corpus and at least one of them is not an unseen word

Google Books N-gram Corpus used as a Grammar Checker (Proceedings of the EACL 2012 Workshop)
Experiment on error detection: 65 sentences with 1 error each (from non-native speakers) tp: true positive, fn: false negative, fp: false positive

N-gram based Statistical Grammar Checker for Bangla and English (Ninth International Conference on Computer and Information Technology, 2006) Using 3-gram of POS tags in Brown Corpus (1,014,312 words) If it is not in the dictionary => error sentence E.g., He have the book I want The performance part is incomplete Flagged 321 out of 866 correct sentences as incorrect (37% fp) Did not show the performance on error detection

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) Use Web 1T 5-grams corpus (1 to 5-grams) Corrections on prepositions only noun-prep-verb (NPV), verb-prep-verb (VPV), adj-prep-verb (APV), adverb-prep-verb (RPV) 2 translation models TM main (for seen trigrams) TM back-off (for unseen trigrams)

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) TM main (for all 3 words are seen) Example: correct to “accused of being” (V-P-V) Extract and count 3-gram from corpus, condition is accuse (inflected) + any preposition + be (inflected) Lemmatized gram frequency Non-majority trigrams are regarded as problematic Assign the probability is “accused of being” = …535 =0.93

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) TM back-off (for unseen trigrams) Example: correct to “accused of murdering” (V-P-V), murder/murdered/murdering is unseen in corpus Extract and count 3-gram from corpus, condition is accuse (inflected) + any preposition + any verb Lemmatized gram frequency Input text = “accused to murder” The probability is “accused of murdering” = =0.55

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) Experiment CLC First Certificate Exam Dataset (LRN) 118 sentences with targeted error types, other errors are corrected (incorrect samples) British National Corpus (BNC) 1000 random sentences (correct samples) (with back-off) (false positive) (with back-off)

Conclusions At this stage, we focus on detection and analyzing on students’ scripts first collect and implement various corpus 3-gram on preposition correction (for analysis) Develop a language model So far found no general way to address all the English errors (except the seq2seq translational way)

Language model proposed to try
Target: to approximate unseen ngram frequency Proposed method: word2vec representation with neural network Suppose the 3-gram “conducted in English” occurs in corpus 10 times, while “conducted in Chinese”, “conducted in Mandarin”, “conducted in French”, are not As Chinese, Mandarin, and French are similar in vector representation, I hypothesize that the return of the neural network function f are close (since f is continuous) Conducted In English 10 If 𝒙 3 ≈ 𝒙 4 , then 𝑓( 𝒙 1 , 𝒙 2 , 𝒙 3 )≈𝑓( 𝒙 1 , 𝒙 2 , 𝒙 4 )

Statistical n-gram 4-12-2017 David ling.

Similar presentations

Presentation on theme: "Statistical n-gram 4-12-2017 David ling."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical n-gram 4-12-2017 David ling.

Similar presentations

Presentation on theme: "Statistical n-gram 4-12-2017 David ling."— Presentation transcript:

Similar presentations

About project

Feedback