Statistical n-gram 4-12-2017 David ling.

Slides:

Advertisements

Similar presentations

1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.

Advertisements

REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.

1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.

1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.

Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.

Language Translators By: Henry Zaremba. Origins of Translator Technology ▫1954- IBM gives a demo of a translation program called the “Georgetown-IBM experiment”

Natural Language Processing Expectation Maximization.

Text Analysis Everything Data CompSci Spring 2014.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.

IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.

Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.

Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.

S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.

Pattern Recognition with N-Tuple Systems Simon Lucas Computer Science Dept Essex University.

LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.

Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson.

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.

Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artiﬁcial Intelligence Laboratory, MIT, Cambridge ACL 2008.

Language Model for Machine Translation Jang, HaYoung.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Final Project: English Preposition Usage Checker J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.

TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.

Language Identification and Part-of-Speech Tagging

Automatic Writing Evaluation

Lecture 9: Part of Speech

Collecting Written Data

Compiler Design (40-414) Main Text Book:

Measuring Monolinguality

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

A Straightforward Author Profiling Approach in MapReduce

Learning Usage of English KWICly with WebLEAP/DSR

CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Tools for Natural Language Processing Applications

Keywords the words (or n word sequences) which are significantly more frequent in a specialised corpus than in a "reference corpus" generally, the reference.

Computational and Statistical Methods for Corpus Analysis: Overview

Factual Claim Validation Models Extraction of Evidence

Giuseppe Attardi Dipartimento di Informatica Università di Pisa

The BonPatron Vocabulary Guide

--Mengxue Zhang, Qingyang Li

CSCI 5832 Natural Language Processing

A semantic proofreading tool for all languages based on a text repository Kai A. Olsen Molde University College and Department of Informatics, University.

web1T and deep learning methods

Transformer result, convolutional encoder-decoder

The CoNLL-2014 Shared Task on Grammatical Error Correction

Word Embedding Word2Vec.

The CoNLL-2014 Shared Task on Grammatical Error Correction

Hong Kong English in Students’ Writing

Grammar correction – Data collection interface

Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B

Using GOLD to Tracking L2 Development

Introduction to Text Analysis

Applied Linguistics Chapter Four: Corpus Linguistics

Ngram frequency smooting

University of Illinois System in HOO Text Correction Shared Task

Word embeddings (continued)

English project More detail and the data collection system

Progress updates on dependency parsing

Introduction to Sentiment Analysis

Building an annotated Corpus

Text Analytics Solutions with Azure Machine Learning

Artificial Intelligence 2004 Speech & Natural Language Processing

Some preliminary results

Prime Time Simply the best From online corpora to word clouds

FCE IES Parque de Lisboa.

Computer Programming Tutorial

Presentation transcript:

Statistical n-gram 4-12-2017 David ling

Contents System update (speed and google ngram) Next available corpus Papers that related to using n-gram corpus Google Books N-gram Corpus used as a Grammar Checker (EACL 2012) N-gram based Statistical Grammar Checker for Bangla and English (ICCIT 2006) Correcting Serial Grammatical Errors based on N-grams and Syntax (CLCLP 2013)

System updates Speeded up by using SQL (Wikipedia dump only) Added Google books ngram corpus Corpus Wikipedia dump 2007 (4GB txt) Version 1 SQL : 11GB disk space Version 2 SQL : 9GB disk space Google ngram via online API Source: US or GB English books (1-5 grams) Far far larger than wiki Included if frequency >= 40 Searching is slow (online), so next step is to make it offline as a SQL database

Student script 1 Wikipedia 2007 Google ngram book

Student script 2 Wikipedia 2007 Google ngram book

Student script 3 Wikipedia 2007 Google ngram book

Newspaper article Wikipedia 2007 Google ngram book Still a lot of false positive results Two directions: Implement more corpus: the common crawl corpus Develop a language model for unseen ngram

Next available corpus British National Corpus (FREE) Only 500 MB in xml (therefore the size is very small) works written by native, expert speakers Need sometime to extract Web 1T 5-gram (LDC member) 1-5gram counts from approximately 1 trillion word tokens publicly accessible Web page (at 2006) Common Crawl (FREE) Webpage archives over years Extremely large, several TBs (at 2017)

Google Books N-gram Corpus used as a Grammar Checker (Proceedings of the EACL 2012 Workshop) On Spanish 2 main tasks: Grammatical error detection Toy grammar exercises (multiple choices, fill-in the blanks, etc.) Break a sentence into a sequence of bigrams Error if Bigrams not in Google N-gram corpus and at least one of them is not an unseen word

Google Books N-gram Corpus used as a Grammar Checker (Proceedings of the EACL 2012 Workshop) Experiment on error detection: 65 sentences with 1 error each (from non-native speakers) tp: true positive, fn: false negative, fp: false positive

N-gram based Statistical Grammar Checker for Bangla and English (Ninth International Conference on Computer and Information Technology, 2006) Using 3-gram of POS tags in Brown Corpus (1,014,312 words) If it is not in the dictionary => error sentence E.g., He have the book I want The performance part is incomplete Flagged 321 out of 866 correct sentences as incorrect (37% fp) Did not show the performance on error detection

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) Use Web 1T 5-grams corpus (1 to 5-grams) Corrections on prepositions only noun-prep-verb (NPV), verb-prep-verb (VPV), adj-prep-verb (APV), adverb-prep-verb (RPV) 2 translation models TM main (for seen trigrams) TM back-off (for unseen trigrams)

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) TM main (for all 3 words are seen) Example: correct to “accused of being” (V-P-V) Extract and count 3-gram from corpus, condition is accuse (inflected) + any preposition + be (inflected) Lemmatized 3gram frequency Non-majority trigrams are regarded as problematic Assign the probability is “accused of being” = 230600 230600+10200+2841+…535 =0.93

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) TM back-off (for unseen trigrams) Example: correct to “accused of murdering” (V-P-V), murder/murdered/murdering is unseen in corpus Extract and count 3-gram from corpus, condition is accuse (inflected) + any preposition + any verb Lemmatized 3gram frequency Input text = “accused to murder” The probability is “accused of murdering” = 870600 870600+50200+20200 =0.55

Correcting Serial Grammatical Errors based on N-grams and Syntax (Computational Linguistics and Chinese Language Processing 2013) Experiment CLC First Certificate Exam Dataset (LRN) 118 sentences with targeted error types, other errors are corrected (incorrect samples) British National Corpus (BNC) 1000 random sentences (correct samples) (with back-off) (false positive) (with back-off)

Conclusions At this stage, we focus on detection and analyzing on students’ scripts first collect and implement various corpus 3-gram on preposition correction (for analysis) Develop a language model So far found no general way to address all the English errors (except the seq2seq translational way)

Language model proposed to try Target: to approximate unseen ngram frequency Proposed method: word2vec representation with neural network Suppose the 3-gram “conducted in English” occurs in corpus 10 times, while “conducted in Chinese”, “conducted in Mandarin”, “conducted in French”, are not As Chinese, Mandarin, and French are similar in vector representation, I hypothesize that the return of the neural network function f are close (since f is continuous) Conducted In English 10 If 𝒙 3 ≈ 𝒙 4 , then 𝑓( 𝒙 1 , 𝒙 2 , 𝒙 3 )≈𝑓( 𝒙 1 , 𝒙 2 , 𝒙 4 )