A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.

A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop NAACL June 5, 2009 Nai-Lung Tsao and David Wible National Central University, Taiwan

Since 2000 under the support of MOE & Taipei Bureau of Education –IWiLL has been used in Taiwan by: 455 schools 2,804 teachers 161,493 students and 22,791 independent learners. Teachers have authored 9,429 web-based lessons with the system’s authoring tool. The learner corpus (English TLC) has archived over 32,000 English essays 5 million words of machine-readable running text written by Taiwan’s learners using the IWiLL writing platform. 100,000 tokens of teacher comments on these student texts The Research Context IWiLL Online Writing Platform www.iwillnow.org

Second Language Learners’ Error Detection and Correction Lexical and Lexico-grammatical errors - an open-ended class - driving teachers crazy - either no rules involved or rules of very limited productivity

Two components to our system INPUT: user-produced string 2. Edit Distance Algorithm ‘on my opinion’ Compares User’s string & Hybrid N-grams Hybrid n-grams extracted from BNC 1. Target Language Knowledgebase: Error Detection/Correction

The Knowledgebase of Hybrid N-grams Hybrid n-grams extracted from BNC 1. Target Language Knowledgebase: What, Why, and How What is a hybrid n-gram? An n-gram that admit items of different levels - Traditional n-gram: ‘in my opinion’ - Hybrid n-gram: ‘in [dps] opinion’ Why use hybrid n-grams? - Traditional n-grams and error precision - POS n-grams and recall Enjoy to canoe > unattested > marked as error Error Detection. Enjoy canoeing> unattested > marked as error True positive: False positive: V + VVg Based on attested strings like: enjoy hiking OR like watching We could extract the POS gram: But this would accept: hope exploring How hybrid n-grams are extracted for the knowledgebase

How the hybrid n-grams are extracted Hybrid n-grams extracted from BNC 1. Target Language Knowledgebase: hikeVVg V enjoy VVd V enjoyed hiking word form lexeme [POS detailed] {POS rough} 4 categories of info for each item In an n-gram Some hybrid n-grams for enjoyed hiking enjoyed + V enjoy+ V enjoyed+ VVg enjoy+ VVg VVd+ VVg enjoyed+ hike enjoy+ hike V+ hiking etc. Potential Hybrid N-grams for a string

Two components: INPUT: user-produced string 2. Edit Distance Algorithm ‘on my opinion’ Compares User’s string & Hybrid N-grams Hybrid n-grams extracted from BNC 1. Target Language Knowledgebase: Error Detection/Correction

Edit Distance Component Steps in measuring edit distance 1.Generate all hybrid n-grams from the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Rank candidates by weighted edit distance between members of C and S b. Prune Set S using filter factor or coverage

Edit Distance Component Steps in measuring edit distance 1.Generate all hybrid n-grams from the learner input string (Set C) enjoyed + V enjoy+ V enjoyed+ VVg enjoy+ VVg VVd+ VVg enjoyed+ hike enjoy+ hike V+ hiking etc. enjoyed hiking Input from learner: Hybrid n-grams generated from learner string Set C =

Edit Distance Component Steps in measuring edit distance 1.Generate all hybrid n-grams from the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Calculate weighted edit distance between members of C and S b. Prune Set S using filter factor or coverage c. Eliminate N-grams under frequency threshold

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage c. Eliminate N-grams under frequency threshold

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) hikeenjoy Target Knowledgebase Hybrid N-grams Set S enjoyed + V enjoy+ V enjoyed+ VVg enjoy+ VVg VVd+ VVg enjoyed+ hike enjoy+ hike V+ hiking etc. Hybrid n-grams generated from learner string enjoyed hiking Set C =

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) hike VVg V enjoy VVd V enjoyed hiking Target Knowledgebase Hybrid N-grams Set S enjoyed + V enjoy+ V enjoyed+ VVg enjoy+ VVg VVd+ VVg enjoyed+ hike enjoy+ hike V+ hiking etc. Hybrid n-grams generated from learner string enjoyed hiking Set C =

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) hike VVg V enjoy VVd V enjoyed hiking Target Knowledgebase Hybrid N-grams Set S

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) enjoy hikeVVg V hiking Target Knowledgebase Hybrid N-grams Set S

Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) hike enjoy VVd V enjoyed Target Knowledgebase Hybrid N-grams Set S

Pruning Set S of Candidates enjoy + V enjoy + VVg 100 tokens 80 tokens We prune the subsuming Hybrid N-gram in cases where a subsumed one accounts for 80% or more of the subsuming set X

Pruning Set S of Candidates enjoy + VVg 80 tokens We prune the subsuming Hybrid N-gram in cases where a subsumed one accounts for 80% or more of the subsuming set Pruning of the Knowledgebase will affect error recall The remaining Set S is filtered for frequency of member hybrid n-grams

Edit Distance Component Steps in measuring edit distance 1.Generate all hybrid n-grams from the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Rank candidates by weighted edit distance between members of C and S b. Prune Set S using filter factor or coverage

Weighting of Edit Distance ‘enjoyed to hike’ Learner string Generate Set C of Hybrid N-grams Generate Set S of Hybrid N-grams enjoyed to hike enjoy VVt enjoy V V to hike VVd to hike etc enjoyed hiking enjoyed hike enjoy VVg VVd hiking V hiking VVd hike enjoyVVg enjoylearning Distance = 1: string c and string s are identical but for one slot Correction candidates are those with a distance 1 or lower. Ranking of candidates with distance = 1 from learner string Differing element = same lexeme but diff word form is closer than different lexeme Differing element = same rough POS but diff detailed POS is closer than diff rough POS

Examples 1 C-selection Enjoy to swim> enjoy swimming Enjoy to shop> enjoy shopping Enjoy to canoe> enjoy canoeing Enjoy to learn> *need to learn; ?want to learn; enjoy learning Enjoy to find > *try to find; *expect to find; *fail to find; *hope to find; *want to find Hope finding> hope to find Let us to know> let us know Get used to say> *get used to; *have used to say; Collocation with C-selection Spend time to fix> spend time fixing; take time to fix Take time fixing> take time to fix Take time recuperating> take time to recuperate Spend time to recuperate> spend time recuperating; take time to recuperate

Examples 2 Preposition Fixed expressions: On the outset> At the outset In different reasons> For different reasons In that time> at that time; by that time On that time> at that time; by that time On my opinion> in my opinion In my point of view> from my point of view I am interested of> I am interested in She is interested of> she is interested in I am interesting in > I am interested in She is interesting in> She is interested in Just on the time when > just at the time when; *just to the time when

Examples 3 Preposition/Particle: Verb + preposition (particle) Discuss to each other> *discussing to each other (should be discuss WITH each other) Discuss this to them> discuss this with them Waited to her> waited for her Waited to them> waited for them Noun + preposition His admiration to> his admiration for His accomplishment on> * No suggestion The opposite side to> the opposite side of A crisis on > a crisis of; a crisis in A crisis on his work> a crisis of his work (*a crisis on his work)

Examples 4 Content Word Choice Lead a miserable living > make a miserable living *leading a miserable living *led a miserable living lead a miserable life Frame of mood> ??change of mood; frame of mind; * frame of reference

Examples 5 Morpho-syntactic She will ran> She will run She will runs> She will run Pronoun case: What made she change> * what made she change (no correction; should be made HER change) Noun countability or number errors: In modern time> in modern times Number agreement in head noun and determiner Too much people> too many people So much things> so many things So many thing> so many things One of the man> one of the men One of the problem> one of the problems In my opinions> in my opinion A lot of problem> a lot of problems Complementizer selection: I wonder that> I wonder if; I wonder whether

Future Work Improving POS tagging using 2nd order model Machine learning of weighting for the various features determining edit distance Incorporation of this into our IWiLL online writing environment Incorporate MI for the knowledgebase’s hybrid n-grams

Thank you

A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.

Similar presentations

Presentation on theme: "A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.

Similar presentations

Presentation on theme: "A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications."— Presentation transcript:

Similar presentations

About project

Feedback