The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CSCI 4717/5717 Computer Architecture
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010.
On the Genetic Evolution of a Perfect Tic-Tac-Toe Strategy
CPSC 335 Compression and Huffman Coding Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Introduction to Probability and Probabilistic Forecasting L i n k i n g S c i e n c e t o S o c i e t y Simon Mason International Research Institute for.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Basic of DFD. Developing a DFD There are no FIXED rules about how a DFD should be developed… There is no such a DFD call “CORRECT DFD”… Expert SAs may.
Part of speech (POS) tagging
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
Chapter 8 USING ACCOUNTING APPLICATIONS. Organization of Accounting Applications.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Chapter 4 Probability Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Mean, Median, Mode and Range The Basics of Statistics Nekela Macon Summer 2009.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel 2010: Chapter1.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Discovery of Manner Relations and their Applicability to Question Answering Roxana Girju 1,2, Manju Putcha 1, and Dan Moldovan 1 University of Texas at.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Communication Technology in a Changing World Week 2.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Presenter: Shanshan Lu 03/04/2010
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Pattern Recognition with N-Tuple Systems Simon Lucas Computer Science Dept Essex University.
Modeling Visual Search Time for Soft Keyboards Lecture #14.
Copyright © 2013 by Educational Testing Service. All rights reserved. 14-June-2013 Detecting Missing Hyphens in Learner Text Aoife Cahill *, Martin Chodorow.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Hearing Conservation Training
Natural Language Processing
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Please click on the blue link bellow to view our ancillary texts.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge ACL 2008.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 4 Probability.
ICS102 Lecture 8 : Boolean Expressions King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer Science.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
IMPAACT 2010 Screening Visits
IMPAACT 2010 Screening Visits
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Error Correcting Code.
Multimedia Information Retrieval
CSCI 5832 Natural Language Processing
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Friday, October 10th, 2014 Standards: All argumentative writing standards Target: Scholars will start find evidence to support their answers to the.
More Practice with Frequently Confused Words
Pattern Recognition and Image Analysis
How to Interpret Probability Mathematically, the probability that an event will occur is expressed as a number between 0 and 1. Notationally, the.
University of Illinois System in HOO Text Correction Shared Task
Naïve Bayes Text Classification
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research Presentation July 2002

Real-word errors However I gave (have) no idea what it represents. Now go to the macros button as shown bellow (below). Fred is away form (from) the and this leaves us vary (very) exposed. there is no evidence on Bill Gates that I have herd (heard) of

Work completed to date Dyslexic error corpus Investigation of possible approaches Syntactic anomaly Confusion set Part-of speech tag collocation experiment Dictionary update

The Dyslexic Error Corpus Sentences1395 Words21524 Non-word errors1681 Word-boundary errors152 Real-word errors842 Total errors2675

Possible Approaches Syntactic anomaly I haven't do (done) any in a long time. Confusion set {their, there, they're} {form, from} {weather, whether} {were, where, we're} {collage, college} {loose, lose}

Confusion Sets {their, there, they're} {form, from} {weather, whether} {were, where, we're} {loose, lose} {collage, college} {their, there} {form, from} {weather, whether} {were, where} {loose, lose}

Tag collocation experiment Syntactic approach Minimal information –part-of-speech tag of preceding and succeeding word Provide baseline for comparison with future approaches

Calculating word|tag probabilities Count occurrences of each tag,word pair Calculate probability for immediately preceding and succeeding tags P(tp|w), P(ts|w) Use Bayes rule to calculate probability of word occurring given the tag P(w|tp), P(w|ts) Store for use at run-time

Using Bayes Rule {w 1,....,w n } Set of words {t 1,....,t m } Set of tags |t j, w i | the number of occurrences of word w i collocating with tag t j

Run-time Retrieve part-of-speech tags for preceding word Assign highest P(w|tp) value to each member of confusion set Retrieve part-of-speech tags for succeeding word Assign highest P(w|ts) value to each member of confusion set P(w|tp) * P(w|ts) gives final value to each member Select member with highest value

unlike their adult theretheir AJ0 PRP there P(w|tp) theretheir NN1 AJ0 P(w|ts) P(w|tp) * P(w|ts) there their

Initial Results

Modifications Reduced tagset Combined probabilities

Results using reduced tagset

Results using combined tag probabilities

Target not in confusion set. the lose (loss) of {loose, lose} Errors in the immediate context grauate form (from) harved (graduate from Harvard ) in their teems (in their teens) Probabilities based on rare uses of a word Problems

Dictionary Update CUV2 –70,000+ entries More precise word-frequency information Part-of- speech tags corresponding to BNC Additional entries –words occurring frequently in BNC but not in CUV2

Further work Word collocation weather: hot, wet, dry, warm, severe, heavy, adverse, warmer, windy, better collage: paper, sticking, colourful, sound, brand, blue, postmodern, hessian, marble, cloth Increase the number of confusion sets Final testing