Presentation is loading. Please wait.

Presentation is loading. Please wait.

Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 1 AND Workshop on Analytics for Noisy Unstructured Text Data Noisy Text Analytics:

Similar presentations


Presentation on theme: "Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 1 AND Workshop on Analytics for Noisy Unstructured Text Data Noisy Text Analytics:"— Presentation transcript:

1 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 1 AND Workshop on Analytics for Noisy Unstructured Text Data Noisy Text Analytics: An Exercise in Futility? Sreeram Balakrishnan Hwee Tou Ng Rohini Srihari Daniel Lopresti (mod) Workshop on Analytics for Noisy Unstructured Text Data Panel Session January 8, 2007 IBM Research National Univ. of Singapore Janya Inc. Lehigh Univ.

2 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 2 AND Workshop on Analytics for Noisy Unstructured Text Data Panel Session Each panelist will make a brief presentation. Please think of questions to discuss. What's your opinion? AND workshop has attracted researchers working on noisy text analytics from a variety of perspectives. Some reports of success, other promising first steps. Still, there are many remaining unsolved issues. Major hurdles include inherent complexities of human language, wide range of sources for noise. Panel session: is the challenge neverending and pointless? Or are there reasons to be hopeful?

3 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 3 AND Workshop on Analytics for Noisy Unstructured Text Data What's this? Mwentxeth International JJomtA Conf r n | | t fl la! | lnt II: encie | 9 | | | | Ll 9 | | 6 | 9 | |^~R= | | | | | j | | | |^~R- | ||^~R. | | |^~R| | | IJCAI-2007 'Wurksho | on Analytics far Noisy Unstructured Text Data sig? Hyderabad, India - January 8, 2007 Noisy unstructured text data is found in informal settings such as online chat, SMS. emeils. I Home message boards. newsgroups, blogs, wilds and web pages Also, text produced by processing News spontaneaus; speech, printed text. handwritten text contains processing noise. Text produced | | under such circumstances is typically highly noisy containing spelling errors, abbreviations` W tch this S ace for | P d. non-standard words. fates starts, repetitions. missing punctuetions, missimg case information, a I t t P | | pause filling words such as ^um` and huh." Such text can be seen in ierge amounts in contact a as news' 3|: Can for pagers centers` online chat rooms, OCRed text documents, SMS gorpug etcffhe theme 'of the IJCAI. 01/08 Worksho 200`? Conference as "A! and its benetits to socaety." In keepmg with thss theme, this workshop | Proceedings online` tm ortantlgates proposes to look at text enemies of highly noasy text that Is produced tn such everyday | applications in society. more | peogle On 07 Jan O | The goal of the workshop is to focus on the problems encountered in anaiyzing such | | noisy documents coming from various sources, The nature of the text warrants moving | Mendanae beyond traditional text anamics techniques` We hope that the workshop will allow researchers announched- | to present current research and development in addressing this challenge. We aiso betieve that | | by | Contact as a result ofthis workshop mere Ma be snaring of real Me noisy data sets and Wm resux: in their Gerald Detwngd becoming aveilebfe to a wider research community. gotential dagasets | | Decisions 1 t A emalled to authorsp |

4 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 4 AND Workshop on Analytics for Noisy Unstructured Text Data AND Homepage, Printed & Scanned Zoom 1 Zoom 2

5 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 5 AND Workshop on Analytics for Noisy Unstructured Text Data Not fair? Binarized | liiil i El| LJ Fm Eltr'lq liIi1il7J r'E lid tEii=Y=iI lid lE1ilE i El FFm | El El IE lgg | [Iil lijil IE r'lij El _ Fm |El lgg rilijil Lm pil El _ El pil liiil r*mtlE Fl | liiil LJ El El riil | | | Fl, rjil Fi | lid 11 LJ Fm lid | F El Lm liii Fl liii i Fliji LJ FFm eEl1ilE Fm li;i | El i El 1 Fm liiil FIY eEltlE Fm lid IE rilij |liiil r'lij El I flE I El | eEl1ilE ri pil la LJ El | fi I I i Fl s;] |liiil r'lij El El Lm liji Fl IE El " LJ F liji | | | El I liiil Fm? I i Fm | liii Fm lE1i r'lijil liiil FFI El. liijiil Iii | mijim | | liijii liiil rFmfE r'E Fm | | i El | | | I IE Fm lid i tE pil r'liiil riil liiil El | El tliiil I liiil liiil I-lii lat tEii=Y=iI IE Fm IE IE pil pil I i liii lE1ii liiil Fm El i Fm El liiil liji i | i FFm liiil F| | | | tl1e \!\i=::=FI-i Cl4gl4:LlrI1er1tS: 4:C>l11iI'lg f~F [Iil |lijil Fm lid tr'la lid i ti lijil Fm IE I tEii=Y=iI IE Fm IE I'3.-ti liii E tliiil pil r'E El | Fit liii LJ r' r'E Fit r'E El | | | Fm IE Fm lid IE El IE r'E El LJ It liiilf 1iFm i El |liiil r' I-lii El Fm liiil riil 1iFl | r bil | liii liiil FFI i FI s;] lE|lE i I IE bil I | tliiil IE -l_.--._l-i lid | | r' OCR Result Zoom

6 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 6 AND Workshop on Analytics for Noisy Unstructured Text Data What's going on here? Document image analysis is still a research topic. Complex layouts (e.g., multicolumn) are hard. Noisy inputs problematic for character recognition. Segmentation (character / word) often error-prone. Recognition sensitive to skew, font, resolution. Handwriting is even more difficult. By exploiting redundancy, some analytics tasks (e.g., IR, text categorization) still feasible. Other tasks likely to remain unsolved for a long time.

7 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 7 AND Workshop on Analytics for Noisy Unstructured Text Data The big question Will noisy text like this ever be tractable? Note: it is very easy to say Just make the OCR better, but this has proven to be a very hard problem and is likely to take a very long time.

8 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 8 AND Workshop on Analytics for Noisy Unstructured Text Data Noisy Text Analytics: An Exercise in Futility? Sreeram Balakrishnan Hwee Tou Ng Rohini Srihari Daniel Lopresti (mod) Workshop on Analytics for Noisy Unstructured Text Data Panel Session January 8, 2007 IBM Research National Univ. of Singapore Janya Inc. Lehigh Univ.

9 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 9 AND Workshop on Analytics for Noisy Unstructured Text Data Text Processing Stages: Functions Processing StageIntended Function Optical character recognition Transcribe input bitmap into encoded text (hopefully accurately). Sentence boundary detection Break input into sentence-sized units, one per text line. Tokenization Break each sentence into word (or word- like) tokens delimited by white space. Part-of-speech tagging Takes tokenized text and attaches label to each token indicating its part-of-speech. Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

10 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 10 AND Workshop on Analytics for Noisy Unstructured Text Data Text Processing Stages: Problems Processing StagePotential Problem(s) Optical character recognition Current OCR is brittle, errors made early-on propagate to later stages. Sentence boundary detection Missing or spurious sentence boundaries due to OCR errors on punctuation. Tokenization Missing or spurious tokens due to OCR errors on whitespace and punctuation. Part-of-speech tagging Bad PoS tags due to failed tokenization or OCR errors that alter orthographies. Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

11 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 11 AND Workshop on Analytics for Noisy Unstructured Text Data Problems 1 CHAPTER 1 Loomings. Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Results for noisy input (light photocopy): ' cH__' R l '. _omings., call me IshMael. soMe ye_s ago--never mind how long, p,ec;sely__hav;ng _;tle or no _oney in my purse, and nothing p_;,u__ to ;,terest Me on shore, I thoug_t I would sail _boUt a _;tt1e and see _e watery p_ or the world. Sentence boundary detection results for clean input: Note: 3 sentences vs. 4 sentences. Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

12 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 12 AND Workshop on Analytics for Noisy Unstructured Text Data Problems 2 CHAPTER 1 Loomings. Call me Ishmael. Some years ago -- never mind how long precisely -- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Results for noisy input (light photocopy): ' cH__ ' R l '. _omings., call me IshMael. soMe ye_s ago -- never mind how long, p, ec ; sely__hav ; ng _ ; tle or no _oney in my purse, and nothing p_ ;, u__ to ;, terest Me on shore, I thoug_t I would sail _boUt a _ ; tt1e and see _e watery p_ or the world. Tokenization results for clean input: Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

13 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 13 AND Workshop on Analytics for Noisy Unstructured Text Data Problems 3 CHAPTER_NNP 1_CD Loomings_NNS._. Call_VB me_PRP Ishmael_NNP._. Some_DT years_NNS ago_RB --_: never_RB mind_VB how_WRB long_JJ precisely_RB --_: having_VBG little_JJ or_CC no_DT money_NN in_IN my_PRP$ purse_NN,_, and_CC nothing_NN particular_JJ to_TO interest_VB me_PRP on_IN shore_NN,_, I_PRP thought_VBD I_PRP would_MD sail_VB about_IN a_DT little_JJ and_CC see_VB the_DT watery_JJ part_NN of_IN the_DT world_NN._. Results for noisy input (light photocopy): Part-of-speech tagging results for clean input: '_POS cH___NNS '_POS R_`` l_NNS '_''._. _omings_NNS._.,_, call_VBP me_PRP IshMael_NNP._. soMe_JJ ye_s_NNS ago_RB --_: never_RB mind_VB how_WRB long_JJ,_, p_NNP,_, ec_NNP ;_: sely__hav_NNP ;_: ng_NNP __NNP ;_: tle_NNP or_CC no_DT _oney_NN in_IN my_PRP$ purse_NN,_, and_CC nothing_NN p__NN ;_:,_, u___JJ to_TO ;_:,_, terest_NN Me_NN on_IN shore_NN,_, I_PRP thoug_t_VBP I_PRP would_MD sail_VB _boUt_VBN a_DT __NN ;_: tt1e_JJ and_CC see_VBP _e_JJ watery_NN p__, or_CC the_DT world_NN._. Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

14 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 14 AND Workshop on Analytics for Noisy Unstructured Text Data Test Conditions Optical character recognition Open Source gocr package. http://jocr.sourceforge.net/index.html (Joerg Schulenburg et al.) Sentence boundary detection MXTERMINATOR. A Maximum Entropy Approach to Identifying Sentence Boundaries, J. C. Reynar and A. Ratnaparkhi, Proc. 5th Conf. on Applied Natural Language Processing, 1997. Tokenization Penn Treebank tokenizer. http://www.cis.upenn.edu/~treebank/tokenizer.sed (Robert MacIntyre) Part-of-speech tagging MXPOST. A Maximum Entropy Part-Of-Speech Tagger, A. Ratnaparkhi, Proc. Empirical Methods in Natural Language Processing Conference, 1996. Corpus 10 pages of Project Gutenberg Moby-Dick. http://www.gutenberg.net (Michael Hart et al.) Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

15 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 15 AND Workshop on Analytics for Noisy Unstructured Text Data Average OCR Performance Notes: All SymbolsPunctuationWhitespace Baseline high on clean inputs, deteriorates rapidly on noisy inputs. Punctuation especially badly impacted: many false alarms. Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

16 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 16 AND Workshop on Analytics for Noisy Unstructured Text Data Token-level segmentation error Sample Alignment Applying hierarchical string matching paradigm, we can recover correct correspondence between noisy output and original input. A straightforward example found by algorithm: Substitution errors Substitution error Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

17 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 17 AND Workshop on Analytics for Noisy Unstructured Text Data Text Processing Performance Sentence BoundariesTokenization Notes: Clean input processed at > 95%; many false alarms in noisy inputs. Performance degrades with each successive stage. PoS Tagging Performance Evaluation for Text Processing of Noisy Inputs, Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.

18 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 18 AND Workshop on Analytics for Noisy Unstructured Text Data Lehigh University A research university founded in 1865. Four colleges: Engineering, Arts & Sciences, Business, Education. Faculty = 441 full-time. Graduate students = 2,064. Undergraduates = 4,577. Three campuses spread over 1,600 acres (mountain side, wooded). Located in northeastern U.S. (about 1.5 hours from New York and Philadelphia, 3 hours from Washington, DC). Engineering College ranked in top 20% of Ph.D.-granting schools in U.S. University ranked in top 15% of U.S. national universities. Key facts about Lehigh: Packard Lab: Home of Computer Science & Engineering

19 Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 19 AND Workshop on Analytics for Noisy Unstructured Text Data Lehigh University Lehigh University New York 120 km Philadelphia 80 km


Download ppt "Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 1 AND Workshop on Analytics for Noisy Unstructured Text Data Noisy Text Analytics:"

Similar presentations


Ads by Google