1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell.

1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell

2 Introduction People use. ? and ! End-of-sentence marks are overloaded.

3 Introduction Period - most ambiguous. Decimals, e-mail addresses, abbreviations, initials in names, honorific titles. For example: U.S. Dist. Judge Charles L. Powell denied all motions made by defense attorneys Monday in Portland's insurance fraud trial. Of the handful of painters that Austria has produced in the 20th century, only one, Oskar Kokoschka, is widely known in U.S. This state of unawareness may not last much longer.

4 Introduction Sentence boundary detection by humans is tedious, slow, error-prone, and extremely difficult to codify. Algorithmic syntactic sentence boundary detection is a necessity.

5 Five Applications I. Part-of-speech tagging –Examples of part-of-speech include nouns, verbs, adverbs, prepositions, conjunctions, and interjections. –John [noun] Smith [noun], the [determiner] president [noun] of [preposition] IBM [noun] announced [verb] his [pronoun] resignation [noun] yesterday [noun].

6 Five Applications II. Natural language parsing –Identify the hierarchical constituent structure in a sentence. S NP S NP NP NP PP VBD NP NP IN NP NNP NNP DT NN NNP PRP$ NN NN JohnSmiththe presidentof IBM announced his resignation yesterday

7 Five Applications III. Reading level of a document –The Bormuth Grade Level, the Flesch Reading Ease use information on the sentences in the documents.

8 Five Applications IV. Text editors –The command to move to the end of a sentence. V. Plagiarism detection

9 Related Work As of 1997: “identifying sentences has not received as much attention as it deserves.” [Reynar and Ratnaparkhi1997] “Although sentence boundary disambiguation is essential..., it is rarely addressed in the literature and there are few public-domain programs for performing the segmentation task.” [Palmer and Hearst1997] Two approaches –Rule based approach –Machine-learning-based approach

10 Related Work I. Rule based –Regular expressions [Cutting1991] Mark Wasson converted grammar into a finite automata with 1419 states and 18002 transitions. –Lexical endings of words [Müller1980] uses a large word list.

11 Related Work II. Machine-learning-based approach –[Riley1989] uses regression trees. –[Palmer and Hearst1997] uses decision trees or neural network.

12 Our Approach Punctuation rules to disambiguate end-of- sentence punctuation. Punctuation rule-based model is simple in design, and is easy to modify.

13 Our Reference Corpus A “sentence” reference corpus is a corpus with each sentence put on its own line. We manipulated the Brown Corpus to create a 51590 sentence reference corpus. Two sections - training text and final run text.

14 High Level Architecture Reference Corpus Text Document Sentence Recognizer Sentences Analysis Module Rules Sentenizer Module

15 Our Sentenizer Module Our sentenizer module has two parts: –A set of end-of-sentence punctuation rules. –An engine to apply the rules.

16 Our Analysis Module Sentenizer Analysis Module Reference Corpus diff.txt rules_summary

17 Our Analysis Module.txt The Japanese want to increase exports to the U.S. |||| While they have been curbing shipments, they have watched Hong Kong step in and capture an expanding share of the big U.S. market. The Hartsfield home is at 637 E. Pelham Rd. @@PE@@ NE. But what came in was piling up. |||| @@PE@@ The nearest undisrupted end of track from Boston was at Concord, N. H.

18 Overview of Experiment Results | | Percentage of Run | Key description | corrected marked Number | | sentences -------|-----------------------------------------|----------------- * Run 1 |All marks | 84.35% * Run 2 |Mark at token end | 89.03% Run 3 |Correction of text | 88.31% * Run 4 |Double punctuation endings | 89.01% * Run 5 |Check next word capitalization | 89.53% Run 6 |Correction of text | 89.55% Run 7 |Modify capitalization function | 91.41% Run 8 |Correction of text | 91.35% Run 9 |Modify capitalization function | 91.35% Run 10 |Correction of text | 90.40% Run 11 |Correction of text | 90.58% * Run 12 |Add abbreviation list | 95.60% Run 13 |Check single initials | 98.94% * Run 14 |Form black chunk of token | 99.12% Run 15 |Check numbering lists and double initials| 99.85% * Run 16 |Reduce abbreviation list | 98.90% Run 17 |Check special abbreviations | 99.83% Run 18 |Confidence ratings | 99.83% * Run 19 |Check sentences with ellipsis points | 99.83% * Run 20 |Check sentences with parenthesis marks | 99.83%

19 Evaluation on Testing Corpus Testing corpus - 26647 sentences Sentenizer - 26613 sentences Total 120 errors –43 false positives –77 false negatives 99.84% accuracy

20 Contributions Highly accurate –99.8% accuracy rate –Comparable to or better than existing systems Highly efficient –About 50 double spaced papers per second –About 1000 sentences per second Easily modifiable –A rule-based model

1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell.

Similar presentations

Presentation on theme: "1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell.

Similar presentations

Presentation on theme: "1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell."— Presentation transcript:

Similar presentations

About project

Feedback