Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.

임성신sslim@pusan.ac.kr Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING

Artificial Intelligence Laboratory 2 Agenda  What are they?  Distribution  Tagsets  Tagging  Rules  Probabilities  Transformation-Based(Brill)

Artificial Intelligence Laboratory 3 Parts of Speech  Start with eight basic categories  Noun, verb, pronoun, preposition, adjective, adverb, article, conjunction  These categories are based on morphological and distributional properties (not semantics)  Some cases are easy, others are murky

Artificial Intelligence Laboratory 4 Parts of Speech  Two kinds of category  Closed class Prepositions, articles, conjunctions, pronouns  Open class Nouns, verbs, adjectives, adverbs

Artificial Intelligence Laboratory 5 Fig 8.1 Prepositions(and particles) of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Artificial Intelligence Laboratory 6 Fig 8.2 English single-word particles from Quirk et al.(1985).

Artificial Intelligence Laboratory 7 Fig 8.3 Coordinating and subordinating conjunctions of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Artificial Intelligence Laboratory 8 Fig 8.4 Pronouns of English from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Artificial Intelligence Laboratory 9 Fig 8.5 English modal verbs from the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 million word corpus.

Artificial Intelligence Laboratory 10 Sets of Parts of Speech: Tagsets  There are various standard tagsets to choose from; some have a lot more tags than others  The choice of tagset is based on the application  Accurate tagging can be done with even large tagsets

Artificial Intelligence Laboratory 11 Fig 8.6 Penn Treebank part-of-speech tags (including punctuation).

Artificial Intelligence Laboratory 12 Tagging  Part of speech tagging is the process of assigning parts of speech to each word in a sentence … Assume we have  A tagset  A dictionary that gives you the possible set of tags for each entry  A text to be tagged  A reason? The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./.

Artificial Intelligence Laboratory 13 Figure 8.7 The number of word types in Brown corpus by degree of ambiguity (after DeRose(1988)).

Artificial Intelligence Laboratory 14 Tagging - Rules  Hand-crafted rules for ambiguous words that test the context to make appropriate choices  Early attempts fairly error-prone  Extremely labor-intensive

Artificial Intelligence Laboratory 15 Figure 8.8 Sample lexical entries from the ENGTWOL lexicon described in Voutilainen(1995) and Heikkila(1995).

Artificial Intelligence Laboratory 16 Tagging - Probabilities  장점  충분한 크기의 태그부탁 말뭉치만 주어지면 태깅에 필요한 통계 정보의 추출이 용이하기 때문에 확장성이 좋고 적용범위가 넓으 며 전체적인 정확성이 비교적 높다는 장점  단점  말뭉치에 의존적  의미 있는 통계정보를 추출하기 위해서는 일정크기 이상의 태그 부탁 말뭉치 필요  말뭉치 구축에 시간과 노력이 많이 요구됨  말뭉치가 편중되어 있거나 불충분한 경우에는 data sparseness 로 인해 신뢰도가 떨어짐

Artificial Intelligence Laboratory 17 Tagging - Probabilities  We want the best set of tags for a sequence of words (a sentence) W is a sequence of words T is a sequence of tags The probability of the word sequence P(W) will be the same for each tag sequence

Artificial Intelligence Laboratory 18 Tagging - Transformation-Based(Brill tagging)  Combine rules and statistics …  TBL(Transformation-Based Learning) is based on rules  Rules are automatically induced from the data(ML)

Artificial Intelligence Laboratory 19 Brill tagging - Examples  Race  “ race ” as NN:.98  “ race ” as VB:.02  So you ’ ll be wrong 2% of the time, which really isn ’ t bad  Patch the cases where you know it has to be a verb  Change NN to VB when previous tag is TO

Artificial Intelligence Laboratory 20 Brill tagging - Rules  Where did that transformational rule come from?  Define a hypothesis space of rules that might help decrease an error rate  Search that space (exhaustively?) to find rules that most reduce an error rate.  Continue to add rules until some stopping criteria is reached Figure 8.9 Brill’s(1995) templates. Each begins with “Change tag a to tag b when : …”. The variables a, b, z and w range over parts-of-speech.

Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.

Similar presentations

Presentation on theme: "Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.

Similar presentations

Presentation on theme: "Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING."— Presentation transcript:

Similar presentations

About project

Feedback