Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Natural Language Processing Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University

Similar presentations


Presentation on theme: "1 Natural Language Processing Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University"— Presentation transcript:

1 1 Natural Language Processing Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

2 2 Overview Models –HMM: Hidden Markov Model –maximum entropy Markov model –CRFs: Conditional Random Fields Tasks –Chinese word segmentation –part-of-speech tagging –named entity recognition

3 3 What is an HMM? Graphical Model Circles indicate states Arrows indicate probabilistic dependencies between states

4 4 What is an HMM? Green circles are hidden states Dependent only on the previous state “The past is independent of the future given the present.”

5 5 What is an HMM? Purple nodes are observed states Dependent only on their corresponding hidden state

6 6 HMM Formalism {S, K,  S : {s 1 …s N } are the values for the hidden states K : {k 1 …k M } are the values for the observations SSS KKK S K S K

7 7 HMM Formalism {S, K,     are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation state probabilities A B AAA BB SSS KKK S K S K

8 8 Inference in an HMM Probability Estimation: Compute the probability of a given observation sequence Decoding: Given an observation sequence, compute the most likely hidden state sequence Parameter Estimation: Given an observation sequence, find a model that most closely fits the observation

9 9 oToT o1o1 otot o t-1 o t+1 Given an observation sequence and a model, compute the probability of the observation sequence Probability Estimation

10 10 Probability Estimation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

11 11 Probability Estimation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

12 12 Probability Estimation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

13 13 Probability Estimation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

14 14 Probability Estimation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

15 15 Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Special structure gives us an efficient solution using dynamic programming. Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences. Define:

16 16 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

17 17 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

18 18 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

19 19 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

20 20 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

21 21 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

22 22 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

23 23 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

24 24 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Backward Procedure Probability of the rest of the states given the first state

25 25 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 The Solution to Estimation Forward Procedure Backward Procedure Combination

26 26 oToT o1o1 otot o t-1 o t+1 Decoding: Best State Sequence Find the state sequence that best explains the observations Viterbi algorithm

27 27 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

28 28 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

29 29 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

30 30 oToT o1o1 otot o t-1 o t+1 Parameter Estimation Given an observation sequence, find the model that is most likely to produce that sequence. No analytic method => an EM algorithm (Baum-Welch) Given a model and observation sequence, update the model parameters to better fit the observations. A B AAA BBBB

31 31 oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Probability of traversing an arc Probability of being in state i

32 32 oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Now we can compute the new estimates of the model parameters.

33 33 Overview Models –HMM: Hidden Markov Model –maximum entropy Markov model –CRFs: Conditional Random Fields Tasks –Chinese word segmentation –part-of-speech tagging –named entity recognition

34 34 Limitations of HMM “US official questions regulatory scrutiny of Apple” Problem 1: HMMs only use word identity. It cannot use richer representations. –Apple is capitalized. MEMM Solution: Use more descriptive features –( b0:Is-capitalized, b1: Is-in-plural, b2: Has-wordnet-antonym, b3:Is-“the” etc) –Real valued features can also be handled. Here features are pairs : b is feature of observation and s is destination state e.g. Feature function: e.g. f (“Apple”, Company) = 1.

35 35 HMMs vs. MEMMs (I) HMMsMEMMs

36 36 HMMs vs. MEMMs (II) MEMMs HMMs α t (s) the probability of producing o 1,..., o t and being in s at time t. α t (s) the probability of being in s at time t given o 1,..., o t. δ t (s) the probability of the best path for producing o 1,..., o t and being in s at time t. δ t (s) the probability of the best path that reaches s at time t given o 1,..., o t.

37 37 Maximum Entropy Problem 2: HMMs are trained to maximize the likelihood of the training set. Generative, joint distribution. But they solve conditional problems (observations are given). MEMM Solution: Maximum Entropy. Idea: Use the least biased hypothesis, subject to what is known. Constraints: The expectation E i of feature i in the learned distribution should be the same as its mean F i on the training set. For every state s 0 and feature i:

38 38 More on MEMMs It turns out that the maximum entropy distribution is unique and has an exponential form: We can estimate λ i with Generalized Iterative Scaling. –Adding a feature x : does not affect the solution. –Compute F i. –Set –Compute current expectation of feature i from model.

39 39 Extensions We can train even when the labels are not known using EM. –E step: determine most probable state sequence and compute F i. –M step: GIS. We can reduce the number of parameters to estimate by moving the previous state in the features: “Subject-is-female”, “Previous-was- question”, “Is-verb-and-no-noun-yet”. We can even add features regarding actions in a reinforcement learning setting: “Slow-vehicle-encountered-and-steer-left”. We can mitigate data sparseness problems by simplifying the model:

40 40 Decoding of MEMM Train an MEMM as it is being a maximum entropy model The only difference is about decoding: –ME is a classifier –MEMM is a structure learning tool, where Viterbi algorithm is applied instead.

41 41 Overview Models –HMM: Hidden Markov Model –maximum entropy Markov model –CRFs: Conditional Random Fields Tasks –Chinese word segmentation –part-of-speech tagging –named entity recognition

42 42 CRFs as Sequence Labeling Tool Conditional random fields (CRFs) are a statistical sequence modeling framework first introduced to the field of natural language processing (NLP) to overcome label-bias problem. John Lafferty, A. McCallum and F. Pereira. 2001. Conditional Random Field: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning, 282-289. June 28-July 01, 2001

43 43 Sequence Segmenting and Labeling Goal: mark up sequences with content tags Application in computational biology –DNA and protein sequence alignment –Sequence homolog searching in databases –Protein secondary structure prediction –RNA secondary structure analysis Application in computational linguistics & computer science –Text and speech processing, including topic segmentation, part-of-speech (POS) tagging –Information extraction –Syntactic disambiguation

44 44 HMMs as Generative Models Hidden Markov models (HMMs) Assign a joint probability to paired observation and label sequences –The parameters typically trained to maximize the joint likelihood of train examples

45 45 HMMs as Generative Models (cont’d) Difficulties and disadvantages –Need to enumerate all possible observation sequences –Not practical to represent multiple interacting features or long-range dependencies of the observations –Very strict independence assumptions on the observations

46 46 Conditional Models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) –Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations –Relax strong independence assumptions in generative models

47 47 Discriminative Models Maximum Entropy Markov Models (MEMMs) Exponential model Given training set X with label sequence Y: –Train a model θ that maximizes P(Y|X, θ) –For a new data sequence x, the predicted label y maximizes P(y|x, θ) –Notice the per-state normalization

48 48 MEMMs (cont’d) MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”) Subject to Label Bias Problem –Bias toward states with fewer outgoing transitions

49 49 Label Bias Problem P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation http://wing.comp.nus.edu.sg/pipermail/graphreading/2005-September/000032.html http://hi.baidu.com/%BB%F0%D1%BF_ayouh/blog/item/338f13510d38e8441038c250.html Consider this MEMM:

50 50 Solve the Label Bias Problem Change the state-transition structure of the model –Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure –Prelude the use of prior, which is very valuable (e.g. in information extraction)

51 51 Random Field

52 52 Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem –MEMM uses per-state exponential model for the conditional probabilities of next states given the current state –CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations

53 53 Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

54 54 Example of CRFs

55 55 Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

56 56 Conditional Distribution x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V f k and g k are given and fixed. g k is a Boolean vertex feature; f k is a Boolean edge feature k is the number of features are parameters to be estimated y| e is the set of components of y defined by edge e y| v is the set of components of y defined by vertex v If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

57 57 Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x

58 58 Decoding: To label an unseen sequence We compute the most likely labeling Y* as follows by dynamic programming (for efficient computation): Viterbi algorithm

59 59 Complexity Estimation The time complexity of an iteration of parameter estimation of L-BFGS algorithm is O(L 2 NMF) where L and N are, respectively, the numbers of labels and sequences (sentences), M is the average length of sequences, and F is the average number of activated features of each labeled clique.

60 60 CRF++: a CRFs Package CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. http://crfpp.sourceforge.net/ Requirements –C++ compiler (gcc 3.0 or higher) How to make –%./configure –% make –% su –# make install

61 61 CRF++ Feature template representation and input file format

62 62 CRF++ training % crf_learn -f 3 -c 1.5 template_file train_file model_file test % crf_test -m model_file test_files

63 63 Summary Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance, but it is expensive in computation!

64 64 Overview Models –HMM: Hidden Markov Model –maximum entropy Markov model –CRFs: Conditional Random Fields Tasks –Chinese word segmentation –part-of-speech tagging –named entity recognition

65 65 What is Chinese Word Segmentation A special case of tokenization in natural language processing (NLP) for many languages that have no explicit word delimiters such as spaces. Original: – 她来自苏格兰 –She comes from SU GE LAN Meaningless! Segmented: – 她 / 来 / 自 / 苏格兰 –She comes from Scotland. Meaningful!

66 66 Learning from a Lexicon: maximal matching algorithm for word segmentation Input –A lexicon is pre-defined. –An unsegmented sequence The algorithm: ① Start from the first character, try to find the longest matched word in the lexicon. ② Set the next character after the above found word as the new start point. ③ If reaches the end of the sequence, the algorithm ends. ④ Otherwise, go to (1).

67 67 Learning from a segmented corpus: Word segmentation as labeling 自然科学的研究不断深入 natural science / of / research / uninterruptedly / deepen 自然科学 / 的 / 研究 / 不断 / 深入 BMME S BE BE BE B: beginning, M: Middle, E: End, of a word S: single-character word Using CRFs as the learning model

68 68 CWS as Character-base Tagging: From the begging to the latest Nianwen Xue, 2003 Chinese Word Segmentation as Character Tagging, CLCLP, Vol. 8(1), 2003 Xiaoqiang Luo, 2003 A Maximum Entropy Chinese Character-based Parser, EMNLP-2003 Hwee Tou Ng and Kin Kiat Low, 2004 Chinese Part-of Speech Tagging: One-at-a-Time or All-at-Once? Word-based or Character-Based? EMNLP-2004 Jin Kiat Low, Hwee Tou Ng, Wenyuan Guo, 2005 A Maximum Entropy Approach to Chinese Word Segmentation, The 4th SIGHAN Workshop on CLP, 2005 Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, Christopher Manning, 2005 A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005, The 4th SIGHAN Workshop on CLP, 2005

69 69 Label Set Tag setTagsMulti-character WordReference 2-tagB, EB, BE, BEE, …Mostly for CRF 4-tagB, M, E, SS, BE, BME, BMME, …Xue/Low/MaxEnt 6-tagB, M, E, S, B 2, B 3 S, BE, BB 2 E, BB 2 B 3 E, BB 2 B 3 ME, … Zhao/CRF More labels, better performance for CWS …

70 70 Feature Template Set C -1 , C 0 , C 1 , C -1 C 0 , C 0 C 1 , C -1 C 1, Where C -1 C 0 C 1 is previous, current and next character

71 71 Overview Models –HMM: Hidden Markov Model –maximum entropy Markov model –CRFs: Conditional Random Fields Tasks –Chinese word segmentation –part-of-speech tagging –named entity recognition

72 72 Parts of Speech Generally speaking, the “grammatical type” of word: –Verb, Noun, Adjective, Adverb, Article, … We can also include inflection: –Verbs: Tense, number, … –Nouns: Number, proper/common, … –Adjectives: comparative, superlative, … –… Most commonly used POS sets for English have 50- 80 different tags

73 73 BNC Parts of Speech Nouns: NN0 Common noun, neutral for number (e.g. aircraft NN1 Singular common noun (e.g. pencil, goose, time NN2 Plural common noun (e.g. pencils, geese, times NP0 Proper noun (e.g. London, Michael, Mars, IBM Pronouns: PNI Indefinite pronoun (e.g. none, everything, one PNP Personal pronoun (e.g. I, you, them, ours PNQ Wh-pronoun (e.g. who, whoever, whom PNX Reflexive pronoun (e.g. myself, itself, ourselves

74 74 Verbs: VVB finite base form of lexical verbs (e.g. forget, send, live, return VVD past tense form of lexical verbs (e.g. forgot, sent, lived VVG -ing form of lexical verbs (e.g. forgetting, sending, living VVI infinitive form of lexical verbs (e.g. forget, send, live, return VVN past participle form of lexical verbs (e.g. forgotten, sent, lived VVZ -s form of lexical verbs (e.g. forgets, sends, lives, returns VBB present tense of BE, except for is …and so on: VBD VBG VBI VBN VBZ VDB finite base form of DO: do …and so on: VDD VDG VDI VDN VDZ VHB finite base form of HAVE: have, 've …and so on: VHD VHG VHI VHN VHZ VM0 Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)

75 75 Articles AT0 Article (e.g. the, a, an, no) DPS Possessive determiner (e.g. your, their, his) DT0 General determiner (this, that) DTQ Wh-determiner (e.g. which, what, whose, whichever) EX0 Existential there, i.e. occurring in “there is…” or “there are…” Adjectives AJ0 Adjective (general or positive) (e.g. good, old, beautiful) AJC Comparative adjective (e.g. better, older) AJS Superlative adjective (e.g. best, oldest) Adverbs AV0 General adverb (e.g. often, well, longer (adv.), furthest. AVP Adverb particle (e.g. up, off, out) AVQ Wh-adverb (e.g. when, where, how, why, wherever)

76 76 Miscellaneous: CJC Coordinating conjunction (e.g. and, or, but) CJS Subordinating conjunction (e.g. although, when) CJT The subordinating conjunction that CRD Cardinal number (e.g. one, 3, fifty-five, 3609) ORD Ordinal numeral (e.g. first, sixth, 77th, last) ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow) POS The possessive or genitive marker 's or ' TO0 Infinitive marker to PUL Punctuation: left bracket - i.e. ( or [ PUN Punctuation: general separating mark - i.e.., !, : ; - or ? PUQ Punctuation: quotation mark - i.e. ' or " PUR Punctuation: right bracket - i.e. ) or ] XX0 The negative particle not or n't ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d)

77 77 Task: Part-of-Speech Tagging Goal: Assign the correct part-of-speech to each word (and punctuation) in a text. Example: Learn a local model of POS dependencies, usually from pre-tagged data No parsing Twooldmenbetonthegame. CRDAJ0NN2VVDPP0AT0NN1PUN

78 78 Hidden Markov Models Assume: POS (state) sequence generated as time- invariant random process, and each POS randomly generates a word (output symbol) AT0 NN1 NN2 AJ0 0.2 0.3 0.5 0.3 0.5 0.9 0.2 0.1 “the” “a” 0.6 0.4 “cat” “bet” “cats” “men”

79 79 Definition of HMM for Tagging Set of states – all possible tags Output alphabet – all words in the language State/tag transition probabilities Initial state probabilities: the probability of beginning a sentence with a tag t (t 0  t) Output probabilities – producing word w at state t Output sequence – observed word sequence State sequence – underlying tag sequence

80 80 First-order (bigram) Markov assumptions: –Limited Horizon: Tag depends only on previous tag P(t i+1 = t k | t 1 = t j 1,…,t i = t j i ) = P(t i+1 = t k | t i = t j ) –Time invariance: No change over time P(t i+1 = t k | t i = t j ) = P(t 2 = t k | t 1 = t j ) = P( t j  t k ) Output probabilities: –Probability of getting word w k for tag t j : P(w k | t j ) –Assumption: Not dependent on other tags or words! HMMs For Tagging

81 81 Combining Probabilities Probability of a tag sequence: P(t 1 t 2 …t n ) = P(t 1 )P(t 1  t 2 )P(t 2  t 3 )…P(t n-1  t n ) Assume t 0 – starting tag: = P(t 0  t 1 )P(t 1  t 2 )P(t 2  t 3 )…P(t n-1  t n ) Prob. of word sequence and tag sequence: P(W,T) =  i P(t i-1  t i ) P(w i | t i )

82 82 Training from Labeled Corpus Labeled training = each word has a POS tag Thus: P MLE (t j ) = C(t j ) / N P MLE (t j  t k ) = C(t j, t k ) / C(t j ) P MLE (w k | t j ) = C(t j :w k ) / C(t j ) Smoothing can be applied.

83 83 Viterbi Tagging Most probable tag sequence given text: T*= arg max T P m (T | W) = arg max T P m (W | T) P m (T) / P m (W) (Bayes’ Theorem) = arg max T P m (W | T) P m (T) (W is constant for all T) = arg max T  i [ m(t i-1  t i ) m(w i | t i ) ] = arg max T  i log [ m(t i-1  t i ) m(w i | t i ) ] Exponential number of possible tag sequences – use dynamic programming for efficient computation

84 84 -log m t1t1 t2t2 t3t3 t 0  2.31.71 t 1  1.712.3 t 2  0.33.3 t 3  1.3 2.3 -log m w1w1 w2w2 w3w3 t1t1 0.72.3 t2t2 1.70.73.3 t3t3 1.7 1.3 t1t1 t2t2 t3t3 w1w1 t1t1 t2t2 t3t3 w2w2 t1t1 t2t2 t3t3 w3w3 t0t0 -1.7 -0.3 -1.3 -3 -3.4 -2.7 -2.3 -1.7 -6 -4.7 -6.7 -1.7 -0.3 -1.3 -7.3 -9.3 -10.3

85 85 Viterbi Algorithm 1. D(0, START) = 0 2. for each tag t != START do: D(1, t) = -  3. for i  1 to N do: for each tag t j do: D(i, t j )  max k D(i-1,t k ) + lm(t k  t j ) + lm(w i |t j ) Record best(i,j)=k which yielded the max 4.log P(W,T) = max j D(N, t j ) 5.Reconstruct path from max j backwards where: lm(.) = log m(.) and D(i, t j ) – max joint probability of state and word sequences till position i, ending at t j. Complexity: O(N t 2 N)

86 86 Overview Models –HMM: Hidden Markov Model –maximum entropy Markov model –CRFs: Conditional Random Fields Tasks –Chinese word segmentation –part-of-speech tagging –named entity recognition

87 87 Named Entity Recognition and Classification Problem of NE tagging Let W be a sequence of words W = w 1, w 2, …, w n Let T be the corresponding NE tag sequence T = t 1, t 2, …, t n Task : Find T which maximizes P ( T | W ) T’ = argmax T P ( T | W )

88 88 Supervised NERC Systems (ME, CRF and SVM) Limitations of HMM – Use of only local features may not work well – Simple HMM models do not work well when large data are not used to estimate the model parameters – Incorporating a diverse set features in an HMM based NE tagger is difficult and complicates the smoothing Solution: – Maximum Entropy (ME) model, Conditional Random Field (CRF) or Support Vector Machine (SVM) – ME, CRF or SVM can make use of rich feature information ME model – Very flexible method of statistical modeling – A combination of several features can be easily incorporated – Careful feature selection plays a crucial role – Does not provide a method for automatic selection of useful features – Features selected using heuristics – Adding arbitrary features may result in overfitting

89 89 Supervised NERC Systems (ME, CRF and SVM) CRF – CRF does not require careful feature selection in order to avoid overfitting – Freedom to include arbitrary features – Ability of feature induction to automatically construct the most useful feature combinations – Conjunction of features – Infeasible to incorporate all possible conjunction features due to overflow of memory – Good to handle different types of data SVM – Predict the classes depending upon the labeled word examples only – Predict the NEs based on feature information of words collected in a predefined window size only – Can not handle the NEs outside tokens – Achieves high generalization even with training data of a very high dimension – Can handle non-linear feature spaces with

90 90 Named Entity Features Language Independent Features – Can be applied for NERC in any language Language Dependent Features – Generated from the language specific resources like gazetteers and POS taggers – Indian languages are resource-constrained – Creation of gazetteers in resource-constrained environment requires a priori knowledge of the language – POS information depends on some language specific phenomenon such as person, number, tense, gender etc – POS tagger (Ekbal and Bandyopadhyay, 2008d) makes use of the several language specific resources such as lexicon, inflection list and a NERC system to improve its performance Language dependent features improve system performance

91 91 Language Independent Features – Context Word: Preceding and succeeding words – Word Suffix Not necessarily linguistic suffixes Fixed length character strings stripped from the endings of words Variable length suffix -binary valued feature – Word Prefix Fixed length character strings stripped from the beginning of the words – Named Entity Information: Dynamic NE tag (s) of the previous word (s) – First Word (binary valued feature): Check whether the current token is the first word in the sentence

92 92 Language Independent Features (Contd..) Length (binary valued): Check whether the length of the current word less than three or not (shorter words rarely NEs) Position (binary valued): Position of the word in the sentence Infrequent (binary valued): Infrequent words in the training corpus most probably NEs Digit features: Binary-valued – Presence and/or the exact number of digits in a token CntDgt : Token contains digits FourDgt: Token consists of four digits TwoDgt: Token consists of two digits CnsDgt: Token consists of digits only

93 93 Language Independent Features (Contd..) – Combination of digits and punctuation symbols CntDgtCma: Token consists of digits and comma CntDgtPrd: Token consists of digits and periods – Combination of digits and symbols CntDgtSlsh: Token consists of digit and slash CntDgtHph: Token consists of digits and hyphen CntDgtPrctg: Token consists of digits and percentages – Combination of digit and special symbols CntDgtSpl: Token consists of digit and special symbol such as $, # etc.

94 94 CRF based NERC System: Feature Templates Feature Template: Feature represented in terms of feature template Feature template used in the experiment

95 95 Best Feature Sets for ME, CRF and SVM ModelFeature MEWord, Context (Preceding one and following one word), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of the previous word, First word of the sentence, Infrequent word, Length of the word, Digit features CRFWord, Context (Preceding two and following two words), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of the previous word, First word of the sentence, Infrequent word, Length of the word, Digit features SVM-FWord, Context (Preceding three and following two words), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of the previous two words, First word of the sentence, Infrequent word, Length of the word, Digit features SVM-BWord, Context (Preceding three and following two words), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of the previous two words, First word of the sentence, Infrequent word, Length of the word, Digit features Best Feature set Selection: Training with language independent features and tested with the development set

96 96 Language Dependent Evaluation (ME, CRF and SVM) Observations:  Classifiers trained with best set of language independent as well as language dependent features  POS information of the words are very effective  Coarse-grained POS tagger (Nominal, PREP and Other) for ME and CRF  Fine-grained POS tagger (developed with 27 POS tags) for SVM based Systems  Best Performance of ME: POS information of the current word only (an improvement of 2.02% F-Score )  Best Performance of CRF: POS information of the current, previous and next words (an improvement of 3.04% F-Score )  Best Performance of SVM: POS information of the current, previous and next words (an improvement of 2.37% F-Score in SVM-F and 2.32% in SVM-B )  NE suffixes, Organization suffix words, person prefix words, designations and common location words are more effective than other gazetteers

97 97 Reference HMM http://www-nlp.stanford.edu/fsnlp/hmm-chap/blei-hmm-ch9.ppt MEMM www.cs.cornell.edu/courses/cs778/2006fa/lectures/05-memm.pdf CRFs web.engr.oregonstate.edu/~tgd/classes/539/slides/Shen-CRF.ppt PoS-tagging cs.haifa.ac.il/~shuly/teaching/04/statnlp/pos-tagging.ppt NER www.cl.uni-heidelberg.de/colloquium/docs/ekbal_abstract.pdf


Download ppt "1 Natural Language Processing Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University"

Similar presentations


Ads by Google