Presentation is loading. Please wait.

Presentation is loading. Please wait.

NAME TAGGING Heng Ji September 23, 2014.

Similar presentations


Presentation on theme: "NAME TAGGING Heng Ji September 23, 2014."— Presentation transcript:

1 NAME TAGGING Heng Ji September 23, 2014

2 Introduction IE Overview Supervised Name Tagging  Features  Models Advanced Techniques and Trends  Re-ranking and Global Features (Ji and Grishman, 05;06)  Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009)

3 Barry Diller on Wednesday quit as chief of Vivendi Universal Entertainment. Trigger Quit (a “Personnel/End-Position” event) Arguments Role = PersonBarry Diller Role = OrganizationVivendi Universal Entertainment Role = PositionChief Role = Time-withinWednesday ( ) Vivendi Universal Entertainment Barry Diller Using IE to Produce “Food” or Watson  (In this talk) Information Extraction (IE) =Identifying the instances of facts names/entities, relations and events from semi-structured or unstructured text; and convert them into structured representations (e.g. databases)

4 Supervised Learning based IE ‘Pipeline’ style IE  Split the task into several components  Prepare data annotation for each component  Apply supervised machine learning methods to address each component separately  Most state-of-the-art ACE IE systems were developed in this way  Provide great opportunity to applying a wide range of learning models and incorporating diverse levels of linguistic features to improve each component  Large progress has been achieved on some of these components such as name tagging and relation extraction

5 Major IE Components Relation Extraction Time Identification and Normalization Name/Nominal Extraction Event Mention Extraction and Event Coreference Resolution “Barry Diller”, “chief” “Barry Diller” = “chief” “Vivendi Universal Entertainment” is located in “France” “Barry Diller” is the person of the end-position event trigged by “quit” Entity Coreference Resolution Wednesday ( )

6 Introduction IE Overview Supervised Name Tagging  Features  Models Advanced Techniques and Trends  Re-ranking and Global Features (Ji and Grishman, 05;06)  Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009)

7 7/40 Name Tagging Recognition x Classification “ Name Identification and Classification ” NER as: as a tool or component of IE and IR as an input module for a robust shallow parsing engine Component technology for other areas Question Answering (QA) Summarization Automatic translation Document indexing Text data mining Genetics …

8 8/40 Name Tagging NE Hierarchies Person Organization Location But also: Artifact Facility Geopolitical entity Vehicle Weapon Etc. SEKINE & NOBATA (2004) 150 types Domain-dependent Abstract Meaning Representation (amr.isi.edu) 200+ types

9 9/40 Handcrafted systems Knowledge (rule) based Patterns Gazetteers Automatic systems Statistical Machine learning Unsupervised Analyze: char type, POS, lexical info, dictionaries Hybrid systems Name Tagging

10 10/40 Handcrafted systems LTG F-measure of in MUC-7 (the best) Ltquery, XML internal representation Tokenizer, POS-tagger, SGML transducer Nominator (1997) IBM Heavy heuristics Cross-document co-reference resolution Used later in IBM Intelligent Miner Name Tagging

11 11/40 Handcrafted systems LaSIE (Large Scale Information Extraction) MUC-6 (LaSIE II in MUC-7) Univ. of Sheffield ’ s GATE architecture (General Architecture for Text Engineering ) JAPE language FACILE (1998) NEA language (Named Entity Analysis) Context-sensitive rules NetOwl (MUC-7) Commercial product C++ engine, extraction rules Name Tagging

12 12/40 Automatic approaches Learning of statistical models or symbolic rules Use of annotated text corpus Manually annotated Automatically annotated “ BIO ” tagging Tags: Begin, Inside, Outside an NE Probabilities: Simple: P(tag i | token i) With external evidence: P(tag i | token i-1, token i, token i+1) “ OpenClose ” tagging Two classifiers: one for the beginning, one for the end

13 13/40 Decision trees Tree-oriented sequence of tests in every word Determine probabilities of having a BIO tag Use training corpus Viterbi, ID3, C4.5 algorithms Select most probable tag sequence SEKINE et al (1998) BALUJA et al (1999) F-measure: 90% Automatic approaches

14 14/40 HMM Markov models, Viterbi Separate statistical model for each NE category + model for words outside NEs Nymble (1997) / IdentiFinder (1999) Maximum Entropy (ME) Separate, independent probabilities for every evidence (external and internal features) are merged multiplicatively MENE (NYU ) Capitalization, many lexical features, type of text F-Measure: 89% Automatic approaches

15 15/40 Hybrid systems Combination of techniques IBM ’ s Intelligent Miner: Nominator + DB/2 data mining WordNet hierarchies MAGNINI et al. (2002) Stacks of classifiers Adaboost algorithm Bootstrapping approaches Small set of seeds Memory-based ML, etc. Automatic approaches

16 NER in various languages Arabic TAGARAB (1998) Pattern-matching engine + morphological analysis Lots of morphological info (no differences in ortographic case) Bulgarian OSENOVA & KOLKOVSKA (2002) Handcrafted cascaded regular NE grammar Pre-compiled lexicon and gazetteers Catalan CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003) Extract catalan NEs with spanish resources (F-measure 93%) Bootstrap using catalan texts

17 NER in various languages Chinese & Japanese Many works Special characteristics Character or word-based No capitalization CHINERS (2003) Sports domain Machine learning Shallow parsing technique ASAHARA & MATSMUTO (2003) Character-based method Support Vector Machine 87.2% F-measure in the IREX (outperformed most word-based systems)

18 NER in various languages Dutch DE MEULDER et al. (2002) Hybrid system Gazetteers, grammars of names Machine Learning Ripper algorithm French B É CHET et al. (2000) Decision trees Le Monde news corpus German Non-proper nouns also capitalized THIELEN (1995) Incremental statistical approach 65% of corrected disambiguated proper names

19 NER in various languages Greek KARKALETSIS et al. (1998) English – Greek GIE (Greek Information Extraction) project GATE platform Italian CUCCHIARELLI et al. (1998) Merge rule-based and statistical approaches Gazetteers Context-dependent heuristics ECRAN (Extraction of Content: Research at Near Market) GATE architecture Lack of linguistic resources: 20% of NEs undetected Korean CHUNG et al. (2003) Rule-based model, Hidden Markov Model, boosting approach over unannotated data

20 NER in various languages Portuguese SOLORIO & L Ó PEZ (2004, 2005) Adapted CARRERAS et al. (2002b) spanish NER Brazilian newspapers Serbo-croatian NENADIC & SPASIC (2000) Hand-written grammar rules Highly inflective language Lots of lexical and lemmatization pre-processing Dual alphabet (Cyrillic and Latin) Pre-processing stores the text in an independent format

21 NER in various languages Spanish CARRERAS et al. (2002b) Machine Learning, AdaBoost algorithm BIO and OpenClose approaches Swedish SweNam system (DALIANIS & ASTROM, 2001) Perl Machine Learning techniques and matching rules Turkish TUR et al (2000) Hidden Markov Model and Viterbi search Lexical, morphological and context clues

22 Exercise Finding name identification errors in

23 GeorgeW. BushdiscussedIraq B-GPEO B-PER I-PER George W. Bush discussed Iraq Name Tagging: Task Person (PER): named person or family Organization (ORG): named corporate, governmental, or other organizational entity Geo-political entity (GPE): name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.) But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc. Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004) Convert it into a sequence labeling problem – “BIO” tagging:

24 Quiz Time! Faisalabad's Catholic Bishop John Joseph, who had been campaigning against the law, shot himself in the head outside a court in Sahiwal district when the judge convicted Christian Ayub Masih under the law in Next, film clips featuring Herrmann’s Hollywood music mingle with a suite from “Psycho,” followed by “La Belle Dame sans Merci,” which he wrote in 1934 during his time at CBS Radio.

25 Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002; Florian et al., 2007) Decision Trees (Sekine et al., 1998) Class-based Language Model (Sun et al., 2002, Ratinov and Roth, 2009) Agent-based Approach (Ye et al., 2002) Support Vector Machines (Takeuchi and Collier, 2002) Sequence Labeling Models Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005) Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000) Conditional Random Fields (CRFs) (McCallum and Li, 2003) Supervised Learning for Name Tagging

26 N-gram: Unigram, bigram and trigram token sequences in the context window of the current token Part-of-Speech: POS tags of the context words Gazetteers: person names, organizations, countries and cities, titles, idioms, etc. Word clusters: to reduce sparsity, using word clusters such as Brown clusters (Brown et al., 1992) Case and Shape: Capitalization and morphology analysis based features Chunking: NP and VP Chunking tags Global feature: Sentence level and document level features. For example, whether the token is in the first sentence of a document Conjunction: Conjunctions of various features Typical Name Tagging Features

27 Markov Chain for a Simple Name Tagger START END PER X George:0.3 W.:0.3 discussed:0.7 $:1.0 LOC Bush:0.3 Iraq:0.1 George:0.2 Iraq:0.8 Transition Probability Emission Probability

28 Viterbi Decoding of Name Tagger START PER George Bush discussed LOC X 0 t= t=1 t=2t=3t= $ W.Iraq t=5t=6 END *0.3*0.3 Current = Previous * Transition * Emission

29 Limitations of HMMs Joint probability distribution p(y, x) Assume independent features Cannot represent overlapping features or long range dependences between observed elements Need to enumerate all possible observation sequences Very strict independence assumptions on the observations Toward discriminative/conditional models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

30 30 Maximum Entropy Why maximum entropy? Maximize entropy = Minimize commitment Model all that is known and assume nothing about what is unknown. Model all that is known: satisfy a set of constraints that must hold Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

31 31 Why Try to be Uniform?  Most Uniform = Maximum Entropy  By making the distribution as uniform as possible, we don’t make any additional assumptions to what is supported by the data  Abides by the principle of Occam’s Razor (least assumption = simplest explanation)  Less generalization errors (less over-fitting)  more accurate predictions on test data

32 32 Learning Coreference by Maximum Entropy Model  Suppose that if the feature “Capitalization” = “Yes” for token t, then P (t is the beginning of a Name | (Captalization = Yes)) = 0.7  How do we adjust the distribution? P (t is not the beginning of a name | (Capitalization = Yes)) = 0.3  If we don’t observe “Has Title = Yes” samples? P (t is the beginning of a name | (Has Title = Yes)) = 0.5 P (t is not the beginning of a name | (Has Title = Yes)) = 0.5

33 33 The basic idea Goal: estimate p Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

34 34 Setting From training data, collect (a, b) pairs: a: thing to be predicted (e.g., a class in a classification problem) b: the context Ex: Name tagging: a=person b=the words in a window and previous two tags Learn the prob of each (a, b): p(a, b)

35 35 Ex1: Coin-flip example (Klein & Manning 2003) Toss a coin: p(H)=p1, p(T)=p2. Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p) p1 H p1=0.3

36 36 Coin-flip example (cont) p1 p2 H p1 + p2 = 1 p1+p2=1.0, p1=0.3

37 37 Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer:

38 38 An MT example (cont) Constraints: Intuitive answer:

39 39 Why ME? Advantages Combine multiple knowledge sources Local Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996) ) Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002) ) Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997) ) Global N-grams (Rosenfeld, 1997) Word window Document title (Pakhomov, 2002) Structurally related words (Chao & Dyer, 2002) Sentence length, conventional lexicon (Och & Ney, 2002) Combine dependent knowledge sources

40 40 Why ME? Advantages Add additional knowledge sources Implicit smoothing Disadvantages Computational Expected value at each iteration Normalizing constant Overfitting Feature selection Cutoffs Basic Feature Selection (Berger et al., 1996)

41 Maximum Entropy Markov Models (MEMMs) Have all the advantages of Conditional Models No longer assume that features are independent Do not take future observations into account (no forward-backward) Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon.

42 Conditional Random Fields (CRFs) Conceptual Overview Each attribute of the data fits into a feature function that associates the attribute and a possible label A positive value if the attribute appears in the data A zero value if the attribute is not in the data Each feature function carries a weight that gives the strength of that feature function for the proposed label High positive weights: a good association between the feature and the proposed label High negative weights: a negative association between the feature and the proposed label Weights close to zero: the feature has little or no impact on the identity of the label CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Weights of different features at different states can be traded off against each other CRFs provide the benefits of discriminative models

43 43/39 Example of CRFs

44 Sequential Model Trade-offs Speed Discriminative vs. Generative Normalization HMMvery fastgenerativelocal MEMMmid-rangediscriminativelocal CRFrelatively slowdiscriminativeglobal

45 State-of-the-art Performance On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008) On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov and Roth, 2009) Remaining Challenges Identification, especially on organizations Boundary: “Asian Pulp and Paper Joint Stock Company, Lt. of Singapore” Need coreference resolution or context event features: “FAW has also utilized the capital market to directly finance, and now owns three domestic listed companies” (FAW = First Automotive Works) Classification “Caribbean Union”: ORG or GPE? State-of-the-art and Remaining Challenges

46 Introduction IE Overview Supervised Name Tagging  Features  Models Advanced Techniques and Trends  Re-ranking and Global Features (Ji and Grishman,05;06)  Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009)

47 NLP Re-Ranking Framework Useful for:  Name Tagging  Parsing  Coreference Resolution  Machine Translation  Semantic Role Labeling  … Input Sentence Best Hypothesis Baseline System N-Best Hypotheses Re-Ranking Feature Extractor Re-Ranker

48 Name Re-Ranking  Sentence: Slobodan Milosevic was born in Serbia.  Generate N-Best Hypotheses Hypo0: Slobodan Milosevic was born in Serbia. Hypo1: Slobodan Milosevic was born in Serbia. Hypo2: Slobodan Milosevic was born in Serbia. …  Goal: Re-Rank Hypotheses and Get the New Best One Hypo0’: Slobodan Milosevic was born in Serbia. Hypo1’: Slobodan Milosevic was born in Serbia. Hypo2’: Slobodan Milosevic was born in Serbia.

49 Enhancing Name Tagging Prior Work TaskUse Joint inferenceModel Name Identification Name Classification Roth and Yih, 02 & 04 NoYes (relation) Linear Programming Collins, 02YesNo RankBoost Ji and Grishman, 05 Yes (coref, relation) MaxEnt-Rank

50 Supervised Name Re-Ranking: Learning Function HypoF-Measure (%) Direct Re-Ranking by Classification (ex. our ’05 work) Direct Re-Ranking by Score (ex.Boosting style) Indirect Re-Ranking + Confidence based decoding using multi- class solution (ex. MaxEnt, SVM style) h1h2h3h1h2h f(h 1 ) = -1 f(h 2 ) = 1 f(h 3 ) = -1 f(h 1 ) = 0.8 f(h 2 ) = 0.9 f(h 3 ) = 0.6 f(h 1, h 2 ) = -1 f(h 1, h 3 ) = 1 f(h 2, h 1 ) = 1 f(h 2, h 3 ) = 1 f(h 3, h 1 ) = -1 f(h 3, h 2 ) = -1

51 Supervised Name Re-Ranking: Algorithms  MaxEnt-Rank Indirect Re-Ranking + Decoding Compute a reliable ranking probability for each hypothesis  SVM-Rank Indirect Re-Ranking + Decoding Theoretically achieves low generalization error Using linear kernel in this paper  p-Norm Push Ranking (Rudin, 06) Direct Re-Ranking by Score New boosting-style supervised ranking algorithm, generalizes RankBoost “p” represents how much to concentrate at the top of the hypothesis list Theoretically achieves low generalization error

52 Wider Context than Traditional NLP Re- Ranker Sent 0: Sent 1: Sent i: i th Input Sentence Best Hypothesis Baseline System N-Best Hypotheses Re-Ranking Feature Extractor Re-Ranker  Use only current sentence vs. use cross-sentence info  Use only baseline system vs. use joint inference

53 Name Re-Ranking in IE View i th Input Sentence Best Hypothesis Baseline HMM N-Best Hypotheses Re-Ranking Feature Extractor Re-Ranker Relation Tagger Event Tagger Coreference Resolver ? ? ?

54 Look Outside of Stage and Sentence Relation Tagger Event Tagger Coreference Resolver Name Structure Parsing (ORG Modifier + ORG Suffix) Text 1 Text 2 S 11 S 12 S 22 The previous president Milosevic was beaten by the Opposition Committee candidate Kostunica. This famous leader and his wife Zuolica live in an apartment in Belgrade. Milosevic was born in Serbia’s central industrial city.

55 Take Wider Context Sent 0: Sent 1: Sent i: Cross-Sentence Cross-Sentence Cross-Stage( Joint Inference ) Cross-Stage( Joint Inference ) i th Input Sentence Best Hypothesis Baseline HMM N-Best Hypotheses Re-Ranking Feature Extractor Re-Ranker Relation Tagger Event Tagger Coreference Resolver

56 i th Input Sentence Best Hypothesis Baseline HMM N-Best Hypotheses Re-Ranking Feature Extractor Re-Ranker Relation Tagger Event Tagger Cross-doc Coreference Resolver Sent 0: Docn Sent 1: Sent i: Sent 0: Doc2 Sent 1: Sent i: Sent 0: Doc Sent 1: Sent i: Even Wider Context Cross-Stage( Joint Inference ) Cross-Stage( Joint Inference ) … Cross-Doc

57 Feature Encoding Example Text 1 Text 2  “CorefNum” Feature: the names in h are referred to by CorefNum other noun phrases (CorefNum larger, h is more likely to be correct) S 11 S 13 S 21 : 2 : 1  For the current name hypothesis of S11: CorefNum = 2+1+1=4 The previous president Milosevic was beaten by the Opposition Committee candidate Kostunica. Kostunica joined the committee in Milosevic was born in Serbia’s central industrial city.

58 Experiment on Chinese Name Tagging: Results ModelPrecisionRecallF-Measure HMM Baseline Oracle (N=30)(96.2)(97.4)(96.8) MaxEnt-Rank SVMRank p-Norm Push Ranking (p =16) RankBoost (p-Norm Push Ranking, p = 1)  All re-ranking models achieved significantly better P, R, F than the baseline  Recall and F-Measure values for MaxEnt-Rank and p-Norm are within the confidence interval

59 Introduction IE Overview Supervised Name Tagging  Features  Models Advanced Techniques and Trends  Re-ranking and Global Features (Ji and Grishman,05;06)  Data Sparsity Reduction (Ratinov and Roth, 2009, Ji and Lin, 2009)

60 NLP Words words words ---> Statistics Data sparsity in NLP: “I have bought a pre-owned car” “I have purchased a used automobile” How do we represent (unseen) words? 60

61 NLP Not so well... We do well when we see the words we have already seen in training examples and have enough statistics about them. When we see a word we haven't seen before, we try:  Part of speech abstraction  Prefixes/suffixes/number/capitalized abstraction. We have a lot of text! Can we do better? 61

62 Word Class Models (Brown1992) Can be seen either as: A hierarchical distributional clustering Iteratively reduce the number of states and assign words to hidden states such that the joint probability of the data and the assigned hidden states is maximized. 62

63 Gazetteers Weakness of Brown clusters and word embeddings: representing the word “Bill” in The Bill of Congress Bill Clinton We either need context-sensitive embeddings or embeddings of multi-token phrases Current simple solution: gazetteers Wikipedia category structure  ~4M typed expressions. 63

64 Results NER is a knowledge-intensive task; Surprisingly, the knowledge was particularly useful (*) on out-of-domain data, even though it was not used for C&W induction or Brown clusters induction. 64

65 Achieving really high performance for name tagging requires deep semantic knowledge large costly hand-labeled data Many systems also exploited lexical gazetteers but knowledge is relatively static, expensive to construct, and doesn’t include any probabilistic information. Obtaining Gazetteers Automatically?

66 Data is Power Web is one of the largest text corpora: however, web search is slooooow (if you have a million queries). N-gram data: compressed version of the web Already proven to be useful for language modeling Tools for large N-gram data sets are not widely available What are the uses of N-grams beyond language models?

67

68 car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hit- and-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, 48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, , shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25, bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24, trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20, drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, common 11, canoe 11, skateboarding 10, ship 10, paragliding 10, paddock 10, moped 10, factory 10

69 Name labeled corpora: 1375 documents, about 16,500 name mentions Manually constructed name gazetteer including 245,615 names Census data including 5,014 person-gender pairs. A Typical Name Tagger

70 Patterns for Gender and Animacy Discovery PropertyNametarget [#]contextPronounExample GenderConjunction- Possessive noun[292,212] |capitalized [162,426] conjunctionhis|her|its|theirJohn and his Nominative- Predicate noun [53,587]am|is|are| was|were|be he|she|it|theyhe is John Verb-Nominativenoun [116,607]verbhe|she|it|theyJohn thought he Verb-Possessivenoun [88,577]| capitalized [52,036] verbhis|her|its|theirJohn bought his Verb-Reflexivenoun [18,725]verbhimself|herself| itself|themselve s John explained himself AnimacyRelative-Pronoun(noun|adjective) & not after (preposition| noun|adjective) [664,673] comma| empty who|which| where|when John, who

71 Lexical Property Mapping PropertyPronounValue Genderhis|he|himselfmasculine her|she|herselffeminine its|it|itselfneutral their|they|themselvesplural Animacywhoanimate which|where|whennon-animate

72 If a mention indicates male and female with high confidence, it’s likely to be a person mention Gender Discovery Examples Patterns for candidate mentionsmalefemaleneutralplural John Joseph bought/… his/…32000 Haifa and its/… screenwriter published/… his/… it/… is/… fish

73 If a mention indicates animacy with high confidence, it’s likely to be a person mention Animacy Discovery Examples Patterns for candidate mentions AnimateNon-Animate whowhenwherewhich supremo24000 shepherd prophet imam oligarchs sheikh

74 74 Overall Procedure Test doc Token Scanning& Stop-word Filtering Candidate Name Mentions Candidate Nominal Mentions Fuzzy Matching Person Mentions Google N-Grams Online ProcessingOffline Processing Gender & Animacy Knowledge Discovery Confidence Estimation Confidence (noun, masculine/feminine/animate)

75 75 Candidate mention detection Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words Nominal: un-capitalized sequence of <=3 words without stop words Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) Confidence (candidate, Male/Female/Animate) > Full matching: candidate = full string Composite matching: candidate = each token in the string Relaxed matching: Candidate = any two tokens in the string Unsupervised Mention Detection Using Gender and Animacy Statistics

76 76 Property Matching Examples Mention candidate Matching Method String for matching Property Frequency masculinefeminineneutralplural John JosephFull Matching John Joseph Ayub MasihComposite Matching Ayub87000 Masih Mahmoud Salim Qawasmi Relaxed Matching Mahmoud Salim Qawasmi0000

77 77 Separate Wheat from Chaff: Confidence Estimation Rank the properties for each noun according to their frequencies: f 1 > f 2 > … > f k

78 78 Candidate mention detection Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words Nominal: un-capitalized sequence of <=3 words without stop words Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) Confidence (candidate, Male/Female/Animate) > Full matching: candidate = full string Composite matching: candidate = each token in the string Relaxed matching: Candidate = any two tokens in the string Experiments: Data

79 79 Impact of Knowledge Sources on Mention Detection for Dev Set Patterns applied to ngrams for Nominal MentionsP(%)R(%)F(%) Conjunction-Possessivewriter and his PredicateHe is a writer Verb-Nominatewriter thought he Verb-Possessivewriter bought his Verb-Reflexivewriter explained himself Animacywriter, who Patterns applied to ngrams for Name MentionsP(%)R(%)F(%) Conjunction-PossessiveJohn and his Verb-NominateJohn thought he AnimacyJohn, who

80 80 Impact of Confidence Metrics Why some metrics don’t work High Percentage (The) = 95.9% The: F:112 M:166 P:12 High Margin&Freq (Under) =16 Under:F:30 M:233 N:15 P:49 Name Gender (conjunction) confidence metric tuning on dev set

81 Exercise Suggest solutions to remove the name identification errors in


Download ppt "NAME TAGGING Heng Ji September 23, 2014."

Similar presentations


Ads by Google