Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20071 A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities.

Similar presentations


Presentation on theme: "Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20071 A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities."— Presentation transcript:

1 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20071 A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities Pawe ł Mazur (University of Technology, Wroc ł aw, Poland) Pawel.Mazur@pwr.wroc.pl and Robert Dale (Macquarie University, Sydney, Australia) Robert.Dale@mq.edu.au

2 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20072 Agenda Conjunction in Named Entities Our approach Experiments Results of the experiment Results analysis Conclusions Further work

3 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20073 Conjunction in Candidate Named Entity String Fujitsu Australia and New Zealand Australia and New Zealand Banking Group Limited Peter Smith and Ann Arbor Software Council Candidate named entity string: – Any sequence of words starting with initial capitals – Single instance of the word and or & form of conjunction 45 documents out of 13460: 5.7% of candidate named entity strings contained conjunction; in some documents the frequency is as high as 23%; in MUC-7 it is 4.5% A lot of candidate named entity strings in this domain contain company names and person names

4 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20074 Our Approach - A Classification Task We distinguish 4 categories of a conjunction in a candidate NE string: – Category A: Name Internal Conjunction Copper Mines and Metals Limited Herbert P Cooper & Son, Ernst and Young – Category B: Name External Conjunction Proxy Form and Explanatory Memorandum Hardware & Operating Systems EchoStar and News Corporation – Category C: Right-Copy Separator William and Alma Ford, Connel and Bent Streets, Eastern and Western Australia – Category D: Left-Copy Separator Hospital Equipment & Systems J H Blair Company Secretary & Corporate Counsel Could be seen as one linguistic category The most common

5 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20075 Our Approach - Candidate NE String Pattern String: Australia and New Zealand Banking Group Limited Pattern: (Loc and Loc Org CompDesig) String: Peter Smith and Ann Arbor Software Council Pattern: (GivenName FamilyName and GivenName FamilyName Noun Org) Patterns are created using gazetteers and simple keyword-based heuristics.

6 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20076 Tag Set InitCapped92542.24% Loc24511.19% Org175 7.99% FamilyName164 7.49% CompDesig138 6.30% Initial108 4.93% CompPos 99 4.52% GivenName 89 4.06% Of 76 3.47% Abbrev 73 3.33% PersDesig 39 1.78% Det 31 1.42% Dir 12 0.55% Son 7 0.32% Month 6 0.27% AlphaNum 3 0.14% PersDesig: Mr, Mrs, Ms, Miss, Dr, Prof, Sir, Madam, Messrs, and Jnr. CompDesig: Ltd, Limited, Pty Ltd, GmbH, plc and many more and Investments Pty Ltd, Management Pty Ltd, Corporate Pty Ltd, Associates Pty Ltd, Family Trust, Co Limited, Partners, Partners Limited, Capital Limited, and Capital Pty Ltd. CompPos: Director, Secretary, Manager, Counsel, Managing Director, Member, Chairman, Chief Executive, Chief Executive Officer, and CEO, and also some bodies within organizations, such as Board and Committee.

7 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20077 Data Encoding Each instance is encoded with 33 attributes: 1 binary attribute for each tag for each conjunct signaling its presence in the string (2x16=32 attributes in total) 1 binary attribute ConjType encoding the lexical form of the conjunction in the string (0 for &, 1 for and)

8 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20078 Corpus & Data Sets Corpus: 13460 text documents – from 8 to 1000 lines long Our corpus is a subcorpus drawn from a collection of company announcements from the Australian Stock Exchange Selection of candidate named entity strings: sequence of initcapped words and a single conjunction (and or &), also optional: of, a, an, the We got a set of 10925 strings, 6437 of which are unique Hand elimination of wrongly identified strings due to typographic features of the documents (tables) Random selection of 600 examples from the unique set Name Internal Name External Right-CopyLeft-CopySum 185 30.8% 350 58.3% 39 6.5% 26 4.3% 600 100%

9 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20079 Machine-learned Classifiers Naïve Bayes Multilayered Perceptron IBk K* Random Tree Logistic Model Trees (LMT) J4.8 SMO Implementations in WEKA (Waikato Environment for Knowledge Analysis), University of Waikato in New Zealand

10 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200710 Baseline Determined with the 0-R algorithm: always assigns the most common category (Name External) – 58.33% Better baseline is given by 1-R algorithm: IF ConjForm=& THEN PredCat Internal IF ConjForm=and THEN PredCat External baseline = 69.83%

11 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200711 Results AlgorithmCorrectly classified IBk504 (84.00%) Random Tree503 (83.83%) K*501 (83.50%) SMO - quadratic kernel494 (82.33%) Mult. Perceptron493 (82.17%) LMT487 (81.17%) J4.8477 (79.50%) SMO - linear kernel468 (78.00%) Naïve Bayes424 (70.67%) Baseline419 (69.83%)

12 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200712 Accuracy by Conjunction Category CategoryPrecisionRecallF-Measure Name Internal0.8140.8760.844 Name External0.8720.8970.885 Right-Copy0.6150.4100.492 Left-Copy0.8000.4620.585 weighted mean0.8340.8400.833

13 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200713 Confusion Matrix Name InternalName ExternalRight-CopyLeft-CopyClassified as: 1622863Name Internal 183141711Name External 46160Right-Copy| 12012 Left- Copy

14 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200714 Results Analysis: Conjunction Cat. Indicators For Name External conjunction: - Month & X - X & Month - CompDesig & X - X & PersDesig - X & GivenName - X & Dir - X & Deter - Abbrev & X - X & Abbrev For Name Internal conjunction: - X & Son (note: Sons of Gwalia Ltd and Gwalia Consolidated Ltd)

15 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200715 Error Analysis: InitCapped 38 of all 96 missclassified examples are InitCapped tag based only (~40%) In these cases the classification ended up being determined on the basis of ConjForm attribute (just like the baseline was determined). There were 134 InitCapped-only patterns in the data set; 96 of them (71.64%) were classified correctly (comparative to the overall baseline result of 69.83%). There were also 11 missclassified examples consisting mainly of InitCapped tag. Ex: Australian Labor Party and Independent Members Loc InitCapped Org and InitCapped InitCapped

16 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200716 Error Analysis: Long Patterns In 2 cases the misclassification was due to the long patterns of the examples: Fellow of the Australian Institute of Geoscientists and The Australasian Institute of Mining CompPos Of Det Loc Org Of InitCapped and Det Loc Org Of InitCapped (Left-Copy => Name Internal) Fellow of the Royal College of Pathologists of Australasia and Chairman of Scientific Services Limited Pos Of Deter Org Of InitCapped Of Loc and Pos Of InitCapped InitCapped Desig (Name External => Name Internal)

17 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200717 Error Analysis: Other Cases 2 cases of extended patterns – a pattern is built as another (common) pattern + additional tag: WD & HO Wills Holdings Limited Initial Initial & Initial Initial FamilyName CompDesig (Name Inter) vs Initial Initial & Initial Initial FamilyName (Right-Copy) A conjunction of a person name and a company name Wayne Jones and Topsfield Pty Ltd – ambiguos even for humans without contextual information A conjunction of two person names: in our domain there is only one case where this is name external type; There are around 20 examples where it is difficult to judge the reason for missclasification - perhaps the reason is the model we have built Influence of k-fold evaluation: different classification for the same pattern in different folds

18 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200718 Conclusions Distinguished 4 categories of conjunctions in NEs Presented the problem as one of classification Experiment with machine-learned classifiers Results: F=0.833 Simple tag set used Some examples are truly ambiguous even for humans

19 Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/200719 Further Work Multiple conjunctions Human supervised N-gram based preprocessing Abbreviation preprocessing Limit the number of InitCapped tags Take into account the syntactic number of tokens Use contextual information (ex. syntactic number of associated verb) Extend the evaluation data Evaluation with full named entity recognition process


Download ppt "Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/20071 A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities."

Similar presentations


Ads by Google