Presentation on theme: "The SEER Project Team and Others"— Presentation transcript:
1 From BioNER to AstroNER: Porting Named Entity Recognition to a New Domain
2 The SEER Project Team and Others Bea Alex, Shipra Dingare, Claire Grover, Ben Hachey, Ewan Klein, Yuval Krymolowski, Malvina NissimEdinburgh BioNER:Stanford BioNER:Jenny Finkel, Chris Manning, Huy NguyenBea Alex, Markus Becker, Shipra Dingare, Rachel Dowsett, Claire Grover, Ben Hachey, Olivia Johnson, Ewan Klein, Yuval Krymolowski, Jochen Leidner, Bob Mann, Malvina Nissim, Bonnie WebberEdinburgh AstroNER:
3 Overview Named Entity Recognition The SEER project BioNER Porting to New DomainsAstroNER
4 Named Entity Recognition As the first stage of Information Extraction, Named Entity Recognition (NER) identifies and labels strings in text as belonging to pre-defined classes of entities.(The second stage of Information Extraction (IE) identifies relations between entities.)NER or full IE can be useful technology for Text Mining.
5 Named Entity Recognition Early work in NLP focused on general entities in newspaper texts e.g. person, organization, location, date, time, money, percentage
6 Newspaper Named Entities Helen Weir, the finance director of Kingfisher, was handed a £334,607 allowance last year to cover the costs of a relocation that appears to have shortened her commute by around 15 miles. The payment to the 40-year-old amounts to roughly £23,000 a mile to allow her to move from Hampshire to Buckinghamshire after an internal promotion.
7 Named Entity Recognition For text mining from scientific texts, the entities are determined by the domain, e.g. for biomedical text, gene, virus, drug etc.
9 The SEER Project Stanford-Edinburgh Entity Recognition Funded by the Edinburgh Stanford Link Jan 2002 — Dec 2004Focus:NER technology applied in a range of new domainsgeneralise from named entities to include term entitiesmachine learning techniques in order to enable bootstrapping from small amounts of training dataDomains: biomedicine, astronomy, archaeology
10 Biomedical NER Competitions BioCreativeGiven a single sentence from a Medline abstract, identify all mentions of genes“(or proteins where there is ambiguity)”BioNLPGiven full Medline abstracts, identify five types of entityDNA, RNA, protein, cell line, cell type
11 The Biomedical NER Data SentencesWordsNEs/SentBioCreativeTraining7,500~200,000~1.2Development2,500~70,000Evaluation5,000~130,000BioNLP~19,000~500,000~2.75~4,000~100,000~2.25
12 Evaluation Method Measure Precision, Recall and F-score. Both BioCreative and BioNLP used the exact-match scoring methodIncorrect boundaries doubly penalized as false negatives and false positives.chloramphenicol acetyl transferase reporter gene (FN)transferase reporter gene (FP)chloramphenicol acetyl transferase reporter geneh
13 The SEER BioNER System Maximum Entropy Tagger in Java Based on Klein et al (2003) CoNLL submissionEfforts mostly in finding new featuresDiverse Feature SetLocal FeaturesExternal Resources
14 External Resources Abbreviation TnT POS-tagger Frequency Gazetteers WebSyntaxAbstractABGENE/GENIA
15 Mining the Web Entity Type Query # hits PROTEIN "glucocorticoid protein OR binds OR kinase OR ligation”234DNA"glucocorticoid dna OR sequence OR promoter OR site”101CELL_LINE"glucocorticoid cells OR cell OR cell type OR line"1CELL_TYPE"glucocorticoid proliferation OR clusters OR cultured OR cells”12RNA"glucocorticoid mrna OR transcript OR messenger OR splicing”35
17 Postprocessing – BioCreative Discarded results with mismatched parenthesesDifferent boundaries were detected when searching the sentence forwards versus backwardsUnioned the results of both; in cases where boundary disagreements meant that one detected gene was contained in the other, we kept the shorter gene
19 What If You Lack Training Data? When porting to a new domain it is likely that there will be little or no annotated data available.Do you pay annotators to create it?Are there methods that will allow you to get by with just a small amount of data?Bootstrapping Techniques
20 AstroNER: The ‘Surprise’ Task Aimssimulate a practical situationexperiment with bootstrapping methodsgain practical experience in porting our technology to a new domain using limited resourcesmonitor resource expenditure to compare the practical utility of various methodsCollaborators: Bonnie Webber, Bob MannFollow-up to workshop held at NeSC in Jan 2004
21 MethodThe data was chosen and prepared in secret to ensure fair comparison.The training set was kept very small but large amounts of tokenised unlabelled data were made available.Three teams, each given the same period of time to perform taskApproaches:co-training, weakly supervised learning, active learning
22 Data and AnnotationAstronomy abstracts from the NASA Astrophysics Data Service (Sub-domain: spectroscopy/spectral lines4 entity types: instrument-name, spectral-feature, source-type, source-nameData:Annotation tool based on the NXT toolkit for expert annotation of training & testing sets as well as active learning annotation.abstractssentencesentitiestraining50502874testing1591,4512,568unlabeled7787,979
24 Co-trainingBasic idea: use the strengths of one classifier to rectify the weaknesses of another.Two different methods classify a set of seed data; select results of one iteration, and add them to the training data for the next iteration.Various choices:same classifier with different feature splits, or two different classifierscache size (# examples to tag on each iteration)add labeled data to new training set if both agree, or add labeled data from one to training set of the otherretrain some or all classifiers at each iterationStanford MaxEnt taggerCurran and Clark’s (2003) MaxEnt taggerTnT (HMM part-of-speech tagger)YAMCHA (SVM tagger)internal vs external features (for C&C and Stanford taggers)
25 best settings on biomedical data: #sentences#words#entities#classesbio-data50012,9001,5455+1SEEDastro-data50215,4298744+1bio-data3,8561010398,6625+1TESTastro-data1,451238,6552,5684+1UNLABELLED DATA: ca. 8,000 sentences for both setsSTART PERFORMANCE (F)StanfordC&CTnTYAMCHAbio-data56.8748.4241.6250.64astro-data69.0664.4761.4561.98best settings on biomedical data:Stanford, C&C, and TnT; cache=200; agreement; retrain Stanford onlyStanford and YAMCHA; cache=500; agreementNOTE: in both cases limited improvement (max 2 percentage points)on astronomical data: no real positive results so farTAKE HOME MESSAGE: COTRAINING QUITE UNSUCCESFUL FOR THIS TASK!REASONS: Classifiers not different enough? Classifiers not good enough tostart with?
26 Weakly supervisedMany multi-token entities, typically a head word preceded by modifiers:instrument-name: Very Large Telescopesource-type: radio–quiet QSOsspectral-feature: [O II] emissionFind most likely modifier sequences for a given initial set of conceptsBuild a gazetteer for each entity subtype and use it for markup.Results: F-score = 49%.Head words come from a fixed set of concepts. Most of the variation is in the modifiers, which denote the various subtypes.
27 Active Learning Supervised Learning Active Learning Select random examples for labelingRequires large amount of (relatively expensive) annotated dataActive LearningSelect most ‘informative’ examples for labellingMaximal reduction of error rate with minimal amount of labellingFaster converging learning curvesHigher accuracy for same amount of labelled dataLess labelled data for same levels of accuracy
28 Parameters Annotation level: Document? Sentence? Word? Selection method:Query-by-committee with several sample selection metricsAverage KL-divergenceMaximum KL-divergenceF-scoreBatch size: 1 ideal but impractical. 10? 50? 100?AlternativesConfidenceQBC with e.g. majority vote
29 Experiments BioNLP AstroNER Corpus: developed for BioNLP 2004 shared task, based on GENIA corpusEntities: DNA, RNA, cell-line, cell-type, proteinExperiments: 10 fold cross validation used to tune AL parameters for real experimentsAstroNERExperiments: 20 rounds of annotation with active sample selection
32 Time Monitoring Objective: Method: Result: Progress towards NL engineering (cost/time-aware)Method:Web-based time tracking tool used to record how time was spentSeparation between shared (communication, infastructure) and method-specific time useResult:No dramatic cost differences between 3 methodsRoughly 64 person days total cost (all methods)
33 Time Monitoring Infrastructure 100 h Active Learning 130.5 h Clustering 57.5 hInfrastructure 100 hCommunication 57.5 hCo-Training h*) Don‘t show this – presenter‘s commentary only.Cave: Absolute numbers do not matter much, might be skewed (people might not have recorded 100% of the time). Example calculation of Cost per Method (blue bar diagramme): ActiveL (total) = ActiveL (dedicated time) + (comm. + infrastr.)Each method‘s dedicated time is combined with time shared across methods (that is also the reason why the blue bar diagramme doesn‘t add up to 64 working days, BTW!)