Presentation on theme: "NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005."— Presentation transcript:
NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005
Univ. of Tokyo 2/11 Introduction As the research in biomedical domain has grown rapidly in recent years, a huge amount of nature language resource s have been developed and become a rich knowledge base. NER (Named Entity Recognition) is strongly demanded to be applied in biomedical domain. identifies names of genes, gene products and diseases in a biomedical text in this project. From now on, genes and gene products are called by gene. has not got high performance. compared with those in newswire domain
Univ. of Tokyo 3/11 Introduction::Problems in NER Some modifiers are often before basic NEs activated B cell lines Sometimes biomedical NEs are very long 47 kDa sterol regulatory element binding factor Two or more NEs share one head noun by using conjunction or disjunction construction 91 and 84 kDa proteins An entity may be found with various spelling forms NE may be cascaded One NE may be embedded in another NE Abbreviations are frequently used Therefore, it is necessary to explore more evidential features and more effective methods to cope with such difficulties.
Univ. of Tokyo 4/11 NER without NLP tech. Dictionary based longest matching ! The number of words in dictionaries Gene : 44,463 Disease : 159,477 Corpus 1,000 biomedical sentences which are tagged by biologists Gene and Disease names and their Association GeneDisease HishikiNagataHishikiNagata Precision57.7%65.0%78.0%82.1% Recall100% F-score73.2%78.8%87.6%90.2%
Univ. of Tokyo 5/11 Experimental results(1) Maximum Entropy based model Features Local context (Name itself, Unigrams and Bigrams) POS (Name itself, Unigrams and Bigrams) Capitalization (All capital, Mixed capital, No capital) Digitalization ( All digit, Mixed digit, No digit) 24 Greek Letters (alpha, beta, gamma, …) 12 suffix Corpus 1,000 biomedical sentences which are tagged by biologists Gene and Disease names and their Association Evaluations 10-fold cross validation L2 L1 NE R1 R2
Univ. of Tokyo 6/11 Experimental results(2) Example of Corpus
Univ. of Tokyo 7/11 Experimental results(3)::Useful features GeneDisease Local context Capitalization Digitalization Greek Letters Affix POSNE NE, Uni NE, Uni, Bi
Univ. of Tokyo 8/11 Experimental results(4) Agreement for Annotations between Hishiki san and Nagata san Comparison Features Gene Local context, Capitalization, POS of NE Disease Local context, Capitalization, POS of NE and Unigram Evaluation : 10fold-cross validation Gene90.3% Disease89.3% Test dataTraining dataGeneDisease PRFPRF Nagata Gene:650 Disease:821 Hishiki88.681.484.890.492.891.6 Nagata86.890.988.889.695.792.6 Intersection90.680.085.091.189.990.5 Union85.491.788.488.897.492.9 Hishiki Gene:577 Disease:780 Hishiki80.283.081.688.795.992.2 Nagata77.591.583.986.897.691.9 Intersection81.781.381.589.993.391.6 Union76.592.583.885.598.791.6
Univ. of Tokyo 9/11 Experimental results(5)::Gene
Univ. of Tokyo 10/11 Experimental results(6)::Disease
Univ. of Tokyo 11/11 Conclusions Through the experiments, we found that the NLP techniques (ML approach) play an important role in improving the performance We can expect that the performance may be increases by considering more evidential features. It is necessary to explore more evidential features and more effective methods to cope with NER difficulties. We found that the performance was improved as the size of training corpus increases.