NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005.

NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo 2/11 Introduction As the research in biomedical domain has grown rapidly in recent years, a huge amount of nature language resource s have been developed and become a rich knowledge base. NER (Named Entity Recognition) is strongly demanded to be applied in biomedical domain. identifies names of genes, gene products and diseases in a biomedical text in this project. From now on, genes and gene products are called by gene. has not got high performance. compared with those in newswire domain

Univ. of Tokyo 3/11 Introduction::Problems in NER Some modifiers are often before basic NEs activated B cell lines Sometimes biomedical NEs are very long 47 kDa sterol regulatory element binding factor Two or more NEs share one head noun by using conjunction or disjunction construction 91 and 84 kDa proteins An entity may be found with various spelling forms NE may be cascaded One NE may be embedded in another NE Abbreviations are frequently used Therefore, it is necessary to explore more evidential features and more effective methods to cope with such difficulties.

Univ. of Tokyo 4/11 NER without NLP tech. Dictionary based longest matching ! The number of words in dictionaries Gene : 44,463 Disease : 159,477 Corpus 1,000 biomedical sentences which are tagged by biologists Gene and Disease names and their Association GeneDisease HishikiNagataHishikiNagata Precision57.7%65.0%78.0%82.1% Recall100% F-score73.2%78.8%87.6%90.2%

Univ. of Tokyo 5/11 Experimental results(1) Maximum Entropy based model Features Local context (Name itself, Unigrams and Bigrams) POS (Name itself, Unigrams and Bigrams) Capitalization (All capital, Mixed capital, No capital) Digitalization ( All digit, Mixed digit, No digit) 24 Greek Letters (alpha, beta, gamma, …) 12 suffix Corpus 1,000 biomedical sentences which are tagged by biologists Gene and Disease names and their Association Evaluations 10-fold cross validation L2 L1 NE R1 R2

Univ. of Tokyo 6/11 Experimental results(2) Example of Corpus

Univ. of Tokyo 7/11 Experimental results(3)::Useful features GeneDisease Local context Capitalization Digitalization Greek Letters Affix POSNE NE, Uni NE, Uni, Bi

Univ. of Tokyo 8/11 Experimental results(4) Agreement for Annotations between Hishiki san and Nagata san Comparison Features Gene Local context, Capitalization, POS of NE Disease Local context, Capitalization, POS of NE and Unigram Evaluation : 10fold-cross validation Gene90.3% Disease89.3% Test dataTraining dataGeneDisease PRFPRF Nagata Gene:650 Disease:821 Hishiki88.681.484.890.492.891.6 Nagata86.890.988.889.695.792.6 Intersection90.680.085.091.189.990.5 Union85.491.788.488.897.492.9 Hishiki Gene:577 Disease:780 Hishiki80.283.081.688.795.992.2 Nagata77.591.583.986.897.691.9 Intersection81.781.381.589.993.391.6 Union76.592.583.885.598.791.6

Univ. of Tokyo 9/11 Experimental results(5)::Gene

Univ. of Tokyo 10/11 Experimental results(6)::Disease

Univ. of Tokyo 11/11 Conclusions Through the experiments, we found that the NLP techniques (ML approach) play an important role in improving the performance We can expect that the performance may be increases by considering more evidential features. It is necessary to explore more evidential features and more effective methods to cope with NER difficulties. We found that the performance was improved as the size of training corpus increases.

Univ. of Tokyo 12/11 Thank you!!!

Univ. of Tokyo 13/11 Gaussian Prior (Hishiki) Gaussian Prior GeneDisease PRFPRF 2073.878.576.085.696.590.7 5075.279.077.187.095.591.0 8075.479.077.187.395.491.2 10075.479.977.287.595.491.2 20075.679.077.287.695.391.3 30075.679.077.287.795.291.3 40075.479.077.187.895.291.3 50075.478.977.187.895.191.3 80075.378.276.787.894.991.2 100075.378.176.787.894.791.1 150075.577.476.487.794.791.1 200075.576.776.187.794.791.1

Univ. of Tokyo 14/11 Experimental results (Hishiki) FeaturesGeneDisease PRFPRF Name, context (W)76.683.479.889.195.692.3 Caps Info73.568.170.778.099.487.4 Digit Info.63.786.873.577.999.587.4 Greek63.284.472.377.999.587.4 Affix62.983.771.878.099.587.4 POS64.478.970.978.199.287.4 W+Caps Info.80.784.682.687.898.292.7 W+Digit Info.79.083.981.387.798.292.7 W+Greek75.284.179.487.698.392.6 W+Affix75.084.979.787.798.292.7 W+D+G79.784.281.987.798.292.7 W+C+D80.784.682.687.798.292.7 W+C+G80.484.182.287.798.292.7 W+A+C80.684.282.487.898.292.7 W+A+D78.984.181.487.898.292.7 W+A+G75.083.979.287.698.292.6 W+C+D+G80.583.782.187.898.292.7 W+A+C+D80.584.282.388.098.392.9 W+A+C+G80.384.182.187.898.292.7 W+A+D+G79.583.981.687.998.392.8 W+A+C+D+G80.583.982.287.998.292.8

Univ. of Tokyo 15/11 Experimental results (Hishiki) FeaturesGeneDisease PRFPRF Name, context(W)76.683.479.889.195.692.3 W+POS of NE76.384.280.187.797.692.0 W+POS(NE,uni)75.982.379.088.695.892.1 W+POS(NE,uni,bi)76.079.477.687.894.991.2 W+Caps Info.80.784.682.687.898.292.7 W+C+POS81.083.582.387.897.692.4 W+C+POS180.082.581.288.695.692.0 W+C+POS277.278.978.088.495.191.7 W+C+D80.784.682.687.798.292.7 W+C+D+POS80.883.081.987.697.692.3 W+C+D+POS179.982.581.288.795.992.2 W+C+D+POS277.279.278.288.495.191.7 W+A+C+D80.584.282.388.098.392.9 W+A+C+D+POS81.083.482.287.897.692.4 W+A+C+D+POS179.882.381.188.895.892.2 W+A+C+D+POS277.079.078.088.194.591.2

Univ. of Tokyo 16/11 Experimental results (Nagata) FeaturesGeneDisease PRFPRF Name, context (W)82.788.385.489.795.092.3 Caps Info73.488.880.482.199.489.9 Digit Info.72.289.780.082,199.590.0 Greek71.786.278.382.199.590.0 Affix71.685.177.882.199.590.0 POS72.886.379.082.299.389.9 W+Caps Info.86.490.288.388.597.892.9 W+Digit Info.82.291.286.588.597.993.0 W+Greek80.992.086.188.698.193.1 W+Affix80.492.085.888.698.193.1 W+D+G82.791.486.888.698.293.1 W+C+D85.990.288.088.597.792.9 W+C+G86.290.688.488.597.792.9 W+A+C86.090.288.188.597.892.9 W+A+D82.391.486.688.598.193.0 W+A+G80.791.585.888.698.293.1 W+C+D+G86.190.888.488.698.193.1 W+A+C+D85.990.288.088.597.892.9 W+A+C+G86.290.588.388.798.193.1 W+A+D+G82.691.486.888.798.193.1 W+A+C+D+G85.790.688.188.697.893.0

Univ. of Tokyo 17/11 Experimental results (Nagata) FeaturesGeneDisease PRFPRF Name, context(W)82.788.385.489.795.092.3 W+POS81.590.685.888.596.092.1 W+POS181.790.685.989.895.592.6 W+POS281.886.384.089.395.492.2 W+Caps Info.86.490.288.388.597.892.9 W+C+POS86.389.487.888.697.092.6 W+C+POS185.990.288.090.096.192.9 W+C+POS285.787.586.689.495.292.2 W+C+D+G86.190.888.488.698.193.1 W+C+D+G+POS86.589.187.888.697.192.6 W+C+D+G+POS185.589.887.689.996.192.9 W+C+D+G+POS285.387.586.489.595.192.2 W+C+G+POS86.789.187.988.597.092.6 W+C+G+POS185.689.787.689.896.392.9 W+C+G+POS285.287.586.389.295.092.0

Univ. of Tokyo 18/11 Prefix and suffix Important cue for terminology identification ~cin ~mide ~zole actinomycin cycloheximide sulphamethoxazole ~lipid ~rogen ~vitamin phospholipids estrogen dihydroxyvitamin etc …

NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005.

Similar presentations

Presentation on theme: "NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005.

Similar presentations

Presentation on theme: "NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback