Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-Automatic Indexing of Full Text Biomedical Articles Washington D.C. October 25, 2005 Clifford W. Gay Lister Hill National Center for Biomedical Communications.

Similar presentations


Presentation on theme: "Semi-Automatic Indexing of Full Text Biomedical Articles Washington D.C. October 25, 2005 Clifford W. Gay Lister Hill National Center for Biomedical Communications."— Presentation transcript:

1 Semi-Automatic Indexing of Full Text Biomedical Articles Washington D.C. October 25, 2005 Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

2 2 Acknowledgments u Alan R. Aronson, PhD. u Mehmet Kayaalp, M.D., PhD.

3 3 Outline u Introduction l The System: Medical Text Indexer (MTI) l The Data: Online biomedical journals l The Task: Emulate Medline indexing using full text u Results l Observations on PubMed Central articles l Model selection results l Recent work

4

5 5 Why Semi-Automatic Indexing? u U.S. National Library of Medicine indexes 5000 journal titles l Supports over 60 million PubMed searches each month l Has 130 indexers l Indexed 570,000 articles in 2004 n Will need to index 1,000,000 very soon l Automated support is helping to meet this demand –MTI was used on 26% of articles in 2004 u More about MTI l l Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative's Medical Text Indexer. Medinfo. 2004; 11(Pt 1): 268-72. PMID: 15360816

6 Title + Abstract et al. Ordered list of MeSH Terms MeSH Headings UMLS Concepts Postprocessing Restrict to MeSH Trigram Phrase Matching Rel. Cits. PubMed Related Citations Extract MeSH Phrasex MetaMap Phrases Medical Text Indexer (MTI)

7 7 DCMS with MTI Suggestions

8

9 9 Why Full Text? u Medical Text Indexer uses article title and abstract u However l Human indexers taught not to use abstract l Author’s complete intent may not be in abstract l Check tags may only appear in a table or methods section. u If MTI indexes from full text articles it may l Find central concepts missing from abstract l Identify terms when article has no abstract l More accurately select check tags l Be in better compliance with indexing policy

10 10 Test Collection Selection u Available online from PubMed Central u Consistent XML format l Identifies title, abstract, sections, tables, figures, references, etc. u 500 articles from 17 diverse biomedical journals u Did not use: l References l Graphics l Math

11 11 Test Collection u 5 Clinical journals (165): l Breast Cancer Research (11) l Journal of Clinical Microbiology (80) u 3 Organization based journals (28): l Journal of American Medical Informatics Assoc. (10) l Proceeding of the National Academy of Sciences (11) u 9 Journals in other categories: Pharmacology (65); Biochemistry (65); Plants (46); Molecular Biology (45); Learning (30); Hospitals (22) Pharmacology (65); Biochemistry (65); Plants (46); Molecular Biology (45); Learning (30); Hospitals (22)

12

13 13 Indexing Task

14 u Medline Indexing beta-Lactamases /*genetics /*metabolism Enterobacteriaceae/drug effects /*enzymology/genetics Plasmids/*genetics Genes, Bacterial/genetics GenotypeKinetics Microbial Sensitivity Tests Molecular Sequence Data Research Support, Non-U.S. Gov't Example Article DNA Transposable Elements DNA Transposable Elements Escherichia coli Escherichia coli Genes, Bacterial Genes, Bacterial Cloning, Molecular Cloning, Molecular Klebsiella pneumoniae Klebsiella pneumoniae Amino Acid Sequence Amino Acid Sequence Microbial Sensitivity Tests Microbial Sensitivity Tests Cephalothin Cephalothin Proteus mirabilis Proteus mirabilis Erwinia Erwinia Salmonella typhimurium Salmonella typhimurium Enterobacteriaceae Infections Enterobacteriaceae Infections Lactams Lactams beta-Lactamases beta-Lactamases Plasmids Plasmids Enterobacteriaceae Enterobacteriaceae beta-Lactam Resistance beta-Lactam Resistance Conjugation, Genetic Conjugation, Genetic Cephalosporin Resistance Cephalosporin Resistance Cefotaxime Cefotaxime Nucleotide Sequences Nucleotide Sequences Molecular Sequence Data Molecular Sequence Data Cephalosporins Cephalosporins Chromosomes, Bacterial Chromosomes, Bacterial DNA, Bacterial DNA, Bacterial u MTI Indexing MMI MMI REL REL MMI & REL MMI & REL Recall = 0.67 Precison = 0.24 F 2 measure = 0.492

15 15 Evaluation u Measure u F 2 Measure l Weighted harmonic mean of Recall and Precision l Weights Recall twice as important as Precision l Values: 0.0 to 1.0 u Computed for each article and averaged

16

17 17 Section Header Classes u Semantically equivalent section headers u u MATERIALS AND METHODS class: l l Materials and Method(s) l l Method(s) l l Scoring Methods l l Experimental Procedures l l Other Methods Tested u CAPTIONS class: l the titles and captions from tables and figures

18 18 Section Class Average F 2 CAPTIONS0.3175 ABSTRACT0.2960 INTRODUCTION0.2869 RESULTS0.2790 DISCUSSION0.2734 NO HEADER 0.2574 …… CONCLUSIONS0.1961 ABBREVIATIONS0.1304 Section Class Performance

19

20 20 Experiments u Varied MTI components used l MetaMap Indexing (MMI) l Related Citations (REL) u Varied section classes processed l Used model selection l Used binary weighting for sections u A model is l A selection of section classes and l The text in those sections l That represents the article

21 21 Production Baseline Title+Abstract MMI REL F 2 = 0.457

22 22 Naive Mode Title+Abstract MMI REL Materials and Methods Results and Discussion No Header F 2 = 0.453 ( - 0.9%) All Section Classes

23 23 MetaMap Indexing Mode Title+Abstract MMI REL Introduction Results Discussion Other No Header F 2 = 0.373 (-18.4%) Captions

24 24 Augmented Mode Title+Abstract MMI REL Introduction Results Discussion Other No Header F 2 = 0.475 (+3.9%) Captions

25 25 Refined Augmented Mode Title+Abstract MMI REL Captions Results Background F 2 = 0.485 (+ 6.1%)

26 26 Full MTI Mode Title+Abstract MMI REL Introduction Results Discussion Other No Header F 2 = 0.488 (+ 6.8%) MMI model Captions

27 27 Refined Full MTI Title+Abstract MMI REL Results Results and Discussion No Header F 2 = 0.491 (+ 7.4%) Captions Conclusions

28 28 MTI Performance Summary Indexing Model RecallPrecision Avg. F 2 Production Baseline (Ti, Ab) 0.530.320.457 Naive Mode (full text) 0.570.270.453 Augmented Mode (MMI + REL (Ti, Ab)) 0.590.290.475 Augmented Mode (refined) 0.600.300.485 Full MTI (MMI + REL common sections) 0.600.300.488 Full MTI (refined) 0.600.310.491

29

30 30 Improvement Potential u With current model l No cut off at 25 terms yields maximum recall of 0.79 u If all good terms prioritized correctly l l F 2 = 0.64 l l Improvement over baseline 7%  40%

31 31 Increase REL Citations u MTI currently uses 10 Related Citations u Optimal number for full text articles is 15 u Best model confirmed for this setting u Additional Improvement in F 2 = 0.01

32 32 Summarization u Selecting important text before MTI processing u Using Yeh, Ke, Yang, Meng approach u Combines l Latent Semantic Analysis and l Salton’s Text Relationship Map u Start with current model u Document representation includes l Bag of words l MetaMap identified concepts

33 NLM Indexing Initiative Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Contact:Web:cliff@nlm.nih.govii.nlm.nih.gov/fulltext.shtml

34 34 NONE Sections u Most appear in articles that have no abstract l 20/23 u Some are errors l 4 have “Introduction” header in publisher version l 2 appear within other sections with headers. u Many contain the primary text of the article l Comments, Editorials, Letters (11/23)

35 35 Other Sections u Other section class has 525 sections (16%) u Non-standard article organization l Common in Review articles u Example l ß-Lactamases of Kluyvera ascorbata, Probable Progenitors of Some Plasmid-Encoded CTX-M Types n Bacterial strains. n Antimicrobial agents and susceptibility testing. n Kinetic and IEF analyses. n Genetic characterization of blaKLUA. n Genetic environment of blaKLUA-1. n Arguments for mobilization of chromosomal blaKLUA gene.

36 36 Ranking Function u Made ranking function for Related Citations more like MetaMap Indexing. u Resulted in a more inclusive model l Materials and Methods l Introduction u F2 measure = 0.4865

37 37 Tuning Path Weight u Ratio of weights between the two indexing paths l MetaMap Indexing – 7 l Related Citations – 2 u No improvement possible

38 38 Partial Weight for Singleton Headers u OTHER section class l Header is unique l Contain content terms u Gave section class weight between 0 and 1 l Some recall improvement l No collection wide improvement in F 2


Download ppt "Semi-Automatic Indexing of Full Text Biomedical Articles Washington D.C. October 25, 2005 Clifford W. Gay Lister Hill National Center for Biomedical Communications."

Similar presentations


Ads by Google