Presentation is loading. Please wait.

Presentation is loading. Please wait.

HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University.

Similar presentations


Presentation on theme: "HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University."— Presentation transcript:

1 HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University of Crete, Chania, Greece Evangelos E. Milios Dalhousie University, Halifax, Canada

2 HIKM’2006AMTEx Overview The need for automatic indexing in large medical collections Current approach: the US NLM MMTx The AMTEx approach to medical document indexing AMTEx resources: MeSH & C/NC value Experiments & evaluation Final thoughts on results & possible future steps for refinement

3 HIKM’2006AMTEx Motivation and Objectives Extraction of terms indicating document content Automatic indexing in large medical collections (e.g. MEDLINE ) –Need for automated document indexing –Manual Indexing in MEDLINE by experts MEDLINE Indexing is based on MeSH: –Subset of UMLS Metathesaurus MMTx, the U.S.NLM approach: –Maps biomedical documents to UMLS concepts

4 HIKM’2006AMTEx MMTx (MetaMap Transfer) Maps arbitrary text to UMLS Metathesaurus concepts:  Parsing (syntactic analysis - linguistic filter)  Variant Generation (uses SPECIALIST Lexicon)  Candidate Retrieval (mapping process to Metathesaurus Concepts)  Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness)

5 HIKM’2006AMTEx MMTx Example  Parsing Shallow syntactic analysis of the input text Linguistic filtering: isolates noun phrases  Variant Generation e.g. “obstructive sleep apnea” has variants: obstructive sleep apnea, sleep apnea, sleep, apnea, osa,…  Candidate Retrieval Candidate Metathesaurus concepts for the variant “osa” : osa [osa antigen], osa [osa gene product] osa [osa protein] osa [obstructive sleep apnea]  Candidate Evaluation Obstructive Sleep apnea1000 Sleep Apnea 901 Apnea827… Sleeping793 Sleepy755

6 HIKM’2006AMTEx MMTx limitations MMTx focus on UMLS rather than MeSH  But MEDLINE indexing is based on MeSH Exhaustive variant generation: the initial phrase is iteratively expanded to all possible UMLS variants term overgeneration term concept diffusion unrelated terms added to the final candidate list

7 HIKM’2006AMTEx The AMTEx method New method for automatic indexing of medical documents Main idea: Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value Extracts general single and multi-word terms Extracted terms are validated against MeSH

8 HIKM’2006AMTEx ΑΜΤΕx Outline INPUT: Document Collection INPUT: Document Collection C/NC value Multi-word Term Extraction & Term Ranking C/NC value Multi-word Term Extraction & Term Ranking MeSH Term Validation MeSH Term Validation Single-word Term Extraction Non-MeSH multi-word are broken down & validated against MeSH Single-word Term Extraction Non-MeSH multi-word are broken down & validated against MeSH Variant Generation Term Expansion (MeSH) Term Expansion (MeSH) MeSH Thesaurus Resource MeSH Thesaurus Resource OUTPUT: MeSH Term Lists OUTPUT: MeSH Term Lists

9 HIKM’2006AMTEx MeSH: Medical Subject Headings The NLM medical & biological terms thesaurus: Organized in IS-A hierarchies –more than 15 taxonomies & more than 22,000 terms –a term may appear in multiple taxonomies No PART-OF relationships Terms organized into synonym sets called entry terms, including stemmed term forms

10 HIKM’2006AMTEx Fragment of the MeSH IS-A Hierarchy

11 HIKM’2006AMTEx The C/NC value method Hybrid, linguistic / statistical term extraction method Domain independent Specifically designed for the identification of multi-word and nested terms: compound & multi-word terms very common in biomedical domain multi-word terms often used in indexing

12 HIKM’2006AMTEx C-value C-value: a phrase may be a term, if it often appears within other candidate terms

13 HIKM’2006AMTEx NC-value NC-value: a phrase is more likely a term, if it often appears in specific word context

14 HIKM’2006AMTEx AMTEx step 1: C/NC value Multi-word Term Extraction & Ranking  Part-of-Speech Tagging  Linguistic filtering: N + N (A|N) + N ( (A|N) + | ( (A|N)* (N P)? ) (A|N)* ) N  Candidate term ranking based on C/NC-value  Keep terms up to threshold T 1

15 HIKM’2006AMTEx AMTEx step 2: MeSH Term Validation  Candidate terms are validated against the MeSH Thesaurus (simple string matching)  Only candidate terms matching MeSH are kept  Multi-word candidates not matching MeSH may contain (shorter) MeSH terms

16 HIKM’2006AMTEx AMTEx step 3: Single-word Term Extraction For multi-word terms not matching MeSH:  Multi-word are split into single-word terms  Single-word terms are validated against MeSH  Matched MeSH terms are added to term list

17 HIKM’2006AMTEx AMTEx step 4: Term Variant Generation Inflectional variants of the extracted terms are identified during term extraction (C/NC-value) Stemmed term-forms are also available in MeSH and are added to the list of terms

18 HIKM’2006AMTEx AMTEx step 5: Term Expansion Each term in the list is expanded with neighbour terms in MeSH The expansion may include terms more than one level higher or lower than the original term, depending on T 2

19 HIKM’2006AMTEx Example Input: Full text article MEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis, Knee/complications”, “Osteoarthritis, Knee/diagnosis”, “Pain/classification”, “Pain/etiology”, “Prospective Studies”, “Research Support, Non-U.S. Gov’t” MMTx terms: “osteoarthritis knee”, “retention”, “peat”, “rheumatology”, “acetylcholine”, “lysine acetate”, “potassium acetate”, “questionnaires”, “target population”, “population”, “selection bias”, “creativeness”, “reproduction”, “cohort studies”, “europe”, “couples”, “naloxone”, “sample size”, “arthritis”, “data collection”, “mail” ‘health status”, “respondents”, “ontario”, “universities”, “dna”, “baseline survey”, “medical records”, “informatics”, “general practitioners”, “gender”, “beliefs”, “logistic regression”, “female”, “marital status”, “employment status”, “comprehension”, “surveys”, “age distribution”, “manual”, “occupations”, “manuals”, “persons”, “females”, “minor”, “minority groups”, “incentives”, “business”, “ability”, “comparative study”, “odds ratio”, “biomedical research”, “pubmed”, “copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”, “skin diseases”, “government”, “norepinephrine”, “social sciences”, “survey methods”, “tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”, “cycloheximide”, “rheum”, “jordan”, “cadmium”, “radiopharmaceuticals”, “community”, “disease progression”, “history” AMTEx terms: “health surveys”, “pain”, “review publication type”, “data collection”, “osteoarthritis knee”, “knee”, “science”, “health services needs and demand”, “population”, “research”, “questionnaires”, “informatics”, “health”

20 HIKM’2006AMTEx Evaluation Precision and Recall measures  Dataset: 61 full MEDLINE documents, from PMC database of NCBI Pubmed MEDLINE documents are paired to respective MeSH index terms, manually assigned by experts  Ground Truth: the set of MeSH document index terms  Benchmark method: MMTx against our AMTEx

21 HIKM’2006AMTEx Multi-Word Terms only MethodPrecisionRecall MMTx0,0134810,015109 AMTEx (T = 0,5) 0,1860250,108085 AMTEx (T = 0,6) 0,218270,090039 AMTEx (T = 0,7) 0,2355180,072318 AMTEx (T = 0,8) 0,2355920,072243 AMTEx (T = 0,9) 0,236150,070267

22 HIKM’2006AMTEx Contribution of Single-Word Terms MethodPrecisionRecall MMTx0,0134810,015109 AMTEx0,236150,070267 AMTEx & single-word MeSH terms0,1196290,228322

23 HIKM’2006AMTEx Conclusions: AMTEx designed for indexing and retrieval of MEDLINE documents focuses on multi-word term extraction using valid linguistic & statistical criteria based on MeSH -- similarly to human indexing selectively expands to term variants & synonyms outperforms the current benchmark MMTx method, reaching better precision & recall

24 HIKM’2006AMTEx Future Work Better ranking of terms Automatic computation thresholds T 1, T 2 Word sense disambiguation could be applied to detect the correct sense for expansion rather than the most common sense used


Download ppt "HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University."

Similar presentations


Ads by Google