Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Document Categorisation by User Profile in MEDLINE Euripides G.M. Petrakis Angelos Hliaoutakis Intelligent Systems Laboratory www.intelligence.tuc.gr.

Similar presentations


Presentation on theme: "Automatic Document Categorisation by User Profile in MEDLINE Euripides G.M. Petrakis Angelos Hliaoutakis Intelligent Systems Laboratory www.intelligence.tuc.gr."— Presentation transcript:

1 Automatic Document Categorisation by User Profile in MEDLINE Euripides G.M. Petrakis Angelos Hliaoutakis Intelligent Systems Laboratory www.intelligence.tuc.gr Technical University of Crete (TUC) Chania, Crete, Greece

2 Problem Definition Medical information systems are designed for experts ! –Domain specific answers to experts Must also serve naive consumers –Easy to read and comprehend information Investigate methods for the categorization of information by user profile –Experts: use complex terms for their searches –Consumers: do simple searches using natural language terms ISHIMR 2011, Zurich, Switzerland2

3 3 Current Practices In MEDLINE of U.S. NLM, documents are indexed by experts –10-12 MeSH terms per document (pathology, disease, treatment, drugs etc) –Over 15 million documents - Slow !! –Automate this process –No categorization MedScape, Medlineplus, MedHunt rely on the manual categorization of information –Slow, does not scale-up for large collections

4 ISHIMR 2011, Zurich, Switzerland4 Objectives Investigate methods for automatic document indexing in MEDLINE These terms are subsequently used for filtering documents by user profile Main Idea: categorization of terms to simple terms comprehendible by consumers or more involved terms suitable for experts

5 Resources Automatic indexing in MEDLINE –MMTx [U.S. NLM]: MMTx focus on UMLS rather than MeSH –AMTEx [DKE, 2009]: MeSH terms, faster and more accurate than MMTx Dictionaries for biomedical and health related concepts –UMLS Metathesaurus, MeSH Dictionaries for general English words –WordNet, Specialist ISHIMR 2011, Zurich, Switzerland5

6 MMTx (MetaMap Transfer) Developed by U.S. NLM Maps text to UMLS Metathesaurus concepts –but MEDLINE indexing is based on MeSH –MeSH is a subset of Metathesaurus Suffers from term overgeneration Unrelated terms added to the final candidate list Topic drift AMTEx HIKM’2006

7 AMTEx The AMTEx method [DKE 2009] Main ideas: Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value Extracts general single and multi-word terms (noun phrases) Mainly multi-word terms: “heart disease”, “coronary artery disease” Extracted terms are validated against MeSH Faster, improved precision by merely a fifth of term output of MMTx

8 Example AMTEx HIKM’2006 Input: Full text article MEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis, Knee/complications”, “Osteoarthritis, Knee/diagnosis”, “Pain/classification”, “Pain/etiology”, “Prospective Studies”, “Research Support, Non-U.S. Gov’t” MMTx terms: “osteoarthritis knee”, “retention”, “peat”, “rheumatology”, “acetylcholine”, “lysine acetate”, “potassium acetate”, “questionnaires”, “target population”, “population”, “selection bias”, “creativeness”, “reproduction”, “cohort studies”, “europe”, “couples”, “naloxone”, “sample size”, “arthritis”, “data collection”, “mail” ‘health status”, “respondents”, “ontario”, “universities”, “dna”, “baseline survey”, “medical records”, “informatics”, “general practitioners”, “gender”, “beliefs”, “logistic regression”, “female”, “marital status”, “employment status”, “comprehension”, “surveys”, “age distribution”, “manual”, “occupations”, “manuals”, “persons”, “females”, “minor”, “minority groups”, “incentives”, “business”, “ability”, “comparative study”, “odds ratio”, “biomedical research”, “pubmed”, “copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”, “skin diseases”, “government”, “norepinephrine”, “social sciences”, “survey methods”, “tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”, “cycloheximide”, “rheum”, “jordan”, “cadmium”, “radiopharmaceuticals”, “community”, “disease progression”, “history” AMTEx terms: “health surveys”, “pain”, “review publication type”, “data collection”, “osteoarthritis knee”, “knee”, “science”, “health services needs and demand”, “population”, “research”, “questionnaires”, “informatics”, “health”

9 Term & Document Categorization ISHIMR 2011, Zurich, Switzerland9

10 New Vocabularies Vocabulary of General Terms (VGT): 105.675 general (WordNet) terms Vocabulary of Consumer Terms (VCT): 7,165 consumer (MeSH) terms. Vocabulary of Expert Terms (VET): 16,719 consumer (MeSH) terms ISHIMR 2011, Zurich, Switzerland10

11 Document Categorization Documents are represented by vectors of terms extracted by AMTEx, MMTx or assigned by human experts The more VET (VCT) terms a document contains the higher its probability to be suitable for experts (consumers) –E.g., a document with VET% = 0.62 has 62% probability to be one suitable for experts ISHIMR 2011, Zurich, Switzerland11

12 Evaluation Precision and Recall measures: a good method has high values of both Datasets: OHSUMED: 348,566 MEDLINE abstracts that come along with 64 queries and their relevant answers Ground truth: the set of MeSH index terms assigned to documents by experts ISHIMR 2011, Zurich, Switzerland12

13 Categorization by User Profile How good is the method in retrieving answers for consumers and experts ? We run retrievals for consumers & experts –15 out of the 64 queries contain no expert terms and are suitable for consumers –The remaining queries are suitable for experts –Documents are represented by document vectors of MeSH, MMTx, or AMTEx terms –The retrieval method is Vector Space Model –The document similarity score of VSM is multiplied by its respective VET or VCT score ISHIMR 2011, Zurich, Switzerland13

14 Consumers Retrieval Task ISHIMR 2011, Zurich, Switzerland14

15 Experts Retrieval Task ISHIMR 2011, Zurich, Switzerland15

16 Results Consumers retrieval task: –Retrievals with the manually assigned MeSH terms performs better –MMTx, AMTEx perform equally well Experts retrieval task: –Retrievals with AMTEx perform better The results indicate –A tendency of human experts to assign simple terms to documents and –Selective ability of AMTEx in extracting complex terms suitable for experts ISHIMR 2011, Zurich, Switzerland16

17 Conclusions & Future Work We investigate methods: –Automatic document indexing –Categorization by user profile AMTEx is very well suited for both problems Future work: more elaborate documents methods (machine learning, fuzzy) More categories –According to UMLS SN (pathology, treatment) –User categories (e.g., specialty) ISHIMR 2011, Zurich, Switzerland17

18 ISHIMR 2011, Zurich, Switzerland18 Questions and answers

19 ΑΜΤΕx Outline AMTEx HIKM’2006 INPUT: Document Collection INPUT: Document Collection C/NC value Multi-word Term Extraction & Term Ranking C/NC value Multi-word Term Extraction & Term Ranking MeSH Term Validation MeSH Term Validation Single-word Term Extraction Non-MeSH multi-word are broken down & validated against MeSH Single-word Term Extraction Non-MeSH multi-word are broken down & validated against MeSH Variant Generation Term Expansion (MeSH) Term Expansion (MeSH) MeSH Thesaurus Resource MeSH Thesaurus Resource OUTPUT: MeSH Term Lists OUTPUT: MeSH Term Lists

20 AMTEx vs MMTx ISHIMR 2011, Zurich, Switzerland20 AMTEx: faster, improved precision by merely a fifth of term output of MMTx Data SetMethod Number of Terms PrecisionRecall Time (hours) OHSUMED AMTE X MMT X 8 40 0.125 0.089 0.101 0.336 7.383 14.516 PMC AMTE X MMT X 25 72 0.034 0.033 0.062 0.162 1.387 2.727

21 MeSH: Medical Subject Headings The NLM medical & biological terms thesaurus: Organized in IS-A hierarchies –more than 15 taxonomies & more than 22,000 terms –a term may appear in multiple taxonomies No PART-OF relationships Terms organized into synonym sets called entry terms, including stemmed term forms AMTEx HIKM’2006

22 Fragment of the MeSH IS-A Hierarchy AMTEx HIKM’2006 neuralgia Root Nervous system diseases Neurologic manifestations pain headache Cranial nerve diseases Facial neuralgia


Download ppt "Automatic Document Categorisation by User Profile in MEDLINE Euripides G.M. Petrakis Angelos Hliaoutakis Intelligent Systems Laboratory www.intelligence.tuc.gr."

Similar presentations


Ads by Google