Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital.

Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital Research Foundation, Cincinnati, OH, USA 2 Department of Informatics, Nicolaus Copernicus University, Torun, Poland 3 School of Computer Engineering, Nanyang Technological University, Singapore ICANN 2005, Warsaw, 10-14 Sept. 2005

Outline Goals & questions Goals & questions Medical data Medical data Data preparation Data preparation Model of similarity Model of similarity Computational experiments and results Computational experiments and results

Goals & Questions What are the key clinical descriptors for a given disease? What are the key clinical descriptors for a given disease? In what sense are the records describing patients with the same diseases similar? In what sense are the records describing patients with the same diseases similar? Can we capture expert’s intuition evaluating document’s similarity and diversity? Can we capture expert’s intuition evaluating document’s similarity and diversity? Include a priori knowledge in document categorization – important especially for rare disease. Include a priori knowledge in document categorization – important especially for rare disease. Use UMLS ontology and NLM lexical tools. Use UMLS ontology and NLM lexical tools.

Example of clinical summary discharges Jane is a 13yo WF who presented with CF bronchopneumonia. She has noticed increasing cough, greenish sputum production, and fatique since prior to 12/8/03. She had 2 febrile epsiodes, but denied any nausea, vomiting, diarrhea, or change in appetite. Upon admission she had no history of diabetic or liver complications. Her FEV1 was 73% 12/8 and she was treated with 2 z-paks, and on 12/29 FEV1 was 72% at which time she was started on Cipro. She noted no clinical improvement and was admitted for a 2 week IV treatment of Tobramycin and Meropenem.

Unified Medical Language System (UMLS) semantic types “Virus" causes "Disease or Syndrome" semantic relation semantic relation Other relations: “interacts with”, “contains”, “consists of”, “result of”, “related to”, … Other relations: “interacts with”, “contains”, “consists of”, “result of”, “related to”, … Other types: “Body location or region”, “Injury or Poisoning”, “Diagnostic procedure”, … Other types: “Body location or region”, “Injury or Poisoning”, “Diagnostic procedure”, …

UMLS – Example (keyword: “virus”) Metathesaurus : Concept: Virus, CUI: C0042776, Semantic Type: Virus Definition (1 of 3): “Group of minute infectious agents characterized by a lack of independent metabolism and by the ability to replicate only within living host cells; have capsid, may have DNA or RNA (not both)”. (CRISP Thesaurus) Synonyms: Virus, Vira Viridae Semantic Network: "Virus" causes "Disease or Syndrome"

Data Disease name Clinical Data Reference Data size [bytes] No. of records Average size [bytes] Pneumonia 609145123583 Asthma 865128236720 Epilepsy 638159819418 Anemia 544284914282 UTI 298158713430 JRA 41181627024 Cystic fibrosis 28317907958 Cerebral palsy 177159735348 Otitis media 493142032416 Gastroenteritis 58613759906 JRA - Juvenile Rheumatoid Arthritis UTI - Urinary tract infection

Data processing/preparation Reference TextsMMTx ULMS concepts /feature prototypes/ Filtering /focus on 26 semantic types/ Features /UMLS concept IDs/ Clinical DocumentsMMTx Filtering using existing space Final data UMLS concepts MMTx – discovers UMLS concepts in text

Semantic types used Values indicate the actual numbers of concepts found in: I – clinical texts II – reference texts

Data - statistics 10 classes 10 classes 4534 vectors 4534 vectors 807 features (out of 1097 found in reference texts) 807 features (out of 1097 found in reference texts)Baseline: Majority: 19.1% (asthma class) Majority: 19.1% (asthma class) Content based: 34.6% (frequency of class name in text) Content based: 34.6% (frequency of class name in text)Remarks: Very sparse vectors Very sparse vectors Feature values represent term frequency (tf) i.e. the number of occurrences of a particular concept in text Feature values represent term frequency (tf) i.e. the number of occurrences of a particular concept in text

Model of similarity I Intuitions: Initial distance between document D and the reference vectors R k should be proportional to d 0k = ||D – R k ||  1/p(C k ) - 1 If a term i appears in R k with frequency R ik > 0 but does not appear in D the distance d(D,R k ) should increase by  ik = a 1 R ik If a term i does not appear in R k but it has non-zero frequency D i the distance d(D,R k ) should increase by  ik = a 2 D i If a term i appears with frequency R ik > D i > 0 in both vectors the distance d(D,R k ) should decrease by  ik =  a 3 D i If a term i appears with frequency 0 < R ik ≤ D i in both vectors the distance d(D,R k ) should decrease by  ik =  a 4 R ik

Model of Similarity II with the constrains: Given the document D, a reference vector R k and probability p(i|C k ) probability that the class of D is C i should be proportional to: where  ik depends on adaptive parameters a 1,…,a 4 which may be specific for each class. Linear programming technique can be used to estimate a i by maximizing similarity between documents and reference vectors: where k indicates the correct class.

ResultsM0M1M2M3M4M5kNN48.950.251.051.449.549.5 SSV39.540.631.039.539.542.3 MLP (300 neur.) 66.056.560.763.272.371.0 SVM (C opt.) 59.3(1.0)60.4(0.1)60.9(0.1)60.5(0.1)59.8(0.01)60.0(0.01) 10 Ref. vectors 71.6-71.471.370.770.1 10-fold crossvalidation accuracies in % for different feature weightings. M0: tf frequencies; M1: binary data;

Conclusions Medical text contain a large number of rare, specific concepts. Vector representation using standard td x idf weighting leads to poor results A priori knowledge was introduced using single reference vector (this certainly needs improvement). Expert intuitions were formalized in a model to measure similarity of text, with only 4 parameters per class. Linear programming has been used to optimize parameters. Results are quite encouraging. Finding best set of reference vectors and similarity measures for medical documents is an interesting challenge.

Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital.

Similar presentations

Presentation on theme: "Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital.

Similar presentations

Presentation on theme: "Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital."— Presentation transcript:

Similar presentations

About project

Feedback