Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotation of 311 Admission Summaries of the ICU Corpus Yefeng Wang.

Similar presentations


Presentation on theme: "Annotation of 311 Admission Summaries of the ICU Corpus Yefeng Wang."— Presentation transcript:

1 Annotation of 311 Admission Summaries of the ICU Corpus Yefeng Wang

2 Aim Create evaluation data for SNOMED CT concept matching performance. Create training data for machine learning systems. –Rule-based systems has low recall –Difficult to tune parameter, building the rules –Machine learning system is the state of art –No such annotated data available yet.

3 Existing Corpora Most of the existing corpora are in biomedical domain –GENIA (2000 abstracts from MEDLINE) –PennBioIE (2300 MEDLINE abstracts) Only a few are from clinical domain –Ogren et al., (clinical condition only) –Chapman et al., (clinical condition only) –CLEF, (semantically annotation, formal report)

4 Selection of Data Clinical notes were from 311 patients’ admission summaries. One note per patient Admission notes were used for annotation –Semi Structured, Variety of information Chief Complaint Background History of Presented illness Medication Examination Observation in Nursing Notes Social Other summaries (Echo reports, Surgical reports, etc)

5 The Annotation Task Concept Annotation –Annotate semantic category of medical concepts –Categories were based on SNOMED CT Relation Annotation –Relationships between concepts. –Inter-term relation Relationship between two separate concepts –Intra-term relation Relationship between atomic concepts within a composite concept (Post-coordination).

6 An Example Note

7 Annotation Schema Body small bowel loops Procedure Loop ileostomy Finding Persistent tachycardia Abnormality Inflammatory adhesions Qualifier Grade 3 intubation Object Sump drain Substance Ceftriaxone Occupation Review by cardiologist Organism Enterococcus Behaviour Lives with son

8 Development of Guidelines Iterative Approach 10 reports were annotated jointly by two annotators. –Discussion, –Development of initial guidelines 25 reports were used for iterative refinement of guidelines –Annotate separately –5 documents for each iteration –New examples, rules were added into annotation guidelines if necessary

9 Annotation Agreement Inter-Annotator Agreement were calculated during each development cycle. F1- is used for calculation –Harmonic mean of recall and precision –Precision = # correct annotation / # annotation –Recall = # correct annotation / # existing concepts Repeat development process until the annotator agreement reach a threshold of 90%. The guidelines then are finalised, no more new rules will be added into the guidelines. Differences resolved by a third annotator to make a gold standard corpus.

10 IAA for the development cycle IterTPFPFNPRF 1152383180.0083.0681.50 2185212989.8186.4588.10 3194223189.8186.2287.98 4238292589.1490.4989.81 5159191789.3390.3489.83

11 IAA for the whole corpus (311) ClassTPFPFNPRF body140420229887.482.4884.87 observable348628984.7779.5282.06 abnormality72013924883.7474.3578.77 qualifier176319839289.8981.885.66 object158433978.358079.17 substance246512915695.0194.0394.52 behaviour68161880.4978.5779.52 occupations125333578.9577.9278.43 finding411637139891.7291.1791.44 organism258107570.5972.73 procedure207629828887.4387.8287.63 overall1327315041976 89.8287.0488.41

12 Concept Frequency Concept Class# of InstancePercentage body6204.88% observable2101.65% substance243119.14% qualifier172713.60% object1581.24% behaviour3752.95% occupation1361.07% finding475537.44% organism350.28% procedure225317.74% total12700100.00%

13 Comparison to other corpus Comparison to corpus in newswire, biomedical, science (astronomy) domain. Available corpus MUC, GENIA, ASTRO CorpusICUGENIAMUCASTRO # category1036843 # entity12700405481156810744 # avg. len (words) 1.491.701.641.49 tag density40.3%33.8%11.8%5.4%

14 Concept Identification Result 279 documents for training 32 documents for testing 4656 tokens, 1218 concepts Rule-based system (TTSCT) Use Conditional Random Fields CRF++ as the learner. Evaluate using CONLL 2000 evaluation script.

15 Concept Matcher Performance PRF1Δ No Pre-processing at all (simple TTSCT)58.7626.6336.35--- Pruning lexicon, removing unrelated classes. 64.8746.7354.33+17.98 Expanding acronyms + Exact Matching (TTSCT) 74.8955.2563.59+9.26 Expanding acronyms + Approximate71.6763.1967.16+4.47 Expanding acronyms + Approx. Matching 1 + Approx. Matching 2 64.8859.1461.88-5.28 Performance increase over baseline+30.81

16 Machine Learning Results PRF1Dec. Best 84.2278.9081.48 --- Best - abb83.2077.2680.12-1.36 Best – orth83.6778.2480.87-0.61 Best – affix83.1677.0179.97-1.51 Best – SNOMED79.0673.1575.99-5.49 Best – bigram83.1778.7480.89-0.59 Best – bow81.2673.3277.08-4.40 Bow (Baseline)76.8666.2671.16 -10.32 --- Bow + SNOMED82.6174.8878.55+7.39

17 Inter Relation Annotation Annotate relationship between concepts Inter-concept relations –Relationship between two outermost concepts –CXR in ED bilateral mid- lower zone opacification

18 Intra-Concept Relations Relations between inner concepts and outermost concepts –Term decomposition –R groin abscess

19 Relation Types procedure_siteassociated_abnormality finding_sitehas_focus due_toassociated_with associated_deviceassociated_substance has_findinginterprets severitylocality lateralitynegation

20 Inter + Intra Concept Relationships Hemicolectomy and formation of ileostomy for bowel obstruction

21 Relation Network


Download ppt "Annotation of 311 Admission Summaries of the ICU Corpus Yefeng Wang."

Similar presentations


Ads by Google