Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Health Text Lexical Processing Mojtaba Sabbagh.

Similar presentations


Presentation on theme: "1 Health Text Lexical Processing Mojtaba Sabbagh."— Presentation transcript:

1 1 Health Text Lexical Processing Mojtaba Sabbagh

2 2 Outline Health Text Lexical Analyser Objectives Tokenisation Verification Designing General Lexical Process Measurements

3 3 Objectives The objects of the lexical analyser are:  Dividing corpus into words and verifying all of them and finding their type.  Find unknown words and resolve them.  Change the format of structures to the standard format.  Annotating the corpus with this information to be used by other systems.

4 4 Mojtaba Sabbagh Tokenisation Corpus Features  The corpus consists of 60 million words of ICU clinical notes written by all staff roles working in the ward, that is doctors, VMOs, nurses, allied health (dietitans, physios, psychologists, etc). To Implement the tokeniser we used a rule base of regular expressions.  These rule were written in python to match the token patterns in the corpus.  The rules are applied sequentially from the most specific to the most general

5 5 Mojtaba Sabbagh Lexical Analyser Results Date 1.01% Simple date sep.by / 4 digit year, 2 digit year, 0 digit year Simple date sep.by dots, hyphens month between day and year(2nd may 07)‏ month at the firs t(Jan the 3rd)‏ Time 1.01% Simple Time hh:mm:ss Relative Time: after at, from,etc Time followed by hr,hrs,am,pm Simple Time hh:mm Range 0.28% Simple range 12-17, Pointer range 12->17 Complex range 12-18/17-19 Complex words 1.52% Two words connected by / - \ + ‘ More than two words Words starting with / Digit words 0.45% Chemical formula O2,Fio2 471530 0.45% Plain words 33.13% Complex digit 0.09% Digits separated by / 85182 0.08% Digit 6.06% Digits with #, %, ` Plain digits Other 55.26% Header separator Question mark(?)‏ ^ sign White space Punctuation(\.,'";:\/|()[]{})‏ Missing tokens Total tokens 104,422,977 Operators 1.20% Plus(+), Minus(-), Greater(>), Lower(<), Equals(=), Approximate(~) These categories must be verified.

6 6 Mojtaba Sabbagh Gazetteers used for Verification Moby: English Common Words (354,984 words)‏ Abbreviation List: General Medical Acronyms (1,000 entries)‏ UMLS Lexicon (450,895 entries)‏ Snomed CT lexicon: (99,992 entries) Name list: Consists of the names of people extracted from the corpus - anonymised Misspelt words: List of all misspelt word which they have been identified in corpus up to now (81,415 entries)‏

7 7 Designing a general process for lexical phase Lexical Processing Raw Corpus Annotated Corpus

8 8 General View of Lexical Processing Manual Resolution Generating/Updating Gazetteers Raw Corpus Computational Verification Unknown tokens Gazetteers Moby SCT UMLS Gazetteers Clean up the Corpus Annotation of the corpus String & examples Extractor Corrected Spreadsheets If # < T Manual Update RegExp Annotated Corpus Computation of Suggestions Unknown tokens with Suggestions Spreadsheets Spreadsheets of tokens & Examples Corrected Corpus Preprocessing (Measurement)‏

9 9 Measurements For Example: GCS PEEP PEARTL ICP CPP …. In this stage we try to capture measurements in the corpus and change their format to the standard way There are 28 measurements is in ICU corpus.

10 10 GCS There are more than 142,000 references in the corpus. About 12000 of them were analyzed and 653 different ways of writing GCS has been identified. We predicate, there are around 7000 patterns of GCS in the corpus and around 60% of them just were used one time. The Most common GCS patterns are: GCS 10 3805/12000 GCS 10-121088/12000 GCS 10/15963/12000 GCS: v 2 m 3 e 5 = 10 956/12000 GCS=10 614/12000 GCS:10 487/12000

11 11 Conclusion In the lexical analysis we try to reduce the complexity and ambiguity of corpus. It helps other systems with their performance and accuracy. Having format in writing each piece of note leading to more accurate information retrieval systems


Download ppt "1 Health Text Lexical Processing Mojtaba Sabbagh."

Similar presentations


Ads by Google