Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Intelligence in Medicine HCA 590 (Topics in Health Sciences) Rohit Kate 11. Biomedical Natural Language Processing.

Similar presentations


Presentation on theme: "Artificial Intelligence in Medicine HCA 590 (Topics in Health Sciences) Rohit Kate 11. Biomedical Natural Language Processing."— Presentation transcript:

1 Artificial Intelligence in Medicine HCA 590 (Topics in Health Sciences) Rohit Kate 11. Biomedical Natural Language Processing

2 Reading Chapter 8, Biomedical Informatics: Computer Applications in Health Care and Biomedicine by Edward H. Shortliffe (Editor) and James J. Cimino (Editor), Springer, 2006.

3 Outline Introduction to NLP Linguistic Essentials Challenges of Clinical Language Processing Challenges of Biological Language Processing

4 What is Natural Language Processing (NLP)? Processing of natural languages like English, Chinese etc. by computers to: – Interact with people, e.g. Follow natural language commands Answer natural language questions Provide information in natural language – Perform useful tasks, e.g. Find required information from several documents Summarize large or many documents Translate from one natural language to another Word processing is NOT Natural Language Processing!

5 Example NLP Task: Information Extraction Extract some specific type of information from texts Entity extraction: – Find all the protein names mentioned in a document – Find all the person, organization and location names mentioned in a document Relation extraction: – Find all pairs of proteins mentioned in a document that interact – Find where a person lives, where an organization is located etc.

6 TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase- promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto- oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. Sample Medline Abstract

7 TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase- promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto- oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. Sample Medline Abstract

8 TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase- promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto- oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. Sample Medline Abstract

9 Convert a natural language sentence into executable meaning representation for a domain Example: Query application for U.S. geography database Which rivers run through the states bordering Texas? Query answer(traverse(next_to(stateid(‘texas’)))) Semantic Parsing Arkansas, Canadian, Cimarron, Gila, Mississippi, Rio Grande … Answer Example NLP Task: Semantic Parsing

10 Example NLP Task: Textual Entailment Given two sentences whether the second sentence is implied (entailed) from the first Sentence 1: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Sentence 2: Yahoo bought Overture. Sentence 1: The market value of U.S. overseas assets exceeds their book value. Sentence 2: The market value of U.S. overseas assets equals their book value.

11 Example NLP Task: Textual Entailment Given two sentences whether the second sentence is implied (entailed) from the first Sentence 1: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Sentence 2: Yahoo bought Overture. TRUE Sentence 1: The market value of U.S. overseas assets exceeds their book value. Sentence 2: The market value of U.S. overseas assets equals their book value. FALSE

12 Example NLP Task: Summarization Generate a short summary of a long document or a collection of documents Article: With a split decision in the final two primaries and a flurry of superdelegate endorsements, Sen. Barack Obama sealed the Democratic presidential nomination last night after a grueling and history-making campaign against Sen. Hillary Rodham Clinton that will make him the first African American to head a major-party ticket. Before a chanting and cheering audience in St. Paul, Minn., the first-term senator from Illinois savored what once seemed an unlikely outcome to the Democratic race with a nod to the marathon that was ending and to what will be another hard-fought battle, against Sen. John McCain, the presumptive Republican nominee….Sen. Barack ObamaSen. Hillary Rodham ClintonSen. John McCain Summary: Senator Barack Obama was declared the presumptive Democratic presidential nominee.

13 Example NLP Task: Question Answering Directly answer natural language questions based on information presented in a corpora of textual documents (e.g. the web). –When was Barack Obama born? (factoid) August 4, 1961 –Who was president when Barack Obama was born? John F. Kennedy –How many presidents have there been since Barack Obama was born? 9

14 Why is NLP Important? Natural language is the preferred medium of communication for humans – Humans communicate with each other in natural languages – Scientific articles, magazines, clinical reports etc. are all in natural languages – Billions of web pages are also in natural languages Computers can do useful things for us if: – Data is in structured form, e.g. databases, knowledge bases – Specifications are in formal language, e.g. programming languages NLP bridges the communication gap between humans and computers – Can lead to a better and a more natural communication with computers – Process an ever increasing amount of natural language data generated by people, e.g. extract required information from web

15 Biomedical Applications of NLP Extract relevant information from large volumes of text, e.g. patient reports, journal articles Identify diagnoses and procedures in patient documents for billing purposes Process enormous number of patient reports to detect medical errors Extract genomic information from literature to curate databases

16 NLP is Hard People generally don’t appreciate how intelligent they are as natural language processors! For them natural language processing is deceptively simple because no conscious effort is required Since computers are orders of magnitude faster, many find it hard to believe that computers are not good at processing natural languages An Artificial Intelligence (AI) problem: – Problems at which currently people are better than computers – Three year old kids are better than current computing systems at NLP

17 What Makes NLP Hard? Ambiguity: A word, term, phrase or sentence could mean several possible things: – cad represents 11 different biomolecular entities in flies and mouse as well as the clinical concept coronary artery disease – The doctor injected the patient with digitalis. – The doctor injected the patient with malaria. – Time flies like an arrow. – I saw a man on the hill with a telescope. In contrast, computer languages are designed to be unambiguous

18 I saw a man on the hill with a telescope.

19 What Makes NLP Hard? Variability: Lots of ways to express the same thing: – The doctor injected the patient with malaria. – The physician gave the patient suffering from malaria an injection. – The patient with malaria was injected by the doctor. – The doctor injected the patient who has malaria. Computer languages have variability but the equivalence of expressions can be automatically detected

20 Why is there Ambiguity and Variability in Natural Languages? A unique and unambiguous way to express every thing would make natural languages unnecessarily complex Language is always used in a physical and/or conceptual context, humans possess a lot of world knowledge and are good at inferring; these are used to simplify language at the cost of making it potentially ambiguous – I am out of money. I am going to the bank. – The river is calm today. I am going to the bank. Variability increases expressivity of a language and allows humans to express themselves in creative ways – I am headed to the bank. – I am off to the bank. – I am hitting the road to bank.

21 How to Make Computers Process Natural Languages? Should we model how humans acquire or process language? – A good idea but difficult to model – Human brain is different from computer processor Humans are good at remembering and recognizing patterns; computers are good at crunching numbers A compromising approach: Model human language processing as much as possible but also utilize computer’s ability to crunch numbers – Airplanes have wings like birds but they don’t flap them, instead they use engine technology

22 How Do Humans Acquire Language? Children pick-up language through experience by associating the language they hear with their perceptual context; language is not taught It is believed that the language experience children get is too little to learn all the complexities of a language, this is known as the “poverty of stimulus” argument It has been postulated (controversial) that key language capabilities are innate and hard-wired into human brain at birth by genetic inheritance, known as “universal grammar” (Chomsky, 1957) – Children only set the right parameters of this grammar based on the particular language they are exposed to – An interesting and fun book : “Language Instinct” by Steven Pinker

23 How to Make Computers Process Natural Languages? If language capability requires built-in prior knowledge then we should manually encode our language knowledge into computers, this is the traditional “rationalist” approach (dominated from 1960s to mid 80s)

24 24 Linguistic Knowledge NLP System Raw Text Automatically Annotated Text Rationalist Approach

25 How to Make Computers Process Natural Languages? If language capability requires built-in prior knowledge then we should manually encode our language knowledge into computers, this is the traditional “rationalist” approach (dominated from 1960s to mid 80s) – Very difficult and time-consuming – Led to brittle systems that won’t work with slightly different input

26 How to Make Computers Process Natural Languages? An alternate “empiricist” approach is to let the computer acquire the knowledge of language from annotated natural language corpora using machine learning techniques; also known as corpus-based or statistical approach

27 27 Manually Annotated Training Corpora Machine Learning Linguistic Knowledge NLP System Raw Text Automatically Annotated Text Empiricist Approach

28 How to Make Computers Process Natural Languages? An alternate “empiricist” approach is to let the computer acquire the knowledge of language from annotated natural language corpora using machine learning techniques; also known as corpus-based or statistical approach This approach has come to dominate NLP since the 90s because: – Advancement of machine learning techniques and computer capabilities – Availability of large corpora to learn from – Typically leads to robust systems – Easy to annotate a corpus than to directly encode language knowledge These may in turn lead to insights into how humans acquire language

29 How to Make Computers Process Natural Languages? Key steps for all NLP tasks: – Formulate the linguistic task as a mathematical/computer problem using appropriate models and data structures – Solve using the appropriate techniques for those models Essential components: – Linguistics – Mathematical/computer models and techniques

30 NLP: State of the Art Several intermediate linguistic analyses for general text can be done with good accuracy: POS tagging, syntactic parsing, dependency parsing, coreference resolution, semantic role labeling – Systems are freely available that do these Many tasks that only require finding specific things from text (i.e. do not require understanding complete sentences) can be done with reasonable success: information extraction, sentiment analysis; they are also commercially important For a particular domain, sentences can be completely understood with good accuracy: semantic parsing

31 NLP: State of the Art Tasks like summarization, machine translation, textual entailment, question-answering that are currently done without fully understanding the sentences can be done with some success but not satisfactorily Fully understanding open domain sentences is currently not possible – Is likely to require encoding or acquiring a lot of common world knowledge Very little work in understanding beyond a sentence, i.e. understanding a whole paragraph or an entire document together

32 Linguistics Essentials Partially based on Chapter 3 of Manning & Schutze Statistical NLP book.

33 Basic Steps of Natural Language Processing Sound waves Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Meaning in context We will skip phonetics and phonology.

34 Basic Steps of Natural Language Processing Sound waves Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Meaning in context We will skip phonetics and phonology.

35 Words: Morphology Study of internal structure of words – carried  carry + ed (past tense) – independently  in + (depend + ent) + ly English has relatively simple morphology, some other languages like German or Finnish have complex word structures Very accurate morphological analyzers are available for most languages; considered a solved problem Biomedical domains have rich morphology: – hydroxynitrodihydrothymine => hydroxy-nitro-di-hydro-thym-ine – hepaticocholangiojejunostomy => hepatico-cholangio-jejuno-stom-y Identifying morphological structure also helps dealing with new words

36 Words: Parts of Speech Linguists group words of a language into categories which occur in similar places in a sentence and have similar type of meaning: e.g. nouns, verbs, adjectives; these are called parts of speech (POS) A basic test to see if words belong to the same category or not is the substitution test – This is a good [dog/chair/pencil]. – This is a [good/bad/green/tall] chair.

37 Parts of Speech Nouns: Typically refer to entities and their names like people, animals, things – John, Mary, boy, girl, dog, cats, mug, table, idea – Can be further divided as proper, singular, plural Pronouns: Variables or place-holders for nouns – Nominative: I, you, he, she, we, they, it – Accusative: me, you, him, her, us, them, it – Possessive: my, your, his, her, our, their, its – 2 nd Possessive: mine, yours, his, hers, ours, theirs, its – Reflexive: myself, yourself, himself, herself, ourselves, themselves, itself

38 Parts of Speech Determiners: Describe particular reference of a noun – Articles: a, an, the – Demonstratives: this, that, these, those Adjectives: Describe properties of nouns – good, bad, green, tall Verbs: Describe actions – talk, sleep, eat, throw – Categorized based on tense, person, singular/plural

39 Parts of Speech Adverbs: Modify verbs by specifying space, time, manner or degree – often, slowly, very Prepositions: Small words that express spatial relations and other attributes – in, on, over, of, about, to, with – They introduce prepositional phrases that typically introduce ambiguity in a sentence. I saw a man on the hill with a telescope. – Prepositional phrase attachment: Another important NLP problem Particles: Subclass of prepositions that bond with verbs to form phrasal verbs – take off, air out, ran up

40 POS Tagging POS tagging is often the first step in analyzing a sentence Why is this a non-trivial task? – The same word can have different pos tags in different sentences: His position was near the tree. Position him near the tree. Noun Verb John saw the saw and decided to take it to the table. NOUN VERB DT NOUN CONJ VERB TO VERB PRP PREP DT NOUN

41 Basic Steps of Natural Language Processing Sound waves Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Meaning in context

42 Phrase Structure Most languages have a word order Words are organized into phrases, group of words that act as a single unit or a constituent – [The dog] [chased] [the cat]. – [The fat dog] [chased] [the thin cat]. – [The fat dog with red collar] [chased] [the thin old cat]. – [The fat dog with red collar named Tom] [suddenly chased] [the thin old white cat].

43 Phrases Noun phrase: A syntactic unit of a sentence which acts like a noun and in which a noun is usually embedded called its head – An optional determiner followed by zero or more adjectives, a noun head and zero or more prepositional phrases Prepositional phrase: Headed by a preposition and express spatial, temporal or other attributes Verb phrase: Part of the sentence that depend on the verb. Headed by the verb. Adjective phrase: Acts like an adjective.

44 An Important NLP Task: Phrase Chunking Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence. – [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs]. – [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] Some applications need all the noun phrases in a sentence

45 Phrase Structure Grammars Syntax is the study of word orders and phrase structures Syntactic analysis tells how to determine meaning of a sentence from the meaning its of words – The dog bit the man. – The man bit the dog. A basic question in Linguistics: What forms a legal sentence in a language? Syntax helps to answer that question – *Bit the the man dog. – Colorless green ideas sleep furiously.

46 Phrase Structure Grammars Linguists have come up with many grammar formalisms to capture syntax of languages, phrase structure grammar is one of them and is very commonly used A context-free grammar that generates sentences, a small grammar with productions: S -> NP VP VP -> Verb VP -> Verb NP NP -> Article Noun Verb -> [slept|ate|made|bit] Noun -> [girl|cake|dog|man] Article -> [A|The]

47 Phrase Structure Grammars The parse of the sentence is typically shown as a tree The girl ate the cake. NP S VP ArticleNN Verb NP ArticleNN Thegirl ate thecake A syntactic derivation or a parse tree Non-terminals Terminals

48 Phrase Structure Grammars Some of the productions can be recursive (like NP -> NP PP) which can then expand several times – [ S [ NP I [ VP saw [ NP the man [ PP on [ NP the hill [ PP with [ NP the telescope [ PP in [ NP Texas]]]]]]] Because of recursion in the grammars there are potentially infinite number of sentences in a language

49 Syntactic Parsing: A Very Important NLP Task Typically a grammar can lead to several parses of a sentence: Syntactic ambiguity – [ S [ NP I [ VP saw [ NP the man [ PP on [ NP the hill [ PP with [ NP the telescope [ PP in [ NP Texas]]]]]]] – [ S [ NP I [ VP saw [ NP the man [ PP on [ NP the hill]]] [ PP with [ NP the telescope [ PP in [ NP Texas]]]] – [ S [ NP I [ VP saw [ NP the man [ PP on [ NP the hill]] [ PP with [ NP the telescope [ PP in [ NP Texas]]]]] – …. – Not uncommon to have hundreds of parses for a sentence

50 Simple PCFG for ATIS English S → NP VP S → Aux NP VP S → VP NP → Pronoun NP → Proper-Noun NP → Det Nominal Nominal → Noun Nominal → Nominal Noun Nominal → Nominal PP VP → Verb VP → Verb NP VP → VP PP PP → Prep NP Grammar 0.8 0.1 0.2 0.6 0.3 0.2 0.5 0.2 0.5 0.3 1.0 Prob + + + + 1.0 Det → the | a | that | this 0.6 0.2 0.1 0.1 Noun → book | flight | meal | money 0.1 0.5 0.2 0.2 Verb → book | include | prefer 0.5 0.2 0.3 Pronoun → I | he | she | me 0.5 0.1 0.1 0.3 Proper-Noun → Houston | NWA 0.8 0.2 Aux → does 1.0 Prep → from | to | on | near | through 0.25 0.25 0.1 0.2 0.2 Lexicon

51 51 Sentence Probability Assume productions for each node are chosen independently. Probability of derivation is the product of the probabilities of its productions. P(D 1 ) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x 0.5 x 0.3 x 1.0 x 0.2 x 0.2 x 0.5 x 0.8 = 0.0000216 D1D1 S VP Verb NP Det Nominal Nominal PP book Prep NP through Houston Proper-Noun the flight Noun 0.5 0.6 0.5 1.0 0.2 0.3 0.5 0.2 0.8 0.1

52 Syntactic Disambiguation Resolve ambiguity by picking most probable parse tree. 52 D2D2 VP Verb NP Det Nominal book Prep NP through Houston Proper-Noun the flight Noun 0.5 0.6 1.0 0.2 0.3 0.5 0.2 0.8 S VP 0.1 PP 0.3 P(D 2 ) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x 0.6 x 0.3 x 1.0 x 0.5 x 0.2 x 0.2 x 0.8 = 0.00001296

53 Syntax in Biomedical Languages Clinical language often relaxes many syntactic constraints in order to be highly compact – The cough worsened – Cough worsened – Cough – Increased tenderness. Because these are widely used, they are not considered ungrammatical, but as a sublanguage There are wide variety of sublanguages in the biomedical domains each exhibiting specialized content and linguistic forms

54 Basic Steps of Natural Language Processing Sound waves Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Meaning in context

55 Lexical Semantics Study of the meaning of individual words – How to represent word meanings? – How are they related to each others? Synonyms, antonyms, hypernyms (more general), hyponyms (more specific) A word can have multiple meanings, this leads to lexical ambiguity: “I am going to the bank.” Compositionality: How meanings of individual words combine to give meaning of a sentence – Many exceptions: “kick the bucket”

56 Semantic Role Labeling Determine the semantic role played by each noun phrase related to a verb sender recipient theme Show the long email Alice sent me yesterday

57 Convert a natural language sentence into executable meaning representation for a domain Example: Query application for U.S. geography database Which rivers run through the states bordering Texas? Query answer(traverse(next_to(stateid(‘texas’)))) Semantic Parsing Arkansas, Canadian, Cimarron, Gila, Mississippi, Rio Grande … Answer Semantic Parsing

58 Semantics in Biomedical Languages Easier to interpret than general languages because they exhibit highly restrictive semantic patterns (Harris et al. 1989, 1991; Sager et al. 1987) Biomedical sublanguages tend to have relatively small number of semantic types (e.g. medication, gene, disease, organism etc.), e.g. semantic types from UMLS They also have a small number of semantic patterns medication treats disease gene interacts with gene

59 Meaning in context Basic Steps of Natural Language Processing Sound waves Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing

60 Pragmatics Discourse analysis: How are sentences/group of sentences connected in text or speech – Is a sentence elaborating/contradicting/restating the previous sentence? Anaphora resolution: Determine which phrases in a document refer to the same underlying entities (include pronoun resolution) – Obama visited the town. The president gave a speech. Both these tasks require knowledge about the world and language

61 Pronoun Resolution: An Important NLP Problem John put the cherry on the plate and ate it. John put the cherry on the cake and ate it. John met Jim in the park. He gave him a book. Pronouns need to be resolved before a sentence can be further analyzed.

62 Pragmatics in Biomedical Languages Context determines meaning of words and sentences – “mass” may mean different things in a radiology report of the chest and a mammography report Pronoun resolution An infiltrate was noted in right upper lobe; it was patchy.

63 Meaning in context Basic Steps of Natural Language Processing Sound waves Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing – This is a conceptual pipeline, it may be true that humans process multiple stages simultaneously Processing in a pipeline often propogates errors, but processing jointly increases computational complexity

64 Challenges of Clinical Language Processing Good Performance – Performance should be good enough to be used for clinical applications, should not be significantly worse than the medical experts – System should have flexibility to trade-off precision and recall Recovery of Implicit Information – NLP system should contain enough medical knowledge to make appropriate inferences – “rupture” means “rupture of membranes” – “patchy opacity” and “focal infiltrate” may indicate “pneumonia”

65 Challenges of Clinical Language Processing Intraoperability: NLP system should seamlessly integrate into clinical information system – Many different interchange formats (e.g. HL7) – Different types of reports with different formats, text may contain tables, structured fields etc. – Output of NLP system should be mapped to appropriate controlled vocabulary, e.g. UMLS, SNOMED or ICD Interoperability – A standard representation model for medical language is needed to represent negation, certainty, severity etc. of the clinical terms

66 Challenges of Clinical Language Processing Training set availability – Patient records are confidential and requires approval of institutional review board (IRB) – There are methods to de-identify names etc. but identifying names etc. is not easy – These issues do not arise when processing literature Limited availability in electronic form – Many clinical documents are still written on paper – Optical Character Recognition (OCR) is not accurate especially with physicians’ notes

67 Challenges of Clinical Language Processing Evaluation – Difficult to obtain gold-standard data, time-consuming for medical experts to annotate data – Evaluation competitions are very useful, they help compare different systems on the same data BioCreative: http://www.biocreative.org/http://www.biocreative.org/ BioNLP shared task: http://sites.google.com/site/bionlpst/http://sites.google.com/site/bionlpst/ i2b2 shared task: https://www.i2b2.org/NLP/Coreference/Call.phphttps://www.i2b2.org/NLP/Coreference/Call.php TREC Medical Records task

68 Challenges of Clinical Language Processing Expressiveness – More than 200 different expressions for severity information: faint, mild, borderline, 3 rd degree, mild to moderate etc. – Complex modifiers: “no improvement in pneumonia” in text will match a query “improvement in pneumonia” Lack of standardized domains for classifying clinical reports – pvc may mean pulmonary vascular congestion in chest X- ray report and premature ventricular complexes in electrocardigram report

69 Challenges of Clinical Language Processing Compactness of text – Very compact containing many abbreviations – Sentence boundaries poorly delineated Admit 10/23 71 yo woman h/o DM, HTN, Dilated CM/CHF, Afib s/p embolic event, chronic diarrhae, admitted with SOB. Rare events – Medical errors and adverse events are not reported frequently, difficult to train a system to detect them

70 Challenges of Biological Language Processing Dynamic nature of domain – Continuous discoveries of genes, proteins etc. – Continuous creation and withdrawal of names Ambiguous biomolecular names – Different model organism groups name genes and other entities independently – cad represents over 11 different biomolecular entities in Drosophila and the mouse Large number of biomolecular entites – 70,000+ genes, 100,000+ proteins, 1 million species – Requires a very large knowledge base of names and recognizing the correct one from context

71 Challenges of Biological Language Processing Variant names – Authors may use various forms of names bmp-4, bmp 4, bmp4 syt4, syt IV Iga, ig alpha Nesting of names – Many biomolecular entities are long and contain other entities inside caspase recruitment domain 4

72 Challenges of Biological Language Processing Complexity of language – In clinical text, information is typically expressed as noun phrases which also consists of their modifiers – In biological text, the information is typically expressed as highly nested verb and noun phrases Bad phosphorylation induced by IL-3 was inhibited by specific inhibitors of PI3 kinase Inhibit Induce Phosphorylate ? PI3 kinase IL3 ? Bad

73 Challenge for Both Clinical and Biological Language Processing Multidisciplinary nature – Requires expertise in biomedical domains as well as computer science, mathematics, chemistry etc. – Collaborations are essential

74 Resources Some sources for NLP datasets and tools: A list of NLP resources: http://nlp.stanford.edu/links/statnlp.html http://nlp.stanford.edu/links/statnlp.html Biomedical Corpora and Tools: http://orbitproject.org/http://orbitproject.org/ OpenNLP Project: http://opennlp.sourceforge.net/projects.html http://opennlp.sourceforge.net/projects.html Natural Language Toolkit: http://www.nltk.org/http://www.nltk.org/ NLTK Corpora: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml


Download ppt "Artificial Intelligence in Medicine HCA 590 (Topics in Health Sciences) Rohit Kate 11. Biomedical Natural Language Processing."

Similar presentations


Ads by Google