Presentation on theme: "Neurocognitive approach to clustering of PubMed query results P. Matykiewicz, Włodzisław Duch, Dept. of Informatics, Nicolaus Copernicus Uni, Toruń, Poland."— Presentation transcript:
Neurocognitive approach to clustering of PubMed query results P. Matykiewicz, Włodzisław Duch, Dept. of Informatics, Nicolaus Copernicus Uni, Toruń, Poland P.M. Zender, K.A. Crutcher, J.P. Pestian Cincinnati Children's Hospital Medical Center, Ohio, USA Google: W. Duch ICONIP 2008,Auckland, NZ
Plan How can we help medical professionals to find relevant information? Neurocognitive informatics. Semantic memory and other types of memory. Creating semantic memory. UMLS as a semantic memory. Spreading activation. Literature based discovery. Neurocognitive approach to literature based discovery. Plans for the future.
Neurocognitive informatics Computational Intelligence. An International Journal (1984) + 10 other journals with “Computational Intelligence”, D. Poole, A. Mackworth R. Goebel, Computational Intelligence - A Logical Approach. (OUP 1998), GOFAI book, logic and reasoning. CI: lower cognitive functions, perception, signal analysis, action control, sensorimotor behavior. CI: lower cognitive functions, perception, signal analysis, action control, sensorimotor behavior. AI: higher cognitive functions, thinking, reasoning, planning etc. AI: higher cognitive functions, thinking, reasoning, planning etc. Neurocognitive informatics: brain processes can be a great inspiration for AI algorithms, if we could only understand them …. Neurocognitive informatics: brain processes can be a great inspiration for AI algorithms, if we could only understand them …. What are the neurons doing? Perceptrons, basic units in multilayer perceptron networks, use threshold logic – NN inspirations. What are the networks doing? Specific transformations, memory, estimation of similarity. How do higher cognitive functions map to the brain activity? Neurocognitive informatics = abstractions of this process.
Types of memory Neurocognitive approach to NLP: at least 4 types of memories. Long term (LTM): recognition, semantic, episodic + working memory. Input (text, speech) pre-processed using recognition memory model to correct spelling errors, expand acronyms etc. For dialogue/text understanding episodic memory models are needed. Working memory: an active subset of semantic/episodic memory. All 3 LTM are coupled mutually providing context for recognition. Semantic memory is a permanent storage of conceptual data. “Permanent”: data is collected throughout the whole lifetime of the system, old information is overridden/corrected by newer input. “Conceptual”: contains semantic relations between words and uses them to create concept definitions.
Semantic Memory Models Endel Tulving „Episodic and Semantic Memory” 1972. Semantic memory refers to the memory of meanings and understandings. It stores concept-based, generic, context-free knowledge. Permanent container for general knowledge (facts, ideas, words etc). Semantic network Collins Loftus, 1975 Hierarchical Model Collins Quillian, 1969
Semantic memory Hierarchical model of semantic memory (Collins and Quillian, 1969), followed by most ontologies. Connectionist spreading activation model (Collins and Loftus, 1975), with mostly lateral connections. Our implementation is based on connectionist model, uses relational database and object access layer API. The database stores three types of data: concepts, or objects being described; keywords (features of concepts extracted from data sources); relations between them. IS-A relation us used to build ontology tree, serving for activation spreading, i.e. features inheritance down the ontology tree. Types of relations (like “x IS y”, or “x CAN DO y” etc.) may be defined when input data is read from dictionaries and ontologies.
SM & neural distances Activations of groups of neurons presented in activation space define similarity relations in geometrical model (McClleland, McNaughton, O’Reilly, Why there are complementary learning systems, 1994).
Similarity between concepts Left: MDS on vectors from neural network. Right: MDS on data from psychological experiments with perceived similarity between animals. Vector and probabilistic models are approximations to this process. S ij ~ (w i,Cont)| (w j,Cont)
Creating SM The API serves as a data access layer providing logical operations between raw data and higher application layers. Data stored in the database is mapped into application objects and the API allows for retrieving specific concepts/keywords. Two major types of data sources for semantic memory: 1. 1.machine-readable structured dictionaries directly convertible into semantic memory data structures; 2. 2.blocks of text, definitions of concepts from dictionaries/encyclopedias. 3 machine-readable data sources are used: The Suggested Upper Merged Ontology (SUMO) and the the MId- Level Ontology (MILO), over 20,000 terms and 60,000 axioms. WordNet lexicon, more than 200,000 words-sense pairs. ConceptNet, concise knowledgebase with 200,000 assertions.
Creating SM – free text WordNet hypernymic (a kind of … ) IS-A relation + Hyponym and meronym relations between synsets (converted into concept/concept relations), combined with ConceptNet relation such as: CapableOf, PropertyOf, PartOf, MadeOf... Relations added only if in both Wordnet and Conceptnet. Free-text data: Merriam-Webster, WordNet and Tiscali. Whole word definitions are stored in SM linked to concepts. A set of most characteristic words from definitions of a given concept. For each concept definition, one set of words for each source dictionary is used, replaced with synset words, subset common to all 3 mapped back to synsets – these are most likely related to the initial concept. They were stored as a separate relation type. Articles and prepositions: removed using manually created stop-word list. Phrases were extracted using ApplePieParser + concept-phrase relations compared with concept-keyword, only phrases that matched keywords were used.
ULMS: Expert Semantic Memory Biomedical domain: hundreds of controlled vocabularies, hierarchies and ontologies. GO - gene ontology, used for gene annotation. ICD-9-CM - used for billing in US hospitals. SNOMED CT - used in electronic medical record systems. MeSH - used in annotation of biomedical literature in PubMed. Psychological Index Terms - used to annotate articles in psychology/psychiatry domain in PsycARTICLES citation database. Unified Medical Language System (ULMS). All of these sources and ~90 other sources connected together create: Unified Medical Language System (ULMS). This is the most detailed description of concepts and relations between them created so far.
Some facts about UMLS UMLS version 2007AC has: 92 English sources, including SNOMED CT, MeSH, ICD-9-CM, ICD-10 ect. 54,245 ambiguous phrases; 3,723,408 unique English phrases; 1,516,299 concepts. Concepts have: 16,918,281 unique structural (semantic) relations. 13,226,382 unique co-occurrence (associative) relations (e.g. PubMed medical subject headings co-occurrence). attributes, contexts, definitions, semantic types,... Is it a good basis for semantic/episodic memory and spreading activation networks approximating associations in expert’s brain?
Enhancing representations Experts reading the text activate their semantic memory and add a lot of knowledge that is not explicitly present in the text. Semantic memory is difficult to create: co-occurrence statistics does not capture structural relations of real objects and features. Better approximation (not as good as SM): use ontologies adding parent concepts to those discovered in the text. Ex: IBD => [C0021390] Inflammatory Bowel Diseases -> [C0341268] Disorder of small intestine -> [C0012242] Digestive System Disorders -> [C1290888] Inflammatory disorder of digestive tract -> [C1334233] Intestinal Precancerous Condition -> [C0851956] Gastrointestinal inflammatory disorders NEC -> [C1285331] Inflammation of specific body organs -> [C0021831] Intestinal Diseases -> [C0178283] [X]Non-infective enteritis and colitis [C0025677] Methotrexate (Pharmacologic Substance) => -> [C0003191] Antirheumatic Agents -> [C1534649] Analgesic/antipyretic/antirheumatic
Literature based discovery Biomedical research is divided into highly specialized fields and subfields, with poor communication between them. The rate of growth of publications makes it difficult for a researcher to derive connections between concepts from different research specialties. Mining hidden connections among biomedical concepts from large amounts of scientific literature is one of the important goals pursued in this field. Swanson explored biomedical literature to find novel connections between medical concepts. He proposed that “Fish Oil” may be used as a cure for “Reynaud's Disease”. Researchers followed up his finding and the hypothesis turned out be true.
Literature based discovery example Swanson found the hidden connection between “Fish Oil” and “Reynaud's Disease” by finding common set of concepts from the document set on “Fish Oil” and “Reynaud's Disease”. Fish Oil Raynaud’s disease High blood viscosity Platelet aggregation You can make medical disoveries!
Literature based discovery using Visual Language System VLS Hypothesis: quicker recognition of interesting relations when graph is presented as icons First consistent graphs are needed.
Graphs of consistent concepts General GCC idea: when the text is read and understood activation of semantic subnetwork in the expert brain is spread to new patterns, corresponding to related concepts; new concepts automatically have to fit to the active network, assuming meanings that increase overall network activation, or the consistency of text interpretation. Many approximations of this process may be defined. Success depends on the quality of semantic network. Explicit competition/inhibition among network nodes is important. 1. 1.Recognition of concepts. 2. 2.Spreading activation from concepts that are in the text to related concepts. 3. 3.Build graph inhibiting concepts that are irrelevant.
PubMed queries Searching for: "Alzheimer disease“ [MeSH Terms] AND "apolipoproteins e“ [MeSH Terms] AND "humans“ [MeSH Terms] returns 2899 citations with 1924 MeSH terms. Out of 16 MeSH hierarchical trees only 4 trees have been selected: Anatomy; Diseases; Chemicals & Drugs; Analytical, Diagnostic and Therapeutic Techniques & Equipment. The number of concepts is 1190. Loop over: Cluster analysis; Feature space enhancement through ULMS relations between MeSH concepts; Inhibition, leading to filtering of concepts. Create graphical representation.
Future work Collaborative work with: Graphical designers Design glyphs as a basis of for icon Design rules how glyphs are connected to create an icon Design layout for consistent graphs Computer scientists Study effects of inhibition (different feature selection methods) Study properties of spreading activation algorithm Apply to other fields (e.g. text classification) Field experts Study performance of experts when text graph vs. icon graph is presented Rate graphs based on their content
Thank you for lending your ears... Google: W. Duch => Papers/presentations/projects