Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes Ted Pedersen Serguei V. Pakhomov

Slides:



Advertisements
Similar presentations
Genetic Statistics Lectures (5) Multiple testing correction and population structure correction.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Spiros Papageorgiou University of Michigan
Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes¹ Ted Pedersen² and Serguei V. Pakhomov¹ University of Minnesota¹.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Chi Square Tests Chapter 17.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
What is Statistical Modeling
Please turn in your signed syllabus. We will be going to get textbooks shortly after class starts. Homework: Reading Guide – Chapter 2: The Chemical Context.
K NOWLEDGE - BASED M ETHOD FOR D ETERMINING THE M EANING OF A MBIGUOUS B IOMEDICAL T ERMS U SING I NFORMATION C ONTENT M EASURES OF S IMILARITY Bridget.
CHAPTER 22 Reliability of Ordination Results From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach,
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
An Unsupervised Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline Bridget T McInnes University of Minnesota Twin Cities Background.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes.
Statistical Evaluation of Data
Today Concepts underlying inferential statistics
Chi Square (X 2 ) Analysis Calculating the significance of deviation in experimental results.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Learning Objective Chapter 13 Data Processing, Basic Data Analysis, and Statistical Testing of Differences CHAPTER thirteen Data Processing, Basic Data.
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
Chapter 11 Simple Regression
Econometrics: The empirical branch of economics which utilizes math and statistics tools to test hypotheses. Special courses are taught in econometrics,
Anthony Greene1 Correlation The Association Between Variables.
EM and expected complete log-likelihood Mixture of Experts
Chapter 4 Correlation and Regression Understanding Basic Statistics Fifth Edition By Brase and Brase Prepared by Jon Booze.
Multinomial Distribution
GET READY! GET YOUR HANDOUTS GRAB A TEXT BOOK HAVE A PENCIL READY SIT DOWN! WE’LL BE STARTING SOON!
Chapter 9: The Origins of Genetics. Probability Likelihood that a specific event will occur Likelihood that a specific event will occur Can be expressed.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Week 5: Logistic regression analysis Overview Questions from last week What is logistic regression analysis? The mathematical model Interpreting the β.
Practical Statistics Regression. There are six statistics that will answer 90% of all questions! 1. Descriptive 2. Chi-square 3. Z-tests 4. Comparison.
PS 225 Lecture 21 Relationships between 3 or More Variables.
Advanced Math Topics 12.3 Goodness of Fit. The chi-square statistic is used to determine if two or more sample percentages are significantly different.
1 Chapter 14 Preprocessing the Data, And Cross-Tabs © 2005 Thomson/South-Western.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Three Broad Purposes of Quantitative Research 1. Description 2. Theory Testing 3. Theory Generation.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Chapter 14 Chi-Square Tests.  Hypothesis testing procedures for nominal variables (whose values are categories)  Focus on the number of people in different.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Remember You just invented a “magic math pill” that will increase test scores. On the day of the first test you give the pill to 4 subjects. When these.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
What is the probability of two or more independent events occurring?
Science Practice 2: The student can use mathematics appropriately. Science Practice 5: The student can perform data analysis and evaluation of evidence.
Handbook for Health Care Research, Second Edition Chapter 11 © 2010 Jones and Bartlett Publishers, LLC CHAPTER 11 Statistical Methods for Nominal Measures.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
The Chi-Square Distribution  Chi-square tests for ….. goodness of fit, and independence 1.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
3.3. SIMPLE LINEAR REGRESSION: DUMMY VARIABLES 1 Design and Data Analysis in Psychology II Salvador Chacón Moscoso Susana Sanduvete Chaves.
Logistic Regression: Regression with a Binary Dependent Variable.
Logistic Regression When and why do we use logistic regression?
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Basic Estimation Techniques
Machine Learning Basics
PSYCH 625 MENTOR Perfect Education/ psych625mentor.com.
Bridget McInnes Ted Pedersen Serguei Pakhomov
Clinical Medical Assisting
Using UMLS CUIs for WSD in the Biomedical Domain
Basic Estimation Techniques
Pima Medical Institute Online Education
PSYB07 Review Questions: Set 4
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Pima Medical Institute Online Education
15.1 The Role of Statistics in the Research Process
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Feature Selection Methods
Chapter 18: The Chi-Square Statistic
Presentation transcript:

Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes Ted Pedersen Serguei V. Pakhomov

Goal The goal of this presentation is to present a simple but effective approach to identify the syntactic structure of three word terms

Importance Potentially improve the analysis of unrestricted medical text  Mapping of medical text to standardized terminologies  Unsupervised syntactic parsing

Syntactic Structure of Terms w1 w2 w3 Monolithi c Non-branchingRight-branchingLeft-branching blue = independence green = dependence

Example small bowel obstruction

Syntactic Structure of Example small bowel obstruction Monolithi c Non-branchingRight-branchingLeft-branching

Method used to determine the structure of a term The Log Likelihood Ratio is the ratio between the observed probability of a term occurring and the probability it would be expected to occur Probability of Term Occurring Expected Probability of Term

Log Likelihood Ratio The expected probability of a term is often based on the Non-branching (Independence) Model P(small bowel obstruction) P(small) P(bowel) P(obstruction) EXPECTED PROBABILITY OBSERVED PROBABILITY

Extended Log Likelihood Ratio The expected probabilities can be calculated using two other hypothesis (models) Non-branchingRight-branchingLeft-branching P(small)P(bowel)P(obstruction)P(small bowel) P(obstruction)P(small) P(bowel obstruction)

Three Log Likelihood Ratio Equations P(small bowel obstruction) P(small) P(bowel) P(obstruction) P(small bowel obstruction) P(small bowel) P(obstruction) P(small bowel obstruction) P(small) P(bowel obstruction) Non-branching Right-branchingLeft-branching

Expected Probability The expected probability of a term differs as does the Log Likelihood Ratio Non-branchingRight-branchingLeft-branching P(small) P(bowel) P(obstruction)P(small bowel) P(obstruction)P(small) P(bowel obstruction) LL = 11, LL = 5,169.81LL = 8,532.90

Model Fitting The model with the lowest Log Likelihood Ratio best describes the underlying structure of the term Non-branchingRight-branchingLeft-branching P(small) P(bowel) P(obstruction)P(small bowel) P(obstruction)P(small) P(bowel obstruction) LL = 11, LL = 5,169.81LL = 8,532.90

ReCap The Log Likelihood Ratio is calculated for each possible model  Non-branching  Right-branching  Left-branching The probabilities for each model are obtained from a corpus The term is assigned the structure whose model has the lowest Log Likelihood Ratio

Test Set Contains 708 three word terms from the SNOMED-CT 73 terms Monolithi c Non-branchingRight-branchingLeft-branching 6 terms378 terms251 terms

Test Set (cont) Syntactic structure of each term was determined through the consensus of two medical text index experts (kappa = 0.704) The probabilities were obtained from over 10,000 Mayo Clinic clinical notes

Monolithic Results Technique Percentage agreement with human experts

Results without Monolithic Terms Technique Percentage agreement with human experts

Limitations Monolithic structures  possibly identify through collocation extraction or dictionary lookup As the number of words in a term grows so does the number of hypothesis (models) to be evaluated  only consider adjacent models  limit the length of the terms to 5 or 6 words

Conclusions Present a simple but effective method to identify the structure of three word terms The method uses the Log Likelihood Ratio Could be extended to identify the structure of for four, five and six word terms

Future Work Improve accuracy of method  explore other measures of association Chi-squared, Phi, Dice coefficient...  incorporate multiple measures together Extend our method to four and five word terms  difficulty: finding a test set

Thank you Software: Ngram Statistic Package (NSP) Log Likelihood Ratio Models

Log Likelihood Equation 2 * ∑ xyz ( n xyz * log(n xyz / m xyz ) )

Expected Values 2 * ∑ xyz ( n xyz * log(n xyz / m xyz ) ) Non-branching: m xyz = n x++ * n +y+ * n ++z / n +++ Left-branching: m xyz = n xy+ * n ++z / n +++ Right-branching: m xyz = n x++ * n +yz / n +++