Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James.

Similar presentations


Presentation on theme: "Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James."— Presentation transcript:

1 Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James E. Ries, Kuichun Su, Gabriel Peterson, MaryEllen C. Sievert, Timothy B. Patrick, David E. Moxley, Lawrence D. Ries CECS, HMI, Statistics, and SISLT

2 Abstract Retrieval tests have assumed that the abstract is a true surrogate of the entire text. However, the frequency of terms in abstracts has never been compared to that of the articles they represent. Even though many sources are now available in full-text, many still rely on the abstract for retrieval … … In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent

3 Background Many retrieval systems still use abstracts as a surrogates for full text. Abstracts are often indexed with respect to word occurrence by employing Zipf’s Law. –Product of occurrence frequency and rank of occurrence frequency is constant –Most occurring and least occurring words contribute little to article content.

4 Background (cont.) Previous studies have shown that abstracts are sometimes inconsistent with their corresponding articles. However, no study has previously shown that abstracts and articles are inconsistent in a statistical sense.

5 Methods 4 medical journals (BMJ, JAMA, Lancet, and NEJM) –Two different countries –Many medical subdisciplines –Regarded as top journals –Available in electronic format Studied all articles which contained an abstract and were 2 pages or longer during 1999. –1,138 articles – 35 parsing problems = 1,103 articles

6 Methods (cont.) Text of articles and abstracts were downloaded and stored in HTML. HTML was parsed into separate abstract and article files via custom C++ parsing program. References and figures were removed.

7 Methods (cont.) “Content-bearing words” extracted from abstracts and articles –Numerical values, special characters, and captions excluded and used as word delimiters Removed words contained in a home-grown “stop word list” (words with little or no medical meaning)

8 Methods (cont.) Remaining words conflated using NLM’s LVG tools. –E.g, “reading” -> “read” Frequencies of all conflated words were calculated for abstracts and articles.

9 Analysis Used chi-squared test to determine whether discrepancies between observed occurrences in abstract and occurrences in articles were due to sampling or were truly indicative of a difference in content.

10 Analysis (cont.) Example: Rosing (Lancet) –Abstract contained 140 content bearing words –“contraceptive” appeared 6 times in the abstract and 35 times in the text of the article. –Since text contained 1081 content bearing words, expect 140/1081 * 35 = 3.35 occurrences of this term in the abstract.

11 Analysis (cont.) Example: Rosing (Lancet) –Actual number of occurrences was 6, the square of the error divided by the expected was added to the chi-squared statistic for this particular word (i.e., ((6-3.35)^2)/3.35 = 2.10). –Every other content bearing word in the article was compared to the abstract in this way, and sum of all of the errors was the total chi-squared statistic for the given article.

12 Analysis (cont.) We reran our analysis using the Bonferroni Inequality measure to assure that we would not have incorrect results simply by virtue of our large sample size.

13 Cumulative Results w/o Bonferroni

14

15 Cumulative Results w/ Bonferroni

16

17 Future Work Utilize a smaller, more standard stop word list (see Su K, et. al., “Comparing Frequency of Word Occurances in Abstracts and Texts Using Two Stop Word Lists” in Fall 2001 AMIA Proceedings). Explore “over agreement”.

18 Future Work (cont.) Compare phrases (terms) rather than words. Utilize the UMLS to compare Concept Unique Identifiers (CUI’s) via MetaMap rather than words or phrases. –Changes in agreement/disagreement may indicate the use of synonyms which might still negatively affect retrieval.

19 Conclusion In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent. Our test was “conservative” in the sense that we can only strongly state that a small number of abstract/article pairs do “disagree”. However, the remaining articles can only be said to not conclusively disagree.

20 Acknowledgements This research was supported in part by grant T15-089 LM0708-09 from the National Library of Medicine, United States of America.

21 Questions http://riesj.hmi.missouri.edu JimR@acm.org


Download ppt "Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James."

Similar presentations


Ads by Google