Lindsay & Gordon’s Discovery Support Systems Model

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Author linkage Vetle I. Torvik. PubMed/MEDLINE is topic-driven Articles in MEDLINE are assigned medical subject headings (MeSH) PubMed converts a free.
Chapter 5: Introduction to Information Retrieval
Improved TF-IDF Ranker
Knowledge Enabled Information and Services Science Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan Kno.e.sis Center, Wright.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Social Pharmacy and Pharmacoepidemiology Lister Hill National Center for Biomedical Communications Text-based Discovery in Biomedicine The Architecture.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Chapter 5: Information Retrieval and Web Search
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
THOMSON SCIENTIFIC Web of Science Using the specialized search and analyze features Jackie Stapleton, librarian Fall 2006.
Knowledge Discovery in the Digital Library Access tools for mining science ICSTI Public Workshop Presented by: Bernard Dumouchel, Director-General February.
 CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
Why Biomedical Literature? Potentially 80% of the known biomedical knowledge is contained in PubMed Analysis Interpretation.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt,
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Aiding Biomedical Researchers with Tools to Assist Discovery Neil R. Smalheiser May 18, 2006.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.
Innovative Novartis Knowledge Center
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Multi-Source Information Extraction Valentin Tablan University of Sheffield.
TDM in the Life Sciences Application to Drug Repositioning *
Information Retrieval in Practice
Ricardo EIto Brun Strasbourg, 5 Nov 2015
Best pTree organization? level-1 gives te, tf (term level)
Neighborhood - based Tag Prediction
Clustering of Web pages
Text Mining CSC 600: Data Mining Class 20.
RaJoLink: Creative Knowledge Discovery by Literature Outlier Detection
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Search Engine Architecture
Introduction to Corpus Linguistics: Exploring Collocation
Terminology problems in literature mining and NLP
John MacMullen SILS Bioinformatics Journal Club Fall 2002
Title of your science project
Martin Rajman, Martin Vesely
Statistical NLP: Lecture 9
Hypotheses Hypothesis Testing
Blake & Pratt’s ‘Collaborative Information Synthesis’
Introduction to Search Engines
Networked Information Resources
Citation-based Extraction of Core Contents from Biomedical Articles
Introduction to Hypothesis Testing
Chapter 5: Information Retrieval and Web Search
Correlated-Groups and Single-Subject Designs
Text Mining CSC 576: Data Mining.
Boolean and Vector Space Retrieval Models
The Winograd Schema Challenge Hector J. Levesque AAAI, 2011
Introduction to Search Engines
Statistical NLP : Lecture 9 Word Sense Disambiguation
Introduction to Systematic Reviews
Presentation transcript:

Lindsay & Gordon’s Discovery Support Systems Model John MacMullen SILS Bioinformatics Journal Club Fall 2002

SILS Bioinformatics Journal Club Background Specialization in science leads to fragmented, ‘complementary but disjoint’ or ‘non-interactive’ literatures Attempt to find ‘undiscovered public knowledge’ in the biomedical literature Tools are needed to integrate biomedical knowledge; ‘discovery support systems’ are one class Replication and extension of Swanson & Smalheiser’s model (‘Arrowsmith’) SILS Bioinformatics Journal Club

Complementary but Disjoint Literatures Fish Oil C Raynaud’s B1 – Blood Viscosity B2 – Platelet Aggregation B3 – Vascular Reactivity ‘non-interactive’ literatures Adapted from Swanson & Smalheiser, 1997 by Weeber, et al., 2001 SILS Bioinformatics Journal Club

Swanson & Smalheiser’s method Systematic trial-and-error method Procedure 1: Citation acquisition Search MEDLINE for topical cites (‘C’ list) Apply stopword list and extract unique terms (‘B’ list) Search MEDLINE for ‘B’ term cites; prune list Perform MEDLINE searches for each ‘B’ term Classify results into likely categories Derive the intersection of each ‘B’ set with the restriction set, and the union of intersection sets (‘U’) Search the resulting terms of ‘U’ set in MEDLINE ‘U’ list becomes potential ‘A’ terms, with each ‘A’ term attached to the ‘B’ term that generated it Rank ‘A’ term results against ‘B’ co-occurrence Procedure 2: Relationship Mining Search for pre-existing A→C &/or A→B→C relationships Search for novel A→C relationships Output: Display of ‘A’ & ‘C’ cites by their common ‘B’ terms Goal: a plausible testable hypothesis Human relevance judgments in each step influence future steps http://arrowsmith.psych.uic.edu SILS Bioinformatics Journal Club

Lindsay & Gordon’s method Limited to lexical statistics, no syntactic or semantic evaluation Uses full MEDLINE record instead of title only Identify a source literature (e.g., a topic) Find all single words, bi-grams & tri-grams in source corpus; exclude stop words; normalize singular/plural Calculate 4 statistics for each token (tf, df, rf, tf*idf) Rank tokens by frequency of occurrence Identify 1 or more intermediate literatures based on stats in step (3), starting with highest ranks from (4) Run process from steps 1-4 for each intermediate literature DL complete MEDLINE records for all docs “mentioning” (i.e., word match or index term) source topic General stop word list + common medical terminology Lindsay & Gordon, 1999, pp. 576-577 SILS Bioinformatics Journal Club

Term / Phrase Statistics tf = token frequency in sub-corpus df = doc frequency (# of docs in sub-corpus w/this term) rf = relative frequency (# of appearances in sub-corpus vs whole corpus) idf(t) = log(N / f(t)) where N = # of docs in whole corpus over number of docs with t in them tf*idf = token frequency * inverse doc frequency SILS Bioinformatics Journal Club

SILS Bioinformatics Journal Club Experiments Reproduction of Swanson & Smalheiser’s magnesium / migraine connection Used their method to try to find the 11 intermediate literatures S&S found +1 new 1,081 MEDLINE records from 1986-1988 on migraine SILS Bioinformatics Journal Club

SILS Bioinformatics Journal Club Outcome Is ranking tokens by frequency effective? Arbitrary ranking and population size cutoffs Domain knowledge used (med student) Lots of manual intervention. Is this reproducible? Is it really all that different from S&S? SILS Bioinformatics Journal Club

SILS Bioinformatics Journal Club Hypotheses Intermediate literatures are best identified by absolute lexical frequencies Candidate discoveries are best generated by relative lexical frequencies Results showed the opposite of H2 (582-583) Search for the absence of connections between source and target literatures (Swanson’s disjointness) SILS Bioinformatics Journal Club

SILS Bioinformatics Journal Club Questions Is this process more automatable? Is 1 intermediate step enough? What happens to system complexity when there are i intermediate literatures? S I1 Ii T SILS Bioinformatics Journal Club