UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics: tasks 1 and 2

Overview UCB BioText group took part in Task 1 and Task 2 Task 1: Information retrieval + Information Extraction (+ Text Classification) Task 2: Text Classification + Information Extraction Commonalities for the both tasks Named entities recognition in the text Genes and synonyms MeSH concepts Text classification algorithms

MeSH Hierarchy Unique identifier: e.g. Abdomen has D000005 UMLS semantic tags e.g. Enzyme, Gene or Genome, Mammal, Tissue, Virus etc. Alphanumeric descriptor codes [A] Anatomy Body Regions [A01] Abdomen [A01.047] [B] Musculoskeletal System [A02] Back [A01.176] [C] Digestive System [A03] Breast [A01.236] [D] Respiratory System [A04] Extremities [A01.378] [E] Urogenital System [A05] Head [A01.456] [F] …… Neck [A01.598] [G] …. [H] Physical Sciences Electronics Amplifiers [I] Astronomy Electronics, Medical [J] Nature Transducers [K] Time

Task 1

TREC Task 1: Overview Search 525,938 MedLine records Titles, abstracts, MeSH category terms, citation information Topics: Taken from the GeneRIF portion of the LocusLink database We are supplied with a gene names Definition of a GeneRIF: For gene X, find all MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism. Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states. Task 1

TREC Task 1: Sample Query 3 2120 Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL ncogene) 3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6 3 2120 Homo sapiens ALIAS_SYMBOL TEL 3 2120 Homo sapiens PREFERRED_PRODUCT ets variant gene 6 3 2120 Homo sapiens PRODUCT ets variant gene 6 3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene The first column is the official topic number (1-50). The second column contains the LocusLink ID for the gene. The third column contains the name of organism. The fourth column contains the gene name type. The fifth column contains the gene name. Task 1

Classifier "has GeneRIF" weight 0.01 General Architecture Task 1

Main Challenges Task 1 Given a gene and an organism, find documents likely to have a GeneRIF Relevance judgment: GeneRIF references from LocusLink Main challenges Ranking Recall Find more gene synonym variations Precision Filter out abstracts with genes from incorrect organisms Lower the rank of documents not likely to have a GeneRIF Task 1

Gene Synonym List Creation Task 1

How to Find Gene Name Synonyms? Strategy: Compile a list of gene names from the text Start with a list of gene names from LocusLink and MeSH Use an n-gram-based approximate match algorithm to find alternative representations of these genes in Medline abstracts Look for commonalities and regularities Create a set of name transformation rules Some are better than others Task 1

Gene Expansion: Sample Expansion Pairs Task 1 Matches whose Dice coefficient falls between 0.5 and 1.0

Gene Expansion: High Confidence Rules Matches whose Dice coefficient falls between 0.5 and 1.0 Rules determined by inspection Task 1

Organism Filtering Task 1

Organism Filtering: Strategy Problem: The query describes the organism name using the LocusLink terminology which differs from Medline’s Strategy: Semi-automatically determine the translation: For a given LocusLink organism name, search for that term against the MEDLINE title, abstract, and MeSH terms Display the most frequent MeSH terms that result The translation appeared as one of the top 3 Could be a useful strategy for other translation problems Task 1

Organism Filtering: Results Task 1 Sample Top-Ranked MeSH Terms

GeneRIF Classification Task 1

GeneRIF Classification: Training Used for our second run Motivation Only Medline documents that have been assigned GeneRIFs are considered relevant Strategy to improve precision: Identify documents likely to have a GeneRIF assigned Naïve Bayes classifier (WEKA ML tools) Training: 50 gene names, not in TREC training/testing set Train on 1000 top-ranked documents for each gene Task 1

GeneRIF Classification: Results Task 1

Document Ranking Task 1

Document Ranking DB2 Net Search Extender Score = weighted SUM: 1.0 * (H compared to phrases in titles) + 1.0 * (H compared to phrases in abstracts) + 0.015 * (L compared to phrases in titles) + 0.015 * (L compared to phrases in abstracts) + 1.4 * (query MeSH compared to document MeSH) H: high confidence gene rules L: low confidence Weights determined experimentally Task 1

Document Retrieval and Ranking Task 1

MAP on TREC training data using GeneRIF classifier: 0.5101 without GeneRIF classifier: 0.5028 MAP on TREC testing data using GeneRIF classifier: 0.3912 without GeneRIF classifier: 0.3753 Analysis Using the classifier performs better on 27 out of 50 queries (= on 12). Tuning the parameters on the test set (tried afterwards) results in only minor improvement. Task 1: TREC Evaluation Task 1

Task 2

TREC Task 2 Problem Definition: Given GeneRIFS formatted as: 1 355 12107169 J Biol Chem 2002 Sep 13;277(37):34343-8. the death effector domain of FADD is involved in interaction with Fas. 2 355 12177303 Nucleic Acids Res 2002 Aug 15;30(16):3609-14. In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid- ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis w … reproduce the GeneRIF from the MEDLINE record. Task 2

Preliminary study Find the GeneRIF text in the abstract 33,662 MEDLINE abstracts with GeneRIFs Best match of the GeneRIF text in the abstract Modified Unigram Dice coefficient Accepted, if scored above 80% Task 2

Baseline Baseline: Pick the whole title verbatim Motivation the best match was a substring of the title: 46.30% the whole title was the best match in 65.10% Baseline: Modified Unigram Dice score 53.39% Choose: title vs. last sentence Observation: the best match is the title OR the last sentence: 73.40% If we choose a whole sentence: title vs. last sentence Upper bound (best choice each time): 66.33% Lower bound (worst choice each time): 22.62% Task 2

Features We experimented with the following features: Nominal features words/stems verbs (most frequent: e.g. bind, block, accept etc.; nominalized) genes gene_freq (number of gene names mentioned) MeSH_unique_ID (e.g. D005796) MeSH_codes (level 1: G14, or level 2: G14.330) MeSH_semantic_type (e.g. cell, human, biological function) journal publication_date (month and year, e.g. 10_2003 ) Boolean features target_gene (is the target gene mentioned?) is_title (is the current sentence the title?) is_last_sentence (is this the last sentence?) Task 2

Best Features Standard feature set verbs (most frequent: e.g. bind, block, accept etc.; nominalized) genes_freq (number of gene names mentioned) MeSH_code (cut at level 2, e.g. G14.330 ) target_gene (is the target gene mentioned?) is_title (is the current sentence the title?) Is_last_sentence (is this the last sentence?) The last two were not used in the final tests. Weighted using TF.IDF (except the Boolean features) Task 2

Title vs. Last Sentence Text classification Choose: title (class A) vs. last sentence (class B) Naïve Bayes classifier (WEKA ML tools) The standard features Training and testing Each document represents one example Features: extracted from the title and the last sentence only  Features for title and last sentence are undistinguishable.  Distinguishing them lowers the accuracy. Training set: Modified Dice Unigram overlap with the GeneRIF Stratified 10-fold cross-validation Task 2

Task 2: Evaluation Training Document collections 1000, 2000, 10000, 20000, 33662 finally, limited the set to the 5 target journals Classification algorithm selection  tried: decision tree, boosting, kNN, logistic regression etc. Feature selection tuning, for a fixed feature set tuned the best minimum frequency thresholds for verbs and MeSH_codes: 12 and 5, accordingly TREC run Training: 5 journals except the 139 abstracts from the TREC test Feature frequency thresholds as found during training: 12 and 5 Task 2

Task 2: Results Task 2

Discussion Test sets are small and much harder than training sets Task 1 Organism filter was very helpful Noisy GeneRIF assignment limits the help given by the classifier Initial runs supplied by other research groups were very helpful Task 2 Sentence truncation could improve the results Need ranking, rather than classification algorithms Better feature selection needed sensitivity to frequency thresholds MeSH ambiguity verb nominalization

Thank you!

UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Similar presentations

Presentation on theme: "UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Similar presentations

Presentation on theme: "UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:"— Presentation transcript:

Similar presentations

About project

Feedback