Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei 03.16.2005.

Slides:

Advertisements

Similar presentations

Boolean and Vector Space Retrieval Models

Advertisements

Chapter 5: Introduction to Information Retrieval

Mining External Resources for Biomedical IE Why, How, What Malvina Nissim

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.

Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Concept Clustering, Summarization and Annotation Qiaozhu Mei.

Text Classification, Active/Interactive learning.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

The Vector Space Model LBSC 708A/CMSC 838L Session 3, September 18, 2001 Douglas W. Oard.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Chapter 6: Information Retrieval and Web Search

Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

SINGULAR VALUE DECOMPOSITION (SVD)

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.

Vector Space Models.

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.

Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.

Automatic Labeling of Multinomial Topic Models

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Natural Language Processing Topics in Information Retrieval August, 2002.

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Plan for Today’s Lecture(s)

A Formal Study of Information Retrieval Heuristics

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Semantic Processing with Context Analysis

Representation of documents and queries

Chapter 5: Information Retrieval and Web Search

Boolean and Vector Space Retrieval Models

Retrieval Utilities Relevance feedback Clustering

Information Retrieval and Web Design

Latent Semantic Analysis

Presentation transcript:

Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei

Concept Processing Component for Beespace: A Big Picture Relevant documemts Query terms Retrieval Filtering Module A list of Representative Terms Or phrases Normalization And Clustering Module Pre-processed Text Collection entities & phrases of interest Similarity Groups Of Terms and Phrases (Concepts)

Concept Processing Component for Beespace: Input and Output Input: texts (indices) with entities and phrases tagged.  Filtering: a group of relevant documents for a query  Normalization: a list of terms, entities or phrases of interest to be normalized Output:  Filtering: list of highly representative terms & phrases  Normalization: hierarchical structure of concepts (compacted, loose) Concept dictionary texts tagged with concepts

Filtering

Term Filtering: Heuristics  We want to find a list of representative terms & phrases short enough to enable interactive selection and navigation.  We want terms with higher frequency in the given documents, (high Term Frequency), however…  Terms too frequent in the whole collection are considered harmful: the, is, cell, bee, …(low Document Frequency)

Term Filtering: TF*IDF Adding IDF to frequency count:  Weight = tf * log ((N – 1)/df) TF-IDF formula in Okapi method:  Weight = IDF TF part

Term Filtering (cont.) Results 1: Results 1  Collection: honeybee.biosis 1980  Query: “pollen-foraging”  Select top 2 documents Results 2: Results 2  Collection: GENIA (on “ human & blood cell & transcription factor ” ), with noun phrases of entities tagged  Query: “ il-2 ”

Normalization

From Term to Concept: Normalization and Theme Clustering Normalization: Tight concepts  Group terms/entities/phrases with similarity so that one can represent others Forage: forager, forage-bee, foraging, foragers, pollen- foraging… Theme clustering: Looser concepts  Group terms/entities/phrases representing the same subtopic (semantically related) forage, pollen, food, detect, feeding, dance, … In a hierarchical manner.

Normalization Morphological approach? (stemming)  Normalize English words of morphological variations, e.g. forag: forage/foraging/forager/foragers  Concerns: Too cruel? one->on; day->dai; apis-> api; useful -> us Handling biological entities? (some do nothing when detect “-”) Not sufficient to normalize phrases

Normalization: Stemmers  Porter Stemmer: does not stem words beginning with an uppercase letter  Krovetz' Stemmer: Less aggressive than porter  Sample results: Honeybee: Honeybee Genia: Genia

Normalization (cont.) Semantic and Contextual Approach:  Group the terms which are considered “Replaceable” with each other in a context. E.g. …the pollen-foraging activity of a mellifera… …the nectar-foraging activity of a cerana…  Generally handled with clustering approaches based on statistical information in a large corpus  Usually in the form of hierarchical clusters

Normalization: A clustering approach A N-gram clustering method:  Ideally, if we consider the terms in its N-Gram context, the replaceable relation would be global and reliable.  Concerns: efficiency Computing complexity is high!  For 2-gram, NV 2 even after optimization! (initially V 5 ) Space complexity is high!!  V 3  Compromising: use 2-gram (equivalent to computing the average mutual information of 2-grams and group two terms which will bring the smallest loss to this avg. MI)

Normalization: A clustering approach (cont.) Toy Example on honeybee:  Vocabulary size: 9100 words;  Collection size: 5505 abstracts; (honeybee.biosis1980)  Terms to be Clustered: 18 Genia collection, 2000 abstracts  200 noun phrases (entities) to be clustered

nectar-foraging foraging-related pollen-foraging preforaging non-foraging foragers worker bee honeybee workers nurseries nursery nursing forage forager foraging queen queens

Sample clusters on Genia: human_and_mouse_gene mouse_il-2r_alpha_gene saos_2_cells saos-2 human_osteosarcoma_ epstein-barr_virus_ interleukin-2 interleukin-2_ epstein-barr_virus phorbol_myristate_acetate phorbol_12-myristate_13-acetate u937_cells monocytic_cells jurkat_cells human_t_cells ipr_cd4-8-_t_cells j_delta_k_cells lymphoid_cells activated_t_cells hematopoietic_cells transcription_factors transcription_factor b_cells jurkat_t_cells hela_cells thp-1 hl60_cells k562_cells thp-1_cells i_kappa_b_alpha nf_kappa_b 2_gene_expression 2_gene

Normalization: Clustering Methods Other Possible Clustering Approaches Cluster terms based on features such as:  Co-occurring terms Tends to ignore position information  Correlation of Nouns and Verbs  Dependency-based Word Similarity  Proximity-based Word Similarity Depend on highly accurate parsing result, which may be not easy to get for biology literature.

Theme Clustering Looser Clusters  Usually in the form of partitioning clusters  K-Means, Latent Semantic Indexing, Probabilistic LSI Compute loose clusters of terms, or clusters represented by term distributions Example: # cluster = 10 Example Sometimes helpful to find normalizations (e.g., when #clusters are large; when no stemming was done)  Comparative Text Mining for concept switching

Future Plan: Customize the stemmers Try more morphological approaches.  e.g. pollen-foraging, nectar-foraging Exam more clustering methods:  How to use theme clustering to help normalization  Find a way to divide the hierarchical clustering structure into concepts

Thanks!