COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

Relevance Feedback & Query Expansion
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008.
Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Class web page:
Text mining.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Text Classification, Active/Interactive learning.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Query Operations. Query Models n IR Systems usually adopt index terms to process queries; n Index term: u A keyword or a group of selected words; u Any.
1 Query Operations Relevance Feedback & Query Expansion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Chapter 23: Probabilistic Language Models April 13, 2004.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Vector Space Models.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Information Retrieval CSE 8337 Spring 2007 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Text Categorization Assigning documents to a fixed set of categories
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
Relevance Feedback & Query Expansion
CSE 635 Multimedia Information Retrieval
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Presentation transcript:

COMP423: Intelligent Agent Text Representation

Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words

Bag of words Vector Space model – Documents are term vectors – Tf.Idf for term weights – Cosine similarity Limitations: – Words semantics – Semantic distance between words – Word order – Word importance – ….

Consider Word Order N-grams model – Bi-grams: two words as a phrase, some are not really phrases – Tri-games, three words, not worth it Phrase based – Use Part of speech, e.g. select noun phrases – Regular expression, chunking: expensive for writing the patterns Mixed results One example [Furnkranz98] – The representation is evaluated on a Web categorization task (university pages classified as STUDENT, FACULTY, STAFF, DEPARTMENT, etc. – A Naive Bayes (NB) classifier and Ripper used – Results (words vs. words+phrases) are mixed Accuracy improved for NB and not for Ripper Precision at low recall highly improved Some phrasal features are highly predictive for certain classes, but in general have low coverage More recent work by [Yuefeng Li 2010, KDD] – Applied on classification, positive results

Word semantics Using external resources – Early work WordNet Cyc – Wikipedia – The Web Mixed results – Recall is usually improved, but precision is hurt Disambiguation is critical

Wordnet WordNet’s organization – The basic unit is the synset = synonym set – A synset is equivalent to a concept – E.g. Senses of “car” (synsets to which “car” belongs) – {car, auto, automobile, machine, motorcar} – {car, railcar, railway car, railroad car} – {cable car, car} – {car, gondola} – {car, elevator car}

WordNet is useful for IR Indexing with synsets has proven effective [Gonzalo98] It improves recall because involves mapping – synonyms into the same indexing object It improves precision if only relevant senses are considered – E.g. A query for “jaguar” in the car sense causes retrieving only documents with “jaguar car”

Mixed results Concept indexing with WordNet [Scott98, Scott99] ¯­ Using synsets and hypernyms with Ripper Fail because they do not perform WSD (Word Sense Disambiguation) [Junker97] ¯¯ Using synsets and hypernyms as generalization operators in a specialized rule learner Fail because the proposed learning method gets lost in the hypothesis space [Fukumoto01] Sysnets and (limited) hypernyms for SVM, no WSD Improvement on less populated categories In general Given that there is not a reliable WSD algorithm for (fine-grained) WordNet senses, current approaches do not perform WSD Improvements in small categories But I believe full, perfect WSD is not required.

Word importance Feature selection: need a corpus for training – Document frequency – Information Gain (IG) – Chi-square Keyword extraction Feature extraction Others – Using Wikipedia as training data and testing data – Using the Web – Bring order to words

Other issues Time Word categories – Common words – Academic words – Domain specific words ….

Semantic distance between word pairs – Thesaurus based : WordNet – Corpus based: e.g. Latent Semantic Analysis Statistical Thesaurus: Co- occurrence – Google normalised distance (GND) – Wikipedia based Wikipedia Link Similarity (WLM) Explicit semantic analysis: ESA (State-of-art)

GND: Motivation and Goals To represent meaning in a computer- digestable form To establish semantic relations between common names of objects Utilise the largest database in the word – the web

NGD definition x = word one (eg 'horse') y = word two (eg 'rider') N = normalising factor (often M) M = the cardinality of the set of all pages on the web f(x) = frequency x occurs in the total set of documents Because of LogN, NGD is stable as the web grows

Example NGD(horse, rider) Horse returns 46,700,000 Rider returns 12,200,000 Horse Rider returns 2,630,000 Google indexed 8,058,044,651

Wikipedia Link similarity measure Inlinks Outlinks Shared inlinks and outlinks, average of the two – Inlinks: formula borrowed from GND – Outlinks: w(l,A) the weight of a link, similar to the inversed document similarity

Bag of concepts WikiMiner, WLM, by Ian Witten, David Milne, Anna Huang – Wikipedia based approach – Concepts are anchor texts Can be phrases Also is a way to select important words – Use shared inlinks, outlinks to estimate the semantic distance between concepts, – New document similarity measure. There should be other ways – to define concepts – to select concepts – to compare concepts – …

Statistical Thesaurus Existing human-developed thesauri are not easily available in all languages. Human thesuari are limited in the type and range of synonymy and semantic relations they represent. Semantically related terms can be discovered from statistical analysis of corpora. 17

Automatic Global Analysis Determine term similarity through a pre- computed statistical analysis of the complete corpus. Compute association matrices which quantify term correlations in terms of how frequently they co-occur. 18

Association Matrix 19 w 1 w 2 w 3 …………………..w n w1w2w3..wnw1w2w3..wn c 11 c 12 c 13 …………………c 1n c 21 c 31. c n1 c ij : Correlation factor between term i and term j f ik : Frequency of term i in document k

Normalized Association Matrix Frequency based correlation factor favors more frequent terms. Normalize association scores: Normalized score is 1 if two terms have the same frequency in all documents. 20

Metric Correlation Matrix Association correlation does not account for the proximity of terms in documents, just co- occurrence frequencies within documents. Metric correlations account for term proximity. 21 V i : Set of all occurrences of term i in any document. r(k u,k v ): Distance in words between word occurrences k u and k v (  if k u and k v are occurrences in different documents).

Normalized Metric Correlation Matrix Normalize scores to account for term frequencies: 22