Center for NLP Whither Come the Words? Dr. Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval
Multimedia Database Systems
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Information Retrieval in Practice
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.
Query Relevance Feedback and Ontologies How to Make Queries Better.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Query Expansion.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
The Cognitive Perspective in Information Science Research Anthony Hughes Kristina Spurgin.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 Query Operations Relevance Feedback & Query Expansion.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Information Retrieval
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Information Retrieval in Practice
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
Multimedia Information Retrieval
Center for Natural Language Processing School of Information Studies
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
CS246: Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

Center for NLP Whither Come the Words? Dr. Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University

Center for NLP A Continuum from Human to Statistical Indexing - Manual - Controlled vocabularies - Mixed Initiative -Machine-aided / Human-assisted -Machine Learning - Automatic -Statistical indexing -Natural Language Processing indexing

Center for NLP Basic Premise The quality of the representation of documents determines: – the ‘richness’ of the indexing – the ‘quality’ of access to relevant information – the ‘value-add’ analytics the system can accomplish for users

Center for NLP Central Problem of IR How to represent documents for retrieval (Blair, 1990) –key issue in controlled vocabulary representation & searching –still true with full-text indexing and free-text querying systems –because documents & queries are expressed in language language is complex and ambiguous methods for solving the language issue are difficult some IR systems don’t even attempt to deal major challenge of high quality information access

Center for NLP 1. Identify indexable / queryable elements: What is a term? –Alpha-numeric characters between blank spaces or punctuation? What about non-compositional phrases? Multi-word proper names? What about inter-word symbols such as hyphens or apostrophes? –“small business men” vs. “small-business men”

Center for NLP 2. Represent the concept behind the term Ability to take ‘terms’, and: – Standardize – Expand to alternative ‘terms’ – Disambiguate So that the concept behind the ‘term’ is represented in both documents & queries

Center for NLP Term Expansion: Goal - add all variant terms which refer to the same concept: –either synonymous expressions or associated terms –use either thesaurus, semantic network, or statistically determined co-occurring terms/phrases –inspired by success of humanly-consulted IR thesauri used in earliest systems –relieves the user from needing to generate all conceptual variants

Center for NLP Term expansion: –Multiple approaches: Knowledge-based Linguistic Statistical

Center for NLP Knowledge-based Thesauri I. R. - style –intended for human indexers and searchers –manually constructed for a specific domain Contain synonymous, more general, and more specific terms –Use For –Broader –Narrower –Related Current question is how to utilize them appropriately in Web-based systems

Center for NLP Knowledge-based Thesauri DATABASE MANAGEMENT SYSTEMS UFdatabases NTrelational databases BTfile organization management information systems RTdatabase theory decision support systems

Center for NLP Linguistic Thesauri General purpose style –e. g. Roget’s, Word Net –contain explicit concept hierarchies of up to 8 increasingly specified levels Based on assumption that the words in a semi- colon group (RIT) or a synset (WordNet) are synonymous or near-synonymous –issue / difficulty is selecting correct sense for terms

Center for NLP Abstract Relations Space PhysicsMatter Sensation Intellect Vilition Affections The World Sensation in General TouchTaste Smell Sight Hearing OdorFragrance Stench Odorless Incense; joss stick;pastille; frankincense or olibanum; agallock or aloeswood; calambac

Center for NLP

Linguistic Thesaurus Use in I R Can be used on either / both documents or queries –more commonly done on queries Terms are expanded by adding one or all of: –synonyms –hyponyms –hypernyms Issues caused by: –idiomatic, specialized terms –non-compositional phrases not in thesaurus

Center for NLP Process used by Voorhees ’93 Research Look up each word from text in Word Net If word is found, the set of synonyms from all Synsets are added to the query representation Weight each added word as.8 rather than 1.0 Found results to be better than plain SMART –Variable performance over queries –Major cause of error was when ambiguous words’ Synsets are used in expansion

Center for NLP Use of Thesauri for expansion: General thesauri such as Roget’s or WordNet have not been shown conclusively to improve results: –may sacrifice precision to recall –not domain specific –not sense disambiguated But, a currently active field of R & D

Center for NLP Disambiguation Non-relevant documents may be retrieved because they contain the query term, – but the wrong sense of the query term Need good Word Sense Disambiguation

Center for NLP Sample ambiguous query: I would like information about developments in low-risk instruments, especially those being offered by companies specializing in bonds.

Center for NLP Human Sense Disambiguation Sources of influence known from psycholinguistics research: –local context the sentence / query containing the ambiguous word restricts the interpretation of the ambiguous word

Center for NLP Sample ambiguous query: I would like information about developments in low-risk instruments, especially those being offered by companies specializing in bonds.

Center for NLP Human Sense Disambiguation Sources of influence known from psycholinguistics research: –local context the sentence / query containing the ambiguous word restricts the interpretation of the ambiguous word –domain knowledge the fact that a text is concerned with a particular domain activates only the sense appropriate to that domain –frequency data the frequency of each sense in general usage affects its accessibility to the mind

Center for NLP Machine Readable Lexical Sources Multiple entries for polysemous words Instrument –Medical –Financial –Dental –Musical –Hardware –Empirical experimentation –General

Center for NLP Machine Readable Lexical Sources Senses are ranked by frequency of occurrence in usage: 1. Musical 2. Hardware 3. General 4. Medical 5. Dental 6. Financial 7. Empirical experimentation

Center for NLP Corpus-based Word Sense Disambiguation Supervised learning from manually sense-tagged corpora –allows development of algorithms which can correctly tag each word with its correct sense –utilizes context, which then proves essential in real-time disambiguation –usually a small window of words surrounding the ambiguous term Issues –time & cost in tagging the training sample –need to retag for new domains or genres

Center for NLP Word Sense Disambiguation Impact on retrieval results –Results vary by approach used by query (short queries, especially) by engine –Some consider it a proven technique for improving Precision –Some are concerned about the trade-off in efficiency

Center for NLP Statistical Thesauri Automatic thesaurus construction –Classes of terms produced are not necessarily synonymous, nor broader, nor narrower –Rather, words that tend to co-occur with head term –Effectiveness varies considerably depending on technique used

Center for NLP Automatic Thesaurus Construction (Salton) Document Collection Based –based on index term similarities –compute vector similarities for each pair of documents –if sufficiently similar, create a thesaurus entry for each term which includes terms from similar document

Center for NLP Sample Automatic Thesaurus Entries: 408 dislocation411 coercive junction demagnetize minority-carrier flux-leakage point contact hysteresis recombine induct transition insensitive 409 blast-cooled magnetoresistance heat-flow square-loop heat-transfer threshold 410 anneal412 longitudinal strain transverse

Center for NLP Dynamic Automatic Thesaurus Construction Thesaurus short-cut –Run at query time –Take all terms in query into consideration at once –Look at frequent words and phrases in top retrieved documents and add these to the query = Automatic Relevance Feedback

Center for NLP Expansion by an Association Thesaurus Query: Impact of the 1986 Immigration Law Phrases retrieved by association in corpus - illegal immigration- statutes - amnesty program- applicability - immigration reform law- seeking amnesty - editorial page article- legal status - naturalization service- immigration act - civil fines- undocumented workers - new immigration law- guest worker - legal immigration- sweeping immigration law - employer sanctions- undocumented aliens

Center for NLP NLP-based Indexing the computational process of identifying, selecting, and extracting useful information from massive volumes of textual data: - for potential review by indexers - or stand-alone representation of content - using Natural Language Processing

Center for NLP Natural Language Processing a range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications

Center for NLP Levels of Language Understanding Pragmatic Discourse Semantic Syntactic Lexical Morphological

Center for NLP What can NLP Indexing do? - Phrase recognition - Disambiguation - Concept expansion

Center for NLP In Summary: There exist a range of approaches for representing documents and queries Each needs to be evaluated in terms of their ability to accomplish the goals of your application Web applications have opened a whole new world of possible variations on the traditional indexing approaches