Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Chapter 5: Introduction to Information Retrieval
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Natural Language Processing - Feature Structures - Feature Structures and Unification.
Information Retrieval in Practice
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
WMES3103 : INFORMATION RETRIEVAL
A Data Compression Algorithm: Huffman Compression
Using IR techniques to improve Automated Text Classification
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Link Grammar ( by Davy Temperley, Daniel Sleator & John Lafferty ) Syed Toufeeq Ahmed ASU.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Chapter 5: Information Retrieval and Web Search
Grammar Nuha Alwadaani.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Interactive Phrase Browsing Within Compressed Text Raymond Wan and Alistair Moffat University of Melbourne Introduction.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
10. Parsing with Context-free Grammars -Speech and Language Processing- 발표자 : 정영임 발표일 :
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Daisy Arias Math 382/Lab November 16, 2010 Fall 2010.
Web- and Multimedia-based Information Systems Lecture 2.
CSA2050 Introduction to Computational Linguistics Parsing I.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
1Computer Sciences Department. 2 Advanced Design and Analysis Techniques TUTORIAL 7.
NATURAL LANGUAGE PROCESSING
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.
Natural Language Processing Vasile Rus
Information Retrieval in Practice
Search Engine Architecture
Syntax versus Semantics
Machine Learning in Natural Language Processing
Statistical n-gram David ling.
CS246: Information Retrieval
Chapter 10: Compilers and Language Translation
Artificial Intelligence 2004 Speech & Natural Language Processing
COMPILER CONSTRUCTION
Presentation transcript:

Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat Purpose: To develop a framework for evaluating statistical phrases which: Compares them to those identified through natural language processing techniques. This resembles earlier work by Wolff [3], where a linguist was asked to determine phrases in text. Evaluates recall from the compressed representation made from the phrases. The framework is demonstrated with a statistical system called Re-Pair, and a natural language processing (NLP) system called Link Grammar. The steps employed are shown above. The source text, described later, contains SGML mark-up which is first removed for both systems. The text is transformed prior to Re-Pair so that words are limited to 16 characters, case folded, and then stemmed using the Porter stemming algorithm. The Link Grammar system takes the filtered text and returns a list of phrases. Of these phrases, simplex noun phrases (those with no coordinating conjunctions or prepositions) are extracted and then transformed as above. Two sets of phrases are produced: P RP and P LG.

Re-Pair [1] is an off-line dictionary- based compression algorithm which reduces the length of a message by recursively replacing the most frequently occurring pair of symbols (word tokens, in our case), with a new symbol. A dictionary of phrases (phrase hierarchy) and the sequence of references to the hierarchy are produced. The hierarchical relationship between phrases is illustrated in the graph structure above. Every phrase can be broken into its two components, has siblings where one of the components is identical, and can be extended to phrases which contain the current one. The figure above shows some of the Re-Pair phrases identified in a sample news article. Phrases which have two words are underlined; those that use these phrases directly are highlighted. Initially, Link Grammar [2] classifies words in the text according to their part of speech (noun, verb, etc.). Then, words are linked recursively based on a set of rules. For example, a link would be formed between the pair of words “the account” since “the” is a determiner, and “account” is a noun which accepts a determiner to its left. If the sentence is grammatical, then a valid linkage is formed, as shown in the figure above. Constituents (phrases) are then identified. For example, the above sentence would be labelled as: (S (NP South Korea) (VP ’s (NP Current Account))). “NP” and “VP” signify noun phrase and verb phrase, respectively. All of the simplex noun phrases identified with Link Grammar from the same sample text on the left, are underlined above.

† The test machine was a 933 MHz Pentium III with 1 GB RAM and 256 kB on-die cache. [1] N. J. Larsson and A. Moffat. Offline dictionary-based compression. Proc. IEEE, 88(11): , November [2] D. D. K. Sleator and D. Temperley. Parsing English with a Link Grammar. Technical Report CMU-CS , Carnegie Mellon University, School of Computer Science, October Software available from current version is [3] J. G. Wolff. Language acquisition and the discovery of phrase structure. Language and Speech, 23(3): , Experiments were conducted on a 20 MB subset of Wall Street Journal news articles in SGML mark-up from 1987, which form part of Disk 1 of TREC’s TIPSTER collection. The overlap between P RP and P LG is shown in the upper table, to the right, grouped according to phrase length. As the table shows, just under 30% of the Re-Pair phrases of length 2 were also identified by Link Grammar. This value diminishes with increasing phrase lengths. Recall of the Re-Pair phrases are listed in the lower table. The unweighted recall assumes that every symbol in the phrase hierarchy is equally likely to be a queried. The weighted scheme ensures that a symbol’s recall is proportional to its frequency in the original text. The average recall for both metrics is no less than That is, due to Re-Pair’s phrase selection heuristic, some phrases cannot be found. This is because sequences of words that form some phrases in the text are broken up by other, more frequent ones. A framework has been described which evaluates the quality of phrases derived from statistics. Two systems were suggested for the framework. Despite the <30% of phrases which overlap, statistical phrase selection with Re-Pair is still viable due to its speed. For example, Re-Pair requires 18 seconds for this test data, while Link Grammar needed about 100 hours †. A system which compromises between these two methods may provide a better solution. Recall of 1.00 can be achieved if Re-Pair is used to isolate phrases that are then explicitly indexed by an inverted file.