Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.

Slides:

Advertisements

Similar presentations

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

歡迎 IBM Watson 研究員詹益毅博士蒞臨國立台灣師範大學. Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, Franc¸ois Yvon ICASSP 2011 許曜麒 Structured Output.

Technical Architectures

Neural Network Based Approach for Short-Term Load Forecasting

Simple Neural Nets For Pattern Classification

Neural Networks Marco Loog.

Layered Diffusion based Coverage Control in Wireless Sensor Networks Wang, Bang; Fu, Cheng; Lim, Hock Beng; Local Computer Networks, LCN nd.

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Lecture 10 Comparison and Evaluation of Alternative System Designs.

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin

Distributed Representations of Sentences and Documents

October 7, 2010Neural Networks Lecture 10: Setting Backpropagation Parameters 1 Creating Data Representations On the other hand, sets of orthogonal vectors.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

A Survey of ICASSP 2013 Language Model Department of Computer Science & Information Engineering National Taiwan Normal University 報告者：郝柏翰 2013/06/19.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

by B. Zadrozny and C. Elkan

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

Chapter 8 Introduction to Hypothesis Testing

Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA

Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Efficient Estimation of Word Representations in Vector Space

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Handing Uncertain Observations in Unsupervised Topic-Mixture Language Model Adaptation Ekapol Chuangsuwanich 1, Shinji Watanabe 2, Takaaki Hori 2, Tomoharu.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

The Measurement of Nonmarket Sector Outputs and Inputs Using Cost Weights 2008 World Congress on National Accounts and Economic Performance Measures for.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

A Simulated-annealing-based Approach for Simultaneous Parameter Optimization and Feature Selection of Back-Propagation Networks (BPN) Shih-Wei Lin, Tsung-Yuan.

Yuya Akita , Tatsuya Kawahara

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Language Identification and Part-of-Speech Tagging

Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C

Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita

End-To-End Memory Networks

CSE 190 Neural Networks: The Neural Turing Machine

Predicting Salinity in the Chesapeake Bay Using Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Luís Filipe Martinsª, Fernando Netoª,b.

Efficient Estimation of Word Representation in Vector Space

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Distributed Representation of Words, Sentences and Paragraphs

Word Embedding Word2Vec.

The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’

GSPT-AS-based Neural Network Design

Learning and Memorization

Deep Neural Network Language Models

Presentation transcript:

Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and LIMSI/CNRS 報告者：郝柏翰 2012/08/ NAACL

Outline Introduction Language modeling in a continuous space Efficiency issues Experimental Results Conclusion

Introduction This paper investigates the potential of language models using larger context windows comprising up to the 9 previous words. This study is made possible by the development of several novel Neural Network Language Model architectures, which can easily fare with such large context windows. We experimentally observed that extending the context size yields clear gains in terms of perplexity and that the n-gram assumption is statistically reasonable as long as n is sufficiently high, and that efforts should be focused on improving the estimation procedures for such large models. 3

Language modeling in a continuous space This architecture can be divided in two parts, with the hidden layer in the middle: the input part which aims at representing the context of the prediction; and the output part. 4

Language modeling in a continuous space 5

In order to better investigate the impact of each context position in the prediction, we introduce a slight modification of this architecture in a manner analog to the proposal of Collobert and Weston (2008). 6

Language modeling in a continuous space Conventional n-gram LMs are usually limited to small values of n, and using n greater that 4 or 5 does not seem to be of much use. The gain obtained by increasing the n-gram order from 4 to 5 is almost negligible, whereas the model size increases drastically. the situation is quite different for NNLMs due to their specific architecture. In fact, increasing the context length for a NNLM mainly implies to expend the projection layer with one supplementary projection vector, which can furthermore be computed very easily through a simple look-up operation. The overall complexity of NNLMs thus only grows linearly with n in the worst case. 7

Language modeling in a continuous space 8

9

Efficiency issues Reducing the training data –Our usual approach for training large scale models is based on n-gram level resampling a subset of the training data at each epoch. –This is not directly compatible with the recurrent model, which requires to iterate over the training data sentence-by-sentence in the same order as they occur in the document. Bunch mode –After resampling, the training data is divided into several sentence flows which are processed simultaneously. –Using such batches, the training time can be speeded up by a factor of 8 at the price of a slight loss (less than 2%) in perplexity. SOUL training scheme 10

Experimental Results The usefulness of remote words 11

Experimental Results POS tags were computed using the TreeTagger 12

Experimental Results ─ Translation experiments Our solution was to resort to a two-pass approach: the first pass uses a conventional back-off n-gram model to produce a list of the k most likely translations; in the second pass, the NNLMs probability of each hypothesis is computed and the k-best list is accordingly reordered. To clarify the impact of the language model order in translation performance, we considered three different ways to use NNLMs. –In the first setting, the NNLM is used alone and all the scores provided by the MT system are ignored. –In the second setting (replace), the NNLM score replaces the score of the standard back-off LM. –Finally, the score of the NNLM can be added in the linear combination (add). 13

Experimental Results These results strengthen the assumption made in Section 3.3 (there seem to be very little information in remote words) 14

Experimental Results Surprisingly, on this task, recurrent models seem to be comparable with 8-gram NNLMs. 15

Conclusion Experimental results showed that the influence of word further than 9 can be neglected for the statistical machine translation task. By restricting the context of recurrent networks, the model can benefit of the advanced training schemes and its training time can be divided by a factor 8 without loss on the performances. Experimental results showed that using long range dependencies (n = 10) with a SOUL language model significantly outperforms conventional LMs. 16