Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Distributed Representations of Sentences and Documents
Introduction to Machine Learning Approach Lecture 5.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
National Taiwan University, Taiwan
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Olivier Siohan David Rybach
Automatically Labeled Data Generation for Large Scale Event Extraction
Unsupervised Learning of Video Representations using LSTMs
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Evaluating Classifiers
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Online Multiscale Dynamic Topic Models
Introduction of Reinforcement Learning
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
2 Research Department, iFLYTEK Co. LTD.
Adversarial Learning for Neural Dialogue Generation
Matt Gormley Lecture 16 October 24, 2016
Conditional Random Fields for ASR
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Summary Presented by : Aishwarya Deep Shukla
Natural Language Processing of Knee MRI Reports
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
"Playing Atari with deep reinforcement learning."
Weakly Learning to Match Experts in Online Community
Lecture 6: Introduction to Machine Learning
Outline Background Motivation Proposed Model Experimental Results
Predicting Body Movement and Recognizing Actions: an Integrated Framework for Mutual Benefits Boyu Wang and Minh Hoai Stony Brook University Experiments:
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Jointly Generating Captions to Aid Visual Question Answering
Presented by: Anurag Paul
From Unstructured Text to StructureD Data
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Extracting Why Text Segment from Web Based on Grammar-gram
Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Bidirectional LSTM-CRF Models for Sequence Tagging
Language Transfer of Audio Word2Vec:
Week 7 Presentation Ngoc Ta Aidean Sharghi
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Presentation transcript:

Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee Domain independent key term extraction from spoken content based on context and term location information in the utterances Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee Q: for speech error simulation, do we have ERR? Q: Kaggle 怎樣? Q: DNN? Q: textrank Q: how to determine th? Open: 0:20 Speaker: Hung-yi Lee Hsien-Chin Lin Chi-Yu Yang National Taiwan University

Task Introduction: Key Term Extraction

Task Introduction model Key term extraction is aimed to identify the important terms from a (spoken) document DNN model LSTM CNN Plain Text or ASR transcriptions

Key Term Extraction Approaches Unsupervised approaches: using some rules to determine whether a term is a key term Supervised approaches → Low accuracy labelled data Key term 1 Key term 2 Key term 3 Key term 5 Key term 4 → limited by training domain Practically, we will need to extract key terms from documents in unseen domains model ?

Supervised approaches [Kamal Sarkar, et al., IJCSI, 2010][Wang, et al., SERC, 2014][Shen, INTERSPEECH, 2016] Supervised approaches Supervised approaches: key term extraction is usually formulated as multi-class classification problem. Out-of- domain? Each key term is considered as a class. soup recipe Key term 1 Key term 2 Key term 3 Key term 5 Key term 4 ICASSP papers → limited by training domain Practically, we will need to extract key terms from documents in unseen domains These approaches can never detect the key terms not in training data. model soup ? Key terms: LSTM, GAN, SLU, ……

Towards Domain Independent We proposed supervised key term extraction approach based on the context and the term location information. Example We focus on keyword extraction, but the proposed approach can be generalized to key phrase extraction. “The subject of this lecture is primarily about Ricklantis. ” 3:30 Never see the term before It is a key term.

Proposed Models

“The lecture is about GAN” Basic Idea “The lecture is about GAN” 1 Predict each word position in the input sentence corresponds to a key word or not word Document: 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 …… Label: 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑁 …… 1 0: The word at the specific position is NOT key term. 1: The word at the specific position is a key term.

Basic Model Training Phase: Testing Phase: It is a keyword. The lecture is about GAN The lecture is about Ricklantis Hidden Layer 𝑣 1 Hidden Layer 𝑣 2 Hidden Layer 𝑣 3 Hidden Layer 𝑣 4 Hidden Layer 𝑣 5 Hidden Layer 𝑣 1 Hidden Layer 𝑣 2 Hidden Layer 𝑣 3 Hidden Layer 𝑣 4 Hidden Layer 𝑣 5 Feature extraction Feature extraction Output Layer Output Layer 𝑜 1 𝑜 2 𝑜 3 𝑜 4 𝑜 5 0.0 0.1 0.0 0.0 0.9 1 Larger than a threshold th

Optimizing Evaluation Measure This lecture … LSTM ... GAN … Feature extraction 𝑣 1 𝑣 2 𝑣 3 𝑣 4 Hidden Layer Hidden Layer Hidden Layer Hidden Layer … … Output Layer … … 𝑜 1 𝑜 2 𝑜 3 𝑜 4 The training of basic model does not optimize the evaluation measure. 1 1

We use reinforcement learning! Optimizing Evaluation Measure This lecture … LSTM ... GAN … Can we directly optimize F-1 score during training? Feature extraction Issue: not differentiable 𝑣 1 𝑣 2 𝑣 3 𝑣 4 F-1 score = 0.25 Hidden Layer Hidden Layer Hidden Layer Hidden Layer … … 3:30 LSTM LSTM Output Layer lecture GAN … … 0.7 0.9 0.3 Reference keywords We use reinforcement learning!

Example: Playing Video Game Start with observation 𝑠 1 Observation 𝑠 2 Observation 𝑠 3 Usually there is some randomness in the environment Action 𝑎 1 : “right” Action 𝑎 2 : “fire” Total reward: 𝑅= 𝑡=1 𝑇 𝑟 𝑡 Agent learns to maximize total reward

Agent Each word is an observation. This lecture … LSTM ... GAN … Feature extraction Use reinforcement learning to learn the agent 𝑣 1 𝑣 2 𝑣 3 𝑣 4 Max F-1 score = 0.25 Hidden Layer Hidden Layer Hidden Layer Hidden Layer Total Reward … … 2:00 LSTM LSTM Output Layer lecture GAN … … N Y N Y Reference keywords Action: a word is a key term or not

Reinforcement Learning: Policy Gradient 𝑃 𝑧 3 =1 =0.5 𝑃 𝑧 3 =0 =0.5 𝑃 𝑧 4 =1 =0.8 𝑃 𝑧 4 =0 =0.2 𝑃 𝑧 1 =1 =0.1 𝑃 𝑧 1 =0 =0.9 𝑃 𝑧 2 =1 =0.05 𝑃 𝑧 2 =0 =0.95 Using the basic model as initialization For each word 𝑥 𝑖 we have an output 𝑜 𝑖 Obtain a sample value 𝑧 𝑖 according to the 𝑜 𝑖 All samples for input sentence 𝑋 form 𝑍 Repeat this way k times (k=5 here) 𝑋: 𝑥 1 𝑥 2 𝑥 3 𝑥 4 Feature extraction 𝑍 1 =[ 1 ] 𝑣 1 𝑣 2 𝑣 3 𝑣 𝑁 𝑍 2 =[ 𝑍 4 =[ 𝑍 5 =[ 𝑍 3 =[ 1 ] Hidden Layer Hidden Layer Hidden Layer Hidden Layer Output value Sampling Output Layer 𝑂: 0.1 0.05 0.5 0.8

Reinforcement Learning: Policy Gradient ( 𝛼 𝑖 =𝑟𝑒𝑤𝑎𝑟𝑑) 𝑍 1 =[0, 0, 0, 1] 𝑍 2 =[0, 0, 0, 0] 𝑍 4 =[0, 0, 1, 1] 𝑍 5 =[1, 0, 0, 1] 𝑍 3 =[0, 0, 1, 0] 𝛼 1 =0.67 𝑋: 𝑥 1 𝑥 2 𝑥 3 𝑥 4 Evaluate the F- measure 𝛼 𝑖 for each 𝑍 𝑖 Assume 𝑥 3 and 𝑥 4 are key terms 𝛼 2 =0 Feature extraction 𝛼 3 =0.67 𝑣 1 𝑣 2 𝑣 3 𝑣 𝑁 𝛼 4 =1 𝛼 5 =0.5 Hidden Layer Hidden Layer Hidden Layer Hidden Layer F measure Evaluation (Reward Function) Environment: F-measure Reward: 𝛼 𝑖 Output value Sampling Output Layer 𝑂: 0.1 0.05 0.5 0.8

Reinforcement Learning: Policy Gradient Define New Training Set Total of 5 training pairs 𝑋, 𝑍 𝑗 with weight 𝛼 𝑗 are used to retrain the model Training pairs with higher F-measures will be weighted higher The model tends to generate output with larger 𝛼 New Training Set X, 𝑍 1 with weight 𝛼 1 X, 𝑍 2 with weight 𝛼 2 X, 𝑍 3 with weight 𝛼 3 X, 𝑍 4 with weight 𝛼 4 X, 𝑍 5 with weight 𝛼 5 𝛼 1 , 𝛼 2 , …, 𝛼 5 (reward) F measure Evaluation (Reward Function) 𝑍 1 , 𝑍 2 ,…, 𝑍 5 Output value Sampling Update Input 𝑋, Output 𝑂, Original Training Method Train model

Spoken Documents ASR The lecture is about gang. Hidden Layer 𝑣 1 Hidden Layer 𝑣 2 Hidden Layer 𝑣 3 Hidden Layer 𝑣 4 Hidden Layer 𝑣 5 The lecture is about GAN Feature extraction Study the impact of speech recognition errors 0:30 Lack of spoken data for key term extraction Output Layer 0.0 0.1 0.0 0.0 ? Simulate ASR errors

Speech Recognition Error Simulation Original words Simulator trained on Librispeech Original words: 𝑋= 𝑥 1 , 𝑥 2 , …, 𝑥 𝑁 Transcriptions: 𝑋 ′ = 𝑥 1 ′ , 𝑥 2 ′ , …, 𝑥 𝑁 ′ Build a confusion matrix by 𝑃 𝑎 𝑏 = 𝑐𝑜𝑢𝑛𝑡 𝑥 ′ =𝑎;𝑥=𝑏 Σ 𝑚 𝑐𝑜𝑢𝑛𝑡 𝑥 ′ =𝑚,𝑥=𝑏 The words in the same group have higher probabilities to be recognized to each other The confusion matrix jointly reflects the functions of acoustic and language models Hair Her Your Want Once lack of spoken data for training and testing Simulate recognition error to generate data from text 最多 11mins One Won Her Transcriptions

Experiments For each domain, how much train and test (LSTM supervised train ) 9:1 Keyphrase considered

Data sets Parsed from StackExchange, a website for Q&A community Kaggle Competition: https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags Parsed from StackExchange, a website for Q&A community Users post their questions and answers with some key terms How can I get chewy chocolate chip cookies? My chocolate chips cookies are always too crisp. How can I get chewy cookies, like those of Starbucks? Key terms: baking, cookies, texture What is the difference between white and brown eggs? I always use brown extra large eggs, but I can't honestly say why I do this other than habit at this point. Are there any distinct advantages or disadvantages like flavor, shelf life, etc? Key terms: eggs

Data sets Kaggle Competition: https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags Parsed from StackExchange, a website for Q&A community Users post their questions and answers with some key terms Contain 6 domains (Each domain divided into training and testing set) Evaluate the ability of the proposed approach to extract the key terms in an unseen domain not existing in the training set choose one domain for testing, the others to be training Domain biology cooking travel robotics crypto DIY # of documents 13196 15404 19279 2771 10432 25918 Vocabulary size 38257 24313 32072 17160 26792 32106 # of key terms 678 736 1645 231 392 734 test train

Data sets Kaggle Competition: https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags Parsed from StackExchange, a website for Q&A community Users post their questions and answers with some key terms Contain 6 domains (Each domain divided into training and testing set) Evaluate the ability of the proposed approach to extract the key terms in an unseen domain not existing in the training set choose one domain for testing, the others to be training Domain biology cooking travel robotics crypto DIY # of documents 13196 15404 19279 2771 10432 25918 Vocabulary size 38257 24313 32072 17160 26792 32106 # of key terms 678 736 1645 231 392 734 test train

Feature extraction Word2Vec Word statistics: Term frequency Inverse document frequency Tf-idf The word counts in the domain Position of the word in the sentence dim =5 dim =200 Word2Vec

Experimental results Evaluation measure: F-1 score Methods DIY biology cooking travel robotics crypto Upper bound 0.679 0.359 0.672 0.578 0.651 0.684 Baseline Tf-idf sorting 0.213 0.095 0.249 0.206 0.193 0.224 Text Rank 0.195 0.084 0.242 0.154 0.151 0.178 Classification LSTM 0.282 0.208 0.221 0.243 0.180 0.222 Proposed Basic Embedding only 0.235 0.102 0.211 0.185 0.237 All features 0.103 0.285 0.229 0.202 0.252 Optimizing F-1 (All fetures) 0.272 0.113 0.318 0.215 0.246 0.255 The proposed approach cannot extract the key terms in the document. Upper bound: The key terms appear in the documents are all correctly identified. F-1 score of the upper bound is not 1.0 because some key terms do not appear in the documents.

Experimental results Evaluation measure: F-1 score Methods DIY biology cooking travel robotics crypto Upper bound 0.679 0.359 0.672 0.578 0.651 0.684 Baseline Tf-idf sorting 0.213 0.095 0.249 0.206 0.193 0.224 Text Rank 0.195 0.084 0.242 0.154 0.151 0.178 Classification LSTM 0.282 0.208 0.221 0.243 0.180 0.222 Proposed Basic Embedding only 0.235 0.102 0.211 0.185 0.237 All features 0.103 0.285 0.229 0.202 0.252 Optimizing F-1 (All fetures) 0.272 0.113 0.318 0.215 0.246 0.255 Tf-idf sorting and Text Rank are unsupervised approaches. Classification LSTM is a supervised approach.

The article is about GAN Classification LSTM Hidden Layer 𝑣 1 The article is about GAN 𝑣 2 𝑣 3 𝑣 4 𝑣 5 Feature extraction Classification LSTM is learned by in-domain data. Output Layer … … GAN RNN CNN

Adding word statistics is always helpful. Experimental results Evaluation measure: F-1 score Methods DIY biology cooking travel robotics crypto Oracle 0.679 0.359 0.672 0.578 0.651 0.684 Baseline Tf-idf sorting 0.213 0.095 0.249 0.206 0.193 0.224 Text Rank 0.195 0.084 0.242 0.154 0.151 0.178 Classification LSTM 0.282 0.208 0.221 0.243 0.180 0.222 Proposed Basic Embedding only 0.235 0.102 0.211 0.185 0.237 All features 0.103 0.285 0.229 0.202 0.252 Optimize F-1 (All features) 0.272 0.113 0.318 0.215 0.246 0.255 Adding word statistics is always helpful.

Optimizing F-1 score improved the performance in most cases. Experimental results Evaluation measure: F-1 score Methods DIY biology cooking travel robotics crypto Oracle 0.679 0.359 0.672 0.578 0.651 0.684 Baseline Tf-idf sorting 0.213 0.095 0.249 0.206 0.193 0.224 Text Rank 0.195 0.084 0.242 0.154 0.151 0.178 Classification LSTM 0.282 0.208 0.221 0.243 0.180 0.222 Proposed Basic Embedding only 0.235 0.102 0.211 0.185 0.237 All features 0.103 0.285 0.229 0.202 0.252 Optimize F-1 (All features) 0.272 0.113 0.318 0.215 0.246 0.255 Optimizing F-1 score improved the performance in most cases.

The proposed approach always outperformed the unsupervised approach. Experimental results Evaluation measure: F-1 score Methods DIY biology cooking travel robotics crypto Oracle 0.679 0.359 0.672 0.578 0.651 0.684 Baseline Tf-idf sorting 0.213 0.095 0.249 0.206 0.193 0.224 Text Rank 0.195 0.084 0.242 0.154 0.151 0.178 Classification LSTM 0.282 0.208 0.221 0.243 0.180 0.222 Proposed Basic Embedding only 0.235 0.102 0.211 0.185 0.237 All features 0.103 0.285 0.229 0.202 0.252 Optimize F-1 (All features) 0.272 0.113 0.318 0.215 0.246 0.255 The proposed approach always outperformed the unsupervised approach.

Experimental results Evaluation measure: F-1 score Methods DIY biology cooking travel robotics crypto Oracle 0.679 0.359 0.672 0.578 0.651 0.684 Baseline Tf-idf sorting 0.213 0.095 0.249 0.206 0.193 0.224 Text Rank 0.195 0.084 0.242 0.154 0.151 0.178 Classification LSTM 0.282 0.208 0.221 0.243 0.180 0.222 Proposed Basic Embedding only 0.235 0.102 0.211 0.185 0.237 All features 0.103 0.285 0.229 0.202 0.252 Optimize F-1 (All features) 0.272 0.113 0.318 0.215 0.246 0.255 5:00 Classification LSTM used in-domain data, while the proposed approach used out-of-domain data only

Speech Recognition Error : Classification LSTM : Proposed (Optimizing F-1 + all features) As WER ranging from 5% to 30%, the performance of supervised baseline was seriously degraded. On the other hand, the performance of proposed model was slightly degraded.

Conclusions In this work, we proposed a novel domain independent approach to extract key term. Once trained with data from several different domains, it can extract key terms in unseen domains. The performance of this approach degrades only very slightly with speech recognition error. Future work: key phrase extraction, key terms not existing in the documents Hope that we can publish an ICASSP paper whose index terms are extracted by the proposed approach in the paper.