Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 20121606 양희정 date : 2015. 12. 17
0. Index Introduction Prerequisite Corpus Construction Building Evaluation Task Training Vector Model Result Analysis Project Output
1. Introduction project name project abstract Korean version of GloVe We studied semantic vector space models which represents each word in real-valued vector for Korean We conducted an experiment on Korean word representations for word analogy task and word similarity task in which a Global Vector model, continuous bag-of-words model as well as skip-gram model are compared
2. Prerequisite semantic vector space model Representing each word with a real-valued vector can be used as features in a variety of applications information retrieval, document classification, question answering, named entity recognition, parsing et al. vector representation of document space indexing document space by vector representation
2. Prerequisite Global Vector model Global Vector Model(GloVe) is an unsupervised learning algorithm for obtaining vector representations for words. The main intuition on GloVe is that the ratios of word-word co-occurrence probabilities have some meaning. ice co-occurs more frequently with solid than gas, whereas steam co-occurs more frequently with gas than solid co-occurrence probabilities for ice and steam with context words
2. Prerequisite Global Vector model GloVe is a log-bilinear model with a weighted least-squares objective. weighted least squares regression model of GloVe plot of weighting function with a = 3/4 weighting function
2. Prerequisite word2vec model The word2vec model is a simple single-layer neural network architecture consists of two models, the continuous bag-of-words (CBOW) and skip-gram(SG) models of Mikolov et al. (2013a). word2vec model architecture
2. Prerequisite word2vec model The input of the skip-gram model is a single word wI and the output is the words in wI‘s context in the sentence “I drove my car to the store”, when the word “car” is given as an input, then {“I”, “drove”, “my”, “to”, “the”, “store”} is its output skip-gram neural network architecture
2. Prerequisite word2vec model The input of the continuous bag-of-words model is the words in wI‘s context and the output is a single word wI CBOW can be considered as reversed version of SG model in the sentence “I drove my car to the store”, when the words {“I”, “drove”, “my”, “to”, “the”, “store”} is given as an input, then “car” is its output continuous bag-of-words neural network architecture
2. Prerequisite project hypothesis Global Vector model is applicable to Korean data as a universal learning algorithm Global Vector model fits well to Korean data than word2vec model as English data does correlation on word similarity task accuracy on word analogy task
2. Prerequisite summary project component topic description project target GloVe Vector model (GloVe) unsupervised learning algorithm trained from global word-word co-occurrence matrix word2vec model neural network architecture consists of skip-gram and continuous bag-of-words model project hypothesis Global Vector model fits well to Korean data result can vary because of specialty of Korean Global Vector model is better to apply to Korean data word2vec model
3. Corpus Construction corpus construction From one million sentences collected from web, 3,552,280 words are collected. samples of Korean sentences samples of Korean words
3. Corpus Construction summary work no. description output date detail 1.0 corpus construction one million Korean text from crawled from web 2015. 09. 01 example of Korean corpus
4. Building Evaluation Task evaluation tasks on previous study evaluation task sets previously on GloVe WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ evaluation set in GloVe
4. Building Evaluation Task word similarity task in Korean word similarity task compares similarity values of words obtained by the cosine-similarity of corresponding vectors with the ones by human judgement score human judgement is done by two graduate students majoring in linguistics 엄마 어머니 cosine-similarity and human- judgement score of ‘엄마’, ‘어머니’
4. Building Evaluation Task word similarity task in Korean Korean is an agglutinative language seven slots of a finite Korean verb
4. Building Evaluation Task word similarity task in Korean Korean allows a word to have multiple particles these derived forms should be recognized as same word comparison of derived forms between Korean and English
4. Building Evaluation Task word similarity task in Korean To reflex the agglutinative feature, reset vector of word i, wi by building a set V sharing same stem i and recalculate by weighted sum of elements in V 86662 66577 35627 17660 6781 6614 visualization of recalculating word ‘밥’ recalculating of wi
4. Building Evaluation Task word similarity task in Korean we constructed 819 word pairs based on their semantic relatedness and classified into 4 categories vector synthesis method will be applied or not word categorization table
4. Building Evaluation Task word similarity task in Korean example of 4 category pairs word pair examples based on ‘감정’ : from left to right, categories are modifier-nouns, entailment, relational pairs, and collocations
4. Building Evaluation Task word analogy task in Korean word analogy task tries to answer “ a is to b as c is to __ ? “ finding the word d whose representation wd is closest to wb-wa+wc word analogy test on syntactic word pairs
4. Building Evaluation Task word analogy task in Korean 3COSADD method find the word d whose representation wd is closest to wb-wa+wc by the cosine similarity 3COSMUL method modified version of 3COSADD
4. Building Evaluation Task word analogy task in Korean we constructed 90 word quadruplets based on their semantic relatedness and classified into 2 categories semantic analogy 48, syntactic analogy 42 two analogy calculating method(3COSADD, 3COSMUL) will be applied semantic analogy syntactic analogy
4. Building Evaluation Task summary work no. description output date detail 2.1 prerequisite information about previous study 2015. 09. 01 WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353 2.2 word similarity task 819 word pairs with 4 categories 2015. 10. 11 example of collocation pairs vector synthesis 2015. 10. 10 2.3 word analogy task 90 word quadruplets with 2 categories example of semantic analogy quadruplets 2 calculation method
5. Training Vector Model training vector model trained GloVe and word2vec(SG, CBOW) by 1.0. corpus 50, 100, 200, 300, 400, 500, 1000 dimension vector files word vector lists
5. Training Vector Model summary work no. description output date detail 3.0 trained vector model vector files trained by GloVe or word2vec of various dimensions 2015. 11. 19 example of vector file
6. Result Analysis similarity task result best performance was achieved from Pearson correlation in dimension 500 with vector synthesized. correlation coefficients on various dimensions before and after the word synthesis
6. Result Analysis similarity task result comparison between word2vec and GloVe correlation coefficients of word2vec and GloVe
6. Result Analysis analogy task result in semantic analogy task, 3COSADD method in dimension 1000 achieved the highest, and in syntactic, 3COSADD method in dimension 50 was the highest accuracy on the semantic analogy task accuracy on the syntactic analogy task
6. Result Analysis analogy task result comparison between word2vec and GloVe accuracy on the analogy task of word2vec and GloVe
6. Result Analysis summary work no. description output date detail 4.1 similarity task result GloVe correlations on various dimensions 2015. 11. 14 GloVe resulted 0.3133 Pearson correlation coefficient on dimension 500 correlation comparison between GloVe and word2vec CBOW resulted 0.2637, SG resulted 0.2177 Pearson correlation coefficient 4.2 analogy task result GloVe accuracy on various dimensions 2015. 10. 20 GloVe resulted 69% accuracy on semantic, 64% accuracy on syntactic task, overall 67% accuracy accuracy comparison between GloVe and word2vec CBOW resulted 75% accuracy on semantic, 57% on syntactic task(overall 66%) while SG resulted 65% accuracy on semantic, 48% on syntactic task(overall 57%)
7. Project Output A Study on Word Vector Models for Representing Korean Semantic Information