Download presentation
Presentation is loading. Please wait.
Published byClaud Edwards Modified over 5 years ago
1
Zhe Ye yezhejack@sjtu.edu.cn 2017.9.28
Word2vec Tutorial Zhe Ye
2
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
3
One-hot representation vs word vectors
Sparse: using 3000K dimensions to represent vocabulary with 3000K word types Not related Word vectors Dense: using 300 (or less) to represent vocabulary with 3000K word types related
4
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
5
Virtual environment Features
Provide separate dependency libraries Do not require admin or sudo to install package Two famous tools which provide these features Anaconda It’s convenient in windows (scipy) Virtualenv
6
Python It’s very popular in NLP or (Data Science)
It’s very simple and easy to understand Version: Python 2.7
7
Corpus Tokenized plain text (Chinese and English is ok)
我们 很 高兴 We are very happy . Tokenized plain text resource atmt.org/lm-benchmark/1-billion-word-language- modeling-benchmark-r13output.tar.gz Tokenizer LTP for Chinese ( Stanford Tokenizer (
8
Gensim Implementing a wrapper for word2vec ( It provide python api
9
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
10
Training word vectors Linux+virtualenv+gensim is recommended
Windows 10 (64bit) + anaconda+gensim is ok
11
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
12
Evaluation Analogy Word Clustering
vector(‘paris’)-vector(‘France’)+vector(‘Italy’)=vector(‘Rome’) Word Clustering
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.