Ratsgo Word2vec 실습

(Ratsgo Word2Vec skip-gram 모델 학습 실습에서 넘어옴)

1 개요[ | ]

Ratsgo Word2vec 실습
Ratsgo Word2Vec skip-gram 모델 학습 실습

2 환경 및 데이터 준비[ | ]

2.1 도커 실행, git checkout[ | ]

  • 도커 이미지는 ratsgo/embedding-cpu:1.4,
  • 소스코드는 git repo ratsgo/embedding의 v1.0.1로 고정하여 테스트한다.
C:\Users\jmnote>docker run -it --rm --hostname=ratsgo ratsgo/embedding-cpu:1.4 bash
root@ratsgo:/notebooks/embedding# git remote -v
origin  https://github.com/ratsgo/embedding.git (fetch)
origin  https://github.com/ratsgo/embedding.git (push)
root@ratsgo:/notebooks/embedding# git pull
remote: Enumerating objects: 197, done.
remote: Counting objects: 100% (197/197), done.
remote: Compressing objects: 100% (44/44), done.
remote: Total 702 (delta 172), reused 174 (delta 153), pack-reused 505
오브젝트를 받는 중: 100% (702/702), 173.51 KiB | 0 bytes/s, 완료.
델타를 알아내는 중: 100% (495/495), 로컬 오브젝트 22개 마침.
... (생략)
 create mode 100644 models/xlnet/prepro_utils.py
 create mode 100644 models/xlnet/train_gpu.py
 create mode 100644 models/xlnet/xlnet.py
root@ratsgo:/notebooks/embedding# git checkout tags/v1.0.1
Note: checking out 'tags/v1.0.1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD의 현재 위치는 ead260a... [python] #14 튜토리얼 페이지 개선

2.2 tokenized 데이터 다운로드[ | ]

root@ratsgo:/notebooks/embedding# bash preprocess.sh dump-tokenized
download tokenized data...
... (생략)
2020-05-29 12:57:58 (649 KB/s) - ‘/notebooks/embedding/data/tokenized.zip’ saved [872719683]

Archive:  tokenized.zip
   creating: tokenized/
  inflating: tokenized/korquad_mecab.txt
  inflating: tokenized/wiki_ko_mecab.txt
  inflating: tokenized/corpus_mecab_jamo.txt
  inflating: tokenized/ratings_okt.txt
  inflating: tokenized/ratings_khaiii.txt
  inflating: tokenized/ratings_hannanum.txt
  inflating: tokenized/ratings_soynlp.txt
  inflating: tokenized/ratings_mecab.txt
  inflating: tokenized/ratings_komoran.txt

3 Word2vec 실습[ | ]

3.1 tokenized 데이터 병합[ | ]

root@ratsgo:/notebooks/embedding# cat data/tokenized/wiki_ko_mecab.txt data/tokenized/ratings_mecab.txt data/tokenized/korquad_mecab.txt > data/tokenized/corpus_mecab.txt

3.2 Word2vec Skip-gram 모델 학습 (실패)[ | ]

  • 병합 데이터(corpus_mecab.txt)로 train했더니 15분간 실행되다가 죽었다...
root@ratsgo:/notebooks/embedding# mkdir -p data/word-embeddings/word2vec
root@ratsgo:/notebooks/embedding# python models/word_utils.py --method train_word2vec --input_path data/tokenized/corpus_mecab.txt --output_path data/word-embeddings/word2vec/word2vec
죽었음

4 Word2vec 단축 실습[ | ]

  • tokenized 데이터를 병합하지 않고 ratings_mecab.txt만으로 train했더니 30초 정도 걸렸다.

4.1 Word2vec Skip-gram 모델 학습 (성공)[ | ]

root@ratsgo:/notebooks/embedding# mkdir -p data/word-embeddings/word2vec
root@ratsgo:/notebooks/embedding# python models/word_utils.py --method train_word2vec --input_path data/tokenized/ratings_mecab.txt --output_path data/word-embeddings/word2vec/word2vec
/usr/local/lib/python3.5/dist-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
root@ratsgo:/notebooks/embedding# ll data/word-embeddings/word2vec/word2vec
-rw-r--r-- 1 root root 23988795  6월 30 07:23 data/word-embeddings/word2vec/word2vec

4.2 테스트 예시[ | ]

cat <<EOF > word2vec_test.py
from models.word_eval import WordEmbeddingEvaluator
model = WordEmbeddingEvaluator("data/word-embeddings/word2vec/word2vec", method="word2vec", dim=100, tokenizer_name="mecab")
print( model.most_similar("희망", topn=5) )
EOF
root@ratsgo:/notebooks/embedding# cat <<EOF > word2vec_test.py
> from models.word_eval import WordEmbeddingEvaluator
> model = WordEmbeddingEvaluator("data/word-embeddings/word2vec/word2vec", method="word2vec", dim=100, tokenizer_name="mecab")
> print( model.most_similar("희망", topn=5) )
> EOF
root@ratsgo:/notebooks/embedding# cat word2vec_test.py 
from models.word_eval import WordEmbeddingEvaluator
model = WordEmbeddingEvaluator("data/word-embeddings/word2vec/word2vec", method="word2vec", dim=100, tokenizer_name="mecab")
print( model.most_similar("희망", topn=5) )
root@ratsgo:/notebooks/embedding# python word2vec_test.py 
/usr/local/lib/python3.5/dist-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
[('깨달음', 0.76899534), ('가르침', 0.7657757), ('절망', 0.758234), ('꿈', 0.7523754), ('기쁨', 0.75105363)]

5 같이 보기[ | ]

6 참고[ | ]

문서 댓글 ({{ doc_comments.length }})
{{ comment.name }} {{ comment.created | snstime }}