Ratsgo BERT 실습

개요

Ratsgo BERT 실습

환경 및 데이터 준비

도커 실행, git checkout

도커 이미지는 ratsgo/embedding-cpu:1.4,
소스코드는 git repo ratsgo/embedding의 v1.0.1로 고정하여 테스트한다.

C:\Users\jmnote>docker run -it --rm --hostname=ratsgo ratsgo/embedding-cpu:1.4 bash
root@ratsgo:/notebooks/embedding# git pull
remote: Enumerating objects: 197, done.
remote: Counting objects: 100% (197/197), done.
remote: Compressing objects: 100% (44/44), done.
remote: Total 702 (delta 172), reused 174 (delta 153), pack-reused 505
오브젝트를 받는 중: 100% (702/702), 173.51 KiB | 0 bytes/s, 완료.
델타를 알아내는 중: 100% (495/495), 로컬 오브젝트 22개 마침.
... (생략)
 create mode 100644 models/xlnet/prepro_utils.py
 create mode 100644 models/xlnet/train_gpu.py
 create mode 100644 models/xlnet/xlnet.py
root@ratsgo:/notebooks/embedding# git checkout tags/v1.0.1
Note: checking out 'tags/v1.0.1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD의 현재 위치는 ead260a... [python] #14 튜토리얼 페이지 개선

processed 데이터 다운로드

root@ratsgo:/notebooks/embedding# bash preprocess.sh dump-processed
download processed data...
... (생략)

BERT 실습

BERT 데이터 전처리

root@ratsgo:/notebooks/embedding# mkdir -p data/sentence-embeddings/pretrain-data
root@ratsgo:/notebooks/embedding# python preprocess/dump.py --preprocess_mode process-documents --input_path data/processed/corrected_ratings_corpus.txt --output_path data/processed/pretrain.txt
root@ratsgo:/notebooks/embedding# ll data/processed/pretrain.txt 
-rw-r--r-- 1 root root 17768815  6월 30 06:19 data/processed/pretrain.txt
root@ratsgo:/notebooks/embedding# split -l 300000 data/processed/pretrain.txt data/sentence-embeddings/pretrain-data/data_
root@ratsgo:/notebooks/embedding# ll data/sentence-embeddings/pretrain-data/data_*
-rw-r--r-- 1 root root 10850925  6월 30 06:20 data/sentence-embeddings/pretrain-data/data_aa
-rw-r--r-- 1 root root  6917890  6월 30 06:20 data/sentence-embeddings/pretrain-data/data_ab

BERT 어휘 집합 구축

root@ratsgo:/notebooks/embedding# mkdir -p data/sentence-embeddings/bert/pretrain-ckpt
root@ratsgo:/notebooks/embedding# python preprocess/unsupervised_nlputils.py --preprocess_mode make_bert_vocab --input_path data/processed/pretrain.txt --vocab_path data/sentence-embeddings/bert/pretrain-ckpt/vocab.txt
sentencepiece_trainer.cc(116) LOG(INFO) Running command: --input=data/processed/pretrain.txt --model_prefix=sentpiece --vocab_size=32000 --model_type=bpe --character_coverage=1.0
sentencepiece_trainer.cc(49) LOG(INFO) Starts training with : 
TrainerSpec {
  input: data/processed/pretrain.txt
  input_format: 
  model_prefix: sentpiece
  model_type: BPE
  vocab_size: 32000
...
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=11 size=28700 all=601937 active=30403 piece=▁베네
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=11 min_freq=5
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=11 size=28720 all=602015 active=30172 piece=▁상속
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=11 size=28740 all=602091 active=30248 piece=▁슨상
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=11 size=28760 all=602161 active=30318 piece=▁안지
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=11 size=28780 all=602216 active=30373 piece=▁예견
trainer_interface.cc(507) LOG(INFO) Saving model: sentpiece.model
trainer_interface.cc(531) LOG(INFO) Saving vocabs: sentpiece.vocab

BERT 학습 데이터 구축

root@ratsgo:/notebooks/embedding# mkdir -p data/sentence-embeddings/bert/pretrain-ckpt/traindata
root@ratsgo:/notebooks/embedding# python models/bert/create_pretraining_data.py --input_file data/sentence-embeddings/pretrain-data/* --output_file=data/sentence-embeddings/bert/pretrain-ckpt/traindata/tfrecord --vocab_file=data/sentence-embeddings/bert/pretrain-ckpt/vocab.txt --do_lower_case=False --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=7 --dupe_factor=5
INFO:tensorflow:*** Reading from input files ***
INFO:tensorflow:  data/sentence-embeddings/pretrain-data/data_aa
INFO:tensorflow:*** Writing to output files ***
INFO:tensorflow:  data/sentence-embeddings/bert/pretrain-ckpt/traindata/tfrecord
INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] 아직도 잊혀 ##지지 않는 ##다 [SEP] [MASK] [MASK] [SEP]
INFO:tensorflow:input_ids: 2 7154 15237 9164 13274 22 3 4 4 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 7 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 28219 10129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1
...
INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] [MASK] ##진짜 ##귀여움 [SEP] 멜로 ##고 뭐 ##고 . . [MASK] [SEP]
INFO:tensorflow:input_ids: 2 4 55 5498 3 1254 154 28997 154 28799 28799 4 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 1 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 18107 28799 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1
INFO:tensorflow:Wrote 715564 total instances

BERT 모델의 하이퍼파라미터 확인

json 파일이 없으면 만들어주자...

root@ratsgo:/notebooks/embedding# cat /notebooks/embedding/data/sentence-embeddings/bert/pretrain-ckpt/bert_config.json
{
  "attention_probs_dropout_prob": 0.1, 
  "directionality": "bidi", 
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1, 
  "hidden_size": 768, 
  "initializer_range": 0.02, 
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12, 
  "num_hidden_layers": 12, 
  "pooler_fc_size": 768, 
  "pooler_num_attention_heads": 12, 
  "pooler_num_fc_layers": 3, 
  "pooler_size_per_head": 128, 
  "pooler_type": "first_token_transform", 
  "type_vocab_size": 2, 
  "vocab_size": 32006
}

BERT 모델 프리트레이닝

root@ratsgo:/notebooks/embedding# python models/bert/run_pretraining.py --input_file=data/sentence-embeddings/bert/pretrain-ckpt/traindata/tfrecord* --output_dir=data/sentence-embeddings/bert/pretrain-ckpt --do_train=True --do_eval=True --bert_config_file=data/sentence-embeddings/bert/pretrain-ckpt/bert_config.json --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --learning_rate=2e-5
INFO:tensorflow:*** Input Files ***
INFO:tensorflow:  data/sentence-embeddings/bert/pretrain-ckpt/traindata/tfrecord
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f5811af1488>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_log_step_count_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': 1000, '_num_ps_replicas': 0, '_master': '', '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f5811af3908>, '_save_summary_steps': 100, '_train_distribute': None, '_keep_checkpoint_max': 5, '_experimental_distribute': None, '_is_chief': True, '_cluster': None, '_protocol': None, '_service': None, '_evaluation_master': '', '_tf_random_seed': None, '_save_checkpoints_secs': None, '_eval_distribute': None, '_task_id': 0, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_model_dir': 'data/sentence-embeddings/bert/pretrain-ckpt', '_device_fn': None, '_global_id_in_cluster': 0, '_task_type': 'worker', '_num_worker_replicas': 1}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Batch size = 32
WARNING:tensorflow:From models/bert/run_pretraining.py:369: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From models/bert/run_pretraining.py:386: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (32, 128)
INFO:tensorflow:  name = input_mask, shape = (32, 128)
INFO:tensorflow:  name = masked_lm_ids, shape = (32, 20)
INFO:tensorflow:  name = masked_lm_positions, shape = (32, 20)
INFO:tensorflow:  name = masked_lm_weights, shape = (32, 20)
INFO:tensorflow:  name = next_sentence_labels, shape = (32, 1)
INFO:tensorflow:  name = segment_ids, shape = (32, 128)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (32006, 768)
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768)
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768)
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/embeddings/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_0/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_0/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/key/kernel:0, shape = (768, 768)
...

BERT 단축 실습

BERT 데이터 전처리

root@ratsgo:/notebooks/embedding# mkdir -p data/sentence-embeddings/pretrain-data
root@ratsgo:/notebooks/embedding# python preprocess/dump.py --preprocess_mode process-documents --input_path data/processed/corrected_ratings_test.txt --output_path data/processed/pretrain.txt
root@ratsgo:/notebooks/embedding# ll data/processed/pretrain.txt 
-rw-r--r-- 1 root root 4656824  6월 30 10:30 data/processed/pretrain.txt
root@ratsgo:/notebooks/embedding# split -l 100000 data/processed/pretrain.txt data/sentence-embeddings/pretrain-data/data_
root@ratsgo:/notebooks/embedding# ll data/sentence-embeddings/pretrain-data/data_*
-rw-r--r-- 1 root root 3807410  6월 30 10:31 data/sentence-embeddings/pretrain-data/data_aa
-rw-r--r-- 1 root root  849414  6월 30 10:31 data/sentence-embeddings/pretrain-data/data_ab

BERT 어휘 집합 구축

root@ratsgo:/notebooks/embedding# mkdir -p data/sentence-embeddings/bert/pretrain-ckpt
root@ratsgo:/notebooks/embedding# python preprocess/unsupervised_nlputils.py --preprocess_mode make_bert_vocab --input_path data/processed/pretrain.txt --vocab_path data/sentence-embeddings/bert/pretrain-ckpt/vocab.txt
sentencepiece_trainer.cc(116) LOG(INFO) Running command: --input=data/processed/pretrain.txt --model_prefix=sentpiece --vocab_size=32000 --model_type=bpe --character_coverage=1.0
sentencepiece_trainer.cc(49) LOG(INFO) Starts training with : 
TrainerSpec {
  input: data/processed/pretrain.txt
  input_format: 
  model_prefix: sentpiece
  model_type: BPE
  vocab_size: 32000
...
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=3 min_freq=2
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=29520 all=253919 active=12703 piece=▁1:1
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=29540 all=253916 active=12700 piece=▁All
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=29560 all=253908 active=12692 piece=▁ᄡ같은
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=29580 all=253904 active=12688 piece=▁각시탈
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=29600 all=253896 active=12680 piece=▁강추다
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=3 min_freq=2
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=29620 all=253889 active=12687 piece=▁거꾸로
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=29640 all=253877 active=12675 piece=▁것일뿐
trainer_interface.cc(507) LOG(INFO) Saving model: sentpiece.model
trainer_interface.cc(531) LOG(INFO) Saving vocabs: sentpiece.vocab

BERT 학습 데이터 구축

root@ratsgo:/notebooks/embedding# mkdir -p data/sentence-embeddings/bert/pretrain-ckpt/traindata
root@ratsgo:/notebooks/embedding# python models/bert/create_pretraining_data.py --input_file data/sentence-embeddings/pretrain-data/* --output_file=data/sentence-embeddings/bert/pretrain-ckpt/traindata/tfrecord --vocab_file=data/sentence-embeddings/bert/pretrain-ckpt/vocab.txt --do_lower_case=False --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=7 --dupe_factor=5
INFO:tensorflow:*** Reading from input files ***
INFO:tensorflow:  data/sentence-embeddings/pretrain-data/data_aa
INFO:tensorflow:*** Writing to output files ***
INFO:tensorflow:  data/sentence-embeddings/bert/pretrain-ckpt/traindata/tfrecord
INFO:tensorflow:*** Example ***
...
INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] ; ; ; ; ␞0 [SEP] 일본 [MASK] 공포영화 ##를 시도 ##한듯 ##한 ##돈벌 . 깜짝 [MASK] ##래 ##키 ##는 ##것 갯 없는 일 ##회 ##용 공포영화 ##0 [SEP]
INFO:tensorflow:input_ids: 2 29783 29783 29783 29783 6 3 4045 4 9018 4911 11624 14633 44 7334 29663 11541 4 26144 1202 4909 111 31308 290 29757 1850 907 9018 561 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 8 14 17 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 1889 2145 7093 1222 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1
INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] 이게 [MASK] 나오냐 ? [SEP] 비 ##리 [MASK] 사회 ##가 무슨 자랑 ##이라고 [MASK] ##벌 ? ␞0 [SEP]
INFO:tensorflow:input_ids: 2 1874 4 28751 29718 3 29773 393 4 2066 53 2882 8674 6800 4 1201 29718 6 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 2 8 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 19297 24373 30147 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 0
INFO:tensorflow:Wrote 238514 total instances

BERT 모델의 하이퍼파라미터 작성

cat <<EOF > data/sentence-embeddings/bert/pretrain-ckpt/bert_config.json
{
  "attention_probs_dropout_prob": 0.1, 
  "directionality": "bidi", 
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1, 
  "hidden_size": 768, 
  "initializer_range": 0.02, 
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12, 
  "num_hidden_layers": 12, 
  "pooler_fc_size": 768, 
  "pooler_num_attention_heads": 12, 
  "pooler_num_fc_layers": 3, 
  "pooler_size_per_head": 128, 
  "pooler_type": "first_token_transform", 
  "type_vocab_size": 2, 
  "vocab_size": 32006
}
EOF

BERT 모델 프리트레이닝

root@ratsgo:/notebooks/embedding# python models/bert/run_pretraining.py --input_file=data/sentence-embeddings/bert/pretrain-ckpt/traindata/tfrecord* --output_dir=data/sentence-embeddings/bert/pretrain-ckpt --do_train=True --do_eval=True --bert_config_file=data/sentence-embeddings/bert/pretrain-ckpt/bert_config.json --train_batch_size=16 --max_seq_length=128 --max_predictions_per_seq=5 --learning_rate=2e-2
...
2020-06-30 11:25:42.538179: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_weights.  Can't parse serialized Example.
2020-06-30 11:25:42.537524: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: masked_lm_weights.  Can't parse serialized Example.
INFO:tensorflow:Finished evaluation at 2020-06-30-11:25:42
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 0.0, masked_lm_accuracy = 0.0, masked_lm_loss = 0.0, next_sentence_accuracy = 0.0, next_sentence_loss = 0.0
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1: data/sentence-embeddings/bert/pretrain-ckpt/model.ckpt-1
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  global_step = 1
INFO:tensorflow:  loss = 0.0
INFO:tensorflow:  masked_lm_accuracy = 0.0
INFO:tensorflow:  masked_lm_loss = 0.0
INFO:tensorflow:  next_sentence_accuracy = 0.0
INFO:tensorflow:  next_sentence_loss = 0.0

Ratsgo BERT 실습

목차

개요

환경 및 데이터 준비

도커 실행, git checkout

processed 데이터 다운로드

BERT 실습

BERT 데이터 전처리

BERT 어휘 집합 구축

BERT 학습 데이터 구축

BERT 모델의 하이퍼파라미터 확인

BERT 모델 프리트레이닝

BERT 단축 실습

BERT 데이터 전처리

BERT 어휘 집합 구축

BERT 학습 데이터 구축

BERT 모델의 하이퍼파라미터 작성

BERT 모델 프리트레이닝

같이 보기