Transformers

pipeline()

๋ชจ๋ธ inference๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ
from transformers import pipeline
pipe = pipeline("text-classification")
pipe("This restaurant is awesome")
# [{'label': 'POSITIVE', 'score': 0.9998743534088135}]

from transformers๋กœ Github(๐Ÿˆ‍โฌ›) transformer์—์„œ ํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค:

transformers์˜ ํ•จ์ˆ˜๋ฅผ importํ•˜๋Š” ๊ฒฝ์šฐ, ์œ„ ์‚ฌ์ง„์˜ src/transformers์— ๋ชจ๋‘ ๊ตฌํ˜„์ด ๋˜์–ด์žˆ๋‹ค.

๋ถˆ๋Ÿฌ์˜ค๋Š” ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ, __init__.py๋ฅผ ํ™•์ธํ•˜๋ฉด ์•Œ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ pipeline์ด from .pipelines import pipeline์ด๋ผ ์ ํ˜€์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.


์œ„ ์ขŒ์ธก์‚ฌ์ง„์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ, pipelinesํด๋”์— pipelineํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š”๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ,
์‹ค์ œ๋กœ ํ•ด๋‹น ํด๋”์— ๋“ค์–ด๊ฐ€๋ณด๋ฉด ์šฐ์ธก์ฒ˜๋Ÿผ pipelineํ•จ์ˆ˜๊ฐ€ ์ •์˜๋˜๊ณ , ์ด ํ˜•ํƒœ๋Š” transformers.pipeline docs๋‚ด์šฉ๊ณผ ์ผ์น˜ํ•จ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.



Auto Classes

from_pretrained() Method๋กœ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, AutoClasses๋Š” ์ด๋Ÿฐ ์ž‘์—…์„ ์ˆ˜ํ–‰, ์‚ฌ์ „ํ›ˆ๋ จ๋œ  AutoConfig, AutoModel, AutoTokenizer์ค‘ ํ•˜๋‚˜๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑ๊ฐ€๋Šฅํ•˜๋‹ค:
 ex)

from transformers import AutoConfig, AutoModel

model = AutoModel.from_pretrained("google-bert/bert-base-cased")




โˆ™ AutoConfig

generic Cofigurationํด๋ž˜์Šค๋กœ from_pretrained()๋ผ๋Š” ํด๋ž˜์Šค๋ฉ”์„œ๋“œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์ •ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ์ธ์Šคํ„ด์Šคํ™”๋œ๋‹ค.
์ด ํด๋ž˜์Šค๋Š” '__init__()'์„ ์‚ฌ์šฉํ•ด ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ•  ์ˆ˜ ์—†๋‹ค!!

์œ„ ์‚ฌ์ง„์„ ๋ณด๋ฉด, transformers/srcํŒŒ์ผ์„ ๋”ฐ๊ณ  ๋“ค์–ด๊ฐ„ ๊ฒฐ๊ณผ, ์ตœ์ข…์ ์œผ๋กœ from_pretrained()ํ•จ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•ด๋‹น ๊นƒํ—™์ฝ”๋“œ(๊ฐ€์žฅ ์šฐ์ธก์‚ฌ์ง„)๋ฅผ ๋ณด๋ฉด, __init__()ํ•จ์ˆ˜์— ๋Œ€ํ•ด raise EnvironmentError๊ฐ€ ๊ฑธ๋ ค์žˆ์Œ์ด ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

config = AutoConfig.from_pretrained("bert-base-uncased")
print(config)


# BertConfig {
#   "_name_or_path": "bert-base-uncased",
#   "architectures": [
#     "BertForMaskedLM"
#   ],
#   "attention_probs_dropout_prob": 0.1,
#   "classifier_dropout": null,
#   "gradient_checkpointing": false,
#   "hidden_act": "gelu",
#   "hidden_dropout_prob": 0.1,
#   "hidden_size": 768,
#   "initializer_range": 0.02,
#   "intermediate_size": 3072,
#   "layer_norm_eps": 1e-12,
#   "max_position_embeddings": 512,
#   "model_type": "bert",
#   "num_attention_heads": 12,
#   "num_hidden_layers": 12,
#   "pad_token_id": 0,
#   "position_embedding_type": "absolute",
#   "transformers_version": "4.41.2",
#   "type_vocab_size": 2,
#   "use_cache": true,
#   "vocab_size": 30522
# }

์œ„ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด, Config๋Š” ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•œ jsonํŒŒ์ผ๋กœ ๋˜์–ด์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.
batch_size, learning_rate, weight_decay ๋“ฑ train์— ํ•„์š”ํ•œ ๊ฒƒ๋“ค๊ณผ
tokenizer์˜ ํŠน์ˆ˜ํ† ํฐ๋“ค์„ ๋ฏธ๋ฆฌ ์„ค์ •ํ•˜๋Š” ๋“ฑ ์„ค์ •๊ด€๋ จ ๋‚ด์šฉ์ด ๋“ค์–ด์žˆ๋‹ค.
๋˜ํ•œ, save_pretrained()๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ชจ๋ธ์˜ checkpointํ™” ํ•จ๊ป˜ ์ €์žฅ๋œ๋‹ค!
๊ทธ๋ ‡๊ธฐ์—, ๋งŒ์•ฝ ์„ค์ •์„ ๋ณ€๊ฒฝํ•˜๊ณ  ์‹ถ๊ฑฐ๋‚˜ Model Proposal๋“ฑ์˜ ์ƒํ™ฉ์—์„œ๋Š” configํŒŒ์ผ์„ ์ง์ ‘ ๋ถˆ๋Ÿฌ์™€์•ผํ•œ๋‹ค!




โˆ™ AutoTokenizer, (blobs, refs, snapshots)

generic Tokenizerํด๋ž˜์Šค๋กœ AutoTokenizer.from_pretrained()๋ผ๋Š” ํด๋ž˜์Šค๋ฉ”์„œ๋“œ.
์ƒ์„ฑ ์‹œ, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ† ํฌ๋‚˜์ด์ €ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ์ธ์Šคํ„ด์Šคํ™”๋œ๋‹ค.
์ฐธ๊ณ )https://chan4im.tistory.com/#%E2%88%99input-ids
์ด ํด๋ž˜์Šค๋Š” '__init__()'์„ ์‚ฌ์šฉํ•ด ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ•  ์ˆ˜ ์—†๋‹ค!!

์œ„ ์‚ฌ์ง„์„ ๋ณด๋ฉด, transformers/srcํŒŒ์ผ์„ ๋”ฐ๊ณ  ๋“ค์–ด๊ฐ„ ๊ฒฐ๊ณผ, ์ตœ์ข…์ ์œผ๋กœ from_pretrained()ํ•จ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•ด๋‹น ๊นƒํ—™์ฝ”๋“œ(๊ฐ€์žฅ ์šฐ์ธก์‚ฌ์ง„)๋ฅผ ๋ณด๋ฉด, __init__()ํ•จ์ˆ˜์— ๋Œ€ํ•ด raise EnvironmentError๊ฐ€ ๊ฑธ๋ ค์žˆ์Œ์ด ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

from transformers import AutoTokenizer

# Download vocabulary from huggingface.co and cache.
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

# If vocabulary files are in a directory 
# (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")

# Download vocabulary from huggingface.co and define model-specific arguments
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_prefix_space=True)

tokenizer
# BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
# 	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# }

๊ทธ๋Ÿฐ๋ฐ ํ•œ๊ฐ€์ง€ ๊ถ๊ธˆํ•˜์ง€ ์•Š์€๊ฐ€?

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

์œ„ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ›„ ์‹คํ–‰ํ•˜๋ฉด ์ฝ˜์†”์ฐฝ์— ์™œ ์•„๋ž˜์™€ ๊ฐ™์€ ํ™”๋ฉด์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์ผ๊นŒ?????

๋ฏธ๋ฆฌ ์„ค๋ช…:
tokenizer_config.json์—๋Š” token์— ๋Œ€ํ•œ ์„ค์ •๋“ค์ด,
config.json์—๋Š” ๋ชจ๋ธ๊ด€๋ จ ์„ค์ •์ด,
vocab.txt๋Š” subword๋“ค์ด ๋“ค์–ด์žˆ๊ณ ,
tokenizer.json์€ config๋œ ๊ฐ’๋“ค์— ๋Œ€ํ•ด ๋‚˜์—ดํ•œ ๊ฒƒ์ด๋‹ค.


๋ณธ์ธ์˜ ๊ฒฝ์šฐ, ์•„๋ž˜์™€ ๊ฐ™์ด cache_dir์— ์ง€์ •์„ ํ•˜๋ฉด, ํ•ด๋‹น ๋””๋ ‰ํ† ๋ฆฌ์— hub๋ผ๋Š” ํŒŒ์ผ์ด ์ƒ๊ธฐ๋ฉฐ, ๊ทธ์•ˆ์— ๋ชจ๋ธ๊ด€๋ จ ํŒŒ์ผ์ด ์ƒ๊ธด๋‹ค.

parser.add_argument('--cache_dir', default="/data/huggingface_models")

ํƒ€๊ณ  ๋“ค์–ด๊ฐ€๋‹ค ๋ณด๋ฉด ์ด 3๊ฐ€์ง€ ํด๋”๊ฐ€ ๋‚˜์˜จ๋‹ค: blobs, refs, snapshots
blobs: ํ•ด์‹œ๊ฐ’์œผ๋กœ ๋‚˜ํƒ€๋‚˜์ ธ ์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค. ํ•ด๋‹นํŒŒ์ผ์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํŒŒ์ผ์ด ์กด์žฌํ•œ๋‹ค:
Configํด๋ž˜์Šค๊ด€๋ จํŒŒ์ผ, Model๊ด€๋ จ ํด๋ž˜์ŠคํŒŒ์ผ๋“ค, tokenizer์„ค์ •๊ด€๋ จ jsonํŒŒ์ผ, weight๊ด€๋ จ ํŒŒ์ผ๋“ค ๋“ฑ๋“ฑ ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด ๋งŽ์€ ํŒŒ์ผ๋“ค์ด ์ฝ”๋“œํ™”๋˜์–ด ๋“ค์–ด์žˆ๋‹ค:


 
refs: ๋‹จ์ˆœํžˆ main์ด๋ผ๋Š” ํŒŒ์ผ๋งŒ ์กด์žฌํ•œ๋‹ค:

ํ•ด๋‹น ํŒŒ์ผ์—๋Š” snapshots์•ˆ์— ์žˆ๋Š” ๋””๋ ‰ํ† ๋ฆฌ์˜ ์ด๋ฆ„๊ณผ ๋™์ผํ•œ ์ด๋ฆ„์ด ์ ํ˜€์žˆ๋‹ค.


snapshots: ๋ฐ”๋กœ ์ด๊ณณ์—!! tokenizer_config.json, config.json, vocab.txt, tokenizer.jsonํŒŒ์ผ์ด ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!!!



๊ทธ๋Ÿฐ๋ฐ ๋ญ”๊ฐ€ ์ด์ƒํ•˜์ง€ ์•Š์€๊ฐ€??
์œ„์˜ blobs์— ๋‚˜์™€์žˆ๋Š” ์‚ฌ์ง„์˜ ์ฝ”๋“œ์™€ snapshots์˜ ์ฝ”๋“œ๊ฐ€ ๋ชจ๋‘ ์ผ์น˜ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค!!

๊ทธ๋ ‡๋‹ค! ์ฆ‰, blobs๋Š” snapshots ๋‚ด์šฉ์„ ํ•ด์‹œ๊ฐ’ํ˜•ํƒœ๋กœ ์ €์žฅํ•œ ๊ฒƒ์ด์—ˆ๋‹ค!!!
์‚ฌ์‹ค ์ด์ง“ํ•œ๋‹ค์Œ์— ๊ตฌ๊ธ€์— ์น˜๋‹ˆ๊นŒ ๋ฐ”๋กœ ์žˆ๊ธดํ–ˆ์—ˆ๋‹ค..(https://huggingface.co/docs/huggingface_hub/v0.16.3/en/guides/manage-cache)
ํ—ˆ๊น…ํŽ˜์ด์Šค ์„ค๋ช… ์š”์•ฝ:

Refs refs ํด๋”๋Š” ์ฃผ์–ด์ง„ ์ฐธ์กฐ์˜ ์ตœ์‹  ๋ฆฌ๋น„์ „์„ ๋‚˜ํƒ€๋‚ด๋Š” ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ด์ „์— ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์˜ ๋ฉ”์ธ ๋ธŒ๋žœ์น˜์—์„œ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜จ ๊ฒฝ์šฐ, refs ํด๋”์—๋Š” main์ด๋ผ๋Š” ํŒŒ์ผ์ด ์žˆ์œผ๋ฉฐ, ํ˜„์žฌ ํ—ค๋“œ์˜ ์ปค๋ฐ‹ ์‹๋ณ„์ž๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

๋งŒ์•ฝ ์ตœ์‹  ์ปค๋ฐ‹์ด aaaaaa๋ผ๋Š” ์‹๋ณ„์ž๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด, ํ•ด๋‹น ํŒŒ์ผ์€ aaaaaa๋ฅผ ํฌํ•จํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋™์ผํ•œ ๋ธŒ๋žœ์น˜๊ฐ€ ์ƒˆ๋กœ์šด ์ปค๋ฐ‹ bbbbbb๋กœ ์—…๋ฐ์ดํŠธ๋œ ๊ฒฝ์šฐ, ํ•ด๋‹น ์ฐธ์กฐ์—์„œ ํŒŒ์ผ์„ ๋‹ค์‹œ ๋‹ค์šด๋กœ๋“œํ•˜๋ฉด refs/main ํŒŒ์ผ์ด bbbbbb๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.

Blobs blobs ํด๋”๋Š” ์‹ค์ œ๋กœ ๋‹ค์šด๋กœ๋“œ๋œ ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํŒŒ์ผ์˜ ์ด๋ฆ„์€ ํ•ด๋‹น ํŒŒ์ผ์˜ ํ•ด์‹œ์ž…๋‹ˆ๋‹ค.

Snapshots snapshots ํด๋”๋Š” ์œ„์˜ blobs์—์„œ ์–ธ๊ธ‰ํ•œ ํŒŒ์ผ์— ๋Œ€ํ•œ ์‹ฌ๋ณผ๋ฆญ ๋งํฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ž์ฒด์ ์œผ๋กœ ์•Œ๋ ค์ง„ ๊ฐ ๋ฆฌ๋น„์ „์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ํด๋”๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

cf) ๋˜ํ•œ cache๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ tree๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง:

    [  96]  .
    โ””โ”€โ”€ [ 160]  models--julien-c--EsperBERTo-small
        โ”œโ”€โ”€ [ 160]  blobs
        โ”‚   โ”œโ”€โ”€ [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
        โ”‚   โ”œโ”€โ”€ [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
        โ”‚   โ””โ”€โ”€ [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
        โ”œโ”€โ”€ [  96]  refs
        โ”‚   โ””โ”€โ”€ [  40]  main
        โ””โ”€โ”€ [ 128]  snapshots
            โ”œโ”€โ”€ [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
            โ”‚   โ”œโ”€โ”€ [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
            โ”‚   โ””โ”€โ”€ [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
            โ””โ”€โ”€ [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
                โ”œโ”€โ”€ [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
                โ””โ”€โ”€ [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd

 


โˆ™ AutoModel

๋‹น์—ฐํžˆ ์œ„์™€ ๊ฐ™์ด, ์•„๋ž˜์‚ฌ์ง„์ฒ˜๋Ÿผ ์ฐพ์•„๊ฐˆ ์ˆ˜ ์žˆ๋‹ค.

๋จผ์ € AutoModel.from_configํ•จ์ˆ˜๋ฅผ ์‚ดํŽด๋ณด์ž.

from transformers import AutoConfig, AutoModel

# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained("google-bert/bert-base-cased")
model = AutoModel.from_config(config)


@classmethod
def from_config(cls, config, **kwargs):
    trust_remote_code = kwargs.pop("trust_remote_code", None)
    has_remote_code = hasattr(config, "auto_map") and cls.__name__ in config.auto_map
    has_local_code = type(config) in cls._model_mapping.keys()
    trust_remote_code = resolve_trust_remote_code(
        trust_remote_code, config._name_or_path, has_local_code, has_remote_code
    )

    if has_remote_code and trust_remote_code:
        class_ref = config.auto_map[cls.__name__]
        if "--" in class_ref:
            repo_id, class_ref = class_ref.split("--")
        else:
            repo_id = config.name_or_path
        model_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
        if os.path.isdir(config._name_or_path):
            model_class.register_for_auto_class(cls.__name__)
        else:
            cls.register(config.__class__, model_class, exist_ok=True)
        _ = kwargs.pop("code_revision", None)
        return model_class._from_config(config, **kwargs)
    elif type(config) in cls._model_mapping.keys():
        model_class = _get_model_class(config, cls._model_mapping)
        return model_class._from_config(config, **kwargs)

    raise ValueError(
        f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
        f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."

 


Data Collator

Data Collator

์ผ๋ จ์˜ sample list๋ฅผ "single training mini-batch"์˜ Tensorํ˜•ํƒœ๋กœ ๋ฌถ์–ด์คŒ
Default Data Collator์ด๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ train_dataset์ด data_collator๋ฅผ ์ด์šฉํ•ด mini-batch๋กœ ๋ฌถ์—ฌ ๋ชจ๋ธ๋กœ ๋“ค์–ด๊ฐ€ ํ•™์Šตํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,โ€‹





batch["input_ids"] , batch["labels"] ?

๋‹ค๋งŒ, ์œ„์™€ ๋‹ฌ๋ฆฌ ๋Œ€๋ถ€๋ถ„์˜ Data Collatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples): 
        texts = []
        images = []
        for example in examples:
            image, question, answer = example 
            messages = [{"role": "user", "content": question},
                        {"role": "assistant", "content": answer}] # <-- ์—ฌ๊ธฐ๊นŒ์ง€ ์ž˜ ๋“ค์–ด๊ฐ€๋Š”๊ฒƒ ํ™•์ธ์™„๋ฃŒ.
            text = self.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text)
            images.append(image)

        batch = self.processor(text=text, images=image, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)โ€‹

๊ณผ์—ฐ batch["input_ids"]์™€ batch["labels"]๊ฐ€ ๋ญ˜๊นŒ?

์ „์ˆ ํ–ˆ๋˜ data_collator๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํ˜•์‹์„ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋„ ๋ณด๋ฉด inputs์™€ labels๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ชจ๋“  ๋ชจ๋ธ์€ ๋‹ค๋ฅด์ง€๋งŒ, ๋‹ค๋ฅธ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ์ ์„ ๊ณต์œ ํ•œ๋‹ค
= ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ ๋™์ผํ•œ ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•œ๋‹ค!

โˆ™Input IDs

Input ID๋Š” ๋ชจ๋ธ์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌ๋˜๋Š” "์œ ์ผํ•œ ํ•„์ˆ˜ ๋งค๊ฐœ๋ณ€์ˆ˜"์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
Input ID๋Š” token_index๋กœ, ์‚ฌ์šฉํ•  sequence(๋ฌธ์žฅ)๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” token์˜ ์ˆซ์žํ‘œํ˜„์ด๋‹ค.
๊ฐ tokenizer๋Š” ๋‹ค๋ฅด๊ฒŒ ์ž‘๋™ํ•˜์ง€๋งŒ "๊ธฐ๋ณธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋™์ผ"ํ•˜๋‹ค.

ex)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"


tokenizer๋Š” sequence(๋ฌธ์žฅ)๋ฅผ tokenizer vocab์— ์žˆ๋Š” Token์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค:

tokenized_sequence = tokenizer.tokenize(sequence)


token์€ word๋‚˜ subword ๋‘˜์ค‘ ํ•˜๋‚˜์ด๋‹ค:

print(tokenized_sequence)
# ์ถœ๋ ฅ: ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# ์˜ˆ๋ฅผ ๋“ค์–ด, "VRAM"์€ ๋ชจ๋ธ ์–ดํœ˜์— ์—†์–ด์„œ "V", "RA" ๋ฐ "M"์œผ๋กœ ๋ถ„ํ• ๋จ.
# ์ด๋Ÿฌํ•œ ํ† ํฐ์ด ๋ณ„๋„์˜ ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ๋™์ผํ•œ ๋‹จ์–ด์˜ ์ผ๋ถ€์ž„์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š”?
# --> "RA"์™€ "M" ์•ž์— ์ด์ค‘ํ•ด์‹œ(##) ์ ‘๋‘์‚ฌ๊ฐ€ ์ถ”๊ฐ€๋ฉ


inputs = tokenizer(sequence)


์ด๋ฅผ ํ†ตํ•ด token์€ ๋ชจ๋ธ์ด ์ดํ•ด๊ฐ€๋Šฅํ•œ ID๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๋•Œ, ๋ชจ๋ธ๋‚ด๋ถ€์—์„œ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” input_ids๋ฅผ key๋กœ, ID๊ฐ’์„ value๋กœ ํ•˜๋Š” "๋”•์…”๋„ˆ๋ฆฌ"ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ด์•ผํ•œ๋‹ค:

encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# ์ถœ๋ ฅ: [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]

๋˜ํ•œ, ๋ชจ๋ธ์— ๋”ฐ๋ผ์„œ ์ž๋™์œผ๋กœ "special token"์„ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ, 
์—ฌ๊ธฐ์—๋Š” ๋ชจ๋ธ์ด ๊ฐ€๋” ์‚ฌ์šฉํ•˜๋Š” "special IDs"๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# ์ถœ๋ ฅ: [CLS] A Titan RTX has 24GB of VRAM [SEP]





โˆ™Attention Mask

Attention Mask๋Š” Sequence๋ฅผ batch๋กœ ๋ฌถ์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” Optionalํ•œ ์ธ์ˆ˜๋กœ 
"๋ชจ๋ธ์ด ์–ด๋–ค token์„ ์ฃผ๋ชฉํ•˜๊ณ  ํ•˜์ง€ ๋ง์•„์•ผ ํ•˜๋Š”์ง€"๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

len(encoded_sequence_a), len(encoded_sequence_b)
# ์ถœ๋ ฅ: (8, 19)
์œ„๋ฅผ ๋ณด๋ฉด, encoding๋œ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— "๋™์ผํ•œ Tensor๋กœ ๋ฌถ์„ ์ˆ˜๊ฐ€ ์—†๋‹ค."
--> padding์ด๋‚˜ truncation์ด ํ•„์š”ํ•จ.
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["input_ids"]
# ์ถœ๋ ฅ: [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]

padded_sequences["attention_mask"]
# ์ถœ๋ ฅ: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask๋Š” tokenizer๊ฐ€ ๋ฐ˜ํ™˜ํ•˜๋Š” dictionary์˜ "attention_mask" key์— ์กด์žฌํ•œ๋‹ค.


โˆ™Token Types IDs

์–ด๋–ค ๋ชจ๋ธ์˜ ๋ชฉ์ ์€ classification์ด๋‚˜ QA์ด๋‹ค.
์ด๋Ÿฐ ๋ชจ๋ธ์€ 2๊ฐœ์˜ "๋‹ค๋ฅธ ๋ชฉ์ ์„ ๋‹จ์ผ input_ids"ํ•ญ๋ชฉ์œผ๋กœ ๊ฒฐํ•ฉํ•ด์•ผํ•œ๋‹ค.
= [CLS], [SEP] ๋“ฑ์˜ ํŠน์ˆ˜ํ† ํฐ์„ ์ด์šฉํ•ด ์ˆ˜ํ–‰๋จ.

ex)
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)
# ์ถœ๋ ฅ: [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
์œ„์˜ ์˜ˆ์ œ์—์„œ tokenizer๋ฅผ ์ด์šฉํ•ด 2๊ฐœ์˜ sequence๊ฐ€ 2๊ฐœ์˜ ์ธ์ˆ˜๋กœ ์ „๋‹ฌ๋˜์–ด ์ž๋™์œผ๋กœ ์œ„์™€๊ฐ™์€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Š” seq์ดํ›„์— ๋‚˜์˜ค๋Š” seq์˜ ์‹œ์ž‘์œ„์น˜๋ฅผ ์•Œ๊ธฐ์—๋Š” ์ข‹๋‹ค.

๋‹ค๋งŒ, ๋‹ค๋ฅธ ๋ชจ๋ธ์€ token_types_ids๋„ ์‚ฌ์šฉํ•˜๋ฉฐ, token_type_ids๋กœ ์ด MASK๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
encoded_dict['token_type_ids']
# ์ถœ๋ ฅ: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 

์งˆ๋ฌธ์— ์‚ฌ์šฉ๋˜๋Š” context๋Š” ๋ชจ๋‘ 0์œผ๋กœ, 
์งˆ๋ฌธ์— ํ•ด๋‹น๋˜๋Š” sequence๋Š” ๋ชจ๋‘ 1๋กœ ์„ค์ •๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


โˆ™Position IDs

RNN: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๊ฐ€ ๋‚ด์žฅ. 
Transformer: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ ์ธ์‹ โŒ


∴ position_ids๋Š” ๋ชจ๋ธ์ด ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ list์—์„œ ์‹๋ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” optional ๋งค๊ฐœ๋ณ€์ˆ˜.

๋ชจ๋ธ์— position_ids๊ฐ€ ์ „๋‹ฌ๋˜์ง€ ์•Š์œผ๋ฉด, ID๋Š” ์ž๋™์œผ๋กœ Absolute positional embeddings์œผ๋กœ ์ƒ์„ฑ:

Absolute positional embeddings์€ [0, config.max_position_embeddings - 1] ๋ฒ”์œ„์—์„œ ์„ ํƒ.

์ผ๋ถ€ ๋ชจ๋ธ์€ sinusoidal position embeddings์ด๋‚˜ relative position embeddings๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ positional embedding์„ ์‚ฌ์šฉ.




โˆ™Labels 

Labels๋Š” ๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๋„๋ก ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ๋Š” Optional์ธ์ˆ˜์ด๋‹ค.
์ฆ‰, Labels๋Š” ๋ชจ๋ธ์˜ ์˜ˆ์ƒ ์˜ˆ์ธก๊ฐ’์ด์–ด์•ผ ํ•œ๋‹ค: ํ‘œ์ค€ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก๊ฐ’๊ณผ ์˜ˆ์ƒ๊ฐ’(๋ ˆ์ด๋ธ”) ๊ฐ„์˜ ์†์‹ค์„ ๊ณ„์‚ฐ.


์ด๋•Œ, Labels๋Š” Model Head์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค:

  • AutoModelForSequenceClassification: ๋ชจ๋ธ์€ (batch_size)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, batch์˜ ๊ฐ ๊ฐ’์€ ์ „์ฒด ์‹œํ€€์Šค์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น.

  • AutoModelForTokenClassification: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น

  • AutoModelForMaskedLM: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹น: label์€ ๋งˆ์Šคํ‚น๋œ token_ids์ด๋ฉฐ, ๋‚˜๋จธ์ง€๋Š” ๋ฌด์‹œํ•  ๊ฐ’(๋ณดํ†ต -100).

  • AutoModelForConditionalGeneration: ๋ชจ๋ธ์€ (batch_size, tgt_seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐ ์ž…๋ ฅ ์‹œํ€€์Šค์™€ ์—ฐ๊ด€๋œ ๋ชฉํ‘œ ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘์—๋Š” BART์™€ T5๊ฐ€ ์ ์ ˆํ•œ ๋””์ฝ”๋” ์ž…๋ ฅ ID์™€ ๋””์ฝ”๋” ์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ์— ๋ณดํ†ต ์ œ๊ณตํ•  ํ•„์š”X. ์ด๋Š” Encoder-Decoder ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์—๋Š” ์ ์šฉ๋˜์ง€ ์•Š์Œ. ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ ํŠน์ • ๋ชจ๋ธ์˜ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๊ธฐ๋ณธ ๋ชจ๋ธ(BertModel ๋“ฑ)์€ Labels๋ฅผ ๋ฐ›์•„๋“ค์ด์ง€ ๋ชปํ•˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๋กœ์„œ ๋‹จ์ˆœํžˆ ํŠน์ง•๋“ค๋งŒ ์ถœ๋ ฅํ•œ๋‹ค.




โˆ™ Decoder input IDs

์ด ์ž…๋ ฅ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ์— ํŠนํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋””์ฝ”๋”์— ์ž…๋ ฅ๋  ์ž…๋ ฅ ID๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ์ž…๋ ฅ์€ ๋ฒˆ์—ญ ๋˜๋Š” ์š”์•ฝ๊ณผ ๊ฐ™์€ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ์ž‘์—…์— ์‚ฌ์šฉ๋˜๋ฉฐ, ๋ณดํ†ต ๊ฐ ๋ชจ๋ธ์— ํŠน์ •ํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ(BART, T5)์€ ๋ ˆ์ด๋ธ”์—์„œ ๋””์ฝ”๋” ์ž…๋ ฅ ID๋ฅผ ์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ชจ๋ธ์—์„œ๋Š” ๋ ˆ์ด๋ธ”์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ํ›ˆ๋ จ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์„ ํ˜ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ํ›ˆ๋ จ์„ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์ž…๋ ฅ ID๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜๋ ค๋ฉด ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.



โˆ™Feed Forward Chunking

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ ์ž”์ฐจ ์–ดํ…์…˜ ๋ธ”๋ก์—์„œ ์…€ํ”„ ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋Š” ๋ณดํ†ต 2๊ฐœ์˜ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด ๋‹ค์Œ์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค.
ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ž„๋ฒ ๋”ฉ ํฌ๊ธฐ๋Š” ์ข…์ข… ๋ชจ๋ธ์˜ ์ˆจ๊ฒจ์ง„ ํฌ๊ธฐ๋ณด๋‹ค ํฝ๋‹ˆ๋‹ค(์˜ˆ: bert-base-uncased).

ํฌ๊ธฐ [batch_size, sequence_length]์˜ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ค‘๊ฐ„ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ์ €์žฅํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ [batch_size, sequence_length, config.intermediate_size]๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์˜ ํฐ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reformer: The Efficient Transformer์˜ ์ €์ž๋“ค์€ ๊ณ„์‚ฐ์ด sequence_length ์ฐจ์›๊ณผ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ ๋‘ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ [batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  n = sequence_length์™€ ํ•จ๊ป˜ [batch_size, sequence_length, config.hidden_size]๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๋Œ€์‹  ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฑฐ๋ž˜๋ฅผ ํ•˜์ง€๋งŒ, ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

apply_chunking_to_forward() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, chunk_size๋Š” ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐ๋˜๋Š” ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ์˜ ์ˆ˜๋ฅผ ์ •์˜ํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก์„ฑ ๊ฐ„์˜ ๊ฑฐ๋ž˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. chunk_size๊ฐ€ 0์œผ๋กœ ์„ค์ •๋˜๋ฉด ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ฒญํ‚น์€ ์ˆ˜ํ–‰๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

 


Optimization

AdamW

ํ”ํžˆ๋“ค ์•„๋ฌป๋”ฐ Adam๋งŒ ์‚ฌ์šฉํ•ด๋ผ! ๋ผ๋Š” ๊ฒฉ์–ธ์ด ์žˆ์„์ •๋„๋กœ ๋งŒ๋Šฅ optimizer๊ฐ™์ง€๋งŒ, 
CV์ผ๋ถ€ Task์—์„œ๋Š” SGD๊ฐ€ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ƒ๋‹นํžˆ ์กด์žฌํ•œ๋‹ค.
AdamW๋…ผ๋ฌธ์—์„œ๋Š” L2 Regularization๊ณผ Weight Decay๊ด€์ ์—์„œ SGD์— ๋น„ํ•ด Adam์ด ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง€๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ•œ๋‹ค.
์„œ๋กœ๋‹ค๋ฅธ ์ดˆ๊ธฐ decay rate์™€ lr์— ๋”ฐ๋ฅธ Test Error
L2 Regularization: weight๊ฐ€ ๋น„์ •์ƒ์ ์œผ๋กœ ์ปค์ง์„ ๋ฐฉ์ง€. (weight๊ฐ’์ด ์ปค์ง€๋ฉด ์†์‹คํ•จ์ˆ˜๋„ ์ปค์ง€๊ฒŒ ๋จ.)
= weight๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š๋Š” ์„ ์—์„œ ๊ธฐ์กด ์†์‹คํ•จ์ˆ˜๋ฅผ ์ตœ์†Œ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” weight๋ฅผ ๋ชจ๋ธ์ด ํ•™์Šต.

weight decay: weight update ์‹œ, ์ด์ „ weightํฌ๊ธฐ๋ฅผ ์ผ์ •๋น„์œจ ๊ฐ์†Œ์‹œ์ผœ overfitting๋ฐฉ์ง€.

SGD: L2 = weight_decay
Adam: L2 ≠ weight_decay (adaptive learning rate๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— SGD์™€๋Š” ๋‹ค๋ฅธ weight update์‹์„ ์‚ฌ์šฉํ•จ.)
∴ ์ฆ‰, L2 Regularization์ด ํฌํ•จ๋œ ์†์‹คํ•จ์ˆ˜๋กœ Adam์ตœ์ ํ™” ์‹œ, ์ผ๋ฐ˜ํ™” ํšจ๊ณผ๋ฅผ ๋œ ๋ณด๊ฒŒ ๋œ๋‹ค. (decay rate๊ฐ€ ๋” ์ž‘์•„์ง€๊ฒŒ๋จ.)
์ €์ž๋Š” L2 regularzation์— ์˜ํ•œ weight decay ํšจ๊ณผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ weight ์—…๋ฐ์ดํŠธ ์‹์— ์ง์ ‘์ ์œผ๋กœ weight decay ํ…€์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. L2 regularization๊ณผ ๋ถ„๋ฆฌ๋œ weight decay๋ผ๊ณ  ํ•˜์—ฌ decoupled weight decay๋ผ๊ณ  ๋งํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

SGDW์™€ AdamW์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜:
์ง€๊ธˆ๊นŒ์ง€ ์„ค๋ช…ํ•˜์ง€ ์•Š์•˜๋˜
๐œ‚๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋งค weight ์—…๋ฐ์ดํŠธ๋งˆ๋‹ค learning rate๋ฅผ ์ผ์ • ๋น„์œจ ๊ฐ์†Œ์‹œ์ผœ์ฃผ๋Š” learning rate schedule ์ƒ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์ดˆ๋ก์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋œ ๋ถ€๋ถ„์ด ์—†๋‹ค๋ฉด L2 regularization์„ ํฌํ•จํ•œ ์†์‹คํ•จ์ˆ˜์— SGD์™€ Adam์„ ์ ์šฉํ•œ ๊ฒƒ๊ณผ ๋˜‘๊ฐ™๋‹ค.
ํ•˜์ง€๋งŒ ์ดˆ๋ก์ƒ‰ ๋ถ€๋ถ„์„ ์ง์ ‘์ ์œผ๋กœ weight ์—…๋ฐ์ดํŠธ ์‹์— ์ถ”๊ฐ€์‹œ์ผœ weight decay ํšจ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค.
optimizer = AdamW(model.parameters(),lr=1e-3, eps=(1e-30, 1e-3),weight_decay=0.0,)

 

cf) model.parameters()๋Š” weight์™€ bias๋ฅผ ๋Œ๋ ค์คŒ.
์ด์ œ github ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ์œ„์˜ ์ˆ˜์‹์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž:
class AdamW(Optimizer):
    """
    Implements Adam algorithm with weight decay fix as introduced in [Decoupled Weight Decay
    Regularization](https://arxiv.org/abs/1711.05101).

    Parameters:
        params (`Iterable[nn.parameter.Parameter]`):
            Iterable of parameters to optimize or dictionaries defining parameter groups.
        lr (`float`, *optional*, defaults to 0.001):
            The learning rate to use.
        betas (`Tuple[float,float]`, *optional*, defaults to `(0.9, 0.999)`):
            Adam's betas parameters (b1, b2).
        eps (`float`, *optional*, defaults to 1e-06):
            Adam's epsilon for numerical stability.
        weight_decay (`float`, *optional*, defaults to 0.0):
            Decoupled weight decay to apply.
        correct_bias (`bool`, *optional*, defaults to `True`):
            Whether or not to correct bias in Adam (for instance, in Bert TF repository they use `False`).
        no_deprecation_warning (`bool`, *optional*, defaults to `False`):
            A flag used to disable the deprecation warning (set to `True` to disable the warning).
    """

    def __init__(
        self,
        params: Iterable[nn.parameter.Parameter],
        lr: float = 1e-3,
        betas: Tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-6,
        weight_decay: float = 0.0,
        correct_bias: bool = True,
        no_deprecation_warning: bool = False,
    ):
        if not no_deprecation_warning:
            warnings.warn(
                "This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch"
                " implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this"
                " warning",
                FutureWarning,
            )
        require_version("torch>=1.5.0")  # add_ with alpha
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr} - should be >= 0.0")
        if not 0.0 <= betas[0] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[0]} - should be in [0.0, 1.0)")
        if not 0.0 <= betas[1] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[1]} - should be in [0.0, 1.0)")
        if not 0.0 <= eps:
            raise ValueError(f"Invalid epsilon value: {eps} - should be >= 0.0")
        defaults = {"lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay, "correct_bias": correct_bias}
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure: Callable = None):
        """
        Performs a single optimization step.

        Arguments:
            closure (`Callable`, *optional*): A closure that reevaluates the model and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group["params"]:
                if p.grad is None:
                    continue
                grad = p.grad
                if grad.is_sparse:
                    raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    # Exponential moving average of gradient values
                    state["exp_avg"] = torch.zeros_like(p)
                    # Exponential moving average of squared gradient values
                    state["exp_avg_sq"] = torch.zeros_like(p)

                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
                beta1, beta2 = group["betas"]

                state["step"] += 1

                # Decay the first and second moment running average coefficient
                # In-place operations to update the averages at the same time
                exp_avg.mul_(beta1).add_(grad, alpha=(1.0 - beta1))
                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
                denom = exp_avg_sq.sqrt().add_(group["eps"])

                step_size = group["lr"]
                if group["correct_bias"]:  # No bias correction for Bert
                    bias_correction1 = 1.0 - beta1 ** state["step"]
                    bias_correction2 = 1.0 - beta2 ** state["step"]
                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1

                p.addcdiv_(exp_avg, denom, value=-step_size)

                # Just adding the square of the weights to the loss function is *not*
                # the correct way of using L2 regularization/weight decay with Adam,
                # since that will interact with the m and v parameters in strange ways.
                #
                # Instead we want to decay the weights in a manner that doesn't interact
                # with the m/v parameters. This is equivalent to adding the square
                # of the weights to the loss with plain (non-momentum) SGD.
                # Add weight decay at the end (fixed version)
                if group["weight_decay"] > 0.0:
                    p.add_(p, alpha=(-group["lr"] * group["weight_decay"]))

        return loss
cf) optimizer์˜ state_dict()์˜ ํ˜•ํƒœ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค:
{
                'state': {
                    0: {'momentum_buffer': tensor(...), ...},
                    1: {'momentum_buffer': tensor(...), ...},
                    2: {'momentum_buffer': tensor(...), ...},
                    3: {'momentum_buffer': tensor(...), ...}
                },
                'param_groups': [
                    {
                        'lr': 0.01,
                        'weight_decay': 0,
                        ...
                        'params': [0]
                    },
                    {
                        'lr': 0.001,
                        'weight_decay': 0.5,
                        ...
                        'params': [1, 2, 3]
                    }
                ]
            }
์ด๋ฅผ ํ†ตํ•ด ์‚ดํŽด๋ณด๋ฉด, Optimizer๋ผ๋Š” ํด๋ž˜์Šค๋กœ๋ถ€ํ„ฐ AdamW๋Š” ์ƒ์†์„ ๋ฐ›์€ ์ดํ›„, 
์œ„์˜ state_dictํ˜•ํƒœ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ, if len(state) == 0์ด๋ผ๋Š” ๋œป์€ state๊ฐ€ ์‹œ์ž‘์„ ํ•˜๋‚˜๋„ ํ•˜์ง€ ์•Š์•˜์Œ์„ ์˜๋ฏธํ•œ๋‹ค.
exp_avg๋Š” m์„, exp_avg_sq๋Š” vt๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ p.addcdiv_์™€ if group["weight_decay"]์ชฝ์—์„œ ์ตœ์ข… parameter์— ๋Œ€ํ•œ update๊ฐ€ ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 


LR-Schedules &. Learning rate Annealing

LR Schedule: ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ์Šค์ผ€์ค„๋Œ€๋กœ lr์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์‚ฌ์šฉ.

ํ›ˆ๋ จ ๋„์ค‘ learning rate๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ์ฃผ๋Š”๊ฒŒ ์ฐจ์ด์ !
warmup restart๋กœ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ local minimum์—์„œ ๋น ์ ธ๋‚˜์˜ฌ ๊ธฐํšŒ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.


LR Annealing: lr schedule๊ณผ ํ˜ผ์šฉ๋˜์–ด ์‚ฌ์šฉ๋˜๋‚˜ iteration์— ๋”ฐ๋ผ monotonicํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๋Š”๊ฒƒ์„ ์˜๋ฏธ.
์ง๊ด€์ ์œผ๋กœ๋Š” ์ฒ˜์Œ์—๋Š” ๋†’์€ learning rate๋กœ ์ข‹์€ ์ˆ˜๋ ด ์ง€์ ์„ ๋นก์„ธ๊ฒŒ ์ฐพ๊ณ ,
๋งˆ์ง€๋ง‰์—๋Š” ๋‚ฎ์€ learning rate๋กœ ์ˆ˜๋ ด ์ง€์ ์— ์ •๋ฐ€ํ•˜๊ฒŒ ์•ˆ์ฐฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

 


Model Outputs

ModelOutput

๋ชจ๋“  ๋ชจ๋ธ์€ ModelOutput์˜ subclass์˜ instance์ถœ๋ ฅ์„ ๊ฐ–๋Š”๋‹ค.
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # ๋ฐฐ์น˜ ํฌ๊ธฐ 1
outputs = model(**inputs, labels=labels)

# SequenceClassifierOutput(loss=tensor(0.4267, grad_fn=<NllLossBackward0>), 
#                           logits=tensor([[-0.0658,  0.5650]], grad_fn=<AddmmBackward0>), 
#                           hidden_states=None, attentions=None)
outputs๊ฐ์ฒด๋Š” ํ•„ํžˆ loss์™€ logits๋ฅผ ๊ฐ–๊ธฐ์— (outputs.loss, outputs.logits) ํŠœํ”Œ์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

Cf)
CuasalLM์˜ ๊ฒฝ์šฐ:
loss: Language modeling loss (for next-token prediction).

logits: Prediction scores of the LM_Head (scores for each vocabulary token before SoftMax)
= raw prediction values and are not bounded to a specific range

transformers output word๋ฅผ ์œ„ํ•ด์„  : project linearly->apply softmax ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์นจ.
์ด๋•Œ, LM_Head๋Š” pre-training์ด ์•„๋‹Œ, Fine-Tuning์—์„œ ์‚ฌ์šฉ๋จ.
LM_Head๋ž€, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ hidden_state๋ฅผ ๋ฐ›์•„ prediction์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ.
ex) BERT
from transformers import BertModel, BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
print(f'logits: {logits}') # `torch.FloatTensor` of shape `(batch_size, sequence_length, vocab_size)

# [MASK] ํ† ํฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
print(f'masked_index: {masked_index}') # `torch.LongTensor` of shape `(1,)

MASK_token = logits[0, masked_index] # batch์˜ ์ฒซ๋ฌธ์žฅ์—์„œ MASK token์„ ๊ฐ€์ ธ์˜ด.
print(f'MASK_Token: {MASK_token}')

predicted_token_id = MASK_token.argmax(axis=-1) # ์ฃผ์–ด์ง„ ์ฐจ์›์—์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์˜ index๋ฅผ ๋ฐ˜ํ™˜. = ๋ชจ๋ธ์ด ํ•ด๋‹น์œ„์น˜์—์„œ ์–˜์ธกํ•œ ๋‹จ์–ด์˜ token_id
print(f'predicted_token_id: {predicted_token_id}')


predicted_token = tokenizer.decode(predicted_token_id)
print(predicted_token)  # paris ์ถœ๋ ฅ


# logits: tensor([[[ -6.4346,  -6.4063,  -6.4097,  ...,  -5.7691,  -5.6326,  -3.7883],
#          [-14.0119, -14.7240, -14.2120,  ..., -11.6976, -10.7304, -12.7618],
#          [ -9.6561, -10.3125,  -9.7459,  ...,  -8.7782,  -6.6036, -12.6596],
#          ...,
#          [ -3.7861,  -3.8572,  -3.5644,  ...,  -2.5593,  -3.1093,  -4.3820],
#          [-11.6598, -11.4274, -11.9267,  ...,  -9.8772, -10.2103,  -4.7594],
#          [-11.7267, -11.7509, -11.8040,  ..., -10.5943, -10.9407,  -7.5151]]],
#        grad_fn=<ViewBackward0>)
# masked_index: tensor([6])
# MASK_Token: tensor([[-3.7861, -3.8572, -3.5644,  ..., -2.5593, -3.1093, -4.3820]],
#        grad_fn=<IndexBackward0>)
# predicted_token_id: tensor([3000])
# paris


cf) ์ฐธ๊ณ ๋กœ argmax๊ฐ€ ๋ฐ˜ํ™˜ํ•œ index๋Š” vocabulary์˜ Index์ž„์„ ์•„๋ž˜๋ฅผ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

for word, idx in list(vocab.items())[:5]:  # ์–ดํœ˜์˜ ์ฒ˜์Œ 10๊ฐœ ํ•ญ๋ชฉ ์ถœ๋ ฅ
    print(f"{word}: {idx}")
for word, idx in list(vocab.items())[2990:3010]:  # ์–ดํœ˜์˜ ์ฒ˜์Œ 10๊ฐœ ํ•ญ๋ชฉ ์ถœ๋ ฅ
    print(f"{word}: {idx}")
    
# [PAD]: 0
# [unused0]: 1
# [unused1]: 2
# [unused2]: 3
# [unused3]: 4
# jack: 2990
# fall: 2991
# raised: 2992
# itself: 2993
# stay: 2994
# true: 2995
# studio: 2996
# 1988: 2997
# sports: 2998
# replaced: 2999
# paris: 3000
# systems: 3001
# saint: 3002
# leader: 3003
# theatre: 3004
# whose: 3005
# market: 3006
# capital: 3007
# parents: 3008
# spanish: 3009

 


Trainer

Trainer

Trainerํด๋ž˜์Šค๋Š” ๐Ÿค— Transformers ๋ชจ๋ธ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ๋‹ค
= ๋ชจ๋ธ์ด ํ•ญ์ƒ tuple(= ์ฒซ์š”์†Œ๋กœ loss๋ฐ˜ํ™˜) , ModelOutput์˜ subclass๋ฅผ ๋ฐ˜ํ™˜ํ•ด์•ผํ•จ์„ ์˜๋ฏธ
= labels์ธ์ž๊ฐ€ ์ œ๊ณต๋˜๋ฉด Loss๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ.

Trainer๋Š” TrainingArguments๋กœ ํ•„์š”์ธ์ž๋ฅผ ์ „๋‹ฌํ•ด์ฃผ๋ฉด, ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ train_loop์ž‘์„ฑํ•  ํ•„์š”์—†์ด ํ•™์Šต์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋˜ํ•œ, TRL ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ SFTTrainer์˜ ๊ฒฝ์šฐ, ์ด Trainerํด๋ž˜์Šค๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ์œผ๋ฉฐ, LoRA, Quantizing๊ณผ DeepSpeed ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์–ด๋–ค ๋ชจ๋ธ ํฌ๊ธฐ์—๋„ ํšจ์œจ์ ์ธ ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

๋จผ์ €, ์‹œ์ž‘์— ์•ž์„œ ๋ถ„์‚ฐํ™˜๊ฒฝ์„ ์œ„ํ•ด์„œ๋Š” Accelerate๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผํ•œ๋‹ค!
pip install accelerate
pip install accelerate --upgrade

Basic Usage

"hugใ…‡ใ…‡ใ„นใ„ด


Checkpoints

"hugใ…‡ใ…‡ใ„นใ„ด


Customizing

"hugใ…‡ใ…‡ใ„นใ„ด


Callbacks & Logging

"hugใ…‡ใ…‡ใ„นใ„ด


Accelerate & Trainer

"hugใ…‡ใ…‡ใ„นใ„ด


TrainingArguments

์ฐธ๊ณ )
output_dir (str): ๋ชจ๋ธ ์˜ˆ์ธก๊ณผ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์ž‘์„ฑ๋  ์ถœ๋ ฅ ๋””๋ ‰ํ† ๋ฆฌ์ž…๋‹ˆ๋‹ค.
eval_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "no"): ํ›ˆ๋ จ ์ค‘ ์ฑ„ํƒํ•  ํ‰๊ฐ€ ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ํ‰๊ฐ€๋ฅผ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"steps": eval_steps๋งˆ๋‹ค ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
per_device_train_batch_size (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 8): ํ›ˆ๋ จ ์‹œ GPU/XPU/TPU/MPS/NPU ์ฝ”์–ด/CPU๋‹น ๋ฐฐ์น˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค.
per_device_eval_batch_size (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 8): ํ‰๊ฐ€ ์‹œ GPU/XPU/TPU/MPS/NPU ์ฝ”์–ด/CPU๋‹น ๋ฐฐ์น˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค.
gradient_accumulation_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1): ์—ญ์ „ํŒŒ/์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ ํ•  ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค.
eval_accumulation_steps (int, optional): ๊ฒฐ๊ณผ๋ฅผ CPU๋กœ ์ด๋™์‹œํ‚ค๊ธฐ ์ „์— ์ถœ๋ ฅ ํ…์„œ๋ฅผ ๋ˆ„์ ํ•  ์˜ˆ์ธก ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„ค์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ์ „์ฒด ์˜ˆ์ธก์ด GPU/NPU/TPU์—์„œ ๋ˆ„์ ๋œ ํ›„ CPU๋กœ ์ด๋™๋ฉ๋‹ˆ๋‹ค(๋” ๋น ๋ฅด์ง€๋งŒ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค).
learning_rate (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 5e-5): [AdamW] ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ์ดˆ๊ธฐ ํ•™์Šต๋ฅ ์ž…๋‹ˆ๋‹ค.
weight_decay (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): [AdamW] ์˜ตํ‹ฐ๋งˆ์ด์ €์—์„œ ๋ชจ๋“  ๋ ˆ์ด์–ด์—(๋ฐ”์ด์–ด์Šค ๋ฐ LayerNorm ๊ฐ€์ค‘์น˜๋Š” ์ œ์™ธ) ์ ์šฉํ•  ๊ฐ€์ค‘์น˜ ๊ฐ์‡ ์ž…๋‹ˆ๋‹ค.
max_grad_norm (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1.0): ์ตœ๋Œ€ ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„(๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘์„ ์œ„ํ•œ)์ž…๋‹ˆ๋‹ค.
num_train_epochs(float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 3.0): ์ˆ˜ํ–‰ํ•  ์ด ํ›ˆ๋ จ ์—ํฌํฌ ์ˆ˜์ž…๋‹ˆ๋‹ค(์ •์ˆ˜๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ์—ํฌํฌ์˜ ๋ฐฑ๋ถ„์œจ์„ ์ˆ˜ํ–‰ํ•œ ํ›„ ํ›ˆ๋ จ์„ ์ค‘์ง€ํ•ฉ๋‹ˆ๋‹ค).
max_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ -1): ์–‘์˜ ์ •์ˆ˜๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ, ์ˆ˜ํ–‰ํ•  ์ด ํ›ˆ๋ จ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. num_train_epochs๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ์œ ํ•œํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ฒฝ์šฐ, max_steps์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
eval_steps (int ๋˜๋Š” float, optional): eval_strategy="steps"์ธ ๊ฒฝ์šฐ ๋‘ ํ‰๊ฐ€ ์‚ฌ์ด์˜ ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, logging_steps์™€ ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ๊ธฐ๋ณธ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ •์ˆ˜ ๋˜๋Š” [0,1) ๋ฒ”์œ„์˜ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ „์ฒด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.
lr_scheduler_type (str ๋˜๋Š” [SchedulerType], optional, ๊ธฐ๋ณธ๊ฐ’์€ "linear"): ์‚ฌ์šฉํ•  ์Šค์ผ€์ค„๋Ÿฌ ์œ ํ˜•์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ [SchedulerType]์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
lr_scheduler_kwargs ('dict', optional, ๊ธฐ๋ณธ๊ฐ’์€ {}): lr_scheduler์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค์ผ€์ค„๋Ÿฌ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ€๋Šฅํ•œ ๊ฐ’์„ ํ™•์ธํ•˜์„ธ์š”.
warmup_ratio (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0.0): 0์—์„œ learning_rate๋กœ์˜ ์„ ํ˜• ์›œ์—…์— ์‚ฌ์šฉ๋˜๋Š” ์ด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค.
warmup_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): 0์—์„œ learning_rate๋กœ์˜ ์„ ํ˜• ์›œ์—…์— ์‚ฌ์šฉ๋˜๋Š” ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. warmup_ratio์˜ ์˜ํ–ฅ์„ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค.
logging_dir (str, optional): TensorBoard ๋กœ๊ทธ ๋””๋ ‰ํ† ๋ฆฌ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ output_dir/runs/CURRENT_DATETIME_HOSTNAME์ž…๋‹ˆ๋‹ค.
logging_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "steps"): ํ›ˆ๋ จ ์ค‘ ์ฑ„ํƒํ•  ๋กœ๊น… ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ๋กœ๊น…์„ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ๋กœ๊น…์„ ํ•ฉ๋‹ˆ๋‹ค.
•	"steps": logging_steps๋งˆ๋‹ค ๋กœ๊น…์„ ํ•ฉ๋‹ˆ๋‹ค.
logging_first_step (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ์ฒซ ๋ฒˆ์งธ global_step์„ ๋กœ๊น…ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
logging_steps (int ๋˜๋Š” float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 500): logging_strategy="steps"์ธ ๊ฒฝ์šฐ ๋‘ ๋กœ๊ทธ ์‚ฌ์ด์˜ ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ •์ˆ˜ ๋˜๋Š” [0,1) ๋ฒ”์œ„์˜ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ „์ฒด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.
run_name (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ output_dir): ์‹คํ–‰์— ๋Œ€ํ•œ ์„ค๋ช…์ž์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ wandb ๋ฐ mlflow ๋กœ๊น…์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ output_dir๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
save_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "steps"): ํ›ˆ๋ จ ์ค‘ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ €์žฅํ•  ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ์ €์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
•	"steps": save_steps๋งˆ๋‹ค ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. "epoch" ๋˜๋Š” "steps"๊ฐ€ ์„ ํƒ๋œ ๊ฒฝ์šฐ, ํ•ญ์ƒ ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ €์žฅ์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
save_total_limit (int, optional): ๊ฐ’์ด ์ „๋‹ฌ๋˜๋ฉด ์ฒดํฌํฌ์ธํŠธ์˜ ์ด ์ˆ˜๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. output_dir์— ์žˆ๋Š” ์˜ค๋ž˜๋œ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. load_best_model_at_end๊ฐ€ ํ™œ์„ฑํ™”๋˜๋ฉด metric_for_best_model์— ๋”ฐ๋ผ "์ตœ๊ณ " ์ฒดํฌํฌ์ธํŠธ๋Š” ํ•ญ์ƒ ๊ฐ€์žฅ ์ตœ๊ทผ์˜ ์ฒดํฌํฌ์ธํŠธ์™€ ํ•จ๊ป˜ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, save_total_limit=5 ๋ฐ load_best_model_at_end์ธ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ ๋„ค ๊ฐœ์˜ ์ฒดํฌํฌ์ธํŠธ๋Š” ํ•ญ์ƒ ์ตœ๊ณ  ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. save_total_limit=1 ๋ฐ load_best_model_at_end์ธ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ์™€ ์ตœ๊ณ  ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๋ฉด ๋‘ ๊ฐœ์˜ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์ €์žฅ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
save_safetensors (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ True): state_dict๋ฅผ ์ €์žฅํ•˜๊ณ  ๋กœ๋“œํ•  ๋•Œ ๊ธฐ๋ณธ torch.load ๋ฐ torch.save ๋Œ€์‹  safetensors๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
save_on_each_node (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ€ํ‹ฐ๋…ธ๋“œ ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๋ชจ๋ธ๊ณผ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๊ฐ ๋…ธ๋“œ์— ์ €์žฅํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฉ”์ธ ๋…ธ๋“œ์—๋งŒ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
seed (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 42): ํ›ˆ๋ จ ์‹œ์ž‘ ์‹œ ์„ค์ •๋  ๋žœ๋ค ์‹œ๋“œ์ž…๋‹ˆ๋‹ค. ์‹คํ–‰ ๊ฐ„ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•˜๋ ค๋ฉด ๋ชจ๋ธ์— ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ [~Trainer.model_init] ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜์„ธ์š”.
data_seed (int, optional): ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋Ÿฌ์— ์‚ฌ์šฉํ•  ๋žœ๋ค ์‹œ๋“œ์ž…๋‹ˆ๋‹ค. ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์„ ์œ„ํ•œ Random sampler๋Š” seed์™€ ๋™์ผํ•œ ์‹œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ ์‹œ๋“œ์™€๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์˜ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
bf16 (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ํ›ˆ๋ จ ๋Œ€์‹  bf16 16๋น„ํŠธ(ํ˜ผํ•ฉ) ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. Ampere ์ด์ƒ NVIDIA ์•„ํ‚คํ…์ฒ˜ ๋˜๋Š” CPU(์‚ฌ์šฉ_cpu) ๋˜๋Š” Ascend NPU๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
fp16 (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ํ›ˆ๋ จ ๋Œ€์‹  fp16 16๋น„ํŠธ(ํ˜ผํ•ฉ) ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
half_precision_backend (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ "auto"): ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ฐฑ์—”๋“œ์ž…๋‹ˆ๋‹ค. "auto", "apex", "cpu_amp" ์ค‘ ํ•˜๋‚˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. "auto"๋Š” ๊ฐ์ง€๋œ PyTorch ๋ฒ„์ „์— ๋”ฐ๋ผ CPU/CUDA AMP ๋˜๋Š” APEX๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋‹ค๋ฅธ ์„ ํƒ์ง€๋Š” ์š”์ฒญ๋œ ๋ฐฑ์—”๋“œ๋ฅผ ๊ฐ•์ œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
bf16_full_eval (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ๋Œ€์‹  ์ „์ฒด bfloat16 ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๋น ๋ฅด๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜์ง€๋งŒ ๋ฉ”ํŠธ๋ฆญ ๊ฐ’์— ์•…์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
fp16_full_eval (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ๋Œ€์‹  ์ „์ฒด float16 ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๋น ๋ฅด๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜์ง€๋งŒ ๋ฉ”ํŠธ๋ฆญ ๊ฐ’์— ์•…์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
tf32 (bool, optional): Ampere ๋ฐ ์ตœ์‹  GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ์‚ฌ์šฉํ•  TF32 ๋ชจ๋“œ๋ฅผ ํ™œ์„ฑํ™”ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ torch.backends.cuda.matmul.allow_tf32์˜ ๊ธฐ๋ณธ๊ฐ’์— ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ TF32 ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
local_rank (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ -1): ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ค‘ ํ”„๋กœ์„ธ์Šค์˜ ์ˆœ์œ„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
ddp_backend (str, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ๋ฐฑ์—”๋“œ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. "nccl", "mpi", "ccl", "gloo", "hccl" ์ค‘ ํ•˜๋‚˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
dataloader_drop_last (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ธธ์ด๊ฐ€ ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ„์–ด๋–จ์–ด์ง€์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ๋ถˆ์™„์ „ํ•œ ๋ฐฐ์น˜๋ฅผ ์‚ญ์ œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
dataloader_num_workers (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์— ์‚ฌ์šฉํ•  ํ•˜์œ„ ํ”„๋กœ์„ธ์Šค ์ˆ˜์ž…๋‹ˆ๋‹ค(PyTorch ์ „์šฉ). 0์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”์ธ ํ”„๋กœ์„ธ์Šค์—์„œ ๋กœ๋“œ๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
remove_unused_columns (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ True): ๋ชจ๋ธ์˜ forward ๋ฉ”์„œ๋“œ์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์—ด์„ ์ž๋™์œผ๋กœ ์ œ๊ฑฐํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
label_names (List[str], optional): input dictionary์—์„œ label์— ํ•ด๋‹นํ•˜๋Š” ํ‚ค์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ๋ ˆ์ด๋ธ” ์ธ์ˆ˜์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
load_best_model_at_end (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ตœ์ƒ์˜ ๋ชจ๋ธ์„ ๋กœ๋“œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜ต์…˜์ด ํ™œ์„ฑํ™”๋˜๋ฉด, ์ตœ์ƒ์˜ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ํ•ญ์ƒ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ save_total_limit๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
<Tip>
            When set to `True`, the parameters `save_strategy` needs to be the same as `eval_strategy`, and in
            the case it is "steps", `save_steps` must be a round multiple of `eval_steps`.
</Tip>
metric_for_best_model (str, optional): load_best_model_at_end์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๋ชจ๋ธ์„ ๋น„๊ตํ•  ๋ฉ”ํŠธ๋ฆญ์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€์—์„œ ๋ฐ˜ํ™˜๋œ ๋ฉ”ํŠธ๋ฆญ ์ด๋ฆ„์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๊ธฐ๋ณธ๊ฐ’์€ "loss"์ด๋ฉฐ, load_best_model_at_end=True์ธ ๊ฒฝ์šฐ eval_loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด greater_is_better์˜ ๊ธฐ๋ณธ๊ฐ’์€ True๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋ฉ”ํŠธ๋ฆญ์ด ๋‚ฎ์„์ˆ˜๋ก ์ข‹๋‹ค๋ฉด False๋กœ ์„ค์ •ํ•˜์„ธ์š”.
greater_is_better (bool, optional): load_best_model_at_end ๋ฐ metric_for_best_model๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋‚˜์€ ๋ชจ๋ธ์ด ๋” ๋†’์€ ๋ฉ”ํŠธ๋ฆญ์„ ๊ฐ€์ ธ์•ผ ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	metric_for_best_model์ด "loss"๋กœ ๋๋‚˜์ง€ ์•Š๋Š” ๊ฐ’์œผ๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
•	metric_for_best_model์ด ์„ค์ •๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜ "loss"๋กœ ๋๋‚˜๋Š” ๊ฐ’์œผ๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ False์ž…๋‹ˆ๋‹ค.

fsdp (bool, str ๋˜๋Š” [~trainer_utils.FSDPOption]์˜ ๋ชฉ๋ก, optional, ๊ธฐ๋ณธ๊ฐ’์€ ''): PyTorch ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(๋ถ„์‚ฐ ํ›ˆ๋ จ ์ „์šฉ).
fsdp_config (str ๋˜๋Š” dict, optional): fsdp(Pytorch ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์„ค์ •์ž…๋‹ˆ๋‹ค. ๊ฐ’์€ fsdp json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., fsdp_config.json) ๋˜๋Š” ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict)์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
deepspeed (str ๋˜๋Š” dict, optional): Deepspeed๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  ๊ธฐ๋Šฅ์ด๋ฉฐ API๊ฐ€ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ’์€ DeepSpeed json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., ds_config.json) ๋˜๋Š” ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict)์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
accelerator_config (str, dict, ๋˜๋Š” AcceleratorConfig, optional): ๋‚ด๋ถ€ Accelerator ๊ตฌํ˜„๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์„ค์ •์ž…๋‹ˆ๋‹ค. ๊ฐ’์€ accelerator json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., accelerator_config.json), ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict), ๋˜๋Š” [~trainer_pt_utils.AcceleratorConfig]์˜ ์ธ์Šคํ„ด์Šค์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
label_smoothing_factor (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0.0): ์‚ฌ์šฉํ•  ๋ ˆ์ด๋ธ” ์Šค๋ฌด๋”ฉ ํŒฉํ„ฐ์ž…๋‹ˆ๋‹ค. 0์€ label_smoothing์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธ, ๋‹ค๋ฅธ ๊ฐ’์€ ์›ํ•ซ ์ธ์ฝ”๋”ฉ๋œ ๋ ˆ์ด๋ธ”์„ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
optim (str ๋˜๋Š” [training_args.OptimizerNames], optional, ๊ธฐ๋ณธ๊ฐ’์€ "adamw_torch"): ์‚ฌ์šฉํ•  ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision, adafactor ์ค‘์—์„œ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
optim_args (str, optional): AnyPrecisionAdamW์— ์ œ๊ณต๋˜๋Š” ์„ ํƒ์  ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค.
group_by_length (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋Œ€๋žต ๊ฐ™์€ ๊ธธ์ด์˜ ์ƒ˜ํ”Œ์„ ๊ทธ๋ฃนํ™”ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค(ํŒจ๋”ฉ์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด). ๋™์  ํŒจ๋”ฉ์„ ์ ์šฉํ•  ๋•Œ๋งŒ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
report_to (str ๋˜๋Š” List[str], optional, ๊ธฐ๋ณธ๊ฐ’์€ "all"): ๊ฒฐ๊ณผ์™€ ๋กœ๊ทธ๋ฅผ ๋ณด๊ณ ํ•  ํ†ตํ•ฉ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ํ”Œ๋žซํผ์€ "azure_ml", "clearml", "codecarbon", "comet_ml", "dagshub", "dvclive", "flyte", "mlflow", "neptune", "tensorboard", "wandb"์ž…๋‹ˆ๋‹ค. "all"์€ ์„ค์น˜๋œ ๋ชจ๋“  ํ†ตํ•ฉ์— ๋ณด๊ณ ํ•˜๋ฉฐ, "none"์€ ํ†ตํ•ฉ์— ๋ณด๊ณ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
ddp_find_unused_parameters (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” find_unused_parameters ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ False, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
ddp_bucket_cap_mb (int, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” bucket_cap_mb ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
ddp_broadcast_buffers (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” broadcast_buffers ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ False, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
dataloader_persistent_workers (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๋กœ ์„ค์ •ํ•˜๋ฉด ๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํ•œ ๋ฒˆ ์†Œ๋น„๋œ ํ›„์—๋„ ์ž‘์—…์ž ํ”„๋กœ์„ธ์Šค๋ฅผ ์ข…๋ฃŒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ž‘์—…์ž ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ธ์Šคํ„ด์Šค๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์ง€๋งŒ RAM ์‚ฌ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ False์ž…๋‹ˆ๋‹ค.
push_to_hub (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ชจ๋ธ์ด ์ €์žฅ๋  ๋•Œ๋งˆ๋‹ค ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜ต์…˜์ด ํ™œ์„ฑํ™”๋˜๋ฉด output_dir์€ git ๋””๋ ‰ํ† ๋ฆฌ๊ฐ€ ๋˜์–ด ์ €์žฅ์ด ํŠธ๋ฆฌ๊ฑฐ๋  ๋•Œ๋งˆ๋‹ค ์ฝ˜ํ…์ธ ๊ฐ€ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค(save_strategy์— ๋”ฐ๋ผ ๋‹ค๋ฆ„). [~Trainer.save_model]์„ ํ˜ธ์ถœํ•˜๋ฉด ํ‘ธ์‹œ๊ฐ€ ํŠธ๋ฆฌ๊ฑฐ๋ฉ๋‹ˆ๋‹ค.
resume_from_checkpoint (str, optional): ๋ชจ๋ธ์— ์œ ํšจํ•œ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์žˆ๋Š” ํด๋”์˜ ๊ฒฝ๋กœ์ž…๋‹ˆ๋‹ค. ์ด ์ธ์ˆ˜๋Š” ์ง์ ‘์ ์œผ๋กœ [Trainer]์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š์œผ๋ฉฐ, ๋Œ€์‹  ํ›ˆ๋ จ/ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์˜ˆ์ œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
hub_model_id (str, optional): ๋กœ์ปฌ output_dir๊ณผ ๋™๊ธฐํ™”ํ•  ์ €์žฅ์†Œ์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๋ชจ๋ธ ID์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด ๊ฒฝ์šฐ ๋ชจ๋ธ์€ ๋„ค์ž„์ŠคํŽ˜์ด์Šค์— ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ „์ฒด ์ €์žฅ์†Œ ์ด๋ฆ„์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(e.g., "user_name/model"). ๊ธฐ๋ณธ๊ฐ’์€ user_name/output_dir_name์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ output_dir์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
hub_strategy (str ๋˜๋Š” [~trainer_utils.HubStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "every_save"): ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ•  ๋ฒ”์œ„์™€ ์‹œ์ ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"end": ๋ชจ๋ธ, ๊ตฌ์„ฑ, ํ† ํฌ๋‚˜์ด์ €(์ „๋‹ฌ๋œ ๊ฒฝ์šฐ), ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์„ ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค.
•	"every_save": ๋ชจ๋ธ, ๊ตฌ์„ฑ, ํ† ํฌ๋‚˜์ด์ €(์ „๋‹ฌ๋œ ๊ฒฝ์šฐ), ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์„ ์ €์žฅํ•  ๋•Œ๋งˆ๋‹ค ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค. ํ‘ธ์‹œ๋Š” ๋น„๋™๊ธฐ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜๋ฉฐ, ์ €์žฅ์ด ๋งค์šฐ ๋นˆ๋ฒˆํ•œ ๊ฒฝ์šฐ ์ด์ „ ํ‘ธ์‹œ๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด ์ƒˆ๋กœ์šด ํ‘ธ์‹œ๊ฐ€ ์‹œ๋„๋ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ตœ์ข… ๋ชจ๋ธ๋กœ ๋งˆ์ง€๋ง‰ ํ‘ธ์‹œ๊ฐ€ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
•	"checkpoint": "every_save"์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ๋„ last-checkpoint๋ผ๋Š” ํ•˜์œ„ ํด๋”์— ํ‘ธ์‹œํ•˜์—ฌ trainer.train(resume_from_checkpoint="last-checkpoint")์œผ๋กœ ํ›ˆ๋ จ์„ ์‰ฝ๊ฒŒ ์žฌ๊ฐœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
•	"all_checkpoints": "checkpoint"์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ตœ์ข… ์ €์žฅ์†Œ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๋Œ€๋กœ ๋ชจ๋“  ์ฒดํฌํฌ์ธํŠธ๋ฅผ ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค(๋”ฐ๋ผ์„œ ์ตœ์ข… ์ €์žฅ์†Œ์—๋Š” ํด๋”๋งˆ๋‹ค ํ•˜๋‚˜์˜ ์ฒดํฌํฌ์ธํŠธ ํด๋”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค).
hub_token (str, optional): ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ•  ๋•Œ ์‚ฌ์šฉํ•  ํ† ํฐ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ huggingface-cli login์œผ๋กœ ์–ป์€ ์บ์‹œ ํด๋”์˜ ํ† ํฐ์ž…๋‹ˆ๋‹ค.
hub_private_repo (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๋กœ ์„ค์ •ํ•˜๋ฉด ํ—ˆ๋ธŒ ์ €์žฅ์†Œ๊ฐ€ ๋น„๊ณต๊ฐœ๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.
hub_always_push (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ, ์ด์ „ ํ‘ธ์‹œ๊ฐ€ ์™„๋ฃŒ๋˜์ง€ ์•Š์œผ๋ฉด ์ฒดํฌํฌ์ธํŠธ ํ‘ธ์‹œ๋ฅผ ๊ฑด๋„ˆ๋œ๋‹ˆ๋‹ค.
gradient_checkpointing (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์—ญ์ „ํŒŒ ์†๋„๊ฐ€ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.
auto_find_batch_size (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ”๋ชจ๋ฆฌ์— ๋งž๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„ CUDA ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์˜ค๋ฅ˜๋ฅผ ํ”ผํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. accelerate๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(pip install accelerate).
ray_scope (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ "last"): Ray๋ฅผ ์‚ฌ์šฉํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒ€์ƒ‰ ์‹œ ์‚ฌ์šฉํ•  ๋ฒ”์œ„์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ "last"์ž…๋‹ˆ๋‹ค. Ray๋Š” ๋ชจ๋“  ์‹œ๋„์˜ ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๊ตํ•˜๊ณ  ์ตœ์ƒ์˜ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์˜ต์…˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ Ray ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
ddp_timeout (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1800): torch.distributed.init_process_group ํ˜ธ์ถœ์˜ ํƒ€์ž„์•„์›ƒ์ž…๋‹ˆ๋‹ค. ๋ถ„์‚ฐ ์‹คํ–‰ ์‹œ GPU ์†Œ์ผ“ ํƒ€์ž„์•„์›ƒ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
torch_compile (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): PyTorch 2.0 torch.compile์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ปดํŒŒ์ผํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” torch.compile API์— ๋Œ€ํ•œ ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. torch_compile_backend ๋ฐ torch_compile_mode ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉ์ž ์ง€์ •ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ชจ๋“  ๊ฐ’์ด ์ž‘๋™ํ•  ๊ฒƒ์ด๋ผ๊ณ  ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ํ”Œ๋ž˜๊ทธ์™€ ์ „์ฒด ์ปดํŒŒ์ผ API๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
torch_compile_backend (str, optional): torch.compile์—์„œ ์‚ฌ์šฉํ•  ๋ฐฑ์—”๋“œ์ž…๋‹ˆ๋‹ค. ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด torch_compile์ด True๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
torch_compile_mode (str, optional): torch.compile์—์„œ ์‚ฌ์šฉํ•  ๋ชจ๋“œ์ž…๋‹ˆ๋‹ค. ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด torch_compile์ด True๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
split_batches (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ค‘ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ƒ์„ฑํ•˜๋Š” ๋ฐฐ์น˜๋ฅผ ์žฅ์น˜์— ๋ถ„ํ• ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. True๋กœ ์„ค์ •ํ•˜๋ฉด ์‚ฌ์šฉ๋œ ์‹ค์ œ ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” ๋ชจ๋“  ์ข…๋ฅ˜์˜ ๋ถ„์‚ฐ ํ”„๋กœ์„ธ์Šค์—์„œ ๋™์ผํ•˜์ง€๋งŒ, ์‚ฌ์šฉ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค ์ˆ˜์˜ ์ •์ˆ˜ ๋ฐฐ์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.




 


DeepSpeed

trust_remote_code=True

์ค‘๊ตญ๋ชจ๋ธ์—์„œ ํ”ํžˆ๋ณด์ด๋Š” trust_remote_code=True ์„ค์ •, ๊ณผ์—ฐ ์ด๊ฑด ๋ญ˜๊นŒ?
์ด๋Š” "huggingface/transformers"์— Model Architecture๊ฐ€ ์•„์ง ์ถ”๊ฐ€๋˜์ง€ ์•Š์€๊ฒฝ์šฐ:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True, device='cuda')
"huggingface repo 'internlm/internlm-chat-7b'์—์„œ ๋ชจ๋ธ ์ฝ”๋“œ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ , ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ์‹คํ–‰ํ•œ๋‹ค"๋Š” ์˜๋ฏธ์ด๋‹ค.
๋งŒ์•ฝ ์ด ๊ฐ’์ด False๋ผ๋ฉด, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” huggingface/transformers์— ๋‚ด์žฅ๋œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๊ฐ€์ค‘์น˜๋งŒ ๋‹ค์šด๋กœ๋“œ
ํ•˜๋Š”๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

rue

์ค‘๊ตญ๋ชจ๋ธ์—์„œ ํ”ํžˆollatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ

 

+ Recent posts