Transformers

pipeline()

๋ชจ๋ธ inference๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ
from transformers import pipeline
pipe = pipeline("text-classification")
pipe("This restaurant is awesome")
# [{'label': 'POSITIVE', 'score': 0.9998743534088135}]

from transformers๋กœ Github(๐Ÿˆ‍โฌ›) transformer์—์„œ ํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค:

transformers์˜ ํ•จ์ˆ˜๋ฅผ importํ•˜๋Š” ๊ฒฝ์šฐ, ์œ„ ์‚ฌ์ง„์˜ src/transformers์— ๋ชจ๋‘ ๊ตฌํ˜„์ด ๋˜์–ด์žˆ๋‹ค.

๋ถˆ๋Ÿฌ์˜ค๋Š” ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ, __init__.py๋ฅผ ํ™•์ธํ•˜๋ฉด ์•Œ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ pipeline์ด from .pipelines import pipeline์ด๋ผ ์ ํ˜€์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.


์œ„ ์ขŒ์ธก์‚ฌ์ง„์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ, pipelinesํด๋”์— pipelineํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š”๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ,
์‹ค์ œ๋กœ ํ•ด๋‹น ํด๋”์— ๋“ค์–ด๊ฐ€๋ณด๋ฉด ์šฐ์ธก์ฒ˜๋Ÿผ pipelineํ•จ์ˆ˜๊ฐ€ ์ •์˜๋˜๊ณ , ์ด ํ˜•ํƒœ๋Š” transformers.pipeline docs๋‚ด์šฉ๊ณผ ์ผ์น˜ํ•จ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.



Auto Classes

from_pretrained() Method๋กœ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, AutoClasses๋Š” ์ด๋Ÿฐ ์ž‘์—…์„ ์ˆ˜ํ–‰, ์‚ฌ์ „ํ›ˆ๋ จ๋œ  AutoConfig, AutoModel, AutoTokenizer์ค‘ ํ•˜๋‚˜๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑ๊ฐ€๋Šฅํ•˜๋‹ค:
 ex)

from transformers import AutoConfig, AutoModel

model = AutoModel.from_pretrained("google-bert/bert-base-cased")




โˆ™ AutoConfig

generic Cofigurationํด๋ž˜์Šค๋กœ from_pretrained()๋ผ๋Š” ํด๋ž˜์Šค๋ฉ”์„œ๋“œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์ •ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ์ธ์Šคํ„ด์Šคํ™”๋œ๋‹ค.
์ด ํด๋ž˜์Šค๋Š” '__init__()'์„ ์‚ฌ์šฉํ•ด ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ•  ์ˆ˜ ์—†๋‹ค!!

์œ„ ์‚ฌ์ง„์„ ๋ณด๋ฉด, transformers/srcํŒŒ์ผ์„ ๋”ฐ๊ณ  ๋“ค์–ด๊ฐ„ ๊ฒฐ๊ณผ, ์ตœ์ข…์ ์œผ๋กœ from_pretrained()ํ•จ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•ด๋‹น ๊นƒํ—™์ฝ”๋“œ(๊ฐ€์žฅ ์šฐ์ธก์‚ฌ์ง„)๋ฅผ ๋ณด๋ฉด, __init__()ํ•จ์ˆ˜์— ๋Œ€ํ•ด raise EnvironmentError๊ฐ€ ๊ฑธ๋ ค์žˆ์Œ์ด ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

config = AutoConfig.from_pretrained("bert-base-uncased")
print(config)


# BertConfig {
#   "_name_or_path": "bert-base-uncased",
#   "architectures": [
#     "BertForMaskedLM"
#   ],
#   "attention_probs_dropout_prob": 0.1,
#   "classifier_dropout": null,
#   "gradient_checkpointing": false,
#   "hidden_act": "gelu",
#   "hidden_dropout_prob": 0.1,
#   "hidden_size": 768,
#   "initializer_range": 0.02,
#   "intermediate_size": 3072,
#   "layer_norm_eps": 1e-12,
#   "max_position_embeddings": 512,
#   "model_type": "bert",
#   "num_attention_heads": 12,
#   "num_hidden_layers": 12,
#   "pad_token_id": 0,
#   "position_embedding_type": "absolute",
#   "transformers_version": "4.41.2",
#   "type_vocab_size": 2,
#   "use_cache": true,
#   "vocab_size": 30522
# }

์œ„ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด, Config๋Š” ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•œ jsonํŒŒ์ผ๋กœ ๋˜์–ด์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.
batch_size, learning_rate, weight_decay ๋“ฑ train์— ํ•„์š”ํ•œ ๊ฒƒ๋“ค๊ณผ
tokenizer์˜ ํŠน์ˆ˜ํ† ํฐ๋“ค์„ ๋ฏธ๋ฆฌ ์„ค์ •ํ•˜๋Š” ๋“ฑ ์„ค์ •๊ด€๋ จ ๋‚ด์šฉ์ด ๋“ค์–ด์žˆ๋‹ค.
๋˜ํ•œ, save_pretrained()๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ชจ๋ธ์˜ checkpointํ™” ํ•จ๊ป˜ ์ €์žฅ๋œ๋‹ค!
๊ทธ๋ ‡๊ธฐ์—, ๋งŒ์•ฝ ์„ค์ •์„ ๋ณ€๊ฒฝํ•˜๊ณ  ์‹ถ๊ฑฐ๋‚˜ Model Proposal๋“ฑ์˜ ์ƒํ™ฉ์—์„œ๋Š” configํŒŒ์ผ์„ ์ง์ ‘ ๋ถˆ๋Ÿฌ์™€์•ผํ•œ๋‹ค!




โˆ™ AutoTokenizer, (blobs, refs, snapshots)

generic Tokenizerํด๋ž˜์Šค๋กœ AutoTokenizer.from_pretrained()๋ผ๋Š” ํด๋ž˜์Šค๋ฉ”์„œ๋“œ.
์ƒ์„ฑ ์‹œ, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ† ํฌ๋‚˜์ด์ €ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ์ธ์Šคํ„ด์Šคํ™”๋œ๋‹ค.
์ฐธ๊ณ )https://chan4im.tistory.com/#%E2%88%99input-ids
์ด ํด๋ž˜์Šค๋Š” '__init__()'์„ ์‚ฌ์šฉํ•ด ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ•  ์ˆ˜ ์—†๋‹ค!!

์œ„ ์‚ฌ์ง„์„ ๋ณด๋ฉด, transformers/srcํŒŒ์ผ์„ ๋”ฐ๊ณ  ๋“ค์–ด๊ฐ„ ๊ฒฐ๊ณผ, ์ตœ์ข…์ ์œผ๋กœ from_pretrained()ํ•จ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•ด๋‹น ๊นƒํ—™์ฝ”๋“œ(๊ฐ€์žฅ ์šฐ์ธก์‚ฌ์ง„)๋ฅผ ๋ณด๋ฉด, __init__()ํ•จ์ˆ˜์— ๋Œ€ํ•ด raise EnvironmentError๊ฐ€ ๊ฑธ๋ ค์žˆ์Œ์ด ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

from transformers import AutoTokenizer

# Download vocabulary from huggingface.co and cache.
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

# If vocabulary files are in a directory 
# (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")

# Download vocabulary from huggingface.co and define model-specific arguments
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_prefix_space=True)

tokenizer
# BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
# 	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# }

๊ทธ๋Ÿฐ๋ฐ ํ•œ๊ฐ€์ง€ ๊ถ๊ธˆํ•˜์ง€ ์•Š์€๊ฐ€?

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

์œ„ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ›„ ์‹คํ–‰ํ•˜๋ฉด ์ฝ˜์†”์ฐฝ์— ์™œ ์•„๋ž˜์™€ ๊ฐ™์€ ํ™”๋ฉด์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์ผ๊นŒ?????

๋ฏธ๋ฆฌ ์„ค๋ช…:
tokenizer_config.json์—๋Š” token์— ๋Œ€ํ•œ ์„ค์ •๋“ค์ด,
config.json์—๋Š” ๋ชจ๋ธ๊ด€๋ จ ์„ค์ •์ด,
vocab.txt๋Š” subword๋“ค์ด ๋“ค์–ด์žˆ๊ณ ,
tokenizer.json์€ config๋œ ๊ฐ’๋“ค์— ๋Œ€ํ•ด ๋‚˜์—ดํ•œ ๊ฒƒ์ด๋‹ค.


๋ณธ์ธ์˜ ๊ฒฝ์šฐ, ์•„๋ž˜์™€ ๊ฐ™์ด cache_dir์— ์ง€์ •์„ ํ•˜๋ฉด, ํ•ด๋‹น ๋””๋ ‰ํ† ๋ฆฌ์— hub๋ผ๋Š” ํŒŒ์ผ์ด ์ƒ๊ธฐ๋ฉฐ, ๊ทธ์•ˆ์— ๋ชจ๋ธ๊ด€๋ จ ํŒŒ์ผ์ด ์ƒ๊ธด๋‹ค.

parser.add_argument('--cache_dir', default="/data/huggingface_models")

ํƒ€๊ณ  ๋“ค์–ด๊ฐ€๋‹ค ๋ณด๋ฉด ์ด 3๊ฐ€์ง€ ํด๋”๊ฐ€ ๋‚˜์˜จ๋‹ค: blobs, refs, snapshots
blobs: ํ•ด์‹œ๊ฐ’์œผ๋กœ ๋‚˜ํƒ€๋‚˜์ ธ ์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค. ํ•ด๋‹นํŒŒ์ผ์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํŒŒ์ผ์ด ์กด์žฌํ•œ๋‹ค:
Configํด๋ž˜์Šค๊ด€๋ จํŒŒ์ผ, Model๊ด€๋ จ ํด๋ž˜์ŠคํŒŒ์ผ๋“ค, tokenizer์„ค์ •๊ด€๋ จ jsonํŒŒ์ผ, weight๊ด€๋ จ ํŒŒ์ผ๋“ค ๋“ฑ๋“ฑ ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด ๋งŽ์€ ํŒŒ์ผ๋“ค์ด ์ฝ”๋“œํ™”๋˜์–ด ๋“ค์–ด์žˆ๋‹ค:


 
refs: ๋‹จ์ˆœํžˆ main์ด๋ผ๋Š” ํŒŒ์ผ๋งŒ ์กด์žฌํ•œ๋‹ค:

ํ•ด๋‹น ํŒŒ์ผ์—๋Š” snapshots์•ˆ์— ์žˆ๋Š” ๋””๋ ‰ํ† ๋ฆฌ์˜ ์ด๋ฆ„๊ณผ ๋™์ผํ•œ ์ด๋ฆ„์ด ์ ํ˜€์žˆ๋‹ค.


snapshots: ๋ฐ”๋กœ ์ด๊ณณ์—!! tokenizer_config.json, config.json, vocab.txt, tokenizer.jsonํŒŒ์ผ์ด ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!!!



๊ทธ๋Ÿฐ๋ฐ ๋ญ”๊ฐ€ ์ด์ƒํ•˜์ง€ ์•Š์€๊ฐ€??
์œ„์˜ blobs์— ๋‚˜์™€์žˆ๋Š” ์‚ฌ์ง„์˜ ์ฝ”๋“œ์™€ snapshots์˜ ์ฝ”๋“œ๊ฐ€ ๋ชจ๋‘ ์ผ์น˜ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค!!

๊ทธ๋ ‡๋‹ค! ์ฆ‰, blobs๋Š” snapshots ๋‚ด์šฉ์„ ํ•ด์‹œ๊ฐ’ํ˜•ํƒœ๋กœ ์ €์žฅํ•œ ๊ฒƒ์ด์—ˆ๋‹ค!!!
์‚ฌ์‹ค ์ด์ง“ํ•œ๋‹ค์Œ์— ๊ตฌ๊ธ€์— ์น˜๋‹ˆ๊นŒ ๋ฐ”๋กœ ์žˆ๊ธดํ–ˆ์—ˆ๋‹ค..(https://huggingface.co/docs/huggingface_hub/v0.16.3/en/guides/manage-cache)
ํ—ˆ๊น…ํŽ˜์ด์Šค ์„ค๋ช… ์š”์•ฝ:

Refs refs ํด๋”๋Š” ์ฃผ์–ด์ง„ ์ฐธ์กฐ์˜ ์ตœ์‹  ๋ฆฌ๋น„์ „์„ ๋‚˜ํƒ€๋‚ด๋Š” ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ด์ „์— ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์˜ ๋ฉ”์ธ ๋ธŒ๋žœ์น˜์—์„œ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜จ ๊ฒฝ์šฐ, refs ํด๋”์—๋Š” main์ด๋ผ๋Š” ํŒŒ์ผ์ด ์žˆ์œผ๋ฉฐ, ํ˜„์žฌ ํ—ค๋“œ์˜ ์ปค๋ฐ‹ ์‹๋ณ„์ž๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

๋งŒ์•ฝ ์ตœ์‹  ์ปค๋ฐ‹์ด aaaaaa๋ผ๋Š” ์‹๋ณ„์ž๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด, ํ•ด๋‹น ํŒŒ์ผ์€ aaaaaa๋ฅผ ํฌํ•จํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋™์ผํ•œ ๋ธŒ๋žœ์น˜๊ฐ€ ์ƒˆ๋กœ์šด ์ปค๋ฐ‹ bbbbbb๋กœ ์—…๋ฐ์ดํŠธ๋œ ๊ฒฝ์šฐ, ํ•ด๋‹น ์ฐธ์กฐ์—์„œ ํŒŒ์ผ์„ ๋‹ค์‹œ ๋‹ค์šด๋กœ๋“œํ•˜๋ฉด refs/main ํŒŒ์ผ์ด bbbbbb๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.

Blobs blobs ํด๋”๋Š” ์‹ค์ œ๋กœ ๋‹ค์šด๋กœ๋“œ๋œ ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํŒŒ์ผ์˜ ์ด๋ฆ„์€ ํ•ด๋‹น ํŒŒ์ผ์˜ ํ•ด์‹œ์ž…๋‹ˆ๋‹ค.

Snapshots snapshots ํด๋”๋Š” ์œ„์˜ blobs์—์„œ ์–ธ๊ธ‰ํ•œ ํŒŒ์ผ์— ๋Œ€ํ•œ ์‹ฌ๋ณผ๋ฆญ ๋งํฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ž์ฒด์ ์œผ๋กœ ์•Œ๋ ค์ง„ ๊ฐ ๋ฆฌ๋น„์ „์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ํด๋”๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

cf) ๋˜ํ•œ cache๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ tree๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง:

    [  96]  .
    โ””โ”€โ”€ [ 160]  models--julien-c--EsperBERTo-small
        โ”œโ”€โ”€ [ 160]  blobs
        โ”‚   โ”œโ”€โ”€ [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
        โ”‚   โ”œโ”€โ”€ [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
        โ”‚   โ””โ”€โ”€ [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
        โ”œโ”€โ”€ [  96]  refs
        โ”‚   โ””โ”€โ”€ [  40]  main
        โ””โ”€โ”€ [ 128]  snapshots
            โ”œโ”€โ”€ [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
            โ”‚   โ”œโ”€โ”€ [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
            โ”‚   โ””โ”€โ”€ [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
            โ””โ”€โ”€ [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
                โ”œโ”€โ”€ [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
                โ””โ”€โ”€ [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd

 


โˆ™ AutoModel

๋‹น์—ฐํžˆ ์œ„์™€ ๊ฐ™์ด, ์•„๋ž˜์‚ฌ์ง„์ฒ˜๋Ÿผ ์ฐพ์•„๊ฐˆ ์ˆ˜ ์žˆ๋‹ค.

๋จผ์ € AutoModel.from_configํ•จ์ˆ˜๋ฅผ ์‚ดํŽด๋ณด์ž.

from transformers import AutoConfig, AutoModel

# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained("google-bert/bert-base-cased")
model = AutoModel.from_config(config)


@classmethod
def from_config(cls, config, **kwargs):
    trust_remote_code = kwargs.pop("trust_remote_code", None)
    has_remote_code = hasattr(config, "auto_map") and cls.__name__ in config.auto_map
    has_local_code = type(config) in cls._model_mapping.keys()
    trust_remote_code = resolve_trust_remote_code(
        trust_remote_code, config._name_or_path, has_local_code, has_remote_code
    )

    if has_remote_code and trust_remote_code:
        class_ref = config.auto_map[cls.__name__]
        if "--" in class_ref:
            repo_id, class_ref = class_ref.split("--")
        else:
            repo_id = config.name_or_path
        model_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
        if os.path.isdir(config._name_or_path):
            model_class.register_for_auto_class(cls.__name__)
        else:
            cls.register(config.__class__, model_class, exist_ok=True)
        _ = kwargs.pop("code_revision", None)
        return model_class._from_config(config, **kwargs)
    elif type(config) in cls._model_mapping.keys():
        model_class = _get_model_class(config, cls._model_mapping)
        return model_class._from_config(config, **kwargs)

    raise ValueError(
        f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
        f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."

 


Data Collator

Data Collator

์ผ๋ จ์˜ sample list๋ฅผ "single training mini-batch"์˜ Tensorํ˜•ํƒœ๋กœ ๋ฌถ์–ด์คŒ
Default Data Collator์ด๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ train_dataset์ด data_collator๋ฅผ ์ด์šฉํ•ด mini-batch๋กœ ๋ฌถ์—ฌ ๋ชจ๋ธ๋กœ ๋“ค์–ด๊ฐ€ ํ•™์Šตํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,โ€‹





batch["input_ids"] , batch["labels"] ?

๋‹ค๋งŒ, ์œ„์™€ ๋‹ฌ๋ฆฌ ๋Œ€๋ถ€๋ถ„์˜ Data Collatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples): 
        texts = []
        images = []
        for example in examples:
            image, question, answer = example 
            messages = [{"role": "user", "content": question},
                        {"role": "assistant", "content": answer}] # <-- ์—ฌ๊ธฐ๊นŒ์ง€ ์ž˜ ๋“ค์–ด๊ฐ€๋Š”๊ฒƒ ํ™•์ธ์™„๋ฃŒ.
            text = self.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text)
            images.append(image)

        batch = self.processor(text=text, images=image, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)โ€‹

๊ณผ์—ฐ batch["input_ids"]์™€ batch["labels"]๊ฐ€ ๋ญ˜๊นŒ?

์ „์ˆ ํ–ˆ๋˜ data_collator๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํ˜•์‹์„ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋„ ๋ณด๋ฉด inputs์™€ labels๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ชจ๋“  ๋ชจ๋ธ์€ ๋‹ค๋ฅด์ง€๋งŒ, ๋‹ค๋ฅธ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ์ ์„ ๊ณต์œ ํ•œ๋‹ค
= ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ ๋™์ผํ•œ ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•œ๋‹ค!

โˆ™Input IDs

Input ID๋Š” ๋ชจ๋ธ์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌ๋˜๋Š” "์œ ์ผํ•œ ํ•„์ˆ˜ ๋งค๊ฐœ๋ณ€์ˆ˜"์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
Input ID๋Š” token_index๋กœ, ์‚ฌ์šฉํ•  sequence(๋ฌธ์žฅ)๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” token์˜ ์ˆซ์žํ‘œํ˜„์ด๋‹ค.
๊ฐ tokenizer๋Š” ๋‹ค๋ฅด๊ฒŒ ์ž‘๋™ํ•˜์ง€๋งŒ "๊ธฐ๋ณธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋™์ผ"ํ•˜๋‹ค.

ex)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"


tokenizer๋Š” sequence(๋ฌธ์žฅ)๋ฅผ tokenizer vocab์— ์žˆ๋Š” Token์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค:

tokenized_sequence = tokenizer.tokenize(sequence)


token์€ word๋‚˜ subword ๋‘˜์ค‘ ํ•˜๋‚˜์ด๋‹ค:

print(tokenized_sequence)
# ์ถœ๋ ฅ: ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# ์˜ˆ๋ฅผ ๋“ค์–ด, "VRAM"์€ ๋ชจ๋ธ ์–ดํœ˜์— ์—†์–ด์„œ "V", "RA" ๋ฐ "M"์œผ๋กœ ๋ถ„ํ• ๋จ.
# ์ด๋Ÿฌํ•œ ํ† ํฐ์ด ๋ณ„๋„์˜ ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ๋™์ผํ•œ ๋‹จ์–ด์˜ ์ผ๋ถ€์ž„์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š”?
# --> "RA"์™€ "M" ์•ž์— ์ด์ค‘ํ•ด์‹œ(##) ์ ‘๋‘์‚ฌ๊ฐ€ ์ถ”๊ฐ€๋ฉ


inputs = tokenizer(sequence)


์ด๋ฅผ ํ†ตํ•ด token์€ ๋ชจ๋ธ์ด ์ดํ•ด๊ฐ€๋Šฅํ•œ ID๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๋•Œ, ๋ชจ๋ธ๋‚ด๋ถ€์—์„œ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” input_ids๋ฅผ key๋กœ, ID๊ฐ’์„ value๋กœ ํ•˜๋Š” "๋”•์…”๋„ˆ๋ฆฌ"ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ด์•ผํ•œ๋‹ค:

encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# ์ถœ๋ ฅ: [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]

๋˜ํ•œ, ๋ชจ๋ธ์— ๋”ฐ๋ผ์„œ ์ž๋™์œผ๋กœ "special token"์„ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ, 
์—ฌ๊ธฐ์—๋Š” ๋ชจ๋ธ์ด ๊ฐ€๋” ์‚ฌ์šฉํ•˜๋Š” "special IDs"๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# ์ถœ๋ ฅ: [CLS] A Titan RTX has 24GB of VRAM [SEP]





โˆ™Attention Mask

Attention Mask๋Š” Sequence๋ฅผ batch๋กœ ๋ฌถ์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” Optionalํ•œ ์ธ์ˆ˜๋กœ 
"๋ชจ๋ธ์ด ์–ด๋–ค token์„ ์ฃผ๋ชฉํ•˜๊ณ  ํ•˜์ง€ ๋ง์•„์•ผ ํ•˜๋Š”์ง€"๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

len(encoded_sequence_a), len(encoded_sequence_b)
# ์ถœ๋ ฅ: (8, 19)
์œ„๋ฅผ ๋ณด๋ฉด, encoding๋œ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— "๋™์ผํ•œ Tensor๋กœ ๋ฌถ์„ ์ˆ˜๊ฐ€ ์—†๋‹ค."
--> padding์ด๋‚˜ truncation์ด ํ•„์š”ํ•จ.
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["input_ids"]
# ์ถœ๋ ฅ: [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]

padded_sequences["attention_mask"]
# ์ถœ๋ ฅ: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask๋Š” tokenizer๊ฐ€ ๋ฐ˜ํ™˜ํ•˜๋Š” dictionary์˜ "attention_mask" key์— ์กด์žฌํ•œ๋‹ค.


โˆ™Token Types IDs

์–ด๋–ค ๋ชจ๋ธ์˜ ๋ชฉ์ ์€ classification์ด๋‚˜ QA์ด๋‹ค.
์ด๋Ÿฐ ๋ชจ๋ธ์€ 2๊ฐœ์˜ "๋‹ค๋ฅธ ๋ชฉ์ ์„ ๋‹จ์ผ input_ids"ํ•ญ๋ชฉ์œผ๋กœ ๊ฒฐํ•ฉํ•ด์•ผํ•œ๋‹ค.
= [CLS], [SEP] ๋“ฑ์˜ ํŠน์ˆ˜ํ† ํฐ์„ ์ด์šฉํ•ด ์ˆ˜ํ–‰๋จ.

ex)
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)
# ์ถœ๋ ฅ: [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
์œ„์˜ ์˜ˆ์ œ์—์„œ tokenizer๋ฅผ ์ด์šฉํ•ด 2๊ฐœ์˜ sequence๊ฐ€ 2๊ฐœ์˜ ์ธ์ˆ˜๋กœ ์ „๋‹ฌ๋˜์–ด ์ž๋™์œผ๋กœ ์œ„์™€๊ฐ™์€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Š” seq์ดํ›„์— ๋‚˜์˜ค๋Š” seq์˜ ์‹œ์ž‘์œ„์น˜๋ฅผ ์•Œ๊ธฐ์—๋Š” ์ข‹๋‹ค.

๋‹ค๋งŒ, ๋‹ค๋ฅธ ๋ชจ๋ธ์€ token_types_ids๋„ ์‚ฌ์šฉํ•˜๋ฉฐ, token_type_ids๋กœ ์ด MASK๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
encoded_dict['token_type_ids']
# ์ถœ๋ ฅ: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 

์งˆ๋ฌธ์— ์‚ฌ์šฉ๋˜๋Š” context๋Š” ๋ชจ๋‘ 0์œผ๋กœ, 
์งˆ๋ฌธ์— ํ•ด๋‹น๋˜๋Š” sequence๋Š” ๋ชจ๋‘ 1๋กœ ์„ค์ •๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


โˆ™Position IDs

RNN: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๊ฐ€ ๋‚ด์žฅ. 
Transformer: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ ์ธ์‹ โŒ


∴ position_ids๋Š” ๋ชจ๋ธ์ด ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ list์—์„œ ์‹๋ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” optional ๋งค๊ฐœ๋ณ€์ˆ˜.

๋ชจ๋ธ์— position_ids๊ฐ€ ์ „๋‹ฌ๋˜์ง€ ์•Š์œผ๋ฉด, ID๋Š” ์ž๋™์œผ๋กœ Absolute positional embeddings์œผ๋กœ ์ƒ์„ฑ:

Absolute positional embeddings์€ [0, config.max_position_embeddings - 1] ๋ฒ”์œ„์—์„œ ์„ ํƒ.

์ผ๋ถ€ ๋ชจ๋ธ์€ sinusoidal position embeddings์ด๋‚˜ relative position embeddings๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ positional embedding์„ ์‚ฌ์šฉ.




โˆ™Labels 

Labels๋Š” ๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๋„๋ก ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ๋Š” Optional์ธ์ˆ˜์ด๋‹ค.
์ฆ‰, Labels๋Š” ๋ชจ๋ธ์˜ ์˜ˆ์ƒ ์˜ˆ์ธก๊ฐ’์ด์–ด์•ผ ํ•œ๋‹ค: ํ‘œ์ค€ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก๊ฐ’๊ณผ ์˜ˆ์ƒ๊ฐ’(๋ ˆ์ด๋ธ”) ๊ฐ„์˜ ์†์‹ค์„ ๊ณ„์‚ฐ.


์ด๋•Œ, Labels๋Š” Model Head์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค:

  • AutoModelForSequenceClassification: ๋ชจ๋ธ์€ (batch_size)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, batch์˜ ๊ฐ ๊ฐ’์€ ์ „์ฒด ์‹œํ€€์Šค์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น.

  • AutoModelForTokenClassification: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น

  • AutoModelForMaskedLM: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹น: label์€ ๋งˆ์Šคํ‚น๋œ token_ids์ด๋ฉฐ, ๋‚˜๋จธ์ง€๋Š” ๋ฌด์‹œํ•  ๊ฐ’(๋ณดํ†ต -100).

  • AutoModelForConditionalGeneration: ๋ชจ๋ธ์€ (batch_size, tgt_seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐ ์ž…๋ ฅ ์‹œํ€€์Šค์™€ ์—ฐ๊ด€๋œ ๋ชฉํ‘œ ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘์—๋Š” BART์™€ T5๊ฐ€ ์ ์ ˆํ•œ ๋””์ฝ”๋” ์ž…๋ ฅ ID์™€ ๋””์ฝ”๋” ์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ์— ๋ณดํ†ต ์ œ๊ณตํ•  ํ•„์š”X. ์ด๋Š” Encoder-Decoder ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์—๋Š” ์ ์šฉ๋˜์ง€ ์•Š์Œ. ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ ํŠน์ • ๋ชจ๋ธ์˜ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๊ธฐ๋ณธ ๋ชจ๋ธ(BertModel ๋“ฑ)์€ Labels๋ฅผ ๋ฐ›์•„๋“ค์ด์ง€ ๋ชปํ•˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๋กœ์„œ ๋‹จ์ˆœํžˆ ํŠน์ง•๋“ค๋งŒ ์ถœ๋ ฅํ•œ๋‹ค.




โˆ™ Decoder input IDs

์ด ์ž…๋ ฅ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ์— ํŠนํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋””์ฝ”๋”์— ์ž…๋ ฅ๋  ์ž…๋ ฅ ID๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ์ž…๋ ฅ์€ ๋ฒˆ์—ญ ๋˜๋Š” ์š”์•ฝ๊ณผ ๊ฐ™์€ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ์ž‘์—…์— ์‚ฌ์šฉ๋˜๋ฉฐ, ๋ณดํ†ต ๊ฐ ๋ชจ๋ธ์— ํŠน์ •ํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ(BART, T5)์€ ๋ ˆ์ด๋ธ”์—์„œ ๋””์ฝ”๋” ์ž…๋ ฅ ID๋ฅผ ์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ชจ๋ธ์—์„œ๋Š” ๋ ˆ์ด๋ธ”์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ํ›ˆ๋ จ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์„ ํ˜ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ํ›ˆ๋ จ์„ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์ž…๋ ฅ ID๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜๋ ค๋ฉด ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.



โˆ™Feed Forward Chunking

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ ์ž”์ฐจ ์–ดํ…์…˜ ๋ธ”๋ก์—์„œ ์…€ํ”„ ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋Š” ๋ณดํ†ต 2๊ฐœ์˜ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด ๋‹ค์Œ์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค.
ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ž„๋ฒ ๋”ฉ ํฌ๊ธฐ๋Š” ์ข…์ข… ๋ชจ๋ธ์˜ ์ˆจ๊ฒจ์ง„ ํฌ๊ธฐ๋ณด๋‹ค ํฝ๋‹ˆ๋‹ค(์˜ˆ: bert-base-uncased).

ํฌ๊ธฐ [batch_size, sequence_length]์˜ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ค‘๊ฐ„ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ์ €์žฅํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ [batch_size, sequence_length, config.intermediate_size]๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์˜ ํฐ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reformer: The Efficient Transformer์˜ ์ €์ž๋“ค์€ ๊ณ„์‚ฐ์ด sequence_length ์ฐจ์›๊ณผ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ ๋‘ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ [batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  n = sequence_length์™€ ํ•จ๊ป˜ [batch_size, sequence_length, config.hidden_size]๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๋Œ€์‹  ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฑฐ๋ž˜๋ฅผ ํ•˜์ง€๋งŒ, ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

apply_chunking_to_forward() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, chunk_size๋Š” ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐ๋˜๋Š” ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ์˜ ์ˆ˜๋ฅผ ์ •์˜ํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก์„ฑ ๊ฐ„์˜ ๊ฑฐ๋ž˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. chunk_size๊ฐ€ 0์œผ๋กœ ์„ค์ •๋˜๋ฉด ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ฒญํ‚น์€ ์ˆ˜ํ–‰๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

 


Optimization

AdamW

ํ”ํžˆ๋“ค ์•„๋ฌป๋”ฐ Adam๋งŒ ์‚ฌ์šฉํ•ด๋ผ! ๋ผ๋Š” ๊ฒฉ์–ธ์ด ์žˆ์„์ •๋„๋กœ ๋งŒ๋Šฅ optimizer๊ฐ™์ง€๋งŒ, 
CV์ผ๋ถ€ Task์—์„œ๋Š” SGD๊ฐ€ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ƒ๋‹นํžˆ ์กด์žฌํ•œ๋‹ค.
AdamW๋…ผ๋ฌธ์—์„œ๋Š” L2 Regularization๊ณผ Weight Decay๊ด€์ ์—์„œ SGD์— ๋น„ํ•ด Adam์ด ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง€๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ•œ๋‹ค.
์„œ๋กœ๋‹ค๋ฅธ ์ดˆ๊ธฐ decay rate์™€ lr์— ๋”ฐ๋ฅธ Test Error
L2 Regularization: weight๊ฐ€ ๋น„์ •์ƒ์ ์œผ๋กœ ์ปค์ง์„ ๋ฐฉ์ง€. (weight๊ฐ’์ด ์ปค์ง€๋ฉด ์†์‹คํ•จ์ˆ˜๋„ ์ปค์ง€๊ฒŒ ๋จ.)
= weight๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š๋Š” ์„ ์—์„œ ๊ธฐ์กด ์†์‹คํ•จ์ˆ˜๋ฅผ ์ตœ์†Œ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” weight๋ฅผ ๋ชจ๋ธ์ด ํ•™์Šต.

weight decay: weight update ์‹œ, ์ด์ „ weightํฌ๊ธฐ๋ฅผ ์ผ์ •๋น„์œจ ๊ฐ์†Œ์‹œ์ผœ overfitting๋ฐฉ์ง€.

SGD: L2 = weight_decay
Adam: L2 ≠ weight_decay (adaptive learning rate๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— SGD์™€๋Š” ๋‹ค๋ฅธ weight update์‹์„ ์‚ฌ์šฉํ•จ.)
∴ ์ฆ‰, L2 Regularization์ด ํฌํ•จ๋œ ์†์‹คํ•จ์ˆ˜๋กœ Adam์ตœ์ ํ™” ์‹œ, ์ผ๋ฐ˜ํ™” ํšจ๊ณผ๋ฅผ ๋œ ๋ณด๊ฒŒ ๋œ๋‹ค. (decay rate๊ฐ€ ๋” ์ž‘์•„์ง€๊ฒŒ๋จ.)
์ €์ž๋Š” L2 regularzation์— ์˜ํ•œ weight decay ํšจ๊ณผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ weight ์—…๋ฐ์ดํŠธ ์‹์— ์ง์ ‘์ ์œผ๋กœ weight decay ํ…€์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. L2 regularization๊ณผ ๋ถ„๋ฆฌ๋œ weight decay๋ผ๊ณ  ํ•˜์—ฌ decoupled weight decay๋ผ๊ณ  ๋งํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

SGDW์™€ AdamW์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜:
์ง€๊ธˆ๊นŒ์ง€ ์„ค๋ช…ํ•˜์ง€ ์•Š์•˜๋˜
๐œ‚๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋งค weight ์—…๋ฐ์ดํŠธ๋งˆ๋‹ค learning rate๋ฅผ ์ผ์ • ๋น„์œจ ๊ฐ์†Œ์‹œ์ผœ์ฃผ๋Š” learning rate schedule ์ƒ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์ดˆ๋ก์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋œ ๋ถ€๋ถ„์ด ์—†๋‹ค๋ฉด L2 regularization์„ ํฌํ•จํ•œ ์†์‹คํ•จ์ˆ˜์— SGD์™€ Adam์„ ์ ์šฉํ•œ ๊ฒƒ๊ณผ ๋˜‘๊ฐ™๋‹ค.
ํ•˜์ง€๋งŒ ์ดˆ๋ก์ƒ‰ ๋ถ€๋ถ„์„ ์ง์ ‘์ ์œผ๋กœ weight ์—…๋ฐ์ดํŠธ ์‹์— ์ถ”๊ฐ€์‹œ์ผœ weight decay ํšจ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค.
optimizer = AdamW(model.parameters(),lr=1e-3, eps=(1e-30, 1e-3),weight_decay=0.0,)

 

cf) model.parameters()๋Š” weight์™€ bias๋ฅผ ๋Œ๋ ค์คŒ.
์ด์ œ github ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ์œ„์˜ ์ˆ˜์‹์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž:
class AdamW(Optimizer):
    """
    Implements Adam algorithm with weight decay fix as introduced in [Decoupled Weight Decay
    Regularization](https://arxiv.org/abs/1711.05101).

    Parameters:
        params (`Iterable[nn.parameter.Parameter]`):
            Iterable of parameters to optimize or dictionaries defining parameter groups.
        lr (`float`, *optional*, defaults to 0.001):
            The learning rate to use.
        betas (`Tuple[float,float]`, *optional*, defaults to `(0.9, 0.999)`):
            Adam's betas parameters (b1, b2).
        eps (`float`, *optional*, defaults to 1e-06):
            Adam's epsilon for numerical stability.
        weight_decay (`float`, *optional*, defaults to 0.0):
            Decoupled weight decay to apply.
        correct_bias (`bool`, *optional*, defaults to `True`):
            Whether or not to correct bias in Adam (for instance, in Bert TF repository they use `False`).
        no_deprecation_warning (`bool`, *optional*, defaults to `False`):
            A flag used to disable the deprecation warning (set to `True` to disable the warning).
    """

    def __init__(
        self,
        params: Iterable[nn.parameter.Parameter],
        lr: float = 1e-3,
        betas: Tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-6,
        weight_decay: float = 0.0,
        correct_bias: bool = True,
        no_deprecation_warning: bool = False,
    ):
        if not no_deprecation_warning:
            warnings.warn(
                "This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch"
                " implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this"
                " warning",
                FutureWarning,
            )
        require_version("torch>=1.5.0")  # add_ with alpha
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr} - should be >= 0.0")
        if not 0.0 <= betas[0] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[0]} - should be in [0.0, 1.0)")
        if not 0.0 <= betas[1] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[1]} - should be in [0.0, 1.0)")
        if not 0.0 <= eps:
            raise ValueError(f"Invalid epsilon value: {eps} - should be >= 0.0")
        defaults = {"lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay, "correct_bias": correct_bias}
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure: Callable = None):
        """
        Performs a single optimization step.

        Arguments:
            closure (`Callable`, *optional*): A closure that reevaluates the model and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group["params"]:
                if p.grad is None:
                    continue
                grad = p.grad
                if grad.is_sparse:
                    raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    # Exponential moving average of gradient values
                    state["exp_avg"] = torch.zeros_like(p)
                    # Exponential moving average of squared gradient values
                    state["exp_avg_sq"] = torch.zeros_like(p)

                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
                beta1, beta2 = group["betas"]

                state["step"] += 1

                # Decay the first and second moment running average coefficient
                # In-place operations to update the averages at the same time
                exp_avg.mul_(beta1).add_(grad, alpha=(1.0 - beta1))
                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
                denom = exp_avg_sq.sqrt().add_(group["eps"])

                step_size = group["lr"]
                if group["correct_bias"]:  # No bias correction for Bert
                    bias_correction1 = 1.0 - beta1 ** state["step"]
                    bias_correction2 = 1.0 - beta2 ** state["step"]
                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1

                p.addcdiv_(exp_avg, denom, value=-step_size)

                # Just adding the square of the weights to the loss function is *not*
                # the correct way of using L2 regularization/weight decay with Adam,
                # since that will interact with the m and v parameters in strange ways.
                #
                # Instead we want to decay the weights in a manner that doesn't interact
                # with the m/v parameters. This is equivalent to adding the square
                # of the weights to the loss with plain (non-momentum) SGD.
                # Add weight decay at the end (fixed version)
                if group["weight_decay"] > 0.0:
                    p.add_(p, alpha=(-group["lr"] * group["weight_decay"]))

        return loss
cf) optimizer์˜ state_dict()์˜ ํ˜•ํƒœ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค:
{
                'state': {
                    0: {'momentum_buffer': tensor(...), ...},
                    1: {'momentum_buffer': tensor(...), ...},
                    2: {'momentum_buffer': tensor(...), ...},
                    3: {'momentum_buffer': tensor(...), ...}
                },
                'param_groups': [
                    {
                        'lr': 0.01,
                        'weight_decay': 0,
                        ...
                        'params': [0]
                    },
                    {
                        'lr': 0.001,
                        'weight_decay': 0.5,
                        ...
                        'params': [1, 2, 3]
                    }
                ]
            }
์ด๋ฅผ ํ†ตํ•ด ์‚ดํŽด๋ณด๋ฉด, Optimizer๋ผ๋Š” ํด๋ž˜์Šค๋กœ๋ถ€ํ„ฐ AdamW๋Š” ์ƒ์†์„ ๋ฐ›์€ ์ดํ›„, 
์œ„์˜ state_dictํ˜•ํƒœ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ, if len(state) == 0์ด๋ผ๋Š” ๋œป์€ state๊ฐ€ ์‹œ์ž‘์„ ํ•˜๋‚˜๋„ ํ•˜์ง€ ์•Š์•˜์Œ์„ ์˜๋ฏธํ•œ๋‹ค.
exp_avg๋Š” m์„, exp_avg_sq๋Š” vt๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ p.addcdiv_์™€ if group["weight_decay"]์ชฝ์—์„œ ์ตœ์ข… parameter์— ๋Œ€ํ•œ update๊ฐ€ ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 


LR-Schedules &. Learning rate Annealing

LR Schedule: ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ์Šค์ผ€์ค„๋Œ€๋กœ lr์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์‚ฌ์šฉ.

ํ›ˆ๋ จ ๋„์ค‘ learning rate๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ์ฃผ๋Š”๊ฒŒ ์ฐจ์ด์ !
warmup restart๋กœ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ local minimum์—์„œ ๋น ์ ธ๋‚˜์˜ฌ ๊ธฐํšŒ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.


LR Annealing: lr schedule๊ณผ ํ˜ผ์šฉ๋˜์–ด ์‚ฌ์šฉ๋˜๋‚˜ iteration์— ๋”ฐ๋ผ monotonicํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๋Š”๊ฒƒ์„ ์˜๋ฏธ.
์ง๊ด€์ ์œผ๋กœ๋Š” ์ฒ˜์Œ์—๋Š” ๋†’์€ learning rate๋กœ ์ข‹์€ ์ˆ˜๋ ด ์ง€์ ์„ ๋นก์„ธ๊ฒŒ ์ฐพ๊ณ ,
๋งˆ์ง€๋ง‰์—๋Š” ๋‚ฎ์€ learning rate๋กœ ์ˆ˜๋ ด ์ง€์ ์— ์ •๋ฐ€ํ•˜๊ฒŒ ์•ˆ์ฐฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

 


Model Outputs

ModelOutput

๋ชจ๋“  ๋ชจ๋ธ์€ ModelOutput์˜ subclass์˜ instance์ถœ๋ ฅ์„ ๊ฐ–๋Š”๋‹ค.
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # ๋ฐฐ์น˜ ํฌ๊ธฐ 1
outputs = model(**inputs, labels=labels)

# SequenceClassifierOutput(loss=tensor(0.4267, grad_fn=<NllLossBackward0>), 
#                           logits=tensor([[-0.0658,  0.5650]], grad_fn=<AddmmBackward0>), 
#                           hidden_states=None, attentions=None)
outputs๊ฐ์ฒด๋Š” ํ•„ํžˆ loss์™€ logits๋ฅผ ๊ฐ–๊ธฐ์— (outputs.loss, outputs.logits) ํŠœํ”Œ์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

Cf)
CuasalLM์˜ ๊ฒฝ์šฐ:
loss: Language modeling loss (for next-token prediction).

logits: Prediction scores of the LM_Head (scores for each vocabulary token before SoftMax)
= raw prediction values and are not bounded to a specific range

transformers output word๋ฅผ ์œ„ํ•ด์„  : project linearly->apply softmax ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์นจ.
์ด๋•Œ, LM_Head๋Š” pre-training์ด ์•„๋‹Œ, Fine-Tuning์—์„œ ์‚ฌ์šฉ๋จ.
LM_Head๋ž€, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ hidden_state๋ฅผ ๋ฐ›์•„ prediction์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ.
ex) BERT
from transformers import BertModel, BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
print(f'logits: {logits}') # `torch.FloatTensor` of shape `(batch_size, sequence_length, vocab_size)

# [MASK] ํ† ํฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
print(f'masked_index: {masked_index}') # `torch.LongTensor` of shape `(1,)

MASK_token = logits[0, masked_index] # batch์˜ ์ฒซ๋ฌธ์žฅ์—์„œ MASK token์„ ๊ฐ€์ ธ์˜ด.
print(f'MASK_Token: {MASK_token}')

predicted_token_id = MASK_token.argmax(axis=-1) # ์ฃผ์–ด์ง„ ์ฐจ์›์—์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์˜ index๋ฅผ ๋ฐ˜ํ™˜. = ๋ชจ๋ธ์ด ํ•ด๋‹น์œ„์น˜์—์„œ ์–˜์ธกํ•œ ๋‹จ์–ด์˜ token_id
print(f'predicted_token_id: {predicted_token_id}')


predicted_token = tokenizer.decode(predicted_token_id)
print(predicted_token)  # paris ์ถœ๋ ฅ


# logits: tensor([[[ -6.4346,  -6.4063,  -6.4097,  ...,  -5.7691,  -5.6326,  -3.7883],
#          [-14.0119, -14.7240, -14.2120,  ..., -11.6976, -10.7304, -12.7618],
#          [ -9.6561, -10.3125,  -9.7459,  ...,  -8.7782,  -6.6036, -12.6596],
#          ...,
#          [ -3.7861,  -3.8572,  -3.5644,  ...,  -2.5593,  -3.1093,  -4.3820],
#          [-11.6598, -11.4274, -11.9267,  ...,  -9.8772, -10.2103,  -4.7594],
#          [-11.7267, -11.7509, -11.8040,  ..., -10.5943, -10.9407,  -7.5151]]],
#        grad_fn=<ViewBackward0>)
# masked_index: tensor([6])
# MASK_Token: tensor([[-3.7861, -3.8572, -3.5644,  ..., -2.5593, -3.1093, -4.3820]],
#        grad_fn=<IndexBackward0>)
# predicted_token_id: tensor([3000])
# paris


cf) ์ฐธ๊ณ ๋กœ argmax๊ฐ€ ๋ฐ˜ํ™˜ํ•œ index๋Š” vocabulary์˜ Index์ž„์„ ์•„๋ž˜๋ฅผ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

for word, idx in list(vocab.items())[:5]:  # ์–ดํœ˜์˜ ์ฒ˜์Œ 10๊ฐœ ํ•ญ๋ชฉ ์ถœ๋ ฅ
    print(f"{word}: {idx}")
for word, idx in list(vocab.items())[2990:3010]:  # ์–ดํœ˜์˜ ์ฒ˜์Œ 10๊ฐœ ํ•ญ๋ชฉ ์ถœ๋ ฅ
    print(f"{word}: {idx}")
    
# [PAD]: 0
# [unused0]: 1
# [unused1]: 2
# [unused2]: 3
# [unused3]: 4
# jack: 2990
# fall: 2991
# raised: 2992
# itself: 2993
# stay: 2994
# true: 2995
# studio: 2996
# 1988: 2997
# sports: 2998
# replaced: 2999
# paris: 3000
# systems: 3001
# saint: 3002
# leader: 3003
# theatre: 3004
# whose: 3005
# market: 3006
# capital: 3007
# parents: 3008
# spanish: 3009

 


Trainer

Trainer

Trainerํด๋ž˜์Šค๋Š” ๐Ÿค— Transformers ๋ชจ๋ธ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ๋‹ค
= ๋ชจ๋ธ์ด ํ•ญ์ƒ tuple(= ์ฒซ์š”์†Œ๋กœ loss๋ฐ˜ํ™˜) , ModelOutput์˜ subclass๋ฅผ ๋ฐ˜ํ™˜ํ•ด์•ผํ•จ์„ ์˜๋ฏธ
= labels์ธ์ž๊ฐ€ ์ œ๊ณต๋˜๋ฉด Loss๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ.

Trainer๋Š” TrainingArguments๋กœ ํ•„์š”์ธ์ž๋ฅผ ์ „๋‹ฌํ•ด์ฃผ๋ฉด, ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ train_loop์ž‘์„ฑํ•  ํ•„์š”์—†์ด ํ•™์Šต์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋˜ํ•œ, TRL ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ SFTTrainer์˜ ๊ฒฝ์šฐ, ์ด Trainerํด๋ž˜์Šค๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ์œผ๋ฉฐ, LoRA, Quantizing๊ณผ DeepSpeed ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์–ด๋–ค ๋ชจ๋ธ ํฌ๊ธฐ์—๋„ ํšจ์œจ์ ์ธ ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

๋จผ์ €, ์‹œ์ž‘์— ์•ž์„œ ๋ถ„์‚ฐํ™˜๊ฒฝ์„ ์œ„ํ•ด์„œ๋Š” Accelerate๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผํ•œ๋‹ค!
pip install accelerate
pip install accelerate --upgrade

Basic Usage

"hugใ…‡ใ…‡ใ„นใ„ด


Checkpoints

"hugใ…‡ใ…‡ใ„นใ„ด


Customizing

"hugใ…‡ใ…‡ใ„นใ„ด


Callbacks & Logging

"hugใ…‡ใ…‡ใ„นใ„ด


Accelerate & Trainer

"hugใ…‡ใ…‡ใ„นใ„ด


TrainingArguments

์ฐธ๊ณ )
output_dir (str): ๋ชจ๋ธ ์˜ˆ์ธก๊ณผ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์ž‘์„ฑ๋  ์ถœ๋ ฅ ๋””๋ ‰ํ† ๋ฆฌ์ž…๋‹ˆ๋‹ค.
eval_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "no"): ํ›ˆ๋ จ ์ค‘ ์ฑ„ํƒํ•  ํ‰๊ฐ€ ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ํ‰๊ฐ€๋ฅผ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"steps": eval_steps๋งˆ๋‹ค ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
per_device_train_batch_size (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 8): ํ›ˆ๋ จ ์‹œ GPU/XPU/TPU/MPS/NPU ์ฝ”์–ด/CPU๋‹น ๋ฐฐ์น˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค.
per_device_eval_batch_size (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 8): ํ‰๊ฐ€ ์‹œ GPU/XPU/TPU/MPS/NPU ์ฝ”์–ด/CPU๋‹น ๋ฐฐ์น˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค.
gradient_accumulation_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1): ์—ญ์ „ํŒŒ/์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ ํ•  ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค.
eval_accumulation_steps (int, optional): ๊ฒฐ๊ณผ๋ฅผ CPU๋กœ ์ด๋™์‹œํ‚ค๊ธฐ ์ „์— ์ถœ๋ ฅ ํ…์„œ๋ฅผ ๋ˆ„์ ํ•  ์˜ˆ์ธก ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„ค์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ์ „์ฒด ์˜ˆ์ธก์ด GPU/NPU/TPU์—์„œ ๋ˆ„์ ๋œ ํ›„ CPU๋กœ ์ด๋™๋ฉ๋‹ˆ๋‹ค(๋” ๋น ๋ฅด์ง€๋งŒ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค).
learning_rate (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 5e-5): [AdamW] ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ์ดˆ๊ธฐ ํ•™์Šต๋ฅ ์ž…๋‹ˆ๋‹ค.
weight_decay (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): [AdamW] ์˜ตํ‹ฐ๋งˆ์ด์ €์—์„œ ๋ชจ๋“  ๋ ˆ์ด์–ด์—(๋ฐ”์ด์–ด์Šค ๋ฐ LayerNorm ๊ฐ€์ค‘์น˜๋Š” ์ œ์™ธ) ์ ์šฉํ•  ๊ฐ€์ค‘์น˜ ๊ฐ์‡ ์ž…๋‹ˆ๋‹ค.
max_grad_norm (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1.0): ์ตœ๋Œ€ ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„(๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘์„ ์œ„ํ•œ)์ž…๋‹ˆ๋‹ค.
num_train_epochs(float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 3.0): ์ˆ˜ํ–‰ํ•  ์ด ํ›ˆ๋ จ ์—ํฌํฌ ์ˆ˜์ž…๋‹ˆ๋‹ค(์ •์ˆ˜๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ์—ํฌํฌ์˜ ๋ฐฑ๋ถ„์œจ์„ ์ˆ˜ํ–‰ํ•œ ํ›„ ํ›ˆ๋ จ์„ ์ค‘์ง€ํ•ฉ๋‹ˆ๋‹ค).
max_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ -1): ์–‘์˜ ์ •์ˆ˜๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ, ์ˆ˜ํ–‰ํ•  ์ด ํ›ˆ๋ จ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. num_train_epochs๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ์œ ํ•œํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ฒฝ์šฐ, max_steps์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
eval_steps (int ๋˜๋Š” float, optional): eval_strategy="steps"์ธ ๊ฒฝ์šฐ ๋‘ ํ‰๊ฐ€ ์‚ฌ์ด์˜ ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, logging_steps์™€ ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ๊ธฐ๋ณธ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ •์ˆ˜ ๋˜๋Š” [0,1) ๋ฒ”์œ„์˜ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ „์ฒด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.
lr_scheduler_type (str ๋˜๋Š” [SchedulerType], optional, ๊ธฐ๋ณธ๊ฐ’์€ "linear"): ์‚ฌ์šฉํ•  ์Šค์ผ€์ค„๋Ÿฌ ์œ ํ˜•์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ [SchedulerType]์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
lr_scheduler_kwargs ('dict', optional, ๊ธฐ๋ณธ๊ฐ’์€ {}): lr_scheduler์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค์ผ€์ค„๋Ÿฌ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ€๋Šฅํ•œ ๊ฐ’์„ ํ™•์ธํ•˜์„ธ์š”.
warmup_ratio (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0.0): 0์—์„œ learning_rate๋กœ์˜ ์„ ํ˜• ์›œ์—…์— ์‚ฌ์šฉ๋˜๋Š” ์ด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค.
warmup_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): 0์—์„œ learning_rate๋กœ์˜ ์„ ํ˜• ์›œ์—…์— ์‚ฌ์šฉ๋˜๋Š” ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. warmup_ratio์˜ ์˜ํ–ฅ์„ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค.
logging_dir (str, optional): TensorBoard ๋กœ๊ทธ ๋””๋ ‰ํ† ๋ฆฌ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ output_dir/runs/CURRENT_DATETIME_HOSTNAME์ž…๋‹ˆ๋‹ค.
logging_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "steps"): ํ›ˆ๋ จ ์ค‘ ์ฑ„ํƒํ•  ๋กœ๊น… ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ๋กœ๊น…์„ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ๋กœ๊น…์„ ํ•ฉ๋‹ˆ๋‹ค.
•	"steps": logging_steps๋งˆ๋‹ค ๋กœ๊น…์„ ํ•ฉ๋‹ˆ๋‹ค.
logging_first_step (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ์ฒซ ๋ฒˆ์งธ global_step์„ ๋กœ๊น…ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
logging_steps (int ๋˜๋Š” float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 500): logging_strategy="steps"์ธ ๊ฒฝ์šฐ ๋‘ ๋กœ๊ทธ ์‚ฌ์ด์˜ ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ •์ˆ˜ ๋˜๋Š” [0,1) ๋ฒ”์œ„์˜ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ „์ฒด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.
run_name (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ output_dir): ์‹คํ–‰์— ๋Œ€ํ•œ ์„ค๋ช…์ž์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ wandb ๋ฐ mlflow ๋กœ๊น…์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ output_dir๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
save_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "steps"): ํ›ˆ๋ จ ์ค‘ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ €์žฅํ•  ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ์ €์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
•	"steps": save_steps๋งˆ๋‹ค ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. "epoch" ๋˜๋Š” "steps"๊ฐ€ ์„ ํƒ๋œ ๊ฒฝ์šฐ, ํ•ญ์ƒ ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ €์žฅ์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
save_total_limit (int, optional): ๊ฐ’์ด ์ „๋‹ฌ๋˜๋ฉด ์ฒดํฌํฌ์ธํŠธ์˜ ์ด ์ˆ˜๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. output_dir์— ์žˆ๋Š” ์˜ค๋ž˜๋œ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. load_best_model_at_end๊ฐ€ ํ™œ์„ฑํ™”๋˜๋ฉด metric_for_best_model์— ๋”ฐ๋ผ "์ตœ๊ณ " ์ฒดํฌํฌ์ธํŠธ๋Š” ํ•ญ์ƒ ๊ฐ€์žฅ ์ตœ๊ทผ์˜ ์ฒดํฌํฌ์ธํŠธ์™€ ํ•จ๊ป˜ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, save_total_limit=5 ๋ฐ load_best_model_at_end์ธ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ ๋„ค ๊ฐœ์˜ ์ฒดํฌํฌ์ธํŠธ๋Š” ํ•ญ์ƒ ์ตœ๊ณ  ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. save_total_limit=1 ๋ฐ load_best_model_at_end์ธ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ์™€ ์ตœ๊ณ  ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๋ฉด ๋‘ ๊ฐœ์˜ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์ €์žฅ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
save_safetensors (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ True): state_dict๋ฅผ ์ €์žฅํ•˜๊ณ  ๋กœ๋“œํ•  ๋•Œ ๊ธฐ๋ณธ torch.load ๋ฐ torch.save ๋Œ€์‹  safetensors๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
save_on_each_node (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ€ํ‹ฐ๋…ธ๋“œ ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๋ชจ๋ธ๊ณผ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๊ฐ ๋…ธ๋“œ์— ์ €์žฅํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฉ”์ธ ๋…ธ๋“œ์—๋งŒ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
seed (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 42): ํ›ˆ๋ จ ์‹œ์ž‘ ์‹œ ์„ค์ •๋  ๋žœ๋ค ์‹œ๋“œ์ž…๋‹ˆ๋‹ค. ์‹คํ–‰ ๊ฐ„ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•˜๋ ค๋ฉด ๋ชจ๋ธ์— ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ [~Trainer.model_init] ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜์„ธ์š”.
data_seed (int, optional): ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋Ÿฌ์— ์‚ฌ์šฉํ•  ๋žœ๋ค ์‹œ๋“œ์ž…๋‹ˆ๋‹ค. ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์„ ์œ„ํ•œ Random sampler๋Š” seed์™€ ๋™์ผํ•œ ์‹œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ ์‹œ๋“œ์™€๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์˜ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
bf16 (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ํ›ˆ๋ จ ๋Œ€์‹  bf16 16๋น„ํŠธ(ํ˜ผํ•ฉ) ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. Ampere ์ด์ƒ NVIDIA ์•„ํ‚คํ…์ฒ˜ ๋˜๋Š” CPU(์‚ฌ์šฉ_cpu) ๋˜๋Š” Ascend NPU๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
fp16 (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ํ›ˆ๋ จ ๋Œ€์‹  fp16 16๋น„ํŠธ(ํ˜ผํ•ฉ) ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
half_precision_backend (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ "auto"): ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ฐฑ์—”๋“œ์ž…๋‹ˆ๋‹ค. "auto", "apex", "cpu_amp" ์ค‘ ํ•˜๋‚˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. "auto"๋Š” ๊ฐ์ง€๋œ PyTorch ๋ฒ„์ „์— ๋”ฐ๋ผ CPU/CUDA AMP ๋˜๋Š” APEX๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋‹ค๋ฅธ ์„ ํƒ์ง€๋Š” ์š”์ฒญ๋œ ๋ฐฑ์—”๋“œ๋ฅผ ๊ฐ•์ œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
bf16_full_eval (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ๋Œ€์‹  ์ „์ฒด bfloat16 ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๋น ๋ฅด๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜์ง€๋งŒ ๋ฉ”ํŠธ๋ฆญ ๊ฐ’์— ์•…์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
fp16_full_eval (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ๋Œ€์‹  ์ „์ฒด float16 ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๋น ๋ฅด๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜์ง€๋งŒ ๋ฉ”ํŠธ๋ฆญ ๊ฐ’์— ์•…์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
tf32 (bool, optional): Ampere ๋ฐ ์ตœ์‹  GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ์‚ฌ์šฉํ•  TF32 ๋ชจ๋“œ๋ฅผ ํ™œ์„ฑํ™”ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ torch.backends.cuda.matmul.allow_tf32์˜ ๊ธฐ๋ณธ๊ฐ’์— ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ TF32 ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
local_rank (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ -1): ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ค‘ ํ”„๋กœ์„ธ์Šค์˜ ์ˆœ์œ„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
ddp_backend (str, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ๋ฐฑ์—”๋“œ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. "nccl", "mpi", "ccl", "gloo", "hccl" ์ค‘ ํ•˜๋‚˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
dataloader_drop_last (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ธธ์ด๊ฐ€ ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ„์–ด๋–จ์–ด์ง€์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ๋ถˆ์™„์ „ํ•œ ๋ฐฐ์น˜๋ฅผ ์‚ญ์ œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
dataloader_num_workers (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์— ์‚ฌ์šฉํ•  ํ•˜์œ„ ํ”„๋กœ์„ธ์Šค ์ˆ˜์ž…๋‹ˆ๋‹ค(PyTorch ์ „์šฉ). 0์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”์ธ ํ”„๋กœ์„ธ์Šค์—์„œ ๋กœ๋“œ๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
remove_unused_columns (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ True): ๋ชจ๋ธ์˜ forward ๋ฉ”์„œ๋“œ์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์—ด์„ ์ž๋™์œผ๋กœ ์ œ๊ฑฐํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
label_names (List[str], optional): input dictionary์—์„œ label์— ํ•ด๋‹นํ•˜๋Š” ํ‚ค์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ๋ ˆ์ด๋ธ” ์ธ์ˆ˜์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
load_best_model_at_end (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ตœ์ƒ์˜ ๋ชจ๋ธ์„ ๋กœ๋“œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜ต์…˜์ด ํ™œ์„ฑํ™”๋˜๋ฉด, ์ตœ์ƒ์˜ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ํ•ญ์ƒ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ save_total_limit๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
<Tip>
            When set to `True`, the parameters `save_strategy` needs to be the same as `eval_strategy`, and in
            the case it is "steps", `save_steps` must be a round multiple of `eval_steps`.
</Tip>
metric_for_best_model (str, optional): load_best_model_at_end์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๋ชจ๋ธ์„ ๋น„๊ตํ•  ๋ฉ”ํŠธ๋ฆญ์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€์—์„œ ๋ฐ˜ํ™˜๋œ ๋ฉ”ํŠธ๋ฆญ ์ด๋ฆ„์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๊ธฐ๋ณธ๊ฐ’์€ "loss"์ด๋ฉฐ, load_best_model_at_end=True์ธ ๊ฒฝ์šฐ eval_loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด greater_is_better์˜ ๊ธฐ๋ณธ๊ฐ’์€ True๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋ฉ”ํŠธ๋ฆญ์ด ๋‚ฎ์„์ˆ˜๋ก ์ข‹๋‹ค๋ฉด False๋กœ ์„ค์ •ํ•˜์„ธ์š”.
greater_is_better (bool, optional): load_best_model_at_end ๋ฐ metric_for_best_model๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋‚˜์€ ๋ชจ๋ธ์ด ๋” ๋†’์€ ๋ฉ”ํŠธ๋ฆญ์„ ๊ฐ€์ ธ์•ผ ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	metric_for_best_model์ด "loss"๋กœ ๋๋‚˜์ง€ ์•Š๋Š” ๊ฐ’์œผ๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
•	metric_for_best_model์ด ์„ค์ •๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜ "loss"๋กœ ๋๋‚˜๋Š” ๊ฐ’์œผ๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ False์ž…๋‹ˆ๋‹ค.

fsdp (bool, str ๋˜๋Š” [~trainer_utils.FSDPOption]์˜ ๋ชฉ๋ก, optional, ๊ธฐ๋ณธ๊ฐ’์€ ''): PyTorch ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(๋ถ„์‚ฐ ํ›ˆ๋ จ ์ „์šฉ).
fsdp_config (str ๋˜๋Š” dict, optional): fsdp(Pytorch ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์„ค์ •์ž…๋‹ˆ๋‹ค. ๊ฐ’์€ fsdp json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., fsdp_config.json) ๋˜๋Š” ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict)์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
deepspeed (str ๋˜๋Š” dict, optional): Deepspeed๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  ๊ธฐ๋Šฅ์ด๋ฉฐ API๊ฐ€ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ’์€ DeepSpeed json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., ds_config.json) ๋˜๋Š” ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict)์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
accelerator_config (str, dict, ๋˜๋Š” AcceleratorConfig, optional): ๋‚ด๋ถ€ Accelerator ๊ตฌํ˜„๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์„ค์ •์ž…๋‹ˆ๋‹ค. ๊ฐ’์€ accelerator json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., accelerator_config.json), ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict), ๋˜๋Š” [~trainer_pt_utils.AcceleratorConfig]์˜ ์ธ์Šคํ„ด์Šค์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
label_smoothing_factor (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0.0): ์‚ฌ์šฉํ•  ๋ ˆ์ด๋ธ” ์Šค๋ฌด๋”ฉ ํŒฉํ„ฐ์ž…๋‹ˆ๋‹ค. 0์€ label_smoothing์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธ, ๋‹ค๋ฅธ ๊ฐ’์€ ์›ํ•ซ ์ธ์ฝ”๋”ฉ๋œ ๋ ˆ์ด๋ธ”์„ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
optim (str ๋˜๋Š” [training_args.OptimizerNames], optional, ๊ธฐ๋ณธ๊ฐ’์€ "adamw_torch"): ์‚ฌ์šฉํ•  ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision, adafactor ์ค‘์—์„œ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
optim_args (str, optional): AnyPrecisionAdamW์— ์ œ๊ณต๋˜๋Š” ์„ ํƒ์  ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค.
group_by_length (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋Œ€๋žต ๊ฐ™์€ ๊ธธ์ด์˜ ์ƒ˜ํ”Œ์„ ๊ทธ๋ฃนํ™”ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค(ํŒจ๋”ฉ์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด). ๋™์  ํŒจ๋”ฉ์„ ์ ์šฉํ•  ๋•Œ๋งŒ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
report_to (str ๋˜๋Š” List[str], optional, ๊ธฐ๋ณธ๊ฐ’์€ "all"): ๊ฒฐ๊ณผ์™€ ๋กœ๊ทธ๋ฅผ ๋ณด๊ณ ํ•  ํ†ตํ•ฉ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ํ”Œ๋žซํผ์€ "azure_ml", "clearml", "codecarbon", "comet_ml", "dagshub", "dvclive", "flyte", "mlflow", "neptune", "tensorboard", "wandb"์ž…๋‹ˆ๋‹ค. "all"์€ ์„ค์น˜๋œ ๋ชจ๋“  ํ†ตํ•ฉ์— ๋ณด๊ณ ํ•˜๋ฉฐ, "none"์€ ํ†ตํ•ฉ์— ๋ณด๊ณ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
ddp_find_unused_parameters (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” find_unused_parameters ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ False, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
ddp_bucket_cap_mb (int, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” bucket_cap_mb ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
ddp_broadcast_buffers (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” broadcast_buffers ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ False, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
dataloader_persistent_workers (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๋กœ ์„ค์ •ํ•˜๋ฉด ๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํ•œ ๋ฒˆ ์†Œ๋น„๋œ ํ›„์—๋„ ์ž‘์—…์ž ํ”„๋กœ์„ธ์Šค๋ฅผ ์ข…๋ฃŒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ž‘์—…์ž ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ธ์Šคํ„ด์Šค๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์ง€๋งŒ RAM ์‚ฌ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ False์ž…๋‹ˆ๋‹ค.
push_to_hub (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ชจ๋ธ์ด ์ €์žฅ๋  ๋•Œ๋งˆ๋‹ค ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜ต์…˜์ด ํ™œ์„ฑํ™”๋˜๋ฉด output_dir์€ git ๋””๋ ‰ํ† ๋ฆฌ๊ฐ€ ๋˜์–ด ์ €์žฅ์ด ํŠธ๋ฆฌ๊ฑฐ๋  ๋•Œ๋งˆ๋‹ค ์ฝ˜ํ…์ธ ๊ฐ€ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค(save_strategy์— ๋”ฐ๋ผ ๋‹ค๋ฆ„). [~Trainer.save_model]์„ ํ˜ธ์ถœํ•˜๋ฉด ํ‘ธ์‹œ๊ฐ€ ํŠธ๋ฆฌ๊ฑฐ๋ฉ๋‹ˆ๋‹ค.
resume_from_checkpoint (str, optional): ๋ชจ๋ธ์— ์œ ํšจํ•œ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์žˆ๋Š” ํด๋”์˜ ๊ฒฝ๋กœ์ž…๋‹ˆ๋‹ค. ์ด ์ธ์ˆ˜๋Š” ์ง์ ‘์ ์œผ๋กœ [Trainer]์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š์œผ๋ฉฐ, ๋Œ€์‹  ํ›ˆ๋ จ/ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์˜ˆ์ œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
hub_model_id (str, optional): ๋กœ์ปฌ output_dir๊ณผ ๋™๊ธฐํ™”ํ•  ์ €์žฅ์†Œ์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๋ชจ๋ธ ID์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด ๊ฒฝ์šฐ ๋ชจ๋ธ์€ ๋„ค์ž„์ŠคํŽ˜์ด์Šค์— ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ „์ฒด ์ €์žฅ์†Œ ์ด๋ฆ„์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(e.g., "user_name/model"). ๊ธฐ๋ณธ๊ฐ’์€ user_name/output_dir_name์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ output_dir์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
hub_strategy (str ๋˜๋Š” [~trainer_utils.HubStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "every_save"): ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ•  ๋ฒ”์œ„์™€ ์‹œ์ ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"end": ๋ชจ๋ธ, ๊ตฌ์„ฑ, ํ† ํฌ๋‚˜์ด์ €(์ „๋‹ฌ๋œ ๊ฒฝ์šฐ), ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์„ ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค.
•	"every_save": ๋ชจ๋ธ, ๊ตฌ์„ฑ, ํ† ํฌ๋‚˜์ด์ €(์ „๋‹ฌ๋œ ๊ฒฝ์šฐ), ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์„ ์ €์žฅํ•  ๋•Œ๋งˆ๋‹ค ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค. ํ‘ธ์‹œ๋Š” ๋น„๋™๊ธฐ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜๋ฉฐ, ์ €์žฅ์ด ๋งค์šฐ ๋นˆ๋ฒˆํ•œ ๊ฒฝ์šฐ ์ด์ „ ํ‘ธ์‹œ๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด ์ƒˆ๋กœ์šด ํ‘ธ์‹œ๊ฐ€ ์‹œ๋„๋ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ตœ์ข… ๋ชจ๋ธ๋กœ ๋งˆ์ง€๋ง‰ ํ‘ธ์‹œ๊ฐ€ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
•	"checkpoint": "every_save"์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ๋„ last-checkpoint๋ผ๋Š” ํ•˜์œ„ ํด๋”์— ํ‘ธ์‹œํ•˜์—ฌ trainer.train(resume_from_checkpoint="last-checkpoint")์œผ๋กœ ํ›ˆ๋ จ์„ ์‰ฝ๊ฒŒ ์žฌ๊ฐœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
•	"all_checkpoints": "checkpoint"์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ตœ์ข… ์ €์žฅ์†Œ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๋Œ€๋กœ ๋ชจ๋“  ์ฒดํฌํฌ์ธํŠธ๋ฅผ ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค(๋”ฐ๋ผ์„œ ์ตœ์ข… ์ €์žฅ์†Œ์—๋Š” ํด๋”๋งˆ๋‹ค ํ•˜๋‚˜์˜ ์ฒดํฌํฌ์ธํŠธ ํด๋”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค).
hub_token (str, optional): ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ•  ๋•Œ ์‚ฌ์šฉํ•  ํ† ํฐ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ huggingface-cli login์œผ๋กœ ์–ป์€ ์บ์‹œ ํด๋”์˜ ํ† ํฐ์ž…๋‹ˆ๋‹ค.
hub_private_repo (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๋กœ ์„ค์ •ํ•˜๋ฉด ํ—ˆ๋ธŒ ์ €์žฅ์†Œ๊ฐ€ ๋น„๊ณต๊ฐœ๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.
hub_always_push (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ, ์ด์ „ ํ‘ธ์‹œ๊ฐ€ ์™„๋ฃŒ๋˜์ง€ ์•Š์œผ๋ฉด ์ฒดํฌํฌ์ธํŠธ ํ‘ธ์‹œ๋ฅผ ๊ฑด๋„ˆ๋œ๋‹ˆ๋‹ค.
gradient_checkpointing (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์—ญ์ „ํŒŒ ์†๋„๊ฐ€ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.
auto_find_batch_size (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ”๋ชจ๋ฆฌ์— ๋งž๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„ CUDA ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์˜ค๋ฅ˜๋ฅผ ํ”ผํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. accelerate๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(pip install accelerate).
ray_scope (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ "last"): Ray๋ฅผ ์‚ฌ์šฉํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒ€์ƒ‰ ์‹œ ์‚ฌ์šฉํ•  ๋ฒ”์œ„์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ "last"์ž…๋‹ˆ๋‹ค. Ray๋Š” ๋ชจ๋“  ์‹œ๋„์˜ ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๊ตํ•˜๊ณ  ์ตœ์ƒ์˜ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์˜ต์…˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ Ray ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
ddp_timeout (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1800): torch.distributed.init_process_group ํ˜ธ์ถœ์˜ ํƒ€์ž„์•„์›ƒ์ž…๋‹ˆ๋‹ค. ๋ถ„์‚ฐ ์‹คํ–‰ ์‹œ GPU ์†Œ์ผ“ ํƒ€์ž„์•„์›ƒ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
torch_compile (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): PyTorch 2.0 torch.compile์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ปดํŒŒ์ผํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” torch.compile API์— ๋Œ€ํ•œ ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. torch_compile_backend ๋ฐ torch_compile_mode ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉ์ž ์ง€์ •ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ชจ๋“  ๊ฐ’์ด ์ž‘๋™ํ•  ๊ฒƒ์ด๋ผ๊ณ  ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ํ”Œ๋ž˜๊ทธ์™€ ์ „์ฒด ์ปดํŒŒ์ผ API๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
torch_compile_backend (str, optional): torch.compile์—์„œ ์‚ฌ์šฉํ•  ๋ฐฑ์—”๋“œ์ž…๋‹ˆ๋‹ค. ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด torch_compile์ด True๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
torch_compile_mode (str, optional): torch.compile์—์„œ ์‚ฌ์šฉํ•  ๋ชจ๋“œ์ž…๋‹ˆ๋‹ค. ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด torch_compile์ด True๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
split_batches (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ค‘ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ƒ์„ฑํ•˜๋Š” ๋ฐฐ์น˜๋ฅผ ์žฅ์น˜์— ๋ถ„ํ• ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. True๋กœ ์„ค์ •ํ•˜๋ฉด ์‚ฌ์šฉ๋œ ์‹ค์ œ ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” ๋ชจ๋“  ์ข…๋ฅ˜์˜ ๋ถ„์‚ฐ ํ”„๋กœ์„ธ์Šค์—์„œ ๋™์ผํ•˜์ง€๋งŒ, ์‚ฌ์šฉ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค ์ˆ˜์˜ ์ •์ˆ˜ ๋ฐฐ์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.




 


DeepSpeed

trust_remote_code=True

์ค‘๊ตญ๋ชจ๋ธ์—์„œ ํ”ํžˆ๋ณด์ด๋Š” trust_remote_code=True ์„ค์ •, ๊ณผ์—ฐ ์ด๊ฑด ๋ญ˜๊นŒ?
์ด๋Š” "huggingface/transformers"์— Model Architecture๊ฐ€ ์•„์ง ์ถ”๊ฐ€๋˜์ง€ ์•Š์€๊ฒฝ์šฐ:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True, device='cuda')
"huggingface repo 'internlm/internlm-chat-7b'์—์„œ ๋ชจ๋ธ ์ฝ”๋“œ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ , ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ์‹คํ–‰ํ•œ๋‹ค"๋Š” ์˜๋ฏธ์ด๋‹ค.
๋งŒ์•ฝ ์ด ๊ฐ’์ด False๋ผ๋ฉด, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” huggingface/transformers์— ๋‚ด์žฅ๋œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๊ฐ€์ค‘์น˜๋งŒ ๋‹ค์šด๋กœ๋“œ
ํ•˜๋Š”๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

rue

์ค‘๊ตญ๋ชจ๋ธ์—์„œ ํ”ํžˆollatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ

 

'HuggingFace๐Ÿค—' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Data Preprocessing] - Data Collator  (1) 2024.07.14
QLoRA ์‹ค์Šต & Trainer vs SFTTrainer  (0) 2024.07.12
[QLoRA] & [PEFT] & deepspeed, DDP  (0) 2024.07.09

Collate: ํ•จ๊ป˜ ํ•ฉ์น˜๋‹ค.

์ด์—์„œ ์œ ์ถ”๊ฐ€๋Šฅํ•˜๋“ฏ, Data Collator๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

 

 

Data Collator

Data Collator

์ผ๋ จ์˜ sample list๋ฅผ "single training mini-batch"์˜ Tensorํ˜•ํƒœ๋กœ ๋ฌถ์–ด์คŒ
Default Data Collator
์ด๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ train_dataset์ด data_collator๋ฅผ ์ด์šฉํ•ด mini-batch๋กœ ๋ฌถ์—ฌ ๋ชจ๋ธ๋กœ ๋“ค์–ด๊ฐ€ ํ•™์Šตํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,โ€‹





batch["input_ids"] , batch["labels"] ?

๋‹ค๋งŒ, ์œ„์™€ ๋‹ฌ๋ฆฌ ๋Œ€๋ถ€๋ถ„์˜ Data Collatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples): 
        texts = []
        images = []
        for example in examples:
            image, question, answer = example 
            messages = [{"role": "user", "content": question},
                        {"role": "assistant", "content": answer}] # <-- ์—ฌ๊ธฐ๊นŒ์ง€ ์ž˜ ๋“ค์–ด๊ฐ€๋Š”๊ฒƒ ํ™•์ธ์™„๋ฃŒ.
            text = self.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text)
            images.append(image)

        batch = self.processor(text=text, images=image, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)โ€‹

๊ณผ์—ฐ batch["input_ids"]์™€ batch["labels"]๊ฐ€ ๋ญ˜๊นŒ?

์ „์ˆ ํ–ˆ๋˜ data_collator๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํ˜•์‹์„ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋„ ๋ณด๋ฉด inputs์™€ labels๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ชจ๋“  ๋ชจ๋ธ์€ ๋‹ค๋ฅด์ง€๋งŒ, ๋‹ค๋ฅธ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ์ ์„ ๊ณต์œ ํ•œ๋‹ค
= ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ ๋™์ผํ•œ ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•œ๋‹ค!

โˆ™Input IDs

Input ID๋Š” ๋ชจ๋ธ์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌ๋˜๋Š” "์œ ์ผํ•œ ํ•„์ˆ˜ ๋งค๊ฐœ๋ณ€์ˆ˜"์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
Input ID๋Š” token_index๋กœ, ์‚ฌ์šฉํ•  sequence(๋ฌธ์žฅ)๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” token์˜ ์ˆซ์žํ‘œํ˜„์ด๋‹ค.
๊ฐ tokenizer๋Š” ๋‹ค๋ฅด๊ฒŒ ์ž‘๋™ํ•˜์ง€๋งŒ "๊ธฐ๋ณธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋™์ผ"ํ•˜๋‹ค.

ex)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"


tokenizer๋Š” sequence(๋ฌธ์žฅ)๋ฅผ tokenizer vocab์— ์žˆ๋Š” Token์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค:

tokenized_sequence = tokenizer.tokenize(sequence)


token์€ word๋‚˜ subword ๋‘˜์ค‘ ํ•˜๋‚˜์ด๋‹ค:

print(tokenized_sequence)
# ์ถœ๋ ฅ: ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# ์˜ˆ๋ฅผ ๋“ค์–ด, "VRAM"์€ ๋ชจ๋ธ ์–ดํœ˜์— ์—†์–ด์„œ "V", "RA" ๋ฐ "M"์œผ๋กœ ๋ถ„ํ• ๋จ.
# ์ด๋Ÿฌํ•œ ํ† ํฐ์ด ๋ณ„๋„์˜ ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ๋™์ผํ•œ ๋‹จ์–ด์˜ ์ผ๋ถ€์ž„์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š”?
# --> "RA"์™€ "M" ์•ž์— ์ด์ค‘ํ•ด์‹œ(##) ์ ‘๋‘์‚ฌ๊ฐ€ ์ถ”๊ฐ€๋ฉ


inputs = tokenizer(sequence)


์ด๋ฅผ ํ†ตํ•ด token์€ ๋ชจ๋ธ์ด ์ดํ•ด๊ฐ€๋Šฅํ•œ ID๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๋•Œ, ๋ชจ๋ธ๋‚ด๋ถ€์—์„œ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” input_ids๋ฅผ key๋กœ, ID๊ฐ’์„ value๋กœ ํ•˜๋Š” "๋”•์…”๋„ˆ๋ฆฌ"ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ด์•ผํ•œ๋‹ค:

encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# ์ถœ๋ ฅ: [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]

๋˜ํ•œ, ๋ชจ๋ธ์— ๋”ฐ๋ผ์„œ ์ž๋™์œผ๋กœ "special token"์„ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ, 
์—ฌ๊ธฐ์—๋Š” ๋ชจ๋ธ์ด ๊ฐ€๋” ์‚ฌ์šฉํ•˜๋Š” "special IDs"๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# ์ถœ๋ ฅ: [CLS] A Titan RTX has 24GB of VRAM [SEP]





โˆ™Attention Mask

Attention Mask๋Š” Sequence๋ฅผ batch๋กœ ๋ฌถ์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” Optionalํ•œ ์ธ์ˆ˜๋กœ 
"๋ชจ๋ธ์ด ์–ด๋–ค token์„ ์ฃผ๋ชฉํ•˜๊ณ  ํ•˜์ง€ ๋ง์•„์•ผ ํ•˜๋Š”์ง€"๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

len(encoded_sequence_a), len(encoded_sequence_b)
# ์ถœ๋ ฅ: (8, 19)
์œ„๋ฅผ ๋ณด๋ฉด, encoding๋œ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— "๋™์ผํ•œ Tensor๋กœ ๋ฌถ์„ ์ˆ˜๊ฐ€ ์—†๋‹ค."
--> padding์ด๋‚˜ truncation์ด ํ•„์š”ํ•จ.
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["input_ids"]
# ์ถœ๋ ฅ: [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]

padded_sequences["attention_mask"]
# ์ถœ๋ ฅ: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask๋Š” tokenizer๊ฐ€ ๋ฐ˜ํ™˜ํ•˜๋Š” dictionary์˜ "attention_mask" key์— ์กด์žฌํ•œ๋‹ค.


โˆ™Token Types IDs

์–ด๋–ค ๋ชจ๋ธ์˜ ๋ชฉ์ ์€ classification์ด๋‚˜ QA์ด๋‹ค.
์ด๋Ÿฐ ๋ชจ๋ธ์€ 2๊ฐœ์˜ "๋‹ค๋ฅธ ๋ชฉ์ ์„ ๋‹จ์ผ input_ids"ํ•ญ๋ชฉ์œผ๋กœ ๊ฒฐํ•ฉํ•ด์•ผํ•œ๋‹ค.
= [CLS], [SEP] ๋“ฑ์˜ ํŠน์ˆ˜ํ† ํฐ์„ ์ด์šฉํ•ด ์ˆ˜ํ–‰๋จ.

ex)
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)
# ์ถœ๋ ฅ: [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
์œ„์˜ ์˜ˆ์ œ์—์„œ tokenizer๋ฅผ ์ด์šฉํ•ด 2๊ฐœ์˜ sequence๊ฐ€ 2๊ฐœ์˜ ์ธ์ˆ˜๋กœ ์ „๋‹ฌ๋˜์–ด ์ž๋™์œผ๋กœ ์œ„์™€๊ฐ™์€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Š” seq์ดํ›„์— ๋‚˜์˜ค๋Š” seq์˜ ์‹œ์ž‘์œ„์น˜๋ฅผ ์•Œ๊ธฐ์—๋Š” ์ข‹๋‹ค.

๋‹ค๋งŒ, ๋‹ค๋ฅธ ๋ชจ๋ธ์€ token_types_ids๋„ ์‚ฌ์šฉํ•˜๋ฉฐ, token_type_ids๋กœ ์ด MASK๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
encoded_dict['token_type_ids']
# ์ถœ๋ ฅ: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 

์งˆ๋ฌธ์— ์‚ฌ์šฉ๋˜๋Š” context๋Š” ๋ชจ๋‘ 0์œผ๋กœ, 
์งˆ๋ฌธ์— ํ•ด๋‹น๋˜๋Š” sequence๋Š” ๋ชจ๋‘ 1๋กœ ์„ค์ •๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


โˆ™Position IDs

RNN: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๊ฐ€ ๋‚ด์žฅ. 
Transformer: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ ์ธ์‹ โŒ


∴ position_ids๋Š” ๋ชจ๋ธ์ด ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ list์—์„œ ์‹๋ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” optional ๋งค๊ฐœ๋ณ€์ˆ˜.

๋ชจ๋ธ์— position_ids๊ฐ€ ์ „๋‹ฌ๋˜์ง€ ์•Š์œผ๋ฉด, ID๋Š” ์ž๋™์œผ๋กœ Absolute positional embeddings์œผ๋กœ ์ƒ์„ฑ:

Absolute positional embeddings์€ [0, config.max_position_embeddings - 1] ๋ฒ”์œ„์—์„œ ์„ ํƒ.

์ผ๋ถ€ ๋ชจ๋ธ์€ sinusoidal position embeddings์ด๋‚˜ relative position embeddings๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ positional embedding์„ ์‚ฌ์šฉ.




โˆ™Labels 

Labels๋Š” ๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๋„๋ก ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ๋Š” Optional์ธ์ˆ˜์ด๋‹ค.
์ฆ‰, Labels๋Š” ๋ชจ๋ธ์˜ ์˜ˆ์ƒ ์˜ˆ์ธก๊ฐ’์ด์–ด์•ผ ํ•œ๋‹ค: ํ‘œ์ค€ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก๊ฐ’๊ณผ ์˜ˆ์ƒ๊ฐ’(๋ ˆ์ด๋ธ”) ๊ฐ„์˜ ์†์‹ค์„ ๊ณ„์‚ฐ.


์ด๋•Œ, Labels๋Š” Model Head์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค:

  • AutoModelForSequenceClassification: ๋ชจ๋ธ์€ (batch_size)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, batch์˜ ๊ฐ ๊ฐ’์€ ์ „์ฒด ์‹œํ€€์Šค์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น.

  • AutoModelForTokenClassification: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น

  • AutoModelForMaskedLM: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹น: label์€ ๋งˆ์Šคํ‚น๋œ token_ids์ด๋ฉฐ, ๋‚˜๋จธ์ง€๋Š” ๋ฌด์‹œํ•  ๊ฐ’(๋ณดํ†ต -100).

  • AutoModelForConditionalGeneration: ๋ชจ๋ธ์€ (batch_size, tgt_seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐ ์ž…๋ ฅ ์‹œํ€€์Šค์™€ ์—ฐ๊ด€๋œ ๋ชฉํ‘œ ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘์—๋Š” BART์™€ T5๊ฐ€ ์ ์ ˆํ•œ ๋””์ฝ”๋” ์ž…๋ ฅ ID์™€ ๋””์ฝ”๋” ์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ์— ๋ณดํ†ต ์ œ๊ณตํ•  ํ•„์š”X. ์ด๋Š” Encoder-Decoder ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์—๋Š” ์ ์šฉ๋˜์ง€ ์•Š์Œ. ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ ํŠน์ • ๋ชจ๋ธ์˜ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๊ธฐ๋ณธ ๋ชจ๋ธ(BertModel ๋“ฑ)์€ Labels๋ฅผ ๋ฐ›์•„๋“ค์ด์ง€ ๋ชปํ•˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๋กœ์„œ ๋‹จ์ˆœํžˆ ํŠน์ง•๋“ค๋งŒ ์ถœ๋ ฅํ•œ๋‹ค.




โˆ™ Decoder input IDs

์ด ์ž…๋ ฅ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ์— ํŠนํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋””์ฝ”๋”์— ์ž…๋ ฅ๋  ์ž…๋ ฅ ID๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ์ž…๋ ฅ์€ ๋ฒˆ์—ญ ๋˜๋Š” ์š”์•ฝ๊ณผ ๊ฐ™์€ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ์ž‘์—…์— ์‚ฌ์šฉ๋˜๋ฉฐ, ๋ณดํ†ต ๊ฐ ๋ชจ๋ธ์— ํŠน์ •ํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ(BART, T5)์€ ๋ ˆ์ด๋ธ”์—์„œ ๋””์ฝ”๋” ์ž…๋ ฅ ID๋ฅผ ์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ชจ๋ธ์—์„œ๋Š” ๋ ˆ์ด๋ธ”์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ํ›ˆ๋ จ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์„ ํ˜ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ํ›ˆ๋ จ์„ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์ž…๋ ฅ ID๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜๋ ค๋ฉด ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.



โˆ™Feed Forward Chunking

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ ์ž”์ฐจ ์–ดํ…์…˜ ๋ธ”๋ก์—์„œ ์…€ํ”„ ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋Š” ๋ณดํ†ต 2๊ฐœ์˜ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด ๋‹ค์Œ์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค.
ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ž„๋ฒ ๋”ฉ ํฌ๊ธฐ๋Š” ์ข…์ข… ๋ชจ๋ธ์˜ ์ˆจ๊ฒจ์ง„ ํฌ๊ธฐ๋ณด๋‹ค ํฝ๋‹ˆ๋‹ค(์˜ˆ: bert-base-uncased).

ํฌ๊ธฐ [batch_size, sequence_length]์˜ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ค‘๊ฐ„ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ์ €์žฅํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ [batch_size, sequence_length, config.intermediate_size]๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์˜ ํฐ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reformer: The Efficient Transformer์˜ ์ €์ž๋“ค์€ ๊ณ„์‚ฐ์ด sequence_length ์ฐจ์›๊ณผ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ ๋‘ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ [batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  n = sequence_length์™€ ํ•จ๊ป˜ [batch_size, sequence_length, config.hidden_size]๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๋Œ€์‹  ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฑฐ๋ž˜๋ฅผ ํ•˜์ง€๋งŒ, ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

apply_chunking_to_forward() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, chunk_size๋Š” ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐ๋˜๋Š” ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ์˜ ์ˆ˜๋ฅผ ์ •์˜ํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก์„ฑ ๊ฐ„์˜ ๊ฑฐ๋ž˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. chunk_size๊ฐ€ 0์œผ๋กœ ์„ค์ •๋˜๋ฉด ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ฒญํ‚น์€ ์ˆ˜ํ–‰๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

'HuggingFace๐Ÿค—' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

HuggingFace(๐Ÿค—)-Tutorials  (1) 2024.07.31
QLoRA ์‹ค์Šต & Trainer vs SFTTrainer  (0) 2024.07.12
[QLoRA] & [PEFT] & deepspeed, DDP  (0) 2024.07.09

QLoRA ์‹ค์Šต with MLLMs(InternVL)

Step 1. ํ•„์š” Library import:

import os

import torch
import torch.nn as nn
import bitsandbytes as bnb
import transformers

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel, 
    get_peft_model,
)
from transformers import (
    AutoConfig,
    AutoModel,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed,
    pipeline,
    TrainingArguments,
)โ€‹


Step 2. ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜จ ํ›„ prepare_model_for_kbit_training(model) ์ง„ํ–‰

devices = [0]#[0, 3]
max_memory = {i: '49140MiB' for i in devices}

model_name = 'OpenGVLab/InternVL2-8B'


model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    cache_dir='/data/huggingface_models',
    trust_remote_code=True,
    device_map="auto",
    max_memory=max_memory,
    quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4'
        ),
)

# ๋ชจ๋ธ ๊ตฌ์กฐ ์ถœ๋ ฅ
print(model)

# get_input_embeddings ๋ฉ”์„œ๋“œ๋ฅผ ๋ชจ๋ธ์— ์ถ”๊ฐ€
def get_input_embeddings(self):
    if hasattr(self, 'embed_tokens'):
        return self.embed_tokens
    elif hasattr(self, 'language_model') and hasattr(self.language_model.model, 'tok_embeddings'):
        return self.language_model.model.tok_embeddings
    else:
        raise NotImplementedError("The model does not have an attribute 'embed_tokens' or 'language_model.model.tok_embeddings'.")

model.get_input_embeddings = get_input_embeddings.__get__(model, type(model))

# prepare_model_for_kbit_training ํ•จ์ˆ˜๋ฅผ ์ง์ ‘ ๊ตฌํ˜„
def prepare_model_for_kbit_training(model):
    for param in model.parameters():
        param.requires_grad = False  # ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ๋น„ํ™œ์„ฑํ™”

    if hasattr(model, 'model') and hasattr(model.model, 'tok_embeddings'):
        for param in model.model.tok_embeddings.parameters():
            param.requires_grad = True  # ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋งŒ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ ํ™œ์„ฑํ™”
    elif hasattr(model, 'embed_tokens'):
        for param in model.embed_tokens.parameters():
            param.requires_grad = True  # ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋งŒ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ ํ™œ์„ฑํ™”
    
    # ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋‹ค๋ฅธ ํŠน์ • ๋ ˆ์ด์–ด๋“ค๋„ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Œ
    # ์˜ˆ์‹œ: 
    # if hasattr(model, 'some_other_layer'):
    #     for param in model.some_other_layer.parameters():
    #         param.requires_grad = True

    return model

model = prepare_model_for_kbit_training(model)โ€‹


Step 3. QLoRA๋ฅผ ๋ถ™์ผ layer ์„ ํƒ:

def find_all_linear_names(model, train_mode):
    assert train_mode in ['lora', 'qlora']
    cls = bnb.nn.Linear4bit if train_mode == 'qlora' else nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # LLM์˜ Head๋ถ€๋ถ„์— ์†ํ•˜๋Š” ์• ๋“ค pass
        lora_module_names.remove('lm_head')
    
    return list(lora_module_names)


print(sorted(config.target_modules)) # ['1','output', 'w1', 'w2', 'w3', 'wo', 'wqkv']
config.target_modules.remove('1') # LLM์˜ Head๋ถ€๋ถ„์— ์†ํ•˜๋Š” ์• ๋“ค ์ œ๊ฑฐ


config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=find_all_linear_names(model, 'qlora'),
    lora_dropout=0.05,
    bias="none",
    task_type="QUESTION_ANS" #CAUSAL_LM, FEATURE_EXTRACTION, QUESTION_ANS, SEQ_2_SEQ_LM, SEQ_CLS, TOKEN_CLS.
)

model = get_peft_model(model, config)

์ดํ›„ trainer๋กœ train์ง„ํ–‰.

QLoRA ๋ถ™์ธ ๊ฒฐ๊ณผ:

 

 

 

 

 

 

 

trainer ์ข…๋ฅ˜? Trainer vs SFTTrainer

Trainer  v.s. SFTTrainer

โˆ™ Trainer  v.s. SFTTrainer

 - ์ผ๋ฐ˜ ๋ชฉ์ ์˜ ํ›ˆ๋ จ: ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ์งˆ์˜์‘๋‹ต, ์š”์•ฝ ๋“ฑ์˜ ์ง€๋„ ํ•™์Šต ์ž‘์—…์—์„œ ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
 - ๋†’์€ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ๊ฐ€๋Šฅ์„ฑ: hyperparameter, optimizer, scheduler, logging, metric ๋“ฑ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๊ตฌ์„ฑ ์˜ต์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
 - ๋ณต์žกํ•œ ํ›ˆ๋ จ ์›Œํฌํ”Œ๋กœ์šฐ ์ฒ˜๋ฆฌ: ๊ทธ๋ž˜๋””์–ธํŠธ ์ถ•์ , ์กฐ๊ธฐ ์ข…๋ฃŒ, ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ, ๋ถ„์‚ฐ ํ›ˆ๋ จ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
 - ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ ์š”๊ตฌ: ํšจ๊ณผ์ ์ธ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.



โˆ™ SFTTrainer

 - ์ง€๋„ ํ•™์Šต ๋ฏธ์„ธ ์กฐ์ • (SFT): ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ PLMs Fine-Tuning์— ์ตœ์ ํ™”.
 - ๊ฐ„๋‹จํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ๋” ์ ์€ configuration์œผ๋กœ ๊ฐ„์†Œํ™”๋œ workflow๋ฅผ ์ œ๊ณต.
 - ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: PEFT์™€ ํŒจํ‚น ์ตœ์ ํ™”์™€ ๊ฐ™์€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ์ค‘ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
 - ๋น ๋ฅธ ํ›ˆ๋ จ: ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์งง์€ ํ›ˆ๋ จ ์‹œ๊ฐ„์œผ๋กœ๋„ ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.



โˆ™ Trainer์™€ SFTTrainer ์„ ํƒ ๊ธฐ์ค€:

 - Trainer ์‚ฌ์šฉ:
ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๊ณ , ํ›ˆ๋ จ ๋ฃจํ”„ ๋˜๋Š” ๋ณต์žกํ•œ ํ›ˆ๋ จ ์›Œํฌํ”Œ๋กœ์šฐ์— ๋Œ€ํ•œ ๊ด‘๋ฒ”์œ„ํ•œ ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ.
Data preprocessing, Datacollator๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์„ค์ •ํ•ด์•ผ ํ•˜๋ฉฐ, ์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ

 - SFTTrainer ์‚ฌ์šฉ:
PLMS์™€ ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๊ณผ ํ•จ๊ป˜ ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ ๋ฏธ์„ธ ์กฐ์ • ๊ฒฝํ—˜์„ ์›ํ•  ๊ฒฝ์šฐ.
PEFT๋ฅผ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ง€์›, `peft_config`์™€ ๊ฐ™์€ ์„ค์ •์„ ํ†ตํ•ด ํšจ์œจ์ ์ธ ํŒŒ์ธ ํŠœ๋‹์„ ์‰ฝ๊ฒŒ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.
Data preprocessing, Datacollator๋„ ํšจ์œจ์ ์ธ FT๋ฅผ ์œ„ํ•ด ์ตœ์ ํ™”๋˜์–ด ์žˆ์Œ.
`dataset_text_field`์™€ ๊ฐ™์€ ํ•„๋“œ๋ฅผ ํ†ตํ•ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ.



Feature Trainer SFTTrainer
๋ชฉ์  Gerneral Purpose training Supervised Fine-Tuning of PLMs
์ปค์Šคํ…€ ์šฉ๋„ Highly Customizable Simpler interface with fewer options
Training workflow Handles complex workflows Streamlined workflow
ํ•„์š” Data Large Datsets Smaller Datasets
Memory ์‚ฌ์šฉ๋Ÿ‰ Higher Lower with PEFT & packing optimization
Training speed Slower Faster with smaller datasets

'HuggingFace๐Ÿค—' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

HuggingFace(๐Ÿค—)-Tutorials  (1) 2024.07.31
[Data Preprocessing] - Data Collator  (1) 2024.07.14
[QLoRA] & [PEFT] & deepspeed, DDP  (0) 2024.07.09

[PEFT]: Parameter Efficient Fine-Tuning


PEFT๋ž€?

PLMs๋ฅผ specific task์— ์ ์šฉํ•  ๋•Œ, ๋Œ€๋ถ€๋ถ„์˜ Parameter๋ฅผ freezeโ„๏ธ, ์†Œ์ˆ˜์˜ parameter๋งŒ FTํ•˜๋Š” ๊ธฐ๋ฒ•.
PEFT๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์œ ์ง€ + #parameter↓๊ฐ€ ๊ฐ€๋Šฅํ•จ.
๋˜ํ•œ, catastrophic forgetting๋ฌธ์ œ ์œ„ํ—˜๋„ ๋˜ํ•œ ๋‚ฎ์Œ.
๐Ÿค—Huggingface์—์„œ ์†Œ๊ฐœํ•œ ํ˜์‹ ์  ๋ฐฉ๋ฒ•์œผ๋กœ downstream task์—์„œ FT๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋จ.

Catastrophic Forgetting์ด๋ž€?

์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋ ๋•Œ, ๊ธฐ์กด์— ํ•™์Šตํ•œ ์ผ๋ถ€์˜ ์ง€์‹์— ๋Œ€ํ•ด์„œ๋Š” ๋ง๊ฐ์„ ํ•˜๊ฒŒ ๋˜๋Š” ํ˜„์ƒ



Main Concept

  • Reduced Parameter Fine-tuning(์ถ•์†Œ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ํŒŒ์ธํŠœ๋‹)
    ์‚ฌ์ „ ํ•™์Šต๋œ LLM ๋ชจ๋ธ์—์„œ ๋Œ€๋‹ค์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ ์ •ํ•ด ์†Œ์ˆ˜์˜ ์ถ”๊ฐ€์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํŒŒ์ธํŠœ๋‹ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์ 
    ์„ ํƒ์  ํŒŒ์ธํŠœ๋‹์œผ๋กœ ๊ณ„์‚ฐ์  ์š”๊ตฌ๊ฐ€ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ํšจ๊ณผ
  • Overcoming Catastrophic Forgetting(์น˜๋ช…์  ๋ง๊ฐ ๋ฌธ์ œ ๊ทน๋ณต)
    Catastrophic Forgetting ๋ฌธ์ œ๋Š” LLM ๋ชจ๋ธ ์ „์ฒด๋ฅผ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ˜„์ƒ์ธ๋ฐ, PEFT๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์น˜๋ช…์  ๋ง๊ฐ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Œ
    PEFT๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ƒํƒœ์˜ ์ง€์‹์„ ๋ณด์กดํ•˜๋ฉฐ ์ƒˆ๋กœ์šด downstream task์— ๋Œ€ํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ
  • Application Across Modalities(์—ฌ๋Ÿฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ ์šฉ ๊ฐ€๋Šฅ)
    PEFT๋Š” ๊ธฐ์กด ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(Natural Language Process: NLP) ์˜์—ญ์„ ๋„˜์–ด์„œ ๋‹ค์–‘ํ•œ ์˜์—ญ์œผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•จ
    ์Šคํ…Œ์ด๋ธ” ๋””ํ“จ์ „(stable diffusion) ํ˜น์€ Layout LM ๋“ฑ์˜ ํฌํ•จ๋œ ์ปดํ“จํ„ฐ ๋น„์ „(Computer Vision: CV) ์˜์—ญ,
    Whisper๋‚˜ XLS-R์ด ํฌํ•จ๋œ ์˜ค๋””์˜ค ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋งˆ๋‹ฌ๋ฆฌํ‹ฐ์— ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉ๋จ
  • Supported PEFT Methods(์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ PEFT)
    ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ๋‹ค์–‘ํ•œ PEFT ๋ฐฉ๋ฒ•์„ ์ง€์›ํ•จ
    LoRA(Low-Rank Adaption), Prefix Tuning, ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹ ๋“ฑ ๊ฐ๊ฐ์˜ ๋ฐฉ๋ฒ•์€ ํŠน์ •ํ•œ ๋ฏธ์„ธ ์กฐ์ • ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋งž๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋จ

 

 

 

The output activations original (frozen) pretrained weights (left) are augmented by a low rank adapter comprised of weight matrics A and B (right).

์‚ฌ์ „ํ•™์Šต๊ฐ€์ค‘์น˜(โ„๏ธ)์˜ output activation์€ weight matrix์ธ A, B๋กœ ๊ตฌ์„ฑ๋œ LoRA์— ์˜ํ•ด ์ฆ๊ฐ€๋œ๋‹ค.

 

[Q-LoRA]: Quantized-LoRA

Q-LoRA๋ž€?

2023๋…„ 5์›” NeurIPS์—์„œ ์–‘์žํ™”์™€ LoRA๋ฅผ ํ•ฉ์ณ "A6000 ๋‹จ์ผ GPU๋กœ 65B๋ชจ๋ธ ํŠœ๋‹์ด ๊ฐ€๋Šฅ"ํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ๋ฐœํ‘œํ•จ.
QLoRA๋Š” ๊ฒฐ๊ตญ ๊ธฐ์กด์˜ LoRA์— ์ƒˆ๋กœ์šด quantization์„ ๋”ํ•œ ํ˜•ํƒœ์ด๋‹ค.
๋ฒ ์ด์Šค ๋ชจ๋ธ์ธ PLM์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์–ผ๋ฆฌ๊ณ (frozen), LoRA ์–ด๋Œ‘ํ„ฐ์˜ ๊ฐ€์ค‘์น˜๋งŒ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ(trainable)ํ•˜๋Š” ๊ฒƒ์€ LoRA์™€ ๋™์ผํ•˜๋ฉฐ, frozen PLM์˜ ๊ฐ€์ค‘์น˜๊ฐ€ '4๋น„ํŠธ๋กœ ์–‘์žํ™”'๋˜์—ˆ๋‹ค๋Š” ์ •๋„๊ฐ€ ๋‹ค๋ฅธ ์ ์ด๋‹ค.
๋•Œ๋ฌธ์—, QLoRA์—์„œ ์ฃผ์š”ํžˆ ์ƒˆ๋กœ ์†Œ๊ฐœ๋˜๋Š” ๊ธฐ์ˆ (Main Contribution)์€ ์–‘์žํ™” ๋ฐฉ๋ฒ•๋ก ์ด ์ฃผ๊ฐ€ ๋œ๋‹ค๋Š” ์‚ฌ์‹ค์ด๋‹ค.

์–‘์žํ™”๋ž€?

weight์™€ activation output์„ ๋” ์ž‘์€ bit๋‹จ์œ„๋กœ ํ‘œํ˜„ํ•˜๋„๋ก ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ.
์ฆ‰, data์ •๋ณด๋ฅผ ์•ฝ๊ฐ„์ค„์ด๊ณ , ์ •๋ฐ€๋„๋Š” ๋‚ฎ์ถ”์ง€๋งŒ
"์ €์žฅ ๋ฐ ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ๊ฐ์†Œ์‹œ์ผœ ํšจ์œจ์„ฑ์„ ํ™•๋ณด"ํ•˜๋Š” ๊ฒฝ๋Ÿ‰ํ™” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค.



How to Use in MLLMs...?

๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ MLLMs์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ? MLLMs๋Š” ๋งค์šฐ ์ข…๋ฅ˜๊ฐ€ ๋งŽ์ง€๋งŒ, ๊ฐ€์žฅ ์‰ฌ์šด ์˜ˆ์ œ๋กœ VLMs๋ฅผ ๋“ค์–ด๋ณด์ž๋ฉด,
Q-LoRA ๋ฐ LoRA๋Š” PEFT๋ฐฉ๋ฒ•๋ก ์ด๊ธฐ์— ์ด๋Š” LLMs, MLLMs๋ชจ๋‘ ํ†ต์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
๊ทธ๋ ‡๊ธฐ์— VLMs(Vision Encoder + LLM Decoder)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์„ค๋ช…ํ•ด๋ณด์ž๋ฉด:

  • ์–ธ์–ด์  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”์‹œํ‚ค๊ณ  ์‹ถ๋‹ค๋ฉด, LLM๋งŒ PEFT๋ฅผ ์ง„ํ–‰.
  • ์‹œ๊ฐ์  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”์‹œํ‚ค๊ณ  ์‹ถ๋‹ค๋ฉด, Vision Encoder๋งŒ PEFT๋ฅผ ์ง„ํ–‰.
  • ๋‘ ๋Šฅ๋ ฅ ๋ชจ๋‘ ๊ฐ•ํ™”์‹œํ‚ค๊ณ  ์‹ถ๋‹ค๋ฉด, Encoder, Decoder ๋ชจ๋‘ PEFT๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด ๋œ๋‹ค.

Reference Code:

 

A Definitive Guide to QLoRA: Fine-tuning Falcon-7b with PEFT

Unveiling the Power of QLoRA: Comprehensive Explanation and Practical Coding with ๐Ÿค— PEFT

medium.com

 

 

Finetuning Llama2 with QLoRA — TorchTune documentation

Finetuning Llama2 with QLoRA In this tutorial, we’ll learn about QLoRA, an enhancement on top of LoRA that maintains frozen model parameters in 4-bit quantized precision, thereby reducing memory usage. We’ll walk through how QLoRA can be utilized withi

pytorch.org

 

 

์ฐธ๊ณ : https://github.com/V2LLAIN/Transformers-Tutorials/blob/master/qlora_baseline.ipynb

 

Transformers-Tutorials/qlora_baseline.ipynb at master · V2LLAIN/Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace. - V2LLAIN/Transformers-Tutorials

github.com

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Deepspeed๋ž€?

# finetune_qlora.sh

deepspeed ovis/train/train.py \
        --deepspeed scripts/zero2.json \
        ...โ€‹


๋ฌผ๋ก  ๋‚˜๋งŒ์˜ ๋ฐฉ๋ฒ•์„ ๊ณ ์ˆ˜ํ•˜๋Š”๊ฒƒ๋„ ์ข‹์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ user๋“ค์ด ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š”๊ฑธ ๋ด์„œ๋Š” ์ผ๋‹จ ์•Œ์•„๋†“๋Š”๊ฒŒ ์ข‹์„ ๊ฒƒ ๊ฐ™๊ธฐ์— ์•Œ์•„๋ณด๊ณ ์žํ•œ๋‹ค.


deepspeed...?

๋ชจ๋ธ์˜ training, inference์†๋„๋ฅผ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋„์™€์ฃผ๋Š” Microsoft์‚ฌ์˜ ๋”ฅ๋Ÿฌ๋‹ ์ตœ์ ํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.

ํ•™์Šต device ์ข…๋ฅ˜:

  • CPU
    Single GPU
    1 Node, Multi GPU
    Multi Node, Multi GPU --> ๋งค์šฐ ํฐ GPT4 ๋“ฑ์˜ ํ•™์Šต์„ ์œ„ํ•ด ์‚ฌ์šฉ๋จ.

๋ถ„์‚ฐํ•™์Šต ๋ฐฉ์‹:

  • Data Parallel: ํ•˜๋‚˜์˜ device๊ฐ€ data๋ฅผ ๋‚˜๋ˆ„๊ณ , ๊ฐ device์—์„œ ์ฒ˜๋ฆฌ๊ฒฐ๊ณผ๋ฅผ ๋ชจ์•„ ๊ณ„์‚ฐ
    --> ํ•˜๋‚˜์˜ device๊ฐ€ ๋‹ค๋ฅธ device์— ๋น„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋งŽ์•„์ง€๋Š”, ๋ฉ”๋ชจ๋ฆฌ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค!
  • Distributed Data Parallel: ๊ฐ๊ฐ์˜ device๋ฅผ ํ•˜๋‚˜์˜ Process๋กœ ๋ณด๊ณ , ๊ฐ process์—์„œ ๋ชจ๋ธ์„ ๋„์›Œ์„œ ์‚ฌ์šฉ.
    ์ด๋•Œ, ์—ญ์ „ํŒŒ์—์„œ๋งŒ ๋‚ด๋ถ€์ ์œผ๋กœ gradient๋ฅผ ๋™๊ธฐํ™” --> ๋ฉ”๋ชจ๋ฆฌ ๋ถˆ๊ท ํ˜•๋ฌธ์ œโŒ


cf) Requirements:

- PyTorch must be installed before installing DeepSpeed.
- For full feature support we recommend a version of PyTorch that is >= 1.9 and ideally the latest PyTorch stable release.
- A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
- Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
        NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
        AMD: MI100 and MI200



pip install deepspeed

๋กœ ์„ค์น˜๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์‚ฌ์šฉ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.


์‚ฌ์šฉ๋ฐฉ๋ฒ•:
Step1)
deepspeed์‚ฌ์šฉ์„ ์œ„ํ•œ Config.jsonํŒŒ์ผ ์ž‘์„ฑ

{
	"train_micro_batch_size_per_gpu": 160,
    "gradient_accumulation_steps": 1,
    "optimizer":
    {
    	"type": "Adam",
        "params":
        {
        	"lr": 0.001
        }
    },
    "zero_optimization":
    {
        "stage": 1,
        "offload_optimizer":
        {
            "device": "cpu",
            "pin_memory":true
        },
        "overlap_comm": false,
        "contiguous_gradients": false
    }
}

config Args:https://www.deepspeed.ai/docs/config-json/

Step2) import & read json

import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam

with open('config.json', 'r') as f:
	deepspeed_config = json.load(f)



Step3) optimizer ์„ค์ • & model,optimizer ์ดˆ๊ธฐํ™”

optimizer = DeepSpeedCPUAdam(model.parameters(), lr=lr)

model, optimizer, _, _ = deepspeed.initialize(model=model,
                                            model_parameters=model.parameters(),
                                            optimizer=optimizer,
                                            config_params=deepspeed_config)


cf) ArgumentParser์— ์ถ”๊ฐ€ํ•˜๋Š”๊ฒƒ๋„ ๊ฐ€๋Šฅํ•จ!

parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', type=int, default=-1)

parser = deepspeed.add_config_arguments(parser)




Step4) Train!

# >> train.py
deepspeed --num_gpus={gpu ๊ฐœ์ˆ˜} train_deepspeed.py
# train.sh
deepspeed --num_gpus={gpu ๊ฐœ์ˆ˜} train_deepspeed.py

# >> bash train.sh


์ฃผ์˜ !)

DeepSpeed๋Š” CUDA_VISIBLE_DEVICES๋กœ ํŠน์ • GPU๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์—†๋‹ค!
์•„๋ž˜์™€ ๊ฐ™์ด --include๋กœ๋งŒ ํŠน์ • GPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

deepspeed —inlcude localhost:<GPU_NUM1>, <GPU_NUM2> <python_file.py>


  • gpu node์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก deepspeed์˜ ์žฅ์ ์ธ ํ•™์Šต ์†๋„๊ฐ€ ๋นจ๋ผ์ง„๋‹ค!
 

DeepSpeed Configuration JSON

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

www.deepspeed.ai

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

'HuggingFace๐Ÿค—' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

HuggingFace(๐Ÿค—)-Tutorials  (1) 2024.07.31
[Data Preprocessing] - Data Collator  (1) 2024.07.14
QLoRA ์‹ค์Šต & Trainer vs SFTTrainer  (0) 2024.07.12

+ Recent posts