Preview:

Machine Learning vs Deep Learning

๋จธ์‹ ๋Ÿฌ๋‹: ์ธ๊ณต์ง€๋Šฅ์˜ ํ•œ ๋ถ„์•ผ๋กœ data์˜ Pattern์„ ํ•™์Šต. (์ด๋•Œ, ๋น„๊ต์  ์ ์€ ์–‘์˜ ๊ตฌ์กฐํ™”๋œ data๋กœ๋„ ์ž‘๋™๊ฐ€๋Šฅ)

๋”ฅ๋Ÿฌ๋‹: ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•œ ๋ถ„์•ผ๋กœ ๋ณต์žกํ•œ ๊ตฌ์กฐ, ๋งŽ์€ ๊ณ„์‚ฐ๋ฆฌ์†Œ์Šค ๋ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•จ.

 

 

 

Transformer(Attention is All You Need-2017)

Transformer ๋ชจ๋ธ์˜ ํ•ต์‹ฌ:

โˆ™ input sequence ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ

โˆ™ Only Use Attention Mechanism (Self Attention)
โˆ™ ์ˆœ์ฐจ์ ์ฒ˜๋ฆฌ, ๋ฐ˜๋ณต์—ฐ๊ฒฐ, ์žฌ๊ท€ ๋ชจ๋‘ ์‚ฌ์šฉโŒ


Transformer ๋ชจ๋ธ๊ตฌ์กฐ:

โˆ™ Embedding: token2dense-vector (์ด๋•Œ, ๋‹จ์–ด๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๋ณด์กดํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ชจ๋ธ์ด ํ•™์Šต๋œ๋‹ค.)
โˆ™ Positional Encoding: input sequence์˜ token์œ„์น˜์ •๋ณด๋ฅผ ์ถ”๊ฐ€๋กœ ์ œ๊ณต
โˆ™ Encoder&Decoder: Embedding+PE๋ผ๋Š” ํ•™์Šต๋œ vector๋ฅผ Input์œผ๋กœ ๋ฐ›์Œ(๋ฒกํ„ฐ ๊ฐ’์€ Pretrained weight or ํ•™์Šต๊ณผ์ •์ค‘ ์ตœ์ ํ™”๋จ.)
- MHA & FFN: token๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต, FFN์œผ๋กœ ๊ฐ ๋‹จ์–ด์˜ ํŠน์ง•๋ฒกํ„ฐ ์ถ”์ถœ (์ด๋•Œ, ๊ฐ Head๊ฐ’์€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€์ ธ input sequence์˜ ๋‹ค์–‘ํ•œ ์ธก๋ฉด ํฌ์ฐฉ๊ฐ€๋Šฅ.) 
- QKV: Query(ํ˜„์žฌ์œ„์น˜์—์„œ ๊ด€์‹ฌ์žˆ๋Š”๋ถ€๋ถ„์˜ ๋ฒกํ„ฐ), Key(๊ฐ ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด์˜ ๋ฒกํ„ฐ), Value(๊ฐ ์œ„์น˜์— ๋Œ€ํ•œ ๊ฐ’ ๋ฒกํ„ฐ)

ex) The student studies at the home 
query: student
 --> Q: student์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•œ ๋ฒกํ„ฐ๊ฐ’
 --> K: The, studies, at, the, home ๋ฒกํ„ฐ๊ฐ’
 --> V: ๋ฌธ๋งฅ, ์˜๋ฏธ๋“ฑ์˜ ๊ด€๋ จ์ •๋ณด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ๊ฐ’
 --> 3-Head๋ผ๋ฉด: ๊ฐ ํ—ค๋“œ๋ณ„ ์—ญํ• ์ด Syntax, Semantics, Pragmatics ๋“ฑ์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ.

 

Huggingface Transformers

Library ์†Œ๊ฐœ

Tokenizer 

๋ณดํ†ต subword๋กœ tokenํ™”(token์œผ๋กœ ๋ถ„ํ• )ํ•˜๋Š” ๊ณผ์ •์„ ์ง„ํ–‰.
๋ถ€์ˆ˜์ ์œผ๋กœ "ํ…์ŠคํŠธ ์ •๊ทœํ™”, ๋ถˆ์šฉ์–ด์ œ๊ฑฐ, Padding, Truncation ๋“ฑ" ๋‹ค์–‘ํ•œ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•œ๋‹ค.


Diffusers Library

txt-img์ƒ์„ฑ๊ด€๋ จ ์ž‘์—…์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ Stable Diffusion, DALL-E, LDM ๋“ฑ ๋‹ค์–‘ํ•œ ์ƒ์„ฑ๋ชจ๋ธ์„ ์ง€์›.
- DDPM, DDIM, LDM๋“ฑ ๋‹ค์–‘ํ•œ Diffusion๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ œ๊ณต
- Batch Inference, ๋ณ‘๋ ฌ, ํ˜ผํ•ฉ์ •๋ฐ€๋„ํ•™์Šต ๋“ฑ ์ง€์›


Accelerate

๋ถ„์‚ฐ์ „๋žต์„ ๊ฐ„๋‹จํžˆ ์ถ”์ƒํ™”ํ•ด API๋กœ ์ œ๊ณต, FP16 ๋ฐ BF16๋“ฑ์˜ ๋‚ฎ์€ ํ˜ผํ•ฉ์ •๋ฐ€๋„ํ•™์Šต์„ "์ž๋™์ง€์›"
- ๋ถ„์‚ฐํ•™์Šต ์ง€์›: Data Parallel, Model Parallel ๋“ฑ ์ง€์›.
- Automatic Mixed Precision์ง€์›: FP16, FP32 ๋“ฑ dataํ˜•์‹์„ ์ž๋™์œผ๋กœ ํ˜ผํ•ฉ, ๋ฉ”๋ชจ๋ฆฌ์‚ฌ์šฉ๋Ÿ‰
, ์†๋„

- Gradient Accumulation: ์—ฌ๋Ÿฌ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ ํ•˜์—ฌ ํฐ ๋ฐฐ์น˜ ํšจ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ธฐ๋ฒ•
- Gradient Checkpointing: ์ค‘๊ฐ„ activation๊ณ„์‚ฐ์„ ์ €์žฅํ•˜๋Š” ๋Œ€์‹ , ํ•„์š”ํ•  ๋•Œ ์žฌ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•

Model ์„ค์ •

๋ชจ๋ธ ์„ค์ • ํด๋ž˜์Šค๋Š” ๋ชจ๋ธ๊ตฌ์กฐ์™€ hyperparameter๊ฐ’์„ "๋”•์…”๋„ˆ๋ฆฌ"ํ˜•ํƒœ๋กœ JSONํŒŒ์ผ์— ์ €์žฅํ•œ๋‹ค.
๋”ฐ๋ผ์„œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋ฉด ๋ชจ๋ธ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ์ด ๊ฐ’์ด ๋ถˆ๋Ÿฌ์™€์ง„๋‹ค. (์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ)




PretrainedConfig & ModelConfig

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ๊ตฌ์กฐ, hyperparameter๋ฅผ ์ €์žฅํ•˜๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ํฌํ•จ
[์˜ˆ์‹œ์ธ์ž ์„ค๋ช…]: 
 - vocab_size: ๋ชจ๋ธ์ด ์ธ์‹๊ฐ€๋Šฅํ•œ ๊ณ ์œ ํ† ํฐ ์ˆ˜
 - output_hidden_states: ๋ชจ๋ธ์˜ ๋ชจ๋“  hidden_state๋ฅผ ์ถœ๋ ฅํ• ์ง€ ์—ฌ๋ถ€
 - output_attentions: ๋ชจ๋ธ์˜ ๋ชจ๋“  attention๊ฐ’์„ ์ถœ๋ ฅํ• ์ง€ ์—ฌ๋ถ€
 - return_dict: ๋ชจ๋ธ์ด ์ผ๋ฐ˜ ํŠœํ”Œ๋Œ€์‹ , ModelOutput๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ• ์ง€ ๊ฒฐ์ •.
๊ฐ ๋ชจ๋ธ ๊ตฌ์กฐ๋ณ„ PretrainedConfig๋ฅผ ์ƒ์†๋ฐ›์€ ์ „์šฉ ๋ชจ๋ธ ์„ค์ • ํด๋ž˜์Šค๊ฐ€ ์ œ๊ณต๋œ๋‹ค.
(ex. BertConfig, GPT2Config ํ˜น์€ ์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ...)

InternVisionConfig๋ฅผ ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ™”ํ•ด ์„ค์ •ํ•˜๋Š” ์˜ˆ์‹œ

์ด๋•Œ, ์„ค์ •๊ฐ’์ด ์ž˜๋ชป๋˜๋ฉด ๋ชจ๋ธ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๊ธฐ์— ๋ณดํ†ต "from_pretrained"๋ฅผ ์ด์šฉํ•ด ๊ฒ€์ฆ๋œ pretrainedํ•™์Šต์„ค์ •๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.



PreTrainedTokenizer & ModelTokenizer & PretrainedTokenizerFast

[์˜ˆ์‹œ์ธ์ž ์„ค๋ช…]: 
 - max_model_input_sizes: ๋ชจ๋ธ์˜ ์ตœ๋Œ€์ž…๋ ฅ๊ธธ์ด

 - model_max_length: tokenizer๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ์ตœ๋Œ€์ž…๋ ฅ๊ธธ์ด
(์ฆ‰, ํ† ํฌ๋‚˜์ด์ €์˜ model_max_length๋Š” ๋ชจ๋ธ์˜ max_model_input_sizes๋ณด๋‹ค ํฌ์ง€ ์•Š๋„๋ก ์„ค์ •ํ•ด์•ผ ๋ชจ๋ธ์ด ์ •์ƒ์ ์œผ๋กœ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 - padding_side/truncation_side: padding/truncation์œ„์น˜(left/right) ๊ฒฐ์ •

 - model_input_names: ์ˆœ์ „ํŒŒ์‹œ ์ž…๋ ฅ๋˜๋Š” tensor๋ชฉ๋ก(ex. input_ids, attention_mask, token_type_ids)

cf) decode๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด token_id ๋ฌธ์žฅ์„ ์›๋ž˜ ๋ฌธ์žฅ์œผ๋กœ ๋ณต์›ํ•œ๋‹ค.
cf) PretrainedTokenizerFast๋Š” Rust๋กœ ๊ตฌํ˜„๋œ ๋ฒ„์ „์œผ๋กœ Python Wrapper๋ฅผ ํ†ตํ•ด ํ˜ธ์ถœ๋˜๋Š”, ๋” ๋น ๋ฅธ tokenizer๋‹ค.




Datasets

Dataset Upload ์˜ˆ์ œ:

images ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ:
images
โŽฟ A.jpg
โŽฟ B.jpg
  ...

import os
from collections import defaultdict
from datasets import Dataset, Image, DatasetDict

data = defaultdict(list)
folder_name = '../images'

for file_name in os.listdir(folder_name):
    name = os.path.splittext(file_name)[0]
    path = os.path.join(folder_name, file_name)
    data['name'].append(name)
    data['image'].append(path)

dataset = Dataset.from_dict(data).cast_column('image', Image())
# print(data, dataset[0]) # ํ™•์ธ์šฉ

dataset_dict = DatasetDict({
    'train': dataset.select(range(5)),
    'valid': dataset.select(range(5, 10)),
    'test': dataset.select(range(10, len(dataset)))
    }
)

hub_name = "<user_name>/<repo_name>" # dataset์ €์žฅ๊ฒฝ๋กœ
token = "hf_###..." # huggingface token์ž…๋ ฅ
datasetdict.push_to_hub(hub_name, token=token)

 





๐Ÿค— Embedding๊ณผ์ • ์™„์ „์ •๋ฆฌ!!

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

txt = "I am laerning about tokenizers."
input = tokenizer(txt, return_tensors="pt")
output = model(**input)

print('input:', input)
print('last_hidden_state:', output.last_hidden_state.shape)
input: {'input_ids': tensor([[  101,  1045,  2572,  2474, 11795,  2075,  2055, 19204, 17629,  2015,  1012,   102]]), 
        'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
        'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
        
last_hidden_state: torch.Size([1, 12, 768])
  1. input ๋”•์…”๋„ˆ๋ฆฌ
    • input_ids:
      • ๊ฐ ๋‹จ์–ด์™€ ํŠน์ˆ˜ ํ† ํฐ์ด BERT์˜ ์–ดํœ˜ ์‚ฌ์ „์— ๋งคํ•‘๋œ ๊ณ ์œ ํ•œ ์ •์ˆ˜ ID๋กœ ๋ณ€ํ™˜๋œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
      • ์˜ˆ์‹œ: 101์€ [CLS] ํ† ํฐ, 102๋Š” [SEP] ํ† ํฐ.
      • ์ „์ฒด ์‹œํ€€์Šค: [CLS] I am laerning about tokenizers. [SEP]
      • ๊ธธ์ด: 12๊ฐœ์˜ ํ† ํฐ (๋ฌธ์žฅ ์ „์ฒด์™€ ํŠน์ˆ˜ ํ† ํฐ ํฌํ•จ)
    • token_type_ids:
      • ๋ฌธ์žฅ ๋‚ด์˜ ๊ฐ ํ† ํฐ์ด ์–ด๋Š segment์— ์†ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„.
      • BERT๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ ์„ธ๊ทธ๋จผํŠธ(์˜ˆ: ๋ฌธ์žฅ A์™€ ๋ฌธ์žฅ B)๋ฅผ ๊ตฌ๋ถ„๊ฐ€๋Šฅ.
      • ์—ฌ๊ธฐ์„œ๋Š” ๋‹จ์ผ ๋ฌธ์žฅ์ด๋ฏ€๋กœ ๋ชจ๋“  ๊ฐ’์ด 0์ด๋‹ค.
    • attention_mask:
      • ๋ชจ๋ธ์ด ๊ฐ ํ† ํฐ์— ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์—ฌ์•ผ ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.
      • 1์€ ํ•ด๋‹น ํ† ํฐ์ด ์‹ค์ œ ๋ฐ์ดํ„ฐ์ž„์„ ์˜๋ฏธํ•˜๊ณ , 0์€ ํŒจ๋”ฉ ํ† ํฐ์„ ์˜๋ฏธ.
      • ์—ฌ๊ธฐ์„œ๋Š” ํŒจ๋”ฉ์ด ์—†์œผ๋ฏ€๋กœ ๋ชจ๋“  ๊ฐ’์ด 1์ด๋‹ค.
  2. last_hidden_state
    • Shape: [1, 12, 768]
      • Batch Size (1): ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌ.
      • Sequence Length (12): ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ํ† ํฐ ์ˆ˜ (ํŠน์ˆ˜ ํ† ํฐ ํฌํ•จ).
      • Hidden Size (768): BERT-base ๋ชจ๋ธ์˜ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด 768์ฐจ์›์˜ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ์ƒ์„ฑํ•œ๋‹ค.
    • ์˜๋ฏธ:
      • last_hidden_state๋Š” ๋ชจ๋ธ์˜ ๋งˆ์ง€๋ง‰ ์€๋‹‰ ๊ณ„์ธต์—์„œ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•œ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ๋‹ด๊ณ  ์žˆ๋‹ค.
      • ์ด ๋ฒกํ„ฐ๋“ค์€ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ NLP ์ž‘์—…(์˜ˆ: ๋ถ„๋ฅ˜, ๊ฐœ์ฒด๋ช… ์ธ์‹ ๋“ฑ)์— ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

์„ค๋ช…)



ex-1) Embedding  Lookup Table๊ณผ์ • ์ฝ”๋“œ

train_data = 'you need to know how to code'

# ์ค‘๋ณต์„ ์ œ๊ฑฐํ•œ ๋‹จ์–ด๋“ค์˜ ์ง‘ํ•ฉ์ธ ๋‹จ์–ด ์ง‘ํ•ฉ ์ƒ์„ฑ.
word_set = set(train_data.split())

# ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ •์ˆ˜ ๋งตํ•‘.
vocab = {word: i+2 for i, word in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1
print(vocab) # {'need': 2, 'to': 3, 'code': 4, 'how': 5, 'you': 6, 'know': 7, '<unk>': 0, '<pad>': 1}

# ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋งŒํผ์˜ ํ–‰์„ ๊ฐ€์ง€๋Š” ํ…Œ์ด๋ธ” ์ƒ์„ฑ.
embedding_table = torch.FloatTensor([[ 0.0,  0.0,  0.0],
                                    [ 0.0,  0.0,  0.0],
                                    [ 0.2,  0.9,  0.3],
                                    [ 0.1,  0.5,  0.7],
                                    [ 0.2,  0.1,  0.8],
                                    [ 0.4,  0.1,  0.1],
                                    [ 0.1,  0.8,  0.9],
                                    [ 0.6,  0.1,  0.1]])

sample = 'you need to run'.split()
idxes = []

# ๊ฐ ๋‹จ์–ด๋ฅผ ์ •์ˆ˜๋กœ ๋ณ€ํ™˜
for word in sample:
  try:
    idxes.append(vocab[word])
  # ๋‹จ์–ด ์ง‘ํ•ฉ์— ์—†๋Š” ๋‹จ์–ด์ผ ๊ฒฝ์šฐ <unk>๋กœ ๋Œ€์ฒด๋œ๋‹ค.
  except KeyError:
    idxes.append(vocab['<unk>'])
idxes = torch.LongTensor(idxes)

# ๊ฐ ์ •์ˆ˜๋ฅผ ์ธ๋ฑ์Šค๋กœ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜จ๋‹ค.
lookup_result = embedding_table[idxes, :]
print(lookup_result)
print(lookup_result.shape)
# tensor([[0.1000, 0.8000, 0.9000],
#         [0.2000, 0.9000, 0.3000],
#         [0.1000, 0.5000, 0.7000],
#         [0.0000, 0.0000, 0.0000]])
# torch.Size([4, 3])

ex-2) Embedding lookup table๊ณผ์ • ์ฝ”๋“œ์™€ nn.Embedding๊ฐ„ ๋น„๊ต

train_data = 'you need to know how to code'

# ์ค‘๋ณต์„ ์ œ๊ฑฐํ•œ ๋‹จ์–ด๋“ค์˜ ์ง‘ํ•ฉ์ธ ๋‹จ์–ด ์ง‘ํ•ฉ ์ƒ์„ฑ.
word_set = set(train_data.split())

# ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ •์ˆ˜ ๋งตํ•‘.
vocab = {tkn: i+2 for i, tkn in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1

embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=3, padding_idx=1)
print(embedding_layer.weight)
print(embedding_layer)

# tensor([[ 0.7830,  0.2669,  0.4363],
#         [ 0.0000,  0.0000,  0.0000],
#         [-1.2034, -0.0957, -0.9129],
#         [ 0.7861, -0.0251, -2.2705],
#         [-0.5167, -0.3402,  1.3143],
#         [ 1.7932, -0.6973,  0.5459],
#         [-0.8952, -0.4937,  0.2341],
#         [ 0.3692,  0.0593,  1.0825]], requires_grad=True)
# Embedding(8, 3, padding_idx=1)




ex-3) Embedding  ์˜ˆ์‹œ์ฝ”๋“œ

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.word_embeddings = nn.Embedding(config.vocab_size, config.emb_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_seq_length, config.emb_size)
        self.token_type_embeddings = nn.Embedding(2, config.emb_size)
        self.LayerNorm = nn.LayerNorm(config.emb_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.dropout)
        
        # position ids (used in the pos_emb lookup table) that we do not want updated through backpropogation
        self.register_buffer("position_ids", torch.arange(config.max_seq_length).expand((1, -1)))

    def forward(self, input_ids, token_type_ids):
        word_emb = self.word_embeddings(input_ids)
        pos_emb = self.position_embeddings(self.position_ids)
        type_emb = self.token_type_embeddings(token_type_ids)

        emb = word_emb + pos_emb + type_emb
        emb = self.LayerNorm(emb)
        emb = self.dropout(emb)

        return emb

 

 

 

 

 

NLP

BERT - Classification

NER (Named Entity Recognition)

Token Classification, ์ฆ‰ ๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฐ token์— label์„ ํ• ๋‹นํ•˜๋Š” Task์ด๋‹ค.
๋จผ์ € ์˜ˆ์‹œ๋กœ BIO Tagging์„ ๋“ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค:
ex) ์ธ๊ณต์ง€๋Šฅ(AI)์—์„œ ๋”ฅ๋Ÿฌ๋‹์€ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•œ ๋ถ„์•ผ์ž…๋‹ˆ๋‹ค.
--> <์ธ๊ณต์ง€๋Šฅ:B-Tech> <(:I-Tech> <AI:I-Tech> <):I-Tech> <์—์„œ:O> <๋”ฅ๋Ÿฌ๋‹:B-Tech> <์€:O> <๋จธ์‹ ๋Ÿฌ๋‹:B-Tech> <์˜:O> <ํ•œ:O> <๋ถ„์•ผ:O> <์ž…๋‹ˆ๋‹ค:O> <.:O>
์ด๋•Œ, B๋Š” Begin(๊ฐœ์ฒด๋ช…์˜ ์‹œ์ž‘)์„, I๋Š” Inside(๊ฐœ์ฒด๋ช…์˜ ์—ฐ์†)๋ฅผ, O๋Š” Outside(๊ฐœ์ฒด๋ช…์ด ์•„๋‹Œ๊ฒƒ)๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
์ด๋Ÿฐ NER์—์„œ ์ž์ฃผ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์ด ๋ฐ”๋กœ BERT์ด๋‹ค.

BERT - MLM, NSP

๋ฌธ์žฅ๊ฐ„ ๊ด€๊ณ„(์š”์•ฝ ๋“ฑ)๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ํ™œ์šฉ๋˜๋Š” [CLS]ํ† ํฐ์ด ์‚ฌ์šฉ๋œ๋‹ค.
BERT์—์„œ๋Š” ์ด 3๊ฐ€์ง€ Embedding์ด Embedding Layer์—์„œ ํ™œ์šฉ๋œ๋‹ค:
1. Token Embedding:
 - ์ž…๋ ฅ๋ฌธ์žฅ embedding
2. Segment Embedding:
 - ๋ชจ๋ธ์ด ์ธ์‹ํ•˜๋„๋ก ๊ฐ ๋ฌธ์žฅ์— ๊ณ ์ •๋œ ์ˆซ์ž ํ• ๋‹น.
3. Position Embedding:
 - input์„ ํ•œ๋ฒˆ์— ๋ชจ๋‘ ๋ฐ€์–ด๋„ฃ๊ธฐ์—(= ์ˆœ์ฐจ์ ์œผ๋กœ ๋„ฃ์ง€ ์•Š์Œ)
 - Transformer Encoder๋Š” ๊ฐ token์˜ ์‹œ๊ฐ„์  ์ˆœ์„œ๋ฅผ ์•Œ์ง€ ๋ชปํ•จ
 - ์ด๋ฅผ ์œ„ํ•ด ์œ„์น˜์ •๋ณด๋ฅผ ์‚ฝ์ž…ํ•˜๊ธฐ์œ„ํ•ด sine, cosine์„ ์‚ฌ์šฉํ•œ๋‹ค.
์ถ”์ฒœ๊ฐ•์˜) https://www.youtube.com/watch?app=desktop&v=CiOL2h1l-EE

BART - Summarization

Abstractive & Extractive Summarization

์ถ”์ƒ์š”์•ฝ: ์›๋ฌธ์„ ์™„์ „ํžˆ ์ดํ•ด --> ์ƒˆ๋กœ์šด ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•ด ์š”์•ฝํ•˜๋Š” ๋ฐฉ์‹.
์ถ”์ถœ์š”์•ฝ: ์›๋ฌธ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ณ  ๊ด€๋ จ์„ฑ ๋†’์€ ๋ฌธ์žฅ๋“ค๋งŒ ์„ ํƒํ•ด ๊ทธ๋Œ€๋กœ ์ถ”์ถœ.
(์š”์•ฝ๋ฌธ์ด ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์šธ ์ˆ˜๋Š” ์žˆ์œผ๋ฉฐ, ์ค‘๋ณต์ œ๊ฑฐ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•จ.)


BART (Bidirectional & Auto Regressive Transformers)

Encoder-Decoder ๋ชจ๋‘ ์กด์žฌํ•˜๋ฉฐ, ํŠนํžˆ๋‚˜ Embedding์ธต์„ ๊ณต์œ ํ•˜๋Š” Shared Embedding์„ ์‚ฌ์šฉํ•ด ๋‘˜๊ฐ„์˜ ์—ฐ๊ฒฐ์„ ๊ฐ•ํ™”ํ•œ๋‹ค.
Encoder๋Š” Bidirectional Encoder๋กœ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ ์ „์ฒด์˜ ์ขŒ์šฐ context๋ฅผ ๋ชจ๋‘ ์ฐธ์กฐ๊ฐ€๋Šฅํ•˜๋ฉฐ,
Decoder์—์„œ Auto-Regressive๋ฐฉ์‹์œผ๋กœ ์ด์ „์— ์ƒ์„ฑํ•œ ๋‹จ์–ด๋ฅผ ์ฐธ์กฐํ•ด ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
๋˜ํ•œ, Pre-Train์‹œ Denoising Auto Encoder๋กœ ํ•™์Šตํ•˜๋Š”๋ฐ, ์ž„์˜๋กœ noisingํ›„, ๋ณต์›ํ•˜๊ฒŒ ํ•œ๋‹ค.

RoBERTa, T5- TextQA

Abstractive & Extractive QA

์ถ”์ƒ์งˆ์˜์‘๋‹ต: ์ฃผ์–ด์ง„ ์ง€๋ฌธ ๋‚ด์—์„œ ๋‹ต๋ณ€์ด ๋˜๋Š” ๋ฌธ์ž์—ด ์ถ”์ถœ (์งˆ๋ฌธ-์ง€๋ฌธ-์ง€๋ฌธ๋‚ด๋‹ต๋ณ€์ถ”์ถœ)
์ถ”์ถœ์งˆ์˜์‘๋‹ต: ์งˆ๋ฌธ๊ณผ ์ง€๋ฌธ์„ ์ž…๋ ฅ๋ฐ›์•„ ์ƒˆ๋กœ์šด ๋‹ต๋ณ€ ์ƒ์„ฑ (์งˆ๋ฌธ-์ง€๋ฌธ-๋‹ต๋ณ€)


RoBERTa

max_len, pretrain-dataset์–‘์ด ๋Š˜์–ด๋‚ฌ์œผ๋ฉฐ, Dynamic Masking๊ธฐ๋ฒ• ๋„์ž…์ด ํ•ต์‹ฌ.
Dynamic Masking: ๊ฐ ์—ํญ๋งˆ๋‹ค ๋‹ค๋ฅธ ๋งˆ์Šคํ‚นํŒจํ„ด ์ƒ์„ฑ. (NSP๋Š” ์—†์•ฐ.)
BPE Tokenization ์‚ฌ์šฉ: BERT๋Š” wordpiece tokenize.

T5- Machine Translation

SMT & NMT

ํ†ต๊ณ„์  ๊ธฐ๊ณ„๋ฒˆ์—ญ: ์›๋ฌธ-๋ฒˆ์—ญ์Œ ๊ธฐ๋ฐ˜, ๋‹จ์–ด์ˆœ์„œ ๋ฐ ์–ธ์–ดํŒจํ„ด์„ ์ธ์‹ --> ๋Œ€๊ทœ๋ชจ dataํ•„์š”
์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ: ๋ฒˆ์—ญ๋ฌธ๊ณผ ๋‹จ์–ด์‹œํ€€์Šค๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต


T5 (Text-To-Text Transfer Transformer)

tast๋ณ„ ํŠน์ • Promptํ˜•์‹์„ ์‚ฌ์šฉํ•ด ์ ์ ˆํ•œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๊ฒŒ ์œ ๋„๊ฐ€๋Šฅํ•˜๋‹ค.
์ฆ‰, ๋‹จ์ผ ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ NLP Task๋ฅผ ์ฒ˜๋ฆฌ๊ฐ€๋Šฅํ•œ seq2seq๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.

T5์˜ ๋…ํŠนํ•œ์ ์€ ๋ชจ๋ธ๊ตฌ์กฐ์ฐจ์ œ๊ฐ€ ์•„๋‹Œ, "์ž…์ถœ๋ ฅ ๋ชจ๋‘ Txtํ˜•ํƒœ๋กœ ์ทจ๊ธ‰ํ•˜๋Š” seq2seq๋กœ ์ ‘๊ทผํ•ด Pretrain๊ณผ์ •์—์„œ "Unsupervised Learning"์„ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ corpus(์•ฝ 75GB)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด๋‹ค." ์ด๋ฅผ ํ†ตํ•ด ์–ธ์–ด์˜ ์ผ๋ฐ˜์  ํŒจํ„ด๊ณผ ์ง€์‹์„ ํšจ๊ณผ์ ์œผ๋กœ ์Šต๋“ํ•œ๋‹ค.


LLaMA - Text Generation

Seq2Seq & CausalLM

Seq2Seq: Transformer, BART, T5 ๋“ฑ Encoder-Decoder๊ตฌ์กฐ
CausalLM: ๋‹จ์ผ Decoder๋กœ ๊ตฌ์„ฑ


LLaMA-3 Family

2024๋…„ 4์›”, LLaMA-3๊ฐ€ ์ถœ์‹œ๋˜์—ˆ๋Š”๋ฐ, LLaMA-3์—์„œ๋Š” GQA(Grouped Query Attention)์ด ์‚ฌ์šฉ๋˜์–ด Inference์†๋„๋ฅผ ๋†’์˜€๋‹ค.
LLaMA-3๋Š” Incontext-Learning, Few-Shot Learning ๋ชจ๋‘ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.
Incontext-Learning: ๋ชจ๋ธ์ด ์ž…๋ ฅํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์ฆ‰์„์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Šฅ๋ ฅ



์ถ”๊ฐ€์ ์œผ๋กœ 2024๋…„ 7์›”, LLaMA-3.1์ด ๊ณต๊ฐœ๋˜์—ˆ๋‹ค. LLaMA-3.1์€ AI์•ˆ์ •์„ฑ ๋ฐ ๋ณด์•ˆ๊ด€๋ จ ๋„๊ตฌ๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋Š”๋ฐ, Prompt Injection์„ ๋ฐฉ์ง€ํ•˜๋Š” Prompt Guard๋ฅผ ๋„์ž…ํ•ด ์œ ํ•ดํ•˜๊ฑฐ๋‚˜ ๋ถ€์ ์ ˆํ•œ ์ฝ˜ํ…์ธ ๋ฅผ ์‹๋ณ„ํ•˜๊ฒŒ ํ•˜์˜€๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ LLaMA-3 ์‹œ๋ฆฌ์ฆˆ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฃผ์š” ํŠน์ง•์ด ์กด์žฌํ•œ๋‹ค:
- RoPE(Rotary Position Embedding): Q, K์— ์ ์šฉ
- GQA(Grouped Query Attention): K, V๋ฅผ ์—ฌ๋Ÿฌ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ์–ด attention์—ฐ์‚ฐ์ˆ˜ํ–‰ --> ํšจ์œจ์  ์ถ”๋ก 

- RMS Norm: ์•ˆ์ •์  ํ•™์Šต ๋ฐ ๊ณ„์‚ฐ์˜ ํšจ์œจ์„ฑ
- KV cache: ์ถ”๋ก ์‹œ K,V๋ฅผ cache์— ์ €์žฅ --> ์—ฐ์‚ฐ์˜ ํšจ์œจ์„ฑ


LLaMA-3 ์ตœ์ ํ™”๊ธฐ๋ฒ•: SFT . RLHF  . DPO

SFT(Supervised Fine-Tuning): ์‚ฌ๋žŒ์ด ์ž‘์„ฑํ•œ ๊ณ ํ’ˆ์งˆQA์Œ์œผ๋กœ ๋ชจ๋ธ์„ ์ง์ ‘ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
RLHF: PPO์•Œ๊ณ ๋ฆฌ์ฆ˜๊ธฐ๋ฐ˜, ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์—ฌ๋Ÿฌ ์‘๋‹ต์— ๋Œ€ํ•ด ์‚ฌ๋žŒ์ด ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์žฌํ•™์Šต.
DPO(Direct Preference Optimization): RLHF์˜ ๋ณต์žก์„ฑ์„ ์ค„์ด๋ฉด์„œ ํšจ๊ณผ์  ํ•™์Šต์„ ๊ฐ€๋Šฅ์ผ€ํ•จ.(์‚ฌ๋žŒ์ด ๋งค๊ธด ์‘๋‹ต์ˆœ์œ„๋ฅผ ์ง์ ‘ํ•™์Šต; ๋‹ค๋งŒ ๋”์šฑ ๊ณ ํ’ˆ์งˆ ์„ ํ˜ธ๋„ data๋ฅผ ํ•„์š”๋กœํ•จ.)

 

 

 

 

Computer Vision

์ฃผ๋กœ CV(Computer Vision)๋ถ„์•ผ์—์„  CNN๊ธฐ๋ฒ•์ด ๋งŽ์ด ํ™œ์šฉ๋˜์—ˆ๋‹ค.(VGG, Inception, ResNet, ...)
๋‹ค๋งŒ, CNN based model์€ ์ฃผ๋กœ ๊ตญ์†Œ์ (local) pattern์„ ํ•™์Šตํ•˜๊ธฐ์— ์ „์—ญ์ ์ธ ๊ด€๊ณ„(global relation)๋ชจ๋ธ๋ง์— ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๊ณ„์‚ฐ๋ณต์žก๋„ ๋˜ํ•œ ์ฆ๊ฐ€ํ•œ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ViT(Vision Transformer)๊ฐ€ ์ œ์•ˆ๋˜์—ˆ์œผ๋ฉฐ, ๋Œ€๊ทœ๋ชจ dataset์œผ๋กœ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.
ViT์˜ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๊ฒฉ์ธ CLIP, OWL-ViT, SAM์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

Zero shot classification

Zero Shot Classification: CLIP, ALIGN, SigLIP

์‚ฌ์‹ค CLIP์€ ๋‹ค์–‘ํ•œ Task์—์„œ ๋งŽ์ด ํ™œ์šฉ๋˜๋‚˜ ๋ณธ ๊ธ€์€ Train dataset์— ์—†๋Š” Label์— ๋Œ€ํ•ด Image Classification์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ์ˆ ์— ํ™œ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์•Œ์•„๋ณด๊ณ ์ž ํ•œ๋‹ค.
์ƒˆ๋กœ์šด Label์ด ์ถ”๊ฐ€๋  ๋•Œ๋งˆ๋‹ค ์žฌํ•™์Šต์ด ํ•„์š”ํ•œ๋ฐ, ์ด๋ฅผ ํ”ผํ•˜๋ ค๋ฉด Zero shot๊ธฐ๋ฒ•์€ ๋ฐ˜ํ•„์ˆ˜์ ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


CLIP (Contrastive Language-Image Pre-training)

Model Architecture Input_Size Patch_Size #params
openai/clip-vit-base-patch32 ViT-B/32 224×224 32×32 1.5B
openai/clip-vit-base-patch16 ViT-B/16 224×224 16×16 1.5B
openai/clip-vit-large-patch14 ViT-L/14 224×224 14×14 4.3B
openai/clip-vit-large-patch14-336 ViT-L/14 336×336 14×14 4.3B

์ž‘์€ patch_size: ๋” ์„ธ๋ฐ€ํ•œ ํŠน์ง•์ถ”์ถœ, ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋ฐ ๊ณ„์‚ฐ์‹œ๊ฐ„ ์ฆ๊ฐ€

ํฐ patch_size: ๋น„๊ต์  ๋‚ฎ์€ ์„ฑ๋Šฅ, ๋น ๋ฅธ ์ฒ˜๋ฆฌ์†๋„
ํŒŒ๋ž€๋ธ”๋ก: Positive Sample , ํฐ๋ธ”๋ก: Negative Sample

๊ธฐ์กด Supervised Learning๊ณผ ๋‹ฌ๋ฆฌ 2๊ฐ€์ง€ ํŠน์ง•์ด ์กด์žฌํ•œ๋‹ค:
1.๋ณ„๋„์˜ Label์—†์ด input์œผ๋กœ image-txt์Œ๋งŒ ํ•™์Šต.
 - img, txt๋ฅผ ๋™์ผํ•œ embedding๊ณต๊ฐ„์— ์‚ฌ์˜(Projection)
 - ์ด๋ฅผ ํ†ตํ•ด ๋‘ Modality๊ฐ„ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ง์ ‘์ ์œผ๋กœ ์ธก์ • ๋ฐ ํ•™์Šต๊ฐ€๋Šฅ
 - ์ด ๋•Œ๋ฌธ์— CLIP์€ img-encoder, txt-encoder ๋ชจ๋‘ ๊ฐ–๊ณ ์žˆ์Œ

2. Contrastive Learning:
 - "Positive Sample": ์‹ค์ œimg-txt์Œ --> img-txt๊ฐ„ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ ์ตœ๋Œ€ํ™”
 - "Negative Sample": randomํ•˜๊ฒŒ pair๋œ ๋ถˆ์ผ์น˜img-txt์Œ --> ์œ ์‚ฌ์„ฑ ์ตœ์†Œํ™”
 - ์ด๋ฅผ ์œ„ํ•ด Cosine Similarity๊ธฐ๋ฐ˜์˜ Contrastive Learning Loss๋ฅผ ์‚ฌ์šฉ.


Zero-Shot Classification ์˜ˆ์‹œ

from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
dataset = load_dataset("sasha/dog-food")
images = dataset['test']['image'][:2]
labels = ['dog', 'food']
inputs = processor(images=images, text=labels, return_tensors="pt", padding=True)

print('input_ids: ', inputs['input_ids'])
print('attention_mask: ', inputs['attention_mask'])
print('pixel_values: ', inputs['pixel_values'])
print('image_shape: ', inputs['pixel_values'].shape)
# =======================================================
# input_ids:  tensor([[49406,  1929, 49407], [49406,  1559, 49407]])
# attention_mask:  tensor([[1, 1, 1], [1, 1, 1]])
# pixel_values:  tensor([[[[-0.0113, ...,]]]])
# image_shape:  torch.Size([2, 3, 224, 224])

CLIPProcessor์—๋Š” CLIPImageProcessor์™€ CLIPTokenizer๊ฐ€ ๋‚ด๋ถ€์ ์œผ๋กœ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.
input_ids์—์„œ 49406๊ณผ 49407์€ ๊ฐ๊ฐ startoftext์™€ endoftext๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํŠน๋ณ„ํ•œ ๊ฐ’์ด๋‹ค.
attention_mask๋Š” ๋ณ€ํ™˜๋œ token_types๋กœ
๊ฐ’์ด 1์ด๋ฉด ํ•ด๋‹น์œ„์น˜ํ† ํฐ์ด ์‹ค์ œ๋ฐ์ดํ„ฐ๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๊ณ , 0์€ [PAD]๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image
  probs = logits_per_image.softmax(dim=1)
  print('outputs:', outputs.keys())
  print('logits_per_image:', logits_per_image)
  print('probs: ', probs)

for idx, prob in enumerate(probs):
  print(f'- Image #{idx}')
  for label, p in zip(labels, prob):
    print(f'{label}: {p.item():.4f}')

# ============================================
# outputs: odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])
# logits_per_image: tensor([[23.3881, 18.8604], [24.8627, 21.5765]])
# probs:  tensor([[0.9893, 0.0107], [0.9640, 0.0360]])
# - Image #0
# dog: 0.9893
# food: 0.0107
# - Image #1
# dog: 0.9640
# food: 0.0360

Zero shot Detection

์ž์—ฐ์–ด์  ์„ค๋ช…์—๋Š” ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด์™€ ๊ฐœ๋žต์  ์œ„์น˜์ •๋ณด๋ฅผ ์•”์‹œ์ ์œผ๋กœ ํฌํ•จํ•œ๋‹ค.
CLIP์—์„œ img-txt์Œ์œผ๋กœ ์‹œ๊ฐ์ ํŠน์ง•๊ณผ ํ…์ŠคํŠธ๊ฐ„ ์—ฐ๊ด€์„ฑ์„ ํ•™์Šต๊ฐ€๋Šฅํ•จ์„ ๋ณด์˜€๊ธฐ์—, 
์ถ”๋ก  ์‹œ, ์ฃผ์–ด์ง„ txt prompt๋งŒ ์ž˜ ์„ค๊ณ„ํ•œ๋‹ค๋ฉด ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค.
๋”ฐ๋ผ์„œ zero-shot object detection์—์„œ๋Š” ์ „ํ†ต์ ์ธ annotation์ •๋ณด ์—†์ด๋„ ์‹œ๊ฐ๊ณผ ์–ธ์–ด๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ƒˆ๋กœ์šด ๊ฐ์ฒดํด๋ž˜์Šค๋ฅผ ๊ฒ€์ถœํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.
OWL-ViT์˜ ๊ฒฝ์šฐ, Multi-Modal Backbone๋ชจ๋ธ๋กœ CLIP๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.

OWLv2 (OWL-ViT)

OWL-ViT๊ตฌ์กฐ, OWLv2๋Š” ๊ฐ์ฒด๊ฒ€์ถœํ—ค๋“œ์— Objectness Classifier์ถ”๊ฐ€ํ•จ.
OWL-ViT๋Š” img-txt์Œ์œผ๋กœ pretrainํ•˜์—ฌ Open-Vocabulary๊ฐ์ฒดํƒ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
OWLv2๋Š” Self-Training๊ธฐ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.
์ฆ‰, ๊ธฐ์กด Detector๋กœ Weak Supervision๋ฐฉ์‹์œผ๋กœ ๊ฐ€์ƒ์˜ Bbox-Annotation์„ ์ž๋™์ƒ์„ฑํ•œ๋‹ค.
ex) input: img-txt pair[๊ฐ•์•„์ง€๊ฐ€ ๊ณต์„ ๊ฐ€์ง€๊ณ  ๋…ธ๋Š”]
๊ธฐ์กด detector: [๊ฐ•์•„์ง€ bbox] [๊ณต bbox] ์ž๋™์˜ˆ์ธก, annotation์ƒ์„ฑ
--> ๋ชจ๋ธ ํ•™์Šต์— ์ด์šฉ (์ฆ‰, ์ •ํ™•ํ•œ ์œ„์น˜์ •๋ณด๋Š” ์—†์ง€๋งŒ ๋ถ€๋ถ„์  supervision signal๋กœ weak signal๊ธฐ๋ฐ˜, ๋ชจ๋ธ์ด ๊ฐ์ฒด์˜ ์œ„์น˜ ๋ฐ ํด๋ž˜์Šค๋ฅผ ์ถ”๋ก , ํ•™์Šตํ•˜๊ฒŒ ํ•จ)

Zero-Shot Detection ์˜ˆ์‹œ

import io
from PIL import Image
from datasets import load_dataset
from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")
dataset = load_dataset('Francesco/animals-ij5d2')
print(dataset)
print(dataset['test'][0])

# ==========================================================
# DatasetDict({
#     train: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 700
#     })
#     validation: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 100
#     })
#     test: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 200
#     })
# })
# {'image_id': 63, 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2B0186E4A0>, 'width': 640, 'height': 640, 'objects': {'id': [96, 97, 98, 99], 'area': [138029, 8508, 10150, 20624], 'bbox': [[129.0, 291.0, 395.5, 349.0], [119.0, 266.0, 93.5, 91.0], [296.0, 280.0, 116.0, 87.5], [473.0, 284.0, 167.0, 123.5]], 'category': [3, 3, 3, 3]}}

 

- Label ๋ฐ Image ์ „์ฒ˜๋ฆฌ
images = dataset['test']['image'][:2]
categories = dataset['test'].features['objects'].feature['category'].names
labels = [categories] * len(images)
inputs = processor(text=labels, images=images, return_tensors="pt", padding=True)

print(images)
print(labels)
print('input_ids:', inputs['input_ids'])
print('attention_mask:', inputs['attention_mask'])
print('pixel_values:', inputs['pixel_values'])
print('image_shape:', inputs['pixel_values'].shape)

# ==========================================================
# [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CF790>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CCC10>]
# [['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk'], ['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk']]
# input_ids: tensor([[49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0],
#         [49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0]])
# attention_mask: tensor([[1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#         [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0], 
#          [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#           [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0]])
# pixel_values: tensor([[[[ 1.5264, ..., ]]]])
# image_shape: torch.Size([2, 3, 960, 960])

- Detection & Inference
import torch

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs.keys()) # odict_keys(['logits', 'objectness_logits', 'pred_boxes', 'text_embeds', 'image_embeds', 'class_embeds', 'text_model_output', 'vision_model_output'])

- Post Processing
import matplotlib.pyplot as plt
from PIL import ImageDraw, ImageFont

# ์˜ˆ์ธกํ™•๋ฅ ์ด ๋†’์€ ๊ฐ์ฒด ์ถ”์ถœ
shape = [dataset['test'][:2]['width'], dataset['test'][:2]['height']]
target_sizes = list(map(list, zip(*shape))) # [[640, 640], [640, 640]]
results = processor.post_process_object_detection(outputs=outputs, threshold=0.5, target_sizes=target_sizes)
print(results)

# Post Processing
for idx, (image, detect) in enumerate(zip(images, results)):
    image = image.copy()
    draw = ImageDraw.Draw(image)
    font = ImageFont.truetype("arial.ttf", 18)

    for box, label, score in zip(detect['boxes'], detect['labels'], detect['scores']):
        box = [round(i, 2) for i in box.tolist()]
        draw.rectangle(box, outline='red', width=3)

        label_text = f'{labels[idx][label]}: {round(score.item(), 3)}'
        draw.text((box[0], box[1]), label_text, fill='white', font=font)

    plt.imshow(image)
    plt.axis('off')
    plt.show()
    
# ==============================================
# [{'scores': tensor([0.5499, 0.6243, 0.6733]), 'labels': tensor([3, 3, 3]), 'boxes': tensor([[329.0247, 287.1844, 400.3372, 357.9262],
#         [122.9359, 272.8753, 534.3260, 637.6506],
#         [479.7363, 294.2744, 636.4859, 396.8372]])}, {'scores': tensor([0.7538]), 'labels': tensor([7]), 'boxes': tensor([[ -0.7799, 173.7043, 440.0294, 538.7166]])}]

 

Zero shot Semantic segmentation

Image Segmentation์€ ๋ณด๋‹ค ์ •๋ฐ€ํ•œ, ํ”ฝ์…€๋ณ„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ๋†’์€ ๊ณ„์‚ฐ๋น„์šฉ์ด ๋“ค๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ train data์™€ ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•„์š”๋กœ ํ•œ๋‹ค.
์ „ํ†ต์  ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” threshold๊ธฐ๋ฐ˜ binary classification, Edge Detection๋“ฑ์ด ์žˆ์œผ๋ฉฐ
์ตœ์‹  ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋”ฅ๋Ÿฌ๋‹๋ชจ๋ธ์„ ์ด์šฉํ•ด Image Segmentation์„ ์ง„ํ–‰ํ•œ๋‹ค.
์ „ํ†ต์  ๋ฐฉ๋ฒ•์€ ๋‹จ์ˆœํ•˜๊ณ  ๋น ๋ฅด์ง€๋งŒ ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋‹ค์–‘ํ•œ ์กฐ๋ช…์กฐ๊ฑด ๋“ฑ์—์„œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋˜๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.


SAM (Segment Anything Model)

Model Architecture Input_Size Patch_Size #params
facebook/sam-vit-base ViT-B/16 1024×1024 16×16 0.9B
facebook/sam-vit-large ViT-L/16 1024×1024 16×16 3.1B
facebook/sam-vit-huge ViT-H/16 1024×1024 16×16 6.4B

 

SAM๊ตฌ์กฐ: img-encoder, prompt-encoder, mask-decoder

SAM์€ Meta์—์„œ ๊ฐœ๋ฐœํ•œ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์ˆ˜์ง‘ํ•œ 1100๋งŒ๊ฐœ ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•ด ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค.

๊ทธ๋ ‡๊ธฐ์— ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ image segmentation์ž‘์—…์„ ๊ณ ์ˆ˜์ค€์œผ๋กœ ์ˆ˜ํ–‰๊ฐ€๋Šฅํ•˜๋‹ค.
SAM์„ ์ด์šฉํ•˜๋ฉด ๋งŽ์€๊ฒฝ์šฐ, ์ถ”๊ฐ€์ ์ธ Fine-Tuning์—†์ด, ๋‹ค์–‘ํ•œ Domain image์— ๋Œ€ํ•œ segmentation์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

SAM์€ prompt๋ฅผ ๋ฐ›์„์ˆ˜๋„ ์žˆ๊ณ , ๋ฐ›์ง€ ์•Š์•„๋„ ๋˜๋Š”๋ฐ, prompt๋Š” ์ขŒํ‘œ, bbox, txt ๋“ฑ ๋‹ค์–‘ํ•˜๊ฒŒ ์ค„ ์ˆ˜ ์žˆ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ prompt๋ฅผ ์ฃผ์ง€ ์•Š์œผ๋ฉด img ์ „์ฒด์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ Segmentation์„ ์ง„ํ–‰ํ•œ๋‹ค.
๋‹ค๋งŒ, Inference๊ฒฐ๊ณผ๋กœ Binary Mask๋Š” ์ œ๊ณตํ•˜์ง€๋งŒ pixel์— ๋Œ€ํ•œ ๊ตฌ์ฒด์  class์ •๋ณด๋Š” ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค.

SAM ํ™œ์šฉ ์˜ˆ์‹œ

import io
from PIL import Image
from datasets import load_dataset
from transformers import SamProcessor, SamModel

def filter_category(data):
    # 16 = dog
    # 23 = giraffe
    return 16 in data["objects"]["category"] or 23 in data["objects"]["category"]

def convert_image(data):
    byte = io.BytesIO(data["image"]["bytes"])
    img = Image.open(byte)
    return {"img": img}

model_name = "facebook/sam-vit-base"
processor = SamProcessor.from_pretrained(model_name) 
model = SamModel.from_pretrained(model_name)

dataset = load_dataset("s076923/coco-val")
filtered_dataset = dataset["validation"].filter(filter_category)
converted_dataset = filtered_dataset.map(convert_image, remove_columns=["image"])
import numpy as np
from matplotlib import pyplot as plt


def show_point_box(image, input_points, input_labels, input_boxes=None, marker_size=375):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    ax = plt.gca()
    
    input_points = np.array(input_points)
    input_labels = np.array(input_labels)

    pos_points = input_points[input_labels[0] == 1]
    neg_points = input_points[input_labels[0] == 0]
    
    ax.scatter(
        pos_points[:, 0],
        pos_points[:, 1],
        color="green",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )
    ax.scatter(
        neg_points[:, 0],
        neg_points[:, 1],
        color="red",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )

    if input_boxes is not None:
        for box in input_boxes:
            x0, y0 = box[0], box[1]
            w, h = box[2] - box[0], box[3] - box[1]
            ax.add_patch(
                plt.Rectangle(
                    (x0, y0), w, h, edgecolor="green", facecolor=(0, 0, 0, 0), lw=2
                )
            )

    plt.axis("on")
    plt.show()


image = converted_dataset[0]["img"]
input_points = [[[250, 200]]]
input_labels = [[[1]]]

show_point_box(image, input_points[0], input_labels[0])
inputs = processor(
    image, input_points=input_points, input_labels=input_labels, return_tensors="pt"
)

# input_points shape : torch.Size([1, 1, 1, 2])
# input_points : tensor([[[[400.2347, 320.0000]]]], dtype=torch.float64)
# input_labels shape : torch.Size([1, 1, 1])
# input_labels : tensor([[[1]]])
# pixel_values shape : torch.Size([1, 3, 1024, 1024])
# pixel_values : tensor([[[[ 1.4612,  ...]]])

input_points: [B, ์ขŒํ‘œ๊ฐœ์ˆ˜, ์ขŒํ‘œ] -- ๊ด€์‹ฌ๊ฐ–๋Š” ๊ฐ์ฒด๋‚˜ ์˜์—ญ์ง€์ • ์ขŒํ‘œ
input_labels: [B, ์ขŒํ‘œB, ์ขŒํ‘œ๊ฐœ์ˆ˜] -- input_points์— ๋Œ€์‘๋˜๋Š” label์ •๋ณด
 - input_labels์ข…๋ฅ˜:

๋ฒˆํ˜ธ ์ด๋ฆ„ ์„ค๋ช…
1 foreground ํด๋ž˜์Šค ๊ฒ€์ถœํ•˜๊ณ ์ž ํ•˜๋Š” ๊ด€์‹ฌ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋œ ์ขŒํ‘œ
0 not foreground ํด๋ž˜์Šค ๊ด€์‹ฌ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€ ์ขŒํ‘œ
-1 background ํด๋ž˜์Šค ๋ฐฐ๊ฒฝ์˜์—ญ์— ํ•ด๋‹นํ•˜๋Š” ์ขŒํ‘œ
-10 padding ํด๋ž˜์Šค batch_size๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•œ padding๊ฐ’ (๋ชจ๋ธ์ด ์ฒ˜๋ฆฌX)

[Processor๋กœ ์ฒ˜๋ฆฌ๋œ ์ดํ›„ ์ถœ๋ ฅ๊ฒฐ๊ณผ]
input_points
: [B, ์ขŒํ‘œB, ๋ถ„ํ• ๋งˆ์Šคํฌ ๋‹น ์ขŒํ‘œ๊ฐœ์ˆ˜, ์ขŒํ‘œ์œ„์น˜] 
input_labels: [B, ์ขŒํ‘œB, ์ขŒํ‘œ๊ฐœ์ˆ˜] 


import torch


def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)


def show_masks_on_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
        masks = masks.squeeze()
    if scores.shape[0] == 1:
        scores = scores.squeeze()

    nb_predictions = scores.shape[-1]
    fig, axes = plt.subplots(1, nb_predictions, figsize=(30, 15))

    for i, (mask, score) in enumerate(zip(masks, scores)):
        mask = mask.cpu().detach()
        axes[i].imshow(np.array(raw_image))
        show_mask(mask, axes[i])
        axes[i].title.set_text(f"Mask {i+1}, Score: {score.item():.3f}")
        axes[i].axis("off")
    plt.show()


model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)
print("iou_scores shape :", outputs.iou_scores.shape)
print("iou_scores :", outputs.iou_scores)
print("pred_masks shape :", outputs.pred_masks.shape)
print("pred_masks :", outputs.pred_masks)

# iou_scores shape : torch.Size([1, 1, 3])
# iou_scores : tensor([[[0.7971, 0.9507, 0.9603]]])
# pred_masks shape : torch.Size([1, 1, 3, 256, 256])
# pred_masks : tensor([[[[[ -3.6988, ..., ]]]]])

iou_scrores: [B, ์ขŒํ‘œ๊ฐœ์ˆ˜, IoU์ ์ˆ˜] 
pred_masks: [B, ์ขŒํ‘œB, C, H, W] 


input_points = [[[250, 200], [15, 50]]]
input_labels = [[[0, 1]]]
input_boxes = [[[100, 100, 400, 600]]]

show_point_box(image, input_points[0], input_labels[0], input_boxes[0])
inputs = processor(
    image,
    input_points=input_points,
    input_labels=input_labels,
    input_boxes=input_boxes,
    return_tensors="pt"
)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)โ€‹

 



Zero shot Instance segmentation

Zero shot Detection + SAM

SAM์˜ ๊ฒฝ์šฐ, ๊ฒ€์ถœ๋œ ๊ฐ์ฒด์˜ ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์—†๋‹ค.
์ฆ‰, ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด๋ฅผ ํ”ฝ์…€๋‹จ์œ„๋กœ ๊ตฌ๋ถ„ํ•˜๋Š” instance segmentation์ž‘์—…์—๋Š” ์–ด๋ ค์›€์ด ์กด์žฌํ•œ๋‹ค.

์ด๋Ÿฐ ํ•œ๊ณ„๊ทน๋ณต์„ ์œ„ํ•ด zero-shot detection model๊ณผ SAM์„ ํ•จ๊ป˜ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค:
1) zero shot detection๋ชจ๋ฐ๋กœ ๊ฐ์ฒด ํด๋ž˜์Šค์™€ bbox์˜์—ญ ๊ฒ€์ถœ
2) bbox์˜์—ญ๋‚ด SAM๋ชจ๋ธ๋กœ semantic segmentation ์ง„ํ–‰.

from transformers import pipeline

generator = pipeline("mask-generation", model=model_name)
outputs = generator(image, points_per_batch=32)

plt.imshow(np.array(image))
ax = plt.gca()
for mask in outputs["masks"]:
    show_mask(mask, ax=ax, random_color=True)
plt.axis("off")
plt.show()

print("outputs mask์˜ ๊ฐœ์ˆ˜ :", len(outputs["masks"]))
print("outputs scores์˜ ๊ฐœ์ˆ˜ :", len(outputs["scores"]))

# outputs mask์˜ ๊ฐœ์ˆ˜ : 52
# outputs scores์˜ ๊ฐœ์ˆ˜ : 52

detector = pipeline(
    model="google/owlv2-base-patch16", task="zero-shot-object-detection"
)

image = converted_dataset[24]["img"]
labels = ["dog", "giraffe"]
results = detector(image, candidate_labels=labels, threshold=0.5)

input_boxes = []
for result in results:
    input_boxes.append(
        [
            result["box"]["xmin"],
            result["box"]["ymin"],
            result["box"]["xmax"],
            result["box"]["ymax"]
        ]
    )
    print(result)

inputs = processor(image, input_boxes=[input_boxes], return_tensors="pt")

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)

plt.imshow(np.array(image))
ax = plt.gca()

for mask, iou in zip(masks[0], outputs.iou_scores[0]):
    max_iou_idx = torch.argmax(iou)
    best_mask = mask[max_iou_idx]
    show_mask(best_mask, ax=ax, random_color=True)

plt.axis("off")
plt.show()

#{'score': 0.6905778646469116, 'label': 'giraffe', 'box': {'xmin': 96, 'ymin': 198, 'xmax': 294, 'ymax': 577}}
#{'score': 0.6264181733131409, 'label': 'giraffe', 'box': {'xmin': 228, 'ymin': 199, 'xmax': 394, 'ymax': 413}}

 

Image Matching

image matching์€ ๋””์ง€ํ„ธ ์ด๋ฏธ์ง€๊ฐ„ ์œ ์‚ฌ์„ฑ์„ ์ •๋Ÿ‰ํ™”, ๋น„๊ตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
์ด๋ฅผ image์˜ feature vector๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๊ฐ image vector๊ฐ„ ์œ ์‚ฌ๋„(๊ฑฐ๋ฆฌ)๋ฅผ ์ธก์ •ํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค.
๊ทธ๋ ‡๊ธฐ์— ์ด๋ฏธ์ง€ ๋งค์นญ์˜ ํ•ต์‹ฌ์€ "์ด๋ฏธ์ง€ ํŠน์ง•์„ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•˜๋Š” feature vector์˜ ์ƒ์„ฑ"์ด๋‹ค.
(๋ณดํ†ต ํŠน์ง•๋ฒกํ„ฐ๊ฐ€ ๊ณ ์ฐจ์›์ผ์ˆ˜๋ก ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ์ด ํŠน์ง•๋ฒกํ„ฐ๋Š” classification layer์™€ ๊ฐ™์€ ์ธต์„ ํ†ต๊ณผํ•˜๊ธฐ ์ง์ „(= Feature Extractor์˜ ๊ฒฐ๊ณผ๊ฐ’ = Classifier ์ง์ „๊ฐ’) ๋ฒกํ„ฐ๋ฅผ ๋ณดํ†ต ์˜๋ฏธํ•œ๋‹ค.)

ex) ViT๋ฅผ ์ด์šฉํ•œ ํŠน์ง•๋ฒกํ„ฐ ์ถ”์ถœ ์˜ˆ์ œ
import torch
from datasets import load_dataset
from transformers import ViTImageProcessor, ViTModel

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTModel.from_pretrained(model_name)

inputs = processor(image, return_tensors="pt")
with torch.no_grad():
    outputs = model(inputs["pixel_values"])

print("๋งˆ์ง€๋ง‰ ํŠน์ง• ๋งต์˜ ํ˜•ํƒœ :", outputs["last_hidden_state"].shape)
print("ํŠน์ง• ๋ฒกํ„ฐ์˜ ์ฐจ์› ์ˆ˜ :", outputs["last_hidden_state"][:, 0, :].shape)
print("ํŠน์ง• ๋ฒกํ„ฐ :", outputs["last_hidden_state"][:, 0, :])

# ๋งˆ์ง€๋ง‰ ํŠน์ง• ๋งต์˜ ํ˜•ํƒœ : torch.Size([1, 197, 768])
# ํŠน์ง• ๋ฒกํ„ฐ์˜ ์ฐจ์› ์ˆ˜ : torch.Size([1, 768])
# ํŠน์ง• ๋ฒกํ„ฐ : tensor([[ 2.9420e-01,  8.3502e-01,  ..., -8.4114e-01,  1.7990e-01]])

ImageNet-21K๋ผ๋Š” ๋ฐฉ๋Œ€ํ•œ ์‚ฌ์ „Dataset์œผ๋กœ ํ•™์Šต๋˜์–ด ๋ฏธ์„ธํ•œ ์ฐจ์ด ๋ฐ ๋ณต์žกํ•œ ํŒจํ„ด์„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.
ViT์—์„œ feature vector์ถ”์ถœ ์‹œ, ์ฃผ๋ชฉํ• ์ ์€ last_hidden_state ํ‚ค ๊ฐ’์ด๋‹ค:
์ถœ๋ ฅ์ด [1, 197, 768]์˜ [B, ์ถœ๋ ฅํ† ํฐ์ˆ˜, feature์ฐจ์›]์„ ์˜๋ฏธํ•˜๋Š”๋ฐ, 197๊ฐœ์˜ ์ถœ๋ ฅํ† ํฐ์€ ๋‹ค์Œ์„ ์˜๋ฏธํ•œ๋‹ค.
224×224 --> 16×16(patch_size) --> 196๊ฐœ patches,
197 = [CLS] + 196 patches๋กœ ์ด๋ฃจ์–ด์ง„ ์ถœ๋ ฅํ† ํฐ์—์„œ [CLS]๋ฅผ ํŠน์ง•๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.


FAISS (Facebook AI Similarity Search)

FAISS๋Š” ๋ฉ”ํƒ€์—์„œ ๊ฐœ๋ฐœํ•œ ๊ณ ์„ฑ๋Šฅ ๋ฒกํ„ฐ์œ ์‚ฌ๋„๊ฒ€์ƒ‰ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.
์ด๋Š” "๋Œ€๊ทœ๋ชจ ๊ณ ์ฐจ์› ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ๋ฅผ ๊ฒ€์ƒ‰"๊ฐ€๋Šฅํ•˜๊ฒŒ ์„ค๊ณ„๋˜์—ˆ๋‹ค.


cf) [๋ฒกํ„ฐ ์ €์žฅ ๋ฐ ๊ด€๋ฆฌ๋ฐฉ์‹]

  • ๋กœ์ปฌ ์ €์žฅ ์žฅ์น˜: SSD๋‚˜ NVMe๊ฐ™์€ ๊ณ ์†์ €์žฅ์žฅ์น˜๋ฅผ ์‚ฌ์šฉํ•ด ๋น ๋ฅธ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅ.
  • ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์‹œ์Šคํ…œ: PstgreSQL, pgvectorํ™•์žฅ์ด๋‚˜ MongoDB์˜ Atlas Vector Search๊ฐ™์€ ๋ฒกํ„ฐ๊ฒ€์ƒ‰๊ธฐ๋Šฅ์„ ์ง€์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํ™œ์šฉ
  • ํด๋ผ์šฐ๋“œ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค: Amazon OpenSearch, Ggogle Vetex AI๋“ฑ ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค๋Š” ๋Œ€๊ทœ๋ชจ ๋ฒกํ„ฐ๋ฐ์ดํ„ฐ์˜ ์ €์žฅ ๋ฐ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ํŠนํ™”๋œ ์†”๋ฃจ์…˜์„ ์ œ๊ณต
  • ๋ฒกํ„ฐ๊ฒ€์ƒ‰์—”์ง„: Milvus, Qdrant, Weaviate, FAISS ๋“ฑ์˜ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค๋Š” ๋Œ€๊ทœ๋ชจ ๋ฒกํ„ฐ dataset์˜ ํšจ์œจ์  ์ €์žฅ ๋ฐ ๊ณ ์„ฑ๋Šฅ ๊ฒ€์ƒ‰์„ ์œ„ํ•ด ์ตœ์ ํ™”๋˜์–ด ANN(Approximate Nearest Neighbor)์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋น ๋ฅธ ์œ ์‚ฌ๋„๊ฒ€์ƒ‰์„ ์ง€์›, ์‹ค์‹œ๊ฐ„ ๊ฒ€์ƒ‰์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ ํŠนํžˆ๋‚˜ ์ ํ•ฉํ•˜๋‹ค.


ex) CLIP์„ ์ด์šฉํ•œ ์ด๋ฏธ์ง€ ํŠน์ง•๋ฒกํ„ฐ ์ถ”์ถœ ์˜ˆ์ œ

import torch
import numpy as np
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel

dataset = load_dataset("sasha/dog-food")
images = dataset["test"]["image"][:100]

model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name)

vectors = []
with torch.no_grad():
    for image in images:
        inputs = processor(images=image, padding=True, return_tensors="pt")
        outputs = model.get_image_features(**inputs)
        vectors.append(outputs.cpu().numpy())

vectors = np.vstack(vectors)
print("์ด๋ฏธ์ง€ ๋ฒกํ„ฐ์˜ shape :", vectors.shape)

# ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ์˜ shape : (100, 512)

dog-food dataset์—์„œ 100๊ฐœ ์ด๋ฏธ์ง€๋ฅผ ์„ ํƒ  ๊ฐ ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœ
 vectors๋ฆฌ์ŠคํŠธ์— ์ €์žฅ → ndarrayํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜

์ด๋Ÿฐ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์œ ์‚ฌ๋„ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ์ธ๋ฑ์Šค ์ƒ์„ฑ์— ํ™œ์šฉ๊ฐ€๋Šฅํ•˜๋‹ค:
์ƒ์„ฑ๋œ ์ธ๋ฑ์Šค์— ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋ฅผ ๋“ฑ๋กํ•˜๊ธฐ ์œ„ํ•ด add๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋•Œ ์ž…๋ ฅ๋˜๋Š” ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋Š” ๋ฐ˜๋“œ์‹œ numpy์˜ ndarrayํ˜•์‹์˜ [๋ฒกํ„ฐ๊ฐœ์ˆ˜, ๋ฒกํ„ฐ์ฐจ์›์ˆ˜] ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋˜์–ด์•ผ ํ•œ๋‹ค!!

import faiss

dimension = vectors.shape[-1]
index = faiss.IndexFlatL2(dimension)
if torch.cuda.is_available():
    res = faiss.StandardGpuResources()
    index = faiss.index_cpu_to_gpu(res, 0, index)

index.add(vectors)


import matplotlib.pyplot as plt

search_vector = vectors[0].reshape(1, -1)
num_neighbors = 5
distances, indices = index.search(x=search_vector, k=num_neighbors)

fig, axes = plt.subplots(1, num_neighbors + 1, figsize=(15, 5))

axes[0].imshow(images[0])
axes[0].set_title("Input Image")
axes[0].axis("off")

for i, idx in enumerate(indices[0]):
    axes[i + 1].imshow(images[idx])
    axes[i + 1].set_title(f"Match {i + 1}\nIndex: {idx}\nDist: {distances[0][i]:.2f}")
    axes[i + 1].axis("off")

print("์œ ์‚ฌํ•œ ๋ฒกํ„ฐ์˜ ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ:", indices)
print("์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ:", distances)

# ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ์˜ ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ: [[ 0  6 75  1 73]]
# ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ: [[ 0.       43.922516 44.92473  46.544144 47.058586]]

์œ„ ๊ณผ์ •์„ ํ†ตํ•ด 100๊ฐœ์˜ ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•œ FAISS ์ธ๋ฑ์Šค๊ฐ€ ์ƒ์„ฑ๋˜๋ฉฐ, ๊ฒ€์ƒ‰ํ•˜๊ณ ์žํ•˜๋Š” ์ด๋ฏธ์ง€์˜ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ธ๋ฑ์Šค ๋‚ด์—์„œ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ถ”์ถœ๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹ค๋งŒ, ์ธ๋ฑ์Šค์— ์ €์žฅ๋œ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด์„œ๋งŒ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•˜๊ธฐ์— ๊ฒ€์ƒ‰๋ฒ”์œ„๋ฅผ ํ™•์žฅํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด ๋” ๋งŽ์€ ๋ฒกํ„ฐ๋ฅผ ์ธ๋ฑ์Šค์— ์ถ”๊ฐ€ํ•ด์•ผํ•œ๋‹ค.

์œ„ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ๊ฐ€ ์žˆ๋Š”๋ฐ, FAISS ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ธ๋ฑ์Šค์œ ํ˜•๋“ค์„ ์ œ๊ณตํ•œ๋‹ค:

index = faiss.IndexFlatL2(dimension)

 

์ด๋ฆ„ ์ •ํ™•๋„ ์†๋„ ํŠน์ง•
IndexFlatL2 ๊ฐ€์žฅ ๋†’์Œ ๊ฐ€์žฅ ๋Š๋ฆผ ๋ชจ๋“  ๋ฒกํ„ฐ์— ๋Œ€ํ•œ ์™„์ „ํƒ์ƒ‰์„ ์ˆ˜ํ–‰
IndexHNSW ๋†’์Œ ๋ณดํ†ต ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด ํšจ์œจ์  ๊ฒ€์ƒ‰
IndexIVFlat ๋ณดํ†ต ๊ฐ€์žฅ ๋น ๋ฆ„ ๋ฒกํ„ฐ๊ฐ„ clustering์œผ๋กœ ํƒ์ƒ‰๋ฒ”์œ„๋ฅผ ์ค„์—ฌ ๊ฒ€์ƒ‰

 

 

 

 

 

 

Multi-Modal

Image Captioning (img2txt)

BLIP

BLIP์˜ ํ•ต์‹ฌ์•„์ด๋””์–ด๋Š” "img์™€ Txt์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ"์ด๋‹ค.
์ด๋ฅผ ์œ„ํ•ด img-encoder, txt-encoder๋กœ ๊ฐ๊ฐ์˜ feature vector๋ฅผ ์—ฐ๊ฒฐํ•ด ํ†ตํ•ฉ ํ‘œํ˜„์„ ์ƒ์„ฑํ•œ๋‹ค.

BLIP-2 ๊ตฌ์กฐ

BLIP2๋Š” Q-Former๋ฅผ ๋„์ž…ํ•ด img-txt๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ๊ณผ ์ •๋ณด๊ตํ™˜์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค:
[img-txt๋Œ€์กฐํ•™์Šต, ITM, img๊ธฐ๋ฐ˜ txt์ƒ์„ฑ] --> ๋™์‹œ์— ํ•˜๋‚˜์˜ Encode-Decoder๊ตฌ์กฐ๋กœ ์ˆ˜ํ–‰
Q-Former๋Š” ์ž…๋ ฅ์œผ๋กœ ๊ณ ์ •๋œ ์ด๋ฏธ์ง€ feature embedding์„ ๋ฐ›์€ ํ›„
img-txt๊ด€๊ณ„๊ฐ€ ์ž˜ ํ‘œํ˜„๋œ Soft visual prompt Embedding์„ ์ถœ๋ ฅํ•œ๋‹ค.

DocumentQA

DQA(DocumentQA)๋Š” ์ž์—ฐ์–ด์ฒ˜๋ฆฌ + ์ •๋ณด๊ฒ€์ƒ‰๊ธฐ์ˆ ์„ ์œตํ•ฉํ•ด QA๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
DQA๋Š” ์‹œ๊ฐ์  ๊ตฌ์กฐ์™€ Layout์„ ๊ณ ๋ คํ•ด์•ผํ•˜๋Š”๋ฐ, ์ด ์ค‘ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›๋Š” ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ฐ”๋กœ LayoutLM์ด๋‹ค.

LayoutLM (Layout-aware Language Model)

LayoutLM์€ Microsoft์—์„œ ๋ฌธ์„œ ์ด๋ฏธ์ง€์˜ txt๋ฟ๋งŒ์•„๋‹ˆ๋ผ Layout์ •๋ณด๊นŒ์ง€ ํ•จ๊ป˜ Pre-Train๋œ ๋ชจ๋ธ์ด๋‹ค.

[
LayoutLMv1]

LayoutLM-v1

BERT๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ txt์™€ ํ•จ๊ป˜ txt์˜ ์œ„์น˜์ •๋ณด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
Faster R-CNN๊ฐ™์€ OCR๋ชจ๋ธ๋กœ txt์™€ bbox๊ฒฝ๊ณ„๋ฅผ ์ถ”์ถœ, position embedding์œผ๋กœ ์ถ”๊ฐ€ํ•˜๋ฉฐ ๋‹จ์–ด์˜ image patch(feature)๋„ model์— ์ž…๋ ฅํ•œ๋‹ค. ๋‹ค๋งŒ, LayoutLMv1์€ image feature๊ฐ€ ๋งจ ๋งˆ์ง€๋ง‰์— ์ถ”๊ฐ€๋˜์–ด Pretrain์‹œ ์‹ค์ œ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.




[LayoutLMv2]

LayoutLMv2๋Š” image embedding์„ ์ถ”๊ฐ€๋กœ ๋„์ž…ํ•ด ๋ฌธ์„œ์˜ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ฐ˜์˜ํ•œ๋‹ค.
LayoutLMv2์—์„œ visual embedding์ด ResNeXT-FPN์œผ๋กœ ์ถ”์ถœ๋œ๋‹ค.
์ฆ‰, txt, img-patch, layout์ •๋ณด๋ฅผ ๋™์‹œ์— ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ Self-Attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
- ํ•™์Šต ์ฃผ์š” ๋ชฉํ‘œ:
i) Masked Visual-Language Modeling: ๋ฌธ์žฅ์˜ ๋นˆ์นธ ์˜ˆ์ธก
ii) ITM: ํŠน์ • ํ…์ŠคํŠธ์™€ ํ•ด๋‹น ์ด๋ฏธ์ง€๊ฐ„์˜ ์—ฐ๊ด€์„ฑ ํ•™์Šต
iii)Text-Image Alignment: ์ด๋ฏธ์ง€์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๊ฐ€๋ ค์กŒ์„ ๋•Œ, ๊ทธ ์œ„์น˜๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๋Šฅ๋ ฅ


[LayoutLMv3]

์ขŒ)DocFormer , ์šฐ)LayoutLMv3

LayoutLMv3๋Š” Faster R-CNN, CNN๋“ฑ์˜ Pre-Trained Backbone์— ์˜์กดํ•˜์ง€ ์•Š๋Š” ์ตœ์ดˆ์˜ ํ†ตํ•ฉ MLLMs์ด๋‹ค.
์ด๋ฅผ ์œ„ํ•ด ์ „๊ณผ ๋‹ฌ๋ฆฌ ์ƒˆ๋กœ์šด ์‚ฌ์ „ํ•™์Šต์ „๋žต ๋ฐ ๊ณผ์ œ๋ฅผ ๋„์ž…ํ•˜์˜€๋‹ค:
i) Masked Language Modeling(MLM): ์ผ๋ถ€ ๋‹จ์–ด token ๋งˆ์Šคํ‚น
ii) Masked Image Modeling(MIM): ๋งˆ์Šคํ‚น๋œ token์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€ ๋ถ€๋ถ„์„ ๋งˆ์Šคํ‚น
iii) Word Patch Alignment(WPA): img token๊ณผ ๋Œ€์‘๋˜๋Š” Txt token์˜ ๋งˆ์Šคํ‚น์—ฌ๋ถ€๋ฅผ ์ด์ง„๋ถ„๋ฅ˜, ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ„ ์ •๋ ฌ์„ ํ•™์Šต

<LayoutLMv3 ๊ตฌ์กฐ>: embedding๊ณ„์ธต, patch_embedding๋ชจ๋“ˆ, encoder
1) embedding์ธต์€ ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ Embedding์„ ํ†ตํ•ฉ:
 - word_embed + token_type_emb + pos_emb + (x_pos_emb , y_pos_emb, h_pos_emb, w_pos_emb)

2) patch_embedding๋ชจ๋“ˆ์€ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌ:
 - patch๋กœ ๋ถ„ํ• ํ•˜๊ณ  ๊ฐ patch๋ฅผ embedding์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ViT์—ญํ• 

3) encoder
 - ์—ฌ๋Ÿฌ Transformer์ธต์œผ๋กœ ๊ตฌ์„ฑ.

 

VQA

VQA process: ์‹œ๊ฐ์  ํŠน์ง• ์ถ”์ถœ → Q์˜๋ฏธํŒŒ์•… →์‹œ๊ฐ์ ํŠน์ง•๊ณผ Q์˜ txt์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•ด ์˜๋ฏธ์žˆ๋Š” ํ‘œํ˜„(A)์ƒ์„ฑ
์ด๋ฅผ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ ViLT์ด๋‹ค.

ViLT (Vision-and-Language Transformer)

์‹œ๊ฐ์  ์ž…๋ ฅ์„ txt์ž…๋ ฅ๊ณผ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹จ์ผ๋ชจ๋ธ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.
์ด๋•Œ, ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ตฌ๋ถ„์„ ์œ„ํ•ด ๋ชจ๋‹ฌํƒ€์ž… embedding์ด ์ถ”๊ฐ€๋˜๋ฉฐ,
ํ•™์Šต๊ณผ์ •์—์„œ 3๊ฐ€์ง€ ์†์‹คํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ด๋ค„์ง„๋‹ค:
- ITM: ์ฃผ์–ด์ง„ Image์™€ Text๊ฐ€ ์„œ๋กœ ์—ฐ๊ด€๋˜์–ด์žˆ๋Š”์ง€ ํŒ๋‹จ.
- MLM: ๋‹จ์–ด๋‹จ์œ„์˜ Masking์œผ๋กœ ์ „์ฒด ๋‹จ์–ด๋งฅ๋ฝ ํŒŒ์•…
- WPA: img-txt๊ฐ„ ๋ฒกํ„ฐ ์œ ์‚ฌ๋„ ์ตœ๋Œ€ํ™”


๊ฒฐ๊ณผ์ ์œผ๋กœ img+txt๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ด, ๋‹จ์ผ embedding๊ณต๊ฐ„์— ํ‘œํ˜„ํ•œ๋‹ค.

cf) collate_fn์€ pytorch์˜ dataloader๋กœ batch๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ, ๊ฐ sample์„ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•  ๊ฒƒ์ธ์ง€ ์ •์˜ํ•˜๋Š” ํ•จ์ˆ˜๋‹ค.

Image Generation

์ด๋ฏธ์ง€ ์ƒ์„ฑ์€ prompt๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ดํ•ดํ•˜์—ฌ GAN์ด๋‚˜ Diffusion Model์„ ์ด์šฉํ•ด prompt์˜ ์„ธ๋ถ€์  ํŠน์ง•์„ ์ž˜ ์žก์•„๋‚ด ์ƒˆ๋กœ์šด img๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ์ˆ ์„ ์˜๋ฏธํ•œ๋‹ค.

Diffusion Model

[Forward process]: src_img์— ์ ์ง„์  ์ •๊ทœ๋ถ„ํฌ Noise ์ถ”๊ฐ€
[Reverse process]: pure_noise์—์„œ ์›๋ณธ์œผ๋กœ ๋ณต์›(by ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ ๊ฐฑ์‹ )


[Stable-Diffusion 1]
- 512×512 img ์ƒ์„ฑ
- txt2img, img2img, inpainting ๋“ฑ์˜ ๊ธฐ๋Šฅ


[Stable-Diffusion 2]
- 768×768 img ์ƒ์„ฑ
- OpenCLIP์œผ๋กœ ๋” ๋‚˜์€ WPA ์ œ๊ณต, ์„ธ๋ถ€์  ๋ฌ˜์‚ฌ ๊ฐœ์„ 


[Stable-Diffusion 3]
- ๋”์šฑ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ
- Rectified flow๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๋ชจ๋ธ๊ตฌ์กฐ
- txt์™€ img token๊ฐ„ ์–‘๋ฐฉํ–ฅ ์ •๋ณดํ๋ฆ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ๊ตฌ์กฐ

 

 

 

 

 

 

 

etc

Hyperparameter Tuning - ray tune

raytune์€ ๋ถ„์‚ฐ hypereparameter ์ตœ์ ํ™” framework์ด๋‹ค.
๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ์ปดํ“จํŒ… ํ™˜๊ฒฝ์—์„œ ๋‹ค์–‘ํ•œ hyperparameter ํƒ์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜(random, greedy ๋“ฑ)์„ ์ง€์›ํ•˜๋ฉฐ, Early Stopping ๋˜ํ•œ ์ œ๊ณตํ•œ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ ์‹คํ—˜๊ฒฐ๊ณผ ์ถ”์  ๋ฐ ์‹œ๊ฐํ™” ๋„๊ตฌ ๋˜ํ•œ ์ œ๊ณตํ•˜๋ฉฐ, ์ตœ์ ์˜ hyperparameter ์กฐํ•ฉ ๋˜ํ•œ ํšจ๊ณผ์ ์œผ๋กœ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€์ค€๋‹ค.
!pip3 install ray[tune] optuna


ex) NER RayTune ์˜ˆ์ œ

i) ํ•™์Šต ์ค€๋น„

from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer

def preprocess_data(example, tokenizer):
    sentence = "".join(example["tokens"]).replace("\xa0", " ")
    encoded = tokenizer(
        sentence,
        return_offsets_mapping=True,
        add_special_tokens=False,
        padding=False,
        truncation=False
    )

    labels = []
    for offset in encoded.offset_mapping:
        if offset[0] == offset[1]:
            labels.append(-100)
        else:
            labels.append(example["ner_tags"][offset[0]])
    encoded["labels"] = labels
    return encoded

dataset = load_dataset("klue", "ner")
labels = dataset["train"].features["ner_tags"].feature.names

model_name = "Leo97/KoELECTRA-small-v3-modu-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(labels),
    ignore_mismatched_sizes=True
)

processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=False,
    remove_columns=dataset["train"].column_names
)


ii) hyperparameter ํƒ์ƒ‰

from ray import tune
from functools import partial
from transformers import Trainer, TrainingArguments
from transformers.data.data_collator import DataCollatorForTokenClassification

def model_init(model_name, labels):
    return AutoModelForTokenClassification.from_pretrained(
        model_name, num_labels=len(labels), ignore_mismatched_sizes=True
    )

def hp_space(trial):
    return {
        "learning_rate": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(1e-5, 1e-1),
        "num_train_epochs": tune.choice([1, 2, 3])
    }

def compute_objective(metrics):
    return metrics["eval_loss"]

training_args = TrainingArguments(
    output_dir="token-classification-hyperparameter-search",
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    # learning_rate=1e-4,
    # weight_decay=0.01,
    # num_train_epochs=5,
    seed=42
)

trainer = Trainer(
    model_init=partial(model_init, model_name=model_name, labels=labels),
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"],
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True)
)

best_run = trainer.hyperparameter_search(
    backend="ray",
    n_trials=5,
    direction="minimize",
    hp_space=hp_space,
    compute_objective=compute_objective,
    resources_per_trial={"cpu": 2, "gpu": 1},
    trial_dirname_creator=lambda trial: str(trial)
)
print(best_run.hyperparameters)


model_init ํ•จ์ˆ˜: ๋ชจ๋ธ ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ (์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด ์ตœ์ ์˜ hyperparameter ํƒ์ƒ‰ํ•˜๊ฒŒ ํ• ๋‹น๋จ.)
์ฆ‰, ๊ฐ ์‹คํ—˜๋งˆ๋‹ค ์ผ๊ด€๋œ ์ดˆ๊ธฐ์ƒํƒœ๋ฅผ ๋ณด์žฅํ•จ.

hp_space ํ•จ์ˆ˜: ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ํƒ์ƒ‰ํ•  hyperparameter ์ข…๋ฅ˜์™€ ๊ฐ’์˜ ๋ฒ”์œ„ ์ง€์ •.

compute_objective ํ•จ์ˆ˜: ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ์‚ฌ์šฉํ•  "ํ‰๊ฐ€์ง€ํ‘œ"๋กœ ๋ณดํ†ต eval_loss๋‚˜ eval_acc๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์„ค์ •.

TrainingArguments ํ•จ์ˆ˜: lr, weight_decay, train_epochs๊ฐ€ hp_space์—์„œ ํƒ์ƒ‰๋˜๊ธฐ์— ๋”ฐ๋กœ ํ• ๋‹นX

Trainer ํ•จ์ˆ˜: ๊ณ ์ •๋œ ๋ชจ๋ธ์ธ์Šคํ„ด์Šค๊ฐ€ ์•„๋‹Œ, model_init์„ ์‚ฌ์šฉ.

์ถœ๋ ฅ ์˜ˆ์‹œ)

+-------------------------------------------------------------------+
| Configuration for experiment     _objective_2024-11-18_05-44-18   |
+-------------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator            |
| Scheduler                        FIFOScheduler                    |
| Number of trials                 5                                |
+-------------------------------------------------------------------+

View detailed results here: /root/ray_results/_objective_2024-11-18_05-44-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-11-18_05-44-11_866890_872/artifacts/2024-11-18_05-44-18/_objective_2024-11-18_05-44-18/driver_artifacts`

Trial status: 5 PENDING
Current time: 2024-11-18 05:44:18. Total running time: 0s
Logical resource usage: 0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-------------------------------------------------------------------------------------------+
| Trial name               status       learning_rate     weight_decay     num_train_epochs |
+-------------------------------------------------------------------------------------------+
| _objective_27024_00000   PENDING        2.36886e-05       0.0635122                     3 |
| _objective_27024_00001   PENDING        6.02131e-05       0.00244006                    2 |
| _objective_27024_00002   PENDING        1.43217e-05       1.7074e-05                    1 |
| _objective_27024_00003   PENDING        3.99131e-05       0.00679658                    2 |
| _objective_27024_00004   PENDING        1.13871e-05       0.00772672                    2 |
+-------------------------------------------------------------------------------------------+

Trial _objective_27024_00000 started with configuration:
+-------------------------------------------------+
| Trial _objective_27024_00000 config             |
+-------------------------------------------------+
| learning_rate                             2e-05 |
| num_train_epochs                              3 |
| weight_decay                            0.06351 |
+-------------------------------------------------+

...

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ๋Š” ๋ชจ๋ธ ์ตœ์ ํ™”๋ฐฉ์‹์œผ๋กœ LLM์˜ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ๊ฐ€๋Šฅํ•˜๋‹ค.
๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์€ bit์ •๋ฐ€๋„๋กœ ์–‘์žํ™”ํ•ด ๋ชจ๋ธํฌ๊ธฐ๋ฅผ ์ค„์ด๊ณ  ์ถ”๋ก ์†๋„๋ฅผ ๋†’์ธ๋‹ค.
์•„๋ž˜ ์˜ˆ์ œ์˜ ์ถœ๋ ฅ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ, ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ƒ๋‹นํžˆ ํฐ ํญ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋Š”๋ฐ, 
GPTQ๋ฐฉ๋ฒ•์€ GPT๊ณ„์—ด๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค ๋ชจ๋‘ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

GPTQ๋ฅผ ์ด์šฉํ•œ ๋ชจ๋ธ ์–‘์žํ™” ์˜ˆ์ œ

from transformers import GPTQConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

quantization_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config
)
from transformers import pipeline

origin_generator = pipeline("text-generation", model="facebook/opt-125m")
quantized_generator = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

input_text_list = [
    "In the future, technology wil",
    "What are we having for dinner?",
    "What day comes after Monday?"
]

print("์›๋ณธ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:")
for input_text in input_text_list:
    print(origin_generator(input_text))
print("์–‘์žํ™” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:")
for input_text in input_text_list:
    print(quantized_generator(input_text))
    
# ์›๋ณธ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:
# [{'generated_text': 'In the future, technology wil be used to make the world a better place.\nI think'}]
# [{'generated_text': 'What are we having for dinner?\n\nWe have a great dinner tonight. We have a'}]
# [{'generated_text': "What day comes after Monday?\nI'm guessing Monday."}]
# ์–‘์žํ™” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:
# [{'generated_text': 'In the future, technology wil be able to make it possible to make a phone that can be'}]
# [{'generated_text': "What are we having for dinner?\n\nI'm not sure what to do with all this"}]
# [{'generated_text': "What day comes after Monday?\nI'm not sure, but I'll be sure to check"}]

์ถœ๋ ฅ๊ฒฐ๊ณผ, ์ •ํ™•๋„๊ฐ€ ๋‹ค์†Œ ๋–จ์–ด์ง€๊ธด ํ•˜๋‚˜ ์›๋ณธ๋ชจ๋ธ๊ณผ ํฐ ์ฐจ์ด๊ฐ€ ์—†์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

import time
import numpy as np

def measure_inference_time(generator, input_text, iterations=10):
    times = []
    for _ in range(iterations):
        start_time = time.time()
        generator(input_text)
        end_time = time.time()
        times.append(end_time - start_time)
    avg_time = np.mean(times)
    return avg_time

def calculate_model_size(model):
    total_params = sum(p.numel() for p in model.parameters())
    total_memory = sum(p.numel() * p.element_size() for p in model.parameters())
    total_memory_mb = total_memory / (1024 ** 2)
    return total_memory_mb, total_params

test_input = "Once upon a time in a land far, far away, there was a small village."

size_original, total_params_original = calculate_model_size(origin_generator.model)
avg_inference_time_original = measure_inference_time(origin_generator, test_input)

size_quantized, total_params_quantized = calculate_model_size(quantized_generator.model)
avg_inference_time_quantized = measure_inference_time(quantized_generator, test_input)

print("์›๋ณธ ๋ชจ๋ธ:")
print(f"- ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: {total_params_original:,}")
print(f"- ๋ชจ๋ธ ํฌ๊ธฐ: {size_original:.2f} MB")
print(f"- ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: {avg_inference_time_original:.4f} sec")

print("์–‘์žํ™” ๋ชจ๋ธ:")
print(f"- ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: {total_params_quantized:,}")
print(f"- ๋ชจ๋ธ ํฌ๊ธฐ: {size_quantized:.2f} MB")
print(f"- ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: {avg_inference_time_quantized:.4f} sec")

# ์›๋ณธ ๋ชจ๋ธ:
# - ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: 125,239,296
# - ๋ชจ๋ธ ํฌ๊ธฐ: 477.75 MB
# - ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: 0.1399 sec
# ์–‘์žํ™” ๋ชจ๋ธ:
# - ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: 40,221,696
# - ๋ชจ๋ธ ํฌ๊ธฐ: 76.72 MB
# - ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: 0.0289 sec

์ถ”๋ก  ๊ณผ์ •์— ๋Œ€ํ•œ ์ถœ๋ ฅ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ์›๋ณธ์— ๋น„ํ•ด ๋ชจ๋ธ์— ๋น„ํ•ด ํฌ๊ธฐ๊ฐ€ ํฌ๊ฒŒ ์ค„๋ฉฐ ๋” ๋น ๋ฅธ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์‹ค์‹œ๊ฐ„ ์‘๋‹ต์— ๋Œ€ํ•ด ๋งค์šฐ ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

'HuggingFace๐Ÿค—' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

HuggingFace(๐Ÿค—)-Tutorials  (1) 2024.07.31
[Data Preprocessing] - Data Collator  (1) 2024.07.14
QLoRA ์‹ค์Šต & Trainer vs SFTTrainer  (0) 2024.07.12
[QLoRA] & [PEFT] & deepspeed, DDP  (0) 2024.07.09

Transformers

pipeline()

๋ชจ๋ธ inference๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ
from transformers import pipeline
pipe = pipeline("text-classification")
pipe("This restaurant is awesome")
# [{'label': 'POSITIVE', 'score': 0.9998743534088135}]

from transformers๋กœ Github(๐Ÿˆ‍โฌ›) transformer์—์„œ ํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค:

transformers์˜ ํ•จ์ˆ˜๋ฅผ importํ•˜๋Š” ๊ฒฝ์šฐ, ์œ„ ์‚ฌ์ง„์˜ src/transformers์— ๋ชจ๋‘ ๊ตฌํ˜„์ด ๋˜์–ด์žˆ๋‹ค.

๋ถˆ๋Ÿฌ์˜ค๋Š” ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ, __init__.py๋ฅผ ํ™•์ธํ•˜๋ฉด ์•Œ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ pipeline์ด from .pipelines import pipeline์ด๋ผ ์ ํ˜€์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.


์œ„ ์ขŒ์ธก์‚ฌ์ง„์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ, pipelinesํด๋”์— pipelineํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š”๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ,
์‹ค์ œ๋กœ ํ•ด๋‹น ํด๋”์— ๋“ค์–ด๊ฐ€๋ณด๋ฉด ์šฐ์ธก์ฒ˜๋Ÿผ pipelineํ•จ์ˆ˜๊ฐ€ ์ •์˜๋˜๊ณ , ์ด ํ˜•ํƒœ๋Š” transformers.pipeline docs๋‚ด์šฉ๊ณผ ์ผ์น˜ํ•จ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.



Auto Classes

from_pretrained() Method๋กœ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, AutoClasses๋Š” ์ด๋Ÿฐ ์ž‘์—…์„ ์ˆ˜ํ–‰, ์‚ฌ์ „ํ›ˆ๋ จ๋œ  AutoConfig, AutoModel, AutoTokenizer์ค‘ ํ•˜๋‚˜๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑ๊ฐ€๋Šฅํ•˜๋‹ค:
 ex)

from transformers import AutoConfig, AutoModel

model = AutoModel.from_pretrained("google-bert/bert-base-cased")




โˆ™ AutoConfig

generic Cofigurationํด๋ž˜์Šค๋กœ from_pretrained()๋ผ๋Š” ํด๋ž˜์Šค๋ฉ”์„œ๋“œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์ •ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ์ธ์Šคํ„ด์Šคํ™”๋œ๋‹ค.
์ด ํด๋ž˜์Šค๋Š” '__init__()'์„ ์‚ฌ์šฉํ•ด ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ•  ์ˆ˜ ์—†๋‹ค!!

์œ„ ์‚ฌ์ง„์„ ๋ณด๋ฉด, transformers/srcํŒŒ์ผ์„ ๋”ฐ๊ณ  ๋“ค์–ด๊ฐ„ ๊ฒฐ๊ณผ, ์ตœ์ข…์ ์œผ๋กœ from_pretrained()ํ•จ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•ด๋‹น ๊นƒํ—™์ฝ”๋“œ(๊ฐ€์žฅ ์šฐ์ธก์‚ฌ์ง„)๋ฅผ ๋ณด๋ฉด, __init__()ํ•จ์ˆ˜์— ๋Œ€ํ•ด raise EnvironmentError๊ฐ€ ๊ฑธ๋ ค์žˆ์Œ์ด ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

config = AutoConfig.from_pretrained("bert-base-uncased")
print(config)


# BertConfig {
#   "_name_or_path": "bert-base-uncased",
#   "architectures": [
#     "BertForMaskedLM"
#   ],
#   "attention_probs_dropout_prob": 0.1,
#   "classifier_dropout": null,
#   "gradient_checkpointing": false,
#   "hidden_act": "gelu",
#   "hidden_dropout_prob": 0.1,
#   "hidden_size": 768,
#   "initializer_range": 0.02,
#   "intermediate_size": 3072,
#   "layer_norm_eps": 1e-12,
#   "max_position_embeddings": 512,
#   "model_type": "bert",
#   "num_attention_heads": 12,
#   "num_hidden_layers": 12,
#   "pad_token_id": 0,
#   "position_embedding_type": "absolute",
#   "transformers_version": "4.41.2",
#   "type_vocab_size": 2,
#   "use_cache": true,
#   "vocab_size": 30522
# }

์œ„ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด, Config๋Š” ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•œ jsonํŒŒ์ผ๋กœ ๋˜์–ด์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.
batch_size, learning_rate, weight_decay ๋“ฑ train์— ํ•„์š”ํ•œ ๊ฒƒ๋“ค๊ณผ
tokenizer์˜ ํŠน์ˆ˜ํ† ํฐ๋“ค์„ ๋ฏธ๋ฆฌ ์„ค์ •ํ•˜๋Š” ๋“ฑ ์„ค์ •๊ด€๋ จ ๋‚ด์šฉ์ด ๋“ค์–ด์žˆ๋‹ค.
๋˜ํ•œ, save_pretrained()๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ชจ๋ธ์˜ checkpointํ™” ํ•จ๊ป˜ ์ €์žฅ๋œ๋‹ค!
๊ทธ๋ ‡๊ธฐ์—, ๋งŒ์•ฝ ์„ค์ •์„ ๋ณ€๊ฒฝํ•˜๊ณ  ์‹ถ๊ฑฐ๋‚˜ Model Proposal๋“ฑ์˜ ์ƒํ™ฉ์—์„œ๋Š” configํŒŒ์ผ์„ ์ง์ ‘ ๋ถˆ๋Ÿฌ์™€์•ผํ•œ๋‹ค!




โˆ™ AutoTokenizer, (blobs, refs, snapshots)

generic Tokenizerํด๋ž˜์Šค๋กœ AutoTokenizer.from_pretrained()๋ผ๋Š” ํด๋ž˜์Šค๋ฉ”์„œ๋“œ.
์ƒ์„ฑ ์‹œ, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ† ํฌ๋‚˜์ด์ €ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ์ธ์Šคํ„ด์Šคํ™”๋œ๋‹ค.
์ฐธ๊ณ )https://chan4im.tistory.com/#%E2%88%99input-ids
์ด ํด๋ž˜์Šค๋Š” '__init__()'์„ ์‚ฌ์šฉํ•ด ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ•  ์ˆ˜ ์—†๋‹ค!!

์œ„ ์‚ฌ์ง„์„ ๋ณด๋ฉด, transformers/srcํŒŒ์ผ์„ ๋”ฐ๊ณ  ๋“ค์–ด๊ฐ„ ๊ฒฐ๊ณผ, ์ตœ์ข…์ ์œผ๋กœ from_pretrained()ํ•จ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•ด๋‹น ๊นƒํ—™์ฝ”๋“œ(๊ฐ€์žฅ ์šฐ์ธก์‚ฌ์ง„)๋ฅผ ๋ณด๋ฉด, __init__()ํ•จ์ˆ˜์— ๋Œ€ํ•ด raise EnvironmentError๊ฐ€ ๊ฑธ๋ ค์žˆ์Œ์ด ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

from transformers import AutoTokenizer

# Download vocabulary from huggingface.co and cache.
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

# If vocabulary files are in a directory 
# (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")

# Download vocabulary from huggingface.co and define model-specific arguments
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_prefix_space=True)

tokenizer
# BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
# 	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# }

๊ทธ๋Ÿฐ๋ฐ ํ•œ๊ฐ€์ง€ ๊ถ๊ธˆํ•˜์ง€ ์•Š์€๊ฐ€?

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

์œ„ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ›„ ์‹คํ–‰ํ•˜๋ฉด ์ฝ˜์†”์ฐฝ์— ์™œ ์•„๋ž˜์™€ ๊ฐ™์€ ํ™”๋ฉด์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์ผ๊นŒ?????

๋ฏธ๋ฆฌ ์„ค๋ช…:
tokenizer_config.json์—๋Š” token์— ๋Œ€ํ•œ ์„ค์ •๋“ค์ด,
config.json์—๋Š” ๋ชจ๋ธ๊ด€๋ จ ์„ค์ •์ด,
vocab.txt๋Š” subword๋“ค์ด ๋“ค์–ด์žˆ๊ณ ,
tokenizer.json์€ config๋œ ๊ฐ’๋“ค์— ๋Œ€ํ•ด ๋‚˜์—ดํ•œ ๊ฒƒ์ด๋‹ค.


๋ณธ์ธ์˜ ๊ฒฝ์šฐ, ์•„๋ž˜์™€ ๊ฐ™์ด cache_dir์— ์ง€์ •์„ ํ•˜๋ฉด, ํ•ด๋‹น ๋””๋ ‰ํ† ๋ฆฌ์— hub๋ผ๋Š” ํŒŒ์ผ์ด ์ƒ๊ธฐ๋ฉฐ, ๊ทธ์•ˆ์— ๋ชจ๋ธ๊ด€๋ จ ํŒŒ์ผ์ด ์ƒ๊ธด๋‹ค.

parser.add_argument('--cache_dir', default="/data/huggingface_models")

ํƒ€๊ณ  ๋“ค์–ด๊ฐ€๋‹ค ๋ณด๋ฉด ์ด 3๊ฐ€์ง€ ํด๋”๊ฐ€ ๋‚˜์˜จ๋‹ค: blobs, refs, snapshots
blobs: ํ•ด์‹œ๊ฐ’์œผ๋กœ ๋‚˜ํƒ€๋‚˜์ ธ ์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค. ํ•ด๋‹นํŒŒ์ผ์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํŒŒ์ผ์ด ์กด์žฌํ•œ๋‹ค:
Configํด๋ž˜์Šค๊ด€๋ จํŒŒ์ผ, Model๊ด€๋ จ ํด๋ž˜์ŠคํŒŒ์ผ๋“ค, tokenizer์„ค์ •๊ด€๋ จ jsonํŒŒ์ผ, weight๊ด€๋ จ ํŒŒ์ผ๋“ค ๋“ฑ๋“ฑ ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด ๋งŽ์€ ํŒŒ์ผ๋“ค์ด ์ฝ”๋“œํ™”๋˜์–ด ๋“ค์–ด์žˆ๋‹ค:


 
refs: ๋‹จ์ˆœํžˆ main์ด๋ผ๋Š” ํŒŒ์ผ๋งŒ ์กด์žฌํ•œ๋‹ค:

ํ•ด๋‹น ํŒŒ์ผ์—๋Š” snapshots์•ˆ์— ์žˆ๋Š” ๋””๋ ‰ํ† ๋ฆฌ์˜ ์ด๋ฆ„๊ณผ ๋™์ผํ•œ ์ด๋ฆ„์ด ์ ํ˜€์žˆ๋‹ค.


snapshots: ๋ฐ”๋กœ ์ด๊ณณ์—!! tokenizer_config.json, config.json, vocab.txt, tokenizer.jsonํŒŒ์ผ์ด ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!!!



๊ทธ๋Ÿฐ๋ฐ ๋ญ”๊ฐ€ ์ด์ƒํ•˜์ง€ ์•Š์€๊ฐ€??
์œ„์˜ blobs์— ๋‚˜์™€์žˆ๋Š” ์‚ฌ์ง„์˜ ์ฝ”๋“œ์™€ snapshots์˜ ์ฝ”๋“œ๊ฐ€ ๋ชจ๋‘ ์ผ์น˜ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค!!

๊ทธ๋ ‡๋‹ค! ์ฆ‰, blobs๋Š” snapshots ๋‚ด์šฉ์„ ํ•ด์‹œ๊ฐ’ํ˜•ํƒœ๋กœ ์ €์žฅํ•œ ๊ฒƒ์ด์—ˆ๋‹ค!!!
์‚ฌ์‹ค ์ด์ง“ํ•œ๋‹ค์Œ์— ๊ตฌ๊ธ€์— ์น˜๋‹ˆ๊นŒ ๋ฐ”๋กœ ์žˆ๊ธดํ–ˆ์—ˆ๋‹ค..(https://huggingface.co/docs/huggingface_hub/v0.16.3/en/guides/manage-cache)
ํ—ˆ๊น…ํŽ˜์ด์Šค ์„ค๋ช… ์š”์•ฝ:

Refs refs ํด๋”๋Š” ์ฃผ์–ด์ง„ ์ฐธ์กฐ์˜ ์ตœ์‹  ๋ฆฌ๋น„์ „์„ ๋‚˜ํƒ€๋‚ด๋Š” ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ด์ „์— ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์˜ ๋ฉ”์ธ ๋ธŒ๋žœ์น˜์—์„œ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜จ ๊ฒฝ์šฐ, refs ํด๋”์—๋Š” main์ด๋ผ๋Š” ํŒŒ์ผ์ด ์žˆ์œผ๋ฉฐ, ํ˜„์žฌ ํ—ค๋“œ์˜ ์ปค๋ฐ‹ ์‹๋ณ„์ž๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

๋งŒ์•ฝ ์ตœ์‹  ์ปค๋ฐ‹์ด aaaaaa๋ผ๋Š” ์‹๋ณ„์ž๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด, ํ•ด๋‹น ํŒŒ์ผ์€ aaaaaa๋ฅผ ํฌํ•จํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋™์ผํ•œ ๋ธŒ๋žœ์น˜๊ฐ€ ์ƒˆ๋กœ์šด ์ปค๋ฐ‹ bbbbbb๋กœ ์—…๋ฐ์ดํŠธ๋œ ๊ฒฝ์šฐ, ํ•ด๋‹น ์ฐธ์กฐ์—์„œ ํŒŒ์ผ์„ ๋‹ค์‹œ ๋‹ค์šด๋กœ๋“œํ•˜๋ฉด refs/main ํŒŒ์ผ์ด bbbbbb๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.

Blobs blobs ํด๋”๋Š” ์‹ค์ œ๋กœ ๋‹ค์šด๋กœ๋“œ๋œ ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํŒŒ์ผ์˜ ์ด๋ฆ„์€ ํ•ด๋‹น ํŒŒ์ผ์˜ ํ•ด์‹œ์ž…๋‹ˆ๋‹ค.

Snapshots snapshots ํด๋”๋Š” ์œ„์˜ blobs์—์„œ ์–ธ๊ธ‰ํ•œ ํŒŒ์ผ์— ๋Œ€ํ•œ ์‹ฌ๋ณผ๋ฆญ ๋งํฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ž์ฒด์ ์œผ๋กœ ์•Œ๋ ค์ง„ ๊ฐ ๋ฆฌ๋น„์ „์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ํด๋”๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

cf) ๋˜ํ•œ cache๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ tree๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง:

    [  96]  .
    โ””โ”€โ”€ [ 160]  models--julien-c--EsperBERTo-small
        โ”œโ”€โ”€ [ 160]  blobs
        โ”‚   โ”œโ”€โ”€ [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
        โ”‚   โ”œโ”€โ”€ [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
        โ”‚   โ””โ”€โ”€ [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
        โ”œโ”€โ”€ [  96]  refs
        โ”‚   โ””โ”€โ”€ [  40]  main
        โ””โ”€โ”€ [ 128]  snapshots
            โ”œโ”€โ”€ [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
            โ”‚   โ”œโ”€โ”€ [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
            โ”‚   โ””โ”€โ”€ [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
            โ””โ”€โ”€ [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
                โ”œโ”€โ”€ [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
                โ””โ”€โ”€ [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd

 


โˆ™ AutoModel

๋‹น์—ฐํžˆ ์œ„์™€ ๊ฐ™์ด, ์•„๋ž˜์‚ฌ์ง„์ฒ˜๋Ÿผ ์ฐพ์•„๊ฐˆ ์ˆ˜ ์žˆ๋‹ค.

๋จผ์ € AutoModel.from_configํ•จ์ˆ˜๋ฅผ ์‚ดํŽด๋ณด์ž.

from transformers import AutoConfig, AutoModel

# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained("google-bert/bert-base-cased")
model = AutoModel.from_config(config)


@classmethod
def from_config(cls, config, **kwargs):
    trust_remote_code = kwargs.pop("trust_remote_code", None)
    has_remote_code = hasattr(config, "auto_map") and cls.__name__ in config.auto_map
    has_local_code = type(config) in cls._model_mapping.keys()
    trust_remote_code = resolve_trust_remote_code(
        trust_remote_code, config._name_or_path, has_local_code, has_remote_code
    )

    if has_remote_code and trust_remote_code:
        class_ref = config.auto_map[cls.__name__]
        if "--" in class_ref:
            repo_id, class_ref = class_ref.split("--")
        else:
            repo_id = config.name_or_path
        model_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
        if os.path.isdir(config._name_or_path):
            model_class.register_for_auto_class(cls.__name__)
        else:
            cls.register(config.__class__, model_class, exist_ok=True)
        _ = kwargs.pop("code_revision", None)
        return model_class._from_config(config, **kwargs)
    elif type(config) in cls._model_mapping.keys():
        model_class = _get_model_class(config, cls._model_mapping)
        return model_class._from_config(config, **kwargs)

    raise ValueError(
        f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
        f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."

 


Data Collator

Data Collator

์ผ๋ จ์˜ sample list๋ฅผ "single training mini-batch"์˜ Tensorํ˜•ํƒœ๋กœ ๋ฌถ์–ด์คŒ
Default Data Collator์ด๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ train_dataset์ด data_collator๋ฅผ ์ด์šฉํ•ด mini-batch๋กœ ๋ฌถ์—ฌ ๋ชจ๋ธ๋กœ ๋“ค์–ด๊ฐ€ ํ•™์Šตํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,โ€‹





batch["input_ids"] , batch["labels"] ?

๋‹ค๋งŒ, ์œ„์™€ ๋‹ฌ๋ฆฌ ๋Œ€๋ถ€๋ถ„์˜ Data Collatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples): 
        texts = []
        images = []
        for example in examples:
            image, question, answer = example 
            messages = [{"role": "user", "content": question},
                        {"role": "assistant", "content": answer}] # <-- ์—ฌ๊ธฐ๊นŒ์ง€ ์ž˜ ๋“ค์–ด๊ฐ€๋Š”๊ฒƒ ํ™•์ธ์™„๋ฃŒ.
            text = self.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text)
            images.append(image)

        batch = self.processor(text=text, images=image, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)โ€‹

๊ณผ์—ฐ batch["input_ids"]์™€ batch["labels"]๊ฐ€ ๋ญ˜๊นŒ?

์ „์ˆ ํ–ˆ๋˜ data_collator๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํ˜•์‹์„ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋„ ๋ณด๋ฉด inputs์™€ labels๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ชจ๋“  ๋ชจ๋ธ์€ ๋‹ค๋ฅด์ง€๋งŒ, ๋‹ค๋ฅธ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ์ ์„ ๊ณต์œ ํ•œ๋‹ค
= ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ ๋™์ผํ•œ ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•œ๋‹ค!

โˆ™Input IDs

Input ID๋Š” ๋ชจ๋ธ์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌ๋˜๋Š” "์œ ์ผํ•œ ํ•„์ˆ˜ ๋งค๊ฐœ๋ณ€์ˆ˜"์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
Input ID๋Š” token_index๋กœ, ์‚ฌ์šฉํ•  sequence(๋ฌธ์žฅ)๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” token์˜ ์ˆซ์žํ‘œํ˜„์ด๋‹ค.
๊ฐ tokenizer๋Š” ๋‹ค๋ฅด๊ฒŒ ์ž‘๋™ํ•˜์ง€๋งŒ "๊ธฐ๋ณธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋™์ผ"ํ•˜๋‹ค.

ex)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"


tokenizer๋Š” sequence(๋ฌธ์žฅ)๋ฅผ tokenizer vocab์— ์žˆ๋Š” Token์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค:

tokenized_sequence = tokenizer.tokenize(sequence)


token์€ word๋‚˜ subword ๋‘˜์ค‘ ํ•˜๋‚˜์ด๋‹ค:

print(tokenized_sequence)
# ์ถœ๋ ฅ: ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# ์˜ˆ๋ฅผ ๋“ค์–ด, "VRAM"์€ ๋ชจ๋ธ ์–ดํœ˜์— ์—†์–ด์„œ "V", "RA" ๋ฐ "M"์œผ๋กœ ๋ถ„ํ• ๋จ.
# ์ด๋Ÿฌํ•œ ํ† ํฐ์ด ๋ณ„๋„์˜ ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ๋™์ผํ•œ ๋‹จ์–ด์˜ ์ผ๋ถ€์ž„์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š”?
# --> "RA"์™€ "M" ์•ž์— ์ด์ค‘ํ•ด์‹œ(##) ์ ‘๋‘์‚ฌ๊ฐ€ ์ถ”๊ฐ€๋ฉ


inputs = tokenizer(sequence)


์ด๋ฅผ ํ†ตํ•ด token์€ ๋ชจ๋ธ์ด ์ดํ•ด๊ฐ€๋Šฅํ•œ ID๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๋•Œ, ๋ชจ๋ธ๋‚ด๋ถ€์—์„œ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” input_ids๋ฅผ key๋กœ, ID๊ฐ’์„ value๋กœ ํ•˜๋Š” "๋”•์…”๋„ˆ๋ฆฌ"ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ด์•ผํ•œ๋‹ค:

encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# ์ถœ๋ ฅ: [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]

๋˜ํ•œ, ๋ชจ๋ธ์— ๋”ฐ๋ผ์„œ ์ž๋™์œผ๋กœ "special token"์„ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ, 
์—ฌ๊ธฐ์—๋Š” ๋ชจ๋ธ์ด ๊ฐ€๋” ์‚ฌ์šฉํ•˜๋Š” "special IDs"๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# ์ถœ๋ ฅ: [CLS] A Titan RTX has 24GB of VRAM [SEP]





โˆ™Attention Mask

Attention Mask๋Š” Sequence๋ฅผ batch๋กœ ๋ฌถ์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” Optionalํ•œ ์ธ์ˆ˜๋กœ 
"๋ชจ๋ธ์ด ์–ด๋–ค token์„ ์ฃผ๋ชฉํ•˜๊ณ  ํ•˜์ง€ ๋ง์•„์•ผ ํ•˜๋Š”์ง€"๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

len(encoded_sequence_a), len(encoded_sequence_b)
# ์ถœ๋ ฅ: (8, 19)
์œ„๋ฅผ ๋ณด๋ฉด, encoding๋œ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— "๋™์ผํ•œ Tensor๋กœ ๋ฌถ์„ ์ˆ˜๊ฐ€ ์—†๋‹ค."
--> padding์ด๋‚˜ truncation์ด ํ•„์š”ํ•จ.
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["input_ids"]
# ์ถœ๋ ฅ: [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]

padded_sequences["attention_mask"]
# ์ถœ๋ ฅ: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask๋Š” tokenizer๊ฐ€ ๋ฐ˜ํ™˜ํ•˜๋Š” dictionary์˜ "attention_mask" key์— ์กด์žฌํ•œ๋‹ค.


โˆ™Token Types IDs

์–ด๋–ค ๋ชจ๋ธ์˜ ๋ชฉ์ ์€ classification์ด๋‚˜ QA์ด๋‹ค.
์ด๋Ÿฐ ๋ชจ๋ธ์€ 2๊ฐœ์˜ "๋‹ค๋ฅธ ๋ชฉ์ ์„ ๋‹จ์ผ input_ids"ํ•ญ๋ชฉ์œผ๋กœ ๊ฒฐํ•ฉํ•ด์•ผํ•œ๋‹ค.
= [CLS], [SEP] ๋“ฑ์˜ ํŠน์ˆ˜ํ† ํฐ์„ ์ด์šฉํ•ด ์ˆ˜ํ–‰๋จ.

ex)
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)
# ์ถœ๋ ฅ: [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
์œ„์˜ ์˜ˆ์ œ์—์„œ tokenizer๋ฅผ ์ด์šฉํ•ด 2๊ฐœ์˜ sequence๊ฐ€ 2๊ฐœ์˜ ์ธ์ˆ˜๋กœ ์ „๋‹ฌ๋˜์–ด ์ž๋™์œผ๋กœ ์œ„์™€๊ฐ™์€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Š” seq์ดํ›„์— ๋‚˜์˜ค๋Š” seq์˜ ์‹œ์ž‘์œ„์น˜๋ฅผ ์•Œ๊ธฐ์—๋Š” ์ข‹๋‹ค.

๋‹ค๋งŒ, ๋‹ค๋ฅธ ๋ชจ๋ธ์€ token_types_ids๋„ ์‚ฌ์šฉํ•˜๋ฉฐ, token_type_ids๋กœ ์ด MASK๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
encoded_dict['token_type_ids']
# ์ถœ๋ ฅ: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 

์งˆ๋ฌธ์— ์‚ฌ์šฉ๋˜๋Š” context๋Š” ๋ชจ๋‘ 0์œผ๋กœ, 
์งˆ๋ฌธ์— ํ•ด๋‹น๋˜๋Š” sequence๋Š” ๋ชจ๋‘ 1๋กœ ์„ค์ •๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


โˆ™Position IDs

RNN: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๊ฐ€ ๋‚ด์žฅ. 
Transformer: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ ์ธ์‹ โŒ


∴ position_ids๋Š” ๋ชจ๋ธ์ด ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ list์—์„œ ์‹๋ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” optional ๋งค๊ฐœ๋ณ€์ˆ˜.

๋ชจ๋ธ์— position_ids๊ฐ€ ์ „๋‹ฌ๋˜์ง€ ์•Š์œผ๋ฉด, ID๋Š” ์ž๋™์œผ๋กœ Absolute positional embeddings์œผ๋กœ ์ƒ์„ฑ:

Absolute positional embeddings์€ [0, config.max_position_embeddings - 1] ๋ฒ”์œ„์—์„œ ์„ ํƒ.

์ผ๋ถ€ ๋ชจ๋ธ์€ sinusoidal position embeddings์ด๋‚˜ relative position embeddings๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ positional embedding์„ ์‚ฌ์šฉ.




โˆ™Labels 

Labels๋Š” ๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๋„๋ก ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ๋Š” Optional์ธ์ˆ˜์ด๋‹ค.
์ฆ‰, Labels๋Š” ๋ชจ๋ธ์˜ ์˜ˆ์ƒ ์˜ˆ์ธก๊ฐ’์ด์–ด์•ผ ํ•œ๋‹ค: ํ‘œ์ค€ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก๊ฐ’๊ณผ ์˜ˆ์ƒ๊ฐ’(๋ ˆ์ด๋ธ”) ๊ฐ„์˜ ์†์‹ค์„ ๊ณ„์‚ฐ.


์ด๋•Œ, Labels๋Š” Model Head์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค:

  • AutoModelForSequenceClassification: ๋ชจ๋ธ์€ (batch_size)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, batch์˜ ๊ฐ ๊ฐ’์€ ์ „์ฒด ์‹œํ€€์Šค์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น.

  • AutoModelForTokenClassification: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น

  • AutoModelForMaskedLM: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹น: label์€ ๋งˆ์Šคํ‚น๋œ token_ids์ด๋ฉฐ, ๋‚˜๋จธ์ง€๋Š” ๋ฌด์‹œํ•  ๊ฐ’(๋ณดํ†ต -100).

  • AutoModelForConditionalGeneration: ๋ชจ๋ธ์€ (batch_size, tgt_seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐ ์ž…๋ ฅ ์‹œํ€€์Šค์™€ ์—ฐ๊ด€๋œ ๋ชฉํ‘œ ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘์—๋Š” BART์™€ T5๊ฐ€ ์ ์ ˆํ•œ ๋””์ฝ”๋” ์ž…๋ ฅ ID์™€ ๋””์ฝ”๋” ์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ์— ๋ณดํ†ต ์ œ๊ณตํ•  ํ•„์š”X. ์ด๋Š” Encoder-Decoder ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์—๋Š” ์ ์šฉ๋˜์ง€ ์•Š์Œ. ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ ํŠน์ • ๋ชจ๋ธ์˜ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๊ธฐ๋ณธ ๋ชจ๋ธ(BertModel ๋“ฑ)์€ Labels๋ฅผ ๋ฐ›์•„๋“ค์ด์ง€ ๋ชปํ•˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๋กœ์„œ ๋‹จ์ˆœํžˆ ํŠน์ง•๋“ค๋งŒ ์ถœ๋ ฅํ•œ๋‹ค.




โˆ™ Decoder input IDs

์ด ์ž…๋ ฅ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ์— ํŠนํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋””์ฝ”๋”์— ์ž…๋ ฅ๋  ์ž…๋ ฅ ID๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ์ž…๋ ฅ์€ ๋ฒˆ์—ญ ๋˜๋Š” ์š”์•ฝ๊ณผ ๊ฐ™์€ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ์ž‘์—…์— ์‚ฌ์šฉ๋˜๋ฉฐ, ๋ณดํ†ต ๊ฐ ๋ชจ๋ธ์— ํŠน์ •ํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ(BART, T5)์€ ๋ ˆ์ด๋ธ”์—์„œ ๋””์ฝ”๋” ์ž…๋ ฅ ID๋ฅผ ์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ชจ๋ธ์—์„œ๋Š” ๋ ˆ์ด๋ธ”์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ํ›ˆ๋ จ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์„ ํ˜ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ํ›ˆ๋ จ์„ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์ž…๋ ฅ ID๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜๋ ค๋ฉด ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.



โˆ™Feed Forward Chunking

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ ์ž”์ฐจ ์–ดํ…์…˜ ๋ธ”๋ก์—์„œ ์…€ํ”„ ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋Š” ๋ณดํ†ต 2๊ฐœ์˜ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด ๋‹ค์Œ์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค.
ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ž„๋ฒ ๋”ฉ ํฌ๊ธฐ๋Š” ์ข…์ข… ๋ชจ๋ธ์˜ ์ˆจ๊ฒจ์ง„ ํฌ๊ธฐ๋ณด๋‹ค ํฝ๋‹ˆ๋‹ค(์˜ˆ: bert-base-uncased).

ํฌ๊ธฐ [batch_size, sequence_length]์˜ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ค‘๊ฐ„ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ์ €์žฅํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ [batch_size, sequence_length, config.intermediate_size]๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์˜ ํฐ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reformer: The Efficient Transformer์˜ ์ €์ž๋“ค์€ ๊ณ„์‚ฐ์ด sequence_length ์ฐจ์›๊ณผ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ ๋‘ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ [batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  n = sequence_length์™€ ํ•จ๊ป˜ [batch_size, sequence_length, config.hidden_size]๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๋Œ€์‹  ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฑฐ๋ž˜๋ฅผ ํ•˜์ง€๋งŒ, ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

apply_chunking_to_forward() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, chunk_size๋Š” ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐ๋˜๋Š” ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ์˜ ์ˆ˜๋ฅผ ์ •์˜ํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก์„ฑ ๊ฐ„์˜ ๊ฑฐ๋ž˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. chunk_size๊ฐ€ 0์œผ๋กœ ์„ค์ •๋˜๋ฉด ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ฒญํ‚น์€ ์ˆ˜ํ–‰๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

 


Optimization

AdamW

ํ”ํžˆ๋“ค ์•„๋ฌป๋”ฐ Adam๋งŒ ์‚ฌ์šฉํ•ด๋ผ! ๋ผ๋Š” ๊ฒฉ์–ธ์ด ์žˆ์„์ •๋„๋กœ ๋งŒ๋Šฅ optimizer๊ฐ™์ง€๋งŒ, 
CV์ผ๋ถ€ Task์—์„œ๋Š” SGD๊ฐ€ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ƒ๋‹นํžˆ ์กด์žฌํ•œ๋‹ค.
AdamW๋…ผ๋ฌธ์—์„œ๋Š” L2 Regularization๊ณผ Weight Decay๊ด€์ ์—์„œ SGD์— ๋น„ํ•ด Adam์ด ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง€๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ•œ๋‹ค.
์„œ๋กœ๋‹ค๋ฅธ ์ดˆ๊ธฐ decay rate์™€ lr์— ๋”ฐ๋ฅธ Test Error
L2 Regularization: weight๊ฐ€ ๋น„์ •์ƒ์ ์œผ๋กœ ์ปค์ง์„ ๋ฐฉ์ง€. (weight๊ฐ’์ด ์ปค์ง€๋ฉด ์†์‹คํ•จ์ˆ˜๋„ ์ปค์ง€๊ฒŒ ๋จ.)
= weight๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š๋Š” ์„ ์—์„œ ๊ธฐ์กด ์†์‹คํ•จ์ˆ˜๋ฅผ ์ตœ์†Œ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” weight๋ฅผ ๋ชจ๋ธ์ด ํ•™์Šต.

weight decay: weight update ์‹œ, ์ด์ „ weightํฌ๊ธฐ๋ฅผ ์ผ์ •๋น„์œจ ๊ฐ์†Œ์‹œ์ผœ overfitting๋ฐฉ์ง€.

SGD: L2 = weight_decay
Adam: L2 ≠ weight_decay (adaptive learning rate๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— SGD์™€๋Š” ๋‹ค๋ฅธ weight update์‹์„ ์‚ฌ์šฉํ•จ.)
∴ ์ฆ‰, L2 Regularization์ด ํฌํ•จ๋œ ์†์‹คํ•จ์ˆ˜๋กœ Adam์ตœ์ ํ™” ์‹œ, ์ผ๋ฐ˜ํ™” ํšจ๊ณผ๋ฅผ ๋œ ๋ณด๊ฒŒ ๋œ๋‹ค. (decay rate๊ฐ€ ๋” ์ž‘์•„์ง€๊ฒŒ๋จ.)
์ €์ž๋Š” L2 regularzation์— ์˜ํ•œ weight decay ํšจ๊ณผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ weight ์—…๋ฐ์ดํŠธ ์‹์— ์ง์ ‘์ ์œผ๋กœ weight decay ํ…€์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. L2 regularization๊ณผ ๋ถ„๋ฆฌ๋œ weight decay๋ผ๊ณ  ํ•˜์—ฌ decoupled weight decay๋ผ๊ณ  ๋งํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

SGDW์™€ AdamW์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜:
์ง€๊ธˆ๊นŒ์ง€ ์„ค๋ช…ํ•˜์ง€ ์•Š์•˜๋˜
๐œ‚๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋งค weight ์—…๋ฐ์ดํŠธ๋งˆ๋‹ค learning rate๋ฅผ ์ผ์ • ๋น„์œจ ๊ฐ์†Œ์‹œ์ผœ์ฃผ๋Š” learning rate schedule ์ƒ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์ดˆ๋ก์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋œ ๋ถ€๋ถ„์ด ์—†๋‹ค๋ฉด L2 regularization์„ ํฌํ•จํ•œ ์†์‹คํ•จ์ˆ˜์— SGD์™€ Adam์„ ์ ์šฉํ•œ ๊ฒƒ๊ณผ ๋˜‘๊ฐ™๋‹ค.
ํ•˜์ง€๋งŒ ์ดˆ๋ก์ƒ‰ ๋ถ€๋ถ„์„ ์ง์ ‘์ ์œผ๋กœ weight ์—…๋ฐ์ดํŠธ ์‹์— ์ถ”๊ฐ€์‹œ์ผœ weight decay ํšจ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค.
optimizer = AdamW(model.parameters(),lr=1e-3, eps=(1e-30, 1e-3),weight_decay=0.0,)

 

cf) model.parameters()๋Š” weight์™€ bias๋ฅผ ๋Œ๋ ค์คŒ.
์ด์ œ github ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ์œ„์˜ ์ˆ˜์‹์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž:
class AdamW(Optimizer):
    """
    Implements Adam algorithm with weight decay fix as introduced in [Decoupled Weight Decay
    Regularization](https://arxiv.org/abs/1711.05101).

    Parameters:
        params (`Iterable[nn.parameter.Parameter]`):
            Iterable of parameters to optimize or dictionaries defining parameter groups.
        lr (`float`, *optional*, defaults to 0.001):
            The learning rate to use.
        betas (`Tuple[float,float]`, *optional*, defaults to `(0.9, 0.999)`):
            Adam's betas parameters (b1, b2).
        eps (`float`, *optional*, defaults to 1e-06):
            Adam's epsilon for numerical stability.
        weight_decay (`float`, *optional*, defaults to 0.0):
            Decoupled weight decay to apply.
        correct_bias (`bool`, *optional*, defaults to `True`):
            Whether or not to correct bias in Adam (for instance, in Bert TF repository they use `False`).
        no_deprecation_warning (`bool`, *optional*, defaults to `False`):
            A flag used to disable the deprecation warning (set to `True` to disable the warning).
    """

    def __init__(
        self,
        params: Iterable[nn.parameter.Parameter],
        lr: float = 1e-3,
        betas: Tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-6,
        weight_decay: float = 0.0,
        correct_bias: bool = True,
        no_deprecation_warning: bool = False,
    ):
        if not no_deprecation_warning:
            warnings.warn(
                "This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch"
                " implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this"
                " warning",
                FutureWarning,
            )
        require_version("torch>=1.5.0")  # add_ with alpha
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr} - should be >= 0.0")
        if not 0.0 <= betas[0] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[0]} - should be in [0.0, 1.0)")
        if not 0.0 <= betas[1] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[1]} - should be in [0.0, 1.0)")
        if not 0.0 <= eps:
            raise ValueError(f"Invalid epsilon value: {eps} - should be >= 0.0")
        defaults = {"lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay, "correct_bias": correct_bias}
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure: Callable = None):
        """
        Performs a single optimization step.

        Arguments:
            closure (`Callable`, *optional*): A closure that reevaluates the model and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group["params"]:
                if p.grad is None:
                    continue
                grad = p.grad
                if grad.is_sparse:
                    raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    # Exponential moving average of gradient values
                    state["exp_avg"] = torch.zeros_like(p)
                    # Exponential moving average of squared gradient values
                    state["exp_avg_sq"] = torch.zeros_like(p)

                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
                beta1, beta2 = group["betas"]

                state["step"] += 1

                # Decay the first and second moment running average coefficient
                # In-place operations to update the averages at the same time
                exp_avg.mul_(beta1).add_(grad, alpha=(1.0 - beta1))
                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
                denom = exp_avg_sq.sqrt().add_(group["eps"])

                step_size = group["lr"]
                if group["correct_bias"]:  # No bias correction for Bert
                    bias_correction1 = 1.0 - beta1 ** state["step"]
                    bias_correction2 = 1.0 - beta2 ** state["step"]
                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1

                p.addcdiv_(exp_avg, denom, value=-step_size)

                # Just adding the square of the weights to the loss function is *not*
                # the correct way of using L2 regularization/weight decay with Adam,
                # since that will interact with the m and v parameters in strange ways.
                #
                # Instead we want to decay the weights in a manner that doesn't interact
                # with the m/v parameters. This is equivalent to adding the square
                # of the weights to the loss with plain (non-momentum) SGD.
                # Add weight decay at the end (fixed version)
                if group["weight_decay"] > 0.0:
                    p.add_(p, alpha=(-group["lr"] * group["weight_decay"]))

        return loss
cf) optimizer์˜ state_dict()์˜ ํ˜•ํƒœ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค:
{
                'state': {
                    0: {'momentum_buffer': tensor(...), ...},
                    1: {'momentum_buffer': tensor(...), ...},
                    2: {'momentum_buffer': tensor(...), ...},
                    3: {'momentum_buffer': tensor(...), ...}
                },
                'param_groups': [
                    {
                        'lr': 0.01,
                        'weight_decay': 0,
                        ...
                        'params': [0]
                    },
                    {
                        'lr': 0.001,
                        'weight_decay': 0.5,
                        ...
                        'params': [1, 2, 3]
                    }
                ]
            }
์ด๋ฅผ ํ†ตํ•ด ์‚ดํŽด๋ณด๋ฉด, Optimizer๋ผ๋Š” ํด๋ž˜์Šค๋กœ๋ถ€ํ„ฐ AdamW๋Š” ์ƒ์†์„ ๋ฐ›์€ ์ดํ›„, 
์œ„์˜ state_dictํ˜•ํƒœ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ, if len(state) == 0์ด๋ผ๋Š” ๋œป์€ state๊ฐ€ ์‹œ์ž‘์„ ํ•˜๋‚˜๋„ ํ•˜์ง€ ์•Š์•˜์Œ์„ ์˜๋ฏธํ•œ๋‹ค.
exp_avg๋Š” m์„, exp_avg_sq๋Š” vt๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ p.addcdiv_์™€ if group["weight_decay"]์ชฝ์—์„œ ์ตœ์ข… parameter์— ๋Œ€ํ•œ update๊ฐ€ ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 


LR-Schedules &. Learning rate Annealing

LR Schedule: ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ์Šค์ผ€์ค„๋Œ€๋กœ lr์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์‚ฌ์šฉ.

ํ›ˆ๋ จ ๋„์ค‘ learning rate๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ์ฃผ๋Š”๊ฒŒ ์ฐจ์ด์ !
warmup restart๋กœ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ local minimum์—์„œ ๋น ์ ธ๋‚˜์˜ฌ ๊ธฐํšŒ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.


LR Annealing: lr schedule๊ณผ ํ˜ผ์šฉ๋˜์–ด ์‚ฌ์šฉ๋˜๋‚˜ iteration์— ๋”ฐ๋ผ monotonicํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๋Š”๊ฒƒ์„ ์˜๋ฏธ.
์ง๊ด€์ ์œผ๋กœ๋Š” ์ฒ˜์Œ์—๋Š” ๋†’์€ learning rate๋กœ ์ข‹์€ ์ˆ˜๋ ด ์ง€์ ์„ ๋นก์„ธ๊ฒŒ ์ฐพ๊ณ ,
๋งˆ์ง€๋ง‰์—๋Š” ๋‚ฎ์€ learning rate๋กœ ์ˆ˜๋ ด ์ง€์ ์— ์ •๋ฐ€ํ•˜๊ฒŒ ์•ˆ์ฐฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

 


Model Outputs

ModelOutput

๋ชจ๋“  ๋ชจ๋ธ์€ ModelOutput์˜ subclass์˜ instance์ถœ๋ ฅ์„ ๊ฐ–๋Š”๋‹ค.
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # ๋ฐฐ์น˜ ํฌ๊ธฐ 1
outputs = model(**inputs, labels=labels)

# SequenceClassifierOutput(loss=tensor(0.4267, grad_fn=<NllLossBackward0>), 
#                           logits=tensor([[-0.0658,  0.5650]], grad_fn=<AddmmBackward0>), 
#                           hidden_states=None, attentions=None)
outputs๊ฐ์ฒด๋Š” ํ•„ํžˆ loss์™€ logits๋ฅผ ๊ฐ–๊ธฐ์— (outputs.loss, outputs.logits) ํŠœํ”Œ์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

Cf)
CuasalLM์˜ ๊ฒฝ์šฐ:
loss: Language modeling loss (for next-token prediction).

logits: Prediction scores of the LM_Head (scores for each vocabulary token before SoftMax)
= raw prediction values and are not bounded to a specific range

transformers output word๋ฅผ ์œ„ํ•ด์„  : project linearly->apply softmax ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์นจ.
์ด๋•Œ, LM_Head๋Š” pre-training์ด ์•„๋‹Œ, Fine-Tuning์—์„œ ์‚ฌ์šฉ๋จ.
LM_Head๋ž€, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ hidden_state๋ฅผ ๋ฐ›์•„ prediction์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ.
ex) BERT
from transformers import BertModel, BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
print(f'logits: {logits}') # `torch.FloatTensor` of shape `(batch_size, sequence_length, vocab_size)

# [MASK] ํ† ํฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
print(f'masked_index: {masked_index}') # `torch.LongTensor` of shape `(1,)

MASK_token = logits[0, masked_index] # batch์˜ ์ฒซ๋ฌธ์žฅ์—์„œ MASK token์„ ๊ฐ€์ ธ์˜ด.
print(f'MASK_Token: {MASK_token}')

predicted_token_id = MASK_token.argmax(axis=-1) # ์ฃผ์–ด์ง„ ์ฐจ์›์—์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์˜ index๋ฅผ ๋ฐ˜ํ™˜. = ๋ชจ๋ธ์ด ํ•ด๋‹น์œ„์น˜์—์„œ ์–˜์ธกํ•œ ๋‹จ์–ด์˜ token_id
print(f'predicted_token_id: {predicted_token_id}')


predicted_token = tokenizer.decode(predicted_token_id)
print(predicted_token)  # paris ์ถœ๋ ฅ


# logits: tensor([[[ -6.4346,  -6.4063,  -6.4097,  ...,  -5.7691,  -5.6326,  -3.7883],
#          [-14.0119, -14.7240, -14.2120,  ..., -11.6976, -10.7304, -12.7618],
#          [ -9.6561, -10.3125,  -9.7459,  ...,  -8.7782,  -6.6036, -12.6596],
#          ...,
#          [ -3.7861,  -3.8572,  -3.5644,  ...,  -2.5593,  -3.1093,  -4.3820],
#          [-11.6598, -11.4274, -11.9267,  ...,  -9.8772, -10.2103,  -4.7594],
#          [-11.7267, -11.7509, -11.8040,  ..., -10.5943, -10.9407,  -7.5151]]],
#        grad_fn=<ViewBackward0>)
# masked_index: tensor([6])
# MASK_Token: tensor([[-3.7861, -3.8572, -3.5644,  ..., -2.5593, -3.1093, -4.3820]],
#        grad_fn=<IndexBackward0>)
# predicted_token_id: tensor([3000])
# paris


cf) ์ฐธ๊ณ ๋กœ argmax๊ฐ€ ๋ฐ˜ํ™˜ํ•œ index๋Š” vocabulary์˜ Index์ž„์„ ์•„๋ž˜๋ฅผ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

for word, idx in list(vocab.items())[:5]:  # ์–ดํœ˜์˜ ์ฒ˜์Œ 10๊ฐœ ํ•ญ๋ชฉ ์ถœ๋ ฅ
    print(f"{word}: {idx}")
for word, idx in list(vocab.items())[2990:3010]:  # ์–ดํœ˜์˜ ์ฒ˜์Œ 10๊ฐœ ํ•ญ๋ชฉ ์ถœ๋ ฅ
    print(f"{word}: {idx}")
    
# [PAD]: 0
# [unused0]: 1
# [unused1]: 2
# [unused2]: 3
# [unused3]: 4
# jack: 2990
# fall: 2991
# raised: 2992
# itself: 2993
# stay: 2994
# true: 2995
# studio: 2996
# 1988: 2997
# sports: 2998
# replaced: 2999
# paris: 3000
# systems: 3001
# saint: 3002
# leader: 3003
# theatre: 3004
# whose: 3005
# market: 3006
# capital: 3007
# parents: 3008
# spanish: 3009

 


Trainer

Trainer

Trainerํด๋ž˜์Šค๋Š” ๐Ÿค— Transformers ๋ชจ๋ธ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ๋‹ค
= ๋ชจ๋ธ์ด ํ•ญ์ƒ tuple(= ์ฒซ์š”์†Œ๋กœ loss๋ฐ˜ํ™˜) , ModelOutput์˜ subclass๋ฅผ ๋ฐ˜ํ™˜ํ•ด์•ผํ•จ์„ ์˜๋ฏธ
= labels์ธ์ž๊ฐ€ ์ œ๊ณต๋˜๋ฉด Loss๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ.

Trainer๋Š” TrainingArguments๋กœ ํ•„์š”์ธ์ž๋ฅผ ์ „๋‹ฌํ•ด์ฃผ๋ฉด, ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ train_loop์ž‘์„ฑํ•  ํ•„์š”์—†์ด ํ•™์Šต์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋˜ํ•œ, TRL ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ SFTTrainer์˜ ๊ฒฝ์šฐ, ์ด Trainerํด๋ž˜์Šค๋ฅผ ๊ฐ์‹ธ๊ณ  ์žˆ์œผ๋ฉฐ, LoRA, Quantizing๊ณผ DeepSpeed ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์–ด๋–ค ๋ชจ๋ธ ํฌ๊ธฐ์—๋„ ํšจ์œจ์ ์ธ ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

๋จผ์ €, ์‹œ์ž‘์— ์•ž์„œ ๋ถ„์‚ฐํ™˜๊ฒฝ์„ ์œ„ํ•ด์„œ๋Š” Accelerate๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผํ•œ๋‹ค!
pip install accelerate
pip install accelerate --upgrade

Basic Usage

"hugใ…‡ใ…‡ใ„นใ„ด


Checkpoints

"hugใ…‡ใ…‡ใ„นใ„ด


Customizing

"hugใ…‡ใ…‡ใ„นใ„ด


Callbacks & Logging

"hugใ…‡ใ…‡ใ„นใ„ด


Accelerate & Trainer

"hugใ…‡ใ…‡ใ„นใ„ด


TrainingArguments

์ฐธ๊ณ )
output_dir (str): ๋ชจ๋ธ ์˜ˆ์ธก๊ณผ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์ž‘์„ฑ๋  ์ถœ๋ ฅ ๋””๋ ‰ํ† ๋ฆฌ์ž…๋‹ˆ๋‹ค.
eval_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "no"): ํ›ˆ๋ จ ์ค‘ ์ฑ„ํƒํ•  ํ‰๊ฐ€ ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ํ‰๊ฐ€๋ฅผ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"steps": eval_steps๋งˆ๋‹ค ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
per_device_train_batch_size (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 8): ํ›ˆ๋ จ ์‹œ GPU/XPU/TPU/MPS/NPU ์ฝ”์–ด/CPU๋‹น ๋ฐฐ์น˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค.
per_device_eval_batch_size (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 8): ํ‰๊ฐ€ ์‹œ GPU/XPU/TPU/MPS/NPU ์ฝ”์–ด/CPU๋‹น ๋ฐฐ์น˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค.
gradient_accumulation_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1): ์—ญ์ „ํŒŒ/์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ ํ•  ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค.
eval_accumulation_steps (int, optional): ๊ฒฐ๊ณผ๋ฅผ CPU๋กœ ์ด๋™์‹œํ‚ค๊ธฐ ์ „์— ์ถœ๋ ฅ ํ…์„œ๋ฅผ ๋ˆ„์ ํ•  ์˜ˆ์ธก ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„ค์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ์ „์ฒด ์˜ˆ์ธก์ด GPU/NPU/TPU์—์„œ ๋ˆ„์ ๋œ ํ›„ CPU๋กœ ์ด๋™๋ฉ๋‹ˆ๋‹ค(๋” ๋น ๋ฅด์ง€๋งŒ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค).
learning_rate (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 5e-5): [AdamW] ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ์ดˆ๊ธฐ ํ•™์Šต๋ฅ ์ž…๋‹ˆ๋‹ค.
weight_decay (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): [AdamW] ์˜ตํ‹ฐ๋งˆ์ด์ €์—์„œ ๋ชจ๋“  ๋ ˆ์ด์–ด์—(๋ฐ”์ด์–ด์Šค ๋ฐ LayerNorm ๊ฐ€์ค‘์น˜๋Š” ์ œ์™ธ) ์ ์šฉํ•  ๊ฐ€์ค‘์น˜ ๊ฐ์‡ ์ž…๋‹ˆ๋‹ค.
max_grad_norm (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1.0): ์ตœ๋Œ€ ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„(๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘์„ ์œ„ํ•œ)์ž…๋‹ˆ๋‹ค.
num_train_epochs(float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 3.0): ์ˆ˜ํ–‰ํ•  ์ด ํ›ˆ๋ จ ์—ํฌํฌ ์ˆ˜์ž…๋‹ˆ๋‹ค(์ •์ˆ˜๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ์—ํฌํฌ์˜ ๋ฐฑ๋ถ„์œจ์„ ์ˆ˜ํ–‰ํ•œ ํ›„ ํ›ˆ๋ จ์„ ์ค‘์ง€ํ•ฉ๋‹ˆ๋‹ค).
max_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ -1): ์–‘์˜ ์ •์ˆ˜๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ, ์ˆ˜ํ–‰ํ•  ์ด ํ›ˆ๋ จ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. num_train_epochs๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ์œ ํ•œํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ฒฝ์šฐ, max_steps์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
eval_steps (int ๋˜๋Š” float, optional): eval_strategy="steps"์ธ ๊ฒฝ์šฐ ๋‘ ํ‰๊ฐ€ ์‚ฌ์ด์˜ ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, logging_steps์™€ ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ๊ธฐ๋ณธ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ •์ˆ˜ ๋˜๋Š” [0,1) ๋ฒ”์œ„์˜ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ „์ฒด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.
lr_scheduler_type (str ๋˜๋Š” [SchedulerType], optional, ๊ธฐ๋ณธ๊ฐ’์€ "linear"): ์‚ฌ์šฉํ•  ์Šค์ผ€์ค„๋Ÿฌ ์œ ํ˜•์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ [SchedulerType]์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
lr_scheduler_kwargs ('dict', optional, ๊ธฐ๋ณธ๊ฐ’์€ {}): lr_scheduler์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค์ผ€์ค„๋Ÿฌ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ€๋Šฅํ•œ ๊ฐ’์„ ํ™•์ธํ•˜์„ธ์š”.
warmup_ratio (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0.0): 0์—์„œ learning_rate๋กœ์˜ ์„ ํ˜• ์›œ์—…์— ์‚ฌ์šฉ๋˜๋Š” ์ด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค.
warmup_steps (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): 0์—์„œ learning_rate๋กœ์˜ ์„ ํ˜• ์›œ์—…์— ์‚ฌ์šฉ๋˜๋Š” ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. warmup_ratio์˜ ์˜ํ–ฅ์„ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค.
logging_dir (str, optional): TensorBoard ๋กœ๊ทธ ๋””๋ ‰ํ† ๋ฆฌ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ output_dir/runs/CURRENT_DATETIME_HOSTNAME์ž…๋‹ˆ๋‹ค.
logging_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "steps"): ํ›ˆ๋ จ ์ค‘ ์ฑ„ํƒํ•  ๋กœ๊น… ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ๋กœ๊น…์„ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ๋กœ๊น…์„ ํ•ฉ๋‹ˆ๋‹ค.
•	"steps": logging_steps๋งˆ๋‹ค ๋กœ๊น…์„ ํ•ฉ๋‹ˆ๋‹ค.
logging_first_step (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ์ฒซ ๋ฒˆ์งธ global_step์„ ๋กœ๊น…ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
logging_steps (int ๋˜๋Š” float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 500): logging_strategy="steps"์ธ ๊ฒฝ์šฐ ๋‘ ๋กœ๊ทธ ์‚ฌ์ด์˜ ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ •์ˆ˜ ๋˜๋Š” [0,1) ๋ฒ”์œ„์˜ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ „์ฒด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋น„์œจ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.
run_name (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ output_dir): ์‹คํ–‰์— ๋Œ€ํ•œ ์„ค๋ช…์ž์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ wandb ๋ฐ mlflow ๋กœ๊น…์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ output_dir๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
save_strategy (str ๋˜๋Š” [~trainer_utils.IntervalStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "steps"): ํ›ˆ๋ จ ์ค‘ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ €์žฅํ•  ์ „๋žต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"no": ํ›ˆ๋ จ ์ค‘ ์ €์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
•	"epoch": ๊ฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
•	"steps": save_steps๋งˆ๋‹ค ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. "epoch" ๋˜๋Š” "steps"๊ฐ€ ์„ ํƒ๋œ ๊ฒฝ์šฐ, ํ•ญ์ƒ ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ €์žฅ์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
save_total_limit (int, optional): ๊ฐ’์ด ์ „๋‹ฌ๋˜๋ฉด ์ฒดํฌํฌ์ธํŠธ์˜ ์ด ์ˆ˜๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. output_dir์— ์žˆ๋Š” ์˜ค๋ž˜๋œ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. load_best_model_at_end๊ฐ€ ํ™œ์„ฑํ™”๋˜๋ฉด metric_for_best_model์— ๋”ฐ๋ผ "์ตœ๊ณ " ์ฒดํฌํฌ์ธํŠธ๋Š” ํ•ญ์ƒ ๊ฐ€์žฅ ์ตœ๊ทผ์˜ ์ฒดํฌํฌ์ธํŠธ์™€ ํ•จ๊ป˜ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, save_total_limit=5 ๋ฐ load_best_model_at_end์ธ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ ๋„ค ๊ฐœ์˜ ์ฒดํฌํฌ์ธํŠธ๋Š” ํ•ญ์ƒ ์ตœ๊ณ  ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. save_total_limit=1 ๋ฐ load_best_model_at_end์ธ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ์™€ ์ตœ๊ณ  ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๋ฉด ๋‘ ๊ฐœ์˜ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์ €์žฅ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
save_safetensors (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ True): state_dict๋ฅผ ์ €์žฅํ•˜๊ณ  ๋กœ๋“œํ•  ๋•Œ ๊ธฐ๋ณธ torch.load ๋ฐ torch.save ๋Œ€์‹  safetensors๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
save_on_each_node (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ€ํ‹ฐ๋…ธ๋“œ ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๋ชจ๋ธ๊ณผ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๊ฐ ๋…ธ๋“œ์— ์ €์žฅํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฉ”์ธ ๋…ธ๋“œ์—๋งŒ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
seed (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 42): ํ›ˆ๋ จ ์‹œ์ž‘ ์‹œ ์„ค์ •๋  ๋žœ๋ค ์‹œ๋“œ์ž…๋‹ˆ๋‹ค. ์‹คํ–‰ ๊ฐ„ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•˜๋ ค๋ฉด ๋ชจ๋ธ์— ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ [~Trainer.model_init] ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜์„ธ์š”.
data_seed (int, optional): ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋Ÿฌ์— ์‚ฌ์šฉํ•  ๋žœ๋ค ์‹œ๋“œ์ž…๋‹ˆ๋‹ค. ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์„ ์œ„ํ•œ Random sampler๋Š” seed์™€ ๋™์ผํ•œ ์‹œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ ์‹œ๋“œ์™€๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์˜ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
bf16 (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ํ›ˆ๋ จ ๋Œ€์‹  bf16 16๋น„ํŠธ(ํ˜ผํ•ฉ) ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. Ampere ์ด์ƒ NVIDIA ์•„ํ‚คํ…์ฒ˜ ๋˜๋Š” CPU(์‚ฌ์šฉ_cpu) ๋˜๋Š” Ascend NPU๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
fp16 (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ํ›ˆ๋ จ ๋Œ€์‹  fp16 16๋น„ํŠธ(ํ˜ผํ•ฉ) ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
half_precision_backend (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ "auto"): ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ฐฑ์—”๋“œ์ž…๋‹ˆ๋‹ค. "auto", "apex", "cpu_amp" ์ค‘ ํ•˜๋‚˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. "auto"๋Š” ๊ฐ์ง€๋œ PyTorch ๋ฒ„์ „์— ๋”ฐ๋ผ CPU/CUDA AMP ๋˜๋Š” APEX๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋‹ค๋ฅธ ์„ ํƒ์ง€๋Š” ์š”์ฒญ๋œ ๋ฐฑ์—”๋“œ๋ฅผ ๊ฐ•์ œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
bf16_full_eval (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ๋Œ€์‹  ์ „์ฒด bfloat16 ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๋น ๋ฅด๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜์ง€๋งŒ ๋ฉ”ํŠธ๋ฆญ ๊ฐ’์— ์•…์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
fp16_full_eval (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): 32๋น„ํŠธ ๋Œ€์‹  ์ „์ฒด float16 ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๋น ๋ฅด๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜์ง€๋งŒ ๋ฉ”ํŠธ๋ฆญ ๊ฐ’์— ์•…์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
tf32 (bool, optional): Ampere ๋ฐ ์ตœ์‹  GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ์‚ฌ์šฉํ•  TF32 ๋ชจ๋“œ๋ฅผ ํ™œ์„ฑํ™”ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ torch.backends.cuda.matmul.allow_tf32์˜ ๊ธฐ๋ณธ๊ฐ’์— ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ TF32 ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์  API์ด๋ฉฐ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
local_rank (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ -1): ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ค‘ ํ”„๋กœ์„ธ์Šค์˜ ์ˆœ์œ„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
ddp_backend (str, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ๋ฐฑ์—”๋“œ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. "nccl", "mpi", "ccl", "gloo", "hccl" ์ค‘ ํ•˜๋‚˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
dataloader_drop_last (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ธธ์ด๊ฐ€ ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ„์–ด๋–จ์–ด์ง€์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ๋ถˆ์™„์ „ํ•œ ๋ฐฐ์น˜๋ฅผ ์‚ญ์ œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
dataloader_num_workers (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0): ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์— ์‚ฌ์šฉํ•  ํ•˜์œ„ ํ”„๋กœ์„ธ์Šค ์ˆ˜์ž…๋‹ˆ๋‹ค(PyTorch ์ „์šฉ). 0์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”์ธ ํ”„๋กœ์„ธ์Šค์—์„œ ๋กœ๋“œ๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
remove_unused_columns (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ True): ๋ชจ๋ธ์˜ forward ๋ฉ”์„œ๋“œ์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์—ด์„ ์ž๋™์œผ๋กœ ์ œ๊ฑฐํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
label_names (List[str], optional): input dictionary์—์„œ label์— ํ•ด๋‹นํ•˜๋Š” ํ‚ค์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ๋ ˆ์ด๋ธ” ์ธ์ˆ˜์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
load_best_model_at_end (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ตœ์ƒ์˜ ๋ชจ๋ธ์„ ๋กœ๋“œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜ต์…˜์ด ํ™œ์„ฑํ™”๋˜๋ฉด, ์ตœ์ƒ์˜ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ํ•ญ์ƒ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ save_total_limit๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
<Tip>
            When set to `True`, the parameters `save_strategy` needs to be the same as `eval_strategy`, and in
            the case it is "steps", `save_steps` must be a round multiple of `eval_steps`.
</Tip>
metric_for_best_model (str, optional): load_best_model_at_end์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๋ชจ๋ธ์„ ๋น„๊ตํ•  ๋ฉ”ํŠธ๋ฆญ์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€์—์„œ ๋ฐ˜ํ™˜๋œ ๋ฉ”ํŠธ๋ฆญ ์ด๋ฆ„์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๊ธฐ๋ณธ๊ฐ’์€ "loss"์ด๋ฉฐ, load_best_model_at_end=True์ธ ๊ฒฝ์šฐ eval_loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด greater_is_better์˜ ๊ธฐ๋ณธ๊ฐ’์€ True๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋ฉ”ํŠธ๋ฆญ์ด ๋‚ฎ์„์ˆ˜๋ก ์ข‹๋‹ค๋ฉด False๋กœ ์„ค์ •ํ•˜์„ธ์š”.
greater_is_better (bool, optional): load_best_model_at_end ๋ฐ metric_for_best_model๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋‚˜์€ ๋ชจ๋ธ์ด ๋” ๋†’์€ ๋ฉ”ํŠธ๋ฆญ์„ ๊ฐ€์ ธ์•ผ ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	metric_for_best_model์ด "loss"๋กœ ๋๋‚˜์ง€ ์•Š๋Š” ๊ฐ’์œผ๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
•	metric_for_best_model์ด ์„ค์ •๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜ "loss"๋กœ ๋๋‚˜๋Š” ๊ฐ’์œผ๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ False์ž…๋‹ˆ๋‹ค.

fsdp (bool, str ๋˜๋Š” [~trainer_utils.FSDPOption]์˜ ๋ชฉ๋ก, optional, ๊ธฐ๋ณธ๊ฐ’์€ ''): PyTorch ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(๋ถ„์‚ฐ ํ›ˆ๋ จ ์ „์šฉ).
fsdp_config (str ๋˜๋Š” dict, optional): fsdp(Pytorch ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์„ค์ •์ž…๋‹ˆ๋‹ค. ๊ฐ’์€ fsdp json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., fsdp_config.json) ๋˜๋Š” ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict)์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
deepspeed (str ๋˜๋Š” dict, optional): Deepspeed๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹คํ—˜์  ๊ธฐ๋Šฅ์ด๋ฉฐ API๊ฐ€ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ’์€ DeepSpeed json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., ds_config.json) ๋˜๋Š” ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict)์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
accelerator_config (str, dict, ๋˜๋Š” AcceleratorConfig, optional): ๋‚ด๋ถ€ Accelerator ๊ตฌํ˜„๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์„ค์ •์ž…๋‹ˆ๋‹ค. ๊ฐ’์€ accelerator json ๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์œ„์น˜(e.g., accelerator_config.json), ์ด๋ฏธ ๋กœ๋“œ๋œ json ํŒŒ์ผ(dict), ๋˜๋Š” [~trainer_pt_utils.AcceleratorConfig]์˜ ์ธ์Šคํ„ด์Šค์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
label_smoothing_factor (float, optional, ๊ธฐ๋ณธ๊ฐ’์€ 0.0): ์‚ฌ์šฉํ•  ๋ ˆ์ด๋ธ” ์Šค๋ฌด๋”ฉ ํŒฉํ„ฐ์ž…๋‹ˆ๋‹ค. 0์€ label_smoothing์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธ, ๋‹ค๋ฅธ ๊ฐ’์€ ์›ํ•ซ ์ธ์ฝ”๋”ฉ๋œ ๋ ˆ์ด๋ธ”์„ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
optim (str ๋˜๋Š” [training_args.OptimizerNames], optional, ๊ธฐ๋ณธ๊ฐ’์€ "adamw_torch"): ์‚ฌ์šฉํ•  ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision, adafactor ์ค‘์—์„œ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
optim_args (str, optional): AnyPrecisionAdamW์— ์ œ๊ณต๋˜๋Š” ์„ ํƒ์  ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค.
group_by_length (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋Œ€๋žต ๊ฐ™์€ ๊ธธ์ด์˜ ์ƒ˜ํ”Œ์„ ๊ทธ๋ฃนํ™”ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค(ํŒจ๋”ฉ์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด). ๋™์  ํŒจ๋”ฉ์„ ์ ์šฉํ•  ๋•Œ๋งŒ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
report_to (str ๋˜๋Š” List[str], optional, ๊ธฐ๋ณธ๊ฐ’์€ "all"): ๊ฒฐ๊ณผ์™€ ๋กœ๊ทธ๋ฅผ ๋ณด๊ณ ํ•  ํ†ตํ•ฉ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ํ”Œ๋žซํผ์€ "azure_ml", "clearml", "codecarbon", "comet_ml", "dagshub", "dvclive", "flyte", "mlflow", "neptune", "tensorboard", "wandb"์ž…๋‹ˆ๋‹ค. "all"์€ ์„ค์น˜๋œ ๋ชจ๋“  ํ†ตํ•ฉ์— ๋ณด๊ณ ํ•˜๋ฉฐ, "none"์€ ํ†ตํ•ฉ์— ๋ณด๊ณ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
ddp_find_unused_parameters (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” find_unused_parameters ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ False, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
ddp_bucket_cap_mb (int, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” bucket_cap_mb ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
ddp_broadcast_buffers (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•  ๋•Œ, DistributedDataParallel์— ์ „๋‹ฌ๋˜๋Š” broadcast_buffers ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ False, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ True์ž…๋‹ˆ๋‹ค.
dataloader_persistent_workers (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๋กœ ์„ค์ •ํ•˜๋ฉด ๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํ•œ ๋ฒˆ ์†Œ๋น„๋œ ํ›„์—๋„ ์ž‘์—…์ž ํ”„๋กœ์„ธ์Šค๋ฅผ ์ข…๋ฃŒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ž‘์—…์ž ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ธ์Šคํ„ด์Šค๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์ง€๋งŒ RAM ์‚ฌ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ False์ž…๋‹ˆ๋‹ค.
push_to_hub (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ชจ๋ธ์ด ์ €์žฅ๋  ๋•Œ๋งˆ๋‹ค ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜ต์…˜์ด ํ™œ์„ฑํ™”๋˜๋ฉด output_dir์€ git ๋””๋ ‰ํ† ๋ฆฌ๊ฐ€ ๋˜์–ด ์ €์žฅ์ด ํŠธ๋ฆฌ๊ฑฐ๋  ๋•Œ๋งˆ๋‹ค ์ฝ˜ํ…์ธ ๊ฐ€ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค(save_strategy์— ๋”ฐ๋ผ ๋‹ค๋ฆ„). [~Trainer.save_model]์„ ํ˜ธ์ถœํ•˜๋ฉด ํ‘ธ์‹œ๊ฐ€ ํŠธ๋ฆฌ๊ฑฐ๋ฉ๋‹ˆ๋‹ค.
resume_from_checkpoint (str, optional): ๋ชจ๋ธ์— ์œ ํšจํ•œ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์žˆ๋Š” ํด๋”์˜ ๊ฒฝ๋กœ์ž…๋‹ˆ๋‹ค. ์ด ์ธ์ˆ˜๋Š” ์ง์ ‘์ ์œผ๋กœ [Trainer]์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š์œผ๋ฉฐ, ๋Œ€์‹  ํ›ˆ๋ จ/ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์˜ˆ์ œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
hub_model_id (str, optional): ๋กœ์ปฌ output_dir๊ณผ ๋™๊ธฐํ™”ํ•  ์ €์žฅ์†Œ์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๋ชจ๋ธ ID์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด ๊ฒฝ์šฐ ๋ชจ๋ธ์€ ๋„ค์ž„์ŠคํŽ˜์ด์Šค์— ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ „์ฒด ์ €์žฅ์†Œ ์ด๋ฆ„์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(e.g., "user_name/model"). ๊ธฐ๋ณธ๊ฐ’์€ user_name/output_dir_name์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ output_dir์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
hub_strategy (str ๋˜๋Š” [~trainer_utils.HubStrategy], optional, ๊ธฐ๋ณธ๊ฐ’์€ "every_save"): ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ•  ๋ฒ”์œ„์™€ ์‹œ์ ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
•	"end": ๋ชจ๋ธ, ๊ตฌ์„ฑ, ํ† ํฌ๋‚˜์ด์ €(์ „๋‹ฌ๋œ ๊ฒฝ์šฐ), ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์„ ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค.
•	"every_save": ๋ชจ๋ธ, ๊ตฌ์„ฑ, ํ† ํฌ๋‚˜์ด์ €(์ „๋‹ฌ๋œ ๊ฒฝ์šฐ), ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์„ ์ €์žฅํ•  ๋•Œ๋งˆ๋‹ค ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค. ํ‘ธ์‹œ๋Š” ๋น„๋™๊ธฐ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜๋ฉฐ, ์ €์žฅ์ด ๋งค์šฐ ๋นˆ๋ฒˆํ•œ ๊ฒฝ์šฐ ์ด์ „ ํ‘ธ์‹œ๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด ์ƒˆ๋กœ์šด ํ‘ธ์‹œ๊ฐ€ ์‹œ๋„๋ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ ์ตœ์ข… ๋ชจ๋ธ๋กœ ๋งˆ์ง€๋ง‰ ํ‘ธ์‹œ๊ฐ€ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
•	"checkpoint": "every_save"์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ๋„ last-checkpoint๋ผ๋Š” ํ•˜์œ„ ํด๋”์— ํ‘ธ์‹œํ•˜์—ฌ trainer.train(resume_from_checkpoint="last-checkpoint")์œผ๋กœ ํ›ˆ๋ จ์„ ์‰ฝ๊ฒŒ ์žฌ๊ฐœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
•	"all_checkpoints": "checkpoint"์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ตœ์ข… ์ €์žฅ์†Œ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๋Œ€๋กœ ๋ชจ๋“  ์ฒดํฌํฌ์ธํŠธ๋ฅผ ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค(๋”ฐ๋ผ์„œ ์ตœ์ข… ์ €์žฅ์†Œ์—๋Š” ํด๋”๋งˆ๋‹ค ํ•˜๋‚˜์˜ ์ฒดํฌํฌ์ธํŠธ ํด๋”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค).
hub_token (str, optional): ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ•  ๋•Œ ์‚ฌ์šฉํ•  ํ† ํฐ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ huggingface-cli login์œผ๋กœ ์–ป์€ ์บ์‹œ ํด๋”์˜ ํ† ํฐ์ž…๋‹ˆ๋‹ค.
hub_private_repo (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๋กœ ์„ค์ •ํ•˜๋ฉด ํ—ˆ๋ธŒ ์ €์žฅ์†Œ๊ฐ€ ๋น„๊ณต๊ฐœ๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.
hub_always_push (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): True๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ, ์ด์ „ ํ‘ธ์‹œ๊ฐ€ ์™„๋ฃŒ๋˜์ง€ ์•Š์œผ๋ฉด ์ฒดํฌํฌ์ธํŠธ ํ‘ธ์‹œ๋ฅผ ๊ฑด๋„ˆ๋œ๋‹ˆ๋‹ค.
gradient_checkpointing (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์„ ์‚ฌ์šฉํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์—ญ์ „ํŒŒ ์†๋„๊ฐ€ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.
auto_find_batch_size (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): ๋ฉ”๋ชจ๋ฆฌ์— ๋งž๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„ CUDA ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์˜ค๋ฅ˜๋ฅผ ํ”ผํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. accelerate๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(pip install accelerate).
ray_scope (str, optional, ๊ธฐ๋ณธ๊ฐ’์€ "last"): Ray๋ฅผ ์‚ฌ์šฉํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒ€์ƒ‰ ์‹œ ์‚ฌ์šฉํ•  ๋ฒ”์œ„์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ "last"์ž…๋‹ˆ๋‹ค. Ray๋Š” ๋ชจ๋“  ์‹œ๋„์˜ ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๊ตํ•˜๊ณ  ์ตœ์ƒ์˜ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์˜ต์…˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ Ray ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
ddp_timeout (int, optional, ๊ธฐ๋ณธ๊ฐ’์€ 1800): torch.distributed.init_process_group ํ˜ธ์ถœ์˜ ํƒ€์ž„์•„์›ƒ์ž…๋‹ˆ๋‹ค. ๋ถ„์‚ฐ ์‹คํ–‰ ์‹œ GPU ์†Œ์ผ“ ํƒ€์ž„์•„์›ƒ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
torch_compile (bool, optional, ๊ธฐ๋ณธ๊ฐ’์€ False): PyTorch 2.0 torch.compile์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ปดํŒŒ์ผํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” torch.compile API์— ๋Œ€ํ•œ ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. torch_compile_backend ๋ฐ torch_compile_mode ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉ์ž ์ง€์ •ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ชจ๋“  ๊ฐ’์ด ์ž‘๋™ํ•  ๊ฒƒ์ด๋ผ๊ณ  ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ํ”Œ๋ž˜๊ทธ์™€ ์ „์ฒด ์ปดํŒŒ์ผ API๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
torch_compile_backend (str, optional): torch.compile์—์„œ ์‚ฌ์šฉํ•  ๋ฐฑ์—”๋“œ์ž…๋‹ˆ๋‹ค. ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด torch_compile์ด True๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
torch_compile_mode (str, optional): torch.compile์—์„œ ์‚ฌ์šฉํ•  ๋ชจ๋“œ์ž…๋‹ˆ๋‹ค. ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด torch_compile์ด True๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๊ฐ’์€ PyTorch ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด๋Š” ์‹คํ—˜์ ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
split_batches (bool, optional): ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ค‘ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ƒ์„ฑํ•˜๋Š” ๋ฐฐ์น˜๋ฅผ ์žฅ์น˜์— ๋ถ„ํ• ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. True๋กœ ์„ค์ •ํ•˜๋ฉด ์‚ฌ์šฉ๋œ ์‹ค์ œ ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” ๋ชจ๋“  ์ข…๋ฅ˜์˜ ๋ถ„์‚ฐ ํ”„๋กœ์„ธ์Šค์—์„œ ๋™์ผํ•˜์ง€๋งŒ, ์‚ฌ์šฉ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค ์ˆ˜์˜ ์ •์ˆ˜ ๋ฐฐ์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.




 


DeepSpeed

trust_remote_code=True

์ค‘๊ตญ๋ชจ๋ธ์—์„œ ํ”ํžˆ๋ณด์ด๋Š” trust_remote_code=True ์„ค์ •, ๊ณผ์—ฐ ์ด๊ฑด ๋ญ˜๊นŒ?
์ด๋Š” "huggingface/transformers"์— Model Architecture๊ฐ€ ์•„์ง ์ถ”๊ฐ€๋˜์ง€ ์•Š์€๊ฒฝ์šฐ:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True, device='cuda')
"huggingface repo 'internlm/internlm-chat-7b'์—์„œ ๋ชจ๋ธ ์ฝ”๋“œ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ , ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ์‹คํ–‰ํ•œ๋‹ค"๋Š” ์˜๋ฏธ์ด๋‹ค.
๋งŒ์•ฝ ์ด ๊ฐ’์ด False๋ผ๋ฉด, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” huggingface/transformers์— ๋‚ด์žฅ๋œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๊ฐ€์ค‘์น˜๋งŒ ๋‹ค์šด๋กœ๋“œ
ํ•˜๋Š”๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

rue

์ค‘๊ตญ๋ชจ๋ธ์—์„œ ํ”ํžˆollatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ

 

Collate: ํ•จ๊ป˜ ํ•ฉ์น˜๋‹ค.

์ด์—์„œ ์œ ์ถ”๊ฐ€๋Šฅํ•˜๋“ฏ, Data Collator๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

 

 

Data Collator

Data Collator

์ผ๋ จ์˜ sample list๋ฅผ "single training mini-batch"์˜ Tensorํ˜•ํƒœ๋กœ ๋ฌถ์–ด์คŒ
Default Data Collator
์ด๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ train_dataset์ด data_collator๋ฅผ ์ด์šฉํ•ด mini-batch๋กœ ๋ฌถ์—ฌ ๋ชจ๋ธ๋กœ ๋“ค์–ด๊ฐ€ ํ•™์Šตํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,โ€‹





batch["input_ids"] , batch["labels"] ?

๋‹ค๋งŒ, ์œ„์™€ ๋‹ฌ๋ฆฌ ๋Œ€๋ถ€๋ถ„์˜ Data Collatorํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋ฅผ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ input_ids์™€ label์ด๋ผ๋Š” ์กฐ๊ธˆ ์ƒ์†Œํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples): 
        texts = []
        images = []
        for example in examples:
            image, question, answer = example 
            messages = [{"role": "user", "content": question},
                        {"role": "assistant", "content": answer}] # <-- ์—ฌ๊ธฐ๊นŒ์ง€ ์ž˜ ๋“ค์–ด๊ฐ€๋Š”๊ฒƒ ํ™•์ธ์™„๋ฃŒ.
            text = self.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text)
            images.append(image)

        batch = self.processor(text=text, images=image, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)โ€‹

๊ณผ์—ฐ batch["input_ids"]์™€ batch["labels"]๊ฐ€ ๋ญ˜๊นŒ?

์ „์ˆ ํ–ˆ๋˜ data_collator๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ํ˜•์‹์„ ๋ ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋„ ๋ณด๋ฉด inputs์™€ labels๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ชจ๋“  ๋ชจ๋ธ์€ ๋‹ค๋ฅด์ง€๋งŒ, ๋‹ค๋ฅธ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ์ ์„ ๊ณต์œ ํ•œ๋‹ค
= ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ ๋™์ผํ•œ ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•œ๋‹ค!

โˆ™Input IDs

Input ID๋Š” ๋ชจ๋ธ์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌ๋˜๋Š” "์œ ์ผํ•œ ํ•„์ˆ˜ ๋งค๊ฐœ๋ณ€์ˆ˜"์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
Input ID๋Š” token_index๋กœ, ์‚ฌ์šฉํ•  sequence(๋ฌธ์žฅ)๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” token์˜ ์ˆซ์žํ‘œํ˜„์ด๋‹ค.
๊ฐ tokenizer๋Š” ๋‹ค๋ฅด๊ฒŒ ์ž‘๋™ํ•˜์ง€๋งŒ "๊ธฐ๋ณธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋™์ผ"ํ•˜๋‹ค.

ex)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"


tokenizer๋Š” sequence(๋ฌธ์žฅ)๋ฅผ tokenizer vocab์— ์žˆ๋Š” Token์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค:

tokenized_sequence = tokenizer.tokenize(sequence)


token์€ word๋‚˜ subword ๋‘˜์ค‘ ํ•˜๋‚˜์ด๋‹ค:

print(tokenized_sequence)
# ์ถœ๋ ฅ: ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# ์˜ˆ๋ฅผ ๋“ค์–ด, "VRAM"์€ ๋ชจ๋ธ ์–ดํœ˜์— ์—†์–ด์„œ "V", "RA" ๋ฐ "M"์œผ๋กœ ๋ถ„ํ• ๋จ.
# ์ด๋Ÿฌํ•œ ํ† ํฐ์ด ๋ณ„๋„์˜ ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ๋™์ผํ•œ ๋‹จ์–ด์˜ ์ผ๋ถ€์ž„์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š”?
# --> "RA"์™€ "M" ์•ž์— ์ด์ค‘ํ•ด์‹œ(##) ์ ‘๋‘์‚ฌ๊ฐ€ ์ถ”๊ฐ€๋ฉ


inputs = tokenizer(sequence)


์ด๋ฅผ ํ†ตํ•ด token์€ ๋ชจ๋ธ์ด ์ดํ•ด๊ฐ€๋Šฅํ•œ ID๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๋•Œ, ๋ชจ๋ธ๋‚ด๋ถ€์—์„œ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” input_ids๋ฅผ key๋กœ, ID๊ฐ’์„ value๋กœ ํ•˜๋Š” "๋”•์…”๋„ˆ๋ฆฌ"ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ด์•ผํ•œ๋‹ค:

encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# ์ถœ๋ ฅ: [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]

๋˜ํ•œ, ๋ชจ๋ธ์— ๋”ฐ๋ผ์„œ ์ž๋™์œผ๋กœ "special token"์„ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ, 
์—ฌ๊ธฐ์—๋Š” ๋ชจ๋ธ์ด ๊ฐ€๋” ์‚ฌ์šฉํ•˜๋Š” "special IDs"๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# ์ถœ๋ ฅ: [CLS] A Titan RTX has 24GB of VRAM [SEP]





โˆ™Attention Mask

Attention Mask๋Š” Sequence๋ฅผ batch๋กœ ๋ฌถ์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” Optionalํ•œ ์ธ์ˆ˜๋กœ 
"๋ชจ๋ธ์ด ์–ด๋–ค token์„ ์ฃผ๋ชฉํ•˜๊ณ  ํ•˜์ง€ ๋ง์•„์•ผ ํ•˜๋Š”์ง€"๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

len(encoded_sequence_a), len(encoded_sequence_b)
# ์ถœ๋ ฅ: (8, 19)
์œ„๋ฅผ ๋ณด๋ฉด, encoding๋œ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— "๋™์ผํ•œ Tensor๋กœ ๋ฌถ์„ ์ˆ˜๊ฐ€ ์—†๋‹ค."
--> padding์ด๋‚˜ truncation์ด ํ•„์š”ํ•จ.
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["input_ids"]
# ์ถœ๋ ฅ: [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]

padded_sequences["attention_mask"]
# ์ถœ๋ ฅ: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask๋Š” tokenizer๊ฐ€ ๋ฐ˜ํ™˜ํ•˜๋Š” dictionary์˜ "attention_mask" key์— ์กด์žฌํ•œ๋‹ค.


โˆ™Token Types IDs

์–ด๋–ค ๋ชจ๋ธ์˜ ๋ชฉ์ ์€ classification์ด๋‚˜ QA์ด๋‹ค.
์ด๋Ÿฐ ๋ชจ๋ธ์€ 2๊ฐœ์˜ "๋‹ค๋ฅธ ๋ชฉ์ ์„ ๋‹จ์ผ input_ids"ํ•ญ๋ชฉ์œผ๋กœ ๊ฒฐํ•ฉํ•ด์•ผํ•œ๋‹ค.
= [CLS], [SEP] ๋“ฑ์˜ ํŠน์ˆ˜ํ† ํฐ์„ ์ด์šฉํ•ด ์ˆ˜ํ–‰๋จ.

ex)
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)
# ์ถœ๋ ฅ: [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
์œ„์˜ ์˜ˆ์ œ์—์„œ tokenizer๋ฅผ ์ด์šฉํ•ด 2๊ฐœ์˜ sequence๊ฐ€ 2๊ฐœ์˜ ์ธ์ˆ˜๋กœ ์ „๋‹ฌ๋˜์–ด ์ž๋™์œผ๋กœ ์œ„์™€๊ฐ™์€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Š” seq์ดํ›„์— ๋‚˜์˜ค๋Š” seq์˜ ์‹œ์ž‘์œ„์น˜๋ฅผ ์•Œ๊ธฐ์—๋Š” ์ข‹๋‹ค.

๋‹ค๋งŒ, ๋‹ค๋ฅธ ๋ชจ๋ธ์€ token_types_ids๋„ ์‚ฌ์šฉํ•˜๋ฉฐ, token_type_ids๋กœ ์ด MASK๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
encoded_dict['token_type_ids']
# ์ถœ๋ ฅ: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 

์งˆ๋ฌธ์— ์‚ฌ์šฉ๋˜๋Š” context๋Š” ๋ชจ๋‘ 0์œผ๋กœ, 
์งˆ๋ฌธ์— ํ•ด๋‹น๋˜๋Š” sequence๋Š” ๋ชจ๋‘ 1๋กœ ์„ค์ •๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


โˆ™Position IDs

RNN: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๊ฐ€ ๋‚ด์žฅ. 
Transformer: ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ ์ธ์‹ โŒ


∴ position_ids๋Š” ๋ชจ๋ธ์ด ๊ฐ ํ† ํฐ์˜ ์œ„์น˜๋ฅผ list์—์„œ ์‹๋ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” optional ๋งค๊ฐœ๋ณ€์ˆ˜.

๋ชจ๋ธ์— position_ids๊ฐ€ ์ „๋‹ฌ๋˜์ง€ ์•Š์œผ๋ฉด, ID๋Š” ์ž๋™์œผ๋กœ Absolute positional embeddings์œผ๋กœ ์ƒ์„ฑ:

Absolute positional embeddings์€ [0, config.max_position_embeddings - 1] ๋ฒ”์œ„์—์„œ ์„ ํƒ.

์ผ๋ถ€ ๋ชจ๋ธ์€ sinusoidal position embeddings์ด๋‚˜ relative position embeddings๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ positional embedding์„ ์‚ฌ์šฉ.




โˆ™Labels 

Labels๋Š” ๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๋„๋ก ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ๋Š” Optional์ธ์ˆ˜์ด๋‹ค.
์ฆ‰, Labels๋Š” ๋ชจ๋ธ์˜ ์˜ˆ์ƒ ์˜ˆ์ธก๊ฐ’์ด์–ด์•ผ ํ•œ๋‹ค: ํ‘œ์ค€ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก๊ฐ’๊ณผ ์˜ˆ์ƒ๊ฐ’(๋ ˆ์ด๋ธ”) ๊ฐ„์˜ ์†์‹ค์„ ๊ณ„์‚ฐ.


์ด๋•Œ, Labels๋Š” Model Head์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค:

  • AutoModelForSequenceClassification: ๋ชจ๋ธ์€ (batch_size)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, batch์˜ ๊ฐ ๊ฐ’์€ ์ „์ฒด ์‹œํ€€์Šค์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น.

  • AutoModelForTokenClassification: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ label์— ํ•ด๋‹น

  • AutoModelForMaskedLM: ๋ชจ๋ธ์€ (batch_size, seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ์˜ˆ์ƒ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹น: label์€ ๋งˆ์Šคํ‚น๋œ token_ids์ด๋ฉฐ, ๋‚˜๋จธ์ง€๋Š” ๋ฌด์‹œํ•  ๊ฐ’(๋ณดํ†ต -100).

  • AutoModelForConditionalGeneration: ๋ชจ๋ธ์€ (batch_size, tgt_seq_length)์ฐจ์›ํ…์„œ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ, ๊ฐ ๊ฐ’์€ ๊ฐ ์ž…๋ ฅ ์‹œํ€€์Šค์™€ ์—ฐ๊ด€๋œ ๋ชฉํ‘œ ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘์—๋Š” BART์™€ T5๊ฐ€ ์ ์ ˆํ•œ ๋””์ฝ”๋” ์ž…๋ ฅ ID์™€ ๋””์ฝ”๋” ์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ์— ๋ณดํ†ต ์ œ๊ณตํ•  ํ•„์š”X. ์ด๋Š” Encoder-Decoder ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์—๋Š” ์ ์šฉ๋˜์ง€ ์•Š์Œ. ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ๊ฐ ํŠน์ • ๋ชจ๋ธ์˜ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๊ธฐ๋ณธ ๋ชจ๋ธ(BertModel ๋“ฑ)์€ Labels๋ฅผ ๋ฐ›์•„๋“ค์ด์ง€ ๋ชปํ•˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๋กœ์„œ ๋‹จ์ˆœํžˆ ํŠน์ง•๋“ค๋งŒ ์ถœ๋ ฅํ•œ๋‹ค.




โˆ™ Decoder input IDs

์ด ์ž…๋ ฅ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ์— ํŠนํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋””์ฝ”๋”์— ์ž…๋ ฅ๋  ์ž…๋ ฅ ID๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ์ž…๋ ฅ์€ ๋ฒˆ์—ญ ๋˜๋Š” ์š”์•ฝ๊ณผ ๊ฐ™์€ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ์ž‘์—…์— ์‚ฌ์šฉ๋˜๋ฉฐ, ๋ณดํ†ต ๊ฐ ๋ชจ๋ธ์— ํŠน์ •ํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ(BART, T5)์€ ๋ ˆ์ด๋ธ”์—์„œ ๋””์ฝ”๋” ์ž…๋ ฅ ID๋ฅผ ์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ชจ๋ธ์—์„œ๋Š” ๋ ˆ์ด๋ธ”์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ํ›ˆ๋ จ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์„ ํ˜ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ํ›ˆ๋ จ์„ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์ž…๋ ฅ ID๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜๋ ค๋ฉด ๊ฐ ๋ชจ๋ธ์˜ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.



โˆ™Feed Forward Chunking

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ ์ž”์ฐจ ์–ดํ…์…˜ ๋ธ”๋ก์—์„œ ์…€ํ”„ ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋Š” ๋ณดํ†ต 2๊ฐœ์˜ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด ๋‹ค์Œ์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค.
ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ž„๋ฒ ๋”ฉ ํฌ๊ธฐ๋Š” ์ข…์ข… ๋ชจ๋ธ์˜ ์ˆจ๊ฒจ์ง„ ํฌ๊ธฐ๋ณด๋‹ค ํฝ๋‹ˆ๋‹ค(์˜ˆ: bert-base-uncased).

ํฌ๊ธฐ [batch_size, sequence_length]์˜ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ค‘๊ฐ„ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ์ €์žฅํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ [batch_size, sequence_length, config.intermediate_size]๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์˜ ํฐ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reformer: The Efficient Transformer์˜ ์ €์ž๋“ค์€ ๊ณ„์‚ฐ์ด sequence_length ์ฐจ์›๊ณผ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ ๋‘ ํ”ผ๋“œ ํฌ์›Œ๋“œ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ [batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  n = sequence_length์™€ ํ•จ๊ป˜ [batch_size, sequence_length, config.hidden_size]๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๋Œ€์‹  ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฑฐ๋ž˜๋ฅผ ํ•˜์ง€๋งŒ, ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

apply_chunking_to_forward() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, chunk_size๋Š” ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐ๋˜๋Š” ์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ์˜ ์ˆ˜๋ฅผ ์ •์˜ํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก์„ฑ ๊ฐ„์˜ ๊ฑฐ๋ž˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. chunk_size๊ฐ€ 0์œผ๋กœ ์„ค์ •๋˜๋ฉด ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ฒญํ‚น์€ ์ˆ˜ํ–‰๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

QLoRA ์‹ค์Šต with MLLMs(InternVL)

Step 1. ํ•„์š” Library import:

import os

import torch
import torch.nn as nn
import bitsandbytes as bnb
import transformers

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel, 
    get_peft_model,
)
from transformers import (
    AutoConfig,
    AutoModel,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed,
    pipeline,
    TrainingArguments,
)โ€‹


Step 2. ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜จ ํ›„ prepare_model_for_kbit_training(model) ์ง„ํ–‰

devices = [0]#[0, 3]
max_memory = {i: '49140MiB' for i in devices}

model_name = 'OpenGVLab/InternVL2-8B'


model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    cache_dir='/data/huggingface_models',
    trust_remote_code=True,
    device_map="auto",
    max_memory=max_memory,
    quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4'
        ),
)

# ๋ชจ๋ธ ๊ตฌ์กฐ ์ถœ๋ ฅ
print(model)

# get_input_embeddings ๋ฉ”์„œ๋“œ๋ฅผ ๋ชจ๋ธ์— ์ถ”๊ฐ€
def get_input_embeddings(self):
    if hasattr(self, 'embed_tokens'):
        return self.embed_tokens
    elif hasattr(self, 'language_model') and hasattr(self.language_model.model, 'tok_embeddings'):
        return self.language_model.model.tok_embeddings
    else:
        raise NotImplementedError("The model does not have an attribute 'embed_tokens' or 'language_model.model.tok_embeddings'.")

model.get_input_embeddings = get_input_embeddings.__get__(model, type(model))

# prepare_model_for_kbit_training ํ•จ์ˆ˜๋ฅผ ์ง์ ‘ ๊ตฌํ˜„
def prepare_model_for_kbit_training(model):
    for param in model.parameters():
        param.requires_grad = False  # ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ๋น„ํ™œ์„ฑํ™”

    if hasattr(model, 'model') and hasattr(model.model, 'tok_embeddings'):
        for param in model.model.tok_embeddings.parameters():
            param.requires_grad = True  # ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋งŒ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ ํ™œ์„ฑํ™”
    elif hasattr(model, 'embed_tokens'):
        for param in model.embed_tokens.parameters():
            param.requires_grad = True  # ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋งŒ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ ํ™œ์„ฑํ™”
    
    # ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋‹ค๋ฅธ ํŠน์ • ๋ ˆ์ด์–ด๋“ค๋„ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Œ
    # ์˜ˆ์‹œ: 
    # if hasattr(model, 'some_other_layer'):
    #     for param in model.some_other_layer.parameters():
    #         param.requires_grad = True

    return model

model = prepare_model_for_kbit_training(model)โ€‹


Step 3. QLoRA๋ฅผ ๋ถ™์ผ layer ์„ ํƒ:

def find_all_linear_names(model, train_mode):
    assert train_mode in ['lora', 'qlora']
    cls = bnb.nn.Linear4bit if train_mode == 'qlora' else nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # LLM์˜ Head๋ถ€๋ถ„์— ์†ํ•˜๋Š” ์• ๋“ค pass
        lora_module_names.remove('lm_head')
    
    return list(lora_module_names)


print(sorted(config.target_modules)) # ['1','output', 'w1', 'w2', 'w3', 'wo', 'wqkv']
config.target_modules.remove('1') # LLM์˜ Head๋ถ€๋ถ„์— ์†ํ•˜๋Š” ์• ๋“ค ์ œ๊ฑฐ


config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=find_all_linear_names(model, 'qlora'),
    lora_dropout=0.05,
    bias="none",
    task_type="QUESTION_ANS" #CAUSAL_LM, FEATURE_EXTRACTION, QUESTION_ANS, SEQ_2_SEQ_LM, SEQ_CLS, TOKEN_CLS.
)

model = get_peft_model(model, config)

์ดํ›„ trainer๋กœ train์ง„ํ–‰.

QLoRA ๋ถ™์ธ ๊ฒฐ๊ณผ:

 

 

 

 

 

 

 

trainer ์ข…๋ฅ˜? Trainer vs SFTTrainer

Trainer  v.s. SFTTrainer

โˆ™ Trainer  v.s. SFTTrainer

 - ์ผ๋ฐ˜ ๋ชฉ์ ์˜ ํ›ˆ๋ จ: ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ์งˆ์˜์‘๋‹ต, ์š”์•ฝ ๋“ฑ์˜ ์ง€๋„ ํ•™์Šต ์ž‘์—…์—์„œ ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
 - ๋†’์€ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ๊ฐ€๋Šฅ์„ฑ: hyperparameter, optimizer, scheduler, logging, metric ๋“ฑ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๊ตฌ์„ฑ ์˜ต์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
 - ๋ณต์žกํ•œ ํ›ˆ๋ จ ์›Œํฌํ”Œ๋กœ์šฐ ์ฒ˜๋ฆฌ: ๊ทธ๋ž˜๋””์–ธํŠธ ์ถ•์ , ์กฐ๊ธฐ ์ข…๋ฃŒ, ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ, ๋ถ„์‚ฐ ํ›ˆ๋ จ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
 - ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ ์š”๊ตฌ: ํšจ๊ณผ์ ์ธ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.



โˆ™ SFTTrainer

 - ์ง€๋„ ํ•™์Šต ๋ฏธ์„ธ ์กฐ์ • (SFT): ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ PLMs Fine-Tuning์— ์ตœ์ ํ™”.
 - ๊ฐ„๋‹จํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ๋” ์ ์€ configuration์œผ๋กœ ๊ฐ„์†Œํ™”๋œ workflow๋ฅผ ์ œ๊ณต.
 - ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: PEFT์™€ ํŒจํ‚น ์ตœ์ ํ™”์™€ ๊ฐ™์€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ์ค‘ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
 - ๋น ๋ฅธ ํ›ˆ๋ จ: ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์งง์€ ํ›ˆ๋ จ ์‹œ๊ฐ„์œผ๋กœ๋„ ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.



โˆ™ Trainer์™€ SFTTrainer ์„ ํƒ ๊ธฐ์ค€:

 - Trainer ์‚ฌ์šฉ:
ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๊ณ , ํ›ˆ๋ จ ๋ฃจํ”„ ๋˜๋Š” ๋ณต์žกํ•œ ํ›ˆ๋ จ ์›Œํฌํ”Œ๋กœ์šฐ์— ๋Œ€ํ•œ ๊ด‘๋ฒ”์œ„ํ•œ ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ.
Data preprocessing, Datacollator๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์„ค์ •ํ•ด์•ผ ํ•˜๋ฉฐ, ์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ

 - SFTTrainer ์‚ฌ์šฉ:
PLMS์™€ ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๊ณผ ํ•จ๊ป˜ ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ ๋ฏธ์„ธ ์กฐ์ • ๊ฒฝํ—˜์„ ์›ํ•  ๊ฒฝ์šฐ.
PEFT๋ฅผ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ง€์›, `peft_config`์™€ ๊ฐ™์€ ์„ค์ •์„ ํ†ตํ•ด ํšจ์œจ์ ์ธ ํŒŒ์ธ ํŠœ๋‹์„ ์‰ฝ๊ฒŒ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.
Data preprocessing, Datacollator๋„ ํšจ์œจ์ ์ธ FT๋ฅผ ์œ„ํ•ด ์ตœ์ ํ™”๋˜์–ด ์žˆ์Œ.
`dataset_text_field`์™€ ๊ฐ™์€ ํ•„๋“œ๋ฅผ ํ†ตํ•ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ.



Feature Trainer SFTTrainer
๋ชฉ์  Gerneral Purpose training Supervised Fine-Tuning of PLMs
์ปค์Šคํ…€ ์šฉ๋„ Highly Customizable Simpler interface with fewer options
Training workflow Handles complex workflows Streamlined workflow
ํ•„์š” Data Large Datsets Smaller Datasets
Memory ์‚ฌ์šฉ๋Ÿ‰ Higher Lower with PEFT & packing optimization
Training speed Slower Faster with smaller datasets

[PEFT]: Parameter Efficient Fine-Tuning


PEFT๋ž€?

PLMs๋ฅผ specific task์— ์ ์šฉํ•  ๋•Œ, ๋Œ€๋ถ€๋ถ„์˜ Parameter๋ฅผ freezeโ„๏ธ, ์†Œ์ˆ˜์˜ parameter๋งŒ FTํ•˜๋Š” ๊ธฐ๋ฒ•.
PEFT๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์œ ์ง€ + #parameter↓๊ฐ€ ๊ฐ€๋Šฅํ•จ.
๋˜ํ•œ, catastrophic forgetting๋ฌธ์ œ ์œ„ํ—˜๋„ ๋˜ํ•œ ๋‚ฎ์Œ.
๐Ÿค—Huggingface์—์„œ ์†Œ๊ฐœํ•œ ํ˜์‹ ์  ๋ฐฉ๋ฒ•์œผ๋กœ downstream task์—์„œ FT๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋จ.

Catastrophic Forgetting์ด๋ž€?

์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋ ๋•Œ, ๊ธฐ์กด์— ํ•™์Šตํ•œ ์ผ๋ถ€์˜ ์ง€์‹์— ๋Œ€ํ•ด์„œ๋Š” ๋ง๊ฐ์„ ํ•˜๊ฒŒ ๋˜๋Š” ํ˜„์ƒ



Main Concept

  • Reduced Parameter Fine-tuning(์ถ•์†Œ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ํŒŒ์ธํŠœ๋‹)
    ์‚ฌ์ „ ํ•™์Šต๋œ LLM ๋ชจ๋ธ์—์„œ ๋Œ€๋‹ค์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ ์ •ํ•ด ์†Œ์ˆ˜์˜ ์ถ”๊ฐ€์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํŒŒ์ธํŠœ๋‹ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์ 
    ์„ ํƒ์  ํŒŒ์ธํŠœ๋‹์œผ๋กœ ๊ณ„์‚ฐ์  ์š”๊ตฌ๊ฐ€ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ํšจ๊ณผ
  • Overcoming Catastrophic Forgetting(์น˜๋ช…์  ๋ง๊ฐ ๋ฌธ์ œ ๊ทน๋ณต)
    Catastrophic Forgetting ๋ฌธ์ œ๋Š” LLM ๋ชจ๋ธ ์ „์ฒด๋ฅผ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ˜„์ƒ์ธ๋ฐ, PEFT๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์น˜๋ช…์  ๋ง๊ฐ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Œ
    PEFT๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ƒํƒœ์˜ ์ง€์‹์„ ๋ณด์กดํ•˜๋ฉฐ ์ƒˆ๋กœ์šด downstream task์— ๋Œ€ํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ
  • Application Across Modalities(์—ฌ๋Ÿฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ ์šฉ ๊ฐ€๋Šฅ)
    PEFT๋Š” ๊ธฐ์กด ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(Natural Language Process: NLP) ์˜์—ญ์„ ๋„˜์–ด์„œ ๋‹ค์–‘ํ•œ ์˜์—ญ์œผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•จ
    ์Šคํ…Œ์ด๋ธ” ๋””ํ“จ์ „(stable diffusion) ํ˜น์€ Layout LM ๋“ฑ์˜ ํฌํ•จ๋œ ์ปดํ“จํ„ฐ ๋น„์ „(Computer Vision: CV) ์˜์—ญ,
    Whisper๋‚˜ XLS-R์ด ํฌํ•จ๋œ ์˜ค๋””์˜ค ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋งˆ๋‹ฌ๋ฆฌํ‹ฐ์— ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉ๋จ
  • Supported PEFT Methods(์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ PEFT)
    ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ๋‹ค์–‘ํ•œ PEFT ๋ฐฉ๋ฒ•์„ ์ง€์›ํ•จ
    LoRA(Low-Rank Adaption), Prefix Tuning, ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹ ๋“ฑ ๊ฐ๊ฐ์˜ ๋ฐฉ๋ฒ•์€ ํŠน์ •ํ•œ ๋ฏธ์„ธ ์กฐ์ • ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋งž๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋จ

 

 

 

The output activations original (frozen) pretrained weights (left) are augmented by a low rank adapter comprised of weight matrics A and B (right).

์‚ฌ์ „ํ•™์Šต๊ฐ€์ค‘์น˜(โ„๏ธ)์˜ output activation์€ weight matrix์ธ A, B๋กœ ๊ตฌ์„ฑ๋œ LoRA์— ์˜ํ•ด ์ฆ๊ฐ€๋œ๋‹ค.

 

[Q-LoRA]: Quantized-LoRA

Q-LoRA๋ž€?

2023๋…„ 5์›” NeurIPS์—์„œ ์–‘์žํ™”์™€ LoRA๋ฅผ ํ•ฉ์ณ "A6000 ๋‹จ์ผ GPU๋กœ 65B๋ชจ๋ธ ํŠœ๋‹์ด ๊ฐ€๋Šฅ"ํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ๋ฐœํ‘œํ•จ.
QLoRA๋Š” ๊ฒฐ๊ตญ ๊ธฐ์กด์˜ LoRA์— ์ƒˆ๋กœ์šด quantization์„ ๋”ํ•œ ํ˜•ํƒœ์ด๋‹ค.
๋ฒ ์ด์Šค ๋ชจ๋ธ์ธ PLM์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์–ผ๋ฆฌ๊ณ (frozen), LoRA ์–ด๋Œ‘ํ„ฐ์˜ ๊ฐ€์ค‘์น˜๋งŒ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ(trainable)ํ•˜๋Š” ๊ฒƒ์€ LoRA์™€ ๋™์ผํ•˜๋ฉฐ, frozen PLM์˜ ๊ฐ€์ค‘์น˜๊ฐ€ '4๋น„ํŠธ๋กœ ์–‘์žํ™”'๋˜์—ˆ๋‹ค๋Š” ์ •๋„๊ฐ€ ๋‹ค๋ฅธ ์ ์ด๋‹ค.
๋•Œ๋ฌธ์—, QLoRA์—์„œ ์ฃผ์š”ํžˆ ์ƒˆ๋กœ ์†Œ๊ฐœ๋˜๋Š” ๊ธฐ์ˆ (Main Contribution)์€ ์–‘์žํ™” ๋ฐฉ๋ฒ•๋ก ์ด ์ฃผ๊ฐ€ ๋œ๋‹ค๋Š” ์‚ฌ์‹ค์ด๋‹ค.

์–‘์žํ™”๋ž€?

weight์™€ activation output์„ ๋” ์ž‘์€ bit๋‹จ์œ„๋กœ ํ‘œํ˜„ํ•˜๋„๋ก ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ.
์ฆ‰, data์ •๋ณด๋ฅผ ์•ฝ๊ฐ„์ค„์ด๊ณ , ์ •๋ฐ€๋„๋Š” ๋‚ฎ์ถ”์ง€๋งŒ
"์ €์žฅ ๋ฐ ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ๊ฐ์†Œ์‹œ์ผœ ํšจ์œจ์„ฑ์„ ํ™•๋ณด"ํ•˜๋Š” ๊ฒฝ๋Ÿ‰ํ™” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค.



How to Use in MLLMs...?

๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ MLLMs์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ? MLLMs๋Š” ๋งค์šฐ ์ข…๋ฅ˜๊ฐ€ ๋งŽ์ง€๋งŒ, ๊ฐ€์žฅ ์‰ฌ์šด ์˜ˆ์ œ๋กœ VLMs๋ฅผ ๋“ค์–ด๋ณด์ž๋ฉด,
Q-LoRA ๋ฐ LoRA๋Š” PEFT๋ฐฉ๋ฒ•๋ก ์ด๊ธฐ์— ์ด๋Š” LLMs, MLLMs๋ชจ๋‘ ํ†ต์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
๊ทธ๋ ‡๊ธฐ์— VLMs(Vision Encoder + LLM Decoder)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์„ค๋ช…ํ•ด๋ณด์ž๋ฉด:

  • ์–ธ์–ด์  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”์‹œํ‚ค๊ณ  ์‹ถ๋‹ค๋ฉด, LLM๋งŒ PEFT๋ฅผ ์ง„ํ–‰.
  • ์‹œ๊ฐ์  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”์‹œํ‚ค๊ณ  ์‹ถ๋‹ค๋ฉด, Vision Encoder๋งŒ PEFT๋ฅผ ์ง„ํ–‰.
  • ๋‘ ๋Šฅ๋ ฅ ๋ชจ๋‘ ๊ฐ•ํ™”์‹œํ‚ค๊ณ  ์‹ถ๋‹ค๋ฉด, Encoder, Decoder ๋ชจ๋‘ PEFT๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด ๋œ๋‹ค.

Reference Code:

 

A Definitive Guide to QLoRA: Fine-tuning Falcon-7b with PEFT

Unveiling the Power of QLoRA: Comprehensive Explanation and Practical Coding with ๐Ÿค— PEFT

medium.com

 

 

Finetuning Llama2 with QLoRA — TorchTune documentation

Finetuning Llama2 with QLoRA In this tutorial, we’ll learn about QLoRA, an enhancement on top of LoRA that maintains frozen model parameters in 4-bit quantized precision, thereby reducing memory usage. We’ll walk through how QLoRA can be utilized withi

pytorch.org

 

 

์ฐธ๊ณ : https://github.com/V2LLAIN/Transformers-Tutorials/blob/master/qlora_baseline.ipynb

 

Transformers-Tutorials/qlora_baseline.ipynb at master · V2LLAIN/Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace. - V2LLAIN/Transformers-Tutorials

github.com

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Deepspeed๋ž€?

# finetune_qlora.sh

deepspeed ovis/train/train.py \
        --deepspeed scripts/zero2.json \
        ...โ€‹


๋ฌผ๋ก  ๋‚˜๋งŒ์˜ ๋ฐฉ๋ฒ•์„ ๊ณ ์ˆ˜ํ•˜๋Š”๊ฒƒ๋„ ์ข‹์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ user๋“ค์ด ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š”๊ฑธ ๋ด์„œ๋Š” ์ผ๋‹จ ์•Œ์•„๋†“๋Š”๊ฒŒ ์ข‹์„ ๊ฒƒ ๊ฐ™๊ธฐ์— ์•Œ์•„๋ณด๊ณ ์žํ•œ๋‹ค.


deepspeed...?

๋ชจ๋ธ์˜ training, inference์†๋„๋ฅผ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋„์™€์ฃผ๋Š” Microsoft์‚ฌ์˜ ๋”ฅ๋Ÿฌ๋‹ ์ตœ์ ํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.

ํ•™์Šต device ์ข…๋ฅ˜:

  • CPU
    Single GPU
    1 Node, Multi GPU
    Multi Node, Multi GPU --> ๋งค์šฐ ํฐ GPT4 ๋“ฑ์˜ ํ•™์Šต์„ ์œ„ํ•ด ์‚ฌ์šฉ๋จ.

๋ถ„์‚ฐํ•™์Šต ๋ฐฉ์‹:

  • Data Parallel: ํ•˜๋‚˜์˜ device๊ฐ€ data๋ฅผ ๋‚˜๋ˆ„๊ณ , ๊ฐ device์—์„œ ์ฒ˜๋ฆฌ๊ฒฐ๊ณผ๋ฅผ ๋ชจ์•„ ๊ณ„์‚ฐ
    --> ํ•˜๋‚˜์˜ device๊ฐ€ ๋‹ค๋ฅธ device์— ๋น„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋งŽ์•„์ง€๋Š”, ๋ฉ”๋ชจ๋ฆฌ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค!
  • Distributed Data Parallel: ๊ฐ๊ฐ์˜ device๋ฅผ ํ•˜๋‚˜์˜ Process๋กœ ๋ณด๊ณ , ๊ฐ process์—์„œ ๋ชจ๋ธ์„ ๋„์›Œ์„œ ์‚ฌ์šฉ.
    ์ด๋•Œ, ์—ญ์ „ํŒŒ์—์„œ๋งŒ ๋‚ด๋ถ€์ ์œผ๋กœ gradient๋ฅผ ๋™๊ธฐํ™” --> ๋ฉ”๋ชจ๋ฆฌ ๋ถˆ๊ท ํ˜•๋ฌธ์ œโŒ


cf) Requirements:

- PyTorch must be installed before installing DeepSpeed.
- For full feature support we recommend a version of PyTorch that is >= 1.9 and ideally the latest PyTorch stable release.
- A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
- Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
        NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
        AMD: MI100 and MI200



pip install deepspeed

๋กœ ์„ค์น˜๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์‚ฌ์šฉ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.


์‚ฌ์šฉ๋ฐฉ๋ฒ•:
Step1)
deepspeed์‚ฌ์šฉ์„ ์œ„ํ•œ Config.jsonํŒŒ์ผ ์ž‘์„ฑ

{
	"train_micro_batch_size_per_gpu": 160,
    "gradient_accumulation_steps": 1,
    "optimizer":
    {
    	"type": "Adam",
        "params":
        {
        	"lr": 0.001
        }
    },
    "zero_optimization":
    {
        "stage": 1,
        "offload_optimizer":
        {
            "device": "cpu",
            "pin_memory":true
        },
        "overlap_comm": false,
        "contiguous_gradients": false
    }
}

config Args:https://www.deepspeed.ai/docs/config-json/

Step2) import & read json

import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam

with open('config.json', 'r') as f:
	deepspeed_config = json.load(f)



Step3) optimizer ์„ค์ • & model,optimizer ์ดˆ๊ธฐํ™”

optimizer = DeepSpeedCPUAdam(model.parameters(), lr=lr)

model, optimizer, _, _ = deepspeed.initialize(model=model,
                                            model_parameters=model.parameters(),
                                            optimizer=optimizer,
                                            config_params=deepspeed_config)


cf) ArgumentParser์— ์ถ”๊ฐ€ํ•˜๋Š”๊ฒƒ๋„ ๊ฐ€๋Šฅํ•จ!

parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', type=int, default=-1)

parser = deepspeed.add_config_arguments(parser)




Step4) Train!

# >> train.py
deepspeed --num_gpus={gpu ๊ฐœ์ˆ˜} train_deepspeed.py
# train.sh
deepspeed --num_gpus={gpu ๊ฐœ์ˆ˜} train_deepspeed.py

# >> bash train.sh


์ฃผ์˜ !)

DeepSpeed๋Š” CUDA_VISIBLE_DEVICES๋กœ ํŠน์ • GPU๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์—†๋‹ค!
์•„๋ž˜์™€ ๊ฐ™์ด --include๋กœ๋งŒ ํŠน์ • GPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

deepspeed —inlcude localhost:<GPU_NUM1>, <GPU_NUM2> <python_file.py>


  • gpu node์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก deepspeed์˜ ์žฅ์ ์ธ ํ•™์Šต ์†๋„๊ฐ€ ๋นจ๋ผ์ง„๋‹ค!
 

DeepSpeed Configuration JSON

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

www.deepspeed.ai

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

+ Recent posts