Preview:

Machine Learning vs Deep Learning

๋จธ์‹ ๋Ÿฌ๋‹: ์ธ๊ณต์ง€๋Šฅ์˜ ํ•œ ๋ถ„์•ผ๋กœ data์˜ Pattern์„ ํ•™์Šต. (์ด๋•Œ, ๋น„๊ต์  ์ ์€ ์–‘์˜ ๊ตฌ์กฐํ™”๋œ data๋กœ๋„ ์ž‘๋™๊ฐ€๋Šฅ)

๋”ฅ๋Ÿฌ๋‹: ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•œ ๋ถ„์•ผ๋กœ ๋ณต์žกํ•œ ๊ตฌ์กฐ, ๋งŽ์€ ๊ณ„์‚ฐ๋ฆฌ์†Œ์Šค ๋ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•จ.

 

 

 

Transformer(Attention is All You Need-2017)

Transformer ๋ชจ๋ธ์˜ ํ•ต์‹ฌ:

โˆ™ input sequence ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ

โˆ™ Only Use Attention Mechanism (Self Attention)
โˆ™ ์ˆœ์ฐจ์ ์ฒ˜๋ฆฌ, ๋ฐ˜๋ณต์—ฐ๊ฒฐ, ์žฌ๊ท€ ๋ชจ๋‘ ์‚ฌ์šฉโŒ


Transformer ๋ชจ๋ธ๊ตฌ์กฐ:

โˆ™ Embedding: token2dense-vector (์ด๋•Œ, ๋‹จ์–ด๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๋ณด์กดํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ชจ๋ธ์ด ํ•™์Šต๋œ๋‹ค.)
โˆ™ Positional Encoding: input sequence์˜ token์œ„์น˜์ •๋ณด๋ฅผ ์ถ”๊ฐ€๋กœ ์ œ๊ณต
โˆ™ Encoder&Decoder: Embedding+PE๋ผ๋Š” ํ•™์Šต๋œ vector๋ฅผ Input์œผ๋กœ ๋ฐ›์Œ(๋ฒกํ„ฐ ๊ฐ’์€ Pretrained weight or ํ•™์Šต๊ณผ์ •์ค‘ ์ตœ์ ํ™”๋จ.)
- MHA & FFN: token๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต, FFN์œผ๋กœ ๊ฐ ๋‹จ์–ด์˜ ํŠน์ง•๋ฒกํ„ฐ ์ถ”์ถœ (์ด๋•Œ, ๊ฐ Head๊ฐ’์€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€์ ธ input sequence์˜ ๋‹ค์–‘ํ•œ ์ธก๋ฉด ํฌ์ฐฉ๊ฐ€๋Šฅ.) 
- QKV: Query(ํ˜„์žฌ์œ„์น˜์—์„œ ๊ด€์‹ฌ์žˆ๋Š”๋ถ€๋ถ„์˜ ๋ฒกํ„ฐ), Key(๊ฐ ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด์˜ ๋ฒกํ„ฐ), Value(๊ฐ ์œ„์น˜์— ๋Œ€ํ•œ ๊ฐ’ ๋ฒกํ„ฐ)

ex) The student studies at the home 
query: student
 --> Q: student์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•œ ๋ฒกํ„ฐ๊ฐ’
 --> K: The, studies, at, the, home ๋ฒกํ„ฐ๊ฐ’
 --> V: ๋ฌธ๋งฅ, ์˜๋ฏธ๋“ฑ์˜ ๊ด€๋ จ์ •๋ณด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ๊ฐ’
 --> 3-Head๋ผ๋ฉด: ๊ฐ ํ—ค๋“œ๋ณ„ ์—ญํ• ์ด Syntax, Semantics, Pragmatics ๋“ฑ์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ.

 

Huggingface Transformers

Library ์†Œ๊ฐœ

Tokenizer 

๋ณดํ†ต subword๋กœ tokenํ™”(token์œผ๋กœ ๋ถ„ํ• )ํ•˜๋Š” ๊ณผ์ •์„ ์ง„ํ–‰.
๋ถ€์ˆ˜์ ์œผ๋กœ "ํ…์ŠคํŠธ ์ •๊ทœํ™”, ๋ถˆ์šฉ์–ด์ œ๊ฑฐ, Padding, Truncation ๋“ฑ" ๋‹ค์–‘ํ•œ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•œ๋‹ค.


Diffusers Library

txt-img์ƒ์„ฑ๊ด€๋ จ ์ž‘์—…์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ Stable Diffusion, DALL-E, LDM ๋“ฑ ๋‹ค์–‘ํ•œ ์ƒ์„ฑ๋ชจ๋ธ์„ ์ง€์›.
- DDPM, DDIM, LDM๋“ฑ ๋‹ค์–‘ํ•œ Diffusion๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ œ๊ณต
- Batch Inference, ๋ณ‘๋ ฌ, ํ˜ผํ•ฉ์ •๋ฐ€๋„ํ•™์Šต ๋“ฑ ์ง€์›


Accelerate

๋ถ„์‚ฐ์ „๋žต์„ ๊ฐ„๋‹จํžˆ ์ถ”์ƒํ™”ํ•ด API๋กœ ์ œ๊ณต, FP16 ๋ฐ BF16๋“ฑ์˜ ๋‚ฎ์€ ํ˜ผํ•ฉ์ •๋ฐ€๋„ํ•™์Šต์„ "์ž๋™์ง€์›"
- ๋ถ„์‚ฐํ•™์Šต ์ง€์›: Data Parallel, Model Parallel ๋“ฑ ์ง€์›.
- Automatic Mixed Precision์ง€์›: FP16, FP32 ๋“ฑ dataํ˜•์‹์„ ์ž๋™์œผ๋กœ ํ˜ผํ•ฉ, ๋ฉ”๋ชจ๋ฆฌ์‚ฌ์šฉ๋Ÿ‰
, ์†๋„

- Gradient Accumulation: ์—ฌ๋Ÿฌ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ ํ•˜์—ฌ ํฐ ๋ฐฐ์น˜ ํšจ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ธฐ๋ฒ•
- Gradient Checkpointing: ์ค‘๊ฐ„ activation๊ณ„์‚ฐ์„ ์ €์žฅํ•˜๋Š” ๋Œ€์‹ , ํ•„์š”ํ•  ๋•Œ ์žฌ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•

Model ์„ค์ •

๋ชจ๋ธ ์„ค์ • ํด๋ž˜์Šค๋Š” ๋ชจ๋ธ๊ตฌ์กฐ์™€ hyperparameter๊ฐ’์„ "๋”•์…”๋„ˆ๋ฆฌ"ํ˜•ํƒœ๋กœ JSONํŒŒ์ผ์— ์ €์žฅํ•œ๋‹ค.
๋”ฐ๋ผ์„œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋ฉด ๋ชจ๋ธ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ์ด ๊ฐ’์ด ๋ถˆ๋Ÿฌ์™€์ง„๋‹ค. (์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ)




PretrainedConfig & ModelConfig

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ๊ตฌ์กฐ, hyperparameter๋ฅผ ์ €์žฅํ•˜๋Š” ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ํฌํ•จ
[์˜ˆ์‹œ์ธ์ž ์„ค๋ช…]: 
 - vocab_size: ๋ชจ๋ธ์ด ์ธ์‹๊ฐ€๋Šฅํ•œ ๊ณ ์œ ํ† ํฐ ์ˆ˜
 - output_hidden_states: ๋ชจ๋ธ์˜ ๋ชจ๋“  hidden_state๋ฅผ ์ถœ๋ ฅํ• ์ง€ ์—ฌ๋ถ€
 - output_attentions: ๋ชจ๋ธ์˜ ๋ชจ๋“  attention๊ฐ’์„ ์ถœ๋ ฅํ• ์ง€ ์—ฌ๋ถ€
 - return_dict: ๋ชจ๋ธ์ด ์ผ๋ฐ˜ ํŠœํ”Œ๋Œ€์‹ , ModelOutput๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ• ์ง€ ๊ฒฐ์ •.
๊ฐ ๋ชจ๋ธ ๊ตฌ์กฐ๋ณ„ PretrainedConfig๋ฅผ ์ƒ์†๋ฐ›์€ ์ „์šฉ ๋ชจ๋ธ ์„ค์ • ํด๋ž˜์Šค๊ฐ€ ์ œ๊ณต๋œ๋‹ค.
(ex. BertConfig, GPT2Config ํ˜น์€ ์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ...)

InternVisionConfig๋ฅผ ์ง์ ‘ ์ธ์Šคํ„ด์Šคํ™”ํ•ด ์„ค์ •ํ•˜๋Š” ์˜ˆ์‹œ

์ด๋•Œ, ์„ค์ •๊ฐ’์ด ์ž˜๋ชป๋˜๋ฉด ๋ชจ๋ธ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๊ธฐ์— ๋ณดํ†ต "from_pretrained"๋ฅผ ์ด์šฉํ•ด ๊ฒ€์ฆ๋œ pretrainedํ•™์Šต์„ค์ •๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.



PreTrainedTokenizer & ModelTokenizer & PretrainedTokenizerFast

[์˜ˆ์‹œ์ธ์ž ์„ค๋ช…]: 
 - max_model_input_sizes: ๋ชจ๋ธ์˜ ์ตœ๋Œ€์ž…๋ ฅ๊ธธ์ด

 - model_max_length: tokenizer๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์˜ ์ตœ๋Œ€์ž…๋ ฅ๊ธธ์ด
(์ฆ‰, ํ† ํฌ๋‚˜์ด์ €์˜ model_max_length๋Š” ๋ชจ๋ธ์˜ max_model_input_sizes๋ณด๋‹ค ํฌ์ง€ ์•Š๋„๋ก ์„ค์ •ํ•ด์•ผ ๋ชจ๋ธ์ด ์ •์ƒ์ ์œผ๋กœ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 - padding_side/truncation_side: padding/truncation์œ„์น˜(left/right) ๊ฒฐ์ •

 - model_input_names: ์ˆœ์ „ํŒŒ์‹œ ์ž…๋ ฅ๋˜๋Š” tensor๋ชฉ๋ก(ex. input_ids, attention_mask, token_type_ids)

cf) decode๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด token_id ๋ฌธ์žฅ์„ ์›๋ž˜ ๋ฌธ์žฅ์œผ๋กœ ๋ณต์›ํ•œ๋‹ค.
cf) PretrainedTokenizerFast๋Š” Rust๋กœ ๊ตฌํ˜„๋œ ๋ฒ„์ „์œผ๋กœ Python Wrapper๋ฅผ ํ†ตํ•ด ํ˜ธ์ถœ๋˜๋Š”, ๋” ๋น ๋ฅธ tokenizer๋‹ค.




Datasets

Dataset Upload ์˜ˆ์ œ:

images ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ:
images
โŽฟ A.jpg
โŽฟ B.jpg
  ...

import os
from collections import defaultdict
from datasets import Dataset, Image, DatasetDict

data = defaultdict(list)
folder_name = '../images'

for file_name in os.listdir(folder_name):
    name = os.path.splittext(file_name)[0]
    path = os.path.join(folder_name, file_name)
    data['name'].append(name)
    data['image'].append(path)

dataset = Dataset.from_dict(data).cast_column('image', Image())
# print(data, dataset[0]) # ํ™•์ธ์šฉ

dataset_dict = DatasetDict({
    'train': dataset.select(range(5)),
    'valid': dataset.select(range(5, 10)),
    'test': dataset.select(range(10, len(dataset)))
    }
)

hub_name = "<user_name>/<repo_name>" # dataset์ €์žฅ๊ฒฝ๋กœ
token = "hf_###..." # huggingface token์ž…๋ ฅ
datasetdict.push_to_hub(hub_name, token=token)

 





๐Ÿค— Embedding๊ณผ์ • ์™„์ „์ •๋ฆฌ!!

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

txt = "I am laerning about tokenizers."
input = tokenizer(txt, return_tensors="pt")
output = model(**input)

print('input:', input)
print('last_hidden_state:', output.last_hidden_state.shape)
input: {'input_ids': tensor([[  101,  1045,  2572,  2474, 11795,  2075,  2055, 19204, 17629,  2015,  1012,   102]]), 
        'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
        'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
        
last_hidden_state: torch.Size([1, 12, 768])
  1. input ๋”•์…”๋„ˆ๋ฆฌ
    • input_ids:
      • ๊ฐ ๋‹จ์–ด์™€ ํŠน์ˆ˜ ํ† ํฐ์ด BERT์˜ ์–ดํœ˜ ์‚ฌ์ „์— ๋งคํ•‘๋œ ๊ณ ์œ ํ•œ ์ •์ˆ˜ ID๋กœ ๋ณ€ํ™˜๋œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
      • ์˜ˆ์‹œ: 101์€ [CLS] ํ† ํฐ, 102๋Š” [SEP] ํ† ํฐ.
      • ์ „์ฒด ์‹œํ€€์Šค: [CLS] I am laerning about tokenizers. [SEP]
      • ๊ธธ์ด: 12๊ฐœ์˜ ํ† ํฐ (๋ฌธ์žฅ ์ „์ฒด์™€ ํŠน์ˆ˜ ํ† ํฐ ํฌํ•จ)
    • token_type_ids:
      • ๋ฌธ์žฅ ๋‚ด์˜ ๊ฐ ํ† ํฐ์ด ์–ด๋Š segment์— ์†ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„.
      • BERT๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ ์„ธ๊ทธ๋จผํŠธ(์˜ˆ: ๋ฌธ์žฅ A์™€ ๋ฌธ์žฅ B)๋ฅผ ๊ตฌ๋ถ„๊ฐ€๋Šฅ.
      • ์—ฌ๊ธฐ์„œ๋Š” ๋‹จ์ผ ๋ฌธ์žฅ์ด๋ฏ€๋กœ ๋ชจ๋“  ๊ฐ’์ด 0์ด๋‹ค.
    • attention_mask:
      • ๋ชจ๋ธ์ด ๊ฐ ํ† ํฐ์— ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์—ฌ์•ผ ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.
      • 1์€ ํ•ด๋‹น ํ† ํฐ์ด ์‹ค์ œ ๋ฐ์ดํ„ฐ์ž„์„ ์˜๋ฏธํ•˜๊ณ , 0์€ ํŒจ๋”ฉ ํ† ํฐ์„ ์˜๋ฏธ.
      • ์—ฌ๊ธฐ์„œ๋Š” ํŒจ๋”ฉ์ด ์—†์œผ๋ฏ€๋กœ ๋ชจ๋“  ๊ฐ’์ด 1์ด๋‹ค.
  2. last_hidden_state
    • Shape: [1, 12, 768]
      • Batch Size (1): ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌ.
      • Sequence Length (12): ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ํ† ํฐ ์ˆ˜ (ํŠน์ˆ˜ ํ† ํฐ ํฌํ•จ).
      • Hidden Size (768): BERT-base ๋ชจ๋ธ์˜ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด 768์ฐจ์›์˜ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ์ƒ์„ฑํ•œ๋‹ค.
    • ์˜๋ฏธ:
      • last_hidden_state๋Š” ๋ชจ๋ธ์˜ ๋งˆ์ง€๋ง‰ ์€๋‹‰ ๊ณ„์ธต์—์„œ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•œ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ๋‹ด๊ณ  ์žˆ๋‹ค.
      • ์ด ๋ฒกํ„ฐ๋“ค์€ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ NLP ์ž‘์—…(์˜ˆ: ๋ถ„๋ฅ˜, ๊ฐœ์ฒด๋ช… ์ธ์‹ ๋“ฑ)์— ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

์„ค๋ช…)



ex-1) Embedding  Lookup Table๊ณผ์ • ์ฝ”๋“œ

train_data = 'you need to know how to code'

# ์ค‘๋ณต์„ ์ œ๊ฑฐํ•œ ๋‹จ์–ด๋“ค์˜ ์ง‘ํ•ฉ์ธ ๋‹จ์–ด ์ง‘ํ•ฉ ์ƒ์„ฑ.
word_set = set(train_data.split())

# ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ •์ˆ˜ ๋งตํ•‘.
vocab = {word: i+2 for i, word in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1
print(vocab) # {'need': 2, 'to': 3, 'code': 4, 'how': 5, 'you': 6, 'know': 7, '<unk>': 0, '<pad>': 1}

# ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋งŒํผ์˜ ํ–‰์„ ๊ฐ€์ง€๋Š” ํ…Œ์ด๋ธ” ์ƒ์„ฑ.
embedding_table = torch.FloatTensor([[ 0.0,  0.0,  0.0],
                                    [ 0.0,  0.0,  0.0],
                                    [ 0.2,  0.9,  0.3],
                                    [ 0.1,  0.5,  0.7],
                                    [ 0.2,  0.1,  0.8],
                                    [ 0.4,  0.1,  0.1],
                                    [ 0.1,  0.8,  0.9],
                                    [ 0.6,  0.1,  0.1]])

sample = 'you need to run'.split()
idxes = []

# ๊ฐ ๋‹จ์–ด๋ฅผ ์ •์ˆ˜๋กœ ๋ณ€ํ™˜
for word in sample:
  try:
    idxes.append(vocab[word])
  # ๋‹จ์–ด ์ง‘ํ•ฉ์— ์—†๋Š” ๋‹จ์–ด์ผ ๊ฒฝ์šฐ <unk>๋กœ ๋Œ€์ฒด๋œ๋‹ค.
  except KeyError:
    idxes.append(vocab['<unk>'])
idxes = torch.LongTensor(idxes)

# ๊ฐ ์ •์ˆ˜๋ฅผ ์ธ๋ฑ์Šค๋กœ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜จ๋‹ค.
lookup_result = embedding_table[idxes, :]
print(lookup_result)
print(lookup_result.shape)
# tensor([[0.1000, 0.8000, 0.9000],
#         [0.2000, 0.9000, 0.3000],
#         [0.1000, 0.5000, 0.7000],
#         [0.0000, 0.0000, 0.0000]])
# torch.Size([4, 3])

ex-2) Embedding lookup table๊ณผ์ • ์ฝ”๋“œ์™€ nn.Embedding๊ฐ„ ๋น„๊ต

train_data = 'you need to know how to code'

# ์ค‘๋ณต์„ ์ œ๊ฑฐํ•œ ๋‹จ์–ด๋“ค์˜ ์ง‘ํ•ฉ์ธ ๋‹จ์–ด ์ง‘ํ•ฉ ์ƒ์„ฑ.
word_set = set(train_data.split())

# ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ •์ˆ˜ ๋งตํ•‘.
vocab = {tkn: i+2 for i, tkn in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1

embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=3, padding_idx=1)
print(embedding_layer.weight)
print(embedding_layer)

# tensor([[ 0.7830,  0.2669,  0.4363],
#         [ 0.0000,  0.0000,  0.0000],
#         [-1.2034, -0.0957, -0.9129],
#         [ 0.7861, -0.0251, -2.2705],
#         [-0.5167, -0.3402,  1.3143],
#         [ 1.7932, -0.6973,  0.5459],
#         [-0.8952, -0.4937,  0.2341],
#         [ 0.3692,  0.0593,  1.0825]], requires_grad=True)
# Embedding(8, 3, padding_idx=1)




ex-3) Embedding  ์˜ˆ์‹œ์ฝ”๋“œ

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.word_embeddings = nn.Embedding(config.vocab_size, config.emb_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_seq_length, config.emb_size)
        self.token_type_embeddings = nn.Embedding(2, config.emb_size)
        self.LayerNorm = nn.LayerNorm(config.emb_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.dropout)
        
        # position ids (used in the pos_emb lookup table) that we do not want updated through backpropogation
        self.register_buffer("position_ids", torch.arange(config.max_seq_length).expand((1, -1)))

    def forward(self, input_ids, token_type_ids):
        word_emb = self.word_embeddings(input_ids)
        pos_emb = self.position_embeddings(self.position_ids)
        type_emb = self.token_type_embeddings(token_type_ids)

        emb = word_emb + pos_emb + type_emb
        emb = self.LayerNorm(emb)
        emb = self.dropout(emb)

        return emb

 

 

 

 

 

NLP

BERT - Classification

NER (Named Entity Recognition)

Token Classification, ์ฆ‰ ๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฐ token์— label์„ ํ• ๋‹นํ•˜๋Š” Task์ด๋‹ค.
๋จผ์ € ์˜ˆ์‹œ๋กœ BIO Tagging์„ ๋“ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค:
ex) ์ธ๊ณต์ง€๋Šฅ(AI)์—์„œ ๋”ฅ๋Ÿฌ๋‹์€ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•œ ๋ถ„์•ผ์ž…๋‹ˆ๋‹ค.
--> <์ธ๊ณต์ง€๋Šฅ:B-Tech> <(:I-Tech> <AI:I-Tech> <):I-Tech> <์—์„œ:O> <๋”ฅ๋Ÿฌ๋‹:B-Tech> <์€:O> <๋จธ์‹ ๋Ÿฌ๋‹:B-Tech> <์˜:O> <ํ•œ:O> <๋ถ„์•ผ:O> <์ž…๋‹ˆ๋‹ค:O> <.:O>
์ด๋•Œ, B๋Š” Begin(๊ฐœ์ฒด๋ช…์˜ ์‹œ์ž‘)์„, I๋Š” Inside(๊ฐœ์ฒด๋ช…์˜ ์—ฐ์†)๋ฅผ, O๋Š” Outside(๊ฐœ์ฒด๋ช…์ด ์•„๋‹Œ๊ฒƒ)๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
์ด๋Ÿฐ NER์—์„œ ์ž์ฃผ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์ด ๋ฐ”๋กœ BERT์ด๋‹ค.

BERT - MLM, NSP

๋ฌธ์žฅ๊ฐ„ ๊ด€๊ณ„(์š”์•ฝ ๋“ฑ)๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ํ™œ์šฉ๋˜๋Š” [CLS]ํ† ํฐ์ด ์‚ฌ์šฉ๋œ๋‹ค.
BERT์—์„œ๋Š” ์ด 3๊ฐ€์ง€ Embedding์ด Embedding Layer์—์„œ ํ™œ์šฉ๋œ๋‹ค:
1. Token Embedding:
 - ์ž…๋ ฅ๋ฌธ์žฅ embedding
2. Segment Embedding:
 - ๋ชจ๋ธ์ด ์ธ์‹ํ•˜๋„๋ก ๊ฐ ๋ฌธ์žฅ์— ๊ณ ์ •๋œ ์ˆซ์ž ํ• ๋‹น.
3. Position Embedding:
 - input์„ ํ•œ๋ฒˆ์— ๋ชจ๋‘ ๋ฐ€์–ด๋„ฃ๊ธฐ์—(= ์ˆœ์ฐจ์ ์œผ๋กœ ๋„ฃ์ง€ ์•Š์Œ)
 - Transformer Encoder๋Š” ๊ฐ token์˜ ์‹œ๊ฐ„์  ์ˆœ์„œ๋ฅผ ์•Œ์ง€ ๋ชปํ•จ
 - ์ด๋ฅผ ์œ„ํ•ด ์œ„์น˜์ •๋ณด๋ฅผ ์‚ฝ์ž…ํ•˜๊ธฐ์œ„ํ•ด sine, cosine์„ ์‚ฌ์šฉํ•œ๋‹ค.
์ถ”์ฒœ๊ฐ•์˜) https://www.youtube.com/watch?app=desktop&v=CiOL2h1l-EE

BART - Summarization

Abstractive & Extractive Summarization

์ถ”์ƒ์š”์•ฝ: ์›๋ฌธ์„ ์™„์ „ํžˆ ์ดํ•ด --> ์ƒˆ๋กœ์šด ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•ด ์š”์•ฝํ•˜๋Š” ๋ฐฉ์‹.
์ถ”์ถœ์š”์•ฝ: ์›๋ฌธ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ณ  ๊ด€๋ จ์„ฑ ๋†’์€ ๋ฌธ์žฅ๋“ค๋งŒ ์„ ํƒํ•ด ๊ทธ๋Œ€๋กœ ์ถ”์ถœ.
(์š”์•ฝ๋ฌธ์ด ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์šธ ์ˆ˜๋Š” ์žˆ์œผ๋ฉฐ, ์ค‘๋ณต์ œ๊ฑฐ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•จ.)


BART (Bidirectional & Auto Regressive Transformers)

Encoder-Decoder ๋ชจ๋‘ ์กด์žฌํ•˜๋ฉฐ, ํŠนํžˆ๋‚˜ Embedding์ธต์„ ๊ณต์œ ํ•˜๋Š” Shared Embedding์„ ์‚ฌ์šฉํ•ด ๋‘˜๊ฐ„์˜ ์—ฐ๊ฒฐ์„ ๊ฐ•ํ™”ํ•œ๋‹ค.
Encoder๋Š” Bidirectional Encoder๋กœ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ ์ „์ฒด์˜ ์ขŒ์šฐ context๋ฅผ ๋ชจ๋‘ ์ฐธ์กฐ๊ฐ€๋Šฅํ•˜๋ฉฐ,
Decoder์—์„œ Auto-Regressive๋ฐฉ์‹์œผ๋กœ ์ด์ „์— ์ƒ์„ฑํ•œ ๋‹จ์–ด๋ฅผ ์ฐธ์กฐํ•ด ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
๋˜ํ•œ, Pre-Train์‹œ Denoising Auto Encoder๋กœ ํ•™์Šตํ•˜๋Š”๋ฐ, ์ž„์˜๋กœ noisingํ›„, ๋ณต์›ํ•˜๊ฒŒ ํ•œ๋‹ค.

RoBERTa, T5- TextQA

Abstractive & Extractive QA

์ถ”์ƒ์งˆ์˜์‘๋‹ต: ์ฃผ์–ด์ง„ ์ง€๋ฌธ ๋‚ด์—์„œ ๋‹ต๋ณ€์ด ๋˜๋Š” ๋ฌธ์ž์—ด ์ถ”์ถœ (์งˆ๋ฌธ-์ง€๋ฌธ-์ง€๋ฌธ๋‚ด๋‹ต๋ณ€์ถ”์ถœ)
์ถ”์ถœ์งˆ์˜์‘๋‹ต: ์งˆ๋ฌธ๊ณผ ์ง€๋ฌธ์„ ์ž…๋ ฅ๋ฐ›์•„ ์ƒˆ๋กœ์šด ๋‹ต๋ณ€ ์ƒ์„ฑ (์งˆ๋ฌธ-์ง€๋ฌธ-๋‹ต๋ณ€)


RoBERTa

max_len, pretrain-dataset์–‘์ด ๋Š˜์–ด๋‚ฌ์œผ๋ฉฐ, Dynamic Masking๊ธฐ๋ฒ• ๋„์ž…์ด ํ•ต์‹ฌ.
Dynamic Masking: ๊ฐ ์—ํญ๋งˆ๋‹ค ๋‹ค๋ฅธ ๋งˆ์Šคํ‚นํŒจํ„ด ์ƒ์„ฑ. (NSP๋Š” ์—†์•ฐ.)
BPE Tokenization ์‚ฌ์šฉ: BERT๋Š” wordpiece tokenize.

T5- Machine Translation

SMT & NMT

ํ†ต๊ณ„์  ๊ธฐ๊ณ„๋ฒˆ์—ญ: ์›๋ฌธ-๋ฒˆ์—ญ์Œ ๊ธฐ๋ฐ˜, ๋‹จ์–ด์ˆœ์„œ ๋ฐ ์–ธ์–ดํŒจํ„ด์„ ์ธ์‹ --> ๋Œ€๊ทœ๋ชจ dataํ•„์š”
์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ: ๋ฒˆ์—ญ๋ฌธ๊ณผ ๋‹จ์–ด์‹œํ€€์Šค๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต


T5 (Text-To-Text Transfer Transformer)

tast๋ณ„ ํŠน์ • Promptํ˜•์‹์„ ์‚ฌ์šฉํ•ด ์ ์ ˆํ•œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๊ฒŒ ์œ ๋„๊ฐ€๋Šฅํ•˜๋‹ค.
์ฆ‰, ๋‹จ์ผ ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ NLP Task๋ฅผ ์ฒ˜๋ฆฌ๊ฐ€๋Šฅํ•œ seq2seq๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.

T5์˜ ๋…ํŠนํ•œ์ ์€ ๋ชจ๋ธ๊ตฌ์กฐ์ฐจ์ œ๊ฐ€ ์•„๋‹Œ, "์ž…์ถœ๋ ฅ ๋ชจ๋‘ Txtํ˜•ํƒœ๋กœ ์ทจ๊ธ‰ํ•˜๋Š” seq2seq๋กœ ์ ‘๊ทผํ•ด Pretrain๊ณผ์ •์—์„œ "Unsupervised Learning"์„ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ corpus(์•ฝ 75GB)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด๋‹ค." ์ด๋ฅผ ํ†ตํ•ด ์–ธ์–ด์˜ ์ผ๋ฐ˜์  ํŒจํ„ด๊ณผ ์ง€์‹์„ ํšจ๊ณผ์ ์œผ๋กœ ์Šต๋“ํ•œ๋‹ค.


LLaMA - Text Generation

Seq2Seq & CausalLM

Seq2Seq: Transformer, BART, T5 ๋“ฑ Encoder-Decoder๊ตฌ์กฐ
CausalLM: ๋‹จ์ผ Decoder๋กœ ๊ตฌ์„ฑ


LLaMA-3 Family

2024๋…„ 4์›”, LLaMA-3๊ฐ€ ์ถœ์‹œ๋˜์—ˆ๋Š”๋ฐ, LLaMA-3์—์„œ๋Š” GQA(Grouped Query Attention)์ด ์‚ฌ์šฉ๋˜์–ด Inference์†๋„๋ฅผ ๋†’์˜€๋‹ค.
LLaMA-3๋Š” Incontext-Learning, Few-Shot Learning ๋ชจ๋‘ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.
Incontext-Learning: ๋ชจ๋ธ์ด ์ž…๋ ฅํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์ฆ‰์„์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Šฅ๋ ฅ



์ถ”๊ฐ€์ ์œผ๋กœ 2024๋…„ 7์›”, LLaMA-3.1์ด ๊ณต๊ฐœ๋˜์—ˆ๋‹ค. LLaMA-3.1์€ AI์•ˆ์ •์„ฑ ๋ฐ ๋ณด์•ˆ๊ด€๋ จ ๋„๊ตฌ๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋Š”๋ฐ, Prompt Injection์„ ๋ฐฉ์ง€ํ•˜๋Š” Prompt Guard๋ฅผ ๋„์ž…ํ•ด ์œ ํ•ดํ•˜๊ฑฐ๋‚˜ ๋ถ€์ ์ ˆํ•œ ์ฝ˜ํ…์ธ ๋ฅผ ์‹๋ณ„ํ•˜๊ฒŒ ํ•˜์˜€๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ LLaMA-3 ์‹œ๋ฆฌ์ฆˆ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฃผ์š” ํŠน์ง•์ด ์กด์žฌํ•œ๋‹ค:
- RoPE(Rotary Position Embedding): Q, K์— ์ ์šฉ
- GQA(Grouped Query Attention): K, V๋ฅผ ์—ฌ๋Ÿฌ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ์–ด attention์—ฐ์‚ฐ์ˆ˜ํ–‰ --> ํšจ์œจ์  ์ถ”๋ก 

- RMS Norm: ์•ˆ์ •์  ํ•™์Šต ๋ฐ ๊ณ„์‚ฐ์˜ ํšจ์œจ์„ฑ
- KV cache: ์ถ”๋ก ์‹œ K,V๋ฅผ cache์— ์ €์žฅ --> ์—ฐ์‚ฐ์˜ ํšจ์œจ์„ฑ


LLaMA-3 ์ตœ์ ํ™”๊ธฐ๋ฒ•: SFT . RLHF  . DPO

SFT(Supervised Fine-Tuning): ์‚ฌ๋žŒ์ด ์ž‘์„ฑํ•œ ๊ณ ํ’ˆ์งˆQA์Œ์œผ๋กœ ๋ชจ๋ธ์„ ์ง์ ‘ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
RLHF: PPO์•Œ๊ณ ๋ฆฌ์ฆ˜๊ธฐ๋ฐ˜, ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์—ฌ๋Ÿฌ ์‘๋‹ต์— ๋Œ€ํ•ด ์‚ฌ๋žŒ์ด ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์žฌํ•™์Šต.
DPO(Direct Preference Optimization): RLHF์˜ ๋ณต์žก์„ฑ์„ ์ค„์ด๋ฉด์„œ ํšจ๊ณผ์  ํ•™์Šต์„ ๊ฐ€๋Šฅ์ผ€ํ•จ.(์‚ฌ๋žŒ์ด ๋งค๊ธด ์‘๋‹ต์ˆœ์œ„๋ฅผ ์ง์ ‘ํ•™์Šต; ๋‹ค๋งŒ ๋”์šฑ ๊ณ ํ’ˆ์งˆ ์„ ํ˜ธ๋„ data๋ฅผ ํ•„์š”๋กœํ•จ.)

 

 

 

 

Computer Vision

์ฃผ๋กœ CV(Computer Vision)๋ถ„์•ผ์—์„  CNN๊ธฐ๋ฒ•์ด ๋งŽ์ด ํ™œ์šฉ๋˜์—ˆ๋‹ค.(VGG, Inception, ResNet, ...)
๋‹ค๋งŒ, CNN based model์€ ์ฃผ๋กœ ๊ตญ์†Œ์ (local) pattern์„ ํ•™์Šตํ•˜๊ธฐ์— ์ „์—ญ์ ์ธ ๊ด€๊ณ„(global relation)๋ชจ๋ธ๋ง์— ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๊ณ„์‚ฐ๋ณต์žก๋„ ๋˜ํ•œ ์ฆ๊ฐ€ํ•œ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ViT(Vision Transformer)๊ฐ€ ์ œ์•ˆ๋˜์—ˆ์œผ๋ฉฐ, ๋Œ€๊ทœ๋ชจ dataset์œผ๋กœ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.
ViT์˜ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๊ฒฉ์ธ CLIP, OWL-ViT, SAM์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

Zero shot classification

Zero Shot Classification: CLIP, ALIGN, SigLIP

์‚ฌ์‹ค CLIP์€ ๋‹ค์–‘ํ•œ Task์—์„œ ๋งŽ์ด ํ™œ์šฉ๋˜๋‚˜ ๋ณธ ๊ธ€์€ Train dataset์— ์—†๋Š” Label์— ๋Œ€ํ•ด Image Classification์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ์ˆ ์— ํ™œ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์•Œ์•„๋ณด๊ณ ์ž ํ•œ๋‹ค.
์ƒˆ๋กœ์šด Label์ด ์ถ”๊ฐ€๋  ๋•Œ๋งˆ๋‹ค ์žฌํ•™์Šต์ด ํ•„์š”ํ•œ๋ฐ, ์ด๋ฅผ ํ”ผํ•˜๋ ค๋ฉด Zero shot๊ธฐ๋ฒ•์€ ๋ฐ˜ํ•„์ˆ˜์ ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


CLIP (Contrastive Language-Image Pre-training)

Model Architecture Input_Size Patch_Size #params
openai/clip-vit-base-patch32 ViT-B/32 224×224 32×32 1.5B
openai/clip-vit-base-patch16 ViT-B/16 224×224 16×16 1.5B
openai/clip-vit-large-patch14 ViT-L/14 224×224 14×14 4.3B
openai/clip-vit-large-patch14-336 ViT-L/14 336×336 14×14 4.3B

์ž‘์€ patch_size: ๋” ์„ธ๋ฐ€ํ•œ ํŠน์ง•์ถ”์ถœ, ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋ฐ ๊ณ„์‚ฐ์‹œ๊ฐ„ ์ฆ๊ฐ€

ํฐ patch_size: ๋น„๊ต์  ๋‚ฎ์€ ์„ฑ๋Šฅ, ๋น ๋ฅธ ์ฒ˜๋ฆฌ์†๋„
ํŒŒ๋ž€๋ธ”๋ก: Positive Sample , ํฐ๋ธ”๋ก: Negative Sample

๊ธฐ์กด Supervised Learning๊ณผ ๋‹ฌ๋ฆฌ 2๊ฐ€์ง€ ํŠน์ง•์ด ์กด์žฌํ•œ๋‹ค:
1.๋ณ„๋„์˜ Label์—†์ด input์œผ๋กœ image-txt์Œ๋งŒ ํ•™์Šต.
 - img, txt๋ฅผ ๋™์ผํ•œ embedding๊ณต๊ฐ„์— ์‚ฌ์˜(Projection)
 - ์ด๋ฅผ ํ†ตํ•ด ๋‘ Modality๊ฐ„ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ง์ ‘์ ์œผ๋กœ ์ธก์ • ๋ฐ ํ•™์Šต๊ฐ€๋Šฅ
 - ์ด ๋•Œ๋ฌธ์— CLIP์€ img-encoder, txt-encoder ๋ชจ๋‘ ๊ฐ–๊ณ ์žˆ์Œ

2. Contrastive Learning:
 - "Positive Sample": ์‹ค์ œimg-txt์Œ --> img-txt๊ฐ„ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ ์ตœ๋Œ€ํ™”
 - "Negative Sample": randomํ•˜๊ฒŒ pair๋œ ๋ถˆ์ผ์น˜img-txt์Œ --> ์œ ์‚ฌ์„ฑ ์ตœ์†Œํ™”
 - ์ด๋ฅผ ์œ„ํ•ด Cosine Similarity๊ธฐ๋ฐ˜์˜ Contrastive Learning Loss๋ฅผ ์‚ฌ์šฉ.


Zero-Shot Classification ์˜ˆ์‹œ

from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
dataset = load_dataset("sasha/dog-food")
images = dataset['test']['image'][:2]
labels = ['dog', 'food']
inputs = processor(images=images, text=labels, return_tensors="pt", padding=True)

print('input_ids: ', inputs['input_ids'])
print('attention_mask: ', inputs['attention_mask'])
print('pixel_values: ', inputs['pixel_values'])
print('image_shape: ', inputs['pixel_values'].shape)
# =======================================================
# input_ids:  tensor([[49406,  1929, 49407], [49406,  1559, 49407]])
# attention_mask:  tensor([[1, 1, 1], [1, 1, 1]])
# pixel_values:  tensor([[[[-0.0113, ...,]]]])
# image_shape:  torch.Size([2, 3, 224, 224])

CLIPProcessor์—๋Š” CLIPImageProcessor์™€ CLIPTokenizer๊ฐ€ ๋‚ด๋ถ€์ ์œผ๋กœ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.
input_ids์—์„œ 49406๊ณผ 49407์€ ๊ฐ๊ฐ startoftext์™€ endoftext๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํŠน๋ณ„ํ•œ ๊ฐ’์ด๋‹ค.
attention_mask๋Š” ๋ณ€ํ™˜๋œ token_types๋กœ
๊ฐ’์ด 1์ด๋ฉด ํ•ด๋‹น์œ„์น˜ํ† ํฐ์ด ์‹ค์ œ๋ฐ์ดํ„ฐ๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๊ณ , 0์€ [PAD]๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image
  probs = logits_per_image.softmax(dim=1)
  print('outputs:', outputs.keys())
  print('logits_per_image:', logits_per_image)
  print('probs: ', probs)

for idx, prob in enumerate(probs):
  print(f'- Image #{idx}')
  for label, p in zip(labels, prob):
    print(f'{label}: {p.item():.4f}')

# ============================================
# outputs: odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])
# logits_per_image: tensor([[23.3881, 18.8604], [24.8627, 21.5765]])
# probs:  tensor([[0.9893, 0.0107], [0.9640, 0.0360]])
# - Image #0
# dog: 0.9893
# food: 0.0107
# - Image #1
# dog: 0.9640
# food: 0.0360

Zero shot Detection

์ž์—ฐ์–ด์  ์„ค๋ช…์—๋Š” ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด์™€ ๊ฐœ๋žต์  ์œ„์น˜์ •๋ณด๋ฅผ ์•”์‹œ์ ์œผ๋กœ ํฌํ•จํ•œ๋‹ค.
CLIP์—์„œ img-txt์Œ์œผ๋กœ ์‹œ๊ฐ์ ํŠน์ง•๊ณผ ํ…์ŠคํŠธ๊ฐ„ ์—ฐ๊ด€์„ฑ์„ ํ•™์Šต๊ฐ€๋Šฅํ•จ์„ ๋ณด์˜€๊ธฐ์—, 
์ถ”๋ก  ์‹œ, ์ฃผ์–ด์ง„ txt prompt๋งŒ ์ž˜ ์„ค๊ณ„ํ•œ๋‹ค๋ฉด ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค.
๋”ฐ๋ผ์„œ zero-shot object detection์—์„œ๋Š” ์ „ํ†ต์ ์ธ annotation์ •๋ณด ์—†์ด๋„ ์‹œ๊ฐ๊ณผ ์–ธ์–ด๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ƒˆ๋กœ์šด ๊ฐ์ฒดํด๋ž˜์Šค๋ฅผ ๊ฒ€์ถœํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.
OWL-ViT์˜ ๊ฒฝ์šฐ, Multi-Modal Backbone๋ชจ๋ธ๋กœ CLIP๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.

OWLv2 (OWL-ViT)

OWL-ViT๊ตฌ์กฐ, OWLv2๋Š” ๊ฐ์ฒด๊ฒ€์ถœํ—ค๋“œ์— Objectness Classifier์ถ”๊ฐ€ํ•จ.
OWL-ViT๋Š” img-txt์Œ์œผ๋กœ pretrainํ•˜์—ฌ Open-Vocabulary๊ฐ์ฒดํƒ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
OWLv2๋Š” Self-Training๊ธฐ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.
์ฆ‰, ๊ธฐ์กด Detector๋กœ Weak Supervision๋ฐฉ์‹์œผ๋กœ ๊ฐ€์ƒ์˜ Bbox-Annotation์„ ์ž๋™์ƒ์„ฑํ•œ๋‹ค.
ex) input: img-txt pair[๊ฐ•์•„์ง€๊ฐ€ ๊ณต์„ ๊ฐ€์ง€๊ณ  ๋…ธ๋Š”]
๊ธฐ์กด detector: [๊ฐ•์•„์ง€ bbox] [๊ณต bbox] ์ž๋™์˜ˆ์ธก, annotation์ƒ์„ฑ
--> ๋ชจ๋ธ ํ•™์Šต์— ์ด์šฉ (์ฆ‰, ์ •ํ™•ํ•œ ์œ„์น˜์ •๋ณด๋Š” ์—†์ง€๋งŒ ๋ถ€๋ถ„์  supervision signal๋กœ weak signal๊ธฐ๋ฐ˜, ๋ชจ๋ธ์ด ๊ฐ์ฒด์˜ ์œ„์น˜ ๋ฐ ํด๋ž˜์Šค๋ฅผ ์ถ”๋ก , ํ•™์Šตํ•˜๊ฒŒ ํ•จ)

Zero-Shot Detection ์˜ˆ์‹œ

import io
from PIL import Image
from datasets import load_dataset
from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")
dataset = load_dataset('Francesco/animals-ij5d2')
print(dataset)
print(dataset['test'][0])

# ==========================================================
# DatasetDict({
#     train: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 700
#     })
#     validation: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 100
#     })
#     test: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 200
#     })
# })
# {'image_id': 63, 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2B0186E4A0>, 'width': 640, 'height': 640, 'objects': {'id': [96, 97, 98, 99], 'area': [138029, 8508, 10150, 20624], 'bbox': [[129.0, 291.0, 395.5, 349.0], [119.0, 266.0, 93.5, 91.0], [296.0, 280.0, 116.0, 87.5], [473.0, 284.0, 167.0, 123.5]], 'category': [3, 3, 3, 3]}}

 

- Label ๋ฐ Image ์ „์ฒ˜๋ฆฌ
images = dataset['test']['image'][:2]
categories = dataset['test'].features['objects'].feature['category'].names
labels = [categories] * len(images)
inputs = processor(text=labels, images=images, return_tensors="pt", padding=True)

print(images)
print(labels)
print('input_ids:', inputs['input_ids'])
print('attention_mask:', inputs['attention_mask'])
print('pixel_values:', inputs['pixel_values'])
print('image_shape:', inputs['pixel_values'].shape)

# ==========================================================
# [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CF790>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CCC10>]
# [['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk'], ['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk']]
# input_ids: tensor([[49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0],
#         [49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0]])
# attention_mask: tensor([[1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#         [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0], 
#          [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#           [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0]])
# pixel_values: tensor([[[[ 1.5264, ..., ]]]])
# image_shape: torch.Size([2, 3, 960, 960])

- Detection & Inference
import torch

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs.keys()) # odict_keys(['logits', 'objectness_logits', 'pred_boxes', 'text_embeds', 'image_embeds', 'class_embeds', 'text_model_output', 'vision_model_output'])

- Post Processing
import matplotlib.pyplot as plt
from PIL import ImageDraw, ImageFont

# ์˜ˆ์ธกํ™•๋ฅ ์ด ๋†’์€ ๊ฐ์ฒด ์ถ”์ถœ
shape = [dataset['test'][:2]['width'], dataset['test'][:2]['height']]
target_sizes = list(map(list, zip(*shape))) # [[640, 640], [640, 640]]
results = processor.post_process_object_detection(outputs=outputs, threshold=0.5, target_sizes=target_sizes)
print(results)

# Post Processing
for idx, (image, detect) in enumerate(zip(images, results)):
    image = image.copy()
    draw = ImageDraw.Draw(image)
    font = ImageFont.truetype("arial.ttf", 18)

    for box, label, score in zip(detect['boxes'], detect['labels'], detect['scores']):
        box = [round(i, 2) for i in box.tolist()]
        draw.rectangle(box, outline='red', width=3)

        label_text = f'{labels[idx][label]}: {round(score.item(), 3)}'
        draw.text((box[0], box[1]), label_text, fill='white', font=font)

    plt.imshow(image)
    plt.axis('off')
    plt.show()
    
# ==============================================
# [{'scores': tensor([0.5499, 0.6243, 0.6733]), 'labels': tensor([3, 3, 3]), 'boxes': tensor([[329.0247, 287.1844, 400.3372, 357.9262],
#         [122.9359, 272.8753, 534.3260, 637.6506],
#         [479.7363, 294.2744, 636.4859, 396.8372]])}, {'scores': tensor([0.7538]), 'labels': tensor([7]), 'boxes': tensor([[ -0.7799, 173.7043, 440.0294, 538.7166]])}]

 

Zero shot Semantic segmentation

Image Segmentation์€ ๋ณด๋‹ค ์ •๋ฐ€ํ•œ, ํ”ฝ์…€๋ณ„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ๋†’์€ ๊ณ„์‚ฐ๋น„์šฉ์ด ๋“ค๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ train data์™€ ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•„์š”๋กœ ํ•œ๋‹ค.
์ „ํ†ต์  ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” threshold๊ธฐ๋ฐ˜ binary classification, Edge Detection๋“ฑ์ด ์žˆ์œผ๋ฉฐ
์ตœ์‹  ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋”ฅ๋Ÿฌ๋‹๋ชจ๋ธ์„ ์ด์šฉํ•ด Image Segmentation์„ ์ง„ํ–‰ํ•œ๋‹ค.
์ „ํ†ต์  ๋ฐฉ๋ฒ•์€ ๋‹จ์ˆœํ•˜๊ณ  ๋น ๋ฅด์ง€๋งŒ ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋‹ค์–‘ํ•œ ์กฐ๋ช…์กฐ๊ฑด ๋“ฑ์—์„œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋˜๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.


SAM (Segment Anything Model)

Model Architecture Input_Size Patch_Size #params
facebook/sam-vit-base ViT-B/16 1024×1024 16×16 0.9B
facebook/sam-vit-large ViT-L/16 1024×1024 16×16 3.1B
facebook/sam-vit-huge ViT-H/16 1024×1024 16×16 6.4B

 

SAM๊ตฌ์กฐ: img-encoder, prompt-encoder, mask-decoder

SAM์€ Meta์—์„œ ๊ฐœ๋ฐœํ•œ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์ˆ˜์ง‘ํ•œ 1100๋งŒ๊ฐœ ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•ด ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค.

๊ทธ๋ ‡๊ธฐ์— ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ image segmentation์ž‘์—…์„ ๊ณ ์ˆ˜์ค€์œผ๋กœ ์ˆ˜ํ–‰๊ฐ€๋Šฅํ•˜๋‹ค.
SAM์„ ์ด์šฉํ•˜๋ฉด ๋งŽ์€๊ฒฝ์šฐ, ์ถ”๊ฐ€์ ์ธ Fine-Tuning์—†์ด, ๋‹ค์–‘ํ•œ Domain image์— ๋Œ€ํ•œ segmentation์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

SAM์€ prompt๋ฅผ ๋ฐ›์„์ˆ˜๋„ ์žˆ๊ณ , ๋ฐ›์ง€ ์•Š์•„๋„ ๋˜๋Š”๋ฐ, prompt๋Š” ์ขŒํ‘œ, bbox, txt ๋“ฑ ๋‹ค์–‘ํ•˜๊ฒŒ ์ค„ ์ˆ˜ ์žˆ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ prompt๋ฅผ ์ฃผ์ง€ ์•Š์œผ๋ฉด img ์ „์ฒด์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ Segmentation์„ ์ง„ํ–‰ํ•œ๋‹ค.
๋‹ค๋งŒ, Inference๊ฒฐ๊ณผ๋กœ Binary Mask๋Š” ์ œ๊ณตํ•˜์ง€๋งŒ pixel์— ๋Œ€ํ•œ ๊ตฌ์ฒด์  class์ •๋ณด๋Š” ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค.

SAM ํ™œ์šฉ ์˜ˆ์‹œ

import io
from PIL import Image
from datasets import load_dataset
from transformers import SamProcessor, SamModel

def filter_category(data):
    # 16 = dog
    # 23 = giraffe
    return 16 in data["objects"]["category"] or 23 in data["objects"]["category"]

def convert_image(data):
    byte = io.BytesIO(data["image"]["bytes"])
    img = Image.open(byte)
    return {"img": img}

model_name = "facebook/sam-vit-base"
processor = SamProcessor.from_pretrained(model_name) 
model = SamModel.from_pretrained(model_name)

dataset = load_dataset("s076923/coco-val")
filtered_dataset = dataset["validation"].filter(filter_category)
converted_dataset = filtered_dataset.map(convert_image, remove_columns=["image"])
import numpy as np
from matplotlib import pyplot as plt


def show_point_box(image, input_points, input_labels, input_boxes=None, marker_size=375):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    ax = plt.gca()
    
    input_points = np.array(input_points)
    input_labels = np.array(input_labels)

    pos_points = input_points[input_labels[0] == 1]
    neg_points = input_points[input_labels[0] == 0]
    
    ax.scatter(
        pos_points[:, 0],
        pos_points[:, 1],
        color="green",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )
    ax.scatter(
        neg_points[:, 0],
        neg_points[:, 1],
        color="red",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )

    if input_boxes is not None:
        for box in input_boxes:
            x0, y0 = box[0], box[1]
            w, h = box[2] - box[0], box[3] - box[1]
            ax.add_patch(
                plt.Rectangle(
                    (x0, y0), w, h, edgecolor="green", facecolor=(0, 0, 0, 0), lw=2
                )
            )

    plt.axis("on")
    plt.show()


image = converted_dataset[0]["img"]
input_points = [[[250, 200]]]
input_labels = [[[1]]]

show_point_box(image, input_points[0], input_labels[0])
inputs = processor(
    image, input_points=input_points, input_labels=input_labels, return_tensors="pt"
)

# input_points shape : torch.Size([1, 1, 1, 2])
# input_points : tensor([[[[400.2347, 320.0000]]]], dtype=torch.float64)
# input_labels shape : torch.Size([1, 1, 1])
# input_labels : tensor([[[1]]])
# pixel_values shape : torch.Size([1, 3, 1024, 1024])
# pixel_values : tensor([[[[ 1.4612,  ...]]])

input_points: [B, ์ขŒํ‘œ๊ฐœ์ˆ˜, ์ขŒํ‘œ] -- ๊ด€์‹ฌ๊ฐ–๋Š” ๊ฐ์ฒด๋‚˜ ์˜์—ญ์ง€์ • ์ขŒํ‘œ
input_labels: [B, ์ขŒํ‘œB, ์ขŒํ‘œ๊ฐœ์ˆ˜] -- input_points์— ๋Œ€์‘๋˜๋Š” label์ •๋ณด
 - input_labels์ข…๋ฅ˜:

๋ฒˆํ˜ธ ์ด๋ฆ„ ์„ค๋ช…
1 foreground ํด๋ž˜์Šค ๊ฒ€์ถœํ•˜๊ณ ์ž ํ•˜๋Š” ๊ด€์‹ฌ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋œ ์ขŒํ‘œ
0 not foreground ํด๋ž˜์Šค ๊ด€์‹ฌ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€ ์ขŒํ‘œ
-1 background ํด๋ž˜์Šค ๋ฐฐ๊ฒฝ์˜์—ญ์— ํ•ด๋‹นํ•˜๋Š” ์ขŒํ‘œ
-10 padding ํด๋ž˜์Šค batch_size๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•œ padding๊ฐ’ (๋ชจ๋ธ์ด ์ฒ˜๋ฆฌX)

[Processor๋กœ ์ฒ˜๋ฆฌ๋œ ์ดํ›„ ์ถœ๋ ฅ๊ฒฐ๊ณผ]
input_points
: [B, ์ขŒํ‘œB, ๋ถ„ํ• ๋งˆ์Šคํฌ ๋‹น ์ขŒํ‘œ๊ฐœ์ˆ˜, ์ขŒํ‘œ์œ„์น˜] 
input_labels: [B, ์ขŒํ‘œB, ์ขŒํ‘œ๊ฐœ์ˆ˜] 


import torch


def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)


def show_masks_on_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
        masks = masks.squeeze()
    if scores.shape[0] == 1:
        scores = scores.squeeze()

    nb_predictions = scores.shape[-1]
    fig, axes = plt.subplots(1, nb_predictions, figsize=(30, 15))

    for i, (mask, score) in enumerate(zip(masks, scores)):
        mask = mask.cpu().detach()
        axes[i].imshow(np.array(raw_image))
        show_mask(mask, axes[i])
        axes[i].title.set_text(f"Mask {i+1}, Score: {score.item():.3f}")
        axes[i].axis("off")
    plt.show()


model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)
print("iou_scores shape :", outputs.iou_scores.shape)
print("iou_scores :", outputs.iou_scores)
print("pred_masks shape :", outputs.pred_masks.shape)
print("pred_masks :", outputs.pred_masks)

# iou_scores shape : torch.Size([1, 1, 3])
# iou_scores : tensor([[[0.7971, 0.9507, 0.9603]]])
# pred_masks shape : torch.Size([1, 1, 3, 256, 256])
# pred_masks : tensor([[[[[ -3.6988, ..., ]]]]])

iou_scrores: [B, ์ขŒํ‘œ๊ฐœ์ˆ˜, IoU์ ์ˆ˜] 
pred_masks: [B, ์ขŒํ‘œB, C, H, W] 


input_points = [[[250, 200], [15, 50]]]
input_labels = [[[0, 1]]]
input_boxes = [[[100, 100, 400, 600]]]

show_point_box(image, input_points[0], input_labels[0], input_boxes[0])
inputs = processor(
    image,
    input_points=input_points,
    input_labels=input_labels,
    input_boxes=input_boxes,
    return_tensors="pt"
)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)โ€‹

 



Zero shot Instance segmentation

Zero shot Detection + SAM

SAM์˜ ๊ฒฝ์šฐ, ๊ฒ€์ถœ๋œ ๊ฐ์ฒด์˜ ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์—†๋‹ค.
์ฆ‰, ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด๋ฅผ ํ”ฝ์…€๋‹จ์œ„๋กœ ๊ตฌ๋ถ„ํ•˜๋Š” instance segmentation์ž‘์—…์—๋Š” ์–ด๋ ค์›€์ด ์กด์žฌํ•œ๋‹ค.

์ด๋Ÿฐ ํ•œ๊ณ„๊ทน๋ณต์„ ์œ„ํ•ด zero-shot detection model๊ณผ SAM์„ ํ•จ๊ป˜ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค:
1) zero shot detection๋ชจ๋ฐ๋กœ ๊ฐ์ฒด ํด๋ž˜์Šค์™€ bbox์˜์—ญ ๊ฒ€์ถœ
2) bbox์˜์—ญ๋‚ด SAM๋ชจ๋ธ๋กœ semantic segmentation ์ง„ํ–‰.

from transformers import pipeline

generator = pipeline("mask-generation", model=model_name)
outputs = generator(image, points_per_batch=32)

plt.imshow(np.array(image))
ax = plt.gca()
for mask in outputs["masks"]:
    show_mask(mask, ax=ax, random_color=True)
plt.axis("off")
plt.show()

print("outputs mask์˜ ๊ฐœ์ˆ˜ :", len(outputs["masks"]))
print("outputs scores์˜ ๊ฐœ์ˆ˜ :", len(outputs["scores"]))

# outputs mask์˜ ๊ฐœ์ˆ˜ : 52
# outputs scores์˜ ๊ฐœ์ˆ˜ : 52

detector = pipeline(
    model="google/owlv2-base-patch16", task="zero-shot-object-detection"
)

image = converted_dataset[24]["img"]
labels = ["dog", "giraffe"]
results = detector(image, candidate_labels=labels, threshold=0.5)

input_boxes = []
for result in results:
    input_boxes.append(
        [
            result["box"]["xmin"],
            result["box"]["ymin"],
            result["box"]["xmax"],
            result["box"]["ymax"]
        ]
    )
    print(result)

inputs = processor(image, input_boxes=[input_boxes], return_tensors="pt")

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)

plt.imshow(np.array(image))
ax = plt.gca()

for mask, iou in zip(masks[0], outputs.iou_scores[0]):
    max_iou_idx = torch.argmax(iou)
    best_mask = mask[max_iou_idx]
    show_mask(best_mask, ax=ax, random_color=True)

plt.axis("off")
plt.show()

#{'score': 0.6905778646469116, 'label': 'giraffe', 'box': {'xmin': 96, 'ymin': 198, 'xmax': 294, 'ymax': 577}}
#{'score': 0.6264181733131409, 'label': 'giraffe', 'box': {'xmin': 228, 'ymin': 199, 'xmax': 394, 'ymax': 413}}

 

Image Matching

image matching์€ ๋””์ง€ํ„ธ ์ด๋ฏธ์ง€๊ฐ„ ์œ ์‚ฌ์„ฑ์„ ์ •๋Ÿ‰ํ™”, ๋น„๊ตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
์ด๋ฅผ image์˜ feature vector๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๊ฐ image vector๊ฐ„ ์œ ์‚ฌ๋„(๊ฑฐ๋ฆฌ)๋ฅผ ์ธก์ •ํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค.
๊ทธ๋ ‡๊ธฐ์— ์ด๋ฏธ์ง€ ๋งค์นญ์˜ ํ•ต์‹ฌ์€ "์ด๋ฏธ์ง€ ํŠน์ง•์„ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•˜๋Š” feature vector์˜ ์ƒ์„ฑ"์ด๋‹ค.
(๋ณดํ†ต ํŠน์ง•๋ฒกํ„ฐ๊ฐ€ ๊ณ ์ฐจ์›์ผ์ˆ˜๋ก ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ์ด ํŠน์ง•๋ฒกํ„ฐ๋Š” classification layer์™€ ๊ฐ™์€ ์ธต์„ ํ†ต๊ณผํ•˜๊ธฐ ์ง์ „(= Feature Extractor์˜ ๊ฒฐ๊ณผ๊ฐ’ = Classifier ์ง์ „๊ฐ’) ๋ฒกํ„ฐ๋ฅผ ๋ณดํ†ต ์˜๋ฏธํ•œ๋‹ค.)

ex) ViT๋ฅผ ์ด์šฉํ•œ ํŠน์ง•๋ฒกํ„ฐ ์ถ”์ถœ ์˜ˆ์ œ
import torch
from datasets import load_dataset
from transformers import ViTImageProcessor, ViTModel

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTModel.from_pretrained(model_name)

inputs = processor(image, return_tensors="pt")
with torch.no_grad():
    outputs = model(inputs["pixel_values"])

print("๋งˆ์ง€๋ง‰ ํŠน์ง• ๋งต์˜ ํ˜•ํƒœ :", outputs["last_hidden_state"].shape)
print("ํŠน์ง• ๋ฒกํ„ฐ์˜ ์ฐจ์› ์ˆ˜ :", outputs["last_hidden_state"][:, 0, :].shape)
print("ํŠน์ง• ๋ฒกํ„ฐ :", outputs["last_hidden_state"][:, 0, :])

# ๋งˆ์ง€๋ง‰ ํŠน์ง• ๋งต์˜ ํ˜•ํƒœ : torch.Size([1, 197, 768])
# ํŠน์ง• ๋ฒกํ„ฐ์˜ ์ฐจ์› ์ˆ˜ : torch.Size([1, 768])
# ํŠน์ง• ๋ฒกํ„ฐ : tensor([[ 2.9420e-01,  8.3502e-01,  ..., -8.4114e-01,  1.7990e-01]])

ImageNet-21K๋ผ๋Š” ๋ฐฉ๋Œ€ํ•œ ์‚ฌ์ „Dataset์œผ๋กœ ํ•™์Šต๋˜์–ด ๋ฏธ์„ธํ•œ ์ฐจ์ด ๋ฐ ๋ณต์žกํ•œ ํŒจํ„ด์„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.
ViT์—์„œ feature vector์ถ”์ถœ ์‹œ, ์ฃผ๋ชฉํ• ์ ์€ last_hidden_state ํ‚ค ๊ฐ’์ด๋‹ค:
์ถœ๋ ฅ์ด [1, 197, 768]์˜ [B, ์ถœ๋ ฅํ† ํฐ์ˆ˜, feature์ฐจ์›]์„ ์˜๋ฏธํ•˜๋Š”๋ฐ, 197๊ฐœ์˜ ์ถœ๋ ฅํ† ํฐ์€ ๋‹ค์Œ์„ ์˜๋ฏธํ•œ๋‹ค.
224×224 --> 16×16(patch_size) --> 196๊ฐœ patches,
197 = [CLS] + 196 patches๋กœ ์ด๋ฃจ์–ด์ง„ ์ถœ๋ ฅํ† ํฐ์—์„œ [CLS]๋ฅผ ํŠน์ง•๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.


FAISS (Facebook AI Similarity Search)

FAISS๋Š” ๋ฉ”ํƒ€์—์„œ ๊ฐœ๋ฐœํ•œ ๊ณ ์„ฑ๋Šฅ ๋ฒกํ„ฐ์œ ์‚ฌ๋„๊ฒ€์ƒ‰ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.
์ด๋Š” "๋Œ€๊ทœ๋ชจ ๊ณ ์ฐจ์› ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ๋ฅผ ๊ฒ€์ƒ‰"๊ฐ€๋Šฅํ•˜๊ฒŒ ์„ค๊ณ„๋˜์—ˆ๋‹ค.


cf) [๋ฒกํ„ฐ ์ €์žฅ ๋ฐ ๊ด€๋ฆฌ๋ฐฉ์‹]

  • ๋กœ์ปฌ ์ €์žฅ ์žฅ์น˜: SSD๋‚˜ NVMe๊ฐ™์€ ๊ณ ์†์ €์žฅ์žฅ์น˜๋ฅผ ์‚ฌ์šฉํ•ด ๋น ๋ฅธ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅ.
  • ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์‹œ์Šคํ…œ: PstgreSQL, pgvectorํ™•์žฅ์ด๋‚˜ MongoDB์˜ Atlas Vector Search๊ฐ™์€ ๋ฒกํ„ฐ๊ฒ€์ƒ‰๊ธฐ๋Šฅ์„ ์ง€์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํ™œ์šฉ
  • ํด๋ผ์šฐ๋“œ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค: Amazon OpenSearch, Ggogle Vetex AI๋“ฑ ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค๋Š” ๋Œ€๊ทœ๋ชจ ๋ฒกํ„ฐ๋ฐ์ดํ„ฐ์˜ ์ €์žฅ ๋ฐ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ํŠนํ™”๋œ ์†”๋ฃจ์…˜์„ ์ œ๊ณต
  • ๋ฒกํ„ฐ๊ฒ€์ƒ‰์—”์ง„: Milvus, Qdrant, Weaviate, FAISS ๋“ฑ์˜ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค๋Š” ๋Œ€๊ทœ๋ชจ ๋ฒกํ„ฐ dataset์˜ ํšจ์œจ์  ์ €์žฅ ๋ฐ ๊ณ ์„ฑ๋Šฅ ๊ฒ€์ƒ‰์„ ์œ„ํ•ด ์ตœ์ ํ™”๋˜์–ด ANN(Approximate Nearest Neighbor)์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋น ๋ฅธ ์œ ์‚ฌ๋„๊ฒ€์ƒ‰์„ ์ง€์›, ์‹ค์‹œ๊ฐ„ ๊ฒ€์ƒ‰์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ ํŠนํžˆ๋‚˜ ์ ํ•ฉํ•˜๋‹ค.


ex) CLIP์„ ์ด์šฉํ•œ ์ด๋ฏธ์ง€ ํŠน์ง•๋ฒกํ„ฐ ์ถ”์ถœ ์˜ˆ์ œ

import torch
import numpy as np
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel

dataset = load_dataset("sasha/dog-food")
images = dataset["test"]["image"][:100]

model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name)

vectors = []
with torch.no_grad():
    for image in images:
        inputs = processor(images=image, padding=True, return_tensors="pt")
        outputs = model.get_image_features(**inputs)
        vectors.append(outputs.cpu().numpy())

vectors = np.vstack(vectors)
print("์ด๋ฏธ์ง€ ๋ฒกํ„ฐ์˜ shape :", vectors.shape)

# ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ์˜ shape : (100, 512)

dog-food dataset์—์„œ 100๊ฐœ ์ด๋ฏธ์ง€๋ฅผ ์„ ํƒ  ๊ฐ ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœ
 vectors๋ฆฌ์ŠคํŠธ์— ์ €์žฅ → ndarrayํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜

์ด๋Ÿฐ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์œ ์‚ฌ๋„ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ์ธ๋ฑ์Šค ์ƒ์„ฑ์— ํ™œ์šฉ๊ฐ€๋Šฅํ•˜๋‹ค:
์ƒ์„ฑ๋œ ์ธ๋ฑ์Šค์— ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋ฅผ ๋“ฑ๋กํ•˜๊ธฐ ์œ„ํ•ด add๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋•Œ ์ž…๋ ฅ๋˜๋Š” ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋Š” ๋ฐ˜๋“œ์‹œ numpy์˜ ndarrayํ˜•์‹์˜ [๋ฒกํ„ฐ๊ฐœ์ˆ˜, ๋ฒกํ„ฐ์ฐจ์›์ˆ˜] ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋˜์–ด์•ผ ํ•œ๋‹ค!!

import faiss

dimension = vectors.shape[-1]
index = faiss.IndexFlatL2(dimension)
if torch.cuda.is_available():
    res = faiss.StandardGpuResources()
    index = faiss.index_cpu_to_gpu(res, 0, index)

index.add(vectors)


import matplotlib.pyplot as plt

search_vector = vectors[0].reshape(1, -1)
num_neighbors = 5
distances, indices = index.search(x=search_vector, k=num_neighbors)

fig, axes = plt.subplots(1, num_neighbors + 1, figsize=(15, 5))

axes[0].imshow(images[0])
axes[0].set_title("Input Image")
axes[0].axis("off")

for i, idx in enumerate(indices[0]):
    axes[i + 1].imshow(images[idx])
    axes[i + 1].set_title(f"Match {i + 1}\nIndex: {idx}\nDist: {distances[0][i]:.2f}")
    axes[i + 1].axis("off")

print("์œ ์‚ฌํ•œ ๋ฒกํ„ฐ์˜ ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ:", indices)
print("์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ:", distances)

# ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ์˜ ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ: [[ 0  6 75  1 73]]
# ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ: [[ 0.       43.922516 44.92473  46.544144 47.058586]]

์œ„ ๊ณผ์ •์„ ํ†ตํ•ด 100๊ฐœ์˜ ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•œ FAISS ์ธ๋ฑ์Šค๊ฐ€ ์ƒ์„ฑ๋˜๋ฉฐ, ๊ฒ€์ƒ‰ํ•˜๊ณ ์žํ•˜๋Š” ์ด๋ฏธ์ง€์˜ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ธ๋ฑ์Šค ๋‚ด์—์„œ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ถ”์ถœ๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹ค๋งŒ, ์ธ๋ฑ์Šค์— ์ €์žฅ๋œ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด์„œ๋งŒ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•˜๊ธฐ์— ๊ฒ€์ƒ‰๋ฒ”์œ„๋ฅผ ํ™•์žฅํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด ๋” ๋งŽ์€ ๋ฒกํ„ฐ๋ฅผ ์ธ๋ฑ์Šค์— ์ถ”๊ฐ€ํ•ด์•ผํ•œ๋‹ค.

์œ„ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ๊ฐ€ ์žˆ๋Š”๋ฐ, FAISS ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ธ๋ฑ์Šค์œ ํ˜•๋“ค์„ ์ œ๊ณตํ•œ๋‹ค:

index = faiss.IndexFlatL2(dimension)

 

์ด๋ฆ„ ์ •ํ™•๋„ ์†๋„ ํŠน์ง•
IndexFlatL2 ๊ฐ€์žฅ ๋†’์Œ ๊ฐ€์žฅ ๋Š๋ฆผ ๋ชจ๋“  ๋ฒกํ„ฐ์— ๋Œ€ํ•œ ์™„์ „ํƒ์ƒ‰์„ ์ˆ˜ํ–‰
IndexHNSW ๋†’์Œ ๋ณดํ†ต ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด ํšจ์œจ์  ๊ฒ€์ƒ‰
IndexIVFlat ๋ณดํ†ต ๊ฐ€์žฅ ๋น ๋ฆ„ ๋ฒกํ„ฐ๊ฐ„ clustering์œผ๋กœ ํƒ์ƒ‰๋ฒ”์œ„๋ฅผ ์ค„์—ฌ ๊ฒ€์ƒ‰

 

 

 

 

 

 

Multi-Modal

Image Captioning (img2txt)

BLIP

BLIP์˜ ํ•ต์‹ฌ์•„์ด๋””์–ด๋Š” "img์™€ Txt์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ"์ด๋‹ค.
์ด๋ฅผ ์œ„ํ•ด img-encoder, txt-encoder๋กœ ๊ฐ๊ฐ์˜ feature vector๋ฅผ ์—ฐ๊ฒฐํ•ด ํ†ตํ•ฉ ํ‘œํ˜„์„ ์ƒ์„ฑํ•œ๋‹ค.

BLIP-2 ๊ตฌ์กฐ

BLIP2๋Š” Q-Former๋ฅผ ๋„์ž…ํ•ด img-txt๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ๊ณผ ์ •๋ณด๊ตํ™˜์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค:
[img-txt๋Œ€์กฐํ•™์Šต, ITM, img๊ธฐ๋ฐ˜ txt์ƒ์„ฑ] --> ๋™์‹œ์— ํ•˜๋‚˜์˜ Encode-Decoder๊ตฌ์กฐ๋กœ ์ˆ˜ํ–‰
Q-Former๋Š” ์ž…๋ ฅ์œผ๋กœ ๊ณ ์ •๋œ ์ด๋ฏธ์ง€ feature embedding์„ ๋ฐ›์€ ํ›„
img-txt๊ด€๊ณ„๊ฐ€ ์ž˜ ํ‘œํ˜„๋œ Soft visual prompt Embedding์„ ์ถœ๋ ฅํ•œ๋‹ค.

DocumentQA

DQA(DocumentQA)๋Š” ์ž์—ฐ์–ด์ฒ˜๋ฆฌ + ์ •๋ณด๊ฒ€์ƒ‰๊ธฐ์ˆ ์„ ์œตํ•ฉํ•ด QA๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
DQA๋Š” ์‹œ๊ฐ์  ๊ตฌ์กฐ์™€ Layout์„ ๊ณ ๋ คํ•ด์•ผํ•˜๋Š”๋ฐ, ์ด ์ค‘ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›๋Š” ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ฐ”๋กœ LayoutLM์ด๋‹ค.

LayoutLM (Layout-aware Language Model)

LayoutLM์€ Microsoft์—์„œ ๋ฌธ์„œ ์ด๋ฏธ์ง€์˜ txt๋ฟ๋งŒ์•„๋‹ˆ๋ผ Layout์ •๋ณด๊นŒ์ง€ ํ•จ๊ป˜ Pre-Train๋œ ๋ชจ๋ธ์ด๋‹ค.

[
LayoutLMv1]

LayoutLM-v1

BERT๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ txt์™€ ํ•จ๊ป˜ txt์˜ ์œ„์น˜์ •๋ณด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
Faster R-CNN๊ฐ™์€ OCR๋ชจ๋ธ๋กœ txt์™€ bbox๊ฒฝ๊ณ„๋ฅผ ์ถ”์ถœ, position embedding์œผ๋กœ ์ถ”๊ฐ€ํ•˜๋ฉฐ ๋‹จ์–ด์˜ image patch(feature)๋„ model์— ์ž…๋ ฅํ•œ๋‹ค. ๋‹ค๋งŒ, LayoutLMv1์€ image feature๊ฐ€ ๋งจ ๋งˆ์ง€๋ง‰์— ์ถ”๊ฐ€๋˜์–ด Pretrain์‹œ ์‹ค์ œ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.




[LayoutLMv2]

LayoutLMv2๋Š” image embedding์„ ์ถ”๊ฐ€๋กœ ๋„์ž…ํ•ด ๋ฌธ์„œ์˜ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ฐ˜์˜ํ•œ๋‹ค.
LayoutLMv2์—์„œ visual embedding์ด ResNeXT-FPN์œผ๋กœ ์ถ”์ถœ๋œ๋‹ค.
์ฆ‰, txt, img-patch, layout์ •๋ณด๋ฅผ ๋™์‹œ์— ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ Self-Attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
- ํ•™์Šต ์ฃผ์š” ๋ชฉํ‘œ:
i) Masked Visual-Language Modeling: ๋ฌธ์žฅ์˜ ๋นˆ์นธ ์˜ˆ์ธก
ii) ITM: ํŠน์ • ํ…์ŠคํŠธ์™€ ํ•ด๋‹น ์ด๋ฏธ์ง€๊ฐ„์˜ ์—ฐ๊ด€์„ฑ ํ•™์Šต
iii)Text-Image Alignment: ์ด๋ฏธ์ง€์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๊ฐ€๋ ค์กŒ์„ ๋•Œ, ๊ทธ ์œ„์น˜๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๋Šฅ๋ ฅ


[LayoutLMv3]

์ขŒ)DocFormer , ์šฐ)LayoutLMv3

LayoutLMv3๋Š” Faster R-CNN, CNN๋“ฑ์˜ Pre-Trained Backbone์— ์˜์กดํ•˜์ง€ ์•Š๋Š” ์ตœ์ดˆ์˜ ํ†ตํ•ฉ MLLMs์ด๋‹ค.
์ด๋ฅผ ์œ„ํ•ด ์ „๊ณผ ๋‹ฌ๋ฆฌ ์ƒˆ๋กœ์šด ์‚ฌ์ „ํ•™์Šต์ „๋žต ๋ฐ ๊ณผ์ œ๋ฅผ ๋„์ž…ํ•˜์˜€๋‹ค:
i) Masked Language Modeling(MLM): ์ผ๋ถ€ ๋‹จ์–ด token ๋งˆ์Šคํ‚น
ii) Masked Image Modeling(MIM): ๋งˆ์Šคํ‚น๋œ token์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€ ๋ถ€๋ถ„์„ ๋งˆ์Šคํ‚น
iii) Word Patch Alignment(WPA): img token๊ณผ ๋Œ€์‘๋˜๋Š” Txt token์˜ ๋งˆ์Šคํ‚น์—ฌ๋ถ€๋ฅผ ์ด์ง„๋ถ„๋ฅ˜, ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ„ ์ •๋ ฌ์„ ํ•™์Šต

<LayoutLMv3 ๊ตฌ์กฐ>: embedding๊ณ„์ธต, patch_embedding๋ชจ๋“ˆ, encoder
1) embedding์ธต์€ ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ Embedding์„ ํ†ตํ•ฉ:
 - word_embed + token_type_emb + pos_emb + (x_pos_emb , y_pos_emb, h_pos_emb, w_pos_emb)

2) patch_embedding๋ชจ๋“ˆ์€ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌ:
 - patch๋กœ ๋ถ„ํ• ํ•˜๊ณ  ๊ฐ patch๋ฅผ embedding์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ViT์—ญํ• 

3) encoder
 - ์—ฌ๋Ÿฌ Transformer์ธต์œผ๋กœ ๊ตฌ์„ฑ.

 

VQA

VQA process: ์‹œ๊ฐ์  ํŠน์ง• ์ถ”์ถœ → Q์˜๋ฏธํŒŒ์•… →์‹œ๊ฐ์ ํŠน์ง•๊ณผ Q์˜ txt์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•ด ์˜๋ฏธ์žˆ๋Š” ํ‘œํ˜„(A)์ƒ์„ฑ
์ด๋ฅผ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ ViLT์ด๋‹ค.

ViLT (Vision-and-Language Transformer)

์‹œ๊ฐ์  ์ž…๋ ฅ์„ txt์ž…๋ ฅ๊ณผ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹จ์ผ๋ชจ๋ธ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.
์ด๋•Œ, ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ตฌ๋ถ„์„ ์œ„ํ•ด ๋ชจ๋‹ฌํƒ€์ž… embedding์ด ์ถ”๊ฐ€๋˜๋ฉฐ,
ํ•™์Šต๊ณผ์ •์—์„œ 3๊ฐ€์ง€ ์†์‹คํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ด๋ค„์ง„๋‹ค:
- ITM: ์ฃผ์–ด์ง„ Image์™€ Text๊ฐ€ ์„œ๋กœ ์—ฐ๊ด€๋˜์–ด์žˆ๋Š”์ง€ ํŒ๋‹จ.
- MLM: ๋‹จ์–ด๋‹จ์œ„์˜ Masking์œผ๋กœ ์ „์ฒด ๋‹จ์–ด๋งฅ๋ฝ ํŒŒ์•…
- WPA: img-txt๊ฐ„ ๋ฒกํ„ฐ ์œ ์‚ฌ๋„ ์ตœ๋Œ€ํ™”


๊ฒฐ๊ณผ์ ์œผ๋กœ img+txt๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ด, ๋‹จ์ผ embedding๊ณต๊ฐ„์— ํ‘œํ˜„ํ•œ๋‹ค.

cf) collate_fn์€ pytorch์˜ dataloader๋กœ batch๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ, ๊ฐ sample์„ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•  ๊ฒƒ์ธ์ง€ ์ •์˜ํ•˜๋Š” ํ•จ์ˆ˜๋‹ค.

Image Generation

์ด๋ฏธ์ง€ ์ƒ์„ฑ์€ prompt๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ดํ•ดํ•˜์—ฌ GAN์ด๋‚˜ Diffusion Model์„ ์ด์šฉํ•ด prompt์˜ ์„ธ๋ถ€์  ํŠน์ง•์„ ์ž˜ ์žก์•„๋‚ด ์ƒˆ๋กœ์šด img๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ์ˆ ์„ ์˜๋ฏธํ•œ๋‹ค.

Diffusion Model

[Forward process]: src_img์— ์ ์ง„์  ์ •๊ทœ๋ถ„ํฌ Noise ์ถ”๊ฐ€
[Reverse process]: pure_noise์—์„œ ์›๋ณธ์œผ๋กœ ๋ณต์›(by ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ ๊ฐฑ์‹ )


[Stable-Diffusion 1]
- 512×512 img ์ƒ์„ฑ
- txt2img, img2img, inpainting ๋“ฑ์˜ ๊ธฐ๋Šฅ


[Stable-Diffusion 2]
- 768×768 img ์ƒ์„ฑ
- OpenCLIP์œผ๋กœ ๋” ๋‚˜์€ WPA ์ œ๊ณต, ์„ธ๋ถ€์  ๋ฌ˜์‚ฌ ๊ฐœ์„ 


[Stable-Diffusion 3]
- ๋”์šฑ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ
- Rectified flow๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๋ชจ๋ธ๊ตฌ์กฐ
- txt์™€ img token๊ฐ„ ์–‘๋ฐฉํ–ฅ ์ •๋ณดํ๋ฆ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ๊ตฌ์กฐ

 

 

 

 

 

 

 

etc

Hyperparameter Tuning - ray tune

raytune์€ ๋ถ„์‚ฐ hypereparameter ์ตœ์ ํ™” framework์ด๋‹ค.
๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ์ปดํ“จํŒ… ํ™˜๊ฒฝ์—์„œ ๋‹ค์–‘ํ•œ hyperparameter ํƒ์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜(random, greedy ๋“ฑ)์„ ์ง€์›ํ•˜๋ฉฐ, Early Stopping ๋˜ํ•œ ์ œ๊ณตํ•œ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ ์‹คํ—˜๊ฒฐ๊ณผ ์ถ”์  ๋ฐ ์‹œ๊ฐํ™” ๋„๊ตฌ ๋˜ํ•œ ์ œ๊ณตํ•˜๋ฉฐ, ์ตœ์ ์˜ hyperparameter ์กฐํ•ฉ ๋˜ํ•œ ํšจ๊ณผ์ ์œผ๋กœ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€์ค€๋‹ค.
!pip3 install ray[tune] optuna


ex) NER RayTune ์˜ˆ์ œ

i) ํ•™์Šต ์ค€๋น„

from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer

def preprocess_data(example, tokenizer):
    sentence = "".join(example["tokens"]).replace("\xa0", " ")
    encoded = tokenizer(
        sentence,
        return_offsets_mapping=True,
        add_special_tokens=False,
        padding=False,
        truncation=False
    )

    labels = []
    for offset in encoded.offset_mapping:
        if offset[0] == offset[1]:
            labels.append(-100)
        else:
            labels.append(example["ner_tags"][offset[0]])
    encoded["labels"] = labels
    return encoded

dataset = load_dataset("klue", "ner")
labels = dataset["train"].features["ner_tags"].feature.names

model_name = "Leo97/KoELECTRA-small-v3-modu-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(labels),
    ignore_mismatched_sizes=True
)

processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=False,
    remove_columns=dataset["train"].column_names
)


ii) hyperparameter ํƒ์ƒ‰

from ray import tune
from functools import partial
from transformers import Trainer, TrainingArguments
from transformers.data.data_collator import DataCollatorForTokenClassification

def model_init(model_name, labels):
    return AutoModelForTokenClassification.from_pretrained(
        model_name, num_labels=len(labels), ignore_mismatched_sizes=True
    )

def hp_space(trial):
    return {
        "learning_rate": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(1e-5, 1e-1),
        "num_train_epochs": tune.choice([1, 2, 3])
    }

def compute_objective(metrics):
    return metrics["eval_loss"]

training_args = TrainingArguments(
    output_dir="token-classification-hyperparameter-search",
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    # learning_rate=1e-4,
    # weight_decay=0.01,
    # num_train_epochs=5,
    seed=42
)

trainer = Trainer(
    model_init=partial(model_init, model_name=model_name, labels=labels),
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"],
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True)
)

best_run = trainer.hyperparameter_search(
    backend="ray",
    n_trials=5,
    direction="minimize",
    hp_space=hp_space,
    compute_objective=compute_objective,
    resources_per_trial={"cpu": 2, "gpu": 1},
    trial_dirname_creator=lambda trial: str(trial)
)
print(best_run.hyperparameters)


model_init ํ•จ์ˆ˜: ๋ชจ๋ธ ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ (์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด ์ตœ์ ์˜ hyperparameter ํƒ์ƒ‰ํ•˜๊ฒŒ ํ• ๋‹น๋จ.)
์ฆ‰, ๊ฐ ์‹คํ—˜๋งˆ๋‹ค ์ผ๊ด€๋œ ์ดˆ๊ธฐ์ƒํƒœ๋ฅผ ๋ณด์žฅํ•จ.

hp_space ํ•จ์ˆ˜: ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ํƒ์ƒ‰ํ•  hyperparameter ์ข…๋ฅ˜์™€ ๊ฐ’์˜ ๋ฒ”์œ„ ์ง€์ •.

compute_objective ํ•จ์ˆ˜: ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ์‚ฌ์šฉํ•  "ํ‰๊ฐ€์ง€ํ‘œ"๋กœ ๋ณดํ†ต eval_loss๋‚˜ eval_acc๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์„ค์ •.

TrainingArguments ํ•จ์ˆ˜: lr, weight_decay, train_epochs๊ฐ€ hp_space์—์„œ ํƒ์ƒ‰๋˜๊ธฐ์— ๋”ฐ๋กœ ํ• ๋‹นX

Trainer ํ•จ์ˆ˜: ๊ณ ์ •๋œ ๋ชจ๋ธ์ธ์Šคํ„ด์Šค๊ฐ€ ์•„๋‹Œ, model_init์„ ์‚ฌ์šฉ.

์ถœ๋ ฅ ์˜ˆ์‹œ)

+-------------------------------------------------------------------+
| Configuration for experiment     _objective_2024-11-18_05-44-18   |
+-------------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator            |
| Scheduler                        FIFOScheduler                    |
| Number of trials                 5                                |
+-------------------------------------------------------------------+

View detailed results here: /root/ray_results/_objective_2024-11-18_05-44-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-11-18_05-44-11_866890_872/artifacts/2024-11-18_05-44-18/_objective_2024-11-18_05-44-18/driver_artifacts`

Trial status: 5 PENDING
Current time: 2024-11-18 05:44:18. Total running time: 0s
Logical resource usage: 0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-------------------------------------------------------------------------------------------+
| Trial name               status       learning_rate     weight_decay     num_train_epochs |
+-------------------------------------------------------------------------------------------+
| _objective_27024_00000   PENDING        2.36886e-05       0.0635122                     3 |
| _objective_27024_00001   PENDING        6.02131e-05       0.00244006                    2 |
| _objective_27024_00002   PENDING        1.43217e-05       1.7074e-05                    1 |
| _objective_27024_00003   PENDING        3.99131e-05       0.00679658                    2 |
| _objective_27024_00004   PENDING        1.13871e-05       0.00772672                    2 |
+-------------------------------------------------------------------------------------------+

Trial _objective_27024_00000 started with configuration:
+-------------------------------------------------+
| Trial _objective_27024_00000 config             |
+-------------------------------------------------+
| learning_rate                             2e-05 |
| num_train_epochs                              3 |
| weight_decay                            0.06351 |
+-------------------------------------------------+

...

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ๋Š” ๋ชจ๋ธ ์ตœ์ ํ™”๋ฐฉ์‹์œผ๋กœ LLM์˜ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ๊ฐ€๋Šฅํ•˜๋‹ค.
๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์€ bit์ •๋ฐ€๋„๋กœ ์–‘์žํ™”ํ•ด ๋ชจ๋ธํฌ๊ธฐ๋ฅผ ์ค„์ด๊ณ  ์ถ”๋ก ์†๋„๋ฅผ ๋†’์ธ๋‹ค.
์•„๋ž˜ ์˜ˆ์ œ์˜ ์ถœ๋ ฅ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ, ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ƒ๋‹นํžˆ ํฐ ํญ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋Š”๋ฐ, 
GPTQ๋ฐฉ๋ฒ•์€ GPT๊ณ„์—ด๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค ๋ชจ๋‘ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

GPTQ๋ฅผ ์ด์šฉํ•œ ๋ชจ๋ธ ์–‘์žํ™” ์˜ˆ์ œ

from transformers import GPTQConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

quantization_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config
)
from transformers import pipeline

origin_generator = pipeline("text-generation", model="facebook/opt-125m")
quantized_generator = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

input_text_list = [
    "In the future, technology wil",
    "What are we having for dinner?",
    "What day comes after Monday?"
]

print("์›๋ณธ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:")
for input_text in input_text_list:
    print(origin_generator(input_text))
print("์–‘์žํ™” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:")
for input_text in input_text_list:
    print(quantized_generator(input_text))
    
# ์›๋ณธ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:
# [{'generated_text': 'In the future, technology wil be used to make the world a better place.\nI think'}]
# [{'generated_text': 'What are we having for dinner?\n\nWe have a great dinner tonight. We have a'}]
# [{'generated_text': "What day comes after Monday?\nI'm guessing Monday."}]
# ์–‘์žํ™” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:
# [{'generated_text': 'In the future, technology wil be able to make it possible to make a phone that can be'}]
# [{'generated_text': "What are we having for dinner?\n\nI'm not sure what to do with all this"}]
# [{'generated_text': "What day comes after Monday?\nI'm not sure, but I'll be sure to check"}]

์ถœ๋ ฅ๊ฒฐ๊ณผ, ์ •ํ™•๋„๊ฐ€ ๋‹ค์†Œ ๋–จ์–ด์ง€๊ธด ํ•˜๋‚˜ ์›๋ณธ๋ชจ๋ธ๊ณผ ํฐ ์ฐจ์ด๊ฐ€ ์—†์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

import time
import numpy as np

def measure_inference_time(generator, input_text, iterations=10):
    times = []
    for _ in range(iterations):
        start_time = time.time()
        generator(input_text)
        end_time = time.time()
        times.append(end_time - start_time)
    avg_time = np.mean(times)
    return avg_time

def calculate_model_size(model):
    total_params = sum(p.numel() for p in model.parameters())
    total_memory = sum(p.numel() * p.element_size() for p in model.parameters())
    total_memory_mb = total_memory / (1024 ** 2)
    return total_memory_mb, total_params

test_input = "Once upon a time in a land far, far away, there was a small village."

size_original, total_params_original = calculate_model_size(origin_generator.model)
avg_inference_time_original = measure_inference_time(origin_generator, test_input)

size_quantized, total_params_quantized = calculate_model_size(quantized_generator.model)
avg_inference_time_quantized = measure_inference_time(quantized_generator, test_input)

print("์›๋ณธ ๋ชจ๋ธ:")
print(f"- ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: {total_params_original:,}")
print(f"- ๋ชจ๋ธ ํฌ๊ธฐ: {size_original:.2f} MB")
print(f"- ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: {avg_inference_time_original:.4f} sec")

print("์–‘์žํ™” ๋ชจ๋ธ:")
print(f"- ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: {total_params_quantized:,}")
print(f"- ๋ชจ๋ธ ํฌ๊ธฐ: {size_quantized:.2f} MB")
print(f"- ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: {avg_inference_time_quantized:.4f} sec")

# ์›๋ณธ ๋ชจ๋ธ:
# - ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: 125,239,296
# - ๋ชจ๋ธ ํฌ๊ธฐ: 477.75 MB
# - ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: 0.1399 sec
# ์–‘์žํ™” ๋ชจ๋ธ:
# - ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜: 40,221,696
# - ๋ชจ๋ธ ํฌ๊ธฐ: 76.72 MB
# - ํ‰๊ท  ์ถ”๋ก  ์‹œ๊ฐ„: 0.0289 sec

์ถ”๋ก  ๊ณผ์ •์— ๋Œ€ํ•œ ์ถœ๋ ฅ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ์›๋ณธ์— ๋น„ํ•ด ๋ชจ๋ธ์— ๋น„ํ•ด ํฌ๊ธฐ๊ฐ€ ํฌ๊ฒŒ ์ค„๋ฉฐ ๋” ๋น ๋ฅธ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์‹ค์‹œ๊ฐ„ ์‘๋‹ต์— ๋Œ€ํ•ด ๋งค์šฐ ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

'HuggingFace๐Ÿค—' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

HuggingFace(๐Ÿค—)-Tutorials  (1) 2024.07.31
[Data Preprocessing] - Data Collator  (1) 2024.07.14
QLoRA ์‹ค์Šต & Trainer vs SFTTrainer  (0) 2024.07.12
[QLoRA] & [PEFT] & deepspeed, DDP  (0) 2024.07.09

+ Recent posts