this.code();

[ML-DL-AISW]

V2LLAIN — Mon, 16 Dec 2024 15:05:31 +0900

Machine Learning

Deep Learning

AI Software

Computer Network

V2LLAIN — Mon, 16 Dec 2024 15:00:39 +0900

[RL-DQN전까지]

V2LLAIN — Mon, 16 Dec 2024 14:59:42 +0900

[영상처리과 컴퓨터비전]

V2LLAIN — Mon, 16 Dec 2024 14:58:20 +0900

영상처리

컴퓨터 비전

컴퓨터 비전-참고서-1

컴퓨터 비전-참고서-2

HuggingFace( )입문자를 위한 transformers 정리

V2LLAIN — Mon, 18 Nov 2024 14:54:30 +0900

Preview:

Machine Learning vs Deep Learning

머신러닝: 인공지능의 한 분야로 data의 Pattern을 학습. (이때, 비교적 적은 양의 구조화된 data로도 작동가능)

딥러닝: 머신러닝의 한 분야로 복잡한 구조, 많은 계산리소스 및 데이터를 필요로 함.

Transformer(Attention is All You Need-2017)

Transformer 모델의 핵심:

∙ input sequence 병렬처리
∙ Only Use Attention Mechanism (Self Attention)
∙ 순차적처리, 반복연결, 재귀 모두 사용❌

Transformer 모델구조:
∙ Embedding: token2dense-vector (이때, 단어간의 의미적 유사성을 보존하는 방향으로 모델이 학습된다.)
∙ Positional Encoding: input sequence의 token위치정보를 추가로 제공
∙ Encoder&Decoder: Embedding+PE라는 학습된 vector를 Input으로 받음(벡터 값은 Pretrained weight or 학습과정중 최적화됨.)
- MHA & FFN: token간 관계를 학습, FFN으로 각 단어의 특징벡터 추출 (이때, 각 Head값은 서로 다른 가중치를 가져 input sequence의 다양한 측면 포착가능.)
- QKV: Query(현재위치에서 관심있는부분의 벡터), Key(각 위치에 대한 정보의 벡터), Value(각 위치에 대한 값 벡터)

ex) The student studies at the home
query: student
--> Q: student정보를 얻기 위한 벡터값
--> K: The, studies, at, the, home 벡터값
--> V: 문맥, 의미등의 관련정보에 대한 벡터값
--> 3-Head라면: 각 헤드별 역할이 Syntax, Semantics, Pragmatics 등에 집중할 수 있다는 것.

Huggingface Transformers

Library 소개

Tokenizer
보통 subword로 token화(token으로 분할)하는 과정을 진행.
부수적으로 "텍스트 정규화, 불용어제거, Padding, Truncation 등" 다양한 전처리 기능도 제공한다.

Diffusers Library

txt-img생성관련 작업을 위한 라이브러리로 Stable Diffusion, DALL-E, LDM 등 다양한 생성모델을 지원.
- DDPM, DDIM, LDM등 다양한 Diffusion기반 알고리즘 제공
- Batch Inference, 병렬, 혼합정밀도학습 등 지원

Accelerate

분산전략을 간단히 추상화해 API로 제공, FP16 및 BF16등의 낮은 혼합정밀도학습을 "자동지원"
- 분산학습 지원: Data Parallel, Model Parallel 등 지원.
- Automatic Mixed Precision지원: FP16, FP32 등 data형식을 자동으로 혼합, 메모리사용량↓, 속도↑

- Gradient Accumulation: 여러 미니배치의 그래디언트를 누적하여 큰 배치 효과를 내는 기법
- Gradient Checkpointing: 중간 activation계산을 저장하는 대신, 필요할 때 재계산하는 방법

Model 설정
모델 설정 클래스는 모델구조와 hyperparameter값을 "딕셔너리"형태로 JSON파일에 저장한다.
따라서 모델을 불러오면 모델가중치와 함께 이 값이 불러와진다. (아래 사진처럼)

PretrainedConfig & ModelConfig

마찬가지로 모델구조, hyperparameter를 저장하는 딕셔너리를 포함
[예시인자 설명]:
- vocab_size: 모델이 인식가능한 고유토큰 수
- output_hidden_states: 모델의 모든 hidden_state를 출력할지 여부
- output_attentions: 모델의 모든 attention값을 출력할지 여부
- return_dict: 모델이 일반 튜플대신, ModelOutput객체를 반환할지 결정.
각 모델 구조별 PretrainedConfig를 상속받은 전용 모델 설정 클래스가 제공된다.
(ex. BertConfig, GPT2Config 혹은 아래 사진처럼...)

InternVisionConfig를 직접 인스턴스화해 설정하는 예시

이때, 설정값이 잘못되면 모델성능이 크게 떨어질 수 있기에 보통 "from_pretrained"를 이용해 검증된 pretrained학습설정값을 불러온다.

PreTrainedTokenizer & ModelTokenizer & PretrainedTokenizerFast

[예시인자 설명]:
- max_model_input_sizes: 모델의 최대입력길이
- model_max_length: tokenizer가 사용하는 모델의 최대입력길이
(즉, 토크나이저의 model_max_length는 모델의 max_model_input_sizes보다 크지 않도록 설정해야 모델이 정상적으로 입력을 처리할 수 있다.)
- padding_side/truncation_side: padding/truncation위치(left/right) 결정
- model_input_names: 순전파시 입력되는 tensor목록(ex. input_ids, attention_mask, token_type_ids)

cf) decode메서드를 사용하면 token_id 문장을 원래 문장으로 복원한다.
cf) PretrainedTokenizerFast는 Rust로 구현된 버전으로 Python Wrapper를 통해 호출되는, 더 빠른 tokenizer다.

Datasets

Dataset Upload 예제:
images 디렉토리 구조:
images
⎿ A.jpg
⎿ B.jpg
  ...

import os
from collections import defaultdict
from datasets import Dataset, Image, DatasetDict

data = defaultdict(list)
folder_name = '../images'

for file_name in os.listdir(folder_name):
    name = os.path.splittext(file_name)[0]
    path = os.path.join(folder_name, file_name)
    data['name'].append(name)
    data['image'].append(path)

dataset = Dataset.from_dict(data).cast_column('image', Image())
# print(data, dataset[0]) # 확인용

dataset_dict = DatasetDict({
    'train': dataset.select(range(5)),
    'valid': dataset.select(range(5, 10)),
    'test': dataset.select(range(10, len(dataset)))
    }
)

hub_name = "<user_name>/<repo_name>" # dataset저장경로
token = "hf_###..." # huggingface token입력
datasetdict.push_to_hub(hub_name, token=token)
Embedding과정 완전정리!!
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

txt = "I am laerning about tokenizers."
input = tokenizer(txt, return_tensors="pt")
output = model(**input)

print('input:', input)
print('last_hidden_state:', output.last_hidden_state.shape)
input: {'input_ids': tensor([[  101,  1045,  2572,  2474, 11795,  2075,  2055, 19204, 17629,  2015,  1012,   102]]), 
        'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
        'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
        
last_hidden_state: torch.Size([1, 12, 768])
input 딕셔너리

input_ids:

각 단어와 특수 토큰이 BERT의 어휘 사전에 매핑된 고유한 정수 ID로 변환된 결과입니다.

예시: 101은 [CLS] 토큰, 102는 [SEP] 토큰.

전체 시퀀스: [CLS] I am laerning about tokenizers. [SEP]

길이: 12개의 토큰 (문장 전체와 특수 토큰 포함)

token_type_ids:

문장 내의 각 토큰이 어느 segment에 속하는지를 나타냄.

BERT는 기본적으로 두 개의 세그먼트(예: 문장 A와 문장 B)를 구분가능.

여기서는 단일 문장이므로 모든 값이 0이다.

attention_mask:

모델이 각 토큰에 주의를 기울여야 하는지를 나타낸다.

1은 해당 토큰이 실제 데이터임을 의미하고, 0은 패딩 토큰을 의미.

여기서는 패딩이 없으므로 모든 값이 1이다.

last_hidden_state

Shape: [1, 12, 768]

Batch Size (1): 한 번에 하나의 입력 문장을 처리.

Sequence Length (12): 입력 시퀀스의 토큰 수 (특수 토큰 포함).

Hidden Size (768): BERT-base 모델의 각 토큰에 대해 768차원의 벡터 표현을 생성한다.

의미:

last_hidden_state는 모델의 마지막 은닉 계층에서 각 토큰에 대한 벡터 표현을 담고 있다.

이 벡터들은 문맥 정보를 포함하고 있으며, 다양한 NLP 작업(예: 분류, 개체명 인식 등)에 활용될 수 있다.

설명)

ex-1) Embedding Lookup Table과정 코드
train_data = 'you need to know how to code'

# 중복을 제거한 단어들의 집합인 단어 집합 생성.
word_set = set(train_data.split())

# 단어 집합의 각 단어에 고유한 정수 맵핑.
vocab = {word: i+2 for i, word in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1
print(vocab) # {'need': 2, 'to': 3, 'code': 4, 'how': 5, 'you': 6, 'know': 7, '<unk>': 0, '<pad>': 1}

# 단어 집합의 크기만큼의 행을 가지는 테이블 생성.
embedding_table = torch.FloatTensor([[ 0.0,  0.0,  0.0],
                                    [ 0.0,  0.0,  0.0],
                                    [ 0.2,  0.9,  0.3],
                                    [ 0.1,  0.5,  0.7],
                                    [ 0.2,  0.1,  0.8],
                                    [ 0.4,  0.1,  0.1],
                                    [ 0.1,  0.8,  0.9],
                                    [ 0.6,  0.1,  0.1]])

sample = 'you need to run'.split()
idxes = []

# 각 단어를 정수로 변환
for word in sample:
  try:
    idxes.append(vocab[word])
  # 단어 집합에 없는 단어일 경우 <unk>로 대체된다.
  except KeyError:
    idxes.append(vocab['<unk>'])
idxes = torch.LongTensor(idxes)

# 각 정수를 인덱스로 임베딩 테이블에서 값을 가져온다.
lookup_result = embedding_table[idxes, :]
print(lookup_result)
print(lookup_result.shape)
# tensor([[0.1000, 0.8000, 0.9000],
#         [0.2000, 0.9000, 0.3000],
#         [0.1000, 0.5000, 0.7000],
#         [0.0000, 0.0000, 0.0000]])
# torch.Size([4, 3])
ex-2) Embedding lookup table과정 코드와 nn.Embedding간 비교
train_data = 'you need to know how to code'

# 중복을 제거한 단어들의 집합인 단어 집합 생성.
word_set = set(train_data.split())

# 단어 집합의 각 단어에 고유한 정수 맵핑.
vocab = {tkn: i+2 for i, tkn in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1

embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=3, padding_idx=1)
print(embedding_layer.weight)
print(embedding_layer)

# tensor([[ 0.7830,  0.2669,  0.4363],
#         [ 0.0000,  0.0000,  0.0000],
#         [-1.2034, -0.0957, -0.9129],
#         [ 0.7861, -0.0251, -2.2705],
#         [-0.5167, -0.3402,  1.3143],
#         [ 1.7932, -0.6973,  0.5459],
#         [-0.8952, -0.4937,  0.2341],
#         [ 0.3692,  0.0593,  1.0825]], requires_grad=True)
# Embedding(8, 3, padding_idx=1)
ex-3) Embedding 예시코드
class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.word_embeddings = nn.Embedding(config.vocab_size, config.emb_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_seq_length, config.emb_size)
        self.token_type_embeddings = nn.Embedding(2, config.emb_size)
        self.LayerNorm = nn.LayerNorm(config.emb_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.dropout)
        
        # position ids (used in the pos_emb lookup table) that we do not want updated through backpropogation
        self.register_buffer("position_ids", torch.arange(config.max_seq_length).expand((1, -1)))

    def forward(self, input_ids, token_type_ids):
        word_emb = self.word_embeddings(input_ids)
        pos_emb = self.position_embeddings(self.position_ids)
        type_emb = self.token_type_embeddings(token_type_ids)

        emb = word_emb + pos_emb + type_emb
        emb = self.LayerNorm(emb)
        emb = self.dropout(emb)

        return emb

NLP

BERT - Classification

NER (Named Entity Recognition)
Token Classification, 즉 문장을 구성하는 각 token에 label을 할당하는 Task이다.
먼저 예시로 BIO Tagging을 들면 아래와 같다:
ex) 인공지능(AI)에서 딥러닝은 머신러닝의 한 분야입니다.
--> <인공지능:B-Tech> <(:I-Tech> <AI:I-Tech> <):I-Tech> <에서:O> <딥러닝:B-Tech> <은:O> <머신러닝:B-Tech> <의:O> <한:O> <분야:O> <입니다:O> <.:O>
이때, B는 Begin(개체명의 시작)을, I는 Inside(개체명의 연속)를, O는 Outside(개체명이 아닌것)를 의미한다.
이런 NER에서 자주사용되는 모델이 바로 BERT이다.

BERT - MLM, NSP
문장간 관계(요약 등)를 이해하기 위해 활용되는 [CLS]토큰이 사용된다.
BERT에서는 총 3가지 Embedding이 Embedding Layer에서 활용된다:
1. Token Embedding:
- 입력문장 embedding
2. Segment Embedding:
- 모델이 인식하도록 각 문장에 고정된 숫자 할당.
3. Position Embedding:
- input을 한번에 모두 밀어넣기에(= 순차적으로 넣지 않음)
- Transformer Encoder는 각 token의 시간적 순서를 알지 못함
- 이를 위해 위치정보를 삽입하기위해 sine, cosine을 사용한다.
추천강의) https://www.youtube.com/watch?app=desktop&v=CiOL2h1l-EE

BART - Summarization

Abstractive & Extractive Summarization
추상요약: 원문을 완전히 이해 --> 새로운 문장을 생성해 요약하는 방식.
추출요약: 원문에서 가장 중요하고 관련성 높은 문장들만 선택해 그대로 추출.
(요약문이 부자연스러울 수는 있으며, 중복제거능력이 필요함.)

BART (Bidirectional & Auto Regressive Transformers)
Encoder-Decoder 모두 존재하며, 특히나 Embedding층을 공유하는 Shared Embedding을 사용해 둘간의 연결을 강화한다.
Encoder는 Bidirectional Encoder로 각 단어가 문장 전체의 좌우 context를 모두 참조가능하며,
Decoder에서 Auto-Regressive방식으로 이전에 생성한 단어를 참조해 다음 단어를 예측한다.
또한, Pre-Train시 Denoising Auto Encoder로 학습하는데, 임의로 noising후, 복원하게 한다.

RoBERTa, T5- TextQA

Abstractive & Extractive QA
추상질의응답: 주어진 지문 내에서 답변이 되는 문자열 추출 (질문-지문-지문내답변추출)
추출질의응답: 질문과 지문을 입력받아 새로운 답변 생성 (질문-지문-답변)

RoBERTa
max_len, pretrain-dataset양이 늘어났으며, Dynamic Masking기법 도입이 핵심.
Dynamic Masking: 각 에폭마다 다른 마스킹패턴 생성. (NSP는 없앰.)
BPE Tokenization 사용: BERT는 wordpiece tokenize.

T5- Machine Translation

SMT & NMT

통계적 기계번역: 원문-번역쌍 기반, 단어순서 및 언어패턴을 인식 --> 대규모 data필요
신경망 기계번역: 번역문과 단어시퀀스간 관계를 학습

T5 (Text-To-Text Transfer Transformer)
tast별 특정 Prompt형식을 사용해 적절한 출력을 생성하게 유도가능하다.
즉, 단일 모델로 다양한 NLP Task를 처리가능한 seq2seq구조를 기반으로 한다.

T5의 독특한점은 모델구조차제가 아닌, "입출력 모두 Txt형태로 취급하는 seq2seq로 접근해 Pretrain과정에서 "Unsupervised Learning"을 통해 대규모 corpus(약 75GB)를 사용한다는 점이다." 이를 통해 언어의 일반적 패턴과 지식을 효과적으로 습득한다.

LLaMA - Text Generation

Seq2Seq & CausalLM
Seq2Seq: Transformer, BART, T5 등 Encoder-Decoder구조
CausalLM: 단일 Decoder로 구성

LLaMA-3 Family
2024년 4월, LLaMA-3가 출시되었는데, LLaMA-3에서는 GQA(Grouped Query Attention)이 사용되어 Inference속도를 높였다.
LLaMA-3는 Incontext-Learning, Few-Shot Learning 모두 뛰어난 성능을 보인다.
~~Incontext-Learning~~: 모델이 입력텍스트를 기반으로 새로운 작업을 즉석에서 수행하는 능력

추가적으로 2024년 7월, LLaMA-3.1이 공개되었다. LLaMA-3.1은 AI안정성 및 보안관련 도구가 추가되었는데, Prompt Injection을 방지하는 Prompt Guard를 도입해 유해하거나 부적절한 콘텐츠를 식별하게 하였다.
추가적으로 LLaMA-3 시리즈는 다음과 같은 주요 특징이 존재한다:
- RoPE(Rotary Position Embedding): Q, K에 적용
- GQA(Grouped Query Attention): K, V를 여러 그룹으로 묶어 attention연산수행 --> 효율적 추론
- RMS Norm: 안정적 학습 및 계산의 효율성
- KV cache: 추론시 K,V를 cache에 저장 --> 연산의 효율성

LLaMA-3 최적화기법: SFT . RLHF . DPO
SFT(Supervised Fine-Tuning): 사람이 작성한 고품질QA쌍으로 모델을 직접 학습시키는 방법
RLHF: PPO알고리즘기반, 모델이 생성한 여러 응답에 대해 사람이 순위를 매기고 이를 바탕으로 재학습.
DPO(Direct Preference Optimization): RLHF의 복잡성을 줄이면서 효과적 학습을 가능케함.(사람이 매긴 응답순위를 직접학습; 다만 더욱 고품질 선호도 data를 필요로함.)

Computer Vision

주로 CV(Computer Vision)분야에선 CNN기법이 많이 활용되었다.(VGG, Inception, ResNet, ...)
다만, CNN based model은 주로 국소적(local) pattern을 학습하기에 전역적인 관계(global relation)모델링에 한계가 존재한다.
추가적으로 이미지 크기가 커질수록 계산복잡도 또한 증가한다.

이를 해결하기 위해 ViT(Vision Transformer)가 제안되었으며, 대규모 dataset으로 효율적으로 학습한다.
ViT의 가장 대표적인 격인 CLIP, OWL-ViT, SAM에 대해 알아보자.

Zero shot classification

Zero Shot Classification: CLIP, ALIGN, SigLIP
사실 CLIP은 다양한 Task에서 많이 활용되나 본 글은 Train dataset에 없는 Label에 대해 Image Classification을 수행하는 기술에 활용되는 방법으로 알아보고자 한다.
새로운 Label이 추가될 때마다 재학습이 필요한데, 이를 피하려면 Zero shot기법은 반필수적이기 때문이다.

CLIP (Contrastive Language-Image Pre-training)

Model Architecture Input_Size Patch_Size #params

openai/clip-vit-base-patch32 ViT-B/32 224×224 32×32 1.5B

openai/clip-vit-base-patch16 ViT-B/16 224×224 16×16 1.5B

openai/clip-vit-large-patch14 ViT-L/14 224×224 14×14 4.3B

openai/clip-vit-large-patch14-336 ViT-L/14 336×336 14×14 4.3B

작은 patch_size: 더 세밀한 특징추출, 메모리 사용량 및 계산시간 증가
큰 patch_size: 비교적 낮은 성능, 빠른 처리속도
파란블록: Positive Sample , 흰블록: Negative Sample

기존 Supervised Learning과 달리 2가지 특징이 존재한다:
1.별도의 Label없이 input으로 image-txt쌍만 학습.
- img, txt를 동일한 embedding공간에 사영(Projection)
- 이를 통해 두 Modality간 의미적 유사성을 직접적으로 측정 및 학습가능
- 이 때문에 CLIP은 img-encoder, txt-encoder 모두 갖고있음
2. Contrastive Learning:
- "Positive Sample": 실제img-txt쌍 --> img-txt간 의미적 유사성 최대화
- "Negative Sample": random하게 pair된 불일치img-txt쌍 --> 유사성 최소화
- 이를 위해 Cosine Similarity기반의 Contrastive Learning Loss를 사용.

Zero-Shot Classification 예시
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
dataset = load_dataset("sasha/dog-food")
images = dataset['test']['image'][:2]
labels = ['dog', 'food']
inputs = processor(images=images, text=labels, return_tensors="pt", padding=True)

print('input_ids: ', inputs['input_ids'])
print('attention_mask: ', inputs['attention_mask'])
print('pixel_values: ', inputs['pixel_values'])
print('image_shape: ', inputs['pixel_values'].shape)
# =======================================================
# input_ids:  tensor([[49406,  1929, 49407], [49406,  1559, 49407]])
# attention_mask:  tensor([[1, 1, 1], [1, 1, 1]])
# pixel_values:  tensor([[[[-0.0113, ...,]]]])
# image_shape:  torch.Size([2, 3, 224, 224])
CLIPProcessor에는 CLIPImageProcessor와 CLIPTokenizer가 내부적으로 포함되어 있다.
input_ids에서 49406과 49407은 각각 startoftext와 endoftext를 나타내는 특별한 값이다.
attention_mask는 변환된 token_types로
값이 1이면 해당위치토큰이 실제데이터값을 나타내고, 0은 [PAD]를 의미한다.
with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image
  probs = logits_per_image.softmax(dim=1)
  print('outputs:', outputs.keys())
  print('logits_per_image:', logits_per_image)
  print('probs: ', probs)

for idx, prob in enumerate(probs):
  print(f'- Image #{idx}')
  for label, p in zip(labels, prob):
    print(f'{label}: {p.item():.4f}')

# ============================================
# outputs: odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])
# logits_per_image: tensor([[23.3881, 18.8604], [24.8627, 21.5765]])
# probs:  tensor([[0.9893, 0.0107], [0.9640, 0.0360]])
# - Image #0
# dog: 0.9893
# food: 0.0107
# - Image #1
# dog: 0.9640
# food: 0.0360

Zero shot Detection

자연어적 설명에는 이미지 내 객체와 개략적 위치정보를 암시적으로 포함한다.
CLIP에서 img-txt쌍으로 시각적특징과 텍스트간 연관성을 학습가능함을 보였기에,
추론 시, 주어진 txt prompt만 잘 설계한다면 객체의 위치를 예측할 수 있게된다.
따라서 zero-shot object detection에서는 전통적인 annotation정보 없이도 시각과 언어간의 상관관계를 학습하여 새로운 객체클래스를 검출할 수 있게 해준다.
OWL-ViT의 경우, Multi-Modal Backbone모델로 CLIP모델을 사용한다.

OWLv2 (OWL-ViT)

OWL-ViT구조, OWLv2는 객체검출헤드에 Objectness Classifier추가함.

OWL-ViT는 img-txt쌍으로 pretrain하여 Open-Vocabulary객체탐지가 가능하다.
OWLv2는 Self-Training기법으로 성능을 크게 향상시켰다.
즉, 기존 Detector로 Weak Supervision방식으로 가상의 Bbox-Annotation을 자동생성한다.
ex) input: img-txt pair[강아지가 공을 가지고 노는]
기존 detector: [강아지 bbox] [공 bbox] 자동예측, annotation생성
--> 모델 학습에 이용 (즉, 정확한 위치정보는 없지만 부분적 supervision signal로 weak signal기반, 모델이 객체의 위치 및 클래스를 추론, 학습하게 함)

Zero-Shot Detection 예시

import io
from PIL import Image
from datasets import load_dataset
from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")
dataset = load_dataset('Francesco/animals-ij5d2')
print(dataset)
print(dataset['test'][0])

# ==========================================================
# DatasetDict({
#     train: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 700
#     })
#     validation: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 100
#     })
#     test: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 200
#     })
# })
# {'image_id': 63, 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2B0186E4A0>, 'width': 640, 'height': 640, 'objects': {'id': [96, 97, 98, 99], 'area': [138029, 8508, 10150, 20624], 'bbox': [[129.0, 291.0, 395.5, 349.0], [119.0, 266.0, 93.5, 91.0], [296.0, 280.0, 116.0, 87.5], [473.0, 284.0, 167.0, 123.5]], 'category': [3, 3, 3, 3]}}

- Label 및 Image 전처리

images = dataset['test']['image'][:2]
categories = dataset['test'].features['objects'].feature['category'].names
labels = [categories] * len(images)
inputs = processor(text=labels, images=images, return_tensors="pt", padding=True)

print(images)
print(labels)
print('input_ids:', inputs['input_ids'])
print('attention_mask:', inputs['attention_mask'])
print('pixel_values:', inputs['pixel_values'])
print('image_shape:', inputs['pixel_values'].shape)

# ==========================================================
# [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CF790>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CCC10>]
# [['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk'], ['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk']]
# input_ids: tensor([[49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0],
#         [49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0]])
# attention_mask: tensor([[1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#         [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0], 
#          [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#           [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0]])
# pixel_values: tensor([[[[ 1.5264, ..., ]]]])
# image_shape: torch.Size([2, 3, 960, 960])

- Detection & Inference

import torch

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs.keys()) # odict_keys(['logits', 'objectness_logits', 'pred_boxes', 'text_embeds', 'image_embeds', 'class_embeds', 'text_model_output', 'vision_model_output'])

- Post Processing

import matplotlib.pyplot as plt
from PIL import ImageDraw, ImageFont

# 예측확률이 높은 객체 추출
shape = [dataset['test'][:2]['width'], dataset['test'][:2]['height']]
target_sizes = list(map(list, zip(*shape))) # [[640, 640], [640, 640]]
results = processor.post_process_object_detection(outputs=outputs, threshold=0.5, target_sizes=target_sizes)
print(results)

# Post Processing
for idx, (image, detect) in enumerate(zip(images, results)):
    image = image.copy()
    draw = ImageDraw.Draw(image)
    font = ImageFont.truetype("arial.ttf", 18)

    for box, label, score in zip(detect['boxes'], detect['labels'], detect['scores']):
        box = [round(i, 2) for i in box.tolist()]
        draw.rectangle(box, outline='red', width=3)

        label_text = f'{labels[idx][label]}: {round(score.item(), 3)}'
        draw.text((box[0], box[1]), label_text, fill='white', font=font)

    plt.imshow(image)
    plt.axis('off')
    plt.show()
    
# ==============================================
# [{'scores': tensor([0.5499, 0.6243, 0.6733]), 'labels': tensor([3, 3, 3]), 'boxes': tensor([[329.0247, 287.1844, 400.3372, 357.9262],
#         [122.9359, 272.8753, 534.3260, 637.6506],
#         [479.7363, 294.2744, 636.4859, 396.8372]])}, {'scores': tensor([0.7538]), 'labels': tensor([7]), 'boxes': tensor([[ -0.7799, 173.7043, 440.0294, 538.7166]])}]

Zero shot Semantic segmentation

Image Segmentation은 보다 정밀한, 픽셀별 분류를 수행하기에 높은 계산비용이 들며, 광범위한 train data와 정교한 알고리즘을 필요로 한다.
전통적 방법으로는 threshold기반 binary classification, Edge Detection등이 있으며
최신 방법으로는 딥러닝모델을 이용해 Image Segmentation을 진행한다.
전통적 방법은 단순하고 빠르지만 복잡하거나 다양한 조명조건 등에서 성능이 크게 저하되는 단점이 존재한다.

SAM (Segment Anything Model)

Model	Architecture	Input_Size	Patch_Size	#params
facebook/sam-vit-base	ViT-B/16	1024×1024	16×16	0.9B
facebook/sam-vit-large	ViT-L/16	1024×1024	16×16	3.1B
facebook/sam-vit-huge	ViT-H/16	1024×1024	16×16	6.4B

SAM구조: img-encoder, prompt-encoder, mask-decoder

SAM은 Meta에서 개발한 다양한 도메인에서 수집한 1100만개 이미지를 이용해 학습한 모델이다.

그렇기에 다양한 환경에서 image segmentation작업을 고수준으로 수행가능하다.
SAM을 이용하면 많은경우, 추가적인 Fine-Tuning없이, 다양한 Domain image에 대한 segmentation이 가능하다.

SAM은 prompt를 받을수도 있고, 받지 않아도 되는데, prompt는 좌표, bbox, txt 등 다양하게 줄 수 있다.
추가적으로 prompt를 주지 않으면 img 전체에 대한 포괄적인 Segmentation을 진행한다.
다만, Inference결과로 Binary Mask는 제공하지만 pixel에 대한 구체적 class정보는 포함하지 않는다.

SAM 활용 예시

import io
from PIL import Image
from datasets import load_dataset
from transformers import SamProcessor, SamModel

def filter_category(data):
    # 16 = dog
    # 23 = giraffe
    return 16 in data["objects"]["category"] or 23 in data["objects"]["category"]

def convert_image(data):
    byte = io.BytesIO(data["image"]["bytes"])
    img = Image.open(byte)
    return {"img": img}

model_name = "facebook/sam-vit-base"
processor = SamProcessor.from_pretrained(model_name) 
model = SamModel.from_pretrained(model_name)

dataset = load_dataset("s076923/coco-val")
filtered_dataset = dataset["validation"].filter(filter_category)
converted_dataset = filtered_dataset.map(convert_image, remove_columns=["image"])

import numpy as np
from matplotlib import pyplot as plt


def show_point_box(image, input_points, input_labels, input_boxes=None, marker_size=375):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    ax = plt.gca()
    
    input_points = np.array(input_points)
    input_labels = np.array(input_labels)

    pos_points = input_points[input_labels[0] == 1]
    neg_points = input_points[input_labels[0] == 0]
    
    ax.scatter(
        pos_points[:, 0],
        pos_points[:, 1],
        color="green",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )
    ax.scatter(
        neg_points[:, 0],
        neg_points[:, 1],
        color="red",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )

    if input_boxes is not None:
        for box in input_boxes:
            x0, y0 = box[0], box[1]
            w, h = box[2] - box[0], box[3] - box[1]
            ax.add_patch(
                plt.Rectangle(
                    (x0, y0), w, h, edgecolor="green", facecolor=(0, 0, 0, 0), lw=2
                )
            )

    plt.axis("on")
    plt.show()


image = converted_dataset[0]["img"]
input_points = [[[250, 200]]]
input_labels = [[[1]]]

show_point_box(image, input_points[0], input_labels[0])
inputs = processor(
    image, input_points=input_points, input_labels=input_labels, return_tensors="pt"
)

# input_points shape : torch.Size([1, 1, 1, 2])
# input_points : tensor([[[[400.2347, 320.0000]]]], dtype=torch.float64)
# input_labels shape : torch.Size([1, 1, 1])
# input_labels : tensor([[[1]]])
# pixel_values shape : torch.Size([1, 3, 1024, 1024])
# pixel_values : tensor([[[[ 1.4612,  ...]]])

input_points: [B, 좌표개수, 좌표] -- 관심갖는 객체나 영역지정 좌표
input_labels: [B, 좌표B, 좌표개수] -- input_points에 대응되는 label정보
- input_labels종류:

번호	이름	설명
1	foreground 클래스	검출하고자 하는 관심객체가 포함된 좌표
0	not foreground 클래스	관심객체가 포함되지 않은 좌표
-1	background 클래스	배경영역에 해당하는 좌표
-10	padding 클래스	batch_size를 맞추기 위한 padding값 (모델이 처리X)

[Processor로 처리된 이후 출력결과]
input_points: [B, 좌표B, 분할마스크 당 좌표개수, 좌표위치]
input_labels: [B, 좌표B, 좌표개수]

import torch


def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)


def show_masks_on_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
        masks = masks.squeeze()
    if scores.shape[0] == 1:
        scores = scores.squeeze()

    nb_predictions = scores.shape[-1]
    fig, axes = plt.subplots(1, nb_predictions, figsize=(30, 15))

    for i, (mask, score) in enumerate(zip(masks, scores)):
        mask = mask.cpu().detach()
        axes[i].imshow(np.array(raw_image))
        show_mask(mask, axes[i])
        axes[i].title.set_text(f"Mask {i+1}, Score: {score.item():.3f}")
        axes[i].axis("off")
    plt.show()


model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)
print("iou_scores shape :", outputs.iou_scores.shape)
print("iou_scores :", outputs.iou_scores)
print("pred_masks shape :", outputs.pred_masks.shape)
print("pred_masks :", outputs.pred_masks)

# iou_scores shape : torch.Size([1, 1, 3])
# iou_scores : tensor([[[0.7971, 0.9507, 0.9603]]])
# pred_masks shape : torch.Size([1, 1, 3, 256, 256])
# pred_masks : tensor([[[[[ -3.6988, ..., ]]]]])

iou_scrores: [B, 좌표개수, IoU점수]
pred_masks: [B, 좌표B, C, H, W]

input_points = [[[250, 200], [15, 50]]]
input_labels = [[[0, 1]]]
input_boxes = [[[100, 100, 400, 600]]]

show_point_box(image, input_points[0], input_labels[0], input_boxes[0])
inputs = processor(
    image,
    input_points=input_points,
    input_labels=input_labels,
    input_boxes=input_boxes,
    return_tensors="pt"
)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)

Zero shot Instance segmentation

Zero shot Detection + SAM

SAM의 경우, 검출된 객체의 클래스를 분류하는 기능이 없다.
즉, 이미지 내 객체를 픽셀단위로 구분하는 instance segmentation작업에는 어려움이 존재한다.

이런 한계극복을 위해 zero-shot detection model과 SAM을 함께 활용할 수 있다:
1) zero shot detection모데로 객체 클래스와 bbox영역 검출
2) bbox영역내 SAM모델로 semantic segmentation 진행.

from transformers import pipeline

generator = pipeline("mask-generation", model=model_name)
outputs = generator(image, points_per_batch=32)

plt.imshow(np.array(image))
ax = plt.gca()
for mask in outputs["masks"]:
    show_mask(mask, ax=ax, random_color=True)
plt.axis("off")
plt.show()

print("outputs mask의 개수 :", len(outputs["masks"]))
print("outputs scores의 개수 :", len(outputs["scores"]))

# outputs mask의 개수 : 52
# outputs scores의 개수 : 52

detector = pipeline(
    model="google/owlv2-base-patch16", task="zero-shot-object-detection"
)

image = converted_dataset[24]["img"]
labels = ["dog", "giraffe"]
results = detector(image, candidate_labels=labels, threshold=0.5)

input_boxes = []
for result in results:
    input_boxes.append(
        [
            result["box"]["xmin"],
            result["box"]["ymin"],
            result["box"]["xmax"],
            result["box"]["ymax"]
        ]
    )
    print(result)

inputs = processor(image, input_boxes=[input_boxes], return_tensors="pt")

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)

plt.imshow(np.array(image))
ax = plt.gca()

for mask, iou in zip(masks[0], outputs.iou_scores[0]):
    max_iou_idx = torch.argmax(iou)
    best_mask = mask[max_iou_idx]
    show_mask(best_mask, ax=ax, random_color=True)

plt.axis("off")
plt.show()

#{'score': 0.6905778646469116, 'label': 'giraffe', 'box': {'xmin': 96, 'ymin': 198, 'xmax': 294, 'ymax': 577}}
#{'score': 0.6264181733131409, 'label': 'giraffe', 'box': {'xmin': 228, 'ymin': 199, 'xmax': 394, 'ymax': 413}}

Image Matching
image matching은 디지털 이미지간 유사성을 정량화, 비교하는 방법이다.
이를 image의 feature vector를 추출하여 각 image vector간 유사도(거리)를 측정하여 계산한다.
그렇기에 이미지 매칭의 핵심은 "이미지 특징을 효과적으로 포착하는 feature vector의 생성"이다.
(보통 특징벡터가 고차원일수록 더 많은 정보를 포함하며, 이 특징벡터는 classification layer와 같은 층을 통과하기 직전(= Feature Extractor의 결과값 = Classifier 직전값) 벡터를 보통 의미한다.)

ex) ViT를 이용한 특징벡터 추출 예제
import torch
from datasets import load_dataset
from transformers import ViTImageProcessor, ViTModel

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTModel.from_pretrained(model_name)

inputs = processor(image, return_tensors="pt")
with torch.no_grad():
    outputs = model(inputs["pixel_values"])

print("마지막 특징 맵의 형태 :", outputs["last_hidden_state"].shape)
print("특징 벡터의 차원 수 :", outputs["last_hidden_state"][:, 0, :].shape)
print("특징 벡터 :", outputs["last_hidden_state"][:, 0, :])

# 마지막 특징 맵의 형태 : torch.Size([1, 197, 768])
# 특징 벡터의 차원 수 : torch.Size([1, 768])
# 특징 벡터 : tensor([[ 2.9420e-01,  8.3502e-01,  ..., -8.4114e-01,  1.7990e-01]])
ImageNet-21K라는 방대한 사전Dataset으로 학습되어 미세한 차이 및 복잡한 패턴을 인식할 수 있게 된다.
ViT에서 feature vector추출 시, 주목할점은 last_hidden_state 키 값이다:
출력이 [1, 197, 768]의 [B, 출력토큰수, feature차원]을 의미하는데, 197개의 출력토큰은 다음을 의미한다.
224×224 --> 16×16(patch_size) --> 196개 patches,
197 = [CLS] + 196 patches로 이루어진 출력토큰에서 [CLS]를 특징벡터로 사용한다.

FAISS (Facebook AI Similarity Search)

FAISS는 메타에서 개발한 고성능 벡터유사도검색 라이브러리이다.
이는 "대규모 고차원 벡터 데이터베이스에서 유사한 벡터를 검색"가능하게 설계되었다.

cf) [벡터 저장 및 관리방식]

로컬 저장 장치: SSD나 NVMe같은 고속저장장치를 사용해 빠른 데이터 접근이 가능.

데이터베이스 시스템: PstgreSQL, pgvector확장이나 MongoDB의 Atlas Vector Search같은 벡터검색기능을 지원하는 데이터베이스를 활용

클라우드 벡터 데이터베이스: Amazon OpenSearch, Ggogle Vetex AI등 클라우드 서비스는 대규모 벡터데이터의 저장 및 검색을 위한 특화된 솔루션을 제공

벡터검색엔진: Milvus, Qdrant, Weaviate, FAISS 등의 벡터 데이터 베이스는 대규모 벡터 dataset의 효율적 저장 및 고성능 검색을 위해 최적화되어 ANN(Approximate Nearest Neighbor)알고리즘으로 빠른 유사도검색을 지원, 실시간 검색이 필요한 경우 특히나 적합하다.

ex) CLIP을 이용한 이미지 특징벡터 추출 예제
import torch
import numpy as np
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel

dataset = load_dataset("sasha/dog-food")
images = dataset["test"]["image"][:100]

model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name)

vectors = []
with torch.no_grad():
    for image in images:
        inputs = processor(images=image, padding=True, return_tensors="pt")
        outputs = model.get_image_features(**inputs)
        vectors.append(outputs.cpu().numpy())

vectors = np.vstack(vectors)
print("이미지 벡터의 shape :", vectors.shape)

# 이미지 벡터의 shape : (100, 512)
dog-food dataset에서 100개 이미지를 선택 → 각 이미지 벡터를 추출
→ vectors리스트에 저장 → ndarray형식으로 변환

이런 특징벡터를 유사도 검색을 위한 인덱스 생성에 활용가능하다:
생성된 인덱스에 이미지 벡터를 등록하기 위해 add를 사용하는데, 이때 입력되는 이미지 벡터는 반드시 numpy의 ndarray형식의 [벡터개수, 벡터차원수] 형태로 구성되어야 한다!!
import faiss

dimension = vectors.shape[-1]
index = faiss.IndexFlatL2(dimension)
if torch.cuda.is_available():
    res = faiss.StandardGpuResources()
    index = faiss.index_cpu_to_gpu(res, 0, index)

index.add(vectors)


import matplotlib.pyplot as plt

search_vector = vectors[0].reshape(1, -1)
num_neighbors = 5
distances, indices = index.search(x=search_vector, k=num_neighbors)

fig, axes = plt.subplots(1, num_neighbors + 1, figsize=(15, 5))

axes[0].imshow(images[0])
axes[0].set_title("Input Image")
axes[0].axis("off")

for i, idx in enumerate(indices[0]):
    axes[i + 1].imshow(images[idx])
    axes[i + 1].set_title(f"Match {i + 1}\nIndex: {idx}\nDist: {distances[0][i]:.2f}")
    axes[i + 1].axis("off")

print("유사한 벡터의 인덱스 번호:", indices)
print("유사도 계산 결과:", distances)

# 유사한 벡터의 인덱스 번호: [[ 0  6 75  1 73]]
# 유사도 계산 결과: [[ 0.       43.922516 44.92473  46.544144 47.058586]]
위 과정을 통해 100개의 벡터를 저장한 FAISS 인덱스가 생성되며, 검색하고자하는 이미지의 특징벡터를 입력으로 인덱스 내에서 가장 유사한 벡터를 효율적으로 추출가능하다.
다만, 인덱스에 저장된 벡터에 대해서만 검색이 가능하기에 검색범위를 확장하고자 한다면 더 많은 벡터를 인덱스에 추가해야한다.

위 코드를 보면 아래와 같은 코드가 있는데, FAISS 라이브러리에서는 다양한 인덱스유형들을 제공한다:
index = faiss.IndexFlatL2(dimension)
이름 정확도 속도 특징

IndexFlatL2 가장 높음 가장 느림 모든 벡터에 대한 완전탐색을 수행

IndexHNSW 높음 보통 그래프 구조를 사용해 효율적 검색

IndexIVFlat 보통 가장 빠름 벡터간 clustering으로 탐색범위를 줄여 검색

Multi-Modal

Image Captioning (img2txt)

BLIP

BLIP의 핵심아이디어는 "img와 Txt의 상호작용을 모델링하는 것"이다.
이를 위해 img-encoder, txt-encoder로 각각의 feature vector를 연결해 통합 표현을 생성한다.

BLIP-2 구조

BLIP2는 Q-Former를 도입해 img-txt간 상호작용과 정보교환을 향상시켰다:
[img-txt대조학습, ITM, img기반 txt생성] --> 동시에 하나의 Encode-Decoder구조로 수행
Q-Former는 입력으로 고정된 이미지 feature embedding을 받은 후
img-txt관계가 잘 표현된 Soft visual prompt Embedding을 출력한다.

DocumentQA
DQA(DocumentQA)는 자연어처리 + 정보검색기술을 융합해 QA를 진행하는 것이다.
DQA는 시각적 구조와 Layout을 고려해야하는데, 이 중 가장 주목받는 모델 중 하나가 바로 LayoutLM이다.

LayoutLM (Layout-aware Language Model)

LayoutLM은 Microsoft에서 문서 이미지의 txt뿐만아니라 Layout정보까지 함께 Pre-Train된 모델이다.

[LayoutLMv1]

LayoutLM-v1

BERT를 기반으로 txt와 함께 txt의 위치정보를 입력으로 사용한다.
Faster R-CNN같은 OCR모델로 txt와 bbox경계를 추출, position embedding으로 추가하며 단어의 image patch(feature)도 model에 입력한다. 다만, LayoutLMv1은 image feature가 맨 마지막에 추가되어 Pretrain시 실제로 활용할 수 없다는 단점이 존재한다.

[LayoutLMv2]

LayoutLMv2는 image embedding을 추가로 도입해 문서의 시각적 정보를 효과적으로 반영한다.
LayoutLMv2에서 visual embedding이 ResNeXT-FPN으로 추출된다.
즉, txt, img-patch, layout정보를 동시에 입력으로 받아 Self-Attention을 수행한다.
- 학습 주요 목표:
i) Masked Visual-Language Modeling: 문장의 빈칸 예측
ii) ITM: 특정 텍스트와 해당 이미지간의 연관성 학습
iii)Text-Image Alignment: 이미지에서 특정 단어가 가려졌을 때, 그 위치를 식별하는 능력

[LayoutLMv3]

좌)DocFormer , 우)LayoutLMv3

LayoutLMv3는 Faster R-CNN, CNN등의 Pre-Trained Backbone에 의존하지 않는 최초의 통합 MLLMs이다.
이를 위해 전과 달리 새로운 사전학습전략 및 과제를 도입하였다:
i) Masked Language Modeling(MLM): 일부 단어 token 마스킹
ii) Masked Image Modeling(MIM): 마스킹된 token에 해당하는 이미지 부분을 마스킹
iii) Word Patch Alignment(WPA): img token과 대응되는 Txt token의 마스킹여부를 이진분류, 두 모달리티간 정렬을 학습

<LayoutLMv3 구조>: embedding계층, patch_embedding모듈, encoder
1) embedding층은 다양한 유형의 Embedding을 통합:
- word_embed + token_type_emb + pos_emb + (x_pos_emb , y_pos_emb, h_pos_emb, w_pos_emb)

2) patch_embedding모듈은 이미지를 처리:
- patch로 분할하고 각 patch를 embedding으로 변환하는 ViT역할

3) encoder
- 여러 Transformer층으로 구성.

VQA
VQA process: 시각적 특징 추출 → Q의미파악 →시각적특징과 Q의 txt정보를 통합해 의미있는 표현(A)생성
이를 위해 등장한 것이 바로 ViLT이다.

ViLT (Vision-and-Language Transformer)

시각적 입력을 txt입력과 동일한 방식으로 처리하는 단일모델구조로 구성되어 있다.
이때, 두 모달리티 구분을 위해 모달타입 embedding이 추가되며,
학습과정에서 3가지 손실함수를 통해 이뤄진다:
- ITM: 주어진 Image와 Text가 서로 연관되어있는지 판단.
- MLM: 단어단위의 Masking으로 전체 단어맥락 파악
- WPA: img-txt간 벡터 유사도 최대화

결과적으로 img+txt를 효과적으로 결합해, 단일 embedding공간에 표현한다.

cf) collate_fn은 pytorch의 dataloader로 batch를 구성할 때, 각 sample을 어떻게 결합할 것인지 정의하는 함수다.

Image Generation
이미지 생성은 prompt를 기반으로 이해하여 GAN이나 Diffusion Model을 이용해 prompt의 세부적 특징을 잘 잡아내 새로운 img를 생성하는 기술을 의미한다.

Diffusion Model

[Forward process]: src_img에 점진적 정규분포 Noise 추가
[Reverse process]: pure_noise에서 원본으로 복원(by 평균과 표준편차 갱신)

[Stable-Diffusion 1]
- 512×512 img 생성
- txt2img, img2img, inpainting 등의 기능

[Stable-Diffusion 2]
- 768×768 img 생성
- OpenCLIP으로 더 나은 WPA 제공, 세부적 묘사 개선

[Stable-Diffusion 3]
- 더욱 고해상도 이미지 생성
- Rectified flow기반의 새로운 모델구조
- txt와 img token간 양방향 정보흐름을 가능하게하는 새로운 모델구조

etc

Hyperparameter Tuning - ray tune

raytune은 분산 hypereparameter 최적화 framework이다.
대규모 분산컴퓨팅 환경에서 다양한 hyperparameter 탐색 알고리즘(random, greedy 등)을 지원하며, Early Stopping 또한 제공한다.
추가적으로 실험결과 추적 및 시각화 도구 또한 제공하며, 최적의 hyperparameter 조합 또한 효과적으로 식별할 수 있게 도와준다.

!pip3 install ray[tune] optuna

ex) NER RayTune 예제

i) 학습 준비

from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer

def preprocess_data(example, tokenizer):
    sentence = "".join(example["tokens"]).replace("\xa0", " ")
    encoded = tokenizer(
        sentence,
        return_offsets_mapping=True,
        add_special_tokens=False,
        padding=False,
        truncation=False
    )

    labels = []
    for offset in encoded.offset_mapping:
        if offset[0] == offset[1]:
            labels.append(-100)
        else:
            labels.append(example["ner_tags"][offset[0]])
    encoded["labels"] = labels
    return encoded

dataset = load_dataset("klue", "ner")
labels = dataset["train"].features["ner_tags"].feature.names

model_name = "Leo97/KoELECTRA-small-v3-modu-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(labels),
    ignore_mismatched_sizes=True
)

processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=False,
    remove_columns=dataset["train"].column_names
)

ii) hyperparameter 탐색

from ray import tune
from functools import partial
from transformers import Trainer, TrainingArguments
from transformers.data.data_collator import DataCollatorForTokenClassification

def model_init(model_name, labels):
    return AutoModelForTokenClassification.from_pretrained(
        model_name, num_labels=len(labels), ignore_mismatched_sizes=True
    )

def hp_space(trial):
    return {
        "learning_rate": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(1e-5, 1e-1),
        "num_train_epochs": tune.choice([1, 2, 3])
    }

def compute_objective(metrics):
    return metrics["eval_loss"]

training_args = TrainingArguments(
    output_dir="token-classification-hyperparameter-search",
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    # learning_rate=1e-4,
    # weight_decay=0.01,
    # num_train_epochs=5,
    seed=42
)

trainer = Trainer(
    model_init=partial(model_init, model_name=model_name, labels=labels),
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"],
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True)
)

best_run = trainer.hyperparameter_search(
    backend="ray",
    n_trials=5,
    direction="minimize",
    hp_space=hp_space,
    compute_objective=compute_objective,
    resources_per_trial={"cpu": 2, "gpu": 1},
    trial_dirname_creator=lambda trial: str(trial)
)
print(best_run.hyperparameters)

model_init 함수: 모델 인스턴스 생성 (여러 실험을 통해 최적의 hyperparameter 탐색하게 할당됨.)
즉, 각 실험마다 일관된 초기상태를 보장함.

hp_space 함수: 최적화 과정에서 탐색할 hyperparameter 종류와 값의 범위 지정.

compute_objective 함수: 최적화 과정에서 사용할 "평가지표"로 보통 eval_loss나 eval_acc를 기준으로 설정.

TrainingArguments 함수: lr, weight_decay, train_epochs가 hp_space에서 탐색되기에 따로 할당X

Trainer 함수: 고정된 모델인스턴스가 아닌, model_init을 사용.

출력 예시)

+-------------------------------------------------------------------+
| Configuration for experiment     _objective_2024-11-18_05-44-18   |
+-------------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator            |
| Scheduler                        FIFOScheduler                    |
| Number of trials                 5                                |
+-------------------------------------------------------------------+

View detailed results here: /root/ray_results/_objective_2024-11-18_05-44-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-11-18_05-44-11_866890_872/artifacts/2024-11-18_05-44-18/_objective_2024-11-18_05-44-18/driver_artifacts`

Trial status: 5 PENDING
Current time: 2024-11-18 05:44:18. Total running time: 0s
Logical resource usage: 0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-------------------------------------------------------------------------------------------+
| Trial name               status       learning_rate     weight_decay     num_train_epochs |
+-------------------------------------------------------------------------------------------+
| _objective_27024_00000   PENDING        2.36886e-05       0.0635122                     3 |
| _objective_27024_00001   PENDING        6.02131e-05       0.00244006                    2 |
| _objective_27024_00002   PENDING        1.43217e-05       1.7074e-05                    1 |
| _objective_27024_00003   PENDING        3.99131e-05       0.00679658                    2 |
| _objective_27024_00004   PENDING        1.13871e-05       0.00772672                    2 |
+-------------------------------------------------------------------------------------------+

Trial _objective_27024_00000 started with configuration:
+-------------------------------------------------+
| Trial _objective_27024_00000 config             |
+-------------------------------------------------+
| learning_rate                             2e-05 |
| num_train_epochs                              3 |
| weight_decay                            0.06351 |
+-------------------------------------------------+

...

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ는 모델 최적화방식으로 LLM의 효율성을 크게 향상가능하다.
모델의 가중치를 낮은 bit정밀도로 양자화해 모델크기를 줄이고 추론속도를 높인다.
아래 예제의 출력결과를 보면 알 수 있듯, 모델 크기를 상당히 큰 폭으로 줄일 수 있는데,
GPTQ방법은 GPT계열뿐만 아니라 다른 Transformer 기반 모델들 모두 적용 가능하다.

GPTQ를 이용한 모델 양자화 예제

from transformers import GPTQConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

quantization_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config
)

from transformers import pipeline

origin_generator = pipeline("text-generation", model="facebook/opt-125m")
quantized_generator = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

input_text_list = [
    "In the future, technology wil",
    "What are we having for dinner?",
    "What day comes after Monday?"
]

print("원본 모델의 출력 결과:")
for input_text in input_text_list:
    print(origin_generator(input_text))
print("양자화 모델의 출력 결과:")
for input_text in input_text_list:
    print(quantized_generator(input_text))
    
# 원본 모델의 출력 결과:
# [{'generated_text': 'In the future, technology wil be used to make the world a better place.\nI think'}]
# [{'generated_text': 'What are we having for dinner?\n\nWe have a great dinner tonight. We have a'}]
# [{'generated_text': "What day comes after Monday?\nI'm guessing Monday."}]
# 양자화 모델의 출력 결과:
# [{'generated_text': 'In the future, technology wil be able to make it possible to make a phone that can be'}]
# [{'generated_text': "What are we having for dinner?\n\nI'm not sure what to do with all this"}]
# [{'generated_text': "What day comes after Monday?\nI'm not sure, but I'll be sure to check"}]

출력결과, 정확도가 다소 떨어지긴 하나 원본모델과 큰 차이가 없음을 확인할 수 있다.

import time
import numpy as np

def measure_inference_time(generator, input_text, iterations=10):
    times = []
    for _ in range(iterations):
        start_time = time.time()
        generator(input_text)
        end_time = time.time()
        times.append(end_time - start_time)
    avg_time = np.mean(times)
    return avg_time

def calculate_model_size(model):
    total_params = sum(p.numel() for p in model.parameters())
    total_memory = sum(p.numel() * p.element_size() for p in model.parameters())
    total_memory_mb = total_memory / (1024 ** 2)
    return total_memory_mb, total_params

test_input = "Once upon a time in a land far, far away, there was a small village."

size_original, total_params_original = calculate_model_size(origin_generator.model)
avg_inference_time_original = measure_inference_time(origin_generator, test_input)

size_quantized, total_params_quantized = calculate_model_size(quantized_generator.model)
avg_inference_time_quantized = measure_inference_time(quantized_generator, test_input)

print("원본 모델:")
print(f"- 매개변수 개수: {total_params_original:,}")
print(f"- 모델 크기: {size_original:.2f} MB")
print(f"- 평균 추론 시간: {avg_inference_time_original:.4f} sec")

print("양자화 모델:")
print(f"- 매개변수 개수: {total_params_quantized:,}")
print(f"- 모델 크기: {size_quantized:.2f} MB")
print(f"- 평균 추론 시간: {avg_inference_time_quantized:.4f} sec")

# 원본 모델:
# - 매개변수 개수: 125,239,296
# - 모델 크기: 477.75 MB
# - 평균 추론 시간: 0.1399 sec
# 양자화 모델:
# - 매개변수 개수: 40,221,696
# - 모델 크기: 76.72 MB
# - 평균 추론 시간: 0.0289 sec

추론 과정에 대한 출력결과를 보면, 원본에 비해 모델에 비해 크기가 크게 줄며 더 빠른 처리를 통해 실시간 응답에 대해 매우 효율적일 수 있음을 확인가능하다.

HuggingFace( )-Tutorials

V2LLAIN — Wed, 31 Jul 2024 14:37:18 +0900

Transformers

pipeline()
모델 inference를 위해 사용
from transformers import pipeline
pipe = pipeline("text-classification")
pipe("This restaurant is awesome")
# [{'label': 'POSITIVE', 'score': 0.9998743534088135}]
from transformers로 Github( ‍⬛) transformer에서 함수를 불러올 수 있다:

transformers의 함수를 import하는 경우, 위 사진의 src/transformers에 모두 구현이 되어있다.

불러오는 함수의 경우, __init__.py를 확인하면 알 수 있는데, 아래 사진처럼 pipeline이 from .pipelines import pipeline이라 적혀있음을 확인가능하다.

위 좌측사진에서 확인할 수 있듯, pipelines폴더에 pipeline함수를 불러오는것을 확인할 수 있으며,
실제로 해당 폴더에 들어가보면 우측처럼 pipeline함수가 정의되고, 이 형태는 transformers.pipeline docs내용과 일치함을 확인가능하다.

Auto Classes

from_pretrained() Method로 추론이 가능하며, AutoClasses는 이런 작업을 수행, 사전훈련된 AutoConfig, AutoModel, AutoTokenizer중 하나를 자동으로 생성가능하다:
ex)
from transformers import AutoConfig, AutoModel

model = AutoModel.from_pretrained("google-bert/bert-base-cased")
∙ AutoConfig

generic Cofiguration클래스로 from_pretrained()라는 클래스메서드 라이브러리 설정클래스 중 하나로 인스턴스화된다.
이 클래스는 '__init__()'을 사용해 직접 인스턴스할 수 없다!!

위 사진을 보면, transformers/src파일을 따고 들어간 결과, 최종적으로 from_pretrained()함수를 찾을 수 있었다.
해당 깃헙코드(가장 우측사진)를 보면, __init__()함수에 대해 raise EnvironmentError가 걸려있음이 확인가능하다.
config = AutoConfig.from_pretrained("bert-base-uncased")
print(config)


# BertConfig {
#   "_name_or_path": "bert-base-uncased",
#   "architectures": [
#     "BertForMaskedLM"
#   ],
#   "attention_probs_dropout_prob": 0.1,
#   "classifier_dropout": null,
#   "gradient_checkpointing": false,
#   "hidden_act": "gelu",
#   "hidden_dropout_prob": 0.1,
#   "hidden_size": 768,
#   "initializer_range": 0.02,
#   "intermediate_size": 3072,
#   "layer_norm_eps": 1e-12,
#   "max_position_embeddings": 512,
#   "model_type": "bert",
#   "num_attention_heads": 12,
#   "num_hidden_layers": 12,
#   "pad_token_id": 0,
#   "position_embedding_type": "absolute",
#   "transformers_version": "4.41.2",
#   "type_vocab_size": 2,
#   "use_cache": true,
#   "vocab_size": 30522
# }
위 코드를 보면, Config는 모델 학습을 위한 json파일로 되어있음을 확인가능하다.
batch_size, learning_rate, weight_decay 등 train에 필요한 것들과
tokenizer의 특수토큰들을 미리 설정하는 등 설정관련 내용이 들어있다.
또한, save_pretrained()를 이용하면 모델의 checkpoint화 함께 저장된다!
그렇기에, 만약 설정을 변경하고 싶거나 Model Proposal등의 상황에서는 config파일을 직접 불러와야한다!

∙ AutoTokenizer, (blobs, refs, snapshots)

generic Tokenizer클래스로 AutoTokenizer.from_pretrained()라는 클래스메서드.
생성 시, 라이브러리 토크나이저클래스 중 하나로 인스턴스화된다.
참고)https://chan4im.tistory.com/#%E2%88%99input-ids
이 클래스는 '__init__()'을 사용해 직접 인스턴스할 수 없다!!

위 사진을 보면, transformers/src파일을 따고 들어간 결과, 최종적으로 from_pretrained()함수를 찾을 수 있었다.
해당 깃헙코드(가장 우측사진)를 보면, __init__()함수에 대해 raise EnvironmentError가 걸려있음이 확인가능하다.
from transformers import AutoTokenizer

# Download vocabulary from huggingface.co and cache.
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

# If vocabulary files are in a directory 
# (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")

# Download vocabulary from huggingface.co and define model-specific arguments
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_prefix_space=True)

tokenizer
# BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
# 	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# }
그런데 한가지 궁금하지 않은가?
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
위 코드를 작성후 실행하면 콘솔창에 왜 아래와 같은 화면이 나오는 것일까?????

미리 설명:
tokenizer_config.json에는 token에 대한 설정들이,
config.json에는 모델관련 설정이,
vocab.txt는 subword들이 들어있고,
tokenizer.json은 config된 값들에 대해 나열한 것이다.

본인의 경우, 아래와 같이 cache_dir에 지정을 하면, 해당 디렉토리에 hub라는 파일이 생기며, 그안에 모델관련 파일이 생긴다.
parser.add_argument('--cache_dir', default="/data/huggingface_models")
타고 들어가다 보면 총 3가지 폴더가 나온다: blobs, refs, snapshots
blobs: 해시값으로 나타나져 있음을 확인가능하다. 해당파일에는 아래와 같은 파일이 존재한다:
Config클래스관련파일, Model관련 클래스파일들, tokenizer설정관련 json파일, weight관련 파일들 등등 아래 사진과 같이 많은 파일들이 코드화되어 들어있다:

refs: 단순히 main이라는 파일만 존재한다:

해당 파일에는 snapshots안에 있는 디렉토리의 이름과 동일한 이름이 적혀있다.

snapshots: 바로 이곳에!! tokenizer_config.json, config.json, vocab.txt, tokenizer.json파일이 있음을 확인할 수 있다!!!

그런데 뭔가 이상하지 않은가??
위의 blobs에 나와있는 사진의 코드와 snapshots의 코드가 모두 일치한다는 사실!!

그렇다! 즉, blobs는 snapshots 내용을 해시값형태로 저장한 것이었다!!!
사실 이짓한다음에 구글에 치니까 바로 있긴했었다..(https://huggingface.co/docs/huggingface_hub/v0.16.3/en/guides/manage-cache)
허깅페이스 설명 요약:

Refs refs 폴더는 주어진 참조의 최신 리비전을 나타내는 파일을 포함합니다. 예를 들어, 이전에 리포지토리의 메인 브랜치에서 파일을 가져온 경우, refs 폴더에는 main이라는 파일이 있으며, 현재 헤드의 커밋 식별자를 포함합니다.

만약 최신 커밋이 aaaaaa라는 식별자를 가지고 있다면, 해당 파일은 aaaaaa를 포함할 것입니다.

동일한 브랜치가 새로운 커밋 bbbbbb로 업데이트된 경우, 해당 참조에서 파일을 다시 다운로드하면 refs/main 파일이 bbbbbb로 업데이트됩니다.

Blobs blobs 폴더는 실제로 다운로드된 파일을 포함합니다. 각 파일의 이름은 해당 파일의 해시입니다.

Snapshots snapshots 폴더는 위의 blobs에서 언급한 파일에 대한 심볼릭 링크를 포함합니다. 자체적으로 알려진 각 리비전에 대해 여러 폴더로 구성됩니다.

cf) 또한 cache는 아래와 같은 tree구조를 가짐:
    [  96]  .
    └── [ 160]  models--julien-c--EsperBERTo-small
        ├── [ 160]  blobs
        │   ├── [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
        │   ├── [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
        │   └── [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
        ├── [  96]  refs
        │   └── [  40]  main
        └── [ 128]  snapshots
            ├── [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
            │   ├── [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
            │   └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
            └── [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
                ├── [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
                └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
∙ AutoModel

당연히 위와 같이, 아래사진처럼 찾아갈 수 있다.

먼저 AutoModel.from_config함수를 살펴보자.
from transformers import AutoConfig, AutoModel

# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained("google-bert/bert-base-cased")
model = AutoModel.from_config(config)


@classmethod
def from_config(cls, config, **kwargs):
    trust_remote_code = kwargs.pop("trust_remote_code", None)
    has_remote_code = hasattr(config, "auto_map") and cls.__name__ in config.auto_map
    has_local_code = type(config) in cls._model_mapping.keys()
    trust_remote_code = resolve_trust_remote_code(
        trust_remote_code, config._name_or_path, has_local_code, has_remote_code
    )

    if has_remote_code and trust_remote_code:
        class_ref = config.auto_map[cls.__name__]
        if "--" in class_ref:
            repo_id, class_ref = class_ref.split("--")
        else:
            repo_id = config.name_or_path
        model_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
        if os.path.isdir(config._name_or_path):
            model_class.register_for_auto_class(cls.__name__)
        else:
            cls.register(config.__class__, model_class, exist_ok=True)
        _ = kwargs.pop("code_revision", None)
        return model_class._from_config(config, **kwargs)
    elif type(config) in cls._model_mapping.keys():
        model_class = _get_model_class(config, cls._model_mapping)
        return model_class._from_config(config, **kwargs)

    raise ValueError(
        f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
        f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."

Data Collator

Data Collator
일련의 sample list를 "single training mini-batch"의 Tensor형태로 묶어줌.
Default Data Collator이는 아래처럼 train_dataset이 data_collator를 이용해 mini-batch로 묶여 모델로 들어가 학습하는데 도움이 된다.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
batch["input_ids"] , batch["labels"] ?
다만, 위와 달리 대부분의 Data Collator함수를 보면 아래와 같은 코드의 형태를 띠는데, 여기서 input_ids와 label이라는 조금 생소한 단어가 있다:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples): 
        texts = []
        images = []
        for example in examples:
            image, question, answer = example 
            messages = [{"role": "user", "content": question},
                        {"role": "assistant", "content": answer}] # <-- 여기까지 잘 들어가는것 확인완료.
            text = self.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text)
            images.append(image)

        batch = self.processor(text=text, images=image, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)
과연 batch["input_ids"]와 batch["labels"]가 뭘까?

전술했던 data_collator는 아래와 같은 형식을 띠는데, 여기서도 보면 inputs와 labels가 있는 것을 볼 수 있다.

모든 모델은 다르지만, 다른모델과 유사한점을 공유한다
= 대부분의 모델은 동일한 입력을 사용한다!

∙Input IDs

Input ID는 모델에 입력으로 전달되는 "유일한 필수 매개변수"인 경우가 많다.
Input ID는 token_index로, 사용할 sequence(문장)를 구성하는 token의 숫자표현이다.
각 tokenizer는 다르게 작동하지만 "기본 메커니즘은 동일"하다.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"
tokenizer는 sequence(문장)를 tokenizer vocab에 있는 Token으로 분할한다:
tokenized_sequence = tokenizer.tokenize(sequence)
token은 word나 subword 둘중 하나이다:
print(tokenized_sequence)
# 출력: ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# 예를 들어, "VRAM"은 모델 어휘에 없어서 "V", "RA" 및 "M"으로 분할됨.
# 이러한 토큰이 별도의 단어가 아니라 동일한 단어의 일부임을 나타내기 위해서는?
# --> "RA"와 "M" 앞에 이중해시(##) 접두사가 추가됩


inputs = tokenizer(sequence)
이를 통해 token은 모델이 이해가능한 ID로 변환될 수 있다.
이때, 모델내부에서 작동하기 위해서는 input_ids를 key로, ID값을 value로 하는 "딕셔너리"형태로 반환해야한다:
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# 출력: [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
또한, 모델에 따라서 자동으로 "special token"을 추가하는데,
여기에는 모델이 가끔 사용하는 "special IDs"가 추가된다.
decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# 출력: [CLS] A Titan RTX has 24GB of VRAM [SEP]
∙Attention Mask
Attention Mask는 Sequence를 batch로 묶을 때 사용하는 Optional한 인수로
"모델이 어떤 token을 주목하고 하지 말아야 하는지"를 나타낸다.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

len(encoded_sequence_a), len(encoded_sequence_b)
# 출력: (8, 19)
위를 보면, encoding된 길이가 다르기 때문에 "동일한 Tensor로 묶을 수가 없다."
--> padding이나 truncation이 필요함.
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["input_ids"]
# 출력: [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]

padded_sequences["attention_mask"]
# 출력: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask는 tokenizer가 반환하는 dictionary의 "attention_mask" key에 존재한다.

∙Token Types IDs
어떤 모델의 목적은 classification이나 QA이다.
이런 모델은 2개의 "다른 목적을 단일 input_ids"항목으로 결합해야한다.
= [CLS], [SEP] 등의 특수토큰을 이용해 수행됨.

ex)
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)
# 출력: [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
위의 예제에서 tokenizer를 이용해 2개의 sequence가 2개의 인수로 전달되어 자동으로 위와같은 문장을 생성하는 것을 볼 수 있다.
이는 seq이후에 나오는 seq의 시작위치를 알기에는 좋다.

다만, 다른 모델은 token_types_ids도 사용하며, token_type_ids로 이 MASK를 반환한다.
encoded_dict['token_type_ids']
# 출력: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
질문에 사용되는 context는 모두 0으로,
질문에 해당되는 sequence는 모두 1로 설정된 것을 확인할 수 있다.

∙Position IDs

RNN: 각 토큰의 위치가 내장.
Transformer: 각 토큰의 위치를 인식 ❌

∴ position_ids는 모델이 각 토큰의 위치를 list에서 식별하는 데 사용되는 optional 매개변수.

모델에 position_ids가 전달되지 않으면, ID는 자동으로 Absolute positional embeddings으로 생성:

Absolute positional embeddings은 [0, config.max_position_embeddings - 1] 범위에서 선택.

일부 모델은 sinusoidal position embeddings이나 relative position embeddings과 같은 다른 유형의 positional embedding을 사용.

∙Labels

Labels는 모델이 자체적으로 손실을 계산하도록 전달될 수 있는 Optional인수이다.
즉, Labels는 모델의 예상 예측값이어야 한다: 표준 손실을 사용하여 예측값과 예상값(레이블) 간의 손실을 계산.

이때, Labels는 Model Head에 따라 다르다:

AutoModelForSequenceClassification: 모델은 (batch_size)차원텐서를 기대하며, batch의 각 값은 전체 시퀀스의 예상 label에 해당.

AutoModelForTokenClassification: 모델은 (batch_size, seq_length)차원텐서를 기대하며, 각 값은 개별 토큰의 예상 label에 해당

AutoModelForMaskedLM: 모델은 (batch_size, seq_length)차원텐서를 기대하며, 각 값은 개별 토큰의 예상 레이블에 해당: label은 마스킹된 token_ids이며, 나머지는 무시할 값(보통 -100).

AutoModelForConditionalGeneration: 모델은 (batch_size, tgt_seq_length)차원텐서를 기대하며, 각 값은 각 입력 시퀀스와 연관된 목표 시퀀스를 나타냅니다. 훈련 중에는 BART와 T5가 적절한 디코더 입력 ID와 디코더 어텐션 마스크를 내부적으로 만들기에 보통 제공할 필요X. 이는 Encoder-Decoder 프레임워크를 사용하는 모델에는 적용되지 않음. 각 모델의 문서를 참조하여 각 특정 모델의 레이블에 대한 자세한 정보를 확인하세요.

기본 모델(BertModel 등)은 Labels를 받아들이지 못하는데, 이러한 모델은 기본 트랜스포머 모델로서 단순히 특징들만 출력한다.

∙ Decoder input IDs

이 입력은 인코더-디코더 모델에 특화되어 있으며, 디코더에 입력될 입력 ID를 포함합니다.
이러한 입력은 번역 또는 요약과 같은 시퀀스-투-시퀀스 작업에 사용되며, 보통 각 모델에 특정한 방식으로 구성됩니다.

대부분의 인코더-디코더 모델(BART, T5)은 레이블에서 디코더 입력 ID를 자체적으로 생성합니다.
이러한 모델에서는 레이블을 전달하는 것이 훈련을 처리하는 선호 방법입니다.
시퀀스-투-시퀀스 훈련을 위한 이러한 입력 ID를 처리하는 방법을 확인하려면 각 모델의 문서를 참조하세요.

∙Feed Forward Chunking

트랜스포머의 각 잔차 어텐션 블록에서 셀프 어텐션 레이어는 보통 2개의 피드 포워드 레이어 다음에 위치합니다.
피드 포워드 레이어의 중간 임베딩 크기는 종종 모델의 숨겨진 크기보다 큽니다(예: bert-base-uncased).

크기 [batch_size, sequence_length]의 입력에 대해 중간 피드 포워드 임베딩을 저장하는 데 필요한 메모리 [batch_size, sequence_length, config.intermediate_size]는 메모리 사용량의 큰 부분을 차지할 수 있습니다.

Reformer: The Efficient Transformer의 저자들은 계산이 sequence_length 차원과 독립적이므로 두 피드 포워드 레이어의 출력 임베딩 [batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n을 개별적으로 계산하고 n = sequence_length와 함께 [batch_size, sequence_length, config.hidden_size]로 결합하는 것이 수학적으로 동일하다는 것을 발견했습니다.

이는 메모리 사용량을 줄이는 대신 계산 시간을 증가시키는 거래를 하지만, 수학적으로 동일한 결과를 얻을 수 있습니다.

apply_chunking_to_forward() 함수를 사용하는 모델의 경우, chunk_size는 병렬로 계산되는 출력 임베딩의 수를 정의하며, 메모리와 시간 복잡성 간의 거래를 정의합니다. chunk_size가 0으로 설정되면 피드 포워드 청킹은 수행되지 않습니다.

Optimization

AdamW
흔히들 아묻따 Adam만 사용해라! 라는 격언이 있을정도로 만능 optimizer같지만,
CV일부 Task에서는 SGD가 더 나은 성능을 보이는 경우가 상당히 존재한다.
AdamW논문에서는 L2 Regularization과 Weight Decay관점에서 SGD에 비해 Adam이 일반화 능력이 떨어지는 이유를 설명한다.
서로다른 초기 decay rate와 lr에 따른 Test Error

L2 Regularization: weight가 비정상적으로 커짐을 방지. (weight값이 커지면 손실함수도 커지게 됨.)
= weight가 너무 커지지 않는 선에서 기존 손실함수를 최소로 만들어주는 weight를 모델이 학습.

weight decay: weight update 시, 이전 weight크기를 일정비율 감소시켜 overfitting방지.

SGD: L2 = weight_decay
Adam: L2 ≠ weight_decay (adaptive learning rate를 사용하기 때문에 SGD와는 다른 weight update식을 사용함.)
∴ 즉, L2 Regularization이 포함된 손실함수로 Adam최적화 시, 일반화 효과를 덜 보게 된다. (decay rate가 더 작아지게됨.)
저자는 L2 regularzation에 의한 weight decay 효과 뿐만 아니라 weight 업데이트 식에 직접적으로 weight decay 텀을 추가하여 이 문제를 해결한다. L2 regularization과 분리된 weight decay라고 하여 decoupled weight decay라고 말하는 것이다.

SGDW와 AdamW의 알고리즘:
지금까지 설명하지 않았던 $η$ 가 있는데, 이는 매 weight 업데이트마다 learning rate를 일정 비율 감소시켜주는 learning rate schedule 상수를 의미한다.

초록색으로 표시된 부분이 없다면 L2 regularization을 포함한 손실함수에 SGD와 Adam을 적용한 것과 똑같다.
하지만 초록색 부분을 직접적으로 weight 업데이트 식에 추가시켜 weight decay 효과를 볼 수 있게 만들었다.
optimizer = AdamW(model.parameters(),lr=1e-3, eps=(1e-30, 1e-3),weight_decay=0.0,)
cf) model.parameters()는 weight와 bias를 돌려줌.
이제 github 코드를 통해 위의 수식에 대해 살펴보도록 하자:
class AdamW(Optimizer):
    """
    Implements Adam algorithm with weight decay fix as introduced in [Decoupled Weight Decay
    Regularization](https://arxiv.org/abs/1711.05101).

    Parameters:
        params (`Iterable[nn.parameter.Parameter]`):
            Iterable of parameters to optimize or dictionaries defining parameter groups.
        lr (`float`, *optional*, defaults to 0.001):
            The learning rate to use.
        betas (`Tuple[float,float]`, *optional*, defaults to `(0.9, 0.999)`):
            Adam's betas parameters (b1, b2).
        eps (`float`, *optional*, defaults to 1e-06):
            Adam's epsilon for numerical stability.
        weight_decay (`float`, *optional*, defaults to 0.0):
            Decoupled weight decay to apply.
        correct_bias (`bool`, *optional*, defaults to `True`):
            Whether or not to correct bias in Adam (for instance, in Bert TF repository they use `False`).
        no_deprecation_warning (`bool`, *optional*, defaults to `False`):
            A flag used to disable the deprecation warning (set to `True` to disable the warning).
    """

    def __init__(
        self,
        params: Iterable[nn.parameter.Parameter],
        lr: float = 1e-3,
        betas: Tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-6,
        weight_decay: float = 0.0,
        correct_bias: bool = True,
        no_deprecation_warning: bool = False,
    ):
        if not no_deprecation_warning:
            warnings.warn(
                "This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch"
                " implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this"
                " warning",
                FutureWarning,
            )
        require_version("torch>=1.5.0")  # add_ with alpha
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr} - should be >= 0.0")
        if not 0.0 <= betas[0] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[0]} - should be in [0.0, 1.0)")
        if not 0.0 <= betas[1] < 1.0:
            raise ValueError(f"Invalid beta parameter: {betas[1]} - should be in [0.0, 1.0)")
        if not 0.0 <= eps:
            raise ValueError(f"Invalid epsilon value: {eps} - should be >= 0.0")
        defaults = {"lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay, "correct_bias": correct_bias}
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure: Callable = None):
        """
        Performs a single optimization step.

        Arguments:
            closure (`Callable`, *optional*): A closure that reevaluates the model and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group["params"]:
                if p.grad is None:
                    continue
                grad = p.grad
                if grad.is_sparse:
                    raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    # Exponential moving average of gradient values
                    state["exp_avg"] = torch.zeros_like(p)
                    # Exponential moving average of squared gradient values
                    state["exp_avg_sq"] = torch.zeros_like(p)

                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
                beta1, beta2 = group["betas"]

                state["step"] += 1

                # Decay the first and second moment running average coefficient
                # In-place operations to update the averages at the same time
                exp_avg.mul_(beta1).add_(grad, alpha=(1.0 - beta1))
                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
                denom = exp_avg_sq.sqrt().add_(group["eps"])

                step_size = group["lr"]
                if group["correct_bias"]:  # No bias correction for Bert
                    bias_correction1 = 1.0 - beta1 ** state["step"]
                    bias_correction2 = 1.0 - beta2 ** state["step"]
                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1

                p.addcdiv_(exp_avg, denom, value=-step_size)

                # Just adding the square of the weights to the loss function is *not*
                # the correct way of using L2 regularization/weight decay with Adam,
                # since that will interact with the m and v parameters in strange ways.
                #
                # Instead we want to decay the weights in a manner that doesn't interact
                # with the m/v parameters. This is equivalent to adding the square
                # of the weights to the loss with plain (non-momentum) SGD.
                # Add weight decay at the end (fixed version)
                if group["weight_decay"] > 0.0:
                    p.add_(p, alpha=(-group["lr"] * group["weight_decay"]))

        return loss
cf) optimizer의 state_dict()의 형태는 아래와 같다:
{
                'state': {
                    0: {'momentum_buffer': tensor(...), ...},
                    1: {'momentum_buffer': tensor(...), ...},
                    2: {'momentum_buffer': tensor(...), ...},
                    3: {'momentum_buffer': tensor(...), ...}
                },
                'param_groups': [
                    {
                        'lr': 0.01,
                        'weight_decay': 0,
                        ...
                        'params': [0]
                    },
                    {
                        'lr': 0.001,
                        'weight_decay': 0.5,
                        ...
                        'params': [1, 2, 3]
                    }
                ]
            }
이를 통해 살펴보면, Optimizer라는 클래스로부터 AdamW는 상속을 받은 이후,
위의 state_dict형태를 보면 알 수 있듯, if len(state) == 0이라는 뜻은 state가 시작을 하나도 하지 않았음을 의미한다.
exp_avg는 m을, exp_avg_sq는 vt를 의미하며 p.addcdiv_와 if group["weight_decay"]쪽에서 최종 parameter에 대한 update가 됨을 확인할 수 있다.

LR-Schedules &. Learning rate Annealing
LR Schedule: 미리 정해진 스케줄대로 lr을 바꿔가며 사용.

훈련 도중 learning rate를 증가시켜주는게 차이점!
warmup restart로 그림처럼 local minimum에서 빠져나올 기회를 제공한다.

LR Annealing: lr schedule과 혼용되어 사용되나 iteration에 따라 monotonic하게 감소하는것을 의미.
직관적으로는 처음에는 높은 learning rate로 좋은 수렴 지점을 빡세게 찾고,
마지막에는 낮은 learning rate로 수렴 지점에 정밀하게 안착할 수 있게 만들어주는 역할을 한다.

Model Outputs

ModelOutput

모든 모델은 ModelOutput의 subclass의 instance출력을 갖는다.

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # 배치 크기 1
outputs = model(**inputs, labels=labels)

# SequenceClassifierOutput(loss=tensor(0.4267, grad_fn=<NllLossBackward0>), 
#                           logits=tensor([[-0.0658,  0.5650]], grad_fn=<AddmmBackward0>), 
#                           hidden_states=None, attentions=None)

outputs객체는 필히 loss와 logits를 갖기에 (outputs.loss, outputs.logits) 튜플을 반환한다.

Cf)
CuasalLM의 경우:
loss: Language modeling loss (for next-token prediction).
logits: Prediction scores of the LM_Head (scores for each vocabulary token before SoftMax)
= raw prediction values and are not bounded to a specific range

transformers output word를 위해선 : project linearly->apply softmax 단계를 거침.
이때, LM_Head는 pre-training이 아닌, Fine-Tuning에서 사용됨.
LM_Head란, 모델의 출력 hidden_state를 받아 prediction을 수행하는 것을 의미.
ex) BERT

from transformers import BertModel, BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
print(f'logits: {logits}') # `torch.FloatTensor` of shape `(batch_size, sequence_length, vocab_size)

# [MASK] 토큰에 대한 예측 결과를 확인합니다.
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
print(f'masked_index: {masked_index}') # `torch.LongTensor` of shape `(1,)

MASK_token = logits[0, masked_index] # batch의 첫문장에서 MASK token을 가져옴.
print(f'MASK_Token: {MASK_token}')

predicted_token_id = MASK_token.argmax(axis=-1) # 주어진 차원에서 가장 큰 값의 index를 반환. = 모델이 해당위치에서 얘측한 단어의 token_id
print(f'predicted_token_id: {predicted_token_id}')


predicted_token = tokenizer.decode(predicted_token_id)
print(predicted_token)  # paris 출력


# logits: tensor([[[ -6.4346,  -6.4063,  -6.4097,  ...,  -5.7691,  -5.6326,  -3.7883],
#          [-14.0119, -14.7240, -14.2120,  ..., -11.6976, -10.7304, -12.7618],
#          [ -9.6561, -10.3125,  -9.7459,  ...,  -8.7782,  -6.6036, -12.6596],
#          ...,
#          [ -3.7861,  -3.8572,  -3.5644,  ...,  -2.5593,  -3.1093,  -4.3820],
#          [-11.6598, -11.4274, -11.9267,  ...,  -9.8772, -10.2103,  -4.7594],
#          [-11.7267, -11.7509, -11.8040,  ..., -10.5943, -10.9407,  -7.5151]]],
#        grad_fn=<ViewBackward0>)
# masked_index: tensor([6])
# MASK_Token: tensor([[-3.7861, -3.8572, -3.5644,  ..., -2.5593, -3.1093, -4.3820]],
#        grad_fn=<IndexBackward0>)
# predicted_token_id: tensor([3000])
# paris

cf) 참고로 argmax가 반환한 index는 vocabulary의 Index임을 아래를 통해 확인할 수 있다.

for word, idx in list(vocab.items())[:5]:  # 어휘의 처음 10개 항목 출력
    print(f"{word}: {idx}")
for word, idx in list(vocab.items())[2990:3010]:  # 어휘의 처음 10개 항목 출력
    print(f"{word}: {idx}")
    
# [PAD]: 0
# [unused0]: 1
# [unused1]: 2
# [unused2]: 3
# [unused3]: 4
# jack: 2990
# fall: 2991
# raised: 2992
# itself: 2993
# stay: 2994
# true: 2995
# studio: 2996
# 1988: 2997
# sports: 2998
# replaced: 2999
# paris: 3000
# systems: 3001
# saint: 3002
# leader: 3003
# theatre: 3004
# whose: 3005
# market: 3006
# capital: 3007
# parents: 3008
# spanish: 3009

Trainer

Trainer클래스는 Transformers 모델에 최적화되어 있다
= 모델이 항상 tuple(= 첫요소로 loss반환) , ModelOutput의 subclass를 반환해야함을 의미
= labels인자가 제공되면 Loss를 계산할 수 있음.

Trainer는 TrainingArguments로 필요인자를 전달해주면, 사용자가 직접 train_loop작성할 필요없이 학습을 시작할 수 있다.
또한, TRL 라이브러리의 SFTTrainer의 경우, 이 Trainer클래스를 감싸고 있으며, LoRA, Quantizing과 DeepSpeed 등의 기능을 통해 어떤 모델 크기에도 효율적인 확장이 가능하다.

먼저, 시작에 앞서 분산환경을 위해서는 Accelerate라이브러리를 설치해야한다!

pip install accelerate
pip install accelerate --upgrade

Basic Usage

"hugㅇㅇㄹㄴ

Checkpoints

"hugㅇㅇㄹㄴ

Customizing

"hugㅇㅇㄹㄴ

Callbacks & Logging

"hugㅇㅇㄹㄴ

Accelerate & Trainer

"hugㅇㅇㄹㄴ

TrainingArguments

참고)

output_dir (str): 모델 예측과 체크포인트가 작성될 출력 디렉토리입니다.
eval_strategy (str 또는 [~trainer_utils.IntervalStrategy], optional, 기본값은 "no"): 훈련 중 채택할 평가 전략을 나타냅니다. 가능한 값은 다음과 같습니다:
•	"no": 훈련 중 평가를 하지 않습니다.
•	"steps": eval_steps마다 평가를 수행하고 기록합니다.
•	"epoch": 각 에포크가 끝날 때마다 평가를 수행합니다.
per_device_train_batch_size (int, optional, 기본값은 8): 훈련 시 GPU/XPU/TPU/MPS/NPU 코어/CPU당 배치 크기입니다.
per_device_eval_batch_size (int, optional, 기본값은 8): 평가 시 GPU/XPU/TPU/MPS/NPU 코어/CPU당 배치 크기입니다.
gradient_accumulation_steps (int, optional, 기본값은 1): 역전파/업데이트를 수행하기 전에 그래디언트를 누적할 업데이트 단계 수입니다.
eval_accumulation_steps (int, optional): 결과를 CPU로 이동시키기 전에 출력 텐서를 누적할 예측 단계 수입니다. 설정하지 않으면 전체 예측이 GPU/NPU/TPU에서 누적된 후 CPU로 이동됩니다(더 빠르지만 더 많은 메모리가 필요합니다).
learning_rate (float, optional, 기본값은 5e-5): [AdamW] 옵티마이저의 초기 학습률입니다.
weight_decay (float, optional, 기본값은 0): [AdamW] 옵티마이저에서 모든 레이어에(바이어스 및 LayerNorm 가중치는 제외) 적용할 가중치 감쇠입니다.
max_grad_norm (float, optional, 기본값은 1.0): 최대 그래디언트 노름(그래디언트 클리핑을 위한)입니다.
num_train_epochs(float, optional, 기본값은 3.0): 수행할 총 훈련 에포크 수입니다(정수가 아닌 경우 마지막 에포크의 백분율을 수행한 후 훈련을 중지합니다).
max_steps (int, optional, 기본값은 -1): 양의 정수로 설정된 경우, 수행할 총 훈련 단계 수입니다. num_train_epochs를 무시합니다. 유한한 데이터 세트의 경우, max_steps에 도달할 때까지 데이터 세트를 반복합니다.
eval_steps (int 또는 float, optional): eval_strategy="steps"인 경우 두 평가 사이의 업데이트 단계 수입니다. 설정되지 않은 경우, logging_steps와 동일한 값으로 기본 설정됩니다. 정수 또는 [0,1) 범위의 부동 소수점 수여야 합니다. 1보다 작으면 전체 훈련 단계의 비율로 해석됩니다.
lr_scheduler_type (str 또는 [SchedulerType], optional, 기본값은 "linear"): 사용할 스케줄러 유형입니다. 모든 가능한 값은 [SchedulerType]의 문서를 참조하세요.
lr_scheduler_kwargs ('dict', optional, 기본값은 {}): lr_scheduler에 대한 추가 인수입니다. 각 스케줄러의 문서를 참조하여 가능한 값을 확인하세요.
warmup_ratio (float, optional, 기본값은 0.0): 0에서 learning_rate로의 선형 웜업에 사용되는 총 훈련 단계의 비율입니다.
warmup_steps (int, optional, 기본값은 0): 0에서 learning_rate로의 선형 웜업에 사용되는 단계 수입니다. warmup_ratio의 영향을 무시합니다.
logging_dir (str, optional): TensorBoard 로그 디렉토리입니다. 기본값은 output_dir/runs/CURRENT_DATETIME_HOSTNAME입니다.
logging_strategy (str 또는 [~trainer_utils.IntervalStrategy], optional, 기본값은 "steps"): 훈련 중 채택할 로깅 전략을 나타냅니다. 가능한 값은 다음과 같습니다:
•	"no": 훈련 중 로깅을 하지 않습니다.
•	"epoch": 각 에포크가 끝날 때마다 로깅을 합니다.
•	"steps": logging_steps마다 로깅을 합니다.
logging_first_step (bool, optional, 기본값은 False): 첫 번째 global_step을 로깅할지 여부를 나타냅니다.
logging_steps (int 또는 float, optional, 기본값은 500): logging_strategy="steps"인 경우 두 로그 사이의 업데이트 단계 수입니다. 정수 또는 [0,1) 범위의 부동 소수점 수여야 합니다. 1보다 작으면 전체 훈련 단계의 비율로 해석됩니다.
run_name (str, optional, 기본값은 output_dir): 실행에 대한 설명자입니다. 일반적으로 wandb 및 mlflow 로깅에 사용됩니다. 지정되지 않은 경우 output_dir과 동일합니다.
save_strategy (str 또는 [~trainer_utils.IntervalStrategy], optional, 기본값은 "steps"): 훈련 중 체크포인트를 저장할 전략을 나타냅니다. 가능한 값은 다음과 같습니다:
•	"no": 훈련 중 저장하지 않습니다.
•	"epoch": 각 에포크가 끝날 때마다 저장합니다.
•	"steps": save_steps마다 저장합니다. "epoch" 또는 "steps"가 선택된 경우, 항상 훈련이 끝날 때 저장이 수행됩니다.
save_total_limit (int, optional): 값이 전달되면 체크포인트의 총 수를 제한합니다. output_dir에 있는 오래된 체크포인트를 삭제합니다. load_best_model_at_end가 활성화되면 metric_for_best_model에 따라 "최고" 체크포인트는 항상 가장 최근의 체크포인트와 함께 유지됩니다.
예를 들어, save_total_limit=5 및 load_best_model_at_end인 경우, 마지막 네 개의 체크포인트는 항상 최고 모델과 함께 유지됩니다. save_total_limit=1 및 load_best_model_at_end인 경우, 마지막 체크포인트와 최고 체크포인트가 서로 다르면 두 개의 체크포인트가 저장될 수 있습니다.
save_safetensors (bool, optional, 기본값은 True): state_dict를 저장하고 로드할 때 기본 torch.load 및 torch.save 대신 safetensors를 사용합니다.
save_on_each_node (bool, optional, 기본값은 False): 멀티노드 분산 훈련을 수행할 때, 모델과 체크포인트를 각 노드에 저장할지 여부를 나타냅니다. 기본적으로 메인 노드에만 저장됩니다.
seed (int, optional, 기본값은 42): 훈련 시작 시 설정될 랜덤 시드입니다. 실행 간 일관성을 보장하려면 모델에 무작위로 초기화된 매개변수가 있는 경우 [~Trainer.model_init] 함수를 사용하여 모델을 인스턴스화하세요.
data_seed (int, optional): 데이터 샘플러에 사용할 랜덤 시드입니다. 설정되지 않은 경우 데이터 샘플링을 위한 Random sampler는 seed와 동일한 시드를 사용합니다. 이 값을 사용하면 모델 시드와는 독립적으로 데이터 샘플링의 일관성을 보장할 수 있습니다.
bf16 (bool, optional, 기본값은 False): 32비트 훈련 대신 bf16 16비트(혼합) 정밀도 훈련을 사용할지 여부를 나타냅니다. Ampere 이상 NVIDIA 아키텍처 또는 CPU(사용_cpu) 또는 Ascend NPU를 사용해야 합니다. 이는 실험적 API이며 변경될 수 있습니다.
fp16 (bool, optional, 기본값은 False): 32비트 훈련 대신 fp16 16비트(혼합) 정밀도 훈련을 사용할지 여부를 나타냅니다.
half_precision_backend (str, optional, 기본값은 "auto"): 혼합 정밀도 훈련을 위한 백엔드입니다. "auto", "apex", "cpu_amp" 중 하나여야 합니다. "auto"는 감지된 PyTorch 버전에 따라 CPU/CUDA AMP 또는 APEX를 사용하며, 다른 선택지는 요청된 백엔드를 강제로 사용합니다.
bf16_full_eval (bool, optional, 기본값은 False): 32비트 대신 전체 bfloat16 평가를 사용할지 여부를 나타냅니다. 이는 더 빠르고 메모리를 절약하지만 메트릭 값에 악영향을 줄 수 있습니다. 이는 실험적 API이며 변경될 수 있습니다.
fp16_full_eval (bool, optional, 기본값은 False): 32비트 대신 전체 float16 평가를 사용할지 여부를 나타냅니다. 이는 더 빠르고 메모리를 절약하지만 메트릭 값에 악영향을 줄 수 있습니다.
tf32 (bool, optional): Ampere 및 최신 GPU 아키텍처에서 사용할 TF32 모드를 활성화할지 여부를 나타냅니다. 기본값은 torch.backends.cuda.matmul.allow_tf32의 기본값에 따릅니다. 자세한 내용은 TF32 문서를 참조하세요. 이는 실험적 API이며 변경될 수 있습니다.
local_rank (int, optional, 기본값은 -1): 분산 훈련 중 프로세스의 순위를 나타냅니다.
ddp_backend (str, optional): 분산 훈련에 사용할 백엔드를 나타냅니다. "nccl", "mpi", "ccl", "gloo", "hccl" 중 하나여야 합니다.
dataloader_drop_last (bool, optional, 기본값은 False): 데이터 세트의 길이가 배치 크기로 나누어떨어지지 않는 경우 마지막 불완전한 배치를 삭제할지 여부를 나타냅니다.
dataloader_num_workers (int, optional, 기본값은 0): 데이터 로딩에 사용할 하위 프로세스 수입니다(PyTorch 전용). 0은 데이터가 메인 프로세스에서 로드됨을 의미합니다.
remove_unused_columns (bool, optional, 기본값은 True): 모델의 forward 메서드에서 사용되지 않는 열을 자동으로 제거할지 여부를 나타냅니다.
label_names (List[str], optional): input dictionary에서 label에 해당하는 키의 목록입니다. 기본값은 모델이 사용하는 레이블 인수의 목록입니다.
load_best_model_at_end (bool, optional, 기본값은 False): 훈련이 끝날 때 최상의 모델을 로드할지 여부를 나타냅니다. 이 옵션이 활성화되면, 최상의 체크포인트가 항상 저장됩니다. 자세한 내용은 save_total_limit를 참조하세요.
<Tip>
            When set to `True`, the parameters `save_strategy` needs to be the same as `eval_strategy`, and in
            the case it is "steps", `save_steps` must be a round multiple of `eval_steps`.
</Tip>
metric_for_best_model (str, optional): load_best_model_at_end와 함께 사용하여 두 모델을 비교할 메트릭을 지정합니다. 평가에서 반환된 메트릭 이름이어야 합니다. 지정되지 않은 경우 기본값은 "loss"이며, load_best_model_at_end=True인 경우 eval_loss를 사용합니다. 이 값을 설정하면 greater_is_better의 기본값은 True가 됩니다. 메트릭이 낮을수록 좋다면 False로 설정하세요.
greater_is_better (bool, optional): load_best_model_at_end 및 metric_for_best_model과 함께 사용하여 더 나은 모델이 더 높은 메트릭을 가져야 하는지 여부를 지정합니다. 기본값은 다음과 같습니다:
•	metric_for_best_model이 "loss"로 끝나지 않는 값으로 설정된 경우 True입니다.
•	metric_for_best_model이 설정되지 않았거나 "loss"로 끝나는 값으로 설정된 경우 False입니다.

fsdp (bool, str 또는 [~trainer_utils.FSDPOption]의 목록, optional, 기본값은 ''): PyTorch 분산 병렬 훈련을 사용합니다(분산 훈련 전용).
fsdp_config (str 또는 dict, optional): fsdp(Pytorch 분산 병렬 훈련)와 함께 사용할 설정입니다. 값은 fsdp json 구성 파일의 위치(e.g., fsdp_config.json) 또는 이미 로드된 json 파일(dict)일 수 있습니다.
deepspeed (str 또는 dict, optional): Deepspeed를 사용합니다. 이는 실험적 기능이며 API가 변경될 수 있습니다. 값은 DeepSpeed json 구성 파일의 위치(e.g., ds_config.json) 또는 이미 로드된 json 파일(dict)일 수 있습니다.
accelerator_config (str, dict, 또는 AcceleratorConfig, optional): 내부 Accelerator 구현과 함께 사용할 설정입니다. 값은 accelerator json 구성 파일의 위치(e.g., accelerator_config.json), 이미 로드된 json 파일(dict), 또는 [~trainer_pt_utils.AcceleratorConfig]의 인스턴스일 수 있습니다.
label_smoothing_factor (float, optional, 기본값은 0.0): 사용할 레이블 스무딩 팩터입니다. 0은 label_smoothing을 사용하지 않음을 의미, 다른 값은 원핫 인코딩된 레이블을 변경합니다.
optim (str 또는 [training_args.OptimizerNames], optional, 기본값은 "adamw_torch"): 사용할 옵티마이저를 나타냅니다. adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision, adafactor 중에서 선택할 수 있습니다.
optim_args (str, optional): AnyPrecisionAdamW에 제공되는 선택적 인수입니다.
group_by_length (bool, optional, 기본값은 False): 훈련 데이터 세트에서 대략 같은 길이의 샘플을 그룹화할지 여부를 나타냅니다(패딩을 최소화하고 효율성을 높이기 위해). 동적 패딩을 적용할 때만 유용합니다.
report_to (str 또는 List[str], optional, 기본값은 "all"): 결과와 로그를 보고할 통합 목록입니다. 지원되는 플랫폼은 "azure_ml", "clearml", "codecarbon", "comet_ml", "dagshub", "dvclive", "flyte", "mlflow", "neptune", "tensorboard", "wandb"입니다. "all"은 설치된 모든 통합에 보고하며, "none"은 통합에 보고하지 않습니다.
ddp_find_unused_parameters (bool, optional): 분산 훈련을 사용할 때, DistributedDataParallel에 전달되는 find_unused_parameters 플래그의 값을 나타냅니다. 기본값은 그래디언트 체크포인팅을 사용하는 경우 False, 그렇지 않은 경우 True입니다.
ddp_bucket_cap_mb (int, optional): 분산 훈련을 사용할 때, DistributedDataParallel에 전달되는 bucket_cap_mb 플래그의 값을 나타냅니다.
ddp_broadcast_buffers (bool, optional): 분산 훈련을 사용할 때, DistributedDataParallel에 전달되는 broadcast_buffers 플래그의 값을 나타냅니다. 기본값은 그래디언트 체크포인팅을 사용하는 경우 False, 그렇지 않은 경우 True입니다.
dataloader_persistent_workers (bool, optional, 기본값은 False): True로 설정하면 데이터 로더는 데이터 세트가 한 번 소비된 후에도 작업자 프로세스를 종료하지 않습니다. 이는 작업자 데이터 세트 인스턴스를 유지할 수 있습니다. 훈련 속도를 높일 수 있지만 RAM 사용량이 증가합니다. 기본값은 False입니다.
push_to_hub (bool, optional, 기본값은 False): 모델이 저장될 때마다 모델을 허브로 푸시할지 여부를 나타냅니다. 이 옵션이 활성화되면 output_dir은 git 디렉토리가 되어 저장이 트리거될 때마다 콘텐츠가 푸시됩니다(save_strategy에 따라 다름). [~Trainer.save_model]을 호출하면 푸시가 트리거됩니다.
resume_from_checkpoint (str, optional): 모델에 유효한 체크포인트가 있는 폴더의 경로입니다. 이 인수는 직접적으로 [Trainer]에서 사용되지 않으며, 대신 훈련/평가 스크립트에서 사용됩니다. 자세한 내용은 예제 스크립트를 참조하세요.
hub_model_id (str, optional): 로컬 output_dir과 동기화할 저장소의 이름입니다. 단순한 모델 ID일 수 있으며, 이 경우 모델은 네임스페이스에 푸시됩니다. 그렇지 않으면 전체 저장소 이름이어야 합니다(e.g., "user_name/model"). 기본값은 user_name/output_dir_name입니다. 기본값은 output_dir의 이름입니다.
hub_strategy (str 또는 [~trainer_utils.HubStrategy], optional, 기본값은 "every_save"): 허브로 푸시할 범위와 시점을 정의합니다. 가능한 값은 다음과 같습니다:
•	"end": 모델, 구성, 토크나이저(전달된 경우), 모델 카드 초안을 푸시합니다.
•	"every_save": 모델, 구성, 토크나이저(전달된 경우), 모델 카드 초안을 저장할 때마다 푸시합니다. 푸시는 비동기적으로 수행되며, 저장이 매우 빈번한 경우 이전 푸시가 완료되면 새로운 푸시가 시도됩니다. 훈련이 끝날 때 최종 모델로 마지막 푸시가 수행됩니다.
•	"checkpoint": "every_save"와 유사하지만 최신 체크포인트도 last-checkpoint라는 하위 폴더에 푸시하여 trainer.train(resume_from_checkpoint="last-checkpoint")으로 훈련을 쉽게 재개할 수 있습니다.
•	"all_checkpoints": "checkpoint"와 유사하지만 최종 저장소에서 나타나는 대로 모든 체크포인트를 푸시합니다(따라서 최종 저장소에는 폴더마다 하나의 체크포인트 폴더가 있습니다).
hub_token (str, optional): 모델을 허브로 푸시할 때 사용할 토큰입니다. 기본값은 huggingface-cli login으로 얻은 캐시 폴더의 토큰입니다.
hub_private_repo (bool, optional, 기본값은 False): True로 설정하면 허브 저장소가 비공개로 설정됩니다.
hub_always_push (bool, optional, 기본값은 False): True가 아닌 경우, 이전 푸시가 완료되지 않으면 체크포인트 푸시를 건너뜁니다.
gradient_checkpointing (bool, optional, 기본값은 False): 메모리를 절약하기 위해 그래디언트 체크포인팅을 사용할지 여부를 나타냅니다. 역전파 속도가 느려집니다.
auto_find_batch_size (bool, optional, 기본값은 False): 메모리에 맞는 배치 크기를 자동으로 찾아 CUDA 메모리 부족 오류를 피할지 여부를 나타냅니다. accelerate가 설치되어 있어야 합니다(pip install accelerate).
ray_scope (str, optional, 기본값은 "last"): Ray를 사용한 하이퍼파라미터 검색 시 사용할 범위입니다. 기본값은 "last"입니다. Ray는 모든 시도의 마지막 체크포인트를 사용하여 비교하고 최상의 체크포인트를 선택합니다. 다른 옵션도 있습니다. 자세한 내용은 Ray 문서를 참조하세요.
ddp_timeout (int, optional, 기본값은 1800): torch.distributed.init_process_group 호출의 타임아웃입니다. 분산 실행 시 GPU 소켓 타임아웃을 피하기 위해 사용됩니다. 자세한 내용은 PyTorch 문서를 참조하세요.
torch_compile (bool, optional, 기본값은 False): PyTorch 2.0 torch.compile을 사용하여 모델을 컴파일할지 여부를 나타냅니다. 이는 torch.compile API에 대한 기본값을 사용합니다. torch_compile_backend 및 torch_compile_mode 인수를 사용하여 기본값을 사용자 지정할 수 있지만, 모든 값이 작동할 것이라고 보장하지 않습니다. 이 플래그와 전체 컴파일 API는 실험적이며 향후 릴리스에서 변경될 수 있습니다.
torch_compile_backend (str, optional): torch.compile에서 사용할 백엔드입니다. 값을 설정하면 torch_compile이 True로 설정됩니다. 가능한 값은 PyTorch 문서를 참조하세요. 이는 실험적이며 향후 릴리스에서 변경될 수 있습니다.
torch_compile_mode (str, optional): torch.compile에서 사용할 모드입니다. 값을 설정하면 torch_compile이 True로 설정됩니다. 가능한 값은 PyTorch 문서를 참조하세요. 이는 실험적이며 향후 릴리스에서 변경될 수 있습니다.
split_batches (bool, optional): 분산 훈련 중 데이터 로더가 생성하는 배치를 장치에 분할할지 여부를 나타냅니다. True로 설정하면 사용된 실제 배치 크기는 모든 종류의 분산 프로세스에서 동일하지만, 사용 중인 프로세스 수의 정수 배여야 합니다.

DeepSpeed

trust_remote_code=True
중국모델에서 흔히보이는 trust_remote_code=True 설정, 과연 이건 뭘까?
이는 "huggingface/transformers"에 Model Architecture가 아직 추가되지 않은경우:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True, device='cuda')
"huggingface repo 'internlm/internlm-chat-7b'에서 모델 코드를 다운로드하고, 가중치와 함께 실행한다"는 의미이다.
만약 이 값이 False라면, 라이브러리는 huggingface/transformers에 내장된 모델 아키텍처를 사용하고 가중치만 다운로드하는것을 의미한다.

rue
중국모델에서 흔히ollator함수를 보면 아래와 같은 코드의 형태를 띠는데, 여기서 input_ids와 label이라는 조금 생소한 단

[Data Preprocessing] - Data Collator

V2LLAIN — Sun, 14 Jul 2024 18:39:04 +0900

Collate: 함께 합치다.

이에서 유추가능하듯, Data Collator는 다음과 같은 역할을 수행한다.

Data Collator

Data Collator
일련의 sample list를 "single training mini-batch"의 Tensor형태로 묶어줌.
Default Data Collator

이는 아래처럼 train_dataset이 data_collator를 이용해 mini-batch로 묶여 모델로 들어가 학습하는데 도움이 된다.
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
batch["input_ids"] , batch["labels"] ?
다만, 위와 달리 대부분의 Data Collator함수를 보면 아래와 같은 코드의 형태를 띠는데, 여기서 input_ids와 label이라는 조금 생소한 단어가 있다:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples): 
        texts = []
        images = []
        for example in examples:
            image, question, answer = example 
            messages = [{"role": "user", "content": question},
                        {"role": "assistant", "content": answer}] # <-- 여기까지 잘 들어가는것 확인완료.
            text = self.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text)
            images.append(image)

        batch = self.processor(text=text, images=image, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)
과연 batch["input_ids"]와 batch["labels"]가 뭘까?

전술했던 data_collator는 아래와 같은 형식을 띠는데, 여기서도 보면 inputs와 labels가 있는 것을 볼 수 있다.

모든 모델은 다르지만, 다른모델과 유사한점을 공유한다
= 대부분의 모델은 동일한 입력을 사용한다!

∙Input IDs

Input ID는 모델에 입력으로 전달되는 "유일한 필수 매개변수"인 경우가 많다.
Input ID는 token_index로, 사용할 sequence(문장)를 구성하는 token의 숫자표현이다.
각 tokenizer는 다르게 작동하지만 "기본 메커니즘은 동일"하다.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"
tokenizer는 sequence(문장)를 tokenizer vocab에 있는 Token으로 분할한다:
tokenized_sequence = tokenizer.tokenize(sequence)
token은 word나 subword 둘중 하나이다:
print(tokenized_sequence)
# 출력: ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# 예를 들어, "VRAM"은 모델 어휘에 없어서 "V", "RA" 및 "M"으로 분할됨.
# 이러한 토큰이 별도의 단어가 아니라 동일한 단어의 일부임을 나타내기 위해서는?
# --> "RA"와 "M" 앞에 이중해시(##) 접두사가 추가됩


inputs = tokenizer(sequence)
이를 통해 token은 모델이 이해가능한 ID로 변환될 수 있다.
이때, 모델내부에서 작동하기 위해서는 input_ids를 key로, ID값을 value로 하는 "딕셔너리"형태로 반환해야한다:
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# 출력: [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
또한, 모델에 따라서 자동으로 "special token"을 추가하는데,
여기에는 모델이 가끔 사용하는 "special IDs"가 추가된다.
decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# 출력: [CLS] A Titan RTX has 24GB of VRAM [SEP]
∙Attention Mask
Attention Mask는 Sequence를 batch로 묶을 때 사용하는 Optional한 인수로
"모델이 어떤 token을 주목하고 하지 말아야 하는지"를 나타낸다.

ex)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

len(encoded_sequence_a), len(encoded_sequence_b)
# 출력: (8, 19)
위를 보면, encoding된 길이가 다르기 때문에 "동일한 Tensor로 묶을 수가 없다."
--> padding이나 truncation이 필요함.
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["input_ids"]
# 출력: [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]

padded_sequences["attention_mask"]
# 출력: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask는 tokenizer가 반환하는 dictionary의 "attention_mask" key에 존재한다.

∙Token Types IDs
어떤 모델의 목적은 classification이나 QA이다.
이런 모델은 2개의 "다른 목적을 단일 input_ids"항목으로 결합해야한다.
= [CLS], [SEP] 등의 특수토큰을 이용해 수행됨.

ex)
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(decoded)
# 출력: [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
위의 예제에서 tokenizer를 이용해 2개의 sequence가 2개의 인수로 전달되어 자동으로 위와같은 문장을 생성하는 것을 볼 수 있다.
이는 seq이후에 나오는 seq의 시작위치를 알기에는 좋다.

다만, 다른 모델은 token_types_ids도 사용하며, token_type_ids로 이 MASK를 반환한다.
encoded_dict['token_type_ids']
# 출력: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
질문에 사용되는 context는 모두 0으로,
질문에 해당되는 sequence는 모두 1로 설정된 것을 확인할 수 있다.

∙Position IDs

RNN: 각 토큰의 위치가 내장.
Transformer: 각 토큰의 위치를 인식 ❌

∴ position_ids는 모델이 각 토큰의 위치를 list에서 식별하는 데 사용되는 optional 매개변수.

모델에 position_ids가 전달되지 않으면, ID는 자동으로 Absolute positional embeddings으로 생성:

Absolute positional embeddings은 [0, config.max_position_embeddings - 1] 범위에서 선택.

일부 모델은 sinusoidal position embeddings이나 relative position embeddings과 같은 다른 유형의 positional embedding을 사용.

∙Labels

Labels는 모델이 자체적으로 손실을 계산하도록 전달될 수 있는 Optional인수이다.
즉, Labels는 모델의 예상 예측값이어야 한다: 표준 손실을 사용하여 예측값과 예상값(레이블) 간의 손실을 계산.

이때, Labels는 Model Head에 따라 다르다:

AutoModelForSequenceClassification: 모델은 (batch_size)차원텐서를 기대하며, batch의 각 값은 전체 시퀀스의 예상 label에 해당.

AutoModelForTokenClassification: 모델은 (batch_size, seq_length)차원텐서를 기대하며, 각 값은 개별 토큰의 예상 label에 해당

AutoModelForMaskedLM: 모델은 (batch_size, seq_length)차원텐서를 기대하며, 각 값은 개별 토큰의 예상 레이블에 해당: label은 마스킹된 token_ids이며, 나머지는 무시할 값(보통 -100).

AutoModelForConditionalGeneration: 모델은 (batch_size, tgt_seq_length)차원텐서를 기대하며, 각 값은 각 입력 시퀀스와 연관된 목표 시퀀스를 나타냅니다. 훈련 중에는 BART와 T5가 적절한 디코더 입력 ID와 디코더 어텐션 마스크를 내부적으로 만들기에 보통 제공할 필요X. 이는 Encoder-Decoder 프레임워크를 사용하는 모델에는 적용되지 않음. 각 모델의 문서를 참조하여 각 특정 모델의 레이블에 대한 자세한 정보를 확인하세요.

기본 모델(BertModel 등)은 Labels를 받아들이지 못하는데, 이러한 모델은 기본 트랜스포머 모델로서 단순히 특징들만 출력한다.

∙ Decoder input IDs

이 입력은 인코더-디코더 모델에 특화되어 있으며, 디코더에 입력될 입력 ID를 포함합니다.
이러한 입력은 번역 또는 요약과 같은 시퀀스-투-시퀀스 작업에 사용되며, 보통 각 모델에 특정한 방식으로 구성됩니다.

대부분의 인코더-디코더 모델(BART, T5)은 레이블에서 디코더 입력 ID를 자체적으로 생성합니다.
이러한 모델에서는 레이블을 전달하는 것이 훈련을 처리하는 선호 방법입니다.
시퀀스-투-시퀀스 훈련을 위한 이러한 입력 ID를 처리하는 방법을 확인하려면 각 모델의 문서를 참조하세요.

∙Feed Forward Chunking

트랜스포머의 각 잔차 어텐션 블록에서 셀프 어텐션 레이어는 보통 2개의 피드 포워드 레이어 다음에 위치합니다.
피드 포워드 레이어의 중간 임베딩 크기는 종종 모델의 숨겨진 크기보다 큽니다(예: bert-base-uncased).

크기 [batch_size, sequence_length]의 입력에 대해 중간 피드 포워드 임베딩을 저장하는 데 필요한 메모리 [batch_size, sequence_length, config.intermediate_size]는 메모리 사용량의 큰 부분을 차지할 수 있습니다.

Reformer: The Efficient Transformer의 저자들은 계산이 sequence_length 차원과 독립적이므로 두 피드 포워드 레이어의 출력 임베딩 [batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n을 개별적으로 계산하고 n = sequence_length와 함께 [batch_size, sequence_length, config.hidden_size]로 결합하는 것이 수학적으로 동일하다는 것을 발견했습니다.

이는 메모리 사용량을 줄이는 대신 계산 시간을 증가시키는 거래를 하지만, 수학적으로 동일한 결과를 얻을 수 있습니다.

apply_chunking_to_forward() 함수를 사용하는 모델의 경우, chunk_size는 병렬로 계산되는 출력 임베딩의 수를 정의하며, 메모리와 시간 복잡성 간의 거래를 정의합니다. chunk_size가 0으로 설정되면 피드 포워드 청킹은 수행되지 않습니다.

QLoRA 실습 & Trainer vs SFTTrainer

V2LLAIN — Fri, 12 Jul 2024 14:43:33 +0900

QLoRA 실습 with MLLMs(InternVL)

Step 1. 필요 Library import:

import os

import torch
import torch.nn as nn
import bitsandbytes as bnb
import transformers

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel, 
    get_peft_model,
)
from transformers import (
    AutoConfig,
    AutoModel,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed,
    pipeline,
    TrainingArguments,
)

Step 2. 모델 불러온 후 prepare_model_for_kbit_training(model) 진행

devices = [0]#[0, 3]
max_memory = {i: '49140MiB' for i in devices}

model_name = 'OpenGVLab/InternVL2-8B'


model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    cache_dir='/data/huggingface_models',
    trust_remote_code=True,
    device_map="auto",
    max_memory=max_memory,
    quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4'
        ),
)

# 모델 구조 출력
print(model)

# get_input_embeddings 메서드를 모델에 추가
def get_input_embeddings(self):
    if hasattr(self, 'embed_tokens'):
        return self.embed_tokens
    elif hasattr(self, 'language_model') and hasattr(self.language_model.model, 'tok_embeddings'):
        return self.language_model.model.tok_embeddings
    else:
        raise NotImplementedError("The model does not have an attribute 'embed_tokens' or 'language_model.model.tok_embeddings'.")

model.get_input_embeddings = get_input_embeddings.__get__(model, type(model))

# prepare_model_for_kbit_training 함수를 직접 구현
def prepare_model_for_kbit_training(model):
    for param in model.parameters():
        param.requires_grad = False  # 모든 파라미터의 기울기 계산을 비활성화

    if hasattr(model, 'model') and hasattr(model.model, 'tok_embeddings'):
        for param in model.model.tok_embeddings.parameters():
            param.requires_grad = True  # 임베딩 레이어만 기울기 계산 활성화
    elif hasattr(model, 'embed_tokens'):
        for param in model.embed_tokens.parameters():
            param.requires_grad = True  # 임베딩 레이어만 기울기 계산 활성화
    
    # 필요한 경우 다른 특정 레이어들도 기울기 계산을 활성화할 수 있음
    # 예시: 
    # if hasattr(model, 'some_other_layer'):
    #     for param in model.some_other_layer.parameters():
    #         param.requires_grad = True

    return model

model = prepare_model_for_kbit_training(model)

Step 3. QLoRA를 붙일 layer 선택:

def find_all_linear_names(model, train_mode):
    assert train_mode in ['lora', 'qlora']
    cls = bnb.nn.Linear4bit if train_mode == 'qlora' else nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # LLM의 Head부분에 속하는 애들 pass
        lora_module_names.remove('lm_head')
    
    return list(lora_module_names)


print(sorted(config.target_modules)) # ['1','output', 'w1', 'w2', 'w3', 'wo', 'wqkv']
config.target_modules.remove('1') # LLM의 Head부분에 속하는 애들 제거


config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=find_all_linear_names(model, 'qlora'),
    lora_dropout=0.05,
    bias="none",
    task_type="QUESTION_ANS" #CAUSAL_LM, FEATURE_EXTRACTION, QUESTION_ANS, SEQ_2_SEQ_LM, SEQ_CLS, TOKEN_CLS.
)

model = get_peft_model(model, config)

이후 trainer로 train진행.

QLoRA 붙인 결과:

trainer 종류? Trainer vs SFTTrainer

Trainer v.s. SFTTrainer

∙ Trainer v.s. SFTTrainer
- 일반 목적의 훈련: 텍스트 분류, 질의응답, 요약 등의 지도 학습 작업에서 모델을 처음부터 훈련시키는 데 사용됩니다.
- 높은 커스터마이징 가능성: hyperparameter, optimizer, scheduler, logging, metric 등을 미세 조정할 수 있는 다양한 구성 옵션을 제공합니다.
- 복잡한 훈련 워크플로우 처리: 그래디언트 축적, 조기 종료, 체크포인트 저장, 분산 훈련 등의 기능을 지원합니다.
- 더 많은 데이터 요구: 효과적인 훈련을 위해 일반적으로 더 큰 데이터셋이 필요합니다.

∙ SFTTrainer
- 지도 학습 미세 조정 (SFT): 작은 데이터셋으로 PLMs Fine-Tuning에 최적화.
- 간단한 인터페이스: 더 적은 configuration으로 간소화된 workflow를 제공.
- 효율적인 메모리 사용: PEFT와 패킹 최적화와 같은 기술을 사용하여 훈련 중 메모리 소비를 줄입니다.
- 빠른 훈련: 작은 데이터셋과 짧은 훈련 시간으로도 유사하거나 더 나은 정확도를 달성합니다.

∙ Trainer와 SFTTrainer 선택 기준:
- Trainer 사용:
큰 데이터셋이 있고, 훈련 루프 또는 복잡한 훈련 워크플로우에 대한 광범위한 커스터마이징이 필요한 경우.
Data preprocessing, Datacollator는 사용자가 직접 설정해야 하며, 일반적인 데이터 전처리 방법을 사용

- SFTTrainer 사용:
PLMS와 상대적으로 작은 데이터셋을 가지고 있으며, 효율적인 메모리 사용과 함께 더 간단하고 빠른 미세 조정 경험을 원할 경우.
PEFT를 기본적으로 지원, `peft_config`와 같은 설정을 통해 효율적인 파인 튜닝을 쉽게 설정할 수 있다.
Data preprocessing, Datacollator도 효율적인 FT를 위해 최적화되어 있음.
`dataset_text_field`와 같은 필드를 통해 텍스트 데이터를 쉽게 처리할 수 있음.

Feature Trainer SFTTrainer

목적 Gerneral Purpose training Supervised Fine-Tuning of PLMs

커스텀 용도 Highly Customizable Simpler interface with fewer options

Training workflow Handles complex workflows Streamlined workflow

필요 Data Large Datsets Smaller Datasets

Memory 사용량 Higher Lower with PEFT & packing optimization

Training speed Slower Faster with smaller datasets

[QLoRA] & [PEFT] & deepspeed, DDP

V2LLAIN — Tue, 9 Jul 2024 18:58:49 +0900

[PEFT]: Parameter Efficient Fine-Tuning

PEFT란?
PLMs를 specific task에 적용할 때, 대부분의 Parameter를 freeze❄️, 소수의 parameter만 FT하는 기법.
PEFT는 모델 성능을 유지 + #parameter↓가 가능함.
또한, catastrophic forgetting문제 위험도 또한 낮음.
Huggingface에서 소개한 혁신적 방법으로 downstream task에서 FT를 위해 사용됨.

Catastrophic Forgetting이란?
새로운 정보를 학습하게 될때, 기존에 학습한 일부의 지식에 대해서는 망각을 하게 되는 현상

Main Concept

Reduced Parameter Fine-tuning(축소된 파라미터 파인튜닝)
사전 학습된 LLM 모델에서 대다수의 파라미터를 고정해 소수의 추가적인 파라미터만 파인튜닝하는 것이 중점
선택적 파인튜닝으로 계산적 요구가 급격하게 감소하는 효과

Overcoming Catastrophic Forgetting(치명적 망각 문제 극복)
Catastrophic Forgetting 문제는 LLM 모델 전체를 파인 튜닝하는 과정에서 발생하는 현상인데, PEFT를 활용하여 치명적 망각 문제를 완화할 수 있음
PEFT를 활용하면 사전 훈련된 상태의 지식을 보존하며 새로운 downstream task에 대해 학습할 수 있음

Application Across Modalities(여러 모달리티 적용 가능)
PEFT는 기존 자연어처리(Natural Language Process: NLP) 영역을 넘어서 다양한 영역으로 확장 가능함
스테이블 디퓨전(stable diffusion) 혹은 Layout LM 등의 포함된 컴퓨터 비전(Computer Vision: CV) 영역,
Whisper나 XLS-R이 포함된 오디오 등의 다양한 마달리티에 성공적으로 적용됨

Supported PEFT Methods(사용 가능한 PEFT)
라이브러리에서 다양한 PEFT 방법을 지원함
LoRA(Low-Rank Adaption), Prefix Tuning, 프롬프트 튜닝 등 각각의 방법은 특정한 미세 조정 요구 사항과 시나리오에 맞게 사용할 수 있도록 설계됨

The output activations original (frozen) pretrained weights (left) are augmented by a low rank adapter comprised of weight matrics A and B (right).

사전학습가중치(❄️)의 output activation은 weight matrix인 A, B로 구성된 LoRA에 의해 증가된다.

[Q-LoRA]: Quantized-LoRA

Q-LoRA란?
2023년 5월 NeurIPS에서 양자화와 LoRA를 합쳐 "A6000 단일 GPU로 65B모델 튜닝이 가능"한 방법론을 발표함.
QLoRA는 결국 기존의 LoRA에 새로운 quantization을 더한 형태이다.
베이스 모델인 PLM의 가중치를 얼리고(frozen), LoRA 어댑터의 가중치만 학습 가능하게(trainable)하는 것은 LoRA와 동일하며, frozen PLM의 가중치가 '4비트로 양자화'되었다는 정도가 다른 점이다.
때문에, QLoRA에서 주요히 새로 소개되는 기술(Main Contribution)은 양자화 방법론이 주가 된다는 사실이다.

양자화란?
weight와 activation output을 더 작은 bit단위로 표현하도록 변환하는 것.
즉, data정보를 약간줄이고, 정밀도는 낮추지만
"저장 및 연산에 필요한 연산을 감소시켜 효율성을 확보"하는 경량화 방법론이다.

How to Use in MLLMs...?

그렇다면 어떻게 MLLMs에 적용할 수 있을까? MLLMs는 매우 종류가 많지만, 가장 쉬운 예제로 VLMs를 들어보자면,
Q-LoRA 및 LoRA는 PEFT방법론이기에 이는 LLMs, MLLMs모두 통용되는 방법이다.
그렇기에 VLMs(Vision Encoder + LLM Decoder)를 기준으로 설명해보자면:

언어적 능력을 강화시키고 싶다면, LLM만 PEFT를 진행.

시각적 능력을 강화시키고 싶다면, Vision Encoder만 PEFT를 진행.

두 능력 모두 강화시키고 싶다면, Encoder, Decoder 모두 PEFT를 진행하면 된다.

Reference Code:

A Definitive Guide to QLoRA: Fine-tuning Falcon-7b with PEFT

Unveiling the Power of QLoRA: Comprehensive Explanation and Practical Coding with PEFT

medium.com

Finetuning Llama2 with QLoRA — TorchTune documentation

Finetuning Llama2 with QLoRA In this tutorial, we’ll learn about QLoRA, an enhancement on top of LoRA that maintains frozen model parameters in 4-bit quantized precision, thereby reducing memory usage. We’ll walk through how QLoRA can be utilized withi

pytorch.org

참고: https://github.com/V2LLAIN/Transformers-Tutorials/blob/master/qlora_baseline.ipynb

Transformers-Tutorials/qlora_baseline.ipynb at master · V2LLAIN/Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace. - V2LLAIN/Transformers-Tutorials

github.com

Deepspeed란?

# finetune_qlora.sh

deepspeed ovis/train/train.py \
        --deepspeed scripts/zero2.json \
        ...
물론 나만의 방법을 고수하는것도 좋지만, 대부분의 user들이 이 방법을 사용하는걸 봐서는 일단 알아놓는게 좋을 것 같기에 알아보고자한다.

deepspeed...?

모델의 training, inference속도를 빠르고 효율적으로 처리하게 도와주는 Microsoft사의 딥러닝 최적화 라이브러리이다.

학습 device 종류:

CPU
Single GPU
1 Node, Multi GPU
Multi Node, Multi GPU --> 매우 큰 GPT4 등의 학습을 위해 사용됨.

분산학습 방식:

Data Parallel: 하나의 device가 data를 나누고, 각 device에서 처리결과를 모아 계산
--> 하나의 device가 다른 device에 비해 메모리 사용량이 많아지는, 메모리 불균형 문제가 발생한다!

Distributed Data Parallel: 각각의 device를 하나의 Process로 보고, 각 process에서 모델을 띄워서 사용.
이때, 역전파에서만 내부적으로 gradient를 동기화 --> 메모리 불균형문제❌

cf) Requirements:
- PyTorch must be installed before installing DeepSpeed.
- For full feature support we recommend a version of PyTorch that is >= 1.9 and ideally the latest PyTorch stable release.
- A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
- Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
        NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
        AMD: MI100 and MI200
pip install deepspeed
로 설치가 가능하며, 사용방법은 아래와 같다.

사용방법:
Step1) deepspeed사용을 위한 Config.json파일 작성
{
	"train_micro_batch_size_per_gpu": 160,
    "gradient_accumulation_steps": 1,
    "optimizer":
    {
    	"type": "Adam",
        "params":
        {
        	"lr": 0.001
        }
    },
    "zero_optimization":
    {
        "stage": 1,
        "offload_optimizer":
        {
            "device": "cpu",
            "pin_memory":true
        },
        "overlap_comm": false,
        "contiguous_gradients": false
    }
}
config Args:https://www.deepspeed.ai/docs/config-json/

Step2) import & read json
import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam

with open('config.json', 'r') as f:
	deepspeed_config = json.load(f)
Step3) optimizer 설정 & model,optimizer 초기화
optimizer = DeepSpeedCPUAdam(model.parameters(), lr=lr)

model, optimizer, _, _ = deepspeed.initialize(model=model,
                                            model_parameters=model.parameters(),
                                            optimizer=optimizer,
                                            config_params=deepspeed_config)
cf) ArgumentParser에 추가하는것도 가능함!
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', type=int, default=-1)

parser = deepspeed.add_config_arguments(parser)
Step4) Train!
# >> train.py
deepspeed --num_gpus={gpu 개수} train_deepspeed.py
# train.sh
deepspeed --num_gpus={gpu 개수} train_deepspeed.py

# >> bash train.sh
주의 !)

DeepSpeed는 CUDA_VISIBLE_DEVICES로 특정 GPU를 제어할 수 없다!
아래와 같이 --include로만 특정 GPU를 사용할 수 있다.
deepspeed —inlcude localhost:<GPU_NUM1>, <GPU_NUM2> <python_file.py>
gpu node의 개수가 많아질수록 deepspeed의 장점인 학습 속도가 빨라진다!

DeepSpeed Configuration JSON

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

www.deepspeed.ai

[Gain Study_RL]: Reinforcement Learning(1)

V2LLAIN — Sun, 2 Jun 2024 20:34:36 +0900

‍ 강화학습[Reinforcement Learning]

강화학습이란?

Agent가 환경에서 누적보상 을 최대화하는 Action을 취하는 "순차적 의사결정"

∙ Sequential Decision Process: 몇단계를 가봐야 reward를 얻음 = cumulated rewards
state → action & reward → state → ...

Exploitation & Exploration
Exploitation: 경험기반 최선행동 (short term benefit)
Exploration: 안알려진행동을 시도, 새로운정보를 획득 (long term benefit)
ex) Exploitation: 늘 가던 단골식당에 간다. / Exploration: 안 가본 식당에 가본다

Reward Hypothesis: 강화학습이론의 기반.
Agent가 어떤 Action을 취했을 때, 해당 Action이 얼마나 좋은 Action인지 알려주는 "feedback signal"
적절한 보상함수로 "Agent는 누적보상 기댓값 최대화(Cumulated Rewards Maximization)"의 목표를 달성한다.

RL의 궁극적인 목표는?
Model: Markov Decision Process
궁극적인 목표: 해당 목표(= max cum_Reward)를 위한 Optimal Policy를 찾는 것이 목표임.
cf) 강화학습의 supervision은 사람이 reward를 주는것.

Markov Property

MDP (Markov Decision Process)
Markov Property: 현재 state만으로 미래를 예측 (과거 state에 영향X)
ex) 오늘의 날씨 -> 내일의 날씨

RL은 Markov Property를 전제로 하는데, 특히 Discrete time를 따를 때 "Markov Chain"을 전제로 함!
즉, markov process는 markov property를 따르는 discrete time을 전제로 하며, 이런 markov process를 markov chain이라함.
cf) Discrete time: 시간이 이산적으로 변함
cf) Stochastic time: 시간에 따라 어떤 사건이 발생할 확률이 변화하는 과정.

MDP(Markov Decision Process)는 <S,A,P,R,γ>라는 Tuple로 정의됨.

Episode와 Return
Episode: Start State ~ Terminal State

Return: Episode 종료시 받는 모든 Reward
∴ Maximize Return = Agent의 Goal
∴ Maximize Cumulated Reward Optimal Policy = RL의 Goal

Continuing Task의 Return...?

무한급수와 Discounting Factor(γ)를 이용.
--> γ=0: 단기보상관심 Agent
--> γ=1: 장기보상관심 Agent

Discounted Return

현재 받은 reward와 미래에 받을 reward에 차이를 두기 위해 discount factor γ∈[0,1]를 고려한 이득(Return)

Policy와 Value

Policy & Value Function
Policy: 어떤 state에서 어떤 action을 취할지 결정하는 함수
Value Function: Return을 기준, 각 state나 action이 얼마나 "좋은지" 알려주는 함수 (즉, reward → return → value)

Deterministic Policy: one state - one action (학습이 끝났을때 도달하는 상태.)
Stochastic Policy: one state → 어떤 action? 취할지 확률을 이용. (학습에 적절.)

State-Value Function: 정책이 주어질때, 임의의 s에서 시작, 끝날때까지 받은 return G의 기댓값
Action-Value Function: state에서 어떤 action이 제일 좋은가를 action에 대한 value를 구하는 것으로 Q-Value라고도 함.

Bellman Equation
Bellman Equation: episode를 다 완료안하고 state가 좋은지 예측할 수 없을까?
State-value Bellman Equation: 즉각적인 reward와 discount factor를 고려한 미래 state values를 합한 식
Action-value Bellman Equation: 즉각적인 reward와 discount factor를 고려한 미래 action values를 합한 식

Bellman Expectation Equation: policy가 주어질 때, 특정 state와 value를 구하는 식.

Bellman Optimality Equation: RL의 goal은 최대 reward의 optimal policy를 찾는게 목표.
Optimal Policy
= Agent가 Goal을 달성했을 때의 policy
= 각 state에서 가장 좋은 policy를 구해서 얻어진 policy

Value Function Estimation (Planning과 Learning)

Planning: 모델을 알고 이를 이용해 최적의 action을 찾는것 - ex) Dynamic Programming
Learning: Sample base로 (= 모델을 모르고) 학습하는 것 -ex) MC, TD

이름	정확도	속도	특징
IndexFlatL2	가장 높음	가장 느림	모든 벡터에 대한 완전탐색을 수행
IndexHNSW	높음	보통	그래프 구조를 사용해 효율적 검색
IndexIVFlat	보통	가장 빠름	벡터간 clustering으로 탐색범위를 줄여 검색

Feature	Trainer	SFTTrainer
목적	Gerneral Purpose training	Supervised Fine-Tuning of PLMs
커스텀 용도	Highly Customizable	Simpler interface with fewer options
Training workflow	Handles complex workflows	Streamlined workflow
필요 Data	Large Datsets	Smaller Datasets
Memory 사용량	Higher	Lower with PEFT & packing optimization
Training speed	Slower	Faster with smaller datasets