HuggingFace(🤗)입문자를 위한 transformers 정리

V2LLAIN 2024. 11. 18. 14:54

2024. 11. 18. 14:54

Preview:

Machine Learning vs Deep Learning

머신러닝: 인공지능의 한 분야로 data의 Pattern을 학습. (이때, 비교적 적은 양의 구조화된 data로도 작동가능)

딥러닝: 머신러닝의 한 분야로 복잡한 구조, 많은 계산리소스 및 데이터를 필요로 함.

Transformer(Attention is All You Need-2017)

Transformer 모델의 핵심:

∙ input sequence 병렬처리
∙ Only Use Attention Mechanism (Self Attention)
∙ 순차적처리, 반복연결, 재귀 모두 사용❌

Transformer 모델구조:
∙ Embedding: token2dense-vector (이때, 단어간의 의미적 유사성을 보존하는 방향으로 모델이 학습된다.)
∙ Positional Encoding: input sequence의 token위치정보를 추가로 제공
∙ Encoder&Decoder: Embedding+PE라는 학습된 vector를 Input으로 받음(벡터 값은 Pretrained weight or 학습과정중 최적화됨.)
- MHA & FFN: token간 관계를 학습, FFN으로 각 단어의 특징벡터 추출 (이때, 각 Head값은 서로 다른 가중치를 가져 input sequence의 다양한 측면 포착가능.)
- QKV: Query(현재위치에서 관심있는부분의 벡터), Key(각 위치에 대한 정보의 벡터), Value(각 위치에 대한 값 벡터)

ex) The student studies at the home
query: student
--> Q: student정보를 얻기 위한 벡터값
--> K: The, studies, at, the, home 벡터값
--> V: 문맥, 의미등의 관련정보에 대한 벡터값
--> 3-Head라면: 각 헤드별 역할이 Syntax, Semantics, Pragmatics 등에 집중할 수 있다는 것.

Huggingface Transformers

Library 소개

Tokenizer
보통 subword로 token화(token으로 분할)하는 과정을 진행.
부수적으로 "텍스트 정규화, 불용어제거, Padding, Truncation 등" 다양한 전처리 기능도 제공한다.

Diffusers Library

txt-img생성관련 작업을 위한 라이브러리로 Stable Diffusion, DALL-E, LDM 등 다양한 생성모델을 지원.
- DDPM, DDIM, LDM등 다양한 Diffusion기반 알고리즘 제공
- Batch Inference, 병렬, 혼합정밀도학습 등 지원

Accelerate

분산전략을 간단히 추상화해 API로 제공, FP16 및 BF16등의 낮은 혼합정밀도학습을 "자동지원"
- 분산학습 지원: Data Parallel, Model Parallel 등 지원.
- Automatic Mixed Precision지원: FP16, FP32 등 data형식을 자동으로 혼합, 메모리사용량↓, 속도↑

- Gradient Accumulation: 여러 미니배치의 그래디언트를 누적하여 큰 배치 효과를 내는 기법
- Gradient Checkpointing: 중간 activation계산을 저장하는 대신, 필요할 때 재계산하는 방법

Model 설정
모델 설정 클래스는 모델구조와 hyperparameter값을 "딕셔너리"형태로 JSON파일에 저장한다.
따라서 모델을 불러오면 모델가중치와 함께 이 값이 불러와진다. (아래 사진처럼)

PretrainedConfig & ModelConfig

마찬가지로 모델구조, hyperparameter를 저장하는 딕셔너리를 포함
[예시인자 설명]:
- vocab_size: 모델이 인식가능한 고유토큰 수
- output_hidden_states: 모델의 모든 hidden_state를 출력할지 여부
- output_attentions: 모델의 모든 attention값을 출력할지 여부
- return_dict: 모델이 일반 튜플대신, ModelOutput객체를 반환할지 결정.
각 모델 구조별 PretrainedConfig를 상속받은 전용 모델 설정 클래스가 제공된다.
(ex. BertConfig, GPT2Config 혹은 아래 사진처럼...)

InternVisionConfig를 직접 인스턴스화해 설정하는 예시

이때, 설정값이 잘못되면 모델성능이 크게 떨어질 수 있기에 보통 "from_pretrained"를 이용해 검증된 pretrained학습설정값을 불러온다.

PreTrainedTokenizer & ModelTokenizer & PretrainedTokenizerFast

[예시인자 설명]:
- max_model_input_sizes: 모델의 최대입력길이
- model_max_length: tokenizer가 사용하는 모델의 최대입력길이
(즉, 토크나이저의 model_max_length는 모델의 max_model_input_sizes보다 크지 않도록 설정해야 모델이 정상적으로 입력을 처리할 수 있다.)
- padding_side/truncation_side: padding/truncation위치(left/right) 결정
- model_input_names: 순전파시 입력되는 tensor목록(ex. input_ids, attention_mask, token_type_ids)

cf) decode메서드를 사용하면 token_id 문장을 원래 문장으로 복원한다.
cf) PretrainedTokenizerFast는 Rust로 구현된 버전으로 Python Wrapper를 통해 호출되는, 더 빠른 tokenizer다.

Datasets

Dataset Upload 예제:
images 디렉토리 구조:
images
⎿ A.jpg
⎿ B.jpg
  ...

import os
from collections import defaultdict
from datasets import Dataset, Image, DatasetDict

data = defaultdict(list)
folder_name = '../images'

for file_name in os.listdir(folder_name):
    name = os.path.splittext(file_name)[0]
    path = os.path.join(folder_name, file_name)
    data['name'].append(name)
    data['image'].append(path)

dataset = Dataset.from_dict(data).cast_column('image', Image())
# print(data, dataset[0]) # 확인용

dataset_dict = DatasetDict({
    'train': dataset.select(range(5)),
    'valid': dataset.select(range(5, 10)),
    'test': dataset.select(range(10, len(dataset)))
    }
)

hub_name = "<user_name>/<repo_name>" # dataset저장경로
token = "hf_###..." # huggingface token입력
datasetdict.push_to_hub(hub_name, token=token)
🤗 Embedding과정 완전정리!!
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

txt = "I am laerning about tokenizers."
input = tokenizer(txt, return_tensors="pt")
output = model(**input)

print('input:', input)
print('last_hidden_state:', output.last_hidden_state.shape)
input: {'input_ids': tensor([[  101,  1045,  2572,  2474, 11795,  2075,  2055, 19204, 17629,  2015,  1012,   102]]), 
        'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
        'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
        
last_hidden_state: torch.Size([1, 12, 768])
input 딕셔너리

input_ids:

각 단어와 특수 토큰이 BERT의 어휘 사전에 매핑된 고유한 정수 ID로 변환된 결과입니다.

예시: 101은 [CLS] 토큰, 102는 [SEP] 토큰.

전체 시퀀스: [CLS] I am laerning about tokenizers. [SEP]

길이: 12개의 토큰 (문장 전체와 특수 토큰 포함)

token_type_ids:

문장 내의 각 토큰이 어느 segment에 속하는지를 나타냄.

BERT는 기본적으로 두 개의 세그먼트(예: 문장 A와 문장 B)를 구분가능.

여기서는 단일 문장이므로 모든 값이 0이다.

attention_mask:

모델이 각 토큰에 주의를 기울여야 하는지를 나타낸다.

1은 해당 토큰이 실제 데이터임을 의미하고, 0은 패딩 토큰을 의미.

여기서는 패딩이 없으므로 모든 값이 1이다.

last_hidden_state

Shape: [1, 12, 768]

Batch Size (1): 한 번에 하나의 입력 문장을 처리.

Sequence Length (12): 입력 시퀀스의 토큰 수 (특수 토큰 포함).

Hidden Size (768): BERT-base 모델의 각 토큰에 대해 768차원의 벡터 표현을 생성한다.

의미:

last_hidden_state는 모델의 마지막 은닉 계층에서 각 토큰에 대한 벡터 표현을 담고 있다.

이 벡터들은 문맥 정보를 포함하고 있으며, 다양한 NLP 작업(예: 분류, 개체명 인식 등)에 활용될 수 있다.

설명)

ex-1) Embedding Lookup Table과정 코드
train_data = 'you need to know how to code'

# 중복을 제거한 단어들의 집합인 단어 집합 생성.
word_set = set(train_data.split())

# 단어 집합의 각 단어에 고유한 정수 맵핑.
vocab = {word: i+2 for i, word in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1
print(vocab) # {'need': 2, 'to': 3, 'code': 4, 'how': 5, 'you': 6, 'know': 7, '<unk>': 0, '<pad>': 1}

# 단어 집합의 크기만큼의 행을 가지는 테이블 생성.
embedding_table = torch.FloatTensor([[ 0.0,  0.0,  0.0],
                                    [ 0.0,  0.0,  0.0],
                                    [ 0.2,  0.9,  0.3],
                                    [ 0.1,  0.5,  0.7],
                                    [ 0.2,  0.1,  0.8],
                                    [ 0.4,  0.1,  0.1],
                                    [ 0.1,  0.8,  0.9],
                                    [ 0.6,  0.1,  0.1]])

sample = 'you need to run'.split()
idxes = []

# 각 단어를 정수로 변환
for word in sample:
  try:
    idxes.append(vocab[word])
  # 단어 집합에 없는 단어일 경우 <unk>로 대체된다.
  except KeyError:
    idxes.append(vocab['<unk>'])
idxes = torch.LongTensor(idxes)

# 각 정수를 인덱스로 임베딩 테이블에서 값을 가져온다.
lookup_result = embedding_table[idxes, :]
print(lookup_result)
print(lookup_result.shape)
# tensor([[0.1000, 0.8000, 0.9000],
#         [0.2000, 0.9000, 0.3000],
#         [0.1000, 0.5000, 0.7000],
#         [0.0000, 0.0000, 0.0000]])
# torch.Size([4, 3])
ex-2) Embedding lookup table과정 코드와 nn.Embedding간 비교
train_data = 'you need to know how to code'

# 중복을 제거한 단어들의 집합인 단어 집합 생성.
word_set = set(train_data.split())

# 단어 집합의 각 단어에 고유한 정수 맵핑.
vocab = {tkn: i+2 for i, tkn in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1

embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=3, padding_idx=1)
print(embedding_layer.weight)
print(embedding_layer)

# tensor([[ 0.7830,  0.2669,  0.4363],
#         [ 0.0000,  0.0000,  0.0000],
#         [-1.2034, -0.0957, -0.9129],
#         [ 0.7861, -0.0251, -2.2705],
#         [-0.5167, -0.3402,  1.3143],
#         [ 1.7932, -0.6973,  0.5459],
#         [-0.8952, -0.4937,  0.2341],
#         [ 0.3692,  0.0593,  1.0825]], requires_grad=True)
# Embedding(8, 3, padding_idx=1)
ex-3) Embedding 예시코드
class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.word_embeddings = nn.Embedding(config.vocab_size, config.emb_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_seq_length, config.emb_size)
        self.token_type_embeddings = nn.Embedding(2, config.emb_size)
        self.LayerNorm = nn.LayerNorm(config.emb_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.dropout)
        
        # position ids (used in the pos_emb lookup table) that we do not want updated through backpropogation
        self.register_buffer("position_ids", torch.arange(config.max_seq_length).expand((1, -1)))

    def forward(self, input_ids, token_type_ids):
        word_emb = self.word_embeddings(input_ids)
        pos_emb = self.position_embeddings(self.position_ids)
        type_emb = self.token_type_embeddings(token_type_ids)

        emb = word_emb + pos_emb + type_emb
        emb = self.LayerNorm(emb)
        emb = self.dropout(emb)

        return emb

NLP

BERT - Classification

NER (Named Entity Recognition)
Token Classification, 즉 문장을 구성하는 각 token에 label을 할당하는 Task이다.
먼저 예시로 BIO Tagging을 들면 아래와 같다:
ex) 인공지능(AI)에서 딥러닝은 머신러닝의 한 분야입니다.
--> <인공지능:B-Tech> <(:I-Tech> <AI:I-Tech> <):I-Tech> <에서:O> <딥러닝:B-Tech> <은:O> <머신러닝:B-Tech> <의:O> <한:O> <분야:O> <입니다:O> <.:O>
이때, B는 Begin(개체명의 시작)을, I는 Inside(개체명의 연속)를, O는 Outside(개체명이 아닌것)를 의미한다.
이런 NER에서 자주사용되는 모델이 바로 BERT이다.

BERT - MLM, NSP
문장간 관계(요약 등)를 이해하기 위해 활용되는 [CLS]토큰이 사용된다.
BERT에서는 총 3가지 Embedding이 Embedding Layer에서 활용된다:
1. Token Embedding:
- 입력문장 embedding
2. Segment Embedding:
- 모델이 인식하도록 각 문장에 고정된 숫자 할당.
3. Position Embedding:
- input을 한번에 모두 밀어넣기에(= 순차적으로 넣지 않음)
- Transformer Encoder는 각 token의 시간적 순서를 알지 못함
- 이를 위해 위치정보를 삽입하기위해 sine, cosine을 사용한다.
추천강의) https://www.youtube.com/watch?app=desktop&v=CiOL2h1l-EE

BART - Summarization

Abstractive & Extractive Summarization
추상요약: 원문을 완전히 이해 --> 새로운 문장을 생성해 요약하는 방식.
추출요약: 원문에서 가장 중요하고 관련성 높은 문장들만 선택해 그대로 추출.
(요약문이 부자연스러울 수는 있으며, 중복제거능력이 필요함.)

BART (Bidirectional & Auto Regressive Transformers)
Encoder-Decoder 모두 존재하며, 특히나 Embedding층을 공유하는 Shared Embedding을 사용해 둘간의 연결을 강화한다.
Encoder는 Bidirectional Encoder로 각 단어가 문장 전체의 좌우 context를 모두 참조가능하며,
Decoder에서 Auto-Regressive방식으로 이전에 생성한 단어를 참조해 다음 단어를 예측한다.
또한, Pre-Train시 Denoising Auto Encoder로 학습하는데, 임의로 noising후, 복원하게 한다.

RoBERTa, T5- TextQA

Abstractive & Extractive QA
추상질의응답: 주어진 지문 내에서 답변이 되는 문자열 추출 (질문-지문-지문내답변추출)
추출질의응답: 질문과 지문을 입력받아 새로운 답변 생성 (질문-지문-답변)

RoBERTa
max_len, pretrain-dataset양이 늘어났으며, Dynamic Masking기법 도입이 핵심.
Dynamic Masking: 각 에폭마다 다른 마스킹패턴 생성. (NSP는 없앰.)
BPE Tokenization 사용: BERT는 wordpiece tokenize.

T5- Machine Translation

SMT & NMT

통계적 기계번역: 원문-번역쌍 기반, 단어순서 및 언어패턴을 인식 --> 대규모 data필요
신경망 기계번역: 번역문과 단어시퀀스간 관계를 학습

T5 (Text-To-Text Transfer Transformer)
tast별 특정 Prompt형식을 사용해 적절한 출력을 생성하게 유도가능하다.
즉, 단일 모델로 다양한 NLP Task를 처리가능한 seq2seq구조를 기반으로 한다.

T5의 독특한점은 모델구조차제가 아닌, "입출력 모두 Txt형태로 취급하는 seq2seq로 접근해 Pretrain과정에서 "Unsupervised Learning"을 통해 대규모 corpus(약 75GB)를 사용한다는 점이다." 이를 통해 언어의 일반적 패턴과 지식을 효과적으로 습득한다.

LLaMA - Text Generation

Seq2Seq & CausalLM
Seq2Seq: Transformer, BART, T5 등 Encoder-Decoder구조
CausalLM: 단일 Decoder로 구성

LLaMA-3 Family
2024년 4월, LLaMA-3가 출시되었는데, LLaMA-3에서는 GQA(Grouped Query Attention)이 사용되어 Inference속도를 높였다.
LLaMA-3는 Incontext-Learning, Few-Shot Learning 모두 뛰어난 성능을 보인다.
~~Incontext-Learning~~: 모델이 입력텍스트를 기반으로 새로운 작업을 즉석에서 수행하는 능력

추가적으로 2024년 7월, LLaMA-3.1이 공개되었다. LLaMA-3.1은 AI안정성 및 보안관련 도구가 추가되었는데, Prompt Injection을 방지하는 Prompt Guard를 도입해 유해하거나 부적절한 콘텐츠를 식별하게 하였다.
추가적으로 LLaMA-3 시리즈는 다음과 같은 주요 특징이 존재한다:
- RoPE(Rotary Position Embedding): Q, K에 적용
- GQA(Grouped Query Attention): K, V를 여러 그룹으로 묶어 attention연산수행 --> 효율적 추론
- RMS Norm: 안정적 학습 및 계산의 효율성
- KV cache: 추론시 K,V를 cache에 저장 --> 연산의 효율성

LLaMA-3 최적화기법: SFT . RLHF . DPO
SFT(Supervised Fine-Tuning): 사람이 작성한 고품질QA쌍으로 모델을 직접 학습시키는 방법
RLHF: PPO알고리즘기반, 모델이 생성한 여러 응답에 대해 사람이 순위를 매기고 이를 바탕으로 재학습.
DPO(Direct Preference Optimization): RLHF의 복잡성을 줄이면서 효과적 학습을 가능케함.(사람이 매긴 응답순위를 직접학습; 다만 더욱 고품질 선호도 data를 필요로함.)

Computer Vision

주로 CV(Computer Vision)분야에선 CNN기법이 많이 활용되었다.(VGG, Inception, ResNet, ...)
다만, CNN based model은 주로 국소적(local) pattern을 학습하기에 전역적인 관계(global relation)모델링에 한계가 존재한다.
추가적으로 이미지 크기가 커질수록 계산복잡도 또한 증가한다.

이를 해결하기 위해 ViT(Vision Transformer)가 제안되었으며, 대규모 dataset으로 효율적으로 학습한다.
ViT의 가장 대표적인 격인 CLIP, OWL-ViT, SAM에 대해 알아보자.

Zero shot classification

Zero Shot Classification: CLIP, ALIGN, SigLIP
사실 CLIP은 다양한 Task에서 많이 활용되나 본 글은 Train dataset에 없는 Label에 대해 Image Classification을 수행하는 기술에 활용되는 방법으로 알아보고자 한다.
새로운 Label이 추가될 때마다 재학습이 필요한데, 이를 피하려면 Zero shot기법은 반필수적이기 때문이다.

CLIP (Contrastive Language-Image Pre-training)

Model Architecture Input_Size Patch_Size #params

openai/clip-vit-base-patch32 ViT-B/32 224×224 32×32 1.5B

openai/clip-vit-base-patch16 ViT-B/16 224×224 16×16 1.5B

openai/clip-vit-large-patch14 ViT-L/14 224×224 14×14 4.3B

openai/clip-vit-large-patch14-336 ViT-L/14 336×336 14×14 4.3B

작은 patch_size: 더 세밀한 특징추출, 메모리 사용량 및 계산시간 증가
큰 patch_size: 비교적 낮은 성능, 빠른 처리속도
파란블록: Positive Sample , 흰블록: Negative Sample

기존 Supervised Learning과 달리 2가지 특징이 존재한다:
1.별도의 Label없이 input으로 image-txt쌍만 학습.
- img, txt를 동일한 embedding공간에 사영(Projection)
- 이를 통해 두 Modality간 의미적 유사성을 직접적으로 측정 및 학습가능
- 이 때문에 CLIP은 img-encoder, txt-encoder 모두 갖고있음
2. Contrastive Learning:
- "Positive Sample": 실제img-txt쌍 --> img-txt간 의미적 유사성 최대화
- "Negative Sample": random하게 pair된 불일치img-txt쌍 --> 유사성 최소화
- 이를 위해 Cosine Similarity기반의 Contrastive Learning Loss를 사용.

Zero-Shot Classification 예시
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
dataset = load_dataset("sasha/dog-food")
images = dataset['test']['image'][:2]
labels = ['dog', 'food']
inputs = processor(images=images, text=labels, return_tensors="pt", padding=True)

print('input_ids: ', inputs['input_ids'])
print('attention_mask: ', inputs['attention_mask'])
print('pixel_values: ', inputs['pixel_values'])
print('image_shape: ', inputs['pixel_values'].shape)
# =======================================================
# input_ids:  tensor([[49406,  1929, 49407], [49406,  1559, 49407]])
# attention_mask:  tensor([[1, 1, 1], [1, 1, 1]])
# pixel_values:  tensor([[[[-0.0113, ...,]]]])
# image_shape:  torch.Size([2, 3, 224, 224])
CLIPProcessor에는 CLIPImageProcessor와 CLIPTokenizer가 내부적으로 포함되어 있다.
input_ids에서 49406과 49407은 각각 startoftext와 endoftext를 나타내는 특별한 값이다.
attention_mask는 변환된 token_types로
값이 1이면 해당위치토큰이 실제데이터값을 나타내고, 0은 [PAD]를 의미한다.
with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image
  probs = logits_per_image.softmax(dim=1)
  print('outputs:', outputs.keys())
  print('logits_per_image:', logits_per_image)
  print('probs: ', probs)

for idx, prob in enumerate(probs):
  print(f'- Image #{idx}')
  for label, p in zip(labels, prob):
    print(f'{label}: {p.item():.4f}')

# ============================================
# outputs: odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])
# logits_per_image: tensor([[23.3881, 18.8604], [24.8627, 21.5765]])
# probs:  tensor([[0.9893, 0.0107], [0.9640, 0.0360]])
# - Image #0
# dog: 0.9893
# food: 0.0107
# - Image #1
# dog: 0.9640
# food: 0.0360

Zero shot Detection

자연어적 설명에는 이미지 내 객체와 개략적 위치정보를 암시적으로 포함한다.
CLIP에서 img-txt쌍으로 시각적특징과 텍스트간 연관성을 학습가능함을 보였기에,
추론 시, 주어진 txt prompt만 잘 설계한다면 객체의 위치를 예측할 수 있게된다.
따라서 zero-shot object detection에서는 전통적인 annotation정보 없이도 시각과 언어간의 상관관계를 학습하여 새로운 객체클래스를 검출할 수 있게 해준다.
OWL-ViT의 경우, Multi-Modal Backbone모델로 CLIP모델을 사용한다.

OWLv2 (OWL-ViT)

OWL-ViT구조, OWLv2는 객체검출헤드에 Objectness Classifier추가함.

OWL-ViT는 img-txt쌍으로 pretrain하여 Open-Vocabulary객체탐지가 가능하다.
OWLv2는 Self-Training기법으로 성능을 크게 향상시켰다.
즉, 기존 Detector로 Weak Supervision방식으로 가상의 Bbox-Annotation을 자동생성한다.
ex) input: img-txt pair[강아지가 공을 가지고 노는]
기존 detector: [강아지 bbox] [공 bbox] 자동예측, annotation생성
--> 모델 학습에 이용 (즉, 정확한 위치정보는 없지만 부분적 supervision signal로 weak signal기반, 모델이 객체의 위치 및 클래스를 추론, 학습하게 함)

Zero-Shot Detection 예시

import io
from PIL import Image
from datasets import load_dataset
from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")
dataset = load_dataset('Francesco/animals-ij5d2')
print(dataset)
print(dataset['test'][0])

# ==========================================================
# DatasetDict({
#     train: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 700
#     })
#     validation: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 100
#     })
#     test: Dataset({
#         features: ['image_id', 'image', 'width', 'height', 'objects'],
#         num_rows: 200
#     })
# })
# {'image_id': 63, 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2B0186E4A0>, 'width': 640, 'height': 640, 'objects': {'id': [96, 97, 98, 99], 'area': [138029, 8508, 10150, 20624], 'bbox': [[129.0, 291.0, 395.5, 349.0], [119.0, 266.0, 93.5, 91.0], [296.0, 280.0, 116.0, 87.5], [473.0, 284.0, 167.0, 123.5]], 'category': [3, 3, 3, 3]}}

- Label 및 Image 전처리

images = dataset['test']['image'][:2]
categories = dataset['test'].features['objects'].feature['category'].names
labels = [categories] * len(images)
inputs = processor(text=labels, images=images, return_tensors="pt", padding=True)

print(images)
print(labels)
print('input_ids:', inputs['input_ids'])
print('attention_mask:', inputs['attention_mask'])
print('pixel_values:', inputs['pixel_values'])
print('image_shape:', inputs['pixel_values'].shape)

# ==========================================================
# [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CF790>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x640 at 0x7A2ADF7CCC10>]
# [['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk'], ['animals', 'cat', 'chicken', 'cow', 'dog', 'fox', 'goat', 'horse', 'person', 'racoon', 'skunk']]
# input_ids: tensor([[49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0],
#         [49406,  4995, 49407,     0],
#         [49406,  2368, 49407,     0],
#         [49406,  3717, 49407,     0],
#         [49406,  9706, 49407,     0],
#         [49406,  1929, 49407,     0],
#         [49406,  3240, 49407,     0],
#         [49406,  9530, 49407,     0],
#         [49406,  4558, 49407,     0],
#         [49406,  2533, 49407,     0],
#         [49406,  1773,  7100, 49407],
#         [49406, 42194, 49407,     0]])
# attention_mask: tensor([[1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#         [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0], 
#          [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0],
#           [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1], [1, 1, 1, 0]])
# pixel_values: tensor([[[[ 1.5264, ..., ]]]])
# image_shape: torch.Size([2, 3, 960, 960])

- Detection & Inference

import torch

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs.keys()) # odict_keys(['logits', 'objectness_logits', 'pred_boxes', 'text_embeds', 'image_embeds', 'class_embeds', 'text_model_output', 'vision_model_output'])

- Post Processing

import matplotlib.pyplot as plt
from PIL import ImageDraw, ImageFont

# 예측확률이 높은 객체 추출
shape = [dataset['test'][:2]['width'], dataset['test'][:2]['height']]
target_sizes = list(map(list, zip(*shape))) # [[640, 640], [640, 640]]
results = processor.post_process_object_detection(outputs=outputs, threshold=0.5, target_sizes=target_sizes)
print(results)

# Post Processing
for idx, (image, detect) in enumerate(zip(images, results)):
    image = image.copy()
    draw = ImageDraw.Draw(image)
    font = ImageFont.truetype("arial.ttf", 18)

    for box, label, score in zip(detect['boxes'], detect['labels'], detect['scores']):
        box = [round(i, 2) for i in box.tolist()]
        draw.rectangle(box, outline='red', width=3)

        label_text = f'{labels[idx][label]}: {round(score.item(), 3)}'
        draw.text((box[0], box[1]), label_text, fill='white', font=font)

    plt.imshow(image)
    plt.axis('off')
    plt.show()
    
# ==============================================
# [{'scores': tensor([0.5499, 0.6243, 0.6733]), 'labels': tensor([3, 3, 3]), 'boxes': tensor([[329.0247, 287.1844, 400.3372, 357.9262],
#         [122.9359, 272.8753, 534.3260, 637.6506],
#         [479.7363, 294.2744, 636.4859, 396.8372]])}, {'scores': tensor([0.7538]), 'labels': tensor([7]), 'boxes': tensor([[ -0.7799, 173.7043, 440.0294, 538.7166]])}]

Zero shot Semantic segmentation

Image Segmentation은 보다 정밀한, 픽셀별 분류를 수행하기에 높은 계산비용이 들며, 광범위한 train data와 정교한 알고리즘을 필요로 한다.
전통적 방법으로는 threshold기반 binary classification, Edge Detection등이 있으며
최신 방법으로는 딥러닝모델을 이용해 Image Segmentation을 진행한다.
전통적 방법은 단순하고 빠르지만 복잡하거나 다양한 조명조건 등에서 성능이 크게 저하되는 단점이 존재한다.

SAM (Segment Anything Model)

Model	Architecture	Input_Size	Patch_Size	#params
facebook/sam-vit-base	ViT-B/16	1024×1024	16×16	0.9B
facebook/sam-vit-large	ViT-L/16	1024×1024	16×16	3.1B
facebook/sam-vit-huge	ViT-H/16	1024×1024	16×16	6.4B

SAM구조: img-encoder, prompt-encoder, mask-decoder

SAM은 Meta에서 개발한 다양한 도메인에서 수집한 1100만개 이미지를 이용해 학습한 모델이다.

그렇기에 다양한 환경에서 image segmentation작업을 고수준으로 수행가능하다.
SAM을 이용하면 많은경우, 추가적인 Fine-Tuning없이, 다양한 Domain image에 대한 segmentation이 가능하다.

SAM은 prompt를 받을수도 있고, 받지 않아도 되는데, prompt는 좌표, bbox, txt 등 다양하게 줄 수 있다.
추가적으로 prompt를 주지 않으면 img 전체에 대한 포괄적인 Segmentation을 진행한다.
다만, Inference결과로 Binary Mask는 제공하지만 pixel에 대한 구체적 class정보는 포함하지 않는다.

SAM 활용 예시

import io
from PIL import Image
from datasets import load_dataset
from transformers import SamProcessor, SamModel

def filter_category(data):
    # 16 = dog
    # 23 = giraffe
    return 16 in data["objects"]["category"] or 23 in data["objects"]["category"]

def convert_image(data):
    byte = io.BytesIO(data["image"]["bytes"])
    img = Image.open(byte)
    return {"img": img}

model_name = "facebook/sam-vit-base"
processor = SamProcessor.from_pretrained(model_name) 
model = SamModel.from_pretrained(model_name)

dataset = load_dataset("s076923/coco-val")
filtered_dataset = dataset["validation"].filter(filter_category)
converted_dataset = filtered_dataset.map(convert_image, remove_columns=["image"])

import numpy as np
from matplotlib import pyplot as plt


def show_point_box(image, input_points, input_labels, input_boxes=None, marker_size=375):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    ax = plt.gca()
    
    input_points = np.array(input_points)
    input_labels = np.array(input_labels)

    pos_points = input_points[input_labels[0] == 1]
    neg_points = input_points[input_labels[0] == 0]
    
    ax.scatter(
        pos_points[:, 0],
        pos_points[:, 1],
        color="green",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )
    ax.scatter(
        neg_points[:, 0],
        neg_points[:, 1],
        color="red",
        marker="*",
        s=marker_size,
        edgecolor="white",
        linewidth=1.25
    )

    if input_boxes is not None:
        for box in input_boxes:
            x0, y0 = box[0], box[1]
            w, h = box[2] - box[0], box[3] - box[1]
            ax.add_patch(
                plt.Rectangle(
                    (x0, y0), w, h, edgecolor="green", facecolor=(0, 0, 0, 0), lw=2
                )
            )

    plt.axis("on")
    plt.show()


image = converted_dataset[0]["img"]
input_points = [[[250, 200]]]
input_labels = [[[1]]]

show_point_box(image, input_points[0], input_labels[0])
inputs = processor(
    image, input_points=input_points, input_labels=input_labels, return_tensors="pt"
)

# input_points shape : torch.Size([1, 1, 1, 2])
# input_points : tensor([[[[400.2347, 320.0000]]]], dtype=torch.float64)
# input_labels shape : torch.Size([1, 1, 1])
# input_labels : tensor([[[1]]])
# pixel_values shape : torch.Size([1, 3, 1024, 1024])
# pixel_values : tensor([[[[ 1.4612,  ...]]])

input_points: [B, 좌표개수, 좌표] -- 관심갖는 객체나 영역지정 좌표
input_labels: [B, 좌표B, 좌표개수] -- input_points에 대응되는 label정보
- input_labels종류:

번호	이름	설명
1	foreground 클래스	검출하고자 하는 관심객체가 포함된 좌표
0	not foreground 클래스	관심객체가 포함되지 않은 좌표
-1	background 클래스	배경영역에 해당하는 좌표
-10	padding 클래스	batch_size를 맞추기 위한 padding값 (모델이 처리X)

[Processor로 처리된 이후 출력결과]
input_points: [B, 좌표B, 분할마스크 당 좌표개수, 좌표위치]
input_labels: [B, 좌표B, 좌표개수]

import torch


def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)


def show_masks_on_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
        masks = masks.squeeze()
    if scores.shape[0] == 1:
        scores = scores.squeeze()

    nb_predictions = scores.shape[-1]
    fig, axes = plt.subplots(1, nb_predictions, figsize=(30, 15))

    for i, (mask, score) in enumerate(zip(masks, scores)):
        mask = mask.cpu().detach()
        axes[i].imshow(np.array(raw_image))
        show_mask(mask, axes[i])
        axes[i].title.set_text(f"Mask {i+1}, Score: {score.item():.3f}")
        axes[i].axis("off")
    plt.show()


model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)
print("iou_scores shape :", outputs.iou_scores.shape)
print("iou_scores :", outputs.iou_scores)
print("pred_masks shape :", outputs.pred_masks.shape)
print("pred_masks :", outputs.pred_masks)

# iou_scores shape : torch.Size([1, 1, 3])
# iou_scores : tensor([[[0.7971, 0.9507, 0.9603]]])
# pred_masks shape : torch.Size([1, 1, 3, 256, 256])
# pred_masks : tensor([[[[[ -3.6988, ..., ]]]]])

iou_scrores: [B, 좌표개수, IoU점수]
pred_masks: [B, 좌표B, C, H, W]

input_points = [[[250, 200], [15, 50]]]
input_labels = [[[0, 1]]]
input_boxes = [[[100, 100, 400, 600]]]

show_point_box(image, input_points[0], input_labels[0], input_boxes[0])
inputs = processor(
    image,
    input_points=input_points,
    input_labels=input_labels,
    input_boxes=input_boxes,
    return_tensors="pt"
)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu(),
)

show_masks_on_image(image, masks[0], outputs.iou_scores)

Zero shot Instance segmentation

Zero shot Detection + SAM

SAM의 경우, 검출된 객체의 클래스를 분류하는 기능이 없다.
즉, 이미지 내 객체를 픽셀단위로 구분하는 instance segmentation작업에는 어려움이 존재한다.

이런 한계극복을 위해 zero-shot detection model과 SAM을 함께 활용할 수 있다:
1) zero shot detection모데로 객체 클래스와 bbox영역 검출
2) bbox영역내 SAM모델로 semantic segmentation 진행.

from transformers import pipeline

generator = pipeline("mask-generation", model=model_name)
outputs = generator(image, points_per_batch=32)

plt.imshow(np.array(image))
ax = plt.gca()
for mask in outputs["masks"]:
    show_mask(mask, ax=ax, random_color=True)
plt.axis("off")
plt.show()

print("outputs mask의 개수 :", len(outputs["masks"]))
print("outputs scores의 개수 :", len(outputs["scores"]))

# outputs mask의 개수 : 52
# outputs scores의 개수 : 52

detector = pipeline(
    model="google/owlv2-base-patch16", task="zero-shot-object-detection"
)

image = converted_dataset[24]["img"]
labels = ["dog", "giraffe"]
results = detector(image, candidate_labels=labels, threshold=0.5)

input_boxes = []
for result in results:
    input_boxes.append(
        [
            result["box"]["xmin"],
            result["box"]["ymin"],
            result["box"]["xmax"],
            result["box"]["ymax"]
        ]
    )
    print(result)

inputs = processor(image, input_boxes=[input_boxes], return_tensors="pt")

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)

plt.imshow(np.array(image))
ax = plt.gca()

for mask, iou in zip(masks[0], outputs.iou_scores[0]):
    max_iou_idx = torch.argmax(iou)
    best_mask = mask[max_iou_idx]
    show_mask(best_mask, ax=ax, random_color=True)

plt.axis("off")
plt.show()

#{'score': 0.6905778646469116, 'label': 'giraffe', 'box': {'xmin': 96, 'ymin': 198, 'xmax': 294, 'ymax': 577}}
#{'score': 0.6264181733131409, 'label': 'giraffe', 'box': {'xmin': 228, 'ymin': 199, 'xmax': 394, 'ymax': 413}}

Image Matching
image matching은 디지털 이미지간 유사성을 정량화, 비교하는 방법이다.
이를 image의 feature vector를 추출하여 각 image vector간 유사도(거리)를 측정하여 계산한다.
그렇기에 이미지 매칭의 핵심은 "이미지 특징을 효과적으로 포착하는 feature vector의 생성"이다.
(보통 특징벡터가 고차원일수록 더 많은 정보를 포함하며, 이 특징벡터는 classification layer와 같은 층을 통과하기 직전(= Feature Extractor의 결과값 = Classifier 직전값) 벡터를 보통 의미한다.)

ex) ViT를 이용한 특징벡터 추출 예제
import torch
from datasets import load_dataset
from transformers import ViTImageProcessor, ViTModel

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTModel.from_pretrained(model_name)

inputs = processor(image, return_tensors="pt")
with torch.no_grad():
    outputs = model(inputs["pixel_values"])

print("마지막 특징 맵의 형태 :", outputs["last_hidden_state"].shape)
print("특징 벡터의 차원 수 :", outputs["last_hidden_state"][:, 0, :].shape)
print("특징 벡터 :", outputs["last_hidden_state"][:, 0, :])

# 마지막 특징 맵의 형태 : torch.Size([1, 197, 768])
# 특징 벡터의 차원 수 : torch.Size([1, 768])
# 특징 벡터 : tensor([[ 2.9420e-01,  8.3502e-01,  ..., -8.4114e-01,  1.7990e-01]])
ImageNet-21K라는 방대한 사전Dataset으로 학습되어 미세한 차이 및 복잡한 패턴을 인식할 수 있게 된다.
ViT에서 feature vector추출 시, 주목할점은 last_hidden_state 키 값이다:
출력이 [1, 197, 768]의 [B, 출력토큰수, feature차원]을 의미하는데, 197개의 출력토큰은 다음을 의미한다.
224×224 --> 16×16(patch_size) --> 196개 patches,
197 = [CLS] + 196 patches로 이루어진 출력토큰에서 [CLS]를 특징벡터로 사용한다.

FAISS (Facebook AI Similarity Search)

FAISS는 메타에서 개발한 고성능 벡터유사도검색 라이브러리이다.
이는 "대규모 고차원 벡터 데이터베이스에서 유사한 벡터를 검색"가능하게 설계되었다.

cf) [벡터 저장 및 관리방식]

로컬 저장 장치: SSD나 NVMe같은 고속저장장치를 사용해 빠른 데이터 접근이 가능.

데이터베이스 시스템: PstgreSQL, pgvector확장이나 MongoDB의 Atlas Vector Search같은 벡터검색기능을 지원하는 데이터베이스를 활용

클라우드 벡터 데이터베이스: Amazon OpenSearch, Ggogle Vetex AI등 클라우드 서비스는 대규모 벡터데이터의 저장 및 검색을 위한 특화된 솔루션을 제공

벡터검색엔진: Milvus, Qdrant, Weaviate, FAISS 등의 벡터 데이터 베이스는 대규모 벡터 dataset의 효율적 저장 및 고성능 검색을 위해 최적화되어 ANN(Approximate Nearest Neighbor)알고리즘으로 빠른 유사도검색을 지원, 실시간 검색이 필요한 경우 특히나 적합하다.

ex) CLIP을 이용한 이미지 특징벡터 추출 예제
import torch
import numpy as np
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel

dataset = load_dataset("sasha/dog-food")
images = dataset["test"]["image"][:100]

model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name)

vectors = []
with torch.no_grad():
    for image in images:
        inputs = processor(images=image, padding=True, return_tensors="pt")
        outputs = model.get_image_features(**inputs)
        vectors.append(outputs.cpu().numpy())

vectors = np.vstack(vectors)
print("이미지 벡터의 shape :", vectors.shape)

# 이미지 벡터의 shape : (100, 512)
dog-food dataset에서 100개 이미지를 선택 → 각 이미지 벡터를 추출
→ vectors리스트에 저장 → ndarray형식으로 변환

이런 특징벡터를 유사도 검색을 위한 인덱스 생성에 활용가능하다:
생성된 인덱스에 이미지 벡터를 등록하기 위해 add를 사용하는데, 이때 입력되는 이미지 벡터는 반드시 numpy의 ndarray형식의 [벡터개수, 벡터차원수] 형태로 구성되어야 한다!!
import faiss

dimension = vectors.shape[-1]
index = faiss.IndexFlatL2(dimension)
if torch.cuda.is_available():
    res = faiss.StandardGpuResources()
    index = faiss.index_cpu_to_gpu(res, 0, index)

index.add(vectors)


import matplotlib.pyplot as plt

search_vector = vectors[0].reshape(1, -1)
num_neighbors = 5
distances, indices = index.search(x=search_vector, k=num_neighbors)

fig, axes = plt.subplots(1, num_neighbors + 1, figsize=(15, 5))

axes[0].imshow(images[0])
axes[0].set_title("Input Image")
axes[0].axis("off")

for i, idx in enumerate(indices[0]):
    axes[i + 1].imshow(images[idx])
    axes[i + 1].set_title(f"Match {i + 1}\nIndex: {idx}\nDist: {distances[0][i]:.2f}")
    axes[i + 1].axis("off")

print("유사한 벡터의 인덱스 번호:", indices)
print("유사도 계산 결과:", distances)

# 유사한 벡터의 인덱스 번호: [[ 0  6 75  1 73]]
# 유사도 계산 결과: [[ 0.       43.922516 44.92473  46.544144 47.058586]]
위 과정을 통해 100개의 벡터를 저장한 FAISS 인덱스가 생성되며, 검색하고자하는 이미지의 특징벡터를 입력으로 인덱스 내에서 가장 유사한 벡터를 효율적으로 추출가능하다.
다만, 인덱스에 저장된 벡터에 대해서만 검색이 가능하기에 검색범위를 확장하고자 한다면 더 많은 벡터를 인덱스에 추가해야한다.

위 코드를 보면 아래와 같은 코드가 있는데, FAISS 라이브러리에서는 다양한 인덱스유형들을 제공한다:
index = faiss.IndexFlatL2(dimension)
이름 정확도 속도 특징

IndexFlatL2 가장 높음 가장 느림 모든 벡터에 대한 완전탐색을 수행

IndexHNSW 높음 보통 그래프 구조를 사용해 효율적 검색

IndexIVFlat 보통 가장 빠름 벡터간 clustering으로 탐색범위를 줄여 검색

Multi-Modal

Image Captioning (img2txt)

BLIP

BLIP의 핵심아이디어는 "img와 Txt의 상호작용을 모델링하는 것"이다.
이를 위해 img-encoder, txt-encoder로 각각의 feature vector를 연결해 통합 표현을 생성한다.

BLIP-2 구조

BLIP2는 Q-Former를 도입해 img-txt간 상호작용과 정보교환을 향상시켰다:
[img-txt대조학습, ITM, img기반 txt생성] --> 동시에 하나의 Encode-Decoder구조로 수행
Q-Former는 입력으로 고정된 이미지 feature embedding을 받은 후
img-txt관계가 잘 표현된 Soft visual prompt Embedding을 출력한다.

DocumentQA
DQA(DocumentQA)는 자연어처리 + 정보검색기술을 융합해 QA를 진행하는 것이다.
DQA는 시각적 구조와 Layout을 고려해야하는데, 이 중 가장 주목받는 모델 중 하나가 바로 LayoutLM이다.

LayoutLM (Layout-aware Language Model)

LayoutLM은 Microsoft에서 문서 이미지의 txt뿐만아니라 Layout정보까지 함께 Pre-Train된 모델이다.

[LayoutLMv1]

LayoutLM-v1

BERT를 기반으로 txt와 함께 txt의 위치정보를 입력으로 사용한다.
Faster R-CNN같은 OCR모델로 txt와 bbox경계를 추출, position embedding으로 추가하며 단어의 image patch(feature)도 model에 입력한다. 다만, LayoutLMv1은 image feature가 맨 마지막에 추가되어 Pretrain시 실제로 활용할 수 없다는 단점이 존재한다.

[LayoutLMv2]

LayoutLMv2는 image embedding을 추가로 도입해 문서의 시각적 정보를 효과적으로 반영한다.
LayoutLMv2에서 visual embedding이 ResNeXT-FPN으로 추출된다.
즉, txt, img-patch, layout정보를 동시에 입력으로 받아 Self-Attention을 수행한다.
- 학습 주요 목표:
i) Masked Visual-Language Modeling: 문장의 빈칸 예측
ii) ITM: 특정 텍스트와 해당 이미지간의 연관성 학습
iii)Text-Image Alignment: 이미지에서 특정 단어가 가려졌을 때, 그 위치를 식별하는 능력

[LayoutLMv3]

좌)DocFormer , 우)LayoutLMv3

LayoutLMv3는 Faster R-CNN, CNN등의 Pre-Trained Backbone에 의존하지 않는 최초의 통합 MLLMs이다.
이를 위해 전과 달리 새로운 사전학습전략 및 과제를 도입하였다:
i) Masked Language Modeling(MLM): 일부 단어 token 마스킹
ii) Masked Image Modeling(MIM): 마스킹된 token에 해당하는 이미지 부분을 마스킹
iii) Word Patch Alignment(WPA): img token과 대응되는 Txt token의 마스킹여부를 이진분류, 두 모달리티간 정렬을 학습

<LayoutLMv3 구조>: embedding계층, patch_embedding모듈, encoder
1) embedding층은 다양한 유형의 Embedding을 통합:
- word_embed + token_type_emb + pos_emb + (x_pos_emb , y_pos_emb, h_pos_emb, w_pos_emb)

2) patch_embedding모듈은 이미지를 처리:
- patch로 분할하고 각 patch를 embedding으로 변환하는 ViT역할

3) encoder
- 여러 Transformer층으로 구성.

VQA
VQA process: 시각적 특징 추출 → Q의미파악 →시각적특징과 Q의 txt정보를 통합해 의미있는 표현(A)생성
이를 위해 등장한 것이 바로 ViLT이다.

ViLT (Vision-and-Language Transformer)

시각적 입력을 txt입력과 동일한 방식으로 처리하는 단일모델구조로 구성되어 있다.
이때, 두 모달리티 구분을 위해 모달타입 embedding이 추가되며,
학습과정에서 3가지 손실함수를 통해 이뤄진다:
- ITM: 주어진 Image와 Text가 서로 연관되어있는지 판단.
- MLM: 단어단위의 Masking으로 전체 단어맥락 파악
- WPA: img-txt간 벡터 유사도 최대화

결과적으로 img+txt를 효과적으로 결합해, 단일 embedding공간에 표현한다.

cf) collate_fn은 pytorch의 dataloader로 batch를 구성할 때, 각 sample을 어떻게 결합할 것인지 정의하는 함수다.

Image Generation
이미지 생성은 prompt를 기반으로 이해하여 GAN이나 Diffusion Model을 이용해 prompt의 세부적 특징을 잘 잡아내 새로운 img를 생성하는 기술을 의미한다.

Diffusion Model

[Forward process]: src_img에 점진적 정규분포 Noise 추가
[Reverse process]: pure_noise에서 원본으로 복원(by 평균과 표준편차 갱신)

[Stable-Diffusion 1]
- 512×512 img 생성
- txt2img, img2img, inpainting 등의 기능

[Stable-Diffusion 2]
- 768×768 img 생성
- OpenCLIP으로 더 나은 WPA 제공, 세부적 묘사 개선

[Stable-Diffusion 3]
- 더욱 고해상도 이미지 생성
- Rectified flow기반의 새로운 모델구조
- txt와 img token간 양방향 정보흐름을 가능하게하는 새로운 모델구조

etc

Hyperparameter Tuning - ray tune

raytune은 분산 hypereparameter 최적화 framework이다.
대규모 분산컴퓨팅 환경에서 다양한 hyperparameter 탐색 알고리즘(random, greedy 등)을 지원하며, Early Stopping 또한 제공한다.
추가적으로 실험결과 추적 및 시각화 도구 또한 제공하며, 최적의 hyperparameter 조합 또한 효과적으로 식별할 수 있게 도와준다.

!pip3 install ray[tune] optuna

ex) NER RayTune 예제

i) 학습 준비

from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer

def preprocess_data(example, tokenizer):
    sentence = "".join(example["tokens"]).replace("\xa0", " ")
    encoded = tokenizer(
        sentence,
        return_offsets_mapping=True,
        add_special_tokens=False,
        padding=False,
        truncation=False
    )

    labels = []
    for offset in encoded.offset_mapping:
        if offset[0] == offset[1]:
            labels.append(-100)
        else:
            labels.append(example["ner_tags"][offset[0]])
    encoded["labels"] = labels
    return encoded

dataset = load_dataset("klue", "ner")
labels = dataset["train"].features["ner_tags"].feature.names

model_name = "Leo97/KoELECTRA-small-v3-modu-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(labels),
    ignore_mismatched_sizes=True
)

processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=False,
    remove_columns=dataset["train"].column_names
)

ii) hyperparameter 탐색

from ray import tune
from functools import partial
from transformers import Trainer, TrainingArguments
from transformers.data.data_collator import DataCollatorForTokenClassification

def model_init(model_name, labels):
    return AutoModelForTokenClassification.from_pretrained(
        model_name, num_labels=len(labels), ignore_mismatched_sizes=True
    )

def hp_space(trial):
    return {
        "learning_rate": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(1e-5, 1e-1),
        "num_train_epochs": tune.choice([1, 2, 3])
    }

def compute_objective(metrics):
    return metrics["eval_loss"]

training_args = TrainingArguments(
    output_dir="token-classification-hyperparameter-search",
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    # learning_rate=1e-4,
    # weight_decay=0.01,
    # num_train_epochs=5,
    seed=42
)

trainer = Trainer(
    model_init=partial(model_init, model_name=model_name, labels=labels),
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"],
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True)
)

best_run = trainer.hyperparameter_search(
    backend="ray",
    n_trials=5,
    direction="minimize",
    hp_space=hp_space,
    compute_objective=compute_objective,
    resources_per_trial={"cpu": 2, "gpu": 1},
    trial_dirname_creator=lambda trial: str(trial)
)
print(best_run.hyperparameters)

model_init 함수: 모델 인스턴스 생성 (여러 실험을 통해 최적의 hyperparameter 탐색하게 할당됨.)
즉, 각 실험마다 일관된 초기상태를 보장함.

hp_space 함수: 최적화 과정에서 탐색할 hyperparameter 종류와 값의 범위 지정.

compute_objective 함수: 최적화 과정에서 사용할 "평가지표"로 보통 eval_loss나 eval_acc를 기준으로 설정.

TrainingArguments 함수: lr, weight_decay, train_epochs가 hp_space에서 탐색되기에 따로 할당X

Trainer 함수: 고정된 모델인스턴스가 아닌, model_init을 사용.

출력 예시)

+-------------------------------------------------------------------+
| Configuration for experiment     _objective_2024-11-18_05-44-18   |
+-------------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator            |
| Scheduler                        FIFOScheduler                    |
| Number of trials                 5                                |
+-------------------------------------------------------------------+

View detailed results here: /root/ray_results/_objective_2024-11-18_05-44-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-11-18_05-44-11_866890_872/artifacts/2024-11-18_05-44-18/_objective_2024-11-18_05-44-18/driver_artifacts`

Trial status: 5 PENDING
Current time: 2024-11-18 05:44:18. Total running time: 0s
Logical resource usage: 0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-------------------------------------------------------------------------------------------+
| Trial name               status       learning_rate     weight_decay     num_train_epochs |
+-------------------------------------------------------------------------------------------+
| _objective_27024_00000   PENDING        2.36886e-05       0.0635122                     3 |
| _objective_27024_00001   PENDING        6.02131e-05       0.00244006                    2 |
| _objective_27024_00002   PENDING        1.43217e-05       1.7074e-05                    1 |
| _objective_27024_00003   PENDING        3.99131e-05       0.00679658                    2 |
| _objective_27024_00004   PENDING        1.13871e-05       0.00772672                    2 |
+-------------------------------------------------------------------------------------------+

Trial _objective_27024_00000 started with configuration:
+-------------------------------------------------+
| Trial _objective_27024_00000 config             |
+-------------------------------------------------+
| learning_rate                             2e-05 |
| num_train_epochs                              3 |
| weight_decay                            0.06351 |
+-------------------------------------------------+

...

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ는 모델 최적화방식으로 LLM의 효율성을 크게 향상가능하다.
모델의 가중치를 낮은 bit정밀도로 양자화해 모델크기를 줄이고 추론속도를 높인다.
아래 예제의 출력결과를 보면 알 수 있듯, 모델 크기를 상당히 큰 폭으로 줄일 수 있는데,
GPTQ방법은 GPT계열뿐만 아니라 다른 Transformer 기반 모델들 모두 적용 가능하다.

GPTQ를 이용한 모델 양자화 예제

from transformers import GPTQConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

quantization_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config
)

from transformers import pipeline

origin_generator = pipeline("text-generation", model="facebook/opt-125m")
quantized_generator = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

input_text_list = [
    "In the future, technology wil",
    "What are we having for dinner?",
    "What day comes after Monday?"
]

print("원본 모델의 출력 결과:")
for input_text in input_text_list:
    print(origin_generator(input_text))
print("양자화 모델의 출력 결과:")
for input_text in input_text_list:
    print(quantized_generator(input_text))
    
# 원본 모델의 출력 결과:
# [{'generated_text': 'In the future, technology wil be used to make the world a better place.\nI think'}]
# [{'generated_text': 'What are we having for dinner?\n\nWe have a great dinner tonight. We have a'}]
# [{'generated_text': "What day comes after Monday?\nI'm guessing Monday."}]
# 양자화 모델의 출력 결과:
# [{'generated_text': 'In the future, technology wil be able to make it possible to make a phone that can be'}]
# [{'generated_text': "What are we having for dinner?\n\nI'm not sure what to do with all this"}]
# [{'generated_text': "What day comes after Monday?\nI'm not sure, but I'll be sure to check"}]

출력결과, 정확도가 다소 떨어지긴 하나 원본모델과 큰 차이가 없음을 확인할 수 있다.

import time
import numpy as np

def measure_inference_time(generator, input_text, iterations=10):
    times = []
    for _ in range(iterations):
        start_time = time.time()
        generator(input_text)
        end_time = time.time()
        times.append(end_time - start_time)
    avg_time = np.mean(times)
    return avg_time

def calculate_model_size(model):
    total_params = sum(p.numel() for p in model.parameters())
    total_memory = sum(p.numel() * p.element_size() for p in model.parameters())
    total_memory_mb = total_memory / (1024 ** 2)
    return total_memory_mb, total_params

test_input = "Once upon a time in a land far, far away, there was a small village."

size_original, total_params_original = calculate_model_size(origin_generator.model)
avg_inference_time_original = measure_inference_time(origin_generator, test_input)

size_quantized, total_params_quantized = calculate_model_size(quantized_generator.model)
avg_inference_time_quantized = measure_inference_time(quantized_generator, test_input)

print("원본 모델:")
print(f"- 매개변수 개수: {total_params_original:,}")
print(f"- 모델 크기: {size_original:.2f} MB")
print(f"- 평균 추론 시간: {avg_inference_time_original:.4f} sec")

print("양자화 모델:")
print(f"- 매개변수 개수: {total_params_quantized:,}")
print(f"- 모델 크기: {size_quantized:.2f} MB")
print(f"- 평균 추론 시간: {avg_inference_time_quantized:.4f} sec")

# 원본 모델:
# - 매개변수 개수: 125,239,296
# - 모델 크기: 477.75 MB
# - 평균 추론 시간: 0.1399 sec
# 양자화 모델:
# - 매개변수 개수: 40,221,696
# - 모델 크기: 76.72 MB
# - 평균 추론 시간: 0.0289 sec

추론 과정에 대한 출력결과를 보면, 원본에 비해 모델에 비해 크기가 크게 줄며 더 빠른 처리를 통해 실시간 응답에 대해 매우 효율적일 수 있음을 확인가능하다.

저작자표시

'HuggingFace🤗' 카테고리의 다른 글

HuggingFace(🤗)-Tutorials (1)	2024.07.31
[Data Preprocessing] - Data Collator (1)	2024.07.14
QLoRA 실습 & Trainer vs SFTTrainer (0)	2024.07.12
[QLoRA] & [PEFT] & deepspeed, DDP (0)	2024.07.09

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

이름	정확도	속도	특징
IndexFlatL2	가장 높음	가장 느림	모든 벡터에 대한 완전탐색을 수행
IndexHNSW	높음	보통	그래프 구조를 사용해 효율적 검색
IndexIVFlat	보통	가장 빠름	벡터간 clustering으로 탐색범위를 줄여 검색

this.code();