[G]Part 2-5. Energy-based Model

V2LLAIN 2024. 1. 29. 17:18

2024. 1. 29. 17:18

📌 목차

1. Energy-based Model
2. Energy 함수
3. 기타 (BM, RBM)
4. 요약

🧐 preview:
EBM(Energy-Based Model)의 아이디어:
실수값의 Energy 함수를 0과 1사이로 정규화하는 Boltzmann Distribution으로 어떤 사건의 확률을 표현할 수 있다는 것.

1. Energy with Boltzmann

Boltzmann Distribution
EBM은 Boltzmann분포로 실제 data생성분포를 모델링한다.
아래 식과 같이 Boltzmann분포의 Energy함수(점수)를 나타낼 수 있다:

Step 1. 신경망 E(x)를 훈련:
→ 가능성이 높은 sample에는 낮은 score → p(x)가 1에 가까워짐.
→ 가능성이 낮은 sample에는 높은 score → p(x)가 0에 가까워짐.

Step 2. 풀기 어려운 적분이 존재:
→ 위 식의 정규화분모는 간단한 문제를 제외, 풀기 어려운 적분이 존재.
이 적분을 계산할 수 없으면, MLE로 모델학습이 불가능하다.

∙ EBM의 핵심아이디어: 근사기법을 사용, 이 분모를 계산할 필요가 없도록 만드는 것.

이는 표준정규분포에 적용한 변환이 여전히 유효한 확률분포를 출력하도록 만들려고 애쓰는 Normalizing Flow와는 대조적이다.

[Implicit Generation and Modeling with EBMs; 2019]의 핵심아이디어는 다음과 같다:
(for training): 대조발산기법
(for sampling): Langevin Dynamics기법
→ 까다로운 분모문제를 해결.

2. Energy Based Model

Energy 함수
에너지함수 E_θ(x)는 parameter가 θ이고 입력 img가 x를 하나의 스칼라값으로 변환하는 신경망.
이때, 여기에 Swish활성화함수를 사용한다.
swish함수

이 Swish함수는 Gradient Vanishing을 완화시키는데, 특히 EBM에서 중요하다.
이 신경망은 일련의 Conv2D를 쌓아 img크기를 줄이고 channel크기를 늘린다.
마지막 층은 Linear Activation이기에 이 신경망은 (-∞,∞)의 범위값을 출력한다.
해당 코드는 아래와 같다.
class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)


class EBM(nn.Module):
    def __init__(self, hidden_features=32, out_dim=1, **kwargs):
        super().__init__()
        c_hid1 = hidden_features//2
        c_hid2 = hidden_features
        c_hid3 = hidden_features*2

        self.cnn_layers = nn.Sequential(
                nn.Conv2d(1, c_hid1, kernel_size=5, stride=2, padding=4), # [16x16] - Larger padding to get 32x32 image
                Swish(),
                nn.Conv2d(c_hid1, c_hid2, kernel_size=3, stride=2, padding=1), #  [8x8]
                Swish(),
                nn.Conv2d(c_hid2, c_hid3, kernel_size=3, stride=2, padding=1), # [4x4]
                Swish(),
                nn.Conv2d(c_hid3, c_hid3, kernel_size=3, stride=2, padding=1), # [2x2]
                Swish(),
                nn.Flatten(),
                nn.Linear(c_hid3*4, c_hid3),
                Swish(),
                nn.Linear(c_hid3, out_dim)
        )

    def forward(self, x):
        x = self.cnn_layers(x).squeeze(dim=-1)
        return x

Langevin Dynamic
Energy함수는 주어진 입력에 대해 1개 점수만 출력한다.

🤔 How to... 에너지점수가 낮은 new sample생성?
랑주뱅 동역학기법: 입력에 대한 Energy함수의 Gradient계산하는 기법
Step 1. Sample space의 임의의 지점에서 시작
Step 2. 계산된 Gradient의 반대방향으로 조금씩 이동, Energy함수를 감소.
Step 3. train이후, random noise가 trainset과 유사한 img로 변환됨.

stochastic gradient Langevin Dynamic:
sample space 이동 시, 위 그림처럼 input에 소량의 random noise추가.
(만약, 그렇지 않으면 local minima에 빠질 수 있음.)

다만, 일반 SGD랑은 당연히 차이가 있다.
∙ SGD: (-) 기울기방향으로 파라미터 조금씩 update, Loss최소화

∙ SGD with Langevin Dynamic:
여기에 Langevin Dynamic을 사용하면 Gradient Weight를 고정하고,
input에 대한 출력의 Gradient를 계산한다.
그 후 (-) 기울기방향으로 input을 조금씩 update,
점진적으로 출력(energy score)을 최소화한다.

두 방식 모두 동일한 경사하강법을 활용하나, 다른 목적함수에 적용된다.
이론적인 Langevin Dynamic 방정식.

[Langevin Sampling함수]:
def generate_samples(model, inp_imgs, steps=60, step_size=10, return_img_per_step=False):
        """
        Function for sampling images for a given model.
        Inputs:
            model - Neural network to use for modeling E_theta
            inp_imgs - Images to start from for sampling. If you want to generate new images, enter noise between -1 and 1.
            steps - Number of iterations in the MCMC algorithm.
            step_size - Learning rate nu in the algorithm above
            return_img_per_step - If True, we return the sample at every iteration of the MCMC
        """
        # Before MCMC: set model parameters to "required_grad=False"
        # because we are only interested in the gradients of the input.
        is_training = model.training
        model.eval()
        for p in model.parameters():
            p.requires_grad = False
        inp_imgs.requires_grad = True

        # Enable gradient calculation if not already the case
        had_gradients_enabled = torch.is_grad_enabled()
        torch.set_grad_enabled(True)

        # We use a buffer tensor in which we generate noise each loop iteration.
        # More efficient than creating a new tensor every iteration.
        noise = torch.randn(inp_imgs.shape, device=inp_imgs.device)

        # List for storing generations at each step (for later analysis)
        imgs_per_step = []

        # Loop over K (steps)
        for _ in range(steps):
            # Part 1: Add noise to the input.
            noise.normal_(0, 0.005)
            inp_imgs.data.add_(noise.data)
            inp_imgs.data.clamp_(min=-1.0, max=1.0)

            # Part 2: calculate gradients for the current input.
            out_imgs = -model(inp_imgs)
            out_imgs.sum().backward()
            inp_imgs.grad.data.clamp_(-0.03, 0.03) # For stabilizing and preventing too high gradients

            # Apply gradients to our current samples
            inp_imgs.data.add_(-step_size * inp_imgs.grad.data)
            inp_imgs.grad.detach_()
            inp_imgs.grad.zero_()
            inp_imgs.data.clamp_(min=-1.0, max=1.0)

            if return_img_per_step:
                imgs_per_step.append(inp_imgs.clone().detach())

        # Reactivate gradients for parameters for training
        for p in model.parameters():
            p.requires_grad = True
        model.train(is_training)

        # Reset gradient calculation to setting before this function
        torch.set_grad_enabled(had_gradients_enabled)

        if return_img_per_step:
            return torch.stack(imgs_per_step, dim=0)
        else:
            return inp_imgs

Contrastive Divergence
앞서 Sampling space에서 낮으 energy를 갖는 새로운 point를 sampling하는 방법을 알아보았으니 모델 train쪽을 알아보자.

Energy함수는 확률을 출력하지 않는다
= MLE를 적용할 수 없다. (물론, 항상 그렇듯 NLL Loss최소화가 목적이다.)

p_θ(x)가 energy함수 E_θ(x)를 포함하는 Boltzmann 분포일 때,
이 값의 Gradient는 아래와 같다:
위 식은 직관적으로 이해가능하다:
∙ 실제 sample에 대해 큰 음의 energy score
∙ 생성된 가짜 sample에 대해 큰 양의 energy score를 출력하도록 모델 훈련
→ 두 극단 간의 차이가 가능한 커지도록하여 이를 손실함수로 이용.

이때, 가짜 sample의 Energy score를 계산하려면 분포 p_θ(x)에서 정확히 sampling가능해야한다.
다만, 분모계산이 어려워 불가능하다.
→ 대신, Langevin Sampling방법을 사용해 낮은 energy score를 갖는 sample set 생성이 가능하다.

또한, 이전 반복 sample을 buffer에 저장,
다음 batch시작점으로 순수noise대신에 이를 사용한다.
def sample_new_exmps(self, steps=60, step_size=10):
        """
        Function for getting a new batch of "fake" images.
        Inputs:
            steps - Number of iterations in the MCMC algorithm
            step_size - Learning rate nu in the algorithm above
        """
        # Choose 95% of the batch from the buffer, 5% generate from scratch
        n_new = np.random.binomial(self.sample_size, 0.05)
        rand_imgs = torch.rand((n_new,) + self.img_shape) * 2 - 1
        old_imgs = torch.cat(random.choices(self.examples, k=self.sample_size-n_new), dim=0)
        inp_imgs = torch.cat([rand_imgs, old_imgs], dim=0).detach().to(device)

        # Perform MCMC sampling
        inp_imgs = Sampler.generate_samples(self.model, inp_imgs, steps=steps, step_size=step_size)

        # Add new images to the buffer and remove old ones if needed
        self.examples = list(inp_imgs.to(torch.device("cpu")).chunk(self.sample_size, dim=0)) + self.examples
        self.examples = self.examples[:self.max_len]
        return inp_imgs
이때, 각 step이 종료된 후, score normalization하지 않고
알고리즘에 따라 sample score는 아래로 내려가고,
가짜 sample의 점수는 위로 올라간다.

최종코드는 링크 참고.

EBM은 이후 score matching이라는 훈련기법으로 발전하였고,
이는 Denoising Diffusion Probabilistic Model이라는 모델로 발전해
DALLE∙2 및 Imagen같은 최첨단 생성모델 구현에 사용된다.

3. 기타 (BM, RBM)

Boltzmann Machine
EBM의 초기 사례중 하나로 fully connected undirected neural network이다.
v: visible unit
h: hidden unit
W, L, J는 학습되는

Contrastive Divergence로 train.
이때, 균형점을 찾을 때 까지 v와 h사이를 번갈어서 Gibbs Sampling을 수행한다.

단점: train속도가 매우 느리고 hidden unit개수를 크게 늘릴 수 없다.

Restricted Boltzmann Machine
위의 Boltzmann machine을 확장한 RBM은 같은 종류 unit사이 connection을 제거,
2개 층으로 구성된 이분그래프(bipartite graph)를 생성한다.
이를 통해 RBM을 쌓아 더 복잡한 분포를 모델링하는 심층신뢰신경망(Deep Belief Network)을 만들 수 있다.

단점: Mixing Time(= Gibbs Sampling으로 목표상태에 도달하는데 걸린시간)이 긴 Sampling이 여전히 필요.
따라서 여전히 고차원 data 모델링에는 현실적으로 어렵다.

4. 요약

Deep EBM의 Sampling은 Langevin Dynamic으로 이뤄진다.
이 기법은 input img에 대한 점수의 gradient를 사용하여
gradient가 감소하는 방향으로 조금씩 input을 update한다.

결과적으로 random noise는 그럴듯한 sample로 점진적변환이 이뤄진다.

저작자표시

'Gain Study > Generation' 카테고리의 다른 글

[G]Part 3-1. Advanced GAN (0)	2024.03.04
[G]Part 2-6. Diffusion Models (0)	2024.01.30
[G]Part 2-4. Normalizing Flows (2)	2024.01.29
[G]Part 2-3. Auto Regressive Models (0)	2024.01.26
[G]Part 2-2. GAN (0)	2024.01.26

this.code();

[G]Part 2-5. Energy-based Model

📌 목차

🧐 preview:

1. Energy with Boltzmann

Boltzmann Distribution

Step 1. 신경망 E(x)를 훈련:

Step 2. 풀기 어려운 적분이 존재:

2. Energy Based Model

Energy 함수

Langevin Dynamic

stochastic gradient Langevin Dynamic:

Contrastive Divergence

3. 기타 (BM, RBM)

Boltzmann Machine

Restricted Boltzmann Machine

4. 요약

'Gain Study > Generation' 카테고리의 다른 글

+ Recent posts

티스토리툴바

this.code();

[G]Part 2-5. Energy-based Model

📌 목차

🧐 preview:

1. Energy with Boltzmann

Boltzmann Distribution

Step 1. 신경망 E(x)를 훈련:

Step 2. 풀기 어려운 적분이 존재:

2. Energy Based Model

Energy 함수

Langevin Dynamic

stochastic gradient Langevin Dynamic:

Contrastive Divergence

3. 기타 (BM, RBM)

Boltzmann Machine

Restricted Boltzmann Machine

4. 요약

'Gain Study > Generation' 카테고리의 다른 글

+ Recent posts

티스토리툴바

Energy 함수

Contrastive Divergence