๐Ÿ“Œ ๋ชฉ์ฐจ

1. Energy-based Model
2. Energy ํ•จ์ˆ˜
3. ๊ธฐํƒ€ (BM, RBM)
4. ์š”์•ฝ

 

๐Ÿง  preview: 

EBM(Energy-Based Model)์˜ ์•„์ด๋””์–ด:
์‹ค์ˆ˜๊ฐ’์˜ Energy ํ•จ์ˆ˜๋ฅผ 0๊ณผ 1์‚ฌ์ด๋กœ ์ •๊ทœํ™”ํ•˜๋Š” Boltzmann Distribution์œผ๋กœ ์–ด๋–ค ์‚ฌ๊ฑด์˜ ํ™•๋ฅ ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ.

 

 

 

 

 

 


1.  Energy with Boltzmann

Boltzmann Distribution

EBM์€ Boltzmann๋ถ„ํฌ๋กœ ์‹ค์ œ data์ƒ์„ฑ๋ถ„ํฌ๋ฅผ ๋ชจ๋ธ๋งํ•œ๋‹ค.
์•„๋ž˜ ์‹๊ณผ ๊ฐ™์ด Boltzmann๋ถ„ํฌ์˜ Energyํ•จ์ˆ˜(์ ์ˆ˜)๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค:

Step 1. ์‹ ๊ฒฝ๋ง E(x)๋ฅผ ํ›ˆ๋ จ:

→ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ sample์—๋Š” ๋‚ฎ์€ score → p(x)๊ฐ€ 1์— ๊ฐ€๊นŒ์›Œ์ง.
→ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์€ sample์—๋Š” ๋†’์€ score → p(x)๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง.


Step 2. ํ’€๊ธฐ ์–ด๋ ค์šด ์ ๋ถ„์ด ์กด์žฌ:

→ ์œ„ ์‹์˜ ์ •๊ทœํ™”๋ถ„๋ชจ๋Š” ๊ฐ„๋‹จํ•œ ๋ฌธ์ œ๋ฅผ ์ œ์™ธ, ํ’€๊ธฐ ์–ด๋ ค์šด ์ ๋ถ„์ด ์กด์žฌ.
์ด ์ ๋ถ„์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†์œผ๋ฉด, MLE๋กœ ๋ชจ๋ธํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

โˆ™ EBM์˜ ํ•ต์‹ฌ์•„์ด๋””์–ด: ๊ทผ์‚ฌ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉ, ์ด ๋ถ„๋ชจ๋ฅผ ๊ณ„์‚ฐํ•  ํ•„์š”๊ฐ€ ์—†๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ.

์ด๋Š” ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ์— ์ ์šฉํ•œ ๋ณ€ํ™˜์ด ์—ฌ์ „ํžˆ ์œ ํšจํ•œ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ๋งŒ๋“ค๋ ค๊ณ  ์• ์“ฐ๋Š” Normalizing Flow์™€๋Š” ๋Œ€์กฐ์ ์ด๋‹ค.


[Implicit Generation and Modeling with EBMs; 2019]์˜ ํ•ต์‹ฌ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:
(for training): ๋Œ€์กฐ๋ฐœ์‚ฐ๊ธฐ๋ฒ•
(for sampling): Langevin Dynamics๊ธฐ๋ฒ•
→ ๊นŒ๋‹ค๋กœ์šด ๋ถ„๋ชจ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


2.  Energy Based Model

Energy ํ•จ์ˆ˜

์—๋„ˆ์ง€ํ•จ์ˆ˜ Eθ(x)๋Š” parameter๊ฐ€ θ์ด๊ณ  ์ž…๋ ฅ img๊ฐ€ x๋ฅผ ํ•˜๋‚˜์˜ ์Šค์นผ๋ผ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์‹ ๊ฒฝ๋ง.
์ด๋•Œ, ์—ฌ๊ธฐ์— Swishํ™œ์„ฑํ™”ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
swishํ•จ์ˆ˜
์ด Swishํ•จ์ˆ˜๋Š” Gradient Vanishing์„ ์™„ํ™”์‹œํ‚ค๋Š”๋ฐ, ํŠนํžˆ EBM์—์„œ ์ค‘์š”ํ•˜๋‹ค.
์ด ์‹ ๊ฒฝ๋ง์€ ์ผ๋ จ์˜ Conv2D๋ฅผ ์Œ“์•„ imgํฌ๊ธฐ๋ฅผ ์ค„์ด๊ณ  channelํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฐ๋‹ค.
๋งˆ์ง€๋ง‰ ์ธต์€ Linear Activation์ด๊ธฐ์—  ์ด ์‹ ๊ฒฝ๋ง์€ (-∞,∞)์˜ ๋ฒ”์œ„๊ฐ’์„ ์ถœ๋ ฅํ•œ๋‹ค.
ํ•ด๋‹น ์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)


class EBM(nn.Module):
    def __init__(self, hidden_features=32, out_dim=1, **kwargs):
        super().__init__()
        c_hid1 = hidden_features//2
        c_hid2 = hidden_features
        c_hid3 = hidden_features*2

        self.cnn_layers = nn.Sequential(
                nn.Conv2d(1, c_hid1, kernel_size=5, stride=2, padding=4), # [16x16] - Larger padding to get 32x32 image
                Swish(),
                nn.Conv2d(c_hid1, c_hid2, kernel_size=3, stride=2, padding=1), #  [8x8]
                Swish(),
                nn.Conv2d(c_hid2, c_hid3, kernel_size=3, stride=2, padding=1), # [4x4]
                Swish(),
                nn.Conv2d(c_hid3, c_hid3, kernel_size=3, stride=2, padding=1), # [2x2]
                Swish(),
                nn.Flatten(),
                nn.Linear(c_hid3*4, c_hid3),
                Swish(),
                nn.Linear(c_hid3, out_dim)
        )

    def forward(self, x):
        x = self.cnn_layers(x).squeeze(dim=-1)
        return xโ€‹

Langevin Dynamic

Energyํ•จ์ˆ˜๋Š” ์ฃผ์–ด์ง„ ์ž…๋ ฅ์— ๋Œ€ํ•ด 1๊ฐœ ์ ์ˆ˜๋งŒ ์ถœ๋ ฅํ•œ๋‹ค.

๐Ÿค” How to... ์—๋„ˆ์ง€์ ์ˆ˜๊ฐ€ ๋‚ฎ์€ new sample์ƒ์„ฑ?
๋ž‘์ฃผ๋ฑ… ๋™์—ญํ•™๊ธฐ๋ฒ•: ์ž…๋ ฅ์— ๋Œ€ํ•œ Energyํ•จ์ˆ˜์˜ Gradient๊ณ„์‚ฐํ•˜๋Š” ๊ธฐ๋ฒ• 

Step 1. Sample space์˜ ์ž„์˜์˜ ์ง€์ ์—์„œ ์‹œ์ž‘
Step 2. ๊ณ„์‚ฐ๋œ Gradient์˜ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ์œผ๋กœ ์กฐ๊ธˆ์”ฉ ์ด๋™, Energyํ•จ์ˆ˜๋ฅผ ๊ฐ์†Œ.
Step 3. train์ดํ›„, random noise๊ฐ€ trainset๊ณผ ์œ ์‚ฌํ•œ img๋กœ ๋ณ€ํ™˜๋จ.



stochastic gradient Langevin Dynamic:

sample space ์ด๋™ ์‹œ, ์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ input์— ์†Œ๋Ÿ‰์˜ random noise์ถ”๊ฐ€.
(๋งŒ์•ฝ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด local minima์— ๋น ์งˆ ์ˆ˜ ์žˆ์Œ.)

๋‹ค๋งŒ, ์ผ๋ฐ˜ SGD๋ž‘์€ ๋‹น์—ฐํžˆ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.
โˆ™ SGD: (-) ๊ธฐ์šธ๊ธฐ๋ฐฉํ–ฅ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ๊ธˆ์”ฉ update, Loss์ตœ์†Œํ™”

โˆ™ SGD with Langevin Dynamic: 
์—ฌ๊ธฐ์— Langevin Dynamic์„ ์‚ฌ์šฉํ•˜๋ฉด Gradient Weight๋ฅผ ๊ณ ์ •ํ•˜๊ณ ,
input์— ๋Œ€ํ•œ ์ถœ๋ ฅ์˜ Gradient๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
๊ทธ ํ›„ (-) ๊ธฐ์šธ๊ธฐ๋ฐฉํ–ฅ์œผ๋กœ input์„ ์กฐ๊ธˆ์”ฉ update,
์ ์ง„์ ์œผ๋กœ ์ถœ๋ ฅ(energy score)์„ ์ตœ์†Œํ™”ํ•œ๋‹ค.

๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ํ™œ์šฉํ•˜๋‚˜, ๋‹ค๋ฅธ ๋ชฉ์ ํ•จ์ˆ˜์— ์ ์šฉ๋œ๋‹ค.
์ด๋ก ์ ์ธ Langevin Dynamic ๋ฐฉ์ •์‹.


[Langevin Samplingํ•จ์ˆ˜]:
def generate_samples(model, inp_imgs, steps=60, step_size=10, return_img_per_step=False):
        """
        Function for sampling images for a given model.
        Inputs:
            model - Neural network to use for modeling E_theta
            inp_imgs - Images to start from for sampling. If you want to generate new images, enter noise between -1 and 1.
            steps - Number of iterations in the MCMC algorithm.
            step_size - Learning rate nu in the algorithm above
            return_img_per_step - If True, we return the sample at every iteration of the MCMC
        """
        # Before MCMC: set model parameters to "required_grad=False"
        # because we are only interested in the gradients of the input.
        is_training = model.training
        model.eval()
        for p in model.parameters():
            p.requires_grad = False
        inp_imgs.requires_grad = True

        # Enable gradient calculation if not already the case
        had_gradients_enabled = torch.is_grad_enabled()
        torch.set_grad_enabled(True)

        # We use a buffer tensor in which we generate noise each loop iteration.
        # More efficient than creating a new tensor every iteration.
        noise = torch.randn(inp_imgs.shape, device=inp_imgs.device)

        # List for storing generations at each step (for later analysis)
        imgs_per_step = []

        # Loop over K (steps)
        for _ in range(steps):
            # Part 1: Add noise to the input.
            noise.normal_(0, 0.005)
            inp_imgs.data.add_(noise.data)
            inp_imgs.data.clamp_(min=-1.0, max=1.0)

            # Part 2: calculate gradients for the current input.
            out_imgs = -model(inp_imgs)
            out_imgs.sum().backward()
            inp_imgs.grad.data.clamp_(-0.03, 0.03) # For stabilizing and preventing too high gradients

            # Apply gradients to our current samples
            inp_imgs.data.add_(-step_size * inp_imgs.grad.data)
            inp_imgs.grad.detach_()
            inp_imgs.grad.zero_()
            inp_imgs.data.clamp_(min=-1.0, max=1.0)

            if return_img_per_step:
                imgs_per_step.append(inp_imgs.clone().detach())

        # Reactivate gradients for parameters for training
        for p in model.parameters():
            p.requires_grad = True
        model.train(is_training)

        # Reset gradient calculation to setting before this function
        torch.set_grad_enabled(had_gradients_enabled)

        if return_img_per_step:
            return torch.stack(imgs_per_step, dim=0)
        else:
            return inp_imgs

Contrastive Divergence

์•ž์„œ Sampling space์—์„œ ๋‚ฎ์œผ energy๋ฅผ ๊ฐ–๋Š” ์ƒˆ๋กœ์šด point๋ฅผ samplingํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์•˜์œผ๋‹ˆ ๋ชจ๋ธ train์ชฝ์„ ์•Œ์•„๋ณด์ž.

Energyํ•จ์ˆ˜๋Š” ํ™•๋ฅ ์„ ์ถœ๋ ฅํ•˜์ง€ ์•Š๋Š”๋‹ค 
= MLE๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์—†๋‹ค. (๋ฌผ๋ก , ํ•ญ์ƒ ๊ทธ๋ ‡๋“ฏ NLL Loss์ตœ์†Œํ™”๊ฐ€ ๋ชฉ์ ์ด๋‹ค.)

pθ(x)๊ฐ€ energyํ•จ์ˆ˜ Eθ(x)๋ฅผ ํฌํ•จํ•˜๋Š” Boltzmann ๋ถ„ํฌ์ผ ๋•Œ,
์ด ๊ฐ’์˜ Gradient๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค:
์œ„ ์‹์€ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ด๊ฐ€๋Šฅํ•˜๋‹ค:
โˆ™ ์‹ค์ œ sample์— ๋Œ€ํ•ด ํฐ ์Œ์˜ energy score
โˆ™ ์ƒ์„ฑ๋œ ๊ฐ€์งœ sample์— ๋Œ€ํ•ด ํฐ ์–‘์˜ energy score๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ๋ชจ๋ธ ํ›ˆ๋ จ
→ ๋‘ ๊ทน๋‹จ ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ปค์ง€๋„๋กํ•˜์—ฌ ์ด๋ฅผ ์†์‹คํ•จ์ˆ˜๋กœ ์ด์šฉ.


์ด๋•Œ, ๊ฐ€์งœ sample์˜ Energy score๋ฅผ ๊ณ„์‚ฐํ•˜๋ ค๋ฉด ๋ถ„ํฌ pθ(x)์—์„œ ์ •ํ™•ํžˆ sampling๊ฐ€๋Šฅํ•ด์•ผํ•œ๋‹ค.
๋‹ค๋งŒ, ๋ถ„๋ชจ๊ณ„์‚ฐ์ด ์–ด๋ ค์›Œ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.
→ ๋Œ€์‹ , Langevin Sampling๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋‚ฎ์€ energy score๋ฅผ ๊ฐ–๋Š” sample set ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

๋˜ํ•œ, ์ด์ „ ๋ฐ˜๋ณต sample์„ buffer์— ์ €์žฅ,
๋‹ค์Œ batch์‹œ์ž‘์ ์œผ๋กœ ์ˆœ์ˆ˜noise๋Œ€์‹ ์— ์ด๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
def sample_new_exmps(self, steps=60, step_size=10):
        """
        Function for getting a new batch of "fake" images.
        Inputs:
            steps - Number of iterations in the MCMC algorithm
            step_size - Learning rate nu in the algorithm above
        """
        # Choose 95% of the batch from the buffer, 5% generate from scratch
        n_new = np.random.binomial(self.sample_size, 0.05)
        rand_imgs = torch.rand((n_new,) + self.img_shape) * 2 - 1
        old_imgs = torch.cat(random.choices(self.examples, k=self.sample_size-n_new), dim=0)
        inp_imgs = torch.cat([rand_imgs, old_imgs], dim=0).detach().to(device)

        # Perform MCMC sampling
        inp_imgs = Sampler.generate_samples(self.model, inp_imgs, steps=steps, step_size=step_size)

        # Add new images to the buffer and remove old ones if needed
        self.examples = list(inp_imgs.to(torch.device("cpu")).chunk(self.sample_size, dim=0)) + self.examples
        self.examples = self.examples[:self.max_len]
        return inp_imgs

์ด๋•Œ, ๊ฐ step์ด ์ข…๋ฃŒ๋œ ํ›„, score normalizationํ•˜์ง€ ์•Š๊ณ 
์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋”ฐ๋ผ sample score๋Š” ์•„๋ž˜๋กœ ๋‚ด๋ ค๊ฐ€๊ณ , 
๊ฐ€์งœ sample์˜ ์ ์ˆ˜๋Š” ์œ„๋กœ ์˜ฌ๋ผ๊ฐ„๋‹ค.

์ตœ์ข…์ฝ”๋“œ๋Š” ๋งํฌ ์ฐธ๊ณ .

EBM์€ ์ดํ›„ score matching์ด๋ผ๋Š” ํ›ˆ๋ จ๊ธฐ๋ฒ•์œผ๋กœ ๋ฐœ์ „ํ•˜์˜€๊ณ ,
์ด๋Š” Denoising Diffusion Probabilistic Model์ด๋ผ๋Š” ๋ชจ๋ธ๋กœ ๋ฐœ์ „ํ•ด
DALLEโˆ™2 ๋ฐ Imagen๊ฐ™์€ ์ตœ์ฒจ๋‹จ ์ƒ์„ฑ๋ชจ๋ธ ๊ตฌํ˜„์— ์‚ฌ์šฉ๋œ๋‹ค.

 

 

 

 

 

 

 

 

 

 

 

 

 


3. ๊ธฐํƒ€ (BM, RBM)

Boltzmann Machine

EBM์˜ ์ดˆ๊ธฐ ์‚ฌ๋ก€์ค‘ ํ•˜๋‚˜๋กœ fully connected undirected neural network์ด๋‹ค.
v: visible unit
h: hidden unit
W, L, J๋Š” ํ•™์Šต๋˜๋Š”

Contrastive Divergence๋กœ train.
์ด๋•Œ, ๊ท ํ˜•์ ์„ ์ฐพ์„ ๋•Œ ๊นŒ์ง€ v์™€ h์‚ฌ์ด๋ฅผ ๋ฒˆ๊ฐˆ์–ด์„œ Gibbs Sampling์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

๋‹จ์ : train์†๋„๊ฐ€ ๋งค์šฐ ๋Š๋ฆฌ๊ณ  hidden unit๊ฐœ์ˆ˜๋ฅผ ํฌ๊ฒŒ ๋Š˜๋ฆด ์ˆ˜ ์—†๋‹ค.

 

Restricted Boltzmann Machine

์œ„์˜ Boltzmann machine์„ ํ™•์žฅํ•œ RBM์€ ๊ฐ™์€ ์ข…๋ฅ˜ unit์‚ฌ์ด connection์„ ์ œ๊ฑฐ,
2๊ฐœ ์ธต์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ด๋ถ„๊ทธ๋ž˜ํ”„(bipartite graph)๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
์ด๋ฅผ ํ†ตํ•ด RBM์„ ์Œ“์•„ ๋” ๋ณต์žกํ•œ ๋ถ„ํฌ๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ์‹ฌ์ธต์‹ ๋ขฐ์‹ ๊ฒฝ๋ง(Deep Belief Network)์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

๋‹จ์ : Mixing Time(= Gibbs Sampling์œผ๋กœ ๋ชฉํ‘œ์ƒํƒœ์— ๋„๋‹ฌํ•˜๋Š”๋ฐ ๊ฑธ๋ฆฐ์‹œ๊ฐ„)์ด ๊ธด Sampling์ด ์—ฌ์ „ํžˆ ํ•„์š”.
๋”ฐ๋ผ์„œ ์—ฌ์ „ํžˆ ๊ณ ์ฐจ์› data ๋ชจ๋ธ๋ง์—๋Š” ํ˜„์‹ค์ ์œผ๋กœ ์–ด๋ ต๋‹ค.

 

 

 

 

 

 

 

 

 

4. ์š”์•ฝ

Deep EBM์˜ Sampling์€ Langevin Dynamic์œผ๋กœ ์ด๋ค„์ง„๋‹ค.
์ด ๊ธฐ๋ฒ•์€ input img์— ๋Œ€ํ•œ ์ ์ˆ˜์˜ gradient๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ
gradient๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์กฐ๊ธˆ์”ฉ input์„ updateํ•œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ random noise๋Š” ๊ทธ๋Ÿด๋“ฏํ•œ sample๋กœ ์ ์ง„์ ๋ณ€ํ™˜์ด ์ด๋ค„์ง„๋‹ค.

'Gain Study > Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[G]Part 3-1. Advanced GAN  (0) 2024.03.04
[G]Part 2-6. Diffusion Models  (0) 2024.01.30
[G]Part 2-4. Normalizing Flows  (2) 2024.01.29
[G]Part 2-3. Auto Regressive Models  (0) 2024.01.26
[G]Part 2-2. GAN  (0) 2024.01.26

+ Recent posts