๐Ÿ“Œ WandB๊ฐ€ ์™œ ํ•„์š”ํ• ๊นŒ?

1. Model Experiment Pipeline

https://ml-ops.org/content/mlops-principles

์œ„ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ, MLOps๊ณผ์ •์€ ํฌ๊ฒŒ 3๋‹จ๊ณ„๋กœ ๋‚˜๋‰œ๋‹ค.

 โˆ™ Project Design

 โˆ™ Experiment & Development

 โˆ™ ๋ฐฐํฌ ๋ฐ ์šด์˜

 

์ด์ค‘, 2๋‹จ๊ณ„์ธ "์‹คํ—˜์— ๋„์›€์„ ์ฃผ๋Š” Tool"์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ฐ”๋กœ WandB์ด๋‹ค.

(cf. TensorBoard๋„ ์กด์žฌ.)

 

 

 

2. Configuration

ML๊ตฌ์ถ•์„ ์œ„ํ•œ ํ•„์ˆ˜๊ตฌ์„ฑ์š”์†Œ๋กœ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
โˆ™ Dataset
โˆ™ Metric
โˆ™ Model
โˆ™ Hyper-parameter

 

Training๊ณผ์ •์—์„œ, ์ด Configuration๊ฐ’์„ ์ ์ ˆํ•˜๊ฒŒ ์„ ํƒํ•ด์ค˜์•ผํ•œ๋‹ค.

๐Ÿง Batch size์— ๋Œ€ํ•˜์—ฌ
Data๋‚˜ Model์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์ ์ ˆํ•œ Batch_size๊ฐ€ ์กด์žฌํ•˜๊ธฐ์—
batch_size๋ฅผ ๋„ˆ๋ฌด ์ž‘์€๊ฐ’์ด๋‚˜ ํฐ๊ฐ’์„ ์“ฐ๋ฉด ์˜คํžˆ๋ ค ํ•™์Šต์ด ์ž˜ ์•ˆ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์กด์žฌํ•œ๋‹ค.

cf) ํŠน์ •ํ•œ ๊ฐ€์„คํ•˜์— ์—ฐ์—ญ์ ์œผ๋กœ ์ฆ๋ช…๊ฐ€๋Šฅํ•  ๋•Œ,
batch size๋ฅผ 2๋ฐฐ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด step size๋Š” √2๋ฐฐ ์ฆ๊ฐ€์‹œ์ผœ์•ผํ•œ๋‹ค. 

cf) batch size๋ฅผ ์ฆ๊ฐ€์‹œ์ผฐ๋Š”๋ฐ๋„ ์ด epoch์ˆ˜๋ฅผ ๊ทธ๋Œ€๋กœ ๋‘๋ฉด
ํ•œ epoch๋‹น iteration์ˆ˜๊ฐ€ ์ค„์–ด๋“ค๊ธฐ์—
๊ทธ๋งŒํผ gradient๋กœ parameter update๋ฅผ ๋œ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋ฏ€๋กœ
Loss๊ฐ์†Œ์†๋„๊ฐ€ ๋Š๋ ค์ ธ ํ•™์Šต์ด ์ž˜ ์•ˆ๋  ์ˆ˜๋„ ์žˆ๋‹ค.

๊ทธ๋ ‡๊ธฐ์— ์ ์ ˆํ•œ Configuration์„ค์ •์€ ์ค€ํ•„์ˆ˜์ ์ด๋‹ค.

 

ํŠนํžˆ, Dataset์€ Data Augmentation

Metric์€ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ๊ต์ฒดํ•˜๊ณ , Model๋„ ๊ตฌ์กฐ๋ฅผ ๋ณ€๊ฒฝ์‹œํ‚ค๋Š” ์‹œ๊ฐ„์€ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์‹œ๊ฐ„์ด ๋“ค์ง€๋งŒ

 

Hyper-parameter Tuning์˜ ๊ฒฝ์šฐ ์ ์ ˆํ•œ ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ƒ๋‹นํžˆ ๋งŽ์€ ์‹œ๊ฐ„์„ ํ• ์• ํ•ด์•ผํ•œ๋‹ค.

Model์˜ parameter ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด Hyper-parameter๋ฅผ ์ ์ ˆํžˆ ์กฐ์ ˆํ•ด์•ผํ•˜๊ณ , ์ด๋Š” Hyper-parameter๋ฅผ ๋ณ€๊ฒฝ์‹œํ‚ค๋ฉฐ ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

์ด๋ฅผ ์‚ฌ๋žŒ์ด ์ผ์ผํžˆ ํ•œ๋‹ค๋ฉด?

์ฆ‰, Hyper-parameter๋ฅผ ์‚ฌ๋žŒ์ด ์ง์ ‘ ์ผ์ผํžˆ tuningํ•˜๋Š” ์ž‘์—…์€

๋งค์šฐ ๋น„ํšจ์œจ์ ์ด๊ณ , ๊ธฐ๋ก์ด ๋ˆ„๋ฝ๋ ์ˆ˜๋„ ์žˆ๊ณ  ์ด๋ฅผ ์ˆ˜๊ธฐ๋กœ ์ •๋ฆฌ๊นŒ์ง€ ํ•ด์•ผํ•˜๋Š”, 

์ข…ํ•ฉ๊ณ ๋ฏผ3์ข…์„ธํŠธ๋ผ ํ•  ์ˆ˜ ์žˆ๊ฒ ๋‹ค.

 

 

 

 

 

 

๐Ÿ“Œ WandB?

WandB(Weights & Biases)๋Š” ๋” ์ตœ์ ํ™”๋œ ๋ชจ๋ธ์„ ๋น ๋ฅธ์‹œ๊ฐ„๋‚ด์— ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€์ฃผ๋Š”, ML Experiment Tracking Tool์ด๋‹ค.

์ฃผ์š”๊ธฐ๋Šฅ

W&B Platform

  • Experiments: ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ์‹คํ—˜์„ ์ถ”์ ํ•˜๊ธฐ ์œ„ํ•œ Dashboard ์ œ๊ณต.
  • Artifacts: Dataset version ๊ด€๋ฆฌ์™€ Model version ๊ด€๋ฆฌ.
  • Tables: Data๋ฅผ logingํ•˜์—ฌ W&B๋กœ ์‹œ๊ฐํ™”ํ•˜๊ณ  queryํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ.
  • Sweeps: Hyper-parameter๋ฅผ ์ž๋™์œผ๋กœ tuningํ•˜์—ฌ ์ตœ์ ํ™” ํ•จ.
  • Reports: ์‹คํ—˜์„ document๋กœ ์ •๋ฆฌํ•˜์—ฌ collaborators์™€ ๊ณต์œ .

 

 

 

๐Ÿ“Œ W&B Experiments. ํ•จ์ˆ˜ ๋ฐ ์˜ˆ์ œ

๋ชจ๋ธํ•™์Šต ์‹œ, ๋ชจ๋ธ ํ•™์Šต log๋ฅผ ์ถ”์ ํ•˜์—ฌ Dashboard๋ฅผ ํ†ตํ•ด ์‹œ๊ฐํ™”
์ด๋ฅผ ํ†ตํ•ด ํ•™์Šต์ด ์ž˜ ๋˜๊ณ  ์žˆ๋Š”์ง€ ๋น ๋ฅด๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

1. config setting

W&B์‹คํ–‰์„ ์œ„ํ•ด configํŒŒ์ผ์ด ํ•„์š”ํ•˜๊ธฐ์— 

Hyper-parameter, Data๋ช… ๋“ฑ ํ•™์Šต์— ํ•„์š”ํ•œ ๊ตฌ์„ฑ๋“ค์„ ๊ทธ๋ฃนํ™”ํ•œ๋‹ค.

๋˜ํ•œ, ์ด configํŒŒ์ผ์€ sweep์— ์ค‘์š”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋œ๋‹ค.

config  = {
    'dataset': 'MNIST',
    'batch_size': 128,
    'epochs': 5,
    
    'architecture': 'CNN',
    'classes':10,
    'kernels': [16, 32],
    
    'weight_decay': 0.0005,
    'learning_rate': 1e-3,
    
    'seed': 42
}

 

 

 

 

 

 

2. Dataset with DataLoader

def make_loader(batch_size, train=True):
    full_dataset = datasets.MNIST(root='./data/MNIST', train=train, download=True,  transform=transforms.ToTensor())

    loader = DataLoader(dataset=full_dataset,
                        batch_size=batch_size,
                        shuffle=True, pin_memory=True, num_workers=2)
    return loader

 

 

 

 

 

3. CNN Model

class ConvNet(nn.Module):
    def __init__(self, kernels, classes=10):
        super(ConvNet, self).__init__()

        self.layer1 = nn.Sequential(
            nn.Conv2d(1, kernels[0], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, kernels[1], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7 * 7 * kernels[-1], classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

์ถ”๊ฐ€์ ์œผ๋กœ W&B๋Š” ๋ชจ๋ธ์˜ weights์™€ bias๊ฐ™์€ parameter๋ฅผ ์ถ”์ ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ํ•™์Šต ๋„์ค‘ weights์˜ histogram์ด๋‚˜ distribution์„ ํ†ตํ•ด ์›ํ™œํ•œ ํ•™์Šต๋ฐฉํ–ฅ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

 

 

 

 

 

4. Train ํ•จ์ˆ˜

def train(model, loader, criterion, optimizer, config):
    wandb.watch(model, criterion, log="all", log_freq=10)

    example_ct = 0
    for epoch in tqdm(range(config.epochs)):
        cumu_loss = 0
        for images, labels in loader:

            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            loss = criterion(outputs, labels)
            cumu_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()

            optimizer.step()

            example_ct +=  len(images)

        avg_loss = cumu_loss / len(loader)
        wandb.log({"loss": avg_loss}, step=epoch)
        print(f"TRAIN: EPOCH {epoch + 1:04d} / {config.epochs:04d} | Epoch LOSS {avg_loss:.4f}")

wandb.log()ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด lossํ•จ์ˆ˜๋ฅผ ์‹œ๊ฐํ™” ํ•  ์ˆ˜ ์žˆ์Œ.

์ด๋•Œ, step์„ epoch์œผ๋กœ ๋ฐ›์•„ avg_loss๊ฐ’์„ ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

wandb.watch()๋Š” Dashboard์—์„œ ์‹คํ—˜ log๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰.

 

 

5. Run ํ•จ์ˆ˜

def run(config=None):
    wandb.init(project='MNIST', entity='๊ณ„์ •๋ช…', config=config)

    config = wandb.config

    train_loader = make_loader(batch_size=config.batch_size, train=True)
    test_loader = make_loader(batch_size=config.batch_size, train=False)

    model = ConvNet(config.kernels, config.classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

    train(model, train_loader, criterion, optimizer, config)
    test(model, test_loader)
    return model

wandb.int()์œผ๋กœ wandb web์„œ๋ฒ„์™€ ์—ฐ๊ฒฐ.

 

cf) project์™€ entity๋ฅผ ๊ธฐ์ž… ๊ฐ€๋Šฅํ•œ ๊ณณ

  • config ์„ค์ • ํ•˜๋Š” ํŒŒ์ผ (config.py ํ˜น์€ config.yaml)
  • wandb.sweep()
  • wandb.init()
  • wandb.agent()

 

 

 

 

6. ๊ฒฐ๊ณผ

์‹คํ—˜๊ฒฐ๊ณผ, ๊ฐ ์—ํญ๋งˆ๋‹ค ํ•ด๋‹น Layer์— ์ „ํŒŒ๋˜๋Š” Gradient๊ฐ’๋“ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ถ”๊ฐ€์ ์œผ๋กœ ํ•ด๋‹น epoch์— ๋Œ€ํ•œ gradient distribution์˜ ๊ฒฝ์šฐ ๋งˆ์šฐ์Šค๋ฅผ ๊ฐ€์ ธ๋‹ค ๋†“์œผ๋ฉด ์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ํ™•์ธ๊ฐ€๋Šฅํ•˜๋‹ค.

'Deep Learning : Vision System > Pytorch & MLOps' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๐Ÿ”ฅPyTorch 2.2]: transformv2 , torch.compile  (0) 2024.01.31
[WandB] Step 3. WandB ์‹œ๊ฐํ™” ๋ฐฉ๋ฒ•.  (0) 2024.01.09
[WandB] Step 2. WandB Sweeps  (2) 2024.01.09

+ Recent posts