๐Ÿ˜ถ ์ดˆ๋ก (Abstract)

- Deep Neural Net์˜ training์€ "๊ฐ  ์ธต์˜ input๊ณผ parameter๊ฐ€ ๋ณ€ํ•œ๋‹ค" ๋Š” ์ ์—์„œ ๋งค์šฐ ๋ณต์žกํ•˜๋‹ค๋Š” ๊ฒƒ์ด ์ •๋ก ์ด๋‹ค.
- ๋‚ฎ์€ ํ•™์Šต์œจ, ์กฐ์‹ฌ์Šค๋Ÿฐ parameter ์ดˆ๊ธฐํ™”๋กœ ์ธํ•ด training์„ ๋А๋ฆฌ๊ฒŒ ๋งŒ๋“ค๊ณ  ํฌํ™” ๋น„์„ ํ˜•์„ฑ(saturating nonlinearity)์œผ๋กœ ์ธํ•ด model train์„ ์•…๋ช…๋†’๊ฒŒ ๋งŒ๋“ ๋‹ค.
- ์šฐ๋ฆฌ๋Š” ์œ„์™€ ๊ฐ™์€ ํ˜„์ƒ์„ internal covariate shift๋ผ ๋ถ€๋ฅด๋ฉฐ, ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ input layer๋ฅผ normalizing์„ ํ†ตํ•ด ๋‹ค๋ฃฌ๋‹ค.

- ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•์€ model ๊ตฌ์กฐ์˜ ์ผ๋ถ€์˜ normalization๊ณผ for each training mini-batch์— normalization์„ ์ ์šฉํ•จ์„ ํ†ตํ•œ Batch-Normalization์„ ํ†ตํ•ด ๋” ๋†’์€ learning rates์™€ ์ดˆ๊ธฐํ™”์— ๋Œ€ํ•ด ๋œ ์‹ ๊ฒฝ ์“ธ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ทธ๋Ÿฐ ๊ฐ•์ ์„ ์ด๋Œ์–ด ๋‚ด์—ˆ๋‹ค.
- Batch-Normalization์€ ๋˜ํ•œ regularizer์˜ ์—ญํ• ๋„ ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์–ด๋– ํ•œ ๊ฒฝ์šฐ์—๋Š” Dropout์˜ ํ•„์š”์„ฑ์„ ์—†์• ์ฃผ์—ˆ๋‹ค.

- BN(Batch Normalization)์˜ ์•™์ƒ๋ธ”์˜ ์‚ฌ์šฉ์œผ๋กœ ILSVRC์—์„œ ์ตœ์ƒ์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์œผ๋ฉฐ ๋” ์กฐ๊ธˆ์˜ training step์œผ๋กœ ๊ฐ™์€ ๊ฒฐ๊ณผ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

 

1. ์„œ๋ก  (Introduction)

- ๋”ฅ๋Ÿฌ๋‹์—์„œ SGD๋Š” ๋งค์šฐ ํšจ๊ณผ์ ์ธ ํ›ˆ๋ จ๋ฐฉ๋ฒ•์œผ๋กœ ์ฆ๋ช…๋˜์—ˆ๋Š”๋ฐ, SGD์˜ ๋ณ€ํ˜•(momentum, Adagrad)๋“ค์€ ์˜ˆ์ˆ ์˜ ๊ฒฝ์ง€๋ผ ํ•  ๋งŒํ•˜๋‹ค.
- SGD๋Š” loss๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด x1..N์˜ training data set์—์„œ ์‹ ๊ฒฝ๋ง์˜ parameter Θ๋ฅผ ์ตœ์ ํ™” ํ•œ๋‹ค.
- SGD์˜ ์ด์šฉ์‹œ training์˜ ๊ณผ์ • ๋ฐ ๊ฐ step์˜ ๊ณผ์ •์—์„œ ํฌ๊ธฐ m์˜ mini-batch x1...m์— ๋Œ€ํ•ด ์ƒ๊ฐํ•ด๋ณด์ž.
mini-batch๋Š” parameter์˜ ๊ณ„์‚ฐ์— ๊ด€ํ•œ loss function์˜ gradient๊ฐ’์„ ์ถ”์ •ํ•˜๊ณค ํ•œ๋‹ค.
• mini-batch์˜ ์‚ฌ์šฉ์€ ๋ช‡๋ช‡ ์žฅ์ ์ด ์žˆ๋‹ค.
  โ‘ mini-batch์— ๋Œ€ํ•œ loss์˜ ๊ธฐ์šธ๊ธฐ(gradient) = ๋ฐฐ์น˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด์„œ ํ’ˆ์งˆ์ด ํ–ฅ์ƒ๋˜๋Š” training set์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ์˜ ์ถ”์ •์น˜์ด๋‹ค.
  โ‘ก๋‘˜์งธ, ํ˜„๋Œ€ ์ปดํ“จํŒ… ํ”Œ๋žซํผ์ด ์ œ๊ณตํ•˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋กœ ์ธํ•ด ๋ฐฐ์น˜๋ฅผ ํ†ตํ•œ ๊ณ„์‚ฐ์€ ๊ฐœ๋ณ„ ์˜ˆ์ œ์— ๋Œ€ํ•œ m๊ฐ’ ๊ณ„์‚ฐ๋ณด๋‹ค ํ›จ์”ฌ ๋” ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ๋‹ค.


- ์ด๋Ÿฐ SGD๋Š” ๊ฐ„๋‹จํ•˜๋ฉด์„œ ํšจ์œจ์ ์ด์ง€๋งŒ hyper-parameter, ํŠนํžˆ learning rate์™€ ๊ด€๋ จํ•˜์—ฌ ์„ธ๋ฐ€ํ•œ ์กฐ์ •์ด ํ•„์š”ํ•˜๋‹ค.
์ž‘์€ ๋ณ€ํ™”๋ผ ํ•ด๋„ ๊นŠ์ด๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ๊ทธ๋Ÿฐ ๋ณ€ํ™”๋“ค์ด ์ฆ๋Œ€๋˜๊ธฐ(amplify) ๋•Œ๋ฌธ์ด๋‹ค.

- layer๋“ค์ด ์ƒˆ๋กœ์šด ๋ถ„ํฌ์— ์ง€์†์ ์œผ๋กœ ์ ์‘ํ•  ํ•„์š”๊ฐ€ ์žˆ์–ด์„œ layer์˜ input์˜ ๋ถ„ํฌ(distribution)๋ณ€ํ™”๋Š” ๋ฌธ์ œ๋ฅผ ์ œ์‹œํ•œ๋‹ค.
ํ•™์Šต์ž…๋ ฅ๋ถ„ํฌ์˜ ๋ณ€ํ™”๋Š” covariate shift(๊ณต๋ณ€๋Ÿ‰ ์ด๋™)์„ ๊ฒฝํ—˜ํ•˜๋Š”๋ฐ, ์ด๋Š” ๋ณดํ†ต domain adaptation์œผ๋กœ ํ•ด๊ฒฐํ•œ๋‹ค.
๋‹ค๋งŒ, covariate shift์˜ ๊ฐœ๋…์€ sub-network(ํ•˜์œ„ ์‹ ๊ฒฝ๋ง)๋‚˜ layer๊ฐ€ ๋™์ผํ•œ ๋ถ€๋ถ„์— ์ ์šฉํ•ด์•ผ ํ•œ๋‹ค.
์ด๋กœ ์ธํ•ด learning system์ด ์ „์ฒด๋ฅผ ๋„˜์–ด์„œ ํ™•์žฅ๋  ์ˆ˜ ์žˆ๋‹ค.
F1๊ณผ F2๊ฐ€ ์ž„์˜์˜ transformation์ด๊ณ , Loss๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด parameter Φ1, Φ2๊ฐ€ ํ•™์Šต๋˜๋Š” ์‹ ๊ฒฝ๋ง์˜ ๊ณ„์‚ฐ์„ ์ƒ๊ฐํ•ด๋ณด์ž. ํ•™์Šต Φ2๋Š” ์ž…๋ ฅ x = F1(u, Φ1)์ด sub-network์— ๊ณต๊ธ‰๋˜๋Š” ๊ฒƒ(fed into)์ฒ˜๋Ÿผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ๊ฒฝ์‚ฌํ•˜๊ฐ•๋‹จ๊ณ„์—์„œ (batch size .&.  learning rate α) input x๊ฐ€ ์žˆ๋Š” ๋…๋ฆฝ์‹คํ–‰ํ˜• ์‹ ๊ฒฝ๋ง F2์˜ ๊ฒฝ์šฐ์™€ ์ •ํ™•ํ•˜๊ฒŒ ๋™๋“ฑ๋‹ค.
๋”ฐ๋ผ์„œ train๊ณผ test data๊ฐ„์˜ ๋™์ผํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ฒ˜๋Ÿผ train์„ ๋” ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“œ๋Š” input ๋ถ„ํฌ์˜ ์†์„ฑ(property)์€ sub-network training์—๋„ ์ ์šฉ๋œ๋‹ค.
๊ทธ๋ž˜์„œ x์˜ ๋ถ„ํฌ๋Š” ์‹œ๊ฐ„์ด ์ง€๋‚˜๊ฒŒ ๋˜๋ฉด ๊ณ ์ •๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•˜๋‹ค.
๊ทธ๋Ÿฌ๋ฏ€๋กœ Φ2๋Š” x์˜ ๋ถ„ํฌ ๋ณ€ํ™”๋ฅผ ๋ณด์ƒํ•˜๊ธฐ ์œ„ํ•ด ์žฌ์กฐ์ •ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค.

 z = g(Wu + b) ( g(x)๋Š” sigmoid 1 / 1+exp(-x)ํ•จ์ˆ˜, u๋Š” input, b๋Š” bias )์— ๋Œ€ํ•ด g'(x) = exp(x) / (exp(x)+1)^2์œผ๋กœ x์˜ ์ ˆ๋Œ€๊ฐ’์ด ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š”๋ฐ, ์ด๋กœ ์ธํ•ด gradient vanishing์ด ๋ฐœ์ƒํ•ด train์ด ๋А๋ ค์ง„๋‹ค.
์ด๋•Œ, x=Wu + b์—์„œ x๋Š” ์•„๋ž˜์ธต parameter์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›์•„ train์ค‘ parameter์˜ ๋ณ€๊ฒฝ์€ x์˜ ๋งŽ์€ ์ฐจ์›์„ "๋น„์„ ํ˜•์„ฑ์˜ ํฌํ™”์ƒํƒœ(saturated regime of the nonlinearity)"๋กœ ๋งŒ๋“ค๊ณ  ์ˆ˜๋ ด์„ ๋А๋ฆฌ๊ฒŒ ๋งŒ๋“ ๋‹ค.
์ด๋Ÿฐ ์˜ํ–ฅ์€ depth๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์ฆ๋Œ€๋˜๋Š”๋ฐ, ์‹ค์ œ๋กœ ํฌํ™”๋ฌธ์ œ์™€ gradient์†Œ์‹ค๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด Rectified Linear Units(ReLU(x) = max(x,0))๊ณผ ์‹ ์ค‘ํ•œ ์ดˆ๊ธฐํ™”, ์ž‘์€ learning rates๋ฅผ ์ด์šฉํ•œ๋‹ค.

- ๋‹ค๋งŒ ์ž…๋ ฅ๊ฐ’์˜ ๋น„์„ ํ˜•์„ฑ์˜ ๋ถ„๋ฐฐ(non-linearity distribution)๋Š” ์‹ ๊ฒฝ๋ง์˜ ํ›ˆ๋ จ์„ ๋” ์•ˆ์ •์ ์œผ๋กœ ํ•  ๊ฒƒ์ด๋ฉฐ optimizer๊ฐ€ ํฌํ™”๋  ๊ฐ€๋Šฅ์„ฑ์ด ์ ๊ฒŒ ํ•˜๊ณ  ํ›ˆ๋ จ์„ ๊ฐ€์†ํ™” ํ•  ๊ฒƒ์„ ํ™•์‹ ํ•  ์ˆ˜ ์žˆ๋‹ค. (Rectified Linear Unit (ReLU) activation์ด ์‹ ๊ฒฝ๋ง์˜ ๋น„์„ ํ˜•์„ฑ(non-linearity)์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค!)


Internal Covariate Shift (๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™)
  - training๊ณผ์ •์—์„œ
์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ๋‚ด๋ถ€ ๋…ธ๋“œ ๋ถ„ํฌ์˜ ๋ณ€ํ™”๋ฅผ ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™์ด๋ผ ํ•˜๋ฉฐ ์ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด training์„ ๋น ๋ฅด๊ฒŒ ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.
  - ์ด๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๊ธฐ์ˆ ์„ ๋„์ž…ํ•˜๋Š”๋ฐ, ์ด๊ฒƒ์ด ๋ฐ”๋กœ BN, Batch-Normalization์œผ๋กœ ๋‚ด๋ถ€๊ณต๋ณ€๋Ÿ‰์ด๋™์„ ์ค„์—ฌ์ค˜ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์—์„œ ๋งค์šฐ ๊ทน์ ์œผ๋กœ training์„ ๊ฐ€์†ํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
 
  - layer์˜ ์ž…๋ ฅ๊ฐ’์˜ ํ‰๊ท (means)๊ณผ ๋ถ„์‚ฐ(variance)์„ ๊ณ ์ •ํ•จ์œผ๋กœ ์ด๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค.
  - Batch Normalization์˜ ์ถ”๊ฐ€์ ์ธ ์ด์ ์€ ๋ฐ”๋กœ ์‹ ๊ฒฝ๋ง์—์„œ์˜ gradient flow์ด๋‹ค.
  - ์ด๋Š” ๋ฐœ์‚ฐ(divergence)์˜ ์œ„ํ—˜์„ฑ์—†์ด ๋” ๋†’์€ learning rate์˜ ์‚ฌ์šฉ์„ ํ—ˆ์šฉํ•˜๊ฒŒ ํ•œ๋‹ค.
  - ๋˜ํ•œ, BN์€ model์„ ๊ทœ์ œํ™”(regularize)ํ•˜๊ณ  Dropout์˜ ํ•„์š”์„ฑ์„ ์ค„์—ฌ์ค€๋‹ค.
 ์ฆ‰, ์‹ ๊ฒฝ๋ง์˜ ํฌํ™”์ƒํƒœ์— ๊ณ ์ฐฉ๋˜๋Š” ๊ฒƒ์„ ๋ง‰์•„์คŒ์œผ๋กœ ํฌํ™” ๋น„์„ ํ˜•์„ฑ(saturating nonlinearity)์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€๋‹ค.

cf.

non-linearity๋Š” ์„ ํ˜•์ด ์•„๋‹Œ ์ˆ˜ํ•™์ ์—ฐ์‚ฐ์„ ๋งํ•˜๋ฉฐ ์ด๋Š” ์ถœ๋ ฅ์ด ์ž…๋ ฅ์— ๋น„๋ก€ํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธ. (์„ ํ˜•์ ์ด๋ฉด ์ž…๋ ฅ์— ๋น„๋ก€ํ•˜๊ธฐ ๋•Œ๋ฌธ)

์ด๋Š” ์ถœ๋ ฅ์— ์ ์šฉ๋˜๋Š” ํ™œ์„ฑํ™”ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋„์ž…๋œ๋‹ค.

 

[Saturation]

- input์ด ์ฆ๊ฐ€ํ•ด๋„ ํ•จ์ˆ˜์˜ output์ด ํฌํ™”๋˜๊ฑฐ๋‚˜ ํŠน์ •๊ฐ’์— "๊ณ ์ •"๋˜๋Š” ํ˜„์ƒ

- ํ™œ์„ฑํ™”ํ•จ์ˆ˜ ์ž…์žฅ์—์„œ ํฌํ™”๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ input์ด ๊ทน๋‹จ์ ์œผ๋กœ ํฌ๊ฑฐ๋‚˜ ์ž‘์„ ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค.

 - ์ด๋กœ ์ธํ•ด input๊ณผ ๊ด€๊ณ„์—†์ด ์ผ์ •ํ•œ ๊ฐ’์„ ์ถœ๋ ฅํ•˜๊ฒŒ ํ•˜์—ฌ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋งค์šฐ ์ž‘์•„์ง€๋Š”(gradient vanishing) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

 - ๊ฒฐ๊ณผ์ ์œผ๋กœ ์‹ ๊ฒฝ๋ง์—์„œ ํ•™์Šต์†๋„๋ฅผ ๋Šฆ์ถ”๊ฑฐ๋‚˜ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.

 

[Saturating non-linearity]

- ํฌํ™”๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ™œ์„ฑํ™”ํ•จ์ˆ˜๋กœ sigmoid, tanhํ•จ์ˆ˜๊ฐ€ ๋Œ€ํ‘œ์ ์ธ ํฌํ™” ๋น„์„ ํ˜•์„ฑ(saturating non-linearity)์˜ˆ์‹œ์ด๋‹ค.

- ReLU ๋ฐ ๊ทธ ๋ณ€ํ˜• ํ™œ์„ฑํ™”ํ•จ์ˆ˜๋Š” ํฌํ™”๋ฅผ ๋‚˜ํƒ€๋‚ด์ง€ ์•Š์•„ ๋น„ํฌํ™” ๋น„์„ ํ˜•์„ฑ(non-saturating non-linearity)๋ผ ๋ถ€๋ฅธ๋‹ค.

 

[linearity]์˜ ์ข…๋ฅ˜

  1. Identity Function: f(x) = x
  2. Constant Function: f(x) = c (c๋Š” ์ƒ์ˆ˜., ์ƒ์ˆ˜ํ•จ์ˆ˜)
  3. Polynomial Function: f(x) = an x^n + a{n-1} x^{n-1} + ... + a1 x + a0 (an, a{n-1}, ..., a0๊ฐ€ ๋ชจ๋‘ ์ƒ์ˆ˜๊ฐ’)
  4. Linear Regression: y = b1 x + b0 (b0๊ณผ  b1๊ฐ€ ์ƒ์ˆ˜๊ฐ’)
  5. Affine Function: f(x) = a x + b      (a ์™€ b ๊ฐ€ ์ƒ์ˆ˜๊ฐ’)

 

 

 

 

 

2. Towards Reducing Internal Covariate Shift

- Internal Covariate Shift๋Š” training์‹œ ์‹ ๊ฒฝ๋ง์˜ parameter๋ฅผ ๋ฐ”๊ฟ€ ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค.
- ๋”ฐ๋ผ์„œ training์˜ ํ–ฅ์ƒ์„ ์œ„ํ•ด์„œ๋Š” ๋‚ด๋ถ€๊ณต๋ณ€๋Ÿ‰์ด๋™์„ ์ค„์—ฌ์•ผ ํ•œ๋‹ค.

 

 Whitening

 - layer์˜ input feature๋ฅผ ์กฐ์ •ํ•˜๊ณ  ํ‰๊ท (means)๊ณผ ๋‹จ์œ„๋ถ„์‚ฐ(unit variance)๊ฐ€ 0์ด ๋˜๊ฒŒ ํ•˜๋Š” ์ •๊ทœํ™”(normalization) ๊ธฐ์ˆ ๋กœ training์ˆ˜๋ ด์ด ๋นจ๋ผ์ง„๋‹ค.
- input์„ whiteningํ•จ์œผ๋กœ์จ BN์€ ๊ฐ ๊ณ„์ธต์— ๋Œ€ํ•œ ์ž…๋ ฅ ๋ถ„ํฌ(input distribution)๋ฅผ ์•ˆ์ •ํ™”ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์‹ ๊ฒฝ๋ง์ด ์ž…๋ ฅ ๋ถ„ํฌ์˜ ๋ณ€ํ™”์— โ€‹โ€‹๋œ ๋ฏผ๊ฐํ•ด์ง€๊ณ  ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ๋‹ค.
- ๋˜ํ•œ ๋‹ค๋ฅธ ๋‰ด๋Ÿฐ๊ณผ activation ์‚ฌ์ด์˜ ๊ณต๋ถ„์‚ฐ์„ ์ค„์—ฌ ์‹ ๊ฒฝ๋ง์„ ์ •๊ทœํ™”(regularize)ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๊ณ  ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
- ์ด๋Ÿฐ whitening์€ ์‹ ๊ฒฝ๋ง์„ ์ฆ‰์‹œ ์ˆ˜์ •ํ•˜๊ฑฐ๋‚˜ optimization algorithm์˜ parameter๋ฅผ ๋ฐ”๊พธ๋Š” ๊ฒƒ์œผ๋กœ ๊ณ ๋ ค๋  ์ˆ˜ ์žˆ๋‹ค.

ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ์ˆ˜์ •์ด optimization๋‹จ๊ณ„์™€ ๊ฐ™์ด ๋ถ„์‚ฐ๋˜๋ฉด ๊ฒฝ์‚ฌํ•˜๊ฐ• ์‹œ updateํ•  normalization์„ ์„ ํƒํ•ด gradient์˜ ํšจ๊ณผ๋ฅผ ์ค„์ด๋Š”๋ฐ, ๋‹ค์Œ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ์„ค๋ช…ํ•œ๋‹ค.
bias b๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  train data๋กœ ๊ณ„์‚ฐ๋œ ํ™œ์„ฑํ™” ํ‰๊ท ์„ ๋บ€ ๊ฒฐ๊ณผ๋ฅผ ์ •๊ทœํ™”ํ•˜๋Š” input u๋ฅผ ๊ฐ–๋Š” layer์— ๋Œ€ํ•ด: x_hat = x - E[x]์—์„œ x = u + b, X = {x1...N}์€ training set์— ๋Œ€ํ•œ x์˜ ์ง‘ํ•ฉ์ด๊ณ  E[x] = (1/N)Σ N i=1 xi, ์ด๋‹ค. ๊ฒฝ์‚ฌํ•˜๊ฐ•๋‹จ๊ณ„๊ฐ€ b์— ๋Œ€ํ•œ E[x]์˜ ์˜์กด์„ฑ์„ ๋ฌด์‹œํ•  ๋•Œ, b ← b + โˆ†b๋ฅผ updateํ•œ๋‹ค. (โˆ†b ∝ −∂l / ∂x_hat์ผ ๋•Œ.)   ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด u+(b+โˆ†b)−E[u+(b+โˆ†b)] = u+b−E[u+b]์ด๋‹ค.

๋”ฐ๋ผ์„œ b๋กœ์˜ update์™€ ๊ทธ์— ๋”ฐ๋ฅธ noramlization์˜ ๋ณ€ํ™”์˜ ์กฐํ•ฉ์€ layer์˜ output์˜ ๋ณ€ํ™”, loss๋ฅผ ์ดˆ๋ž˜ํ•˜์ง€ ์•Š์•˜๋‹ค.
training์ด ๊ณ„์†๋ ์ˆ˜๋ก b๋Š” loss๊ฐ’์ด ๊ณ ์ •๋œ ์ƒํƒœ์—์„œ ๋ฌดํ•œํžˆ(indefinitely) ์ปค์งˆ ๊ฒƒ์ด๋‹ค.
์ด ๋ฌธ์ œ๋Š” ์ •๊ทœํ™”๊ฐ€ activation์˜ ์ค‘์‹ฌ์ผ ๋ฟ๋งŒ์•„๋‹ˆ๋ผ scale์กฐ์ • ์‹œ ๋” ์•…ํ™”๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค.
normalization parameter๊ฐ€ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋‹จ๊ณ„์‹œ ์™ธ๋ถ€์—์„œ ๊ณ„์‚ฐ๋˜๋ฉด ๋ชจ๋ธ์ด ํ„ฐ์ ธ๋ฒ„๋ฆฐ๋‹ค.

์ด ๋ฌธ์ œ์˜ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๋ชจ๋“  parameter๊ฐ’์— ์‹ ๊ฒฝ๋ง์ด ํ•ญ์ƒ ์›ํ•˜๋Š” ๋ถ„ํฌ๋กœ activation์„ ์ƒ์„ฑํ•˜๋ ค ํ•˜๋Š”๋ฐ, ์ด๋Š” ๋ชจ๋ธ์˜ parameter ๋ณ€์ˆ˜์— ๋Œ€ํ•œ loss ๊ธฐ์šธ๊ธฐ(gradient)๊ฐ€ normalization์„ ์„ค๋ช…ํ•˜๊ณ  ๋ชจ๋ธ parameter Θ์— ๋Œ€ํ•œ ์˜์กด์„ฑ์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.

x = layer input
X = training data์— ๋Œ€ํ•œ ์ด๋Ÿฐ input ์ง‘ํ•ฉ
์ด์— ๋Œ€ํ•ด Normalization์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž‘์„ฑ๋œ๋‹ค.
์ด๋•Œ, ์ •๊ทœํ™”์‹์€ training ์˜ˆ์ œ์ธ x ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ชจ๋“  ์˜ˆ์ œ์ธ X์— ์˜์กดํ•˜๊ฒŒ ๋˜๋ฉฐ
x๊ฐ€ ๋‹ค๋ฅธ์ธต์— ์˜ํ•ด ์ƒ์„ฑ๋  ๋•Œ, ๊ฐ๊ฐ์˜ ์˜ˆ์ œ๋Š” Θ์— ์˜์กดํ•˜๊ฒŒ ๋œ๋‹ค.
์—ญ์ „ํŒŒ๋ฅผ ์œ„ํ•ด Jacobians๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋ฉฐ ์ด๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
์ž…๋ ฅ x, X์— ๋Œ€ํ•œ gradient๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ํŽธ๋ฏธ๋ถ„ํ•จ์ˆ˜
์ด๋•Œ, ํ›„์ž(X์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„ํ•จ์ˆ˜)๋ฅผ ๋ฌด์‹œํ•˜๋ฉด ์œ„์—์„œ ์„ค๋ช…ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ exploding์„ ์ดˆ๋ž˜ํ•œ๋‹ค.
์—ฌ๊ธฐ์„œ layer input์„ whiteningํ•˜๋ ค๋ฉด ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ 

์™€ ๊ทธ ์—ญ ์ œ๊ณฑ๊ทผ์„ ๊ณ„์‚ฐํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ whitening activation

๊ณผ ์—ญ์ „ํŒŒ๋ฅผ ์œ„ํ•œ transform์˜ ๋ฏธ๋ถ„๊ฐ’์„ ์ƒ์„ฑํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋น„์šฉ์ด ๋งŽ์ด ๋“ ๋‹ค.

- ์ด๋Š” ๋‹ค๋ฅธ ๋Œ€์•ˆ์„ ์ฐพ๊ฒŒ ๋งŒ๋“œ๋Š”๋ฐ, input normalization์˜ ์„ฑ๋Šฅ์€ ์ „์ฒด training set์˜ ๋ชจ๋“  parameter update์ดํ›„์— ๋Œ€ํ•œ ๋ถ„์„์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š๋Š”๋ฐ, ์ด๋Š” ์šฐ๋ฆฌ๋กœ ํ•˜์—ฌ๊ธˆ ์ „์ฒด training data์— ๋Œ€ํ•ด trianing๊ณผ์ • ์ค‘ activation์˜ normalization์„ ํ†ตํ•œ ์‹ ๊ฒฝ๋ง์ •๋ณด์˜ ๋ณด์กด์„ ํ•˜๋„๋ก ํ•˜์˜€๋‹ค.

 

 

3. Normalization via Mini-Batch Statistics

๊ฐ layer์ž…๋ ฅ๋ถ€์˜ full whitening์€ ๋น„์šฉ์ด ๋งŽ์ด๋“ค๊ณ  ๋ชจ๋“ ๊ณณ์—์„œ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— 2๊ฐ€์ง€ ์ค‘์š”ํ•œ ๋‹จ์ˆœ์„ฑ์„ ๋งŒ๋“ค์—ˆ๋‹ค.
โ‘  layer์˜ input๊ณผ output์„ ๊ธด๋ฐ€ํ•˜๊ฒŒ ํ•˜๊ธฐ์œ„ํ•ด feature๋ฅผ whiteningํ•˜๋Š” ๋Œ€์‹ , N(0, 1)์˜ normalization์„ ๊ฐ scalar feature์— ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”์‹œํ‚จ๋‹ค.
  - d์ฐจ์›์˜ input x = (x1 . . . xd)์— ๋Œ€ํ•ด ๊ฐ ์ฐจ์›์„ ์ •๊ทœํ™”(normalize)ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
(์ด๋•Œ, E๋Š” Expectation,  Var๋Š” Variance๋กœ training dataset์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋œ ๊ฐ’)
- LeNet(LeCun et al., 1998b)์—์„œ ๋ณด์˜€๋“ฏ normalization์€ ์ˆ˜๋ ด์†๋„๋ฅผ ๋†’์—ฌ์ค€๋‹ค.
์ด๋•Œ, ๊ฐ„๋‹จํ•œ ์ •๊ทœํ™”๋Š” ๊ฐ ์ธต์˜ ์ž…๋ ฅ๋ถ€๊ฐ€ ๋ฐ”๋€” ์ˆ˜ ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค๋ฉด sigmoid input์˜ ์ •๊ทœํ™”(normalization)์˜ ๊ฒฝ์šฐ, ๋น„์„ ํ˜•์„ฑ์˜ ์ƒํ™ฉ์„ linearํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ๋‹ค.
์ด๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฐ "์‹ ๊ฒฝ๋ง์— ์‚ฝ์ž…๋œ transformation์ด transform์˜ ์„ฑ์งˆ์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค(

the transformation inserted in the network can represent the identity transform

)"๋ผ๋Š” ๊ฒƒ์„ ํ™•์‹ ํ•  ์ˆ˜ ์žˆ๋‹ค.
- ์ด๋ฅผ ์œ„ํ•ด ๊ฐ activation์„
x(k)
, parameter์Œ์„ γ(k), β(k)๋ผ ํ•  ๋•Œ, ์ •๊ทœํ™”๋œ ๊ฐ’์˜ scale๊ณผ shift๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
์ด parameter๋“ค์€ ๊ธฐ์กด๋ชจ๋ธparameters์™€ ํ•จ๊ป˜ ํ•™์Šต๋˜๋ฉฐ ์‹ ๊ฒฝ๋ง์˜ ํ‘œํ˜„๋ ฅ์„ ํšŒ๋ณต์‹œํ‚จ๋‹ค.
- batch ์„ค์ •์—์„œ ์ „์ฒด๋ฅผ ์ •๊ทœํ™”ํ•œ activation์œผ๋กœ ํ•  ๋•Œ, SGD์˜ ์‚ฌ์šฉ์€ ๋น„ํ˜„์‹ค์ (impractical)์ด์–ด์„œ ๋‘๋ฒˆ์งธ ๋‹จ์ˆœํ™”๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. 


โ‘ก training์—์„œ SGD๋ฅผ ์ด์šฉํ•ด mini-batch์˜ ์‚ฌ์šฉ์œผ๋กœ ๊ฐ mini-batch๋Š” ๊ฐ activation์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์˜ ์ถ”์ •์น˜๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
์ด๋Ÿฐ ํ†ต๊ณ„๋กœ normalization์˜ ์‚ฌ์šฉ์€ gradient ์—ญ์ „ํŒŒ์— ์ „์ฒด์ ์œผ๋กœ ๊ด€์—ฌํ•œ๋‹ค.
- ์ฃผ๋ชฉํ•  ์ ์€ mini-batch์˜ ์‚ฌ์šฉ์€ ๊ณต๋™์˜(joint) covariance๋ณด๋‹ค ์ฐจ์›๋ณ„ ๋ถ„์‚ฐ์˜ ๊ณ„์‚ฐ์„ ํ†ตํ•ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์ด๋‹ค.
in the joint case, (singular covarianceํ–‰๋ ฌ๋“ค์˜ ๊ฒฐ๊ณผ์—์„œ) whiten๋œ activation์˜ ์ˆ˜๋ณด๋‹ค mini-batch size๊ฐ€ ๋” ์ž‘๊ธฐ ๋•Œ๋ฌธ์— regularization์ด ํ•„์š”ํ•˜๋‹ค. 

- ํฌ๊ธฐ๊ฐ€ m์ธ mini-batch ๐›ฃ์— ๋Œ€ํ•ด ์ƒ๊ฐํ•ด๋ณด์ž.
๊ฐ activation์€ ๋…๋ฆฝ์ ์œผ๋กœ normalization์ด ์ ์šฉ๋˜๋ฉฐ ๋ถ€๋ถ„activation x(k)์— ์ง‘์ค‘ํ•ด๋ณด์ž.
(์ด๋•Œ, ๋ช…ํ™•์„ฑ(clarity)์„ ์œ„ํ•ด k๋Š” ์ƒ๋žตํ•œ๋‹ค.)

[Batch Normalizing Transform์˜ Algorithm]
โฒ‰๋Š” ์ˆ˜์น˜์ ์•ˆ์ •์„ฑ(numerical stability)๋ฅผ ์œ„ํ•ด mini-batch์˜ ๋ถ„์‚ฐ์— ๋„์ž…๋œ ์ƒ์ˆ˜์ด๋‹ค.
 numerical stability
  - training์‹œ, 
lossํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋Š” Back propagation์œผ๋กœ ๊ณ„์‚ฐ๋œ๋‹ค. (๋ถ€๋ถ„๋„ํ•จ์ˆ˜์˜ ๋‹ค์ค‘์—ฐ๊ฒฐ๊ณ„์‚ฐ์ด ํฌํ•จ๋จ)
  - ์ด ๊ณ„์‚ฐ์€ ๋งŽ์€ ์ˆ˜๋ฅผ ํฌํ•จํ•˜๊ธฐ์— gradient vanishing/exploding๊ฐ™์€ ์ˆ˜์น˜์  ๋ถˆ์•ˆ์ •์„ฑ์œผ๋กœ optimization์ด ์–ด๋ ต๊ฑฐ๋‚˜ ๋ถˆ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ๋‹ค. (์†Œ์ˆ˜์  ์ดํ•˜์™€ ๊ฐ™์€ ์ˆ˜์น˜๊ณ„์‚ฐ์‹œ์˜ ์˜ค์ฐจ, ๋ถ€์ •ํ™•์„ฑ์„ ์ตœ์†Œํ™” ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐœ๋…)
  - Batch Normalization์€ ์ค‘๊ฐ„์ธต์˜ activation์„ normalizeํ•˜์—ฌ training์ค‘์˜ ์ˆ˜์น˜์  ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ธฐ์ˆ ๋กœ ๋„์ž…๋˜์—ˆ๋‹ค. (์ด ๋…ผ๋ฌธ์—์„œ๋Š” SGD๋ฅผ ์‚ฌ์šฉํ•ด ์‹ ๊ฒฝ๋ง์„ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ถ€๋ถ„์—์„œ์˜ ์ˆ˜์น˜์  ์•ˆ์ •์„ฑ์˜ ์ค‘์š”์„ฑ์„ ๋งํ•œ๋‹ค.)
  - Activation scale์„ ์ค„์—ฌ BN์€ gradient vanishing, exploding์„ ๋ฐฉ์ง€ํ•ด ๋” ์•ˆ์ „ํ•˜๊ณ  ํšจ์œจ์ ์ธ training์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค.
  - ๋˜ํ•œ bias๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ์„ ํ†ตํ•ด ์ˆ˜์น˜๊ณ„์‚ฐ์˜ ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™” ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
(๋‹ค๋งŒ, ์„ฑ๋Šฅ์ €ํ•˜์˜ ์šฐ๋ ค๋กœ ์ธํ•ด ์„ฑ๋Šฅ๊ณผ ์ˆ˜์น˜์  ์•ˆ์ •์„ฑ๊ฐ„์˜ ์ ์ ˆํ•œ ๊ท ํ˜•์ด ํ•„์š”ํ•˜๋‹ค.)
 
[numerical instability  with. overfitting]
- ์ˆ˜์น˜์  ๋ถˆ์•ˆ์ •์„ฑ์€ overfitting์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ๋“ค์–ด ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋งค์šฐ ์ปค์ง€๊ฑฐ๋‚˜ ์ž‘์•„์ง€๋ฉด optimization์ด ๋ถˆ์•ˆ์ •ํ•ด์ง€๊ณ  overfitting์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.
- ๋˜ํ•œ ์ˆ˜์น˜์  ๋ถˆ์•ˆ์ •์„ฑ์œผ๋กœ ์ˆ˜์น˜๊ณ„์‚ฐ์ด ๋ถ€์ •ํ™•ํ•ด์ง€๋ฉด ์ตœ์ ์˜ ๊ฐ’์œผ๋กœ์˜ ์ˆ˜๋ ด์ด ๋˜์ง€ ์•Š์•„ overfitting๋ฐœ์ƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
- ๋‹ค๋งŒ model์˜ ๋ณต์žก๋„, train data์˜ ๋ถˆ์ถฉ๋ถ„ ๋ฐ ๋…ธ์ด์ฆˆ๋กœ overfitting์ด ์ˆ˜์น˜์ ๋ถˆ์•ˆ์ •์„ฑ์ด ์—†์–ด๋„ ๋ฐœ์ƒ๊ฐ€๋Šฅํ•˜๋‹ค.

- BN transform์€ activation์„ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋„์ž…๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๋Ÿฐ ๊ฐ’์˜ ๋ถ„ํฌ๋Š” Expectation 0๊ณผ 1์˜ ๋ถ„์‚ฐ๊ฐ’์„ ๊ฐ–๋Š”๊ฒƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ๊ฐ์˜ mini-batch์˜ ์›์†Œ๋“ค์ด ์šฐ๋ฆฌ๊ฐ€ โฒ‰๋ฅผ ๋ฌด์‹œํ•œ๋‹ค๋ฉด ๊ฐ™์€ ๋ถ„ํฌ๋กœ ๋ถ€ํ„ฐ ํ‘œ๋ณธ์ด ๋‚˜์˜ค๋Š”๋ฐ, ์ด๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ expectation ์ฆ‰, ๊ธฐ๋Œ“๊ฐ’์—์„œ ๊ด€์ฐฐ๊ฐ€๋Šฅํ•˜๋‹ค.
training์—์„œ transformation๊ณผ์ •์„ ํ†ตํ•ด ๋‚˜์˜จ loss์˜ ๊ธฐ์šธ๊ธฐ โ„“์— ๋Œ€ํ•ด backpropagation์„ ์ง„ํ–‰ํ•˜๊ณ  BN transformation์˜ parameter์— ๋Œ€ํ•œ gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ณ„์‚ฐํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.
์ด๋•Œ, chain rule์„ ์‚ฌ์šฉํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋‹จ์ˆœํ™” ํ•œ๋‹ค.
์ฆ‰, BN transformation์€ ์ •๊ทœํ™”๋œ activation์„ ์‹ ๊ฒฝ๋ง์— ์ „ํ•˜๋Š” ์ฐจ๋ณ„ํ™”๋œ(differentiable) tranformation์ด๋‹ค. 
์ด๋Š” model training์—์„œ layer๊ฐ€ internal covariate shift๊ฐ€ ์ ์€ ์ž…๋ ฅ๋ถ„ํฌ๋กœ ํ•™์Šต์„ ๊ณ„์†ํ•จ์„ ํ†ตํ•ด ํ•™์Šต์„ ๊ฐ€์†ํ™”ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์‹ ํ•˜๊ฒŒ ํ•ด์ค€๋‹ค.
๋˜ํ•œ ์ด๋Ÿฐ ์ •๊ทœํ™”๋œ activation์— ์ ์šฉ๋˜์–ด ํ•™์Šต(learn)๋œ affine transformation์€ BN transformation์ด identity transformation์„ ๋‚˜ํƒ€๋‚ด๊ฒŒ ํ•˜๋ฉฐ ์‹ ๊ฒฝ๋ง์˜ ์ˆ˜์šฉ๋ ฅ์„ ๋ณด์กดํ•œ๋‹ค.

 

 

3.1 Training and Inference with Batch-Normalized Networks

- Batch Normalization ์‹ ๊ฒฝ๋ง์„ ์œ„ํ•ด ์šฐ๋ฆฐ activation์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ ํŠน์ •ํ•ด์•ผํ•˜๊ณ  ๊ทธ ๊ฐ๊ฐ์— ๋Œ€ํ•ด BN transform์„ Algorithm. 1.์„ ์ด์šฉํ•ด ์ง‘์–ด๋„ฃ์–ด์•ผ ํ•œ๋‹ค.
- ์ผ์ „์— x๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์€ ๋ชจ๋“ ์ธต์€ ์ด์ œ BN(x)๋กœ ํ‘œ๊ธฐํ•œ๋‹ค.
- BN์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์€ [ BGD  ||  mini-batch size m > 1์ธ SGD  ||  Adagrad(Duchi et al., 2011)]๊ฐ™์€ ๋ณ€ํ˜•(variants)์„ ์‚ฌ์šฉํ•ด ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๋‹ค. 
- mini-batch์— ์˜์กดํ•˜๋Š” activation์˜ ์ •๊ทœํ™”๋Š” ํšจ๊ณผ์ ์ธ training์„ ํ—ˆ์šฉํ•˜์ง€๋งŒ, ์ถ”๋ก (inference)์—์„œ ํ•„์š”ํ•˜์ง€๋„, ๋ฐ”๋žŒ์งํ•˜์ง€๋„ ์•Š๋Š”๋ฐ ์šฐ๋ฆฌ์˜ output์ด ๊ฒฐ๊ณผ์ ์œผ๋กœ input๊ฐ’์—๋งŒ ์˜์กดํ•˜๊ธฐ๋ฅผ ์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 

3.2 Batch-Normalized Convolutional Networks

- Batch Normalization์€ ์‹ ๊ฒฝ๋ง์˜ ๋ชจ๋“  activation์ง‘ํ•ฉ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ์— ์šฐ๋ฆฌ๋Š” ์š”์†Œ๋ณ„(element-wise) ๋น„์„ ํ˜•์„ฑ์„ ๋”ฐ๋ฅด๋Š” affine transformation์œผ๋กœ ๊ตฌ์„ฑ๋œ transform์— ์ดˆ์ ์„ ๋งž์ถ˜๋‹ค.

W ์™€ b๋Š” ํ•™์Šต๋œ ๋งค๊ฐœ๋ณ€์ˆ˜์ด๊ณ  g(.)๋Š” (sigmoid, ReLU๊ฐ™์€) ๋น„์„ ํ˜•์„ฑํ•จ์ˆ˜(saturating nonlinearity)์ด๋‹ค.
์ด ์‹์€ FC์ธต๊ณผ convolution์ธต ๋ชจ๋‘ ์ ์šฉ๋˜๋Š”๋ฐ, ๋น„์„ ํ˜•์„ฑ ์ „์— x = Wu + b๋ฅผ ์ •๊ทœํ™” ํ•จ์„ ํ†ตํ•ด ๋ฐ”๋กœ BN transform์„ ์ถ”๊ฐ€ํ•ด์คฌ๋‹ค.
๋”๋ถˆ์–ด input u๋„ ์ •๊ทœํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ u๋Š” ๋‹ค๋ฅธ ๋น„์„ ํ˜•์„ฑ์˜ ์ถœ๋ ฅ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„ ๋ถ„ํฌ๊ฐ€ training๊ณผ์ •์—์„œ ๋ฐ”๋€” ์ˆ˜ ์žˆ๊ธฐ์— ์ฒซ๋ฒˆ์งธ์™€ ๋‘๋ฒˆ์งธ moment๋ฅผ ์ œํ•œํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๊ณต๋ณ€๋Ÿ‰์ด๋™์„ ์ œ๊ฑฐํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค.

- ๊ทธ์™€๋Š” ๋Œ€์กฐ์ ์œผ๋กœ Wu + b๋Š” ๋Œ€์นญ์ ์ด๊ณ  ํฌ์†Œํ•˜์ง€์•Š์€ ๋ถ„ํฌ๋ฅผ ๊ฐ€์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’์œผ๋ฉฐ, ์ด๋Š” "more Gaussian (Hyva ฬˆrinen & Oja, 2000)"์œผ๋กœ; ์ด ๋ถ„ํฌ๋ฅผ ์ •๊ทœํ™”ํ•˜๋ฉด ์•ˆ์ •์ ์ธ ๋ถ„ํฌ์˜ activation์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.

- ๋”ฐ๋ผ์„œ ์ด์—๋Œ€ํ•ด ์ฃผ๋ชฉํ•  ์ ์€, ์šฐ๋ฆฌ๊ฐ€ Wu + b๋ฅผ ์ •๊ทœํ™”ํ•˜๊ธฐ์— bias b์˜ ํšจ๊ณผ๋Š” ํ›„์† ํ‰๊ท ์˜ ์ฐจ์— ์˜ํ•ด ์ทจ์†Œ๋  ๊ฒƒ์ด๊ธฐ์— ๋ฌด์‹œ๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๋•Œ, bias์˜ ์—ญํ• ์€ Algorithm. 1.์˜ β์— ์˜ํ•ด ๊ฐ€์ •๋œ๋‹ค.
๋”ฐ๋ผ์„œ z = g(Wu + b)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด BN transformation์ด x = Wu์˜ ๊ฐ ์ฐจ์›์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด๋Œ€์ฒด๋œ๋‹ค. (์ด๋•Œ, ๊ฐ ์ฐจ์›์—๋Š” ํ•™์Šต๋œ ๋งค๊ฐœ ๋ณ€์ˆ˜ γ(k), β(k)๋ผ๋Š” ๋ณ„๋„์˜ ์Œ์ด ์กด์žฌํ•œ๋‹ค.)

- conv.layer์—์„œ ์šฐ๋ฆฐ convolution์˜ ์†์„ฑ์„ ๋ณต์ข…์‹œํ‚ค๊ธฐ ์œ„ํ•ด normalization์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ๊ทธ๋ž˜์„œ ๋™์ผํ•œ ํŠน์ง•๋งต์˜ ๋‹ค๋ฅธ์œ„์น˜์— ์žˆ๋Š” ๋‹ค๋ฅธ ์›์†Œ๋“ค์ด ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์ •๊ทœํ™”๋œ๋‹ค.
์ด๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฐ ๋ชจ๋“  ์œ„์น˜์˜ mini-batch์˜ activation์„ ๊ธด๋ฐ€ํ•˜๊ฒŒ ์—ฐ๊ฒฐํ•œ๋‹ค.

- Algorithm.1.์—์„œ, ์šฐ๋ฆฐ ๐œ๊ฐ€ mini-batch์˜ ์š”์†Œ์™€ ๊ณต๊ฐ„์œ„์น˜ ๋ชจ๋‘์— ๊ฑธ์ณ ํŠน์ง•๋งต์˜ ๋ชจ๋“  ์š”์†Œ์˜ ์ง‘ํ•ฉ์ด ๋˜๊ฒŒ ํ•œ๋‹ค. - ๋”ฐ๋ผ์„œ ํฌ๊ธฐ๊ฐ€ m์ธ mini-batch์™€ pxqํฌ๊ธฐ์˜ ํŠน์ง•๋งต์˜ ๊ฒฝ์šฐ, ํฌ๊ธฐ๊ฐ€ m' = |๐œ| = m · pq์ธ ํšจ๊ณผ์ ์ธ mini-batch๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

- ์šฐ๋ฆฐ activation๋‹น์ด ์•„๋‹Œ, feature map๋‹น parmeter๋ณ€์ˆ˜ ์Œ์ธ γ(k)์™€ β(k)๋ฅผ ํ•™์Šตํ•œ๋‹ค.
- Algorithm.2.๊ฐ€ ๋น„์Šทํ•˜๊ฒŒ ์ˆ˜์ •๋˜์—ˆ๋Š”๋ฐ, ์ถ”๋ก ํ•˜๋Š”๋™์•ˆ BN transform์€ ์ฃผ์–ด์ง„ ํŠน์ง•๋งต์˜ ๊ฐ activation์— ๋™์ผํ•œ linear transformation์„ ์ ์šฉํ•œ๋‹ค.

 

 

3.3 Batch Normalization enables higher learning rates

- ์ „ํ†ต์ ์ธ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์—์„œ, ๋„ˆ๋ฌด ๋†’์€ ํ•™์Šต์œจ์€ poor local minima์— ๊ฐ‡ํžˆ๋Š” ๊ฒƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค/ํญ๋ฐœ์„ ์•ผ๊ธฐํ–ˆ๋‹ค. 
- Batch Normalization์€ ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์„ ๋„์™€์ค€๋‹ค.
- ์‹ ๊ฒฝ๋ง์„ ๊ฑฐ์นœ activation์„ ์ •๊ทœํ™”ํ•จ์œผ๋กœ์จ parameter์— ๋Œ€ํ•œ ์ž‘์€ ๋ณ€๊ฒฝ์ด activation์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋” ํฌ๊ณ  ์ตœ์ ์˜ ๋ณ€ํ™”๋กœ ์ฆํญ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๋Š”๋ฐ, training์ด ๋น„์„ ํ˜•์„ฑ์˜ ํฌํ™”์ƒํƒœ์— ๊ฐ‡ํžˆ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๋Š”๊ฒƒ์ด ๋ฐ”๋กœ ๊ทธ ์˜ˆ์‹œ์ด๋‹ค.

- ๋˜ํ•œ BN์€ training์—์„œ parameter๋ถ€๊ทผ์„ ๋” ํšŒ๋ณต๋ ฅ ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š”๋ฐ, ๋ณดํ†ต large learning rate๋Š” parameter์˜ ๊ทœ๋ชจ๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ gradient ์—ญ์ „ํŒŒ์—์„œ model์˜ ํญ๋ฐœ์„ ์ดˆ๋ž˜ํ•œ๋‹ค.
ํ•˜์ง€๋งŒ Batch Normalization๊ณผ ํ•จ๊ป˜๋ผ๋ฉด layer์—์„œ์˜ Back Propagation์€ parameter์˜ ๊ทœ๋ชจ๋ฉด์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š”๋‹ค.

์ด๋•Œ, Scale์€ Jacobian layer๋‚˜ ๊ฒฐ๊ณผ์ ์œผ๋กœ gradient ์ „ํŒŒ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค.
๊ฒŒ๋‹ค๊ฐ€ ๊ฐ€์ค‘์น˜๊ฐ€ ํด์ˆ˜๋ก ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ž‘์•„์ง€๊ณ  BN์€ parameter์˜ ์„ฑ์žฅ(growth)์„ ์•ˆ์ •ํ™”ํ•œ๋‹ค.
๋˜ํ•œ Batch Normalization์ด layer Jacobian์ด training์— ์œ ๋ฆฌํ•œ ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง„ 1์— ๊ฐ€๊นŒ์šด ๋‹จ์ˆ˜๊ฐ’(singular value)์„ ๊ฐ–๋„๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ถ”์ธกํ•œ๋‹ค.(Saxe et al., 2013)
์‹ค์ œ๋กœ transformation์€ ์„ ํ˜•์ ์ด์ง€ ์•Š๊ณ  ์ •๊ทœํ™”๋œ ๊ฐ’์ด Gaussian๋ถ„ํฌ์ด๊ฑฐ๋‚˜ ๋…๋ฆฝ์ ์ด๋ผ๋Š” ๋ณด์žฅ์€ ์—†๋‹ค.
ํ•˜์ง€๋งŒ, ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  Batch Normalization์ด gradient propagation์„ ๋” ์ž˜ ์ž‘๋™์‹œํ‚ฌ ๊ฒƒ์ด๋ผ ๋ฏฟ๋Š”๋‹ค.

- gradient propagation(๊ธฐ์šธ๊ธฐ ์ „ํŒŒ)์— ๋Œ€ํ•œ Batch Normalization์˜ ์ •ํ™•ํ•œ ์˜ํ–ฅ์€ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๋กœ ๋‚จ์•„์žˆ๋‹ค.

 

 

3.4 Batch Normalization regularizes the model

- Batch Normalization์œผ๋กœ trainingํ•  ๋•Œ, training์˜ˆ์ œ๋Š” mini-batch์˜ ๋‹ค๋ฅธ ์˜ˆ์‹œ๋“ค๊ณผ ํ•จ๊ป˜(conjunct) ํ‘œ์‹œ๋˜๋ฉฐ, training ์‹ ๊ฒฝ๋ง์€ ๋” ์ด์ƒ ์ฃผ์–ด์ง„ training ์˜ˆ์ œ์— ๋Œ€ํ•œ ๊ฒฐ์ •์ ์ธ ๊ฐ’์„ ์ƒ์„ฑํ•˜์ง€ ์•Š๋Š”๋‹ค.
์ด ํšจ๊ณผ๋Š” ์‹ ๊ฒฝ๋ง์˜ ์ผ๋ฐ˜ํ™”(generalization)์— ๋งค์šฐ ์ด์ ์„ ๊ฐ–๋Š”๋‹ค(advantageous).

- Dropout(Srivastava et al., 2014)์ด overfitting์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹์ธ ๋ฐ˜๋ฉด์— Batch Normalization์ด ์ ์šฉ๋œ ์‹ ๊ฒฝ๋ง์€ Dropout์ด ๊ฐ์†Œ๋˜๊ฑฐ๋‚˜ ์ œ๊ฑฐ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

 

 

 

4. Experiments

4.1 Activations over time

-  training์—์„œ internal covariateshift์™€ ์ด์— ๋Œ€ํ•ญํ•˜๋Š” Batch Normalization์˜ ๋Šฅ๋ ฅ์„ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด MNIST dataset์„ ์‚ฌ์šฉ. (์ด๋•Œ, 28x28์˜ ์ž‘์€ image๋ฅผ input์œผ๋กœ 3๊ฐœ์˜ FC-layer์™€ 100๊ฐœ์˜ activation์„ ๊ฐ๊ฐ ๊ฐ–๋Š” ๋‹จ์ˆœํ•œ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉ)
- ๊ฐ ์ธต์€ sigmoidํ•จ์ˆ˜ y = g(Wu + b)๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ๋˜๋ฉฐ, ์ด๋•Œ W๋Š” ์ž‘์€ ๋ฌด์ž‘์œ„ Gaussian๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”๋œ๋‹ค.
- ๋งˆ์ง€๋ง‰ ์ธต์€ FC-layer๋ฅผ ์ด์šฉํ•ด 10๊ฐœ์˜ activation์ด๋ฉฐ cross-entropy loss๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
- Section 3.1์—์„œ์ฒ˜๋Ÿผ ์‹ ๊ฒฝ๋ง์˜ ๊ฐ ์€๋‹‰์ธต์— Batch Normalization์„ ์‚ฌ์šฉํ•œ๋‹ค.
- ์šฐ๋ฆฌ์˜ ์ฃผ ๊ด€์‹ฌ์‚ฌ๋Š” model์˜ ์„ฑ๋Šฅ๋ณด๋‹ค๋Š” baseline๊ณผ BN๊ฐ„์˜ ๋น„๊ต์— ์ค‘์ ์„ ๋‘”๋‹ค.
(a)๋ฅผ ๋ณด๋ฉด, BN์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ด ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ–๋Š”๋ฐ, ๊ทธ ์ด์œ ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ์œ„ํ•ด training๊ณผ์ •์ค‘์˜ ์‹ ๊ฒฝ๋ง์˜ sigmoid์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ•˜์˜€๋‹ค.
(b,c)์—์„œ original network(b)๋Š” ์ „๋ฐ˜์ ์œผ๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์ด ๋งŽ์ด ๋ณ€ํ™”ํ•˜๋Š” ๋ฐ˜๋ฉด, BN์„ ์‚ฌ์šฉํ•œ network(c)์˜ ๊ฒฝ์šฐ, ๋ถ„ํฌ(distribution)๊ฐ€ ์•ˆ์ •์ ์ž„์„ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋ฅผ ํ†ตํ•ด training์— ๋„์›€์„ ์ฃผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

4.2 ImageNet Classification

- 2014๋…„์˜ Inception network์— Batch Normalization์„ ์ ์šฉ, ๋” ์ž์„ธํ•œ ์‚ฌํ•ญ์€ ๋ถ€๋ก(Appendix)์— ๊ธฐ์žฌ.
- (Sutskever et al.,2013)์— ๊ธฐ์žฌ๋œ momentum๊ณ„์ˆ˜์™€ 32 mini-batch size๋ฅผ ์‚ฌ์šฉ
- ์ด๋•Œ, ๊ธฐ์กด์˜ Inception์— Batch Normalization์„ ์ถ”๊ฐ€ํ•œ ์ˆ˜์ •๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ evaluate๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค.
- ํŠนํžˆ, Normalization์€ Conv.layer์˜ ๋น„์„ ํ˜•์„ฑ(non-linearity., ReLU)์ด ์ž…๋ ฅ๋ถ€๋ถ„์— ์ ์šฉํ•˜์˜€๋‹ค.
์ฆ‰, ์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค

 

• 4.2.1 Accerlerating BN Networks

๋‹จ์ˆœํžˆ Batch Normalization์„ ์‹ ๊ฒฝ๋ง์— ์ถ”๊ฐ€ํ•œ๋‹ค๊ณ  ์™„์ „ํ•œ ์ด์ ์„ ์ฃผ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.
์™„์ „ํ•œ ์ด์ ์„ ์œ„ํ•ด ์‹ ๊ฒฝ๋ง ๋ฐ training parameter์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ณ€ํ˜•์ด ํ•„์š”ํ•˜๋‹ค.  
Increase learning rate.
  - training speed๋ฅผ ๋†’์ž„
Remove Dropout.
  - Batch Normalization์€ Dropout๊ณผ ๋™์ผํ•œ ๋ชฉํ‘œ๋ฅผ ์ดํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ
Reduce the L2 weight regularization.
  - L2 ๊ฐ€์ค‘์น˜๋ฅผ ์ค„์—ฌ์„œ ์ •ํ™•๋„๋ฅผ ๋†’์˜€์Œ
Accerlerate the learning rate decay.
  - learning rate์˜ ๊ฐ์‡ ์œจ์„ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ง„ํ–‰(for train faster)
Remove L
ocal Response Normalization.
  - BN์˜ ์‚ฌ์šฉ์œผ๋กœ LRN์˜ ํ•„์š”๊ฐ€ ์—†์–ด์ง
Shuffle training examples more thoroughly.
  - ์ฒ ์ €ํ•˜๊ฒŒ train์˜ˆ์‹œ๋ฅผ ์„ž์Œ์œผ๋กœ mini-batch์—์„œ ๋™์ผํ•œ ์˜ˆ์ œ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
- Reduce the photometric distortions.
  - ๊ด‘๋„์ธก์ •์˜ ์™œ๊ณก์„ ์ค„์—ฌ real image์— ์ง‘์ค‘ํ•˜๊ฒŒํ•จ (โˆต BN์‹ ๊ฒฝ๋ง์€ train faster. &. observe fewer time์ด๊ธฐ ๋•Œ๋ฌธ)
Inceoption์˜ Single Crop validation accurarcy ๋ถ„ํฌ๋กœ BN๊ณผ์˜ ๋น„๊ต๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ทธ๋ž˜

 

• 4.2.2 Single-Network Classification

Inception.
  - ์ฒ˜์Œ learning rate๋Š” 0.0015๋กœ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ train
BN-Baseline.
  - non-linearity์ด์ „, Batch Normalization์„ ์ ์šฉํ•œ Inception๊ณผ ๋™์ผ
BN-x5.
  - ์œ„์˜ ๋ชจ๋ธ์— Section 4.2.1์„ ์ ์šฉ
  - ์ดˆ๊ธฐ ํ•™์Šต๋ฅ ์€ 0.0075๋กœ 5๋ฐฐ ์ฆ๊ฐ€์‹œ์ผฐ์œผ๋ฉฐ ๋™์ผํ•œ ํ•™์Šต๋ฅ ๋กœ ๊ธฐ์กด Inception์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๋ฌดํ•œ๋Œ€์— ๋„๋‹ฌํ•˜๊ฒŒํ•จ
BN-x30.
  - BN-x5์™€ ๋™์ผํ•˜์ง€๋งŒ, ์ดˆ๊ธฐํ•™์Šต๋ฅ ์€ 0.045๋กœ Inception์˜ 30๋ฐฐ์ด๋‹ค.
BN-x5-Sigmoid.
  - BN-x5์™€ ๋™์ผํ•˜์ง€๋งŒ, ReLU๋Œ€์‹  sigmoid non-linearity๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

์œ„์˜ ๊ทธ๋ฆผ์€ 4.2.1์˜ ๊ทธ๋ž˜ํ”„์˜ Max validation accuracy ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 72.2%์ด ์ •ํ™•๋„์— ๋„๋‹ฌํ•˜๊ธฐ๊นŒ์ง€ ๊ฑธ๋ฆฐ Step์— ๋Œ€ํ•œ ํ‘œ์ด๋‹ค.

์—ฌ๊ธฐ์„œ ํฅ๋ฏธ๋กœ์šด ์ ์€ BN-x30์ด ๋‹ค์†Œ ์ดˆ๊ธฐ์—๋Š” ๋А๋ ธ์ง€๋งŒ ๋” ๋†’์€ ์ตœ์ข… accuracy์— ๋„๋‹ฌํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค.
internal covariance shift์˜ ๊ฐ์†Œ๋กœ sigmoid๋ฅผ ๋น„์„ ํ˜•์„ฑ์œผ๋กœ ์‚ฌ์šฉํ•  ๋•Œ, Batch Normalization์„ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค. (ํ”ํžˆ๋“ค ๊ทธ๋Ÿฐ ์‹ ๊ฒฝ๋ง train์ด ์–ด๋ ต๋‹ค๊ณ  ์•Œ๋ ค์ง„ ๊ฒƒ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ .)

 

• 4.2.3 Ensemble Classification

BN-x30์„ ๊ธฐ๋ฐ˜์„ ๋‘” 6๊ฐœ์˜ ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‚ฌํ•ญ์— ์ง‘์ค‘ํ•ด ์ˆ˜์ •ํ•˜์˜€์œผ๋ฉฐ ๊ฒฐ๊ณผ๋Š” 5. Conclusion์˜ ํ‘œ๋ฅผ ์ฐธ๊ณ ํ•œ๋‹ค.
- Conv.layer์˜ ์ดˆ๊ธฐ weight๋ฅผ ์ฆ๊ฐ€์‹œํ‚ด
- Dropout์˜ ์‚ฌ์šฉ (Inception์—์„œ ์‚ฌ์šฉํ•˜๋Š” 40%์— ๋น„ํ•ด 5~10%์ •๋„๋งŒ ์ˆ˜ํ–‰)
- model์˜ ๋งˆ์ง€๋ง‰ ์€๋‹‰์ธต๊ณผ ํ•จ๊ป˜ non-convolution, activation ๋‹น Batch Normalization์„ ์‚ฌ์šฉ

 

 

 

 

 

 

5. Conclusion

- ๊ธฐ๊ณ„ํ•™์Šต์˜ training์„ ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง„ Covariate Shift(๊ณต๋ณ€๋Ÿ‰ ์ด๋™)์€ sub-network์™€ layer์—๋„ ์ ์šฉ๋˜๋ฉฐ
network์˜ ๋‚ด๋ถ€ ํ™œ์„ฑํ™”์—์„œ ์ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด training์— ๋„์›€์ด ๋œ๋‹ค๋Š” ์ „์ œ(premise)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.
- Normalization์€ network๋ฅผ train์— ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š” optimization method๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.
๊ฐ๊ฐ์˜ mini-batch์— Normalization์„ ์ ์šฉํ•˜๊ณ  normalization parameter๋ฅผ ํ†ตํ•œ ๊ธฐ์šธ๊ธฐ์˜ ์—ญ์ „ํŒŒ๋ฅผ ์œ„ํ•ด SGD๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

- network์˜ ๊ฒฐ๊ณผ๋Š” ํฌํ™”๋น„์„ ํ˜•(saturating nonlinearity)๋กœ ํ›ˆ๋ จ๋  ์ˆ˜ ์žˆ๊ณ , training rate์˜ ์ฆ๊ฐ€์— ๋”์šฑ ๊ด€๋Œ€(tolerant)ํ•˜๋ฉฐ ์ •๊ทœํ™”(regularization)๋ฅผ ์œ„ํ•œ Dropout์ด ํ•„์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

- ์ตœ์ฒจ๋‹จ image classification model์— Batch-Normalization์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ ๋งŒ์œผ๋กœ๋„ training์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

- learining rate๋ฅผ ๋†’์ด๊ณ  Dropout์„ ์ œ๊ฑฐํ•˜๊ณ  Batch-Normalization์— ์˜ํ•ด ์ œ๊ณต๋˜๋Š” ๊ธฐํƒ€ ์ˆ˜์ •์‚ฌํ•ญ์„ ์ ์šฉํ•จ์œผ๋กœ
์šฐ๋ฆฌ๋Š” ์ ์€ training step์œผ๋กœ ์ด์ „์˜ ์ฒจ๋‹จ๊ธฐ์ˆ ์— ๋„๋‹ฌํ•˜๋ฉฐ ์ดํ›„์˜ single-network image classification์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค.
๋˜ํ•œ, Batch-Normalization์œผ๋กœ ํ›ˆ๋ จ๋œ ์—ฌ๋Ÿฌ ๋ชจ๋ธ๋“ค์˜ ๊ฒฐํ•ฉ์œผ๋กœ ImageNet์—์„œ ๊ฐ€์žฅ ์ž˜ ์•Œ๋ ค์ง„ ๋ฐฉ๋ฒ•๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

[Batch-Normalization์˜ ๋ชฉํ‘œ]

- ํ›ˆ๋ จ ์ „๋ฐ˜์— ๊ฑธ์ณ activation๊ฐ’์˜ ์•ˆ์ •์ ์ธ ๋ถ„ํฌ(distribution)์˜ ๋‹ฌ์„ฑ
- experiments์—์„œ ์ถœ๋ ฅ๋ถ„ํฌ์˜ 1์ฐจ ๋ชจ๋ฉ˜ํŠธ์™€ 2์ฐจ ๋ชจ๋ฉ˜ํŠธ๋ฅผ ์ผ์น˜์‹œํ‚ค๋ฉด ์•ˆ์ •์  ๋ถ„ํฌ๋ฅผ ์–ป์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.
∴ ๋น„์„ ํ˜•์„ฑ ์ „์— Batch-Normalization์„ ์ ์šฉ์‹œํ‚จ๋‹ค.

- BN์˜ ๋˜๋‹ค๋ฅธ ์ฃผ๋ชฉํ•  ์ ์€ Batch Normalization transform์ด ๋™์ผ์„ฑ(ํ‘œ์ค€ํ™” layer๋Š” ๊ฐœ๋…์ ์œผ๋กœ ํ•„์š”ํ•œ scale๊ณผ shift๋ฅผ ํก์ˆ˜ํ•˜๋Š” learned linear transform์ด ๋’ค๋”ฐ๋ฅด๊ธฐ์— ํ•„์š”ํ•˜์ง€ ์•Š์•˜์Œ)์„ ๋‚˜ํƒ€๋‚ด๋„๋กํ•˜๋Š” learned scale๊ณผ shift, conv.layer์˜ ์ฒ˜๋ฆฌ, mini-batch์— ์˜์กดํ•˜์ง€ ์•Š๋Š” ๊ฒฐ์ •๋ก ์  ์ถ”๋ก , ์‹ ๊ฒฝ๋ง์˜ ๊ฐ conv.layer์˜ BN์ด ํฌํ•จ๋œ๋‹ค.

- ์ฆ‰, Batch Normalization์ด conv.layer๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ณ  ๋ฏธ๋‹ˆ ๋ฐฐ์น˜์— ์˜์กดํ•˜์ง€ ์•Š๋Š” ๊ฒฐ์ •๋ก ์  ์ถ”๋ก ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๋œป์ด๋ฉฐ ์ถ”๊ฐ€์ ์œผ๋กœ ์‹ ๊ฒฝ๋ง์˜ ๊ฐ conv.layer๋ฅผ ์ •๊ทœํ™”ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

[์ด๋ฒˆ ์—ฐ๊ตฌ์˜ ๋ฌธ์ œ์  ๋ฐ ํ–ฅํ›„ ๋ชฉํ‘œ]
BN๊ฐ€ ์ž ์žฌ์ ์œผ๋กœ ๊ฐ€๋Šฅ์ผ€ํ•˜๋Š” ๋ชจ๋“  ๋ฒ”์œ„์˜ ๊ฐ€๋Šฅ์„ฑ์„ ํƒ๊ตฌํ•˜์ง€ "์•Š์•˜๋‹ค."
- ๋”ฐ๋ผ์„œ internal covariate shift์™€ gradient vanishing, exploding๊ฐ€ ํŠนํžˆ๋‚˜ ์‹ฌ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
- normalization์ด gradient propagation์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฐ€์„ค์„ ์ฒ ์ €ํ•˜๊ฒŒ ๊ฒ€์ฆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹น์—ฐํžˆ๋„ ๋ง์ด๋‹ค.
- ์šฐ๋ฆฌ๋Š” ์ „ํ†ต์ ์ธ ์˜๋ฏธ์—์„œ ์ ์šฉ์˜ ๋ถ€๋ถ„์— ๋„์›€์ด ๋  ์ง€ [์‹ ๊ฒฝ๋ง์— ์˜ํ•ด ์ˆ˜ํ–‰๋˜๋Š” normalization์ด ๋ชจ์ง‘๋‹จ์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์˜ ์žฌ๊ณ„์‚ฐ๋งŒ์œผ๋กœ ์ƒˆ ๋ฐ์ดํ„ฐ๋ถ„ํฌ๋ณด๋‹ค ์‰ฝ๊ฒŒ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ์ง€์˜ ์—ฌ๋ถ€]๋ฅผ ์กฐ์‚ฌํ•  ๊ณ„ํš์ด๋‹ค.

 

 

 

 

๐Ÿ˜ถ ๋ถ€๋ก (Appendix)

Variant of the Inception Model Used

์œ„์˜ Figure 5๋Š” GoogleNet๊ตฌ์กฐ์™€ ๊ด€๋ จํ•ด ๋น„๊ตํ•œ ๊ฒƒ์œผ๋กœ ํ‘œ์˜ ํ•ด์„์€ GoogleNet(Szegedy et al., 2014)์„ ์ฐธ์กฐ.
GoogLeNet๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์ฃผ๋ชฉํ• ๋งŒํ•œ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
• 5x5 conv.layer๋Š” 2๊ฐœ์˜ ์—ฐ์†๋œ(consecutive) 3x3 conv.layer๋กœ ๋Œ€์ฒด๋œ๋‹ค.
  - ์ด๋Š” network์˜ ์ตœ๋Œ€๊นŠ์ด๋ฅผ 9๊ฐœ์˜ weight layer๋กœ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.
  - ๋˜ํ•œ, parameter์ˆ˜๋ฅผ 25% ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ๊ณ„์‚ฐ๋น„์šฉ(computational cost)๋ฅผ 30%์ •๋„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.

• 28x28 inception module์€ 2๊ฐœ์—์„œ 3๊ฐœ๋กœ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.
• module๋‚ด๋ถ€์—์„œ๋Š” maxpooling๊ณผ average pooling์ด ์‚ฌ์šฉ๋œ๋‹ค.
• ๋‘ inception module์‚ฌ์ด์—์„œ pooling layer๋Š” ์—†์ง€๋งŒ module 3c, 4e์—์„œ filter ์—ฐ๊ฒฐ ์ „์— stride=2์˜ conv/pooling layer๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค.

์šฐ๋ฆฌ์˜ model์€ ์ฒซ๋ฒˆ์งธ conv.layer์—์„œ 8๋ฐฐ์˜ ๊นŠ์ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ถ„๋ฆฌ๊ฐ€๋Šฅํ•œ convolution์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
์ด๋Š” training์‹œ์˜ Memory์†Œ๋น„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ณ  computational cost๋ฅผ ๊ฐ์†Œ์‹œํ‚จ๋‹ค.

 

 

 

 

 

๐Ÿง ๋…ผ๋ฌธ ๊ฐ์ƒ_์ค‘์š”๊ฐœ๋… ํ•ต์‹ฌ ์š”์•ฝ

"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift"
Sergey Ioffe์™€ Christian Szegedy๊ฐ€ 2015๋…„์— ๋ฐœํ‘œํ•œ ์—ฐ๊ตฌ ๋…ผ๋ฌธ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ํ›ˆ๋ จ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” Batch Normalization์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•œ๋‹ค.

 

[ํ•ต์‹ฌ ๊ฐœ๋…]

1. Internal Covariate Shift
์ด ๋…ผ๋ฌธ์€ ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™(Internal Covariate Shift) ๋ฌธ์ œ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๋Š”๋ฐ, ์ด ๋ฌธ์ œ๋Š” training์‹œ ๋ ˆ์ด์–ด์— ๋Œ€ํ•œ ์ž…๋ ฅ ๊ฐ’์ด๋‚˜ parameter ๋ถ„ํฌ์˜ ๋ณ€ํ™”๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
์ด๊ฒƒ์€ ํ›„์† ๊ณ„์ธตํ•™์Šต์˜ ํ•™์Šต์†๋„๋ฅผ ๋Šฆ์ถ”๋Š” ๊ฒƒ์œผ๋กœ ํ•™์Šต๊ณผ์ •์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

2. Batch Normalization
์ด ๋…ผ๋ฌธ์€ ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™ ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ Batch-Normalization์„ ์ œ์•ˆํ•œ๋‹ค.
๋ฐฐ์น˜ ์ •๊ทœํ™”๋Š” ๋ฐฐ์น˜ ํ‰๊ท (mean)์„ ๋นผ๊ณ  ๋ฐฐ์น˜ ํ‘œ์ค€ํŽธ์ฐจ(standard deviation)๋กœ ๋‚˜๋ˆ„์–ด ๊ณ„์ธต์— ๋Œ€ํ•œ ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค

3. Learnable Scale and Shift
์ž…๋ ฅ์„ ์ •๊ทœํ™”(normalization)ํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ ๋ฐฐ์น˜ ์ •๊ทœํ™”๋Š” ๋‘ ๊ฐ€์ง€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์ธ scale parameter์™€ shift parameter๋ฅผ ๋„์ž…ํ•œ๋‹ค.
์ด๋Ÿฌํ•œ parameters๋ฅผ ํ†ตํ•ด ์‹ ๊ฒฝ๋ง์€ ์ •๊ทœํ™”๋œ ์ž…๋ ฅ์— ๋Œ€ํ•œ optimal scale ๋ฐ shift๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

4. Improved Training
์ด ๋…ผ๋ฌธ์€ ๋ฐฐ์น˜ ์ •๊ทœํ™”๊ฐ€ ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™์˜ ์˜ํ–ฅ์„ ์ค„์ž„์œผ๋กœ์จ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ๊ต์œก์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.
์ด๊ฒƒ์€ ๋” ๋น ๋ฅธ ์ˆ˜๋ ด, ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™” ๋ฐ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์œผ๋กœ ์ด์–ด์ง„๋‹ค.

5. Compatibility
์ด ๋…ผ๋ฌธ์€ ๋ฐฐ์น˜ ์ •๊ทœํ™”๊ฐ€ ๋‹ค์–‘ํ•œ ์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜ ๋ฐ ํ™œ์„ฑํ™” ๊ธฐ๋Šฅ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

์ „๋ฐ˜์ ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ์ค‘์š”ํ•œ ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ํšจ๊ณผ์— ๋Œ€ํ•œ ์ฆ๊ฑฐ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

 

+ Recent posts