๐Ÿ˜ถ ์ดˆ๋ก (Abstract)

- ResNet์€ ๋Š” ์„ค๋“๋ ฅ ์žˆ๋Š” ์ •ํ™•๋„์™€ ๋ฉ‹์ง„ ์ˆ˜๋ ด๋™์ž‘์„ ๋ณด์—ฌ์ฃผ๋Š” ๋งค์šฐ ์‹ฌ์ธต์ ์ธ ์•„ํ‚คํ…์ฒ˜๊ตฐ์œผ๋กœ ๋ถ€์ƒํ–ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” "Identity Mapping"์„ "Skip Connection" ๋ฐ "After-addition Activation"๋กœ ์‚ฌ์šฉํ•  ๋•Œ
์ˆœ์ „ํŒŒ/ํ›„์ „ํŒŒ signal์ด ํ•œ ๋ธ”๋ก์—์„œ ๋‹ค๋ฅธ ๋ธ”๋ก์œผ๋กœ ์ง์ ‘ ์ „ํŒŒ๋  ์ˆ˜ ์žˆ์Œ์„ ์ œ์•ˆํ•˜๋Š” residual building block ์ดํ›„์˜ propagation๊ณต์‹์„ ๋ถ„์„ํ•œ๋‹ค.

์ผ๋ จ์˜ ์ œ๊ฑฐ(ablation)์‹คํ—˜์€ ์ด๋Ÿฐ identity mapping์˜ ์ค‘์š”์„ฑ์„ ๋’ท๋ฐ›์นจํ•œ๋‹ค.
์ด๋Š” ์ƒˆ๋กœ์šด residual unit์„ ์ œ์•ˆํ•˜๋„๋ก ๋™๊ธฐ๋ถ€์—ฌํ•˜์—ฌ ํ›ˆ๋ จ์„ ๋” ์‰ฝ๊ฒŒ ํ•˜๊ณ  ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐœ์„ ํ•œ๋‹ค.

CIFAR-10(4.62% ์˜ค๋ฅ˜) ๋ฐ CIFAR-100์˜ 1001์ธต ResNet๊ณผ ImageNet์˜ 200-layer ResNet์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐœ์„ ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ ์œผ๋ฉฐ ์ด์— ๋Œ€ํ•œ ์ฝ”๋“œ๋Š” https://github.com/KaimingHe/resnet-1k-layers ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
 

GitHub - KaimingHe/resnet-1k-layers: Deep Residual Networks with 1K Layers

Deep Residual Networks with 1K Layers. Contribute to KaimingHe/resnet-1k-layers development by creating an account on GitHub.

github.com

 

 

 

1. ์„œ๋ก  (Introduction)

 

 

 

2. Analysis of Deep Residual Networks


Discussions

 

 

 

3. On the Importance of Identity Skip Connections


 

 

3.1  Experiments on Skip Connections  


 • Constant scaling




 • Exclusive gating

์ด๋Š” Highway  Network๋…ผ๋ฌธ์˜ ์ง€์นจ์— ๋”ฐ๋ผ(people.idsia.ch/~rupesh/very_deep_learning/)




 • Shortcut-only gating

 



 1x1 convolutional shortcut





 • Dropout Shortcut

 

 

3.2   Discussions

 

 

 

 

 

 

4. On the Usage of Activation Functions

 

 

4.1  Experiments on Activation

์ด Section์—์„œ๋Š” ResNet-110๊ณผ 164์ธต์˜ Bottleneck Architecture(ResNet-164๋ผ๊ณ ๋„ ํ•จ)์œผ๋กœ ์‹คํ—˜ํ•œ๋‹ค.

Bottleneck Residual Unit์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋œ๋‹ค.
์ฐจ์›์ถ•์†Œ๋ฅผ ์œ„ํ•œ 1×1 ๋ฐ 3×3 layer
์ฐจ์›๋ณต์›์„ ์œ„ํ•œ 1×1 layer 
์ด๋Š” [ResNet๋…ผ๋ฌธ]์—์„œ์˜ ์„ค๊ณ„๋ฐฉ์‹์ฒ˜๋Ÿผ ๊ณ„์‚ฐ๋ณต์žก๋„๋Š” 2๊ฐœ์˜ 3×3 Residual Unit๊ณผ ์œ ์‚ฌํ•˜๋‹ค. (์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ถ€๋ก์— ๊ธฐ์žฌ)
๋˜ํ•œ, ๊ธฐ์กด์˜ ResNet-164๋Š” CIFAR-10์—์„œ 5.93%์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.(ํ‘œ2)


 •BN after addition




 • ReLU before addition



 • Post-activation or Pre-activation ?
 



 

 

4.2  Analysis


 •Ease of optimization

- ์ด ํšจ๊ณผ๋Š” ResNet-1001์„ ํ›ˆ๋ จํ•  ๋•Œ ๋งค์šฐ ๋‘๋“œ๋Ÿฌ์ง„๋‹ค. (๊ทธ๋ฆผ 1์˜ ๊ณก์„ .)
[ResNet๋…ผ๋ฌธ]
์˜ ๊ธฐ์กด ์„ค๊ณ„๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•˜๋ฉด training Loss๊ฐ€ ๋งค์šฐ ๋Š๋ฆฌ๊ฒŒ ๊ฐ์†Œํ•œ๋‹ค.
f = ReLU๊ฐ€ ์Œ์˜ ๊ฐ’์„ ๊ฒฝ์šฐ, ์‹ ํ˜ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”๋ฐ ์ด๋Š” Residual Unit์ด ๋งŽ์œผ๋ฉด ์ด ํšจ๊ณผ๊ฐ€ ๋‘๋“œ๋Ÿฌ์ง„๋‹ค.
์ฆ‰, 
Eqn.(3)(๋”ฐ๋ผ์„œ Eqn.(5))์€ ์ข‹์€ ๊ทผ์‚ฌ์น˜๊ฐ€ ์•„๋‹ˆ๊ฒŒ ๋œ๋‹ค.
๋ฐ˜๋ฉด, f๊ฐ€ identity mapping์ธ ๊ฒฝ์šฐ, ์‹ ํ˜ธ๋Š” ์ž„์˜์˜ ๋‘ Unit ์‚ฌ์ด์— ์ง์ ‘ ์ „ํŒŒ๋  ์ˆ˜ ์žˆ๋‹ค.
1001์ธต์ด๋‚˜ ๋˜๋Š” ์‹ ๊ฒฝ๋ง์˜ training Loss๊ฐ’์„ ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ๊ฐ์†Œ์‹œํ‚จ๋‹ค(๊ทธ๋ฆผ 1).
๋˜ํ•œ ์šฐ๋ฆฌ๊ฐ€ ์กฐ์‚ฌํ•œ ๋ชจ๋“  ๋ชจ๋ธ ์ค‘ ๊ฐ€์žฅ ๋‚ฎ์€ Loss๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ์ตœ์ ํ™”์˜ ์„ฑ๊ณต์„ ๋ณด์—ฌ์ค€๋‹ค.


- ๋˜ํ•œ ResNet์ด ๋” ์ ์€ ์ธต์„ ๊ฐ€์งˆ ๋•Œ f = ReLU์˜ ์˜ํ–ฅ์ด ์‹ฌ๊ฐํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค(์˜ˆ: ๊ทธ๋ฆผ 6(์˜ค๋ฅธ์ชฝ)).
ํ›ˆ๋ จ ์ดˆ๋ฐ˜, training๊ณก์„ ์ด ์กฐ๊ธˆ ํž˜๋“ค์–ด ๋ณด์ด์ง€๋งŒ ๊ณง ์ข‹์€ ์ƒํƒœ๋กœ ๋œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ ˆ(truncation)์€ 1000๊ฐœ์˜ ๋ ˆ์ด์–ด๊ฐ€ ์žˆ์„ ๋•Œ ๋” ๋นˆ๋ฒˆํžˆ ์ผ์–ด๋‚œ๋‹ค.





 • Reducing Overfitting

 
์ œ์•ˆ๋œ "pre-activation" unit์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด Regualarizatoin์— ๋ฏธ์น˜๋Š” ๋˜ ๋‹ค๋ฅธ ์˜ํ–ฅ์€ ๊ทธ๋ฆผ 6(์˜ค๋ฅธ์ชฝ)๊ณผ ๊ฐ™๋‹ค.

"pre-activation" ๋ฒ„์ „์€ ์ˆ˜๋ ด ์‹œ training Loss๊ฐ’์ด ์•ฝ๊ฐ„ ๋” ๋†’์ง€๋งŒ "test Error"๋Š” ๋” ๋‚ฎ๋‹ค.

์ด ํ˜„์ƒ์€ CIFAR-10๊ณผ 100 ๋ชจ๋‘์—์„œ ResNet-110, ResNet-110(1-layer) ๋ฐ ResNet-164์—์„œ ๊ด€์ฐฐ๋œ๋‹ค.
์ด๋•Œ, ์šฐ๋ฆฌ๋Š” ์ด๊ฒƒ์ด BN์˜ "regularization" ํšจ๊ณผ์— ์˜ํ•ด ๋ฐœ์ƒํ•œ ๊ฒƒ์œผ๋กœ ์ถ”์ •๋œ๋‹ค.

์›๋ž˜ Residual Unit(๊ทธ๋ฆผ 4(a)์—์„œ BN์ด ์‹ ํ˜ธ๋ฅผ ์ •๊ทœํ™”(normalize)ํ•˜์ง€๋งŒ, ์ด๋Š” ๊ณง shortcut์— ์ถ”๊ฐ€๋˜๋ฏ€๋กœ ๋ณ‘ํ•ฉ๋œ ์‹ ํ˜ธ๋Š” ์ •๊ทœํ™”(normalize)๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด ์ •๊ทœํ™”๋˜์ง€ ์•Š์€ ์‹ ํ˜ธ๋Š” ๊ทธ๋‹ค์Œ weight-layer์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

๋Œ€์กฐ์ ์œผ๋กœ, ์šฐ๋ฆฌ์˜ "pre-activation"๋ฒ„์ „์—์„œ, ๋ชจ๋“  weight-layer์˜ ์ž…๋ ฅ๊ฐ’์ด ์ •๊ทœํ™”๋˜์—ˆ๋‹ค.

 

 

 

 

 

 

5. Results

 • Comparisons on CIFAR-10/100
  - ํ‘œ 4๋Š” CIFAR-10/100์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ•˜์—ฌ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š”๋‹ค.
   ์šฐ๋ฆฐ ์‹ ๊ฒฝ๋ง ํญ์ด๋‚˜ filterํฌ๊ธฐ๋ฅผ ํŠน๋ณ„ํžˆ ์กฐ์ •ํ•˜๊ฑฐ๋‚˜ ์ž‘์€ dataset์— ๋งค์šฐ ํšจ๊ณผ์ ์ธ ์ •๊ทœํ™” ๊ธฐ์ˆ (์˜ˆ: Dropout)์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉํ•œ๋‹ค.
  ์šฐ๋ฆฌ๋Š” ๋‹จ์ˆœํ•˜์ง€๋งŒ ํ•„์ˆ˜์ ์ธ ๊ฐœ๋…์„ ํ†ตํ•ด ๋” ๊นŠ์ด ๋“ค์–ด๊ฐ€ ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด์„œ ๊นŠ์ด์˜ ํ•œ๊ณ„๋ฅผ ๋ฐ€์–ด๋‚ด๋Š” ์ž ์žฌ๋ ฅ์„ ๋ณด์—ฌ์ค€๋‹ค.



 •Comparisons on ImageNet
  - ๋‹ค์Œ์œผ๋กœ 1000-class ImageNet dataset์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ์ด๋‹ค.
ResNet-101์„ ์‚ฌ์šฉํ•ด ImageNet์˜ ๊ทธ๋ฆผ 2์™€ 3์—์„œ ์—ฐ๊ตฌํ•œ skip connection์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ๋น„ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์„ ๋•Œ, ์œ ์‚ฌํ•œ ์ตœ์ ํ™” ์–ด๋ ค์›€์„ ๊ด€์ฐฐํ–ˆ๋‹ค.
์ด๋Ÿฌํ•œ non-idntity shortcut network์˜ training์˜ค๋ฅ˜๋Š” ์ฒซ learning rate(๊ทธ๋ฆผ 3๊ณผ ์œ ์‚ฌ)์—์„œ ๊ธฐ์กด ResNet๋ณด๋‹ค ๋ถ„๋ช…ํžˆ ๋†’์œผ๋ฉฐ, ์ž์›์ด ์ œํ•œ๋˜์–ด ํ›ˆ๋ จ์„ ์ค‘๋‹จํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ์šฐ๋ฆฌ๋Š” ImageNet์—์„œ ResNet-101์˜ "BN after addition" ๋ฒ„์ „(๊ทธ๋ฆผ 4(b)์„ ๋งˆ์ณค๊ณ  ๋” ๋†’์€ training Loss์™€ validation error์˜ค๋ฅ˜๋ฅผ ๊ด€์ฐฐํ–ˆ๋‹ค.
์ด ๋ชจ๋ธ์˜ ๋‹จ์ผ ํฌ๋กญ(224×224)์˜ validation error๋Š” 24.6%/7.5%์ด๋ฉฐ ๊ธฐ์กดResNet-101์˜ 23.6%/7.1%์ด๋‹ค.
์ด๋Š” ๊ทธ๋ฆผ 6(์™ผ์ชฝ)์˜ CIFAR ๊ฒฐ๊ณผ์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.


- ํ‘œ 5๋Š” ๋ชจ๋‘ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จ๋œ ResNet-152์™€ ResNet-200์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
์šฐ๋ฆฌ๋Š” ๊ธฐ์กด ResNet ๋…ผ๋ฌธ์ด ๋” ์งง์€ ์ธก๋ฉด s
[256, 480]์„ ๊ฐ–๋Š” scale-jittering์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œ์ผฐ๊ธฐ ๋•Œ๋ฌธ์— s = 256 ([ResNet๋…ผ๋ฌธ]์—์„œ์™€ ๊ฐ™์ด)์—์„œ 224×224crop์˜ test๋Š” negative์ชฝ์œผ๋กœ ํŽธํ–ฅ๋˜์–ด ์žˆ์—ˆ๋‹ค.

๋Œ€์‹ , ๋ชจ๋“  ๊ธฐ์กด ๋ฐ ResNets์— ๋Œ€ํ•ด s = 320์—์„œ ๋‹จ์ผ 320x320 crop์„ testํ•œ๋‹ค.
ResNets๋Š” ๋” ์ž‘์€ ๊ฒฐ๊ณผ๋ฌผ์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋ฐ›์•˜์ง€๋งŒ, ResNets๋Š” ์„ค๊ณ„์ƒ Fully-Convolution์ด๊ธฐ ๋•Œ๋ฌธ์— ๋” ํฐ ๊ฒฐ๊ณผ๋ฌผ์—์„œ ์‰ฝ๊ฒŒ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋‹ค.
์ด ํฌ๊ธฐ๋Š” Inception v3์—์„œ ์‚ฌ์šฉํ•œ 299×299์— ๊ฐ€๊น๊ธฐ ๋•Œ๋ฌธ์— ๋ณด๋‹ค ๊ณต์ •ํ•œ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.



- ๊ธฐ์กด ResNet-152๋Š” 320x320 crop์—์„œ top-1 error๊ฐ€ 21.3%์ด๋ฉฐ, "pre-activation"์€ 21.1%์ด๋‹ค.
ResNet-152์—์„œ๋Š” ์ด ๋ชจ๋ธ์ด ์‹ฌ๊ฐํ•œ ์ผ๋ฐ˜ํ™”(generalization) ์–ด๋ ค์›€์„ ๋ณด์ด์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋“์ด ํฌ์ง€ ์•Š๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ResNet-200์˜ ์˜ค๋ฅ˜์œจ์€ 21.8%๋กœ ๊ธฐ์กด ResNet-152๋ณด๋‹ค ๋†’๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ResNet-200์€ ResNet-152๋ณด๋‹ค training error๊ฐ€ ๋‚ฎ์€๋ฐ, ์ด๋Š” overfitting์œผ๋กœ ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.


"pre-activation" ResNet-200์˜ ์˜ค๋ฅ˜์œจ์€ 20.7%๋กœ ๊ธฐ์กด ResNet-200๋ณด๋‹ค 1.1% ๋‚ฎ๊ณ  ResNet-152์˜ ๋‘ ๋ฒ„์ „๋ณด๋‹ค ๋‚ฎ๋‹ค. GoogLeNet๊ณผ InceptionV3์˜ scale ๋ฐ ์ข…ํšก(aspect)์˜ ๋น„์œจ์˜ ํ™•๋Œ€๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ, ResNet-200์€ Inception v3๋ณด๋‹ค ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค(ํ‘œ 5).
์šฐ๋ฆฌ์˜ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋  ๋•Œ์™€ ๋™์‹œ์—, Inception-ResNet-v2 ๋ชจ๋ธ์€ 19.9%/4.9%์˜ single crop ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.




 • Computational Cost
  - ์šฐ๋ฆฌ ๋ชจ๋ธ์˜ ๊ณ„์‚ฐ ๋ณต์žก๋„๋Š” ๊นŠ์ด์— ๋”ฐ๋ผ ์„ ํ˜•์ ์ด๋‹ค(๋”ฐ๋ผ์„œ 1001-layer net์€ 100-layer net๋ณด๋‹ค 10๋ฐฐ ๋ณต์žกํ•˜๋‹ค).
CIFAR์—์„œ ResNet-1001์€ 2๊ฐœ์˜ GPU์—์„œ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ ์•ฝ 27์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ณ ,
ImageNet์—์„œ ResNet-200์€ 8๊ฐœ์˜ GPU์—์„œ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ ์•ฝ 3์ฃผ๊ฐ€ ๊ฑธ๋ฆฐ๋‹ค(VGGNet๋…ผ๋ฌธ๊ณผ ๋™๋“ฑ).

 

 

6. Conclusions

์ด ๋…ผ๋ฌธ์€ ResNet์˜ connection๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋’ค์—์„œ ์ž‘๋™ํ•˜๋Š” ์ „ํŒŒ ๊ณต์‹์„ ์กฐ์‚ฌํ•œ๋‹ค.
์šฐ๋ฆฌ์˜ ๊ฒฐ๊ณผ๋ฌผ์€ identity shortcut connection ๋ฐ identity after-addition activation์ด ์ •๋ณด์˜ ์ „ํŒŒ๋ฅผ ์›ํ™œํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์ˆ˜์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์‹œ์‹œํ•œ๋‹ค.
์ด๋Ÿฐ ๋ณ€์ธํ†ต์ œ์‹คํ—˜(Ablation Experimanet)์€ ์šฐ๋ฆฌ์˜ ๊ฒฐ๊ณผ๋ฌผ๊ณผ ์ผ์น˜ํ•˜๋Š” ํ˜„์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค.
์šฐ๋ฆฌ๋Š” ๋˜ํ•œ ์‰ฝ๊ฒŒ ํ›ˆ๋ จ๋˜๊ณ  ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” 1000์ธต ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ์ œ์‹œํ•œ๋‹ค

 

 

 

 •Appendix: Implementation Details

 

 

 

 

 

 

๐Ÿง ๋…ผ๋ฌธ ๊ฐ์ƒ_์ค‘์š”๊ฐœ๋… ํ•ต์‹ฌ ์š”์•ฝ

"Identity Mappings in Deep Residual Networks"
Kaiming He, Xiangyu Zhang, Shaoqing Ren ๋ฐ Jian Sun์ด 2016๋…„์— ๋ฐœํ‘œํ•œ ์—ฐ๊ตฌ ๋…ผ๋ฌธ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์ƒˆ๋กœ์šด ์ž”์ฐจ ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

 

 

[ํ•ต์‹ฌ ๊ฐœ๋…]

1. ๊ธฐ์กด ResNet๊ณผ์˜ ์ฐจ์ด์ 
1. Shortcut Connections
์ด ๋…ผ๋ฌธ์€ ๊ธฐ์กด์˜ ResNet์—์„œ layer๊ฐ„์˜ shortcut connection์—์„œ "Identity Mapping"์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
  - ๊ธฐ์กด ResNet: ๋‹ค์Œ์ธต์˜ ์ถœ๋ ฅ์ฐจ์›๊ณผ ์ผ์น˜ํ•˜๋„๋ก ์ž…๋ ฅ์„ ๋ณ€ํ™˜ํ•˜๋Š” Residual Mapping์„ ์‚ฌ์šฉ
  - ResNet V2: transformation์„ ์šฐํšŒํ•˜๊ณ  ์ž…๋ ฅ์„ ๋‹ค์Œ์ธต์œผ๋กœ ์ง์ ‘์ „ํŒŒํ•˜๋Š” "Identity Mapping"์„ ์‚ฌ์šฉ


2. Pre-activation

์ด ๋…ผ๋ฌธ์€ ResNet์„ ์‘์šฉํ•œ ResNetV2๋กœ ์‚ฌ์ „ ํ™œ์„ฑํ™”(pre-activation)์— ๋Œ€ํ•œ ๊ฐœ๋…์„ ๋„์ž…ํ–ˆ๋‹ค.
  - BatchNormalization ๋ฐ ReLU๋ฅผ ๊ฐ conv.layer์ดํ›„๊ฐ€ ์•„๋‹Œ, ์ด์ „์— ์ ์šฉํ•œ๋‹ค.
  - ์ด๋ฅผ ํ†ตํ•ดtraining performance๋ฅผ ๊ฐœ์„ ํ•˜๊ณ  ๋งค์šฐ ๊นŠ์€ ์‹ ๊ฒฝ๋ง์—์„œ์˜ overfitting์„ ์ค„์—ฌ์ฃผ์—ˆ๋‹ค.

[์žฅ์  โ‘ _ Easy to Optimization]
  - ์ด ํšจ๊ณผ๋Š” ๊นŠ์€ ์‹ ๊ฒฝ๋ง(1001-layer ResNet)์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ๋ถ„๋ช…ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚œ๋‹ค.
๊ธฐ์กด ResNet์€ Skip connetion์„ ๊ฑฐ์ณ์„œ ์ž…๋ ฅ๊ฐ’๊ณผ ์ถœ๋ ฅ๊ฐ’์ด ๋”ํ•ด์ง€๊ณ , ReLU ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์นœ๋‹ค.
๋”ํ•ด์ง„ ๊ฐ’์ด ์Œ์ˆ˜์ด๋ฉด ReLU ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์ณ์„œ 0์ด ๋˜๋Š”๋ฐ, ์ด๋Š” ๋งŒ์•ฝ, ์ธต์ด ๊นŠ๋‹ค๋ฉด ์ด ์ฆ์ƒ์˜ ์˜ํ–ฅ์ด ๋” ์ปค์ง€๊ฒŒ ๋˜์–ด ๋” ๋งŽ์€ ๊ฐ’์ด 0์ด ๋˜์–ด ์ดˆ๊ธฐ ํ•™์Šต์‹œ์— ๋ถˆ์•ˆ์ •์„ฑ์œผ๋กœ ์ธํ•œ ์ˆ˜๋ ด์ด ๋˜์ง€ ์•Š๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.
์‹ค์ œ๋กœ ์•„๋ž˜ ํ•™์Šต ๊ณก์„ ์„ ๋ณด๋ฉด ์ดˆ๊ธฐ์— Loss๊ฐ€ ์ˆ˜๋ ด๋˜์ง€ ์•Š๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
Bold: test, ์ ์„ : train

 ํ•˜์ง€๋งŒ pre-activation ๊ตฌ์กฐ๋Š” ๋”ํ•ด์ง„ ๊ฐ’์ด ReLU ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š์•„, ์Œ์ˆ˜ ๊ฐ’๋„ ๊ทธ๋Œ€๋กœ ์ด์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.
์‹ค์ œ๋กœ ํ•™์Šต ๊ณก์„ ์„ ์‚ดํŽด๋ณด๋ฉด ์ œ์•ˆ๋œ ๊ตฌ์กฐ๊ฐ€ ์ดˆ๊ธฐ ํ•™์Šต์‹œ์— loss๋ฅผ ๋” ๋น ๋ฅด๊ฒŒ ๊ฐ์†Œ์‹œํ‚ด์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

[์žฅ์  โ‘ก_ Reduce Overfitting]
  - ์œ„ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์ˆ˜๋ ด์ง€์ ์—์„œ pre-activation ๊ตฌ์กฐ์˜ training loss๊ฐ€ original๋ณด๋‹ค ๋†’๋‹ค.
  - ๋ฐ˜๋ฉด, test error๊ฐ€ ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์€ overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  - ์ด ๋…ผ๋ฌธ์—์„œ ์ด ํšจ๊ณผ์— ๋Œ€ํ•ด Batch Normalization ํšจ๊ณผ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ์ถ”์ธกํ•˜๋Š”๋ฐ, Original Residual unit์€ BN์„ ๊ฑฐ์น˜๊ณ  ๊ฐ’์ด shortcut์— ๋”ํ•ด์ง€๋ฉฐ, ๋”ํ•ด์ง„ ๊ฐ’์€ ์ •๊ทœํ™”๋˜์ง€ ์•Š๋Š”๋‹ค.
์ด ์ •๊ทœํ™”๋˜์ง€ ์•Š์€ ๊ฐ’์ด ๋‹ค์Œ conv. layer์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์ „๋‹ฌ๋œ๋‹ค.

  Pre-activation Residual unit์€ ๋”ํ•ด์ง„ ๊ฐ’์ด BN์„ ๊ฑฐ์ณ์„œ ์ •๊ทœํ™” ๋œ ๋’ค์— convolution layer์— ์ž…๋ ฅ๋˜์„œ overfitting์„ ๋ฐฉ์ง€ํ•œ๋‹ค๊ณ  ์ €์ž๋Š” ์ถ”์ธกํ•œ๋‹ค.

3. Recommendation
์ด ๋…ผ๋ฌธ์—์„œ๋Š” ResNet V2 ์„ค๊ณ„ ๋ฐ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์†Œ๊ฐœ๋œ ์‹ค์šฉ์ ์ธ ๊ถŒ์žฅ์‚ฌํ•ญ์„ ์ œ์•ˆํ•˜๋Š”๋ฐ,  ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
 โ‘  Initialization
   - ํ‘œ์ค€ํŽธ์ฐจ = sqrt(2/n) ์ธ Gaussian๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•ด Conv.layer weight์ดˆ๊ธฐํ™”๋ฅผ ๊ถŒ์žฅ
      (์ด๋•Œ, n์€ input channel์ˆ˜) 

 โ‘ก Batch Normalization 
   - pre-activation์„ ์ด์šฉํ•œ Batch Normalization์„ ๊ถŒ์žฅํ•œ๋‹ค.
   - mini-batch์˜ statistics ์˜ํ–ฅ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด training/test์ค‘์—๋Š” statistics์ด๋™ํ‰๊ท ์„ ์‚ฌ์šฉํ•œ๋‹ค.

 โ‘ข Learning Rate Schedule
   - ์ดˆ๊ธฐ ์ˆ˜๋ ด์˜ ๊ฐ€์†ํ™”๋ฅผ ์œ„ํ•ด warming up๊ตฌ๊ฐ„์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ํฐ ํ•™์Šต๋ฅ  ์‚ฌ์šฉ
   - ๋ฏธ์„ธ์กฐ์ •์„ ์œ„ํ•ด decay๊ตฌ๊ฐ„์—์„œ ๋” ์ž‘์€ ํ•™์Šต๋ฅ  ์‚ฌ์šฉ

 โ‘ฃ Weight Decay
   - overfitting ๋ฐฉ์ง€๋ฅผ ์œ„ํ•ด weight_decay = 1e-4 (0.0001)๋ฅผ ์‚ฌ์šฉ

 โ‘ค Data Augmentation
   - random cropping
   - horizontal flipping

 

 

 

 

 

 

๐Ÿง  ๋…ผ๋ฌธ์„ ์ฝ๊ณ  Architecture ์ƒ์„ฑ (with tensorflow)

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, ReLU, Add, GlobalAveragePooling2D, Dense
from tensorflow.keras.models import Model

def conv2d_bn(x, filters, kernel_size, strides=1, padding='same'):
    x = Conv2D(filters=filters,
               kernel_size=kernel_size,
               strides=strides,
               padding=padding)(x)
    x = BatchNormalization()(x)
    return x

def residual_block(x, filters, kernel_size, strides=1):
    shortcut = x
    x = conv2d_bn(x, filters, kernel_size, strides)
    x = ReLU()(x)
    x = conv2d_bn(x, filters, kernel_size, 1)
    
    if x.shape != shortcut.shape:
        shortcut = Conv2D(filters=filters,
                          kernel_size=1,
                          strides=strides,
                          padding='same')(shortcut)
    x = Add()([x, shortcut])
    x = ReLU()(x)
    return x

def resnetv2(input_shape, num_classes, num_layers, use_bottleneck=False):
    num_blocks = (num_layers - 2) // 9
    filters = 16

    inputs = Input(shape=input_shape)

    x = conv2d_bn(inputs, filters, 3)

    for i in range(num_blocks):
        for j in range(3):
            strides = 1
            if j == 0:
                strides = 2
            x = residual_block(x, filters, 3, strides)
        filters *= 2

    x = GlobalAveragePooling2D()(x)
    x = Dense(num_classes, activation='softmax')(x)

    model = Model(inputs=inputs, outputs=x)

    return model
    
    
model = resnetv2(input_shape=(224,224,3),  num_classes=200, num_layers=152, use_bottleneck=True)
model.summary()

 

+ Recent posts