๐Ÿ˜ถ ์ดˆ๋ก (Abstract)

- ILSVRC14์˜ classification๊ณผ detection์—์„œ ์ตœ์‹ ๊ธฐ๋Ÿ‰์„ ๋ณด์—ฌ์ฃผ๋Š” ์ƒˆ๋กœ์šด ์‹ฌ์ธต์‹ ๊ฒฝ๋ง๊ตฌ์กฐ, ์ผ๋ช… "Inception"์„ ์ œ์•ˆํ•œ๋‹ค.
- ์ด ๋ชจ๋ธ์˜ ๊ฐ€์žฅ ํฐ ์ข‹์€ ํŠน์ง•์€ ๋ฐ”๋กœ "๊ณ„์‚ฐ๋ฆฌ์†Œ์Šค์˜ ์ด์šฉ๋ฉด์—์„œ์˜ ํ–ฅ์ƒ"์ด๋‹ค.
- ๊ณ„์‚ฐ๋ฆฌ์†Œ์Šค๋ฅผ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ depth์™€ width๋ฅผ ํ‚ค์› ๋‹ค๋Š” ์ ์—์„œ ์ฃผ๋ชฉํ• ๋งŒํ•˜๋‹ค.
- ์ด๋Ÿฐ ํŠน์„ฑ์˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๊ตฌ์กฐ์ ์ธ ๊ฒฐ์ •์€ Hebbian์›๋ฆฌ์™€ multi-scale processing์˜ ์ง๊ด€(intuition)์— ๊ธฐ์ดˆํ–ˆ๋‹ค.

cf. [Hebbian Principle]

- Hebbian ์›๋ฆฌ๋Š” ๋‰ด๋Ÿฐ ๊ฐ„์˜ ์—ฐ๊ฒฐ ๊ฐ•๋„๊ฐ€ ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ์„ค๋ช…ํ•˜๋Š” ์‹ ๊ฒฝ ๊ณผํ•™์˜ ๊ฐœ๋…์œผ๋กœ ์ด ์›๋ฆฌ๋Š” ๋‘ ๊ฐœ์˜ ๋‰ด๋Ÿฐ์ด ํ•จ๊ป˜ ํ™œ์„ฑํ™”๋  ๋•Œ ๊ทธ๋“ค ์‚ฌ์ด์˜ ์—ฐ๊ฒฐ์ด ๊ฐ•ํ™”๋œ๋‹ค๋Š” ์ƒ๊ฐ์— ๊ธฐ๋ฐ˜์„ ๋‘”๋‹ค. ์ด๊ฒƒ์€ ํ•™์Šต๊ณผ ๊ธฐ์–ต์˜ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” ์‹ ๊ฒฝ ๊ฒฝ๋กœ์˜ ํ˜•์„ฑ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค.

- ์ด ๋…ผ๋ฌธ์—์„œ๋Š” CNN์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” "Local Response Normalization"(LRN)์ด๋ผ๋Š” Hebbian ์›๋ฆฌ์˜ ๋ณ€์น™์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ๋‹ค.

 

 

 

 

1. ์„œ๋ก  (Introduction)

์ง€๋‚œ 3๋…„๊ฐ„ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์—„์ฒญ๋‚œ ๋ฐœ์ „์œผ๋กœ classification๊ณผ detection์—์„œ ๊ทน์ ์ธ ๋ณ€ํ™”๊ฐ€ ์žˆ์—ˆ๋‹ค.
์ด์— ๋Œ€ํ•œ ํ˜ธ์žฌ(ๅฅฝๆ)๋Š” ๋‹จ์ง€ hardware์˜ ๋ฐœ์ „์ด๋‚˜ model์ด ์ปค์ง€๊ฑฐ๋‚˜ dataset์˜ ๋งŽ์•„์ง์˜ ๋•๋„ ์žˆ์ง€๋งŒ, ๋” ์ฃผ์š”ํ•œ ๊ฒƒ์€ ์ƒˆ๋กœ์šด ์ƒ๊ฐ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ํ–ฅ์ƒ๋œ ์‹ ๊ฒฝ๋ง๊ตฌ์กฐ์˜ ๊ฒฐ๊ณผ๋ผ๋Š” ๊ฒƒ์— ์žˆ๋‹ค.
์šฐ๋ฆฌ์˜ GoogLeNet์€ ๋ถˆ๊ณผ 2๋…„์ „์˜ AlexNet๋ณด๋‹ค ์‹ค์ œ ์‚ฌ์šฉํ•˜๋Š” parameter 12๋ฐฐ ์ ์Œ์—๋„ ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ–๋Š”๋‹ค.

Object Detection์˜ ๊ฐ€์žฅ ํฐ ์ด์ ์€ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๋‹จ๋…์ด๋‚˜ ๋” ํฐ ๋ชจ๋ธ์˜ ํ™œ์šฉ์ด ์•„๋‹Œ R-CNN algorithm(Girshick et al)๊ณผ ๊ฐ™์€ ์‹ฌ์ธต ๊ตฌ์กฐ์™€ ๊ณ ์ „์  ์ปดํ“จํ„ฐ๋น„์ „์˜ ์‹œ๋„ˆ์ง€์—์„œ ์–ป์—ˆ๋‹ค.


๋‹ค๋ฅธ ์ฃผ๋ชฉํ•  ํŠน์ง•์€ ๋ชจ๋ฐ”์ผ ๋ฐ ์ž„๋ฒ ๋””๋“œ์˜ ์ง€์†์œผ๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ์œจ์„ฑ(ํŠนํžˆ ์ „๋ ฅ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ์‚ฌ์šฉ)์ด ์ค‘์š”ํ•ด์ง„๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
์šฐ๋ฆฌ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๋‘๋“œ๋Ÿฌ์ง€๋Š”(noteworthy) ์ ์€ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์˜ ์„ค๊ณ„(design)๋กœ ์ด์–ด์ง€๋Š” ๊ณ ๋ ค์‚ฌํ•ญ์ด ์ •ํ™•๋„์˜ ์ˆซ์ž์— ๋Œ€ํ•œ ์ˆœ์ˆ˜ํ•œ ๊ณ ์ •๊ด€(sheer fixation)๋ณด๋‹ค ์œ„์˜ ์š”์†Œ๋ฅผ ํฌํ•จํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค.
๋Œ€๋ถ€๋ถ„์˜ ์‹คํ—˜์—์„œ ๋ชจ๋ธ์€ ์ถ”๋ก ๋‹จ๊ณ„(inference)์—์„œ 15์–ต๋ฐฐ ์ฆ๊ฐ€ํ•œ ๊ณ„์‚ฐ์˜ˆ์‚ฐ์— ์œ ์ง€ํ•˜๊ฒŒ ์„ค๊ณ„๋˜์—ˆ๊ธฐ์— ์ˆœ์ˆ˜ํ•œ ํ•™๋ฌธ์ ํ˜ธ๊ธฐ์‹ฌ์œผ๋กœ ๋๋‚˜์ง€ ์•Š๋Š”๋ฐ, ์‹ฌ์ง€์–ด๋Š” ๋Œ€๊ทœ๋ชจ dataset์—์„œ๋„ ํ•ฉ๋ฆฌ์ ์ธ ๋น„์šฉ์œผ๋กœ ์‹ค์ œ๋กœ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•˜๋‹ค.


๋ณธ๋…ผ๋ฌธ์—์„œ๋Š” ์ปดํ“จํ„ฐ๋น„์ „์„ ์œ„ํ•œ ํšจ์œจ์  ์‹ฌ์ธต์‹ ๊ฒฝ๋ง๊ตฌ์กฐ์ธ "Inception"์— ์ดˆ์ ์„ ๋งž์ถ˜๋‹ค. 
("we need to go deeper"๋ผ๋Š” ์ธํ„ฐ๋„ท ๋ฐˆ์—์„œ ํŒŒ์ƒ)
์ด๋•Œ, "deep"์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ 2๊ฐ€์ง€ ์˜๋ฏธ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
  - "Inception module"์˜ ํ˜•ํƒœ๋กœ ์ƒˆ๋กœ์šด ์ˆ˜์ค€์˜ ์กฐ์ง์„ ๋„์ž…ํ•œ๋‹ค๋Š” ์˜๋ฏธ
  - ์‹ ๊ฒฝ๋ง์˜ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค๋Š” ์ง์ ‘์ ์ธ ์˜๋ฏธ

 

 

 

2. Related Work

LeNet-5์—์„œ ์‹œ์ž‘ํ•ด์„œ CNN์€ ์ „ํ˜•์ ์ธ ํ‘œ์ค€๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋ฐ, stacked conv.layer(๋ณดํ†ต ์„ ํƒ์ ์œผ๋กœ ์ •๊ทœํ™” ๋ฐ maxpooling์ด ๋”ฐ๋ฅธ๋‹ค.)๋Š” ํ•˜๋‚˜ ์ด์ƒ์˜ FC.layer๊ฐ€ ๋’ค๋”ฐ๋ฅธ๋‹ค. 
์ด ๊ธฐ๋ณธ์ ์ธ ์„ค๊ณ„๋ฅผ ๋ณ€ํ˜•ํ•˜๋Š” ๊ฒƒ์€ image classification ๋ฌธํ—Œ์— ๋„๋ฆฌ ํผ์ ธ์žˆ๋‹ค.
ImageNet๊ฐ™์€ ๋Œ€๊ทœ๋ชจ dataset์€ layerํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋™์‹œ์— overfitting์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Dropout๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

Max-Pooling์ด ์ •ํ™•ํ•œ ๊ณต๊ฐ„์ •๋ณด์˜ ์†์‹คํ•œ๋‹ค๋Š” ์šฐ๋ ค์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  conv.layer์˜ localization, object detection ๋“ฑ ์—ฌ๋Ÿฌ ์‚ฌ๋ก€์— ํ™œ์šฉ๋˜๋Š”๋ฐ, ์˜์žฅ๋ฅ˜์˜ ์‹œ๊ฐํ”ผ์งˆ์˜ ์‹ ๊ฒฝ๊ณผํ•™๋ชจ๋ธ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜๋‹ค. (Hebbian principle)
์ด๋•Œ, Inception layer๋Š” ๋งŽ์ด ๋ฐ˜๋ณต๋˜๋ฉฐ Inception model์˜ ๋ชจ๋“  filter๋Š” ํ•™์Šต๋˜๋Š”๋ฐ, GoogLeNet ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, 22๊ฐœ์˜ layer์˜ ์‹ฌ์ธต๋ชจ๋ธ๋กœ ์ด์–ด์ง„๋‹ค.

Network-in-Network๋Š” ์‹ ๊ฒฝ๋ง์˜ ํ‘œํ˜„๋ ฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ์ ‘๊ทผ๋ฒ•์œผ๋กœ conv.layer์— ์ ์šฉํ•  ๋•Œ, ์ด ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜์ ์ธ rectified linear activation์ด ๋’ค๋”ฐ๋ฅด๋Š” ์ถ”๊ฐ€์ ์ธ 1x1 conv.layer๋กœ ๊ฐ„์ฃผ๋œ๋‹ค.
์ด๋ฅผ ํ†ตํ•ด CNN pipline์— ์‰ฝ๊ฒŒ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ, ๋ณธ ๋…ผ๋ฌธ์˜ ๊ตฌ์กฐ์—์„œ ์ด ์ ‘๊ทผ๋ฐฉ์‹์„ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค.

๋‹ค๋งŒ ์šฐ๋ฆฌ์˜ ๊ตฌ์กฐ์„ค์ •์—์„œ๋Š” 1x1 convolution์€ ์ด์ค‘์ ์ธ ๋ชฉ์ ์„ ๊ฐ–๋Š”๋‹ค.
- ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ชฉ์ : ์‹ ๊ฒฝ๋ง์˜ ํฌ๊ธฐ๋ฅผ ์ œํ•œํ•˜๋Š” ๊ณ„์‚ฐ๋ณ‘๋ชฉํ˜„์ƒ(computational bottleneck)์˜ ์ œ๊ฑฐ๋ฅผ ์œ„ํ•œ ์ฐจ์›์ถ•์†Œ๋ชจ๋“ˆ(dimension reduction module).  →  ์„ฑ๋Šฅ์˜ ์ €ํ•˜ ์—†์ด depth, width๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์Œ
(โˆต 1x1 convolution์€ ๋” ํฐ convolution์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์ˆ˜์˜ parameter๋ฅผ ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ์— ํšจ์œจ์ .)
- ๋ถ€๊ฐ€์ ์ธ ๋ชฉ์ : ์‹ ๊ฒฝ๋ง์˜ ํ‘œํ˜„๋ ฅ์„ ๋†’์ž„


Object Detection์„ ์œ„ํ•œ ๊ฐ€์žฅ ์ตœ์‹  ์ ‘๊ทผ๋ฒ•์€ Girshick ๋“ฑ(et al)์ด ์ œ์•ˆํ•œ R-CNN์ด๋‹ค. 
R-CNN์€ ์ „๋ฐ˜์ ์ธ object detection์„ 2๊ฐ€์ง€์˜ ํ•˜์œ„๋ฌธ์ œ๋กœ ๋‚˜๋ˆˆ๋‹ค.
- category์— ๊ตฌ์• ๋ฐ›์ง€์•Š๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž ์žฌ์  ๋ฌผ์ฒด์ œ์•ˆ์„ ์œ„ํ•œ ์ƒ‰์ƒ ๋ฐ super-pixe์˜ ์ผ๊ด€์„ฑ ๊ฐ™์€ ์ €์ˆ˜์ค€์˜ ๋‹จ์„œ๋ฅผ ๋จผ์ € ํ™œ์šฉํ•œ๋‹ค.
- ๊ทธ ํ›„ CNN classification์„ ํ™œ์šฉํ•ด ๋ฌผ์˜ ํ•ด๋‹น ์œ„์น˜ ๋ฐ category๋ฅผ ์‹๋ณ„ํ•œ๋‹ค.

์ด๋Ÿฌํ•œ 2๋‹จ๊ณ„ ์ ‘๊ทผ๋ฐฉ์‹์€ ์ €์ˆ˜์ค€์˜ ๋‹จ์„œ๋ฅผ ๊ฐ–๋Š” boundaing box segmentation์˜ ์ •ํ™•์„ฑ ๋ฐ ์ตœ์ฒจ๋‹จ CNN์˜ ๋งค์šฐ ๊ฐ•๋ ฅํ•œ classifciation ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•œ๋‹ค.
์ด๋•Œ, bounding box์˜ ๋” ๋‚˜์€ ์ œ์•ˆ ๋ฐ classification์„ ์œ„ํ•œ ์•™์ƒ๋ธ” ์ ‘๊ทผ๋ฒ•๊ณผ ๊ฐ™์€ 2๋‹จ๊ณ„ ๋ชจ๋‘์—์„œ ํ–ฅ์ƒ๋œ ๊ธฐ๋Šฅ์„ ์–ป์–ด๋‚ด์—ˆ๋‹ค.

 

 

3. Motivation and High Level Considerations

- ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด๋‹ค.
(์ด๋•Œ, ์‹ ๊ฒฝ๋ง์˜ ๊นŠ์ด(depth), ํญ(depth์˜ unit์ˆ˜)์˜ ์ฆ๊ฐ€ ๋˜ํ•œ ๋ชจ๋‘ ํฌํ•จ๋œ๋‹ค.)
๋‹ค๋งŒ ์ด ํ•ด๊ฒฐ๋ฒ•์€ 2๊ฐ€์ง€์˜ ์ฃผ์š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.

โ‘  ๋” ํฐ ํฌ๊ธฐ๋Š” ๋ณดํ†ต parameter๊ฐ€ ๋งŽ์Œ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ด๋Š” training set์ด ์ œํ•œ๋œ ๊ฒฝ์šฐ, overfitting ๋ฐœ์ƒ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ธ๋‹ค.
ํŠนํžˆ, ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๋” ์ž์„ธํ•œ ๊ตฌ๋ถ„์„ ์œ„ํ•ด ์ธ๊ฐ„์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ, ๊ณ ํ’ˆ์งˆ์˜ training set์„ ๋งŒ๋“ค๊ธฐ ํž˜๋“ค๊ธฐ์— ์ฃผ์š”๋ณ‘๋ชฉํ˜„์ƒ(major bottleneck)์ด ๋  ์ˆ˜ ์žˆ๋‹ค.


โ‘ก ์‹ ๊ฒฝ๋ง ํฌ๊ธฐ๊ฐ€ ๊ท ์ผํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋‹จ์ ์€ ๊ณ„์‚ฐ ์ž์›์˜ ์‚ฌ์šฉ์ด ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด deep vision network์—์„œ 2๊ฐœ์˜ conv.layer์˜ ์—ฐ๊ฒฐ์˜ ๊ฒฝ์šฐ, filter ์ˆ˜๊ฐ€ ๊ท ์ผํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋ฉด ๊ณ„์‚ฐ์ด ์ œ๊ณฑ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค.
๋งŒ์•ฝ, ์ถ”๊ฐ€๋œ ์šฉ๋Ÿ‰์ด ๊ฐ€์ค‘์น˜๊ฐ€ 0์— ๊ฐ€๊นŒ์šธ ๋•Œ์ฒ˜๋Ÿผ ๋น„ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋ฉด ๋งŽ์€ ๊ณ„์‚ฐ์ž์›์ด ๋‚ญ๋น„๋œ๋‹ค.


- ์ด๋Ÿฐ ๋‘ ๋ฌธ์ œ์˜ ํ•ด๊ฒฐ์— ๋Œ€ํ•œ ๊ทผ๋ณธ์  ๋ฐฉ๋ฒ•์€ ๊ถ๊ทน์ ์œผ๋กœ conv.layer ๋‚ด๋ถ€์—์„œ fully-connectedํ•œ ๊ตฌ์กฐ์—์„œ ๋œ ์—ฐ๊ฒฐ๋œ ๊ตฌ์กฐ๋กœ ์ด๋™ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
Dataset์˜ ํ™•๋ฅ ๋ถ„ํฌ(probability distribution)๊ฐ€ ํฌ๊ณ  ๋งค์šฐ ํฌ๋ฐ•ํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰์ธต์˜ activation๊ณผ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ ์ถœ๋ ฅ์„ ๊ฐ–๋Š” clustering neurons์˜ ์ƒ๊ด€ํ†ต๊ณ„๋ฅผ ๋ถ„์„ํ•ด ์ตœ์ ์˜ ์‹ ๊ฒฝ๋ง topology๋ฅผ ์ธต๋ณ„๋กœ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ,
๋น„๋ก ์—„๋ฐ€ํ•œ ์ˆ˜ํ•™์ ์ฆ๋ช…์ด ๋งค์šฐ ๊ฐ•ํ•œ ์กฐ๊ฑด์„ ์š”๊ตฌํ•˜์ง€๋งŒ, ์œ„์˜ ๊ธ€์€ ์ž˜ ์•Œ๋ ค์ง„ Hebbian principle์™€ ๊ณต๋ช…ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์ด๋‹ค.
์ฆ‰, ๊ทผ๋ณธ์  ์•„์ด๋””์–ด๊ฐ€ ๋œ ์—„๋ฐ€ํ•œ ์กฐ๊ฑด์—์„œ๋„ ์‹ค์ œ๋กœ ์ž˜ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์‹œ์‚ฌํ•œ๋‹ค ํ•  ์ˆ˜ ์žˆ๋‹ค.

- ๋ถ€์ •์ ์ธ ๋ฉด์—์„œ๋Š” ์˜ค๋Š˜๋‚ ์˜ computing infra.๋Š” ๋ถˆ๊ท ์ผํ•œ ํฌ์†Œ์ž๋ฃŒ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ˆ˜์น˜์ ๊ณ„์‚ฐ๊ณผ ๊ด€๋ จํ•ด ๋งค์šฐ ๋น„ํšจ์œจ์ ์ด๋ผ๋Š” ์‚ฌ์‹ค์ด๋‹ค.
์‚ฐ์ˆ ์—ฐ์‚ฐ์ˆ˜๊ฐ€ 100๋ฐฐ(100x) ์ค„์–ด๋“ ๋‹คํ•ด๋„ lookups๊ณผ cache ์†์‹ค์˜ overhead๊ฐ€ ๋„ˆ๋ฌด๋‚˜๋„ ์ง€๋ฐฐ์ ์ด๊ธฐ์— ํฌ์†Œํ–‰๋ ฌ(sparse matrices)๋กœ์˜ ์ „ํ™˜์€ ํšจ๊ณผ๊ฐ€ ์—†์„ ๊ฒƒ์ด๋‹ค.
๋˜ํ•œ, ๊ท ์ผํ•˜์ง€ ์•Š์€ ํฌ์†Œ๋ชจ๋ธ(sparse model)์€ ๋” ์ •๊ตํ•œ ๊ณตํ•™ ๋ฐ ๊ณ„์‚ฐ์ธํ”„๋ผ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ vision์ง€ํ–ฅ ๊ธฐ๊ณ„ํ•™์Šต๋ฐฉ์‹์€ convolution์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์žฅ์ ๋งŒ์œผ๋กœ ๊ณต๊ฐ„์˜์—ญ์˜ ํฌ์†Œ์„ฑ(sparsity in the spatial domain)์„ ํ™œ์šฉํ•œ๋‹ค.
ํ•˜์ง€๋งŒ, convolution์€ ์ด์ „์ธต์˜ patch์˜ ๋ฐ€์ง‘๋œ ์—ฐ๊ฒฐ๋ชจ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋Š”๋ฐ, ConvNet์€ ์ „ํ†ต์ ์œผ๋กœ ํŠน์ง•๋งต์—์„œ ๋ฌด์ž‘์œ„ ๋ฐ ํฌ์†Œ์—ฐ๊ฒฐ(random and sparse connection tables)์„ ์‚ฌ์šฉํ•ด ๋น„๋Œ€์นญ์ ์œผ๋กœ ํ•™์Šต์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.
์ด๋Ÿฐ ์ถ”์„ธ๋Š” ๋ณ‘๋ ฌ์ปดํ“จํŒ…์„ ๋”์šฑ ์ž˜ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ „์ฒด์—ฐ๊ฒฐ๋กœ ๋‹ค์‹œ ๋ณ€๊ฒฝ๋˜์—ˆ๋Š”๋ฐ, ์ด๋ฅผ ํ†ตํ•ด ๊ตฌ์กฐ์˜ ๊ท ์ผ์„ฑ ๋ฐ ๋งŽ์€ filter์™€ ๋” ํฐ batch size๋Š” ํšจ์œจ์ ์ธ dense computation์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค.

- ์ด๋Š” ๋‹ค์Œ์˜ ์ค‘๊ฐ„๋‹จ๊ณ„์— ๋Œ€ํ•œ ํฌ๋ง์ด ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์˜๋ฌธ์„ ์ œ๊ธฐํ•œ๋‹ค.
์ด๋ก ์—์„œ ์ œ์•ˆ๋œ ๊ฒƒ์ฒ˜๋Ÿผ filter์—์„œ์กฐ์ฐจ๋„ ์—ฌ๋ถ„์˜ ํฌ์†Œ์„ฑ์„ ํ™œ์šฉํ•˜์ง€๋งŒ ๊ณ ๋ฐ€๋„์˜ ํ–‰๋ ฌ๊ณ„์‚ฐ์„ ํ™œ์šฉํ•ด ํ˜„์žฌ ํ•˜๋“œ์›จ์–ด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค.
ํฌ์†Œํ–‰๋ ฌ๊ณ„์‚ฐ์— ๋Œ€ํ•œ ๋ฐฉ๋Œ€ํ•œ์ž๋ฃŒ๋Š” ํฌ์†Œํ–‰๋ ฌ์„ ๋น„๊ต์  ๊ณ ๋ฐ€๋„์˜ submatrices๋กœ clusteringํ•˜๋Š” ๊ฒƒ์ด ํฌ์†Œํ–‰๋ ฌ๊ณฑ์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค๋Š”๊ฒƒ์„ ์‹œ์‚ฌํ•œ๋‹ค.

- Inception๊ตฌ์กฐ๋Š” vision์‹ ๊ฒฝ๋ง์—์„œ ์•”์‹œํ•˜๋Š” ํฌ์†Œ๊ตฌ์กฐ(sparse structure)๋ฅผ ๊ทผ์‚ฌํ™”ํ•˜๊ณ  ๊ณ ๋ฐ€๋„๋กœ ์‰ฝ๊ฒŒ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ๊ตฌ์„ฑ์š”์†Œ์— ์˜ํ•ด ๊ฐ€์ •๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค๋ฃจ๋Š” ์ •๊ตํ•œ์‹ ๊ฒฝ๋ง topology๊ตฌ์„ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐ€์ƒ์ถœ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ์œ„ํ•œ ์ฒซ ์ €์ž์˜ ์‚ฌ๋ก€์—ฐ๊ตฌ๋กœ ์‹œ์ž‘๋˜์—ˆ๋‹ค.
๋งค์šฐ ์ด๋ก ์ ์ธ ๋ฐฉ๋ฒ•์ด์—ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ •ํ™•ํ•œ topology์˜ ์„ ํƒ์— ๋Œ€ํ•ด 2๋ฒˆ ๋ฐ˜๋ณต ํ›„ ์ด๋“์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ, 
learning-rate, hyper-parameter ๋ฐ ๊ฐœ์„ ๋œ training ๋ฐฉ๋ฒ•๋ก (methodology)์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฒฐ๊ณผ์ ์œผ๋กœ Inception๊ตฌ์กฐ๊ฐ€ ๊ธฐ๋ณธ ์‹ ๊ฒฝ๋ง์œผ๋กœ์„œ์˜ localization ๋ฐ object detection์˜ ๋งฅ๋ฝ์—์„œ ํŠนํžˆ๋‚ญ ์œ ์šฉํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•˜๋‹ค.
ํฅ๋ฏธ๋กœ์šด ์ ์€ ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์กด๊ตฌ์กฐ์˜ ์„ ํƒ์— ๋Œ€ํ•œ ์ฒ ์ €ํ•œ ์˜๋ฌธ์„ ์ œ๊ธฐํ•˜์˜€๊ณ  test์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ ์–ด๋„ locally optimal์ด๋ผ๋Š” ์ ์ด๋‹ค.

- ์ œ์•ˆ๋œ ๊ตฌ์กฐ๊ฐ€ ์ปดํ“จํ„ฐ๋น„์ „์˜ ์„ฑ๊ณต์  ์‚ฌ๋ก€๊ฐ€ ๋˜์—ˆ์ง€๋งŒ, ์ปดํ“จํ„ฐ๋น„์ „์˜ ๊ตฌ์ถ•์„ ์ด๋ˆ guiding ์›๋ฆฌ์— ๊ธฐ์—ฌํ•  ์ง€๋Š” ์˜๋ฌธ์ด๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ์•„๋ž˜์— ์„ค๋ช…๋œ ์›์น™์— ๊ธฐ์ดˆํ•œ ์ž๋™ํ™” ํˆด์ด vision์‹ ๊ฒฝ๋ง์— ๋Œ€ํ•ด ์œ ์‚ฌํ•˜์ง€๋งŒ ๋” ๋‚˜์€ topology๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋ ค๋ฉด ํ›จ์”ฌ ๋” ์ฒ ์ €ํ•œ ๋ถ„์„๊ณผ ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋‹ค.
๊ฐ€์žฅ ์„ค๋“๋ ฅ์žˆ๋Š” ์ฆ๊ฑฐ๋Š” ์ž๋™ํ™”๋œ ์‹œ์Šคํ…œ์ด ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋‚˜ ๋งค์šฐ ๋‹ค๋ฅด๊ฒŒ ๋ณด์ด๋Š” global ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด ๋‹ค๋ฅธ ์˜์—ญ์—์„œ ์œ ์‚ฌํ•œ ์ด๋“์„ ์–ป๋Š” network topology๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ์ธ๋ฐ, ์ ์–ด๋„ Inception๊ตฌ์กฐ์˜ ์ดˆ๊ธฐ์„ฑ๊ณต์€ ์ด ๋ฐฉํ–ฅ์œผ๋กœ์˜ ํฅ๋ฏธ๋กœ์šด ์ž‘์—…์— ๋Œ€ํ•œ ํ™•๊ณ ํ•œ motivation์„ ์ œ๊ณตํ•œ๋‹ค.

 

 

4. Architectural Details

[Inception๊ตฌ์กฐ์˜ ์ฃผ์š” ์•„์ด๋””์–ด., main idea for inception architecture]
- ๊ธฐ๋ฐ˜: convolution์˜ optimal local sparse structure๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ทผ์‚ฌํ™”๋˜๊ณ  ์‰ฝ๊ฒŒ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ์กฐ๋ฐ€ํ•œ ๊ตฌ์„ฑ์š”์†Œ๋กœ ์ปค๋ฒ„๋˜๋Š”์ง€๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ๊ฒƒ.
- translation ๋ถˆ๋ณ€์„ฑ(invariance)๋ฅผ ๊ฐ€์ •ํ•˜๋Š” ๊ฒƒ์€ ์šฐ๋ฆฌ๊ฐ€ ์‹ ๊ฒฝ๋ง์˜ conv.block์„ ๊ตฌ์ถ•ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.
- ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๊ฒƒ์€ ์ตœ์ ์˜ ๊ตญ์†Œ๊ตฌ์กฐ(optimal local construction)๋ฅผ ์ฐพ๊ณ  ๊ณต๊ฐ„์ ์œผ๋กœ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

cf) sparse structure: ๊ณ ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณ„์‚ฐ์ ์œผ๋กœ๋„ ํšจ์œจ์ ์ด๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์‹ ๊ฒฝ๋ง๊ตฌ์กฐ
์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ„ํ•ด ํšจ์œจ์ ์ธ ํ•ฉ์„ฑ๊ณฑ์—ฐ์‚ฐ๊ณผ ํฌ์†Œ์„ฑ์˜ ์กฐํ•ฉ์„ ์‚ฌ์šฉํ•˜๋„๋ก ์‹ ๊ฒฝ๋ง์ด ์„ค๊ณ„๋œ ๋ฐฉ์‹์„ ์„ค๋ช…ํ•œ๋‹ค.


- Arora et al.์€ layer-by-layer ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.
(layer-by-layer๊ตฌ์กฐ๋Š” ํ•˜๋‚˜๊ฐ€ ๋งˆ์ง€๋ง‰ layer์˜ ํ†ต๊ณ„์  ์ƒํ˜ธ๊ด€๋ จ์„ฑ๊ณผ ๊ทธ๋“ค์„ ๋†’์€ ์—ฐ๊ด€์„ฑ์œผ๋กœ ๊ตฐ์ง‘(cluster)ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.)
์ด ๊ตฐ์ง‘๋“ค ํ˜•ํƒœ๋Š” ๋‹ค์Œ์ธต์˜ unit๊ณผ ์ด์ „์ธต์˜ unit๊ฐ„์˜ ์—ฐ๊ฒฐ์ด ๋˜์–ด์žˆ๋‹ค.

- ์šฐ๋ฆฐ ์ด์ „์ธต์˜ ๊ฐ unit์ด input image์˜ ์ผ๋ถ€์˜์—ญ์ด๊ณ , ์ด๋Ÿฐ unit์€ filter bank๋กœ ๊ทธ๋ฃนํ™”๋œ๋‹ค ๊ฐ€์ •ํ•œ๋‹ค.
์ž…๋ ฅ์ธต๊ณผ ๊ฐ€๊นŒ์šด ํ•˜์œ„์ธต์ผ ์ˆ˜๋ก ์ƒํ˜ธ์—ฐ๊ด€๋œ unit์€ ๊ตญ์†Œ์˜์—ญ(local region)์— ์ง‘์ค‘ ๋  ๊ฒƒ์ด๋‹ค.
์ด๋Š” ์ฆ‰, ๋งŽ์€ cluster๋“ค์ด ๋‹จ์ผ์˜์—ญ์— ์ง‘์ค‘๋˜๊ณ  ๋‹ค์Œ์ธต์—์„œ 1x1 conv.layer๋กœ ์ปค๋ฒ„๋  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•œ๋‹ค.

- ํ•˜์ง€๋งŒ, ๋” ํฐ patch์— ๋Œ€ํ•œ ํ•ฉ์„ฑ๊ณฑ์œผ๋กœ ์ปค๋ฒ„๋˜๋Š” ๋” ์ ์€ ์ˆ˜์˜ ๊ณต๊ฐ„์ ์œผ๋กœ ๋ถ„์‚ฐ๋œ ๊ตฐ์ง‘(spatially spread out clusters)์ด ์žˆ์„ ๊ฒƒ์ด๋ฉฐ, ๋”์šฑ ํฐ ์˜์—ญ์— ๋Œ€ํ•œ patch์ˆ˜ ๊ฐ์†Œ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค.

- ์ด๋Ÿฐ patchํ• ๋‹น๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด์„œ, ํ˜„์žฌ Inception๊ตฌ์กฐ๋Š” filter ํฌ๊ธฐ๋ฅผ 1x1, 3x3, 5x5๋กœ ๊ฐ•์š”ํ•˜๊ณ  ์žˆ๋‹ค.
๋‹ค๋งŒ, ์ด๋Ÿฐ ๊ฒฐ์ •์€ ํ•„์š”์„ฑ์ด๋ผ๊ธฐ ๋ณด๋‹ค๋Š” ํŽธ๋ฆฌ์„ฑ์ด๋ผ๋Š” ์ ์— ๋”์šฑ ๊ธฐ๋ฐ˜์„ ๋‘๊ณ  ์žˆ๋‹ค.
์ด๋Š” Inception๊ตฌ์กฐ๊ฐ€ ๋‹ค์Œ์ธต์˜ ์ž…๋ ฅ์„ ํ˜•์„ฑํ•˜๋Š” ๋‹จ์ผ์ถœ๋ ฅ๋ฒกํ„ฐ์— ์—ฐ๊ฒฐ๋œ output filter bank์™€ ํ•จ๊ป˜ ๋ชจ๋“  ๊ณ„์ธต์˜ ์กฐํ•ฉ๋ฐฉ์‹์ด๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.
๊ฒŒ๋‹ค๊ฐ€, pooling์€ ์ตœ์‹  ConvNet์—์„œ ์„ฑ๊ณต์„ ์œ„ํ•ด์„œ๋Š” ํ•„์ˆ˜์ ์ธ๋ฐ, ์ด๋Š” ๊ฐ ๋‹จ๊ณ„์— ๋Œ€์•ˆ์ ์ธ ๋ณ‘๋ ฌํ’€๋ง(parallel pooling)์„ ํ•˜๋Š” ๊ฒƒ์ด ์ถ”๊ฐ€์ ์ธ ์ด๋“์„ ์ฃผ๋Š” ํšจ๊ณผ๋ผ๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค.


- ์ด๋Ÿฐ "Inception Module"์ด ์„œ๋กœ ๊ฒน์ณ์Œ“์ด๊ธฐ์— ์ถœ๋ ฅ์ƒ๊ด€ํ†ต๊ณ„(output correlation statistics)๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ๋‹ค์–‘ํ•˜๋‹ค.
๋” ๋†’์€ ์ถ”์ƒํ™”ํŠน์ง•์ด ๋” ๋†’์€์ธต์— ์˜ํ•ด ํฌ์ฐฉ๋œ๋‹ค.
์ด๋กœ ์ธํ•ด ๋” ๋†’์€ ์ถ”์ƒํ™”ํŠน์ง•์˜ ๊ณต๊ฐ„์ง‘์ค‘๋„๋Š” ๊ฐ์†Œํ•  ๊ฒƒ์ด๋‹ค.
์ด๋Š” ์šฐ๋ฆฌ๊ฐ€ ๋” ๋†’์€์ธต์œผ๋กœ ๊ฐˆ์ˆ˜๋ก 3x3๊ณผ 5x5 convolution ๋น„์œจ์ด ์ฆ๊ฐ€ํ•  ๊ฒƒ์ž„์„ ์‹œ์‚ฌํ•œ๋‹ค.


- "Inception Module"์˜ ๊ฐ€์žฅ ํฐ ๋ฌธ์ œ(์ ์–ด๋„ ๋‹จ์ˆœ(naive)ํ•œ ํ˜•ํƒœ์—์„œ)๋Š” ์ ์€ ์ˆ˜์˜ 5x5 convolution๋“ค๋„ ๋งŽ์€ ์ˆ˜์˜  filter๊ฐ€ ์žˆ๋Š” conv.layer์—์„œ ๋งค์šฐ ๋น„์šฉ์ด ๋งŽ์ด๋“ ๋‹ค๋Š” ์ ์ด๋‹ค.
์ด ๋ฌธ์ œ๋Š” pooling unit์„ ์ถ”๊ฐ€ํ•˜๋ฉด ๋”์šฑ ๋‘๋“œ๋Ÿฌ์ง€๋Š”๋ฐ, output filter์ˆ˜๋Š” ์ด์ „๋‹จ๊ณ„์˜ filter์ˆ˜์™€ ๊ฐ™๋‹ค.
์ฆ‰, pooling.layer output. &. conv.layer output ๊ฐ„์˜ ๋ณ‘ํ•ฉ(merge)์€ ๋‹จ๊ณ„๊ฐ„์˜ output ์ˆ˜๋ฅผ ๋ถˆ๊ฐ€ํ”ผํ•˜๊ฒŒ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.

- ์ด ๊ตฌ์กฐ๋Š” ์ตœ์ ์˜ ํฌ์†Œ๊ตฌ์กฐ(optimal sparse structure)๋ฅผ ํฌํ•จํ•œ๋‹ค.
๋‹ค๋งŒ, ๋งค์šฐ ๋น„ํšจ์œจ์ ์ธ ์ˆ˜ํ–‰์œผ๋กœ ๋ช‡๋‹จ๊ณ„๊ฐ€ ์•ˆ์ง€๋‚˜ computational blowup(ํญ๋ฐœ)๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Š” Inception Module์— ๋Œ€ํ•œ 2๋ฒˆ์งธ ์•„์ด๋””์–ด๋ฅผ ์ด๋Œ์–ด ์ฃผ์—ˆ๋‹ค.




[Inception Module์— ๋Œ€ํ•œ 2๋ฒˆ์งธ ์•„์ด๋””์–ด]

- 2๋ฒˆ์งธ ์•„์ด๋””์–ด๋กœ ์ด์–ด์ง„ ์ด์œ ๋Š” ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๊ณ„์‚ฐ์š”๊ตฌ๋Ÿ‰์ด ๋งŽ์ด ์ฆ๊ฐ€ํ•˜๊ธฐ์— ์ฐจ์›์ถ•์†Œ(dimension reduction) ๋ฐ ์ฐจ์›ํˆฌ์˜(dimension projection)๋ฅผ ํ˜„๋ช…ํ•˜๊ฒŒ ์ ์šฉํ–ˆ๋Š”๋ฐ, ์ด๋Š” embedding์˜ ์„ฑ๊ณต์— ๊ธฐ๋ฐ˜ํ–ˆ๋‹ค.
์ €์ฐจ์›์ž„๋ฒ ๋”ฉ์—์„œ๋„ ์ƒ๋Œ€์ ์œผ๋กœ ํฐ image patch์— ๋Œ€ํ•ด ๋งŽ์€ ์ •๋ณด๋ฅผ ํฌํ•จ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

- ํ•˜์ง€๋งŒ, ์ž„๋ฒ ๋”ฉ์€ ์กฐ๋ฐ€ํ•˜๊ณ  ์••์ถ•๋œ ํ˜•ํƒœ๋กœ ์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. (์••์ถ•๋œ ์ •๋ณด๋Š” ๋ชจ๋ธ๋งํ•˜๊ธฐ ์–ด๋ ต๋‹ค.)
์šฐ๋ฆฐ ๋Œ€๋ถ€๋ถ„์˜ ์ƒํ™ฉ์—์„œ ํฌ๋ฐ•ํ•œ ํ‘œํ˜„๋ ฅ์„ ์œ ์ง€ํ•˜๊ณ , ์‹ ํ˜ธ๋ฅผ ์ผ๊ด€์ ์œผ๋กœ ์ง‘๊ณ„ํ•ด์•ผํ•  ๋•Œ ์••์ถ•ํ•˜๊ณ  ์‹ถ๋‹ค.
์ด๋ง์ธ ์ฆ‰์Šจ, 1x1 convolution์€  expensiveํ•œ 3x3, 5x5 convolution์˜ ์ด์ „์—์„œ์˜ reduction์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
๊ฒŒ๋‹ค๊ฐ€ reduction์˜ ์‚ฌ์šฉ์€ 2๊ฐ€์ง€ ๋ชฉ์ ์„ ๊ฐ–๋Š” ReLU์˜ ์‚ฌ์šฉ๋„ ํฌํ•จ๋œ๋‹ค.


[์ผ๋ฐ˜์ ์ธ Inception ์‹ ๊ฒฝ๋ง]
- gridํ•ด์ƒ๋„๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ด๋”ฐ๊ธˆ์”ฉ MaxPooling(with stride=2)์ด ์„œ๋กœ ์ ์ธต๋œ inception module๋กœ ๊ตฌ์„ฑ๋œ ์‹ ๊ฒฝ๋ง์ด๋‹ค.

training์‹œ ํšจ์œจ์ ์ธ memory์‚ฌ์šฉ์„ ์œ„ํ•œ ๊ธฐ์ˆ ์˜ ์ด์œ ๋กœ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๊ณ ์žํ•œ๋‹ค.
- higher layer: Inception module ์‚ฌ์šฉ.
- lower layer: ์ „ํ˜•์ ์ธ convolution๋ฐฉ์‹
์ด๋Š” ๋„ˆ๋ฌด ์—„๊ฒฉํ•˜๊ฒŒ ํ•„์ˆ˜๋‹ค!๋ผ๊ณ ๋Š” ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋‹จ์ˆœํžˆ ํ˜„์žฌ๊ตฌํ˜„๋œ ์ผ๋ถ€ ์ธํ”„๋ผ์  ๋น„ํšจ์œจ์„ฑ์„ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.



[Inception ๊ตฌ์กฐ์˜ ์ฃผ์š”์ด์ ]
โ‘  ์ œ์–ด๋˜์ง€์•Š๋Š” ๊ณ„์‚ฐ๋ณต์žก์„ฑ์˜ ํญ๋ฐœ์—†์ด, ๊ฐ ๋‹จ๊ณ„์—์„œ unit์ˆ˜๋ฅผ ํฌ๊ฒŒ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
  - ์ฐจ์›์ถ•์†Œ์˜ ubiquitous(์–ด๋””์—์„œ๋‚˜)์  ์‚ฌ์šฉ์€ ๋งˆ์ง€๋ง‰๋‹จ๊ณ„์—์„œ input filter๋ฅผ ๋‹ค์Œ์ธต์œผ๋กœ ์ฐจํ(shielding)ํ•  ์ˆ˜ ์žˆ๊ฒŒํ•œ๋‹ค.
  - ๋˜ํ•œ ํฐ patch๋กœ convolvingํ•˜๊ธฐ ์ „์— ์ฐจ์›์„ ์ค„์ธ๋‹ค.

โ‘ก ์ด ์„ค๊ณ„๋ฐฉ์‹์˜ ๋˜๋‹ค๋ฅธ ์‹ค์šฉ์ ์ธก๋ฉด์€ ์‹œ๊ฐ์ •๋ณด๊ฐ€ ๋‹ค์–‘ํ•œ๊ทœ๋ชจ์—์„œ ์ง„ํ–‰ ๋ฐ ์ง‘ํ•ฉํ•˜์—ฌ ๋™์‹œ์— ๋‹ค๋ฅธ ๊ทœ๋ชจ๋กœ๋ถ€ํ„ฐ์˜ ์ถ”์ƒ์ ํŠน์ง•์— ๋Œ€ํ•œ ์ง๊ด€์„ ํ• ๋‹นํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

 

5. GoogLeNet

- Yann LeCuns์˜ LeNet-5 ์‹ ๊ฒฝ๋ง์— ๋Œ€ํ•œ ๊ฒฝ์˜๋ฅผ ํ‘œํ•ด GoogLeNet์ด๋ผ๋Š” ์ด๋ฆ„์„ ์ง€์—ˆ๋Š”๋ฐ, ILSVRC14์— ์ œ์ถœํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ Inception๊ตฌ์กฐ์˜ ํŠน์ •ํ™”์‹ ์„ GoogLeNet์„ ์‚ฌ์šฉํ•ด ์–ธ๊ธ‰ํ•˜์˜€๋‹ค.

- ๋˜ํ•œ, ๋” ๊นŠ๊ณ  ๋„“์€ Inception์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ, ํ’ˆ์งˆ์€ ๋–จ์–ด์ ธ๋„ ์•™์ƒ๋ธ”์„ ๊ณ๋“ค์ด๋ฉด ๊ฒฐ๊ณผ๊ฐ€ ์•ฝ๊ฐ„ ๊ฐœ์„ ๋˜์—ˆ๋‹ค.
์‹คํ—˜๊ฒฐ๊ณผ, ์ •ํ™•ํ•œ ๊ตฌ์กฐ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์˜ํ–ฅ์ด ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๊ธฐ์— ์‹ ๊ฒฝ๋ง์˜ ์„ธ๋ถ€์‚ฌํ•ญ์„ ์ƒ๋žตํ•œ๋‹ค.
์šฐ๋ฆฐ ์•™์ƒ๋ธ” 7๊ฐœ๋ชจ๋ธ ์ค‘ 6๊ฐœ์— ๋Œ€ํ•ด ์ •ํ™•ํžˆ ๋™์ผํ•œ topology(๋‹ค๋ฅธ ์ƒ˜ํ”Œ๋ง๋ฐฉ๋ฒ•์œผ๋กœ ํ›ˆ๋ จ๋จ)๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.
๊ฐ€์žฅ ์„ฑ๊ณต์ ์ธ ํŠน์ • GoogLeNet์€ ์‹œ์—ฐ์˜ ๋ชฉ์ ์œผ๋กœ ์œ„์˜ ํ‘œ1์— ์„ค๋ช…๋˜์–ด์žˆ๋‹ค.

[Inception Module์˜ ๋‚ด๋ถ€๊ตฌ์กฐ]
- ๋ชจ๋“  convolution์€ ReLU์˜ activation์„ ์‚ฌ์šฉํ•œ๋‹ค.
- ์‹ ๊ฒฝ๋ง์—์„œ ์ˆ˜์šฉํ•„๋“œ(receptive field)์˜ ํฌ๊ธฐ๋Š” ํ‰๊ท ๊ฐ๋ฒ•(mean subtraction)์œผ๋กœ 224x224์˜ RGB channel์„ ์ทจํ•œ๋‹ค. 
- "#3x3 reduce" ์™€ "#5x5 reduce"๋Š”  3x3 ๋ฐ 5x5 convolution์ด์ „์— ์‚ฌ์šฉ๋œ  ์ถ•์†Œ์ธต(reduction layer)์—์„œ์˜ 1x1 filter์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.
- Pool proj column์— ๋‚ด์žฅ๋œ MaxPooling ์ดํ›„์˜ translation layer์—์„œ 1x1 filter์˜ ์ˆ˜๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Ÿฐ ๋ชจ๋“  reduction/projection layer๋Š” ReLU๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ์ด 22๊ฐœ์˜ parameter๊ฐ€ ์žˆ๋Š” ์ธต์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. (์ „์ฒด๋Š” ํ’€๋งํฌํ•จ 27์ธต)

cf) [Projection layer
- ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ๋งž์ถฐ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ์— ๋Œ€ํ•ด ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์กฐ์ •ํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ convolutional layer์ด๋‹ค.
์ผ๋ฐ˜์ ์œผ๋กœ pooling layer๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ณต๊ฐ„์  ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋ฉฐ, Inception module์—์„œ๋„ pooling layer๊ฐ€ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, projection layer์™€๋Š” ๋ณ„๊ฐœ๋กœ ๋™์‹œ์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, 5x5 ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•  ๋•Œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๊นŠ์ด๊ฐ€ 192์ด๊ณ , ์ถœ๋ ฅ ํŠน์ง• ๋งต์˜ ๊นŠ์ด๋ฅผ 32๋กœ ์„ค์ •ํ•œ๋‹ค๋ฉด, 1x1 ์ปจ๋ณผ๋ฃจ์…˜ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” projection layer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๊นŠ์ด๋ฅผ 32๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ฐจ์›์„ ๋งž์ถ˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ์— ๋Œ€ํ•ด ์ปจ๋ณผ๋ฃจ์…˜ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ํŠน์ง• ๋งต์„ ์ถ”์ถœํ•œ๋‹ค.


- ๋ถ„๋ฅ˜๊ธฐ(classifier) ์ด์ „ Average Pooling์˜ ์‚ฌ์šฉ์€
[Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.
]๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.
๋‹ค๋งŒ, ์šฐ๋ฆฌ์˜ ๊ตฌํ˜„์€ ์ถ”๊ฐ€์ ์ธ linear layer๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์—์„œ ๋‹ค๋ฅด๋‹ค.
์ด๋ฅผ ํ†ตํ•ด ๋‹ค๋ฅธ label set์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง์„ ์‰ฝ๊ฒŒ ์ ์šฉํ•˜๊ณ  ๋ฏธ์„ธ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

FC.layer์—์„œ Average Pooling์œผ๋กœ ์ด๋™ํ•˜๋ฉด, ์ƒ์œ„ 1๊ฐœ์˜ ์ •ํ™•๋„๊ฐ€ ์•ฝ 0.6% ํ–ฅ์ƒ๋œ๋‹ค.
Dropout์˜ ์‚ฌ์šฉ์€ ํ•˜์ง€๋งŒ, FC.layer๋ฅผ ์ œ๊ฑฐํ•จ์—๋„ ํ•„์ˆ˜์ ์ธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.



[์šฐ๋ ค์‚ฌํ•ญ]
- ์‹ ๊ฒฝ๋ง์˜ ์ƒ๋Œ€์ ์œผ๋กœ ๊นŠ์€ ๊นŠ์ด ๊ณ ๋ ค์‹œ, ๋ชจ๋“ ์ธต์— ํšจ๊ณผ์ ์ธ ๋ฐฉ์‹์œผ๋กœ gradient๋ฅผ ์—ญ์ „ํŒŒํ•˜๋Š” ๋ฐฉ์‹์ด ์šฐ๋ ค๋˜์—ˆ๋‹ค.
- ํ•œ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด์ ์€ ์ด๋•Œ, ์ƒ๋Œ€์ ์œผ๋กœ ์–•์€ ์‹ ๊ฒฝ๋ง์˜ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์ด ์‹ ๊ฒฝ๋ง ์ค‘๊ฐ„์ธต์— ์˜ํ•ด ์ƒ์„ฑ๋œ ํŠน์ง•์ด ๋งค์šฐ ๋‘๋“œ๋Ÿฌ์ ธ์•ผ๋งŒ(discriminative) ํ•œ๋‹ค๋Š” ์ ์„ ์‹œ์‚ฌํ–ˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
- ์ด๋Ÿฐ ์ค‘๊ฐ„์ธต์— ์—ฐ๊ฒฐ๋œ ๋ณด์กฐ๋ถ„๋ฅ˜๊ธฐ(auxiliary classifier)๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ classifier๊ฐ€ ํ•˜์œ„๋‹จ๊ณ„์—์„œ ๋งค์šฐ ๋‘๋“œ๋Ÿฌ์ง€๋Š” ํŠน์ง•์„ ๋ณด์—ฌ์ค€๋‹ค.
์ด๋•Œ, ์—ญ์ „ํŒŒ๋˜๋Š” gradient ์‹ ํ˜ธ๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ์ถ”๊ฐ€์ ์ธ ๊ทœ์ œํ™”(regularization)์„ ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

- ์ด๋Ÿฐ classifier๋Š” Inception (4a) ๋ฐ Inception (4d) ๋ชจ๋“ˆ์˜ ์ถœ๋ ฅ์œ„์น˜์— ๋ฐฐ์น˜๋œ ๋” ์ž‘์€ ConvNet์˜ ํ˜•ํƒœ๋ฅผ ์ทจํ•œ๋‹ค.
training๊ณผ์ •์—์„œ, ๊ทธ๋“ค์˜ Loss๋Š” discount weight์ธ๋ฐ, ์ด๋Š” ์‹ ๊ฒฝ๋ง์˜ ์ „์ฒด Loss์— ์ถ”๊ฐ€๋œ๋‹ค.
(๋ณด์กฐ๋ถ„๋ฅ˜๊ธฐ(auxiliary classifier)์˜ Loss๋Š” 0.3์˜ weight๋ฅผ ๋ถ€์—ฌ.)

- ์ถ”๋ก (inference)์‹œ, auxiliary classifier๋Š” ์ œ๊ฑฐ๋˜๋Š”๋ฐ, auxiliary classifier๋ฅผ ํฌํ•จํ•œ ๋ถ€๊ฐ€์ ์ธ ์‹ ๊ฒฝ๋ง(extra network on the side)์˜ ์ •ํ™•ํ•œ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
• 5×5 filter์™€ stride=3์„ ๊ฐ–๋Š” Avg.pooling layer, (4a)์˜ ๊ฒฝ์šฐ 4×4×512 output, (4d) ๋‹จ๊ณ„์˜ ๊ฒฝ์šฐ 4×4×528 output.
• ์ฐจ์› ์ถ•์†Œ ๋ฐ ReLU๋ฅผ ์œ„ํ•œ  1×1 convolution(128๊ฐœ filter์กด์žฌ).
• 1024๊ฐœ unit๊ณผ ReLU๋ฅผ ๊ฐ–๋Š” FC.layer.
• Dropout(0.7)์˜ layer (Drop ratio = 70%)
• softmax Loss๋ฅผ classifier๋กœ linear layer(main classifier์™€ ๋™์ผํ•œ 1000๊ฐœ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•˜์ง€๋งŒ ์ถ”๋ก  ์‹œ๊ฐ„์— ์ œ๊ฑฐ๋จ).


cf) Inference time
์ถ”๋ก  ์‹œ๊ฐ„์€ ํ•™์Šต๋œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์ด ์ด๋ฏธ ํ•™์Šต๋œ ํ›„ ์ƒˆ๋กœ์šด ์ž…๋ ฅ์— ๋Œ€ํ•ด ์˜ˆ์ธก ๋˜๋Š” ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์ด๋‹ค.
ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜๊ณ  ์ถœ๋ ฅ ์˜ˆ์ธก์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์œผ๋กœ ์ฆ‰, ๋ชจ๋ธ์˜ ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์ˆœ์ „ํŒŒ(foward pass)๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ์˜ˆ์ธก๋œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์ด๋‹ค.


๊ฒฐ๊ณผ ๋„คํŠธ์›Œํฌ์˜ ๊ฐœ๋žต๋„๋Š” ๊ทธ๋ฆผ 3, ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

 

 

6. Training Methodology

- ์šฐ๋ฆฌ์˜ ์‹ ๊ฒฝ๋ง์€ DistBelief๋ฅผ training์œผ๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค.
(DistBelief๋Š” ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ํ•™์Šต ์‹œ์Šคํ…œ์œผ๋กœ์„œ ๊ตฌ๊ธ€ ํด๋ผ์šฐ๋“œ ํ”Œ๋žซํผ์—์„œ ์‹คํ–‰๋˜๋Š” ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค์ด๋‹ค. DistBelief์˜ ๋‹จ์ ์œผ๋กœ๋Š” ๊ตฌ๊ธ€ ๋‚ด๋ถ€ ์ธํ”„๋ผ์™€ ๋„ˆ๋ฌด ๋‹จ๋‹จํžˆ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค๋Š” ์ , ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋งŒ ์ง€์›ํ•œ๋‹ค๋Š” ์ ์„ ๋“ค ์ˆ˜์žˆ๋‹ค.) 


- ์šฐ๋ฆฐ training์—์„œ Asynchronous SGD(with 0.9 momentum)์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, learning rate๋ฅผ 8epoch๋งˆ๋‹ค 4%๊ฐ์†Œํ•˜๋„๋ก ์ˆ˜์ •ํ•˜๋„๋ก ํ•˜์˜€๋‹ค.


- ์šฐ๋ฆฌ์˜ image sampling๋ฐฉ๋ฒ•์€ ๋Œ€ํšŒ๋ฅผ ๊ฑฐ์น˜๋ฉด์„œ ๋ช‡๋‹ฌ๋™์•ˆ ํฌ๊ฒŒ ๋ณ€ํ™”ํ–ˆ๋‹ค.
์ด๋ฏธ ํ†ตํ•ฉ๋œ๋ชจ๋ธ์€ ๋‹ค๋ฅธ๋ฐฉ๋ฒ•์œผ๋กœ (decay, learning rate๊ฐ™์€ hyper-parameter์˜ ๋ณ€๊ฒฝ์œผ๋กœ) training์„ ์ง„ํ–‰ํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฐ ์‹ ๊ฒฝ๋ง์„ trainingํ•˜๋Š” ๊ฐ€์žฅํšจ๊ณผ์ ์ธ ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ•์„ ๊ฒฐ์ •๋‚ด๋ฆฌ๊ธฐ ์–ด๋ ต๋‹ค.

- ๋”์šฑ ๋ฐฉ๋ฒ•์„ ๋ณต์žกํ•˜๊ฒŒ ํ•˜๊ธฐ์œ„ํ•ด, ๊ด€๋ จ๊ฒฐ๊ณผ๋ฅผ ์ž‘๊ฒŒํ•˜๊ฑฐ๋‚˜ ํฌ๊ฒŒํ•˜๋Š” ๋“ฑ์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.
- ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€ํšŒ ํ›„ ํšจ๊ณผ๊ฐ€ ์ข‹๋‹ค๊ณ  ํ™•์ธํ•œ ํ•œ๊ฐ€์ง€ ๊ทœ์น™(prescription)์€ ํฌ๊ธฐ๊ฐ€ image๋ฉด์ ์˜ 8~100%์‚ฌ์ด์— ๊ณ ๋ฅด๊ฒŒ ๋ถ„ํฌํ•˜๊ณ  ๊ฐ€๋กœ/์„ธ๋กœ๊ฐ€ 3/4 ~ 4/3 ์‚ฌ์ด์—์„œ ๋ฌด์ž‘์œ„์„ ํƒ๋˜๋Š” image์˜ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ patch์˜ sampling์„ ํฌํ•จํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
- ๋˜ํ•œ, ๊ด‘๋„์™œ๊ณก(photometric distortion)์ด overfitting ๊ทน๋ณต์— ์–ด๋А์ •๋„ ์œ ์šฉํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„๋ƒˆ๋‹ค.

- ๊ฒŒ๋‹ค๊ฐ€, ๋‹ค๋ฅธ hyper-parameter์˜ ๋ณ€ํ™”์™€ ํ•จ๊ป˜ ํฌ๊ธฐ์กฐ์ •์„ ์œ„ํ•ด (bilinear, area, nearest neighbor and cubic, with equal probability ๊ฐ™์€) ๋ฌด์ž‘์œ„๋ณด๊ฐ„๋ฐฉ๋ฒ•(random interpolation methods)์„ ์ƒ๋Œ€์ ์œผ๋กœ ๋Šฆ๊ฒŒ ์‚ฌ์šฉํ–ˆ๋‹ค.
๋”ฐ๋ผ์„œ ์ตœ์ข…๊ฒฐ๊ณผ๊ฐ€ ์œ„์˜ ๋ฐฉ๋ฒ•์˜ ์‚ฌ์šฉ์œผ๋กœ ๊ธ์ •์ ์˜ํ–ฅ์„ ๋ฐ›์•˜๋Š”์ง€์˜ ์—ฌ๋ถ€๋ฅผ ์ ˆ๋Œ€์ ์œผ๋กœ ํ™•์‹ ํ•  ์ˆ˜ ์—†๋‹ค.

 

 

Adam versus SGD

cf. ๋Œ€๋ถ€๋ถ„์˜ ๋…ผ๋ฌธ๋“ค์—์„œ Adam ๋Œ€์‹  SGD๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š”๋ฐ ๊ทธ ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ?
Adam optimizer: ๋‹ค์–‘ํ•œ ๋ฌธ์ œ์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๋Š” ์ตœ์‹  optimization Algorithm์ด์ง€๋งŒ ํ•ญ์ƒ ์ตœ์„ ์˜ ์„ ํƒ์€ ์•„๋‹Œ๋ฐ, ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1. Overfitting
- Adam optimizer๋Š” ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•๋ณด๋‹ค ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋Š” ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ๋  ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ์ด๋Ÿฐ ์ ์€ ์ž‘์€ ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ•  ๋•Œ ๋” ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค.

2. Data size(ํฌ๊ธฐ)
- Adam optimizer๋Š” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋” ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
- ๊ทธ๋Ÿฌ๋‚˜ ์ž‘์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ๋Š” SGD์™€ ๊ฐ™์€ ๋” ๊ฐ„๋‹จํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

3. Convergence speed (์ˆ˜๋ ด์†๋„)
- Adam optimizer๋Š” ์ดˆ๊ธฐ ํ•™์Šต ์†๋„๊ฐ€ ๋น ๋ฅด๋ฏ€๋กœ ์ดˆ๊ธฐ ํ•™์Šต ์†๋„๋ฅผ ๋‚ฎ์ถ”๋Š” ๊ฒƒ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ๊ทธ๋Ÿฌ๋‚˜ ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋А๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

๋”ฐ๋ผ์„œ, Adam optimizer๋ณด๋‹ค SGD๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ํฌ๊ธฐ, ๊ณผ์ ํ•ฉ ๋ฌธ์ œ, ์ˆ˜๋ ด ์†๋„ ๋“ฑ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์š”์ธ์ด ๊ด€๋ จ๋˜๊ธฐ์— ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ํƒํ•  ๋•Œ๋Š” ํ•ด๋‹น ๋ฌธ์ œ์™€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.

 

 

 

 

7. ILSVRC-2014  Classification. Challenge. Setup  and Result

- ILSVRC 2014 classification์—๋Š” 1.2๋ฐฑ๋งŒ์˜ training dataset. //. ๊ฐ๊ฐ 5๋งŒ, 10๋งŒ๊ฐœ์˜ validation, test dataset์ด ์žˆ๋‹ค.
top-1 accuracy rate: predict class์™€ ground truth๋ฅผ ๋น„๊ต
top-5 accuracy rate: ground truth์ž๋ฃŒ๋ฅผ ์ฒ˜์Œ 5๊ฐœ์˜ predict class์™€ ๋น„๊ต (ground truth์ž๋ฃŒ๊ฐ€ top-5์•ˆ์— ์žˆ์œผ๋ฉด image๊ฐ€ ์˜ณ๊ฒŒ ๋ถ„๋ฅ˜๋œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ) 


- ๋Œ€ํšŒ์ฐธ๊ฐ€์—์„œ training์—๋Š” ์™ธ๋ถ€๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.
- training ๊ธฐ์ˆ ์€ ์ด ๋…ผ๋ฌธ์—์„œ ์ „์ˆ (aforemention)ํ•œ ๊ฒƒ๊ณผ ๊ฐ™์œผ๋ฉฐ, testing์—์„œ๋Š” ๊ณ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ์ •๊ตํ•œ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
โ‘  ๋™์ผํ•œ GoogLeNet๋ชจ๋ธ์˜ 7๊ฐœ ๋ฒ„์ „์„ ๋…๋ฆฝ์ ์œผ๋กœ trainingํ•˜๊ณ  ensemble prediction์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค.
  - ์ด ๋ชจ๋ธ์€ ๋™์ผํ•œ ์ดˆ๊ธฐํ™” ๋ฐ learning rate๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จํ–ˆ๊ณ , sampling๋ฐฉ๋ฒ•๊ณผ input image๋Š” ๋‹จ์ง€ ์ˆœ์„œ๋งŒ ๋ฌด์ž‘์œ„๋กœ ๋‹ค๋ฅด๋‹ค.
     (์ด๋•Œ, ์ดˆ๊ธฐํ™”๋Š” ์ฃผ๋กœ ๋™์ผํ•œ ์ดˆ๊ธฐํ™”๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ๊ฐ„๊ณผ(oversight)์˜ ์ด์œ  ๋•Œ๋ฌธ์ด์—ˆ๋‹ค.) 
  
โ‘ก testing๋ฐฉ์‹์—์„œ, AlexNet๋ณด๋‹ค ๋” ๊ณต๊ฒฉ์ ์ธ(aggressive) cropping ์ ‘๊ทผ๋ฒ•์„ ์ฑ„ํƒํ–ˆ๋‹ค.
- ๊ตฌ์ฒด์ ์œผ๋กœ, (256, 288, 320, 352๋กœ ์ œ๊ฐ๊ฐ์ธ 4๊ฐœ์˜) ๋” ์งง์€ ์น˜์ˆ˜(height or width)์˜ ์‚ฌ์šฉ์œผ๋กœ imageํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•œ๋‹ค.
์ด๋•Œ, resize๋œ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋Š” ์ขŒ์ธก, ์ค‘์•™ ๋ฐ ์šฐ์ธก ๋ฐฉํ˜•(square)์„ ์ทจํ•œ๋‹ค(์ดˆ์ƒํ™”์˜ ๊ฒฝ์šฐ์—๋Š” ์ƒ๋‹จ, ์ค‘์•™ ๋ฐ ํ•˜๋‹จ ๋ฐฉํ˜•์„ ์ทจํ•œ๋‹ค).
- ๊ฐ ๋ฐฉํ˜•์— ๋Œ€ํ•ด resize๋œ 224x224 ๋ฐฉํ˜•image ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 4๊ฐœ์˜ ๋ชจ์„œ๋ฆฌ์™€ ์ค‘์•™์ด crop๋œ 224x224๋„ ์žˆ๋Š”๊ทธ๋Œ€๋กœ์˜ ๋ฒ„์ „์„ ๋ฐ˜์˜ํ•œ๋‹ค(mirrored version).
- ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์ด๋Š” image ํ•˜๋‚˜ ๋‹น  4×3×6×2 = 144๊ฐœ์˜ ๊ฒฐ๊ณผ๋ฌผ์ด ์ƒ๊ธด๋‹ค.
- ์ดํ›„ ๋‚˜์˜ค๋Š” ๋‚ด์šฉ์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ, ์ƒ๋‹น์ˆ˜์˜ ๊ฒฐ๊ณผ๋ฌผ์ด ์กด์žฌํ•˜๋ฉด ๋งŽ์€ ๊ฒฐ๊ณผ๋ฌผ์˜ ์ด์ ์ด ๋ฏธ๋ฏธํ•ด์ง€๊ธฐ์— ์‹ค์ œ๋กœ ์ ์šฉ์‹œ์—๋Š” ์ด๋Ÿฌํ•œ ๊ณต๊ฒฉ์ ์ธ crop์€ ํ•„์š”ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์ฃผ์˜ํ•ด์•ผํ•œ๋‹ค.

โ‘ข softmax ํ™•๋ฅ (probability)๋Š” ์ตœ์ข…์ ์ธ ์˜ˆ์ธก๊ฐ’์„ ์–ป๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ๋ชจ๋“  ๊ฐœ๋ณ„๋ถ„๋ฅ˜๊ธฐ(individual classifiers)์— ๊ฑธ์ณ ํ‰๊ท ํ™”๋œ๋‹ค.
- ์šฐ๋ฆฌ์˜ ์‹คํ—˜์—์„œ ๊ฒฐ๊ณผ๋ฌผ์— ๋Œ€ํ•œ MaxPooling๊ณผ classifier์— ๋Œ€ํ•œ ํ‰๊ท ๊ฐ™์€ validation data์— ๋Œ€ํ•ด ์ ‘๊ทผ๋ฐฉ๋ฒ•์˜ ๋Œ€์•ˆ์„ ๋ถ„์„ํ–ˆ์œผ๋‚˜ ์ด๋“ค์€ ๋‹จ์ˆœํ‰๊ท (simple averaging)๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค.
 



[์ตœ์ข…์ œ์ถœ์˜ ์ „๋ฐ˜์ ์ธ ์„ฑ๊ณผ์š”์ธ ๋ถ„์„]
- ์„ฑ๊ณผ:
๋Œ€ํšŒ์—์„œ ์ตœ์ข… ์ œ์ถœ์€ validation, test data์—์„œ 6.67%์˜ top-5 error๋กœ 1์œ„๋ฅผ ์ฐจ์ง€ํ–ˆ๋‹ค. 
์ด๋Š” 2012๋…„ SuperVision๋ฐฉ์‹์— ๋น„ํ•ด 56.5% ๋‚ฎ์€ ์ˆ˜์น˜์ด๋‹ค.
classifier training์—์„œ ์™ธ๋ถ€๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์ „๋…„๋„์˜ ์ตœ๊ณ ์ ‘๊ทผ๋ฒ•์ธ Clarifai์— ๋น„ํ•ด ์•ฝ 40% ๊ฐ์†Œํ•œ ์ˆ˜์น˜์ด๋‹ค.
์œ„์˜ ํ‘œ๋Š” ์ตœ๊ณ ์„ฑ๋Šฅ์ ‘๊ทผ๋ฒ•์— ๋Œ€ํ•œ ํ†ต๊ณ„์˜ ์ผ๋ถ€๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

- ์„ฑ๊ณผ์š”์ธ ๋ถ„์„: ์œ„์˜ ํ‘œ์—์„œ image prediction ์‹œ ๋ชจ๋ธ ์ˆ˜์™€ ์‚ฌ์šฉ๋˜๋Š” Crop์˜ ์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•ด ์—ฌ๋Ÿฌ test์„ ํƒ์˜ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•  ๊ฒƒ์ด๋‹ค.
ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋งŒ์„ ์‚ฌ์šฉ์‹œ validation data์—์„œ top-1 error rate์ธ ๋ชจ๋ธ์„ ์„ ํƒ
๋ชจ๋“  ์ˆซ์ž๋Š” test data ํ†ต๊ณ„์— overfitting๋˜์ง€ ์•Š๋„๋ก validation set์„ ์ด์šฉํ•œ๋‹ค.

 

 

8. ILSVRC-2014  Detection. Challenge. Setup. and. Result


 

 

 

9. Conclusions

- ๋ณธ ๋…ผ๋ฌธ์˜ ๊ฒฐ๊ณผ๋Š” ์‰ฝ๊ฒŒ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ๊ณ ๋ฐ€๋„์˜ building block๋“ค์— ์˜ํ•ด ์˜ˆ์ƒ๋˜๋Š” ์ตœ์ ์˜ ๋นˆ์•ฝํ•œ ๊ตฌ์กฐ๋ฅผ ๊ทผ์‚ฌํ™”ํ•˜๋Š” ๊ฒƒ์ด ์ปดํ“จํ„ฐ๋น„์ „์„ ์œ„ํ•ด ์‹คํ˜„๊ฐ€๋Šฅํ•œ ์‹ ๊ฒฝ๋ง ๊ฐœ์„ ๋ฐฉ๋ฒ•์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
- ์ด ๋ฐฉ๋ฒ•์˜ ์ฃผ์š” ์ด์ ์€ ์–•๊ณ  ๋œ ๋„“์€ ์‹ ๊ฒฝ๋ง์— ๋น„ํ•ด ๊ณ„์‚ฐ์š”๊ตฌ๋Ÿ‰์ด ์•ฝ๊ฐ„ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋  ๋•Œ์˜ ์ƒ๋‹นํ•œ ํ’ˆ์งˆ์˜ ํ–ฅ์ƒ์„ ์–ป๋Š”๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
๋˜ํ•œ, context์˜ ํ™œ์šฉ์ด๋‚˜ bounding box regression์„ ํ•˜์ง€ ์•Š์•„๋„ ์šฐ๋ฆฌ์˜ detection์€ ๊ฒฝ์Ÿ๋ ฅ์ด ์žˆ์—ˆ์œผ๋ฉฐ ์ด ์‚ฌ์‹ค์€ Inception ๊ตฌ์กฐ์˜ ๊ฐ•์ ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์  ์ฆ๊ฑฐ๋ผ๋Š” ์ ์„ ์ œ๊ณตํ•ด์ค€๋‹ค.

- ๋น„๋ก ์œ ์‚ฌํ•œ ๊นŠ์ด์™€ ํญ์„ ๊ฐ–๋Š” ๋” expensiveํ•œ ์‹ ๊ฒฝ๋ง์— ์˜ํ•ด ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ ์šฐ๋ฆฌ์˜ ์ ‘๊ทผ๋ฐฉ์‹์€ ํฌ์†Œ์ ์ธ ๊ตฌ์กฐ๋กœ์˜ ์ด๋™์ด๋ผ๋Š” ์ ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์‹คํ˜„๊ฐ€๋Šฅํ•˜๊ณ  ์œ ์šฉํ•œ ์ƒ๊ฐ์ด๋ผ๋Š” ํ™•์‹คํ•œ ์ฆ๊ฑฐ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

 

 

 

๐Ÿง ๋…ผ๋ฌธ ๊ฐ์ƒ_์ค‘์š”๊ฐœ๋… ํ•ต์‹ฌ ์š”์•ฝ

"Going deeper with convolutions"
CNN์˜ ๊ฐœ๋…๊ณผ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ์˜ ํšจ์œจ์„ฑ์„ ์†Œ๊ฐœํ•˜๋Š” ์—ฐ๊ตฌ ๋…ผ๋ฌธ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ filter ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๋ณ‘๋ ฌ convolution์„ ์‚ฌ์šฉํ•ด CNN์„ ํšจ์œจ์ ์œผ๋กœ ํ›ˆ๋ จํ•˜๋Š” "Inception Module"์ด๋ผ๋Š” ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•œ๋‹ค.

 

[ํ•ต์‹ฌ ๊ฐœ๋…]

1. Inception Module
- GoogLeNet ์•„ํ‚คํ…์ฒ˜๋Š” ์ปค๋„ ํฌ๊ธฐ(1x1, 3x3 ๋ฐ 5x5)๊ฐ€ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๋ณ‘๋ ฌ ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ฒฝ๋กœ๋กœ ๊ตฌ์„ฑ๋œ Inception ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•œ๋‹ค.
์ด๋ฅผ ํ†ตํ•ด ๋„คํŠธ์›Œํฌ๋Š” ๋‹ค์–‘ํ•œ ๊ทœ๋ชจ์˜ ๊ธฐ๋Šฅ์„ ์บก์ฒ˜ํ•  ์ˆ˜ ์žˆ๊ณ  ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ์˜ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

- ๋˜ํ•œ Inception ๋ชจ๋“ˆ์—๋Š” ์ž…๋ ฅ์˜ ์ฐจ์›์„ ์ค„์ด๊ณ  ๊ณ„์‚ฐ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด 1x1 ์ปจ๋ณผ๋ฃจ์…˜์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

- GoogLeNet ๊ตฌ์กฐ๋Š” ์—ฌ๋Ÿฌ Inception ๋ชจ๋“ˆ์„ ์Œ“๊ณ  ๋ณด์กฐ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ ๊ฒฝ๋ง์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ธต์—์„œ ๊ตฌ๋ณ„๋˜๋Š” ํŠน์ง•์„ ํ•™์Šตํ•˜๋„๋ก ํ•œ๋‹ค.
์ด ๋…ผ๋ฌธ์€ GoogLeNet ์•„ํ‚คํ…์ฒ˜๊ฐ€ AlexNet ๋ฐ VGG๋ฅผ ํฌํ•จํ•œ ๋‹ค๋ฅธ CNN ์•„ํ‚คํ…์ฒ˜๋ณด๋‹ค ๋” ์ ์€ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ณ„์‚ฐ์œผ๋กœ ImageNet ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

cf) depth๋Š” ๊นŠ์ด์™€ ๊ด€๋ จ, width๋Š” filter์˜ unit๊ณผ ๊ด€๋ จ


2. 1x1 convolution
- GoogLeNet ๊ตฌ์กฐ๋Š” ๋˜ํ•œ 1x1 convolution์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ง• ๋ณ€ํ™˜ ๋ฐ ์ฐจ์› ์ถ•์†Œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” "๋„คํŠธ์›Œํฌ ๋‚ด ๋„คํŠธ์›Œํฌ"(Network-in-Network) layer์˜ ๊ฐœ๋…์„ ๋„์ž…ํ•œ๋‹ค.

- Network-in-Network layer๋Š” Inception module๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜์–ด ๋„คํŠธ์›Œํฌ์˜ ๊ณ„์‚ฐ ๋น„์šฉ์„ ๋”์šฑ ์ค„์ธ๋‹ค.

cf) 3x3, 5x5์€ 1x1๋ณด๋‹ค expensiveํ•˜๋‹ค.


3. Normalization
๊ฐ ์ธต์— ๋Œ€ํ•œ ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•˜๊ณ  ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™(internal covariance shift)์„ ์ค„์ด๋Š” ๋ฐฐ์น˜ ์ •๊ทœํ™”(Batch Normalization)์˜ ์‚ฌ์šฉ์œผ๋กœ ์‹ ๊ฒฝ๋ง์ด ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๊ณ  ๊ณผ์ ํ•ฉ์„ ์ค„์ด๋Š” ๋ฐ ๋„์›€์ด ๋˜๊ฒŒํ•œ๋‹ค.


์ „๋ฐ˜์ ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ ์ปค๋„ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๋ณ‘๋ ฌ ํ•ฉ์„ฑ๊ณฑ ๊ฒฝ๋กœ(multiple parallel convolutional pathways)๋ฅผ ์‚ฌ์šฉํ•ด ๊นŠ์€ CNN์„ ํšจ์œจ์ ์œผ๋กœ trainingํ•˜๋Š” Inception ๋ชจ๋“ˆ์„ ์†Œ๊ฐœํ•œ๋‹ค.
์ด ๋…ผ๋ฌธ์€ GoogLeNet ๊ตฌ์กฐ๊ฐ€ ImageNet์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์‹ฌ์ธต CNN trainig์˜ ๊ณ„์‚ฐ ๋น„์šฉ(computational cost)์„ ์ค„์ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

 

 

 

 

๐Ÿง  ๋…ผ๋ฌธ์„ ์ฝ๊ณ  Architecture ์ƒ์„ฑ (with tensorflow)

import tensorflow as tf
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Dense, Flatten, AveragePooling2D, concatenate

def inception_module(x, filters):
  # Filters is a tuple that contains the number of filters for each branch
    f1, f2, f3, f4 = filters
  
  # Branch 1
    branch1 = Conv2D(f1, (1, 1), padding='same', activation='relu')(x)
    
  # Branch 2
    branch2 = Conv2D(f2, (1, 1), padding='same', activation='relu')(x)
    branch2 = Conv2D(f2, (3, 3), padding='same', activation='relu')(branch2)
  
  # Branch 3
    branch3 = Conv2D(f3, (1, 1), padding='same', activation='relu')(x)
    branch3 = Conv2D(f3, (5, 5), padding='same', activation='relu')(branch3)
  
  # Branch 4
    branch4 = MaxPooling2D((3, 3), strides=(1, 1), padding='same')(x)
    branch4 = Conv2D(f4, (1, 1), padding='same', activation='relu')(branch4)
  
  # Concatenate the outputs of the branches
    output = concatenate([branch1, branch2, branch3, branch4], axis=-1)
  
    return output
  
def GoogLeNet(input_shape, num_classes):
    input = tf.keras.layers.Input(shape=input_shape)

  # First Convolutional layer
    x = Conv2D(64, (7, 7), strides=(2, 2), padding='same', activation='relu')(input)
    x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)
  
  # Second Convolutional layer
    x = Conv2D(64, (1, 1), padding='same', activation='relu')(x)
    x = Conv2D(192, (3, 3), padding='same', activation='relu')(x)
    x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)
  
  # Inception 3a
    x = inception_module(x, filters=(64, 96, 128, 16))
  # Inception 3b
    x = inception_module(x, filters=(128, 128, 192, 32))
   
  # Max pooling with stride 2
    x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)
  
  # Inception 4a
    x = inception_module(x, filters=(192, 96, 208, 16))
  # Inception 4b
    x = inception_module(x, filters=(160, 112, 224, 24))
  # Inception 4c
    x = inception_module(x, filters=(128, 128, 256, 24))
  # Inception 4d
    x = inception_module(x, filters=(112, 144, 288, 32))
  # Inception 4e
    x = inception_module(x, filters=(256, 160, 320, 32))
  
  # Max pooling with stride 
    x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)
    
  # Inception 5a
    x = inception_module(x, filters=(256, 160, 320, 32))
  # Inception 5b
    x = inception_module(x, filters=(384, 192, 384, 48))
    
  # global average pooling layer
    x = AveragePooling2D((7, 7), padding='same')(x)
    x = Flatten()(x)
    
  # Fully Connected layer with linear. and. Dropout
    x = Dropout(0.4)(x)
    x = Dense(1024, activation='relu')(x)
    
    
  # softmax
    output_layer = Dense(num_classes, activation='softmax')(x)
    
  # define the model with input and output layers  
    model = tf.keras.Model(inputs=input, outputs=output_layer)
    
    return model
    
    
    
model = GoogLeNet(input_shape=(224,224,3), num_classes=1000)
model.summary()
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_2 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_57 (Conv2D)             (None, 112, 112, 64  9472        ['input_2[0][0]']                
                                )                                                                 
                                                                                                  
 max_pooling2d_13 (MaxPooling2D  (None, 56, 56, 64)  0           ['conv2d_57[0][0]']              
 )                                                                                                
                                                                                                  
 conv2d_58 (Conv2D)             (None, 56, 56, 64)   4160        ['max_pooling2d_13[0][0]']       
                                                                                                  
 conv2d_59 (Conv2D)             (None, 56, 56, 192)  110784      ['conv2d_58[0][0]']              
                                                                                                  
 max_pooling2d_14 (MaxPooling2D  (None, 28, 28, 192)  0          ['conv2d_59[0][0]']              
 )                                                                                                
                                                                                                  
 
 ...
 
 
 conv2d_108 (Conv2D)            (None, 7, 7, 384)    295296      ['concatenate_16[0][0]']         
                                                                                                  
 conv2d_110 (Conv2D)            (None, 7, 7, 192)    331968      ['conv2d_109[0][0]']             
                                                                                                  
 conv2d_112 (Conv2D)            (None, 7, 7, 384)    3686784     ['conv2d_111[0][0]']             
                                                                                                  
 conv2d_113 (Conv2D)            (None, 7, 7, 48)     36912       ['max_pooling2d_25[0][0]']       
                                                                                                  
 concatenate_17 (Concatenate)   (None, 7, 7, 1008)   0           ['conv2d_108[0][0]',             
                                                                  'conv2d_110[0][0]',             
                                                                  'conv2d_112[0][0]',             
                                                                  'conv2d_113[0][0]']             
                                                                                                  
 average_pooling2d_1 (AveragePo  (None, 1, 1, 1008)  0           ['concatenate_17[0][0]']         
 oling2D)                                                                                         
                                                                                                  
 flatten_1 (Flatten)            (None, 1008)         0           ['average_pooling2d_1[0][0]']    
                                                                                                  
 dropout_1 (Dropout)            (None, 1008)         0           ['flatten_1[0][0]']              
                                                                                                  
 dense_2 (Dense)                (None, 1024)         1033216     ['dropout_1[0][0]']              
                                                                                                  
 dense_3 (Dense)                (None, 1000)         1025000     ['dense_2[0][0]']                
                                                                                                  
==================================================================================================
Total params: 23,040,216
Trainable params: 23,040,216
Non-trainable params: 0
__________________________________________________________________________________________________

+ Recent posts