๐Ÿ˜ถ ์ดˆ๋ก (Abstract)

- ImageNet์—์„œ 120๋งŒ๊ฐœ์˜ ๊ณ ํ•ด์ƒ๋„ image๋ฅผ ๋Œ€์ƒ์œผ๋กœ 1000๊ฐœ์˜ ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฌธ์ œ์— ๋Œ€ํ•ด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์—ˆ์Œ

- 6000๋งŒ๊ฐœ์˜ parameter์™€ 650,000์—ฌ๊ฐœ์˜ ๋‰ด๋Ÿฐ [5๊ฐœ์˜ CONV ์ธต(maxpool), 3๊ฐœ์˜ FullyConnected ์ธต(๋งˆ์ง€๋ง‰์€ 1000๊ฐœ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” softmax์ธต)]์œผ๋กœ ๊ตฌ์„ฑ
- training์„ ๋นจ๋ฆฌ ํ•˜๊ธฐ ์œ„ํ•ด ํฌํ™”๋˜์ง€ ์•Š์€(non-saturating) ๋‰ด๋Ÿฐ๋“ค๊ณผ ํšจ๊ณผ์ ์ธ GPU๋ฅผ ์‚ฌ์šฉ
- overfitting์„ ์ค„์ด๊ธฐ(reduce)์œ„ํ•ด dropout(๋งค์šฐ ํšจ๊ณผ์ ์œผ๋กœ ์ฆ๋ช…๋œ regularized method)์„ ์‚ฌ์šฉ

- ILSVRC-2012 ๋Œ€ํšŒ์—์„œ 2๋“ฑ์ธ 26.2%์— ๋น„ํ•ด 15.3%์˜ error rate๋ฅผ ์„ฑ์ทจํ•˜์˜€์Œ
(Imagenet Large-scale Visual Recognition Challenge)

 

 

 

1. ์„œ๋ก  (Introduction)

- ํ˜„์žฌ ์‚ฌ๋ฌผ์ธ์‹์—์„œ ๊ธฐ๊ณ„ํ•™์Šต(machine learning)๋ฐฉ์‹์˜ ์‚ฌ์šฉ์€ ํ•„์ˆ˜์ ์ด๋‹ค.
- ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ํ–ฅ์ƒ์„ ์‹œํ‚ค๋Š”๋ฐ, ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹, ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ ๋ฐ ๋” ์ข‹์€ ๊ณผ์ ํ•ฉ์„ ๋ง‰๋Š” ๊ธฐ์ˆ ๋“ฑ์ด ์žˆ๋‹ค.
๊ณผ๊ฑฐ, MNIST์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋“ค์€ ์ ์€ ์–‘์œผ๋กœ๋„ ๊ฐ€๋Šฅํ•˜์˜€์ง€๋งŒ ํ˜„์‹ค์—์„œ๋Š” ์ƒ๋‹นํžˆ ๋‹ค์–‘ํ•˜๋ฉฐ(considerable variablility) ๋”ฐ๋ผ์„œ ๋” ๋งŽ์€ training set์„ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒƒ์€ ํ•„์ˆ˜๋ถˆ๊ฐ€๊ฒฐํ•œ ์‚ฌํ•ญ์ด ๋˜์–ด๋ฒ„๋ ธ๋‹ค.

- ์ˆ˜๋งŽ์€ ์‚ฌ๋ฌผ์„ ๊ตฌ๋ถ„ํ•˜๋ ค๋ฉด ๋งŽ์€ ํ•™์Šต๋Šฅ๋ ฅ์„ ์ง€๋…€์•ผ ํ•œ๋‹ค.
ํ•˜์ง€๋งŒ, ์‚ฌ๋ฌผ์ธ์‹์˜ ๊ฑฐ๋Œ€ํ•œ ๋ณต์žก๋„์˜ ๋ฌธ์ œ๋Š” ImageNet๊ฐ™์€ ๊ฑฐ๋Œ€ํ•œ dataset์—์„œ ๋ช…์‹œ๋  ์ˆ˜ ์—†๋‹ค๋Š” ์ ์ด๋‹ค.

- ํ•™์Šต๋Šฅ๋ ฅ์€ depth์™€ breadth๋ฅผ ๋‹ค์–‘ํ™”ํ•˜๋ฉด์„œ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, feedforward neural network์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, CNN์€ ๋” ์ ์€ ์—ฐ๊ฒฐ๊ณผ parameter๋ฅผ ๊ฐ–๊ธฐ์— trainํ•˜๊ธฐ ์‰ฝ๊ณ  ์ด๋ก ์ ์œผ๋กœ ์ตœ์ƒ์˜ ํšจ๊ณผ๋ฅผ ์•ฝ๊ฐ„์˜ ๋‹จ์ ๋งŒ์œผ๋กœ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

2. The Dataset

- ImageNet์„ ์ด์šฉํ–ˆ๋Š”๋ฐ, ImageNet์€ 22,000๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ, 1500๋งŒ๊ฐœ์˜ ๋ผ๋ฒจ๋ง๋œ ๊ณ ํ•ด์ƒ๋„์˜ ์‚ฌ์ง„์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
- ๋”ฐ๋ผ์„œ image๋ฅผ down-sampleํ•˜์—ฌ ํ•ด์ƒ๋„๋ฅผ 256X256์œผ๋กœ ๋งž์ถฐ image๋ฅผ rescaleํ•˜์˜€๋‹ค.
- raw RGB์˜ pixel๋กœ train์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

 

 

3. The Architecture

 

3.1 ReLU Nonlinearity

- Deep CNN์—์„œ tanh unit์˜ ์‚ฌ์šฉ๋ณด๋‹ค ReLU์˜ ์‚ฌ์šฉ์ด ํ›จ์”ฌ ๋น ๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Ÿฐ ๋น ๋ฅธ ํ•™์Šต์€ ํฐ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ–๋Š” largeํ•œ ๋ชจ๋ธ์— ๋งค์šฐ ํ›Œ๋ฅญํ•œ ์˜ํ–ฅ์„ ์ค€๋‹ค.
์œ„ ๊ทธ๋ฆผ์— ๋Œ€ํ•œ ์„ค๋ช…์ด๋‹ค.
- ์‹ค์„ : 4๊ฐœ์˜ Conv์ธต (with relu), 6๋ฐฐ ๋” ๋น ๋ฅด๋‹ค.
- ์ ์„ : 4๊ฐœ์˜ Conv์ธต (with tanh)
- ReLu๋Š” ์ง€์†์ ์ธ ํ•™์Šต์˜ ์ง„ํ–‰์œผ๋กœ saturating neuron๋ณด๋‹ค ๋” ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•œ๋‹ค.
์šฐ๋ฆฌ๋Š” ์—ฌ๊ธฐ์„œ ๋ฏธ๋ฆฌ ํฌํ™”๋œ(large neural net)๋‰ด๋Ÿฐ ๋ชจ๋ธ์„ ๊ตณ์ด ๊ฒฝํ—˜ํ•  ํ•„์š”๊ฐ€ ์—†์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

3.2 Training on Multiple GPUs

- single GTX 580 GPU ์ฝ”์–ด๋Š” ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ 3GB์ด๊ธฐ์— ํ›ˆ๋ จ๊ฐ€๋Šฅํ•œ ์‹ ๊ฒฝ๋ง์˜ ์ตœ๋Œ€ํฌ๊ธฐ๊ฐ€ ์ œํ•œ๋œ๋‹ค.
- ๋”ฐ๋ผ์„œ 2๊ฐœ์˜ GPU์— ์‹ ๊ฒฝ๋ง์„ ํฉ์–ด๋œจ๋ฆฌ๋Š” GPU๋ณ‘๋ ฌํ™”(parallelization scheme)๋ฅผ ์ด์šฉํ•œ๋‹ค.
- ์ด๋•Œ, 2๊ฐœ์˜ GPU net์ด 1๊ฐœ์˜ GPU net๋ณด๋‹ค ํ›ˆ๋ จ์— ๋” ์ ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ณ  ์˜ค๋ฅ˜์œจ๋„ 1.2~1.7%์ •๋„ ๊ฐ์†Œ์‹œ์ผฐ๋‹ค.

 

 

3.3 Local Response Normalization

- ReLU๋Š” ํฌํ™”(saturating)๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ์ž…๋ ฅ ์ •๊ทœํ™”๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š๋Š” ๋ฐ”๋žŒ์งํ•œ ํŠน์„ฑ์„ ๊ฐ–๋Š”๋‹ค.
- ReLU๋Š” local ์ •๊ทœํ™” ์ผ๋ฐ˜ํ™”์— ๋„์›€์ด ๋œ๋‹ค.

- ์œ„์˜ ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•ด์„๋œ๋‹ค.

 
kernel i์˜ ์œ„์น˜ (x,y)๋ฅผ ์ ์šฉํ•ด ๊ณ„์‚ฐ๋œ ๋‰ด๋Ÿฐ์˜ activity์ด๋‹ค.
์—ฌ๊ธฐ์— ReLU์˜ ๋น„์„ ํ˜•์„ฑ์„ ์ ์šฉํ•œ๋‹ค.
 
์ด๋•Œ, ํ•ฉ์€ ๋™์ผํ•œ ๊ณต๊ฐ„์œ„์น˜์—์„œ n๊ฐœ์˜ '์ธ์ ‘ํ•œ(adjecent)' kernel map์— ๊ฑธ์ณ ์‹คํ–‰๋œ๋‹ค. (N์€ layer์˜ ์ด kernel์ˆ˜)
kernel map์˜ ์ˆœ์„œ๋Š” training์ด ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ์ž„์˜๋กœ(arbitrary) ๊ฒฐ์ •๋œ๋‹ค.
์ด๋Ÿฐ ๋ฐ˜์‘์ •๊ทœํ™”(response normalization)์˜ ์ข…๋ฅ˜๋Š” ์˜๊ฐ์„ ๋ฐ›์€(inspired) ์ธก๋ฉด์–ต์ œ(lateral inhibition)ํ˜•ํƒœ๋ฅผ ๊ตฌํ˜„,
์ด๋ฅผ ํ†ตํ•œ ๋‹ค๋ฅธ ์ปค๋„์˜ ์‚ฌ์šฉ์œผ๋กœ ๊ณ„์‚ฐ๋œ ๋‰ด๋Ÿฐ ์ถœ๋ ฅ๊ฐ„์˜ large activity์— ๋Œ€ํ•œ ๊ฒฝ์Ÿ(competition)์„ ์ƒ์„ฑํ•œ๋‹ค.

์ƒ์ˆ˜ k, n, α, β๋Š” validation set์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ’์ด ๊ฒฐ์ •๋˜๋Š” hyper-parameter๋กœ ๊ฐ๊ฐ k = 2, n = 5, α = 10-4, β = 0.75๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ํŠน์ • layer์—์„œ ReLU์˜ ๋น„์„ ํ˜•์„ฑ์„ ์ ์šฉํ•œ ํ›„ ์ด ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ–ˆ๋‹ค(3.5 ์ฐธ์กฐ).

Jarrett์˜ local contrast normalization๊ณผ ๋น„์Šทํ•˜์ง€๋งŒ mean activity๋ฅผ ๋นผ์ง€(subtract)์•Š์•˜๊ธฐ์—
์šฐ๋ฆฌ์˜ ์ด ์ •๊ทœํ™”๋ฅผ 'brightness normalization'์ด๋ผ ๋” ์žฅํ™•ํ•˜๊ฒŒ ๋ถ€๋ฅผ ๊ฒƒ์ด๋‹ค. 

 

 

3.4 Overlapping Pooling

CNN์—์„œ์˜ pooling layer๋Š” ๋™์ผํ•œ kernel map์—์„œ '์ธ์ ‘ํ•œ(adjacent)' ๋‰ด๋Ÿฐ ๊ทธ๋ฃน์˜ output์„ summarizeํ•œ๋‹ค.
๋ณดํŽธ์ ์œผ๋กœ adjacent pooling์œผ๋กœ ์š”์•ฝ๋˜๋ฉด unit๋“ค์ด ๊ฒน์น˜์ง€ ์•Š๋Š”๋‹ค.
์ฆ‰, pooling layer๋Š” ํ”ฝ์…€๋“ค ์‚ฌ์ด์— s๋งŒํผ์˜ ๊ฐ„๊ฒฉ์„ ๋‘” pooling unit๋“ค์˜ grid๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ๊ฐ์€ pooling unit์˜ ์œ„์น˜๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ•œ ํฌ๊ธฐ(zxz)์˜ neighbor๋ฅผ ์š”์•ฝํ•œ๋‹ค.
๋งŒ์•ฝ ์œ„์—์„œ์˜ s๋ฅผ s = z๋กœ ์„ค์ •ํ•˜๋ฉด CNN์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ณดํŽธ์ ์ธ local pooling์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, s < z๋ฅผ ์„ค์ •ํ•˜๋ฉด overlapping pooling์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

์ด๋•Œ๋ฌธ์— ์ „๋ฐ˜์ ์ธ ์‹ ๊ฒฝ๋ง์—์„œ s=2๋‚˜ s=3์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
๋˜ํ•œ ์ผ๋ฐ˜์ ์ธ training์—์„œ๋Š” overlapping pooling์ด ์žˆ๋Š” ๋ชจ๋ธ์ด ๋ชจ๋ธ์„ ๊ณผ์ ํ•ฉํ•˜๊ธฐ๊ฐ€ ์•ฝ๊ฐ„ ๋” ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค

 

 

 

3.5 Overall Architecture

์ด 8๊ฐœ์˜ weight๋ฅผ ๊ฐ–๋Š” layer๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.
5๊ฐœ์˜ CONV (convolution layer)
- 3๊ฐœ์˜ FC (Fully-connected layer) + ๋งˆ์ง€๋ง‰์€ 1000๊ฐœ์˜ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” softmax์ธต

์ด ๋ชจ๋ธ์€ ๋‹คํ•ญ์‹ logistic regression ๋ถ„์„๋ชฉํ‘œ๋ฅผ ์ตœ๋Œ€ํ™” ํ•˜๋Š”๋ฐ, ์ด๋Š” ์˜ˆ์ธก๋ถ„ํฌ(prediction distribution)ํ•˜์—์„œ ์˜ฌ๋ฐ”๋ฅธ label์˜ logํ™•๋ฅ (log-probability)์˜ training cases์— ๋Œ€ํ•œ ํ‰๊ท (average)์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

2, 4, 5๋ฒˆ์งธ์˜ CONV์˜ kernel์€ ๋™์ผํ•œ GPU์— ์žˆ๋Š” ์ด์ „ layer์˜ kernel map์—๋งŒ ์—ฐ๊ฒฐ๋œ๋‹ค. (์•„๋ž˜๊ทธ๋ฆผ ์ฐธ์กฐ)
3๋ฒˆ์งธ CONV๋Š” 2๋ฒˆ์งธ CONV์˜ ๋ชจ๋“  kernel map๋“ค๊ณผ ์—ฐ๊ฒฐ๋˜๋ฉฐ ์™„์ „ํžˆ ์—ฐ๊ฒฐ๋œ ์ธต์˜ ๋‰ด๋Ÿฐ์€ ์ด์ „ ์ธต์˜ ๋ชจ๋“ ๋‰ด๋Ÿฐ๋“ค๊ณผ ์—ฐ๊ฒฐ๋œ๋‹ค.

response-normalization์ธต๋“ค์€ 1, 2๋ฒˆ์งธ CONV๋ฅผ ๋”ฐ๋ฅด๋ฉฐ 3.4์ ˆ์— ๊ธฐ์ˆ ๋œ ์ข…๋ฅ˜์˜ MaxPooling์€ response-normalization layer์™€ 5๋ฒˆ์งธ CONV layer๋ฅผ ๋ชจ๋‘ ๋”ฐ๋ฅธ๋‹ค.

ReLU์˜ ๋น„์„ ํ˜•์„ฑ์€ ๋ชจ๋“  CONV์™€ FC์˜ ์ถœ๋ ฅ์— ์ ์šฉ๋˜๋ฉฐ ์ฒซ ๋ฒˆ์งธ CONV์€ 11×11×3์˜ 96๊ฐœ์˜ kernel์„ ๊ฐ€์ง€๋ฉฐ
224×224×3(150,528 ์ฐจ์›)์˜ input image๋ฅผ  stride=4๋กœ filtering์„ ์ง„ํ–‰ํ•œ๋‹ค.
(์‹ ๊ฒฝ๋ง์˜ ๋‰ด๋Ÿฐ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 150,528-253,440–186,624–64,896–64,896–43,264–4096–1000)

2๋ฒˆ์งธ CONV๋Š” ์ฒซ๋ฒˆ์งธ CONV์˜ ์ถœ๋ ฅ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ 5×5×48์˜ 256๊ฐœ์˜ kernel๋กœ filtering์„ ์ง„ํ–‰ํ•˜๋ฉฐ
3๋ฒˆ์งธ CONV๋Š” 3×3×256์˜ 384๊ฐœ์˜ kernel์„ ๋‘๋ฒˆ์งธ CONV output๊ณผ ์—ฐ๊ฒฐํ•˜๊ณ 
4๋ฒˆ์งธ CONV๋Š” 3×3×192์˜ 384๊ฐœ์˜ kernel๋กœ,
5๋ฒˆ์งธ CONV๋Š” 3×3×192์˜ 256๊ฐœ์˜ kernel๋กœ filteringํ•˜๋ฉฐ
3, 4, 5๋ฒˆ์งธ CONV๋Š” ์–ด๋– ํ•œ ๊ฐ„์„ญ(intervening)pooling๊ณผ normalization layer์—†์ด ์„œ๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค.

๋˜ํ•œ FC (Fully-Connected layer)๋Š” ๊ฐ๊ฐ 4096๊ฐœ์˜ ๋‰ด๋Ÿฐ์„ ๊ฐ–๋Š”๋‹ค.

 

 

 

 

 

 

4. Reducing Overfitting

4.1 Data Augmentation

โ‘  image translation. &. horizontal reflection
- image translation:
ํ•œ ๋„๋ฉ”์ธ์˜ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์˜ ํ•ด๋‹น ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ
- ์ด ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด train์— ์ƒํ˜ธ์˜์กด์ ์ผ ์ˆ˜ ์žˆ๊ณ  ์ด๋Ÿฐ scheme์ด ์—†์—ˆ๋‹ค๋ฉด overfitting์˜ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์„ ๊ฒƒ์ด๋‹ค.

โ‘ก altering intensities of the RGB channels
- ImageNet์˜ training set ์ „์ฒด์˜ RGBํ”ฝ์…€๊ฐ’์— PCA๋ฅผ ์ˆ˜ํ–‰
- ๊ฐ train image์— ํ•ด๋‹น eigenvalues(๊ณ ์œ ๊ฐ’)์— ๋น„๋ก€ํ•˜๋Š” ํฌ๊ธฐ์™€ ํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 0.1์„ ๊ฐ–๋Š” zero-mean Gaussian ๋ถ„ํฌ(์ •๊ทœ๋ถ„ํฌ)์—์„œ ๋„์ถœ๋œ random๋ณ€์ˆ˜๋ฅผ ๊ณฑํ•ด ๋‚˜์˜จ ์ฃผ์š” ๊ตฌ์„ฑ์š”์†Œ์˜ ๋ฐฐ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.

- ๋”ฐ๋ผ์„œ RGB ํ”ฝ์…€ Ixy =[IR , IG , IB ]T์— ์•„๋ž˜ ๊ฐ’์„ ์ถ”๊ฐ€ํ•œ๋‹ค. 
- ์ด๋•Œ, pi์™€ yi๋Š” ๊ฐ๊ฐ RGB pixel์˜ 3x3 covariance(๊ณต๋ถ„์‚ฐ)ํ–‰๋ ฌ์˜ i๋ฒˆ์งธ eigenvector์™€ eigenvalue์ด๋ฉฐ αi๋Š” ์ด์ „์— ์„ค๋ช…ํ•œ  random ๋ณ€์ˆ˜์ด๋‹ค. ๊ฐ αi๋Š” ํ•ด๋‹น ์ด๋ฏธ์ง€๊ฐ€ ๋‹ค์‹œ ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋  ๋•Œ๊นŒ์ง€ ํŠน์ • ํ›ˆ๋ จ ์ด๋ฏธ์ง€์˜ ๋ชจ๋“  ํ”ฝ์…€์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ๋งŒ ๊ทธ๋ ค์ง€๋ฉฐ, ์ด ์‹œ์ ์—์„œ ๋‹ค์‹œ ๊ทธ๋ ค์ง„๋‹ค.

 

 

4.2 Dropout

- Dropout์€ ์ „์ง„์˜ ๊ธฐ์—ฌ๋„๋ฅผ ์—†์• ๊ณ  ์—ญ์ „ํŒŒ๋ฅผ ๊ด€์—ฌํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋งŽ๊ณ  ๋‹ค๋ฅธ ๋‰ด๋Ÿฐ๋“ค์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ randomํ•˜๊ฒŒ ๊ฒฐํ•ฉ์‹œํ‚ค๋Š”๋ฐ ์œ ์šฉํ•˜๋‹ค.
- 0.5์˜ ๋น„์œจ๋กœ FC์˜ ์ฒ˜์Œ 2๊ฐœ์˜ layer์— ์ ์šฉํ•˜์˜€์œผ๋ฉฐ dropout์ด ์—†์œผ๋ฉด ์ƒ๋‹นํ•œ ๊ณผ์ ํ•ฉ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

 

 

 

 

 

5. Details of Learning

- Optimizer: SGD (Stochastic  Gradient  Descent), momentum = 0.9,  weight decay = 0.0005
- Batch size: 128

i๋Š” ๋ฐ˜๋ณตํšŸ์ˆ˜, v๋Š” momentum๋ณ€์ˆ˜, ε์€ Learning rate(ํ•™์Šต์œจ),

<-- wi์—์„œ ํ‰๊ฐ€๋œ w์˜ ๋ฏธ๋ถ„(derivative)์˜ i๋ฒˆ์งธ batch Di์˜ ํ‰๊ท ์œผ๋กœ Loss function์˜ ๊ธฐ์šธ๊ธฐ์„ ์˜๋ฏธํ•œ๋‹ค.


<์ดˆ๊ธฐํ™”>
ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 0.01์ธ zero-mean Gaussian ๋ถ„ํฌ๋กœ ๊ฐ ์ธต์˜ weight๋ฅผ ์ดˆ๊ธฐํ™”
2,4,5 CONV์ธต๊ณผ FC์ธต์€ ์ƒ์ˆ˜ 1๋กœ , ๋‚˜๋จธ์ง€ ์ธต์€ 0์œผ๋กœ bias๋ฅผ ์ดˆ๊ธฐํ™”

validation error๊ฐ€ ๊ฐœ์„ ๋˜์ง€ ์•Š์œผ๋ฉด ํ•™์Šต์œจ์„ 10์œผ๋กœ ๋‚˜๋ˆ” (0.01๋กœ ์ดˆ๊ธฐํ™” ํ•˜์˜€์œผ๋ฉฐ ์ข…๋ฃŒ์ „์— 3๋ฐฐ์ •๋„ ๊ฐ์†Œํ•˜์˜€์Œ)

 

 

6. Results

- ILSVRC-2010์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ์˜ ์š”์•ฝ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์‹ ๊ฒฝ๋ง์€ 37.5%์™€ 17.0%5์˜ ์ƒ์œ„ 1์œ„์™€ ์ƒ์œ„ 5์œ„์˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ ์˜ค๋ฅ˜์œจ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

ILSVRC-2010 ๊ฒฝ์Ÿ์—์„œ ๋‹ฌ์„ฑ๋œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ธฐ๋Šฅ์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ 6๊ฐœ์˜ spase coding ๋ชจ๋ธ์—์„œ ์ƒ์„ฑ๋œ ์˜ˆ์ธก์„ ํ‰๊ท ํ™”ํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ 47.1%์™€ 28.2%์˜€๋‹ค.
๊ทธ ์ดํ›„ ๊ฐ€์žฅ ์ž˜ ๋ฐœํ‘œ๋œ ๊ฒฐ๊ณผ๋Š” Fisher Vectors (FVS)์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ๋‘ ๋ถ„๋ฅ˜๊ธฐ์˜ ์˜ˆ์ธก์„ ํ‰๊ท ํ™”ํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ 45.7%์™€ 25.7%์˜€์œผ๋ฉฐ ๋‘ ๊ฐ€์ง€ ์œ ํ˜•์˜ ์กฐ๋ฐ€ํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง๋œ ํŠน์ง•์œผ๋กœ๋ถ€ํ„ฐ ๊ณ„์‚ฐ๋˜์—ˆ๋‹ค.

- ๋˜ํ•œ  ILSVRC-2012 ๋Œ€ํšŒ์—์„œ์˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๋‹จ๋ฝ(paragraph)์˜ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์—์„œ๋Š” ๊ฒ€์ฆ ๋ฐ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์ด 0.1% ์ด์ƒ ์ฐจ์ด๊ฐ€ ๋‚˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋กœ ๊ตํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.


์ด ๋…ผ๋ฌธ์—์„œ ์„ค๋ช…ํ•œ CNN์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘์—ˆ๋‹ค
์ƒ์œ„ 5์œ„ ์ด๋‚ด์˜ ์˜ค๋ฅ˜์œจ 18.2%. 5๊ฐœ์˜ ์œ ์‚ฌํ•œ CNN์˜ ์˜ˆ์ธก์„ ํ‰๊ท ํ•˜๋ฉด 16.4%์˜ ์˜ค๋ฅ˜์œจ์„ ์–ป์—ˆ๋‹ค.

- ๋งˆ์ง€๋ง‰ pooling layer์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ 6์ฐจ ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด๊ฐ€ ์žˆ๋Š” CNN ํ•˜๋‚˜๋ฅผ ๊ต์œกํ•˜์—ฌ ImageNet Fall 2011 release (15M images, 22K categories)๋ฅผ ๋ถ„๋ฅ˜ํ•œ ๋‹ค์Œ ILSVRC-2012์—์„œ "๋ฏธ์„ธ ์กฐ์ •(fine-tuning)"ํ•˜๋ฉด 16.6%์˜ ์˜ค๋ฅ˜์œจ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ 5๊ฐœ์˜ CNN์œผ๋กœ 2011๋…„ ๊ฐ€์„ ์ „์ฒด ๋ฆด๋ฆฌ์Šค์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ 2๊ฐœ์˜ CNN์˜ ์˜ˆ์ธก์„ ํ‰๊ท ํ•˜๋ฉด 15.3%์˜ ์˜ค๋ฅ˜์œจ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ ์šฐ์ˆ˜ํ•œ ์ฝ˜ํ…Œ์ŠคํŠธ ํ•ญ๋ชฉ์€ ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ์กฐ๋ฐ€ํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง๋œ ๊ธฐ๋Šฅ์—์„œ ๊ณ„์‚ฐ๋œ FV์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ์—ฌ๋Ÿฌ ๋ถ„๋ฅ˜๊ธฐ์˜ ์˜ˆ์ธก์„ ํ‰๊ท ํ™”ํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ 26.2%์˜ ์˜ค๋ฅ˜์œจ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

- ๋งˆ์ง€๋ง‰์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ๋˜ํ•œ 10,184๊ฐœ์˜ ๋ฒ”์ฃผ๋ฅผ ๊ฐ€์ง„ Fall 2009 version of ImageNet์— ๋Œ€ํ•œ ์˜ค๋ฅ˜์œจ์„ ๋ณด๊ณ ํ•œ๋‹ค
๊ทธ๋ฆฌ๊ณ  890๋งŒ ์žฅ์˜ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด dataset์—์„œ ์šฐ๋ฆฌ๋Š” ํ•™์Šต์„ ์œ„ํ•ด ์ด๋ฏธ์ง€์˜ ์ ˆ๋ฐ˜์„ ์‚ฌ์šฉํ•˜๊ณ  ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•ด ์ด๋ฏธ์ง€์˜ ์ ˆ๋ฐ˜์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฌธํ—Œ์˜ ๊ด€๋ก€๋ฅผ ๋”ฐ๋ฅธ๋‹ค.
ํ™•๋ฆฝ๋œ testset์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์šฐ๋ฆฌ์˜ ๋ถ„ํ• ์€ ๋ฐ˜๋“œ์‹œ ์ด์ „ ์ €์ž๋“ค์ด ์‚ฌ์šฉํ•œ ๋ถ„ํ• ๊ณผ ๋‹ค๋ฅด์ง€๋งŒ, ์ด๊ฒƒ์€ ๊ฒฐ๊ณผ์— ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์œผ๋ฉฐ ์ด dataset์—์„œ ์ƒ์œ„ 1์œ„์™€ ์ƒ์œ„ 5์œ„์˜ ์˜ค๋ฅ˜์œจ์€ 67.4%์™€ 40.9%์ด๋ฉฐ, ์œ„์—์„œ ์„ค๋ช…ํ•œ ์ˆœ์œผ๋กœ ๋‹ฌ์„ฑ๋˜๋‚˜ ๋งˆ์ง€๋ง‰ pooling layer์— ๋น„ํ•ด 6๋ฒˆ์งธ ์ปจ๋ณผ๋ฃจ์…˜ ๊ณ„์ธต์ด ์ถ”๊ฐ€๋œ๋‹ค. ์ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ๊ฐ€์žฅ ์ž˜ ์•Œ๋ ค์ง„ ๊ฒฐ๊ณผ๋Š” 78.1%์™€ 60.9%์ด๋‹ค.

 

6.1 Qualitative Evaluations

์ด ๊ทธ๋ฆผ์€ ์‹ ๊ฒฝ๋ง์˜ ๋‘ data-connected layer์— ์˜ํ•ด ํ•™์Šต๋œ Convolution kernel์„ ๋ณด์—ฌ์ค€๋‹ค.
์‹ ๊ฒฝ๋ง์€ ๋‹ค์–‘ํ•œ frequency ๋ฐ orientation-selective kernels ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ ์ƒ‰์ƒ blob๋“ค์„ ์‚ฌ์šฉํ•œ ๊ฒƒ ๋˜ํ•œ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

3.5์—์„œ ์„ค๋ช…ํ•œ ์ œํ•œ๋œ ์—ฐ๊ฒฐ์˜ ๊ฒฐ๊ณผ๋กœ ๋‘ GPU๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ์ „๋ฌธํ™”์— ์ฃผ๋ชฉํ•œ๋‹ค. GPU 1์˜ ์ปค๋„์€ ๋Œ€๋ถ€๋ถ„ ์ƒ‰์ƒ์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๋Š” ๋ฐ˜๋ฉด GPU 2์˜ ์ปค๋„์€ ๋Œ€๋ถ€๋ถ„ ์ƒ‰์ƒ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ธฐ์— ์ด๋Ÿฌํ•œ ์ข…๋ฅ˜์˜ ์ „๋ฌธํ™”๋Š” ๋ชจ๋“  ์‹คํ–‰ ์ค‘์— ๋ฐœ์ƒํ•˜๋ฉฐ ํŠน์ • random weight ์ดˆ๊ธฐํ™”(GPU์˜ ๋ชจ๋“ˆ๋กœ ๋ฒˆํ˜ธ ๋ณ€๊ฒฝ)์™€ ๋ฌด๊ด€ํ•˜๋‹ค.


์œ„ ์‚ฌ์ง„์˜ ์™ผ์ชฝ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
- 8๊ฐœ์˜ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ƒ์œ„ 5๊ฐœ ์˜ˆ์ธก์„ ๊ณ„์‚ฐํ•˜์—ฌ ์‹ ๊ฒฝ๋ง์ด ๋ฌด์—‡์„ ๋ฐฐ์› ๋Š”์ง€ ์ •์„ฑ์ ์œผ๋กœ ํ‰๊ฐ€(qualitatively assess)ํ•œ๋‹ค.
- ์ด๋•Œ, ์ง„๋“œ๊ธฐ(mite)๊ฐ™์ด ์ค‘์‹ฌ์„ ๋ฒ—์–ด๋‚œ ๋ฌผ์ฒด๋„ net๋กœ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์œ ์˜ํ•ด์•ผ ํ•œ๋‹ค.
(์ƒ์œ„ 5๊ฐœ์˜ label์€ ๋Œ€๋ถ€๋ถ„ ํ•ฉ๋ฆฌ์ ์ธ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.)

์œ„ ์‚ฌ์ง„์˜ ์˜ค๋ฅธ์ชฝ์€ ์ฒซ ๋ฒˆ์งธ ์—ด์— 5๊ฐœ์˜ ILSVRC-2010 ํ…Œ์ŠคํŠธ ์˜์ƒ์œผ๋กœ ๋‚˜๋จธ์ง€ ์—ด์€ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€์˜ ํŠน์ง• ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ(Euclidean distance)๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ ๋งˆ์ง€๋ง‰ ์ˆจ๊ฒจ์ง„ ๋ ˆ์ด์–ด์—์„œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” 6๊ฐœ์˜ ํ›ˆ๋ จ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

- ๋‘ ์ด๋ฏธ์ง€๊ฐ€ ์ž‘์€ Euclidean separation์œผ๋กœ ํŠน์ง• ํ™œ์„ฑํ™” ๋ฒกํ„ฐ(feature activation vectors)๋ฅผ ์ƒ์„ฑํ•  ๋•Œ, ์‹ ๊ฒฝ๋ง์˜ ๊ณ ์ˆ˜์ค€์ด ์œ ์‚ฌํ•˜๋‹ค ๊ฐ„์ฃผํ•œ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

- ๊ทธ๋ฆผ์€ testset์˜ 5๊ฐœ ์ด๋ฏธ์ง€์™€ ์ด ์ธก์ •์— ๋”ฐ๋ผ ๊ฐ ์ด๋ฏธ์ง€์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ training set์˜ 6๊ฐœ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ํ”ฝ์…€ ์ˆ˜์ค€์—์„œ ๊ฒ€์ƒ‰๋œ ๊ต์œก ์ด๋ฏธ์ง€๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ฒซ ๋ฒˆ์งธ ์—ด์˜ query image์˜ L2์—์„œ ๊ฐ€๊น์ง€ ์•Š๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ํšŒ์ˆ˜๋œ ๊ฐœ๋“ค๊ณผ ์ฝ”๋ผ๋ฆฌ๋“ค์€ ๋‹ค์–‘ํ•œ ํฌ์ฆˆ๋กœ ๋‚˜ํƒ€๋‚˜๋ฉฐ ์šฐ๋ฆฌ๋Š” ๋ณด์ถฉ ์ž๋ฃŒ์—์„œ ๋” ๋งŽ์€ train image์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ์‹œํ•œ๋‹ค.

- ๋‘ ๊ฐœ์˜ 4096์ฐจ์› ์‹ค์ œ ๊ฐ’ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ Euclidean distance ์‚ฌ์šฉํ•˜์—ฌ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ด์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฒกํ„ฐ๋ฅผ ์งง์€ ์ด์ง„ ์ฝ”๋“œ๋กœ ์••์ถ•ํ•˜๋„๋ก auto-encoder๋ฅผ ํ›ˆ๋ จ์‹œํ‚ด์œผ๋กœ์จ ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
์ด๋Š” image label์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ์— ์˜๋ฏธ์ ์œผ๋กœ(semantically) ์œ ์‚ฌํ•œ์ง€ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†์ด edge๋“ค๊ณผ์˜ ์œ ์‚ฌํ•œ ํŒจํ„ด์„ ๊ฐ€์ง„ ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋Š” raw-pixel์— auto-encoder๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ๋‚˜์€ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ๋ฐฉ๋ฒ•์„ ์ƒ์„ฑํ•ด์•ผ ํ•œ๋‹ค.

 

 

 

 

 

 

7. Discussion

- ํฌ๊ณ  ๊นŠ์€ CNN์ด ๊ธฐ๋ก์„ ๊นจ๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
- ๋˜ํ•œ ์ค‘๊ฐ„์ธต์„ ์ผ๋ถ€ ์‚ญ์ œํ•˜๋”๋ผ๋„ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ, ๊นŠ์ด๊ฐ€ ์ •๋ง๋กœ ์šฐ๋ฆฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ด๋ฃฉํ•˜๋Š”๋ฐ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

- ์‹คํ—˜์„ ๋‹จ์ˆœํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ํŠนํžˆ label์ด ์ง€์ •๋œ ๋ฐ์ดํ„ฐ์˜ ์–‘์—์„œ ๊ทธ์— ์ƒ์‘ํ•˜๋Š” ์ฆ๊ฐ€๋ฅผ ์–ป์ง€ ์•Š๊ณ  ์‹ ๊ฒฝ๋ง์˜ ํฌ๊ธฐ๋ฅผ ํฌ๊ฒŒ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋Š” ์ถฉ๋ถ„ํ•œ ๊ณ„์‚ฐ ๋Šฅ๋ ฅ์„ ์–ป๋Š” ๊ฒฝ์šฐ์— ๋„์›€์ด ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  unsupervised pre-training์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.

 

 

 

 

 

 

๐Ÿง ๋…ผ๋ฌธ ๊ฐ์ƒ_์ค‘์š”๊ฐœ๋… ํ•ต์‹ฌ ์š”์•ฝ

"ImageNet Classification with Deep Convolutional Neural Networks"
Alex Krizhevsky, Ilya Sutskever ๋ฐ Geoffrey Hinton์ด 2012๋…„์— ๋ฐœํ‘œํ•œ ์—ฐ๊ตฌ ๋…ผ๋ฌธ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ 2012๋…„ ILSVRC(ImageNet Large Scale Visual Recognition Challenge)์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ AlexNet์ด๋ผ๋Š” ์‹ฌ์ธต ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์„ ์ œ์•ˆํ•œ๋‹ค.

 

[ํ•ต์‹ฌ ๊ฐœ๋…]

1. Convolutional Neural Networks
Convolutional ๋ฐ Subsampling ์ธต์˜ ์—ฌ๋Ÿฌ layer๋กœ ๊ตฌ์„ฑ๋œ Deep CNN Architecture์™€ Fully-Connected layer๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

2. Rectified Linear Units(ReLU)
์ด ๋…ผ๋ฌธ์€ Sigmoid ๋ฐ tanh์™€ ๊ฐ™์€ ๊ธฐ์กด ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ณด๋‹ค ๋” ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ์†Œ๊ฐœ ๋ฐ ํ™œ์šฉ์„ ์ง„ํ–‰ํ•œ ์ตœ์ดˆ์˜ ๋…ผ๋ฌธ์ด๊ธฐ์— ๋”์šฑ ์ค‘์š”ํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

3. Local Response Normalization(LRN)
์ด ๋…ผ๋ฌธ์€ ์ผ์ข…์˜ ์ธก๋ฉด ์–ต์ œ(lateral inhibition)๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” LRN์ด๋ผ๋Š” ์ •๊ทœํ™” ์œ ํ˜•์„ ์ œ์•ˆํ•œ๋‹ค.

4. Dropout
์ด ๋…ผ๋ฌธ์€ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ ์ค‘์— ์ผ๋ถ€ ๋‰ด๋Ÿฐ์„ ๋ฌด์ž‘์œ„๋กœ ์ œ๊ฑฐํ•˜๋Š” ๋“œ๋กญ์•„์›ƒ์ด๋ผ๋Š” ์ •๊ทœํ™” ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค.

5. Data Augmentation
์ด ๋…ผ๋ฌธ์€ training set์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๊ณ  model์˜ ๊ฒฌ๊ณ ์„ฑ์„ ๋Š˜ใ„น๋ฆฌ๊ธฐ ์œ„ํ•ด random crop, vertical flip๊ฐ™์€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

6. State-of-the-Art Performance (์ตœ์‹  ์„ฑ๋Šฅ)
์ด ๋…ผ๋ฌธ์€ ILSVRC 2012 ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ์ตœ์‹  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์—ฌ ์ด์ „ ๋ฐฉ๋ฒ•๋ณด๋‹ค ํ›จ์”ฌ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.

์ „๋ฐ˜์ ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ Convolution์‹ ๊ฒฝ๋ง, ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๋ฐ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์˜ ์‚ฌ์šฉ๊ณผ ๊ฐ™์€ ๋”ฅ ๋Ÿฌ๋‹์˜ ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๊ฐœ๋…์„ ์†Œ๊ฐœํ•˜๊ณ  ๊นŒ๋‹ค๋กœ์šด ์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ์žˆ์–ด ๊ทธ ํšจ๊ณผ๋ฅผ ์ž…์ฆํ–ˆ๋‹ค.

 

 

 

 

 

 

๐Ÿง  ๋…ผ๋ฌธ์„ ์ฝ๊ณ  Architecture ์ƒ์„ฑ (with tensorflow)

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout

def AlexNet(input_shape, num_classes):
    input_layer = Input(shape=input_shape)
    
    # first convolutional layer
    x = Conv2D(96, kernel_size=(11,11), strides=(4,4), padding='valid', activation='relu')(input_layer)
    x = MaxPooling2D(pool_size=(3,3), strides=(2,2))(x)
    x = Dropout(0.25)(x)
    
    # second convolutional layer
    x = Conv2D(256, kernel_size=(5,5), strides=(1,1), padding='same', activation='relu')(x)
    x = MaxPooling2D(pool_size=(3,3), strides=(2,2))(x)
    x = Dropout(0.25)(x)
    
    # third convolutional layer
    x = Conv2D(384, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu')(x)
    x = Dropout(0.25)(x)
    
    # fourth convolutional layer
    x = Conv2D(384, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu')(x)
    x = Dropout(0.25)(x)
    
    # fifth convolutional layer
    x = Conv2D(256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu')(x)
    x = MaxPooling2D(pool_size=(3,3), strides=(2,2))(x)
    x = Dropout(0.25)(x)
    
    # flatten the output from the convolutional layers
    x = Flatten()(x)
    
    # first fully connected layer
    x = Dense(4096, activation='relu')(x)
    x = Dropout(0.5)(x)
    
    # second fully connected layer
    x = Dense(4096, activation='relu')(x)
    x = Dropout(0.5)(x)
    
    # output layer
    output_layer = Dense(num_classes, activation='softmax')(x)
    
    # define the model with input and output layers
    model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
    
    return model
model = AlexNet(input_shape=(224,224,3), num_classes=1000)
model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_12 (InputLayer)       [(None, 224, 224, 3)]     0         
                                                                 
 conv2d_281 (Conv2D)         (None, 54, 54, 96)        34944     
                                                                 
 max_pooling2d_59 (MaxPoolin  (None, 26, 26, 96)       0         
 g2D)                                                            
                                                                 
 dropout_21 (Dropout)        (None, 26, 26, 96)        0         
                                                                 
 conv2d_282 (Conv2D)         (None, 26, 26, 256)       614656    
                                                                 
 max_pooling2d_60 (MaxPoolin  (None, 12, 12, 256)      0         
 g2D)                                                            
                                                                 
 dropout_22 (Dropout)        (None, 12, 12, 256)       0         
                                                                 
 conv2d_283 (Conv2D)         (None, 12, 12, 384)       885120    
                                                                 
 dropout_23 (Dropout)        (None, 12, 12, 384)       0         
                                                                 
 conv2d_284 (Conv2D)         (None, 12, 12, 384)       1327488   
                                                                 
 dropout_24 (Dropout)        (None, 12, 12, 384)       0         
                                                                 
 conv2d_285 (Conv2D)         (None, 12, 12, 256)       884992    
                                                                 
 max_pooling2d_61 (MaxPoolin  (None, 5, 5, 256)        0         
 g2D)                                                            
                                                                 
 dropout_25 (Dropout)        (None, 5, 5, 256)         0         
                                                                 
 flatten_9 (Flatten)         (None, 6400)              0         
                                                                 
 dense_25 (Dense)            (None, 4096)              26218496  
                                                                 
 dropout_26 (Dropout)        (None, 4096)              0         
                                                                 
 dense_26 (Dense)            (None, 4096)              16781312  
                                                                 
 dropout_27 (Dropout)        (None, 4096)              0         
                                                                 
 dense_27 (Dense)            (None, 1000)              4097000   
                                                                 
=================================================================
Total params: 50,844,008
Trainable params: 50,844,008
Non-trainable params: 0
_________________________________________________________________

 

 

 

 

 

+ Recent posts