๐Ÿง ๋…ผ๋ฌธ ํ•ด์„ ๋ฐ ์š”์•ฝ

 

๐Ÿ˜ถ ์ดˆ๋ก (Abstract)

- NLP๋ถ„์•ผ์—์„œ transformer๊ฐ€ ์‚ฌ์‹ค์ƒ standardํ•˜์˜€์ง€๋งŒ, vision์—์„œ๋Š” ๋งค์šฐ ๋“œ๋ฌผ๊ฒŒ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.
Vision์—์„œ๋Š” attention ์‚ฌ์šฉ ์‹œ, CNN๊ณผ ํ˜ผํ•ฉํ•ด ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ์ „์ฒด๊ตฌ์กฐ์—์„œ ๋ช‡๊ฐœ์˜ ๊ตฌ์„ฑ์š”์†Œ๋งŒ ๋Œ€์ฒด๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ๋˜์–ด ์™”๋‹ค.
๋ณธ ์—ฐ๊ตฌ๋Š” CNN์— ์˜์กดํ•˜๋Š” ๊ฒƒ์ด ๋ถˆํ•„์š”ํ•˜๋ฉฐ ์˜ค์ง transformer๋งŒ์œผ๋กœ image patch๋“ค์˜ sequence๋ฅผ ์ ์šฉํ•ด image classification task์— ๋งค์šฐ ์ž˜ ๋™์ž‘ํ•จ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•˜์˜€๋‹ค.
๋งŽ์€ dataset์„ pre-trainํ•œ ํ›„ small~midsize์˜ ์ด๋ฏธ์ง€ ์ธ์‹ bench mark dataset์— ์ „์ดํ•™์Šต ์‹œ, ์—ฌํƒ€ CNN๋ณด๋‹ค ViT๊ฐ€ ์š”๊ตฌ๊ณ„์‚ฐ๋Ÿ‰์€ ์ ์œผ๋ฉด์„œ๋„ S.O.T.A๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

 

 

1. ์„œ๋ก  (Introduction)

- "Self-Attention"๊ธฐ๋ฐ˜ ๊ตฌ์กฐ ํŠนํžˆ๋‚˜ Transformer[Vaswani2017]๋Š” NLP task์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋ฉฐ ๊ฐ€์žฅ ์ง€๋ฐฐ์ ์ด๊ณ  ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹์€ ๋งค์šฐ ๊ธด text์— ๋Œ€ํ•œ pre-train ์ดํ›„ ์ž‘๊ณ  ๊ตฌ์ฒด์ ์ธ task์˜ dataset์— fine-tuningํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
transformer์˜ ํŠน์ง• ์ค‘ ํ•˜๋‚˜์ธ ๊ณ„์‚ฐ์˜ ํšจ์œจ์„ฑ๊ณผ ํ™•์žฅ์„ฑ ๋•๋ถ„์— ์ „๋ก€์—†๋Š” ํฌ๊ธฐ(100์–ต๊ฐœ ์ด์ƒ์˜ parameter)์˜ model์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. (model๊ณผ dataset ํฌ๊ธฐ์˜ ์ฆ๊ฐ€์—๋„ ์„ฑ๋Šฅ์ด saturating๋˜์ง€ ์•Š์Œ)

- ๋‹ค๋งŒ computer vision์—์„œ๋Š” CNN๋ฐฉ์‹์ด ์ง€๋ฐฐ์ ์œผ๋กœ ๋‚จ์•„์žˆ์–ด์„œ NLP์˜ ์„ฑ๊ณต์— ์˜๊ฐ์„ ์–ป์–ด CNN์— "Self-Attention"์„ ํ•ฉ์น˜๋ ค๋Š” ์‹คํ—˜, Conv.layer๋ฅผ ๋ชจ๋‘ self-attention์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋“ฑ์ด ์ด๋ฃจ์–ด์กŒ๋‹ค. 
ํ›„์ž์˜ ๊ฒฝ์šฐ, ์ด๋ก ์ ์œผ๋กœ๋Š” ํšจ์œจ์ ์ด์ง€๋งŒ attention์˜ ๋…ํŠนํ•œ ๊ณ„์‚ฐ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ํšจ์œจ์ ์‚ฌ์šฉ์ด ์–ด๋ ต๋‹ค.
(โˆต image ์ „์ฒด๋ฅผ ํ•œ๋ฒˆ์— ๋„ฃ๋Š” ์—ฐ์‚ฐ >>> ๋‹จ์–ด vector์— ๋Œ€ํ•œ ์—ฐ์‚ฐ)
๊ทธ๋ ‡๊ธฐ์— ResNet๊ฐ™์€ ๋ชจ๋ธ๋“ค์ด Large-Scale image์—์„œ S.O.T.A๋กœ ๋‚จ์•„์žˆ๋‹ค.

- ๋ณธ ์—ฐ๊ตฌ๋Š” NLP์„ฑ๊ณต์— ์˜๊ฐ์„ ๋ฐ›์•„ standard transformer architecture์˜ ๊ตฌ์กฐ๋ฅผ ์กฐ๊ธˆ ์ˆ˜์ • ํ›„, image์— ์ง์ ‘ ์ ์šฉํ•˜๋Š” ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.
์ด๋ฅผ ์œ„ํ•ด image๋ฅผ patch๋กœ ๋‚˜๋ˆ„๊ณ , ์ด patch๋“ค์˜ linear embedding์˜ sequence๋ฅผ transformer์˜ input์œผ๋กœ ์ง‘์–ด ๋„ฃ๋Š”๋‹ค.
(patch๋Š” NLP์˜ transformer์—์„œ token(word)๊ณผ ๋™์ผํ•˜๊ฒŒ ๋‹ค๋ฃจ์–ด ์ง์„ ์˜๋ฏธํ•œ๋‹ค.)

- ํ›ˆ๋ จ ์‹œ, ImageNet๊ณผ ๊ฐ™์€ ์ค‘๊ฐ„ํฌ๊ธฐ์˜ dataset์— ๋Œ€ํ•ด ๊ฐ•ํ•œ ๊ทœ์ œํ™”(regularization)์—†์ด๋Š” ๋น„์Šทํ•œ ํฌ๊ธฐ์˜ ResNet๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ์˜ ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์กŒ๋‹ค. 
์ด์— ๋Œ€ํ•ด CNN์€ ๋‚ด์žฌ๋œ inductive biases๊ฐ€ ์žˆ์œผ๋‚˜ Transformer๋Š” ๋‚ด์žฌ๋œ inductive biases์˜ ๋ถ€์กฑ์œผ๋กœ ์ถฉ๋ถ„ํ•˜์ง€ ๋ชปํ•œ ์–‘์˜ dataset์— ๋Œ€ํ•œ ํ›ˆ๋ จ์€ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™”์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ ์˜ˆ์ธก ํ•˜์˜€๋‹ค.

- ๋‹ค๋งŒ, 14M~300M์ •๋„์˜ ๋Œ€๊ทœ๋ชจ์˜ dataset์˜ ๊ฒฝ์šฐ, inductive biases๋ฅผ ์ฐ์–ด๋ˆŒ๋Ÿฌ๋ฒ„๋ฆด ์ˆ˜ ์žˆ์Œ์„ ์—ฐ๊ตฌ์ง„์€ ํ™•์ธํ–ˆ๋‹ค.
ViT๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฐ dataset์— pre-train๋œ ํ›„ ๋” ์ž‘์€ dataset์— ๋Œ€ํ•ด transfer learning์ด ์ด๋ฃจ์–ด์ง€๊ฒŒ ๋  ๋•Œ, ๊ธฐ์กด์˜ S.O.T.A๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋Šฅ๊ฐ€ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
cf)
the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.

 

 

2. Related Work

- Transformer[Vaswani2017]๋Š” NLP๋ถ„์•ผ์˜ ๊ธฐ๊ณ„๋ฒˆ์—ญ์—์„œ S.O.T.A๋ฐฉ๋ฒ•์ด ๋˜์—ˆ์œผ๋ฉฐ, Large Transformer๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์€ ๋งค์šฐ ํฐ ๋ง๋ญ‰์น˜(corpora)์— pre-trainํ•œ ํ›„ task์˜ ๋ชฉ์ „์— ์ „์ดํ•™์Šต(fine-tuning)์„ ์ง„ํ–‰ํ•œ๋‹ค : BERT[Devlin2019]๋Š” "Denoising Self-Supervised pre-training task"๋กœ ์‚ฌ์šฉํ•˜๊ณ  GPT[Radford2018, 2019, Brown2020]๋Š” "Language modeling pre-train task"๋กœ ํ™œ์šฉํ•œ๋‹ค. 

- image์— ๋Œ€ํ•œ Naiveํ•œ "Self-Attention" ์ ์šฉ์€ ๊ฐ pixel์ด ๋‹ค๋ฅธ ๋ชจ๋“  pixel์— ์ฃผ๋ชฉ(attend)ํ•  ๊ฒƒ์„ ์š”๊ตฌํ•œ๋‹ค.
(= ํ•˜๋‚˜์˜ pixel์„ embedding ์‹œ, ๋‹ค๋ฅธ pixel๋„ embedding์— ์ฐธ์—ฌํ•  ๊ฒƒ์ด ์š”๊ตฌ๋œ๋‹ค๋Š” ์˜๋ฏธ)
pixel์ˆ˜์—์„œ 2์ฐจ์ ์ธ ๊ณ„์‚ฐ๋ณต์žก๋„๋ฅผ ์•ผ๊ธฐํ•˜๋ฉฐ ์ด๋กœ ์ธํ•ด ๋‹ค์–‘ํ•œ input size๋กœ ํ™•์žฅ๋˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค.
์ฆ‰, image processing์— transformer๋ฅผ ์ ์šฉํ•˜๋ ค๋ฉด ๋ช‡๊ฐ€์ง€ ๊ทผ์‚ฌ(approximation)๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
โˆ™ local self-attention 
โˆ™ sparse attention
โˆ™ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ block์— scale attention ์ ์šฉ
์ด๋Ÿฐ specialized attention๊ตฌ์กฐ๋“ค์€ computer vision ๋ถ„์•ผ์—์„œ ๊ดœ์ฐฎ์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋งŒ ํšจ์œจ์  ๊ตฌํ˜„์„ ์œ„ํ•ด์„œ๋Š” ๋ณต์žกํ•œ engineering์ด ํ•„์š”ํ•˜๋‹ค.

- self-attention์„ ์ด์šฉํ•ด feature map์„ augmentationํ•˜๊ฑฐ๋‚˜ CNN์˜ output์— attention์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋“ฑ CNN๊ณผ self-attention์„ ์œตํ•ฉํ•˜๋ ค๋Š” ๋งŽ์€ ์—ฐ๊ตฌ๋„ ์ด๋ฃจ์–ด ์กŒ๋‹ค.

- ๋ณธ ์—ฐ๊ตฌ์™€ ๊ฐ€์žฅ ๊ด€๋ จ์žˆ๋Š” ๋ชจ๋ธ์€ Cordonnier2020๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœ๋œ ๋ชจ๋ธ๋กœ input image์—์„œ 2×2 ํฌ๊ธฐ์˜ patch๋ฅผ ์ถ”์ถœํ•˜์—ฌ top์— full self-attention์„ ์ ์šฉํ•œ๋‹ค.  ViT์™€ ๋น„์Šทํ•˜์ง€๋งŒ ViT๋Š” ๋” ์ถ”๊ฐ€์ ์ธ ์ฆ๋ช…์„ ์ง„ํ–‰ํ•˜๋Š”๋ฐ, ๋Œ€๊ทœ๋ชจ pre-training์€ ํ‰๋ฒ”ํ•œ transformer๊ฐ€ S.O.T.A์™€ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๊ฒŒ ํ•œ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ Cordonnier2020๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์€ 2×2 piexl์ด๋ผ๋Š” ์ž‘์€ ํฌ๊ธฐ์˜ patch๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋Š” small-resolution image์—๋งŒ ์ ์šฉ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค. (์šฐ๋ฆฌ๋Š” medium-resolution image๋ฅผ ๋‹ค๋ฃจ์ง€๋งŒ...)

- ์ตœ๊ทผ์˜ ๋˜๋‹ค๋ฅธ ๊ด€๋ จ์žˆ๋Š” ๋ชจ๋ธ์€ iGPT[Chen2020]๋กœ iGPT๋Š” image์˜ resolution๊ณผ color space๋ฅผ ์ค„์ธ ํ›„ pixel๋“ค์— transformer๋ฅผ ์ ์šฉํ•œ ์ƒ์„ฑ๋ชจ๋ธ๋กœ์จ "Unsupervised"๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ๊ณ  ์ด๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง„ representation์€ classification์„ ์œ„ํ•ด ์ „์ดํ•™์Šต์ด๋‚˜ ์„ ํ˜•์ ์œผ๋กœ ํƒ์ƒ‰๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ์˜ˆ์‹œ๋กœ ํ™œ์šฉ ์‹œ ImageNet์—์„œ ์ตœ๋Œ€ ์ •ํ™•๋„ 72%๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.

- ๋ณธ ์—ฐ๊ตฌ๋Š” ์ถ”๊ฐ€์ ์œผ๋กœ ๊ธฐ์กด ImageNet๋ณด๋‹ค ๋” ๋งŽ์€ image๋ฅผ ๋ณด์œ ํ•œ ์ธ์‹๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ ์—ฐ๊ตฌ์˜ ์—ฐ์žฅ์„ ์œผ๋กœ CNN์˜ ์„ฑ๋Šฅ์ด dataset size์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š” ์ง€์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์™€ ๋Œ€๊ทœ๋ชจ dataset(ImageNet-21K, JFT-300M)์— ๋Œ€ํ•œ CNN์˜ ์ „์ดํ•™์Šต์— ๋Œ€ํ•ด ๊ฒฝํ—˜์ ํƒ๊ตฌ๋„ ์ง„ํ–‰ํ•œ๋‹ค. (์ด์ „ ์—ฐ๊ตฌ์™€ ๋‹ฌ๋ฆฌ ResNet์ด ์•„๋‹Œ Transformer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ.)

 

 

 

3. Method

- model ์„ค๊ณ„์—์„œ ๊ธฐ์กด Transformer[Vaswani2017]๋ฅผ ๊ฐ€๋Šฅํ•œ ๊ทผ์ ‘ํ•˜๊ฒŒ ๊ตฌ์„ฑํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ์‰ฝ๊ฒŒ ํ™•์žฅ๊ฐ€๋Šฅํ•œ NLP transformer ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ด์ ๊ณผ ํšจ์œจ์ ์ธ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ด์ ์ด ๋˜์–ด ๋ณ„๋„์˜ ์„ค์ •์—†์ด ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

3.1  Vision Transformer (ViT)
- ๊ธฐ์กด์˜ Transformer๋Š” token embedding์˜ 1D sequence๋ฅผ input์œผ๋กœ ๋ฐ›๋Š”๋‹ค.

- BERT์˜ [CLS]token์ฒ˜๋Ÿผ embedding๋œ patch๋“ค์˜ ๊ฐ€์žฅ ์•ž์— ํ•˜๋‚˜์˜ learnableํ•œ class token embedding vector๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.

- Position Embedding์€ image์˜ ์œ„์น˜์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด patch embedding ์‹œ trainableํ•œ position embeddings๊ฐ€ ๋”ํ•ด์ง„๋‹ค.
image๋ฅผ ์œ„ํ•ด ๊ฐœ์„ ๋œ 2D-aware position embedding์„ ์‚ฌ์šฉํ•ด ๋ณด์•˜์ง€๋งŒ 1D Position Embedding๊ณผ์˜ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์ด ์—†์–ด์„œ "1D Position Embedding"์„ ์‚ฌ์šฉํ•œ๋‹ค. (Appendix D.4)
์ด๋ ‡๊ฒŒ embedding๋œ ๋ฒกํ„ฐ๋“ค์˜ sequence๋ฅผ encoder์˜ ์ž…๋ ฅ์— ๋„ฃ๋Š”๋‹ค.

- Transformer์˜ Encoder๋ถ€๋ถ„์€ Multi-Head Attention(์‹ 2,3)์ธต๋“ค๊ณผ MLP๊ฐ€ ๊ต์ฐจ๋กœ ๊ตฌ์„ฑ๋˜๋Š”๋ฐ, ์ด๋Š” ํ•˜๋‚˜์˜ image์ด๋”๋ผ๋„ ์ฐจ์›์„ ์ชผ๊ฐ  ๋’ค multi-head๋ฅผ ๊ด€์ฐฐํ•˜๋Š” ํ˜„์ƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.






3.2  Fine-tuning. &. Higher Resolution

 

 

 

 

4. Experiments

์—ฐ๊ตฌ์ง„๋“ค์€ ResNet, ViT, ๊ทธ๋ฆฌ๊ณ  ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ชจ๋ธ์˜ representation learning capabilities๋ฅผ ํ‰๊ฐ€ํ–ˆ๋‹ค. ๊ฐ ๋ชจ๋ธ์ด ์š”๊ตฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘์ƒ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์‚ฌ์ด์ฆˆ์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์ „ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ–ˆ๊ณ  ๋งŽ์€ ๋ฒค์น˜๋งˆํฌ ํ…Œ์Šคํฌ์— ๋Œ€ํ•ด ํ‰๊ฐ€ ์‹ค์‹œํ–ˆ๋‹ค.
ViT๊ฐ€ ๋ชจ๋ธ์˜ ์‚ฌ์ „ํ›ˆ๋ จ ์—ฐ์‚ฐ ๋Œ€๋น„ ์„ฑ๋Šฅ ๋ถ€๋ถ„์—์„œ ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋” ๋‚ฎ์€ pre-training ๋น„์šฉ์œผ๋กœ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ self-supervised๋ฅผ ์ด์šฉํ•œ ์ž‘์€ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•ด ViT๊ฐ€ self-supervised์—์„œ๋„ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ์„ ๋ณด์˜€๋‹ค.

4.1  Setup
โˆ™Datasets

- ๋ชจ๋ธ ํ™•์žฅ์„ฑ(scalability)์„ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด
1,000๊ฐœ class์™€ 1.3M image๊ฐ€ ์žˆ๋Š” ILSVRC-2012 ImageNet dataset,
21,000 class์™€ 14M image๊ฐ€ ์žˆ๋Š” superset ImageNet-21k(Deng2009),
18,000 class์™€ 303M์˜ ๊ณ ํ•ด์ƒ๋„ image์˜ JFT(Sun2017)๋ฅผ ์‚ฌ์šฉ.

[Kolesnikov2020]์— ์ด์–ด downstream task์˜ testset์„ ํ†ตํ•ด pre-training dataset์„ ์ค‘๋ณต ์ œ๊ฑฐ.
์ด dataset์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ benchmark์— ์ „์ดํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค:
์›๋ณธ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋ผ๋ฒจ๊ณผ ์ •๋ฆฌ๋œ ReaL ๋ผ๋ฒจ์˜ ImageNet(Beyer2020), CIFAR-10/100(Krizhevsky2009), Oxford-IIIT Pets(Parkhi2012) ๋ฐ Oxford Flowers-102(Nilsback & Ziserman, 2008).
์ด๋•Œ, dataset์˜ ์ „์ฒ˜๋ฆฌ๊ณผ์ •์€ [Kolesnikov2020]๋ฅผ ๋”ฐ๋ฅธ๋‹ค.


- ๋˜ํ•œ 19๊ฐœ ์ž‘์—… VTAB classification suite(Zhai 2019)์— ๋Œ€ํ•ด์„œ๋„ ํ‰๊ฐ€ํ•œ๋‹ค.
VTAB๋Š” task๋‹น 1,000๊ฐœ์˜ training example๋“ค์„ ์‚ฌ์šฉํ•ด ๋‹ค์–‘ํ•œ task์— ๋Œ€ํ•ด ๋‚ฎ์€ data transfer๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค.
task๋Š” 3๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜๋œ๋‹ค: Natural(Pets, CIFAR), Specialized(์˜๋ฃŒ, ์œ„์„ฑ์‚ฌ์ง„), Structured(localization๊ฐ™์€ ๊ธฐํ•˜ํ•™์  ์ดํ•ด๊ฐ€ ํ•„์š”ํ•œ ์ž‘์—…)



โˆ™Model Variants
- ํ‘œ 1์— ์š”์•ฝ๋œ ๊ฒƒ์ฒ˜๋Ÿผ BERT(Devlin 2019)์— ์‚ฌ์šฉ๋œ ViT ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ BERT์—์„œ "Base" ๋ฐ "Large" ๋ชจ๋ธ์„ ์ง์ ‘ ์ฑ„ํƒํ•˜๊ณ  ๋” ํฐ "Huge" ๋ชจ๋ธ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ํ›„์ˆ ์—์„œ๋Š” ๋ชจ๋ธ์˜ ํฌ๊ธฐ์™€ ์ž…๋ ฅ patchํฌ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ๊ฐ„๋žตํ•œ ํ‘œ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉ.
ex) ViT-L/16์€ 16 × 16์˜ input patch size๋ฅผ ๊ฐ–๋Š” "Large" variant๋ฅผ ์˜๋ฏธ.
Transformer์˜ ์‹œํ€€์Šค ๊ธธ์ด๋Š” ํŒจ์น˜ ํฌ๊ธฐ์˜ ์ œ๊ณฑ์— ๋ฐ˜๋น„๋ก€ํ•˜๋ฏ€๋กœ ํŒจ์น˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์€ ๋ชจ๋ธ์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋” ๋งŽ์ด ๋“ ๋‹ค.

- ๊ธฐ๋ณธ CNN์˜ ๊ฒฝ์šฐ ResNet์„ ์‚ฌ์šฉํ•œ๋‹ค.
๋‹ค๋งŒ, Batch Normalization(Ioffe & Szegedy, 2015)๋Œ€์‹  Group Normalization(Wu & He, 2018)์œผ๋กœ ๋ฐ”๊พธ๊ณ  ํ‘œ์ค€ํ™”๋œ Convolution(Qiao 2019)์„ ์‚ฌ์šฉํ•œ๋‹ค.
์ด ์ˆ˜์ •์€ transfer๋ฅผ ๊ฐœ์„ ํ•˜๋ฉฐ[Kolesnikov2020], ์ด๋ฅผ "ResNet(BiT)๋ผ ๋ถ€๋ฅธ๋‹ค.

hybrid ์‹œ, ์šฐ๋ฆฌ๋Š” ์ค‘๊ฐ„ feature map์„ ํ•˜๋‚˜์˜ "pixel"์˜ patch ํฌ๊ธฐ๋กœ ViT์— ์ „๋‹ฌํ•œ๋‹ค. ๋‹ค๋ฅธ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์‹คํ—˜ํ•˜๊ธฐ ์œ„ํ•ด
(i) ์ผ๋ฐ˜ ResNet50์˜ 4๋‹จ๊ณ„ output์„ ๊ฐ€์ ธ์˜ค๊ฑฐ๋‚˜
(ii) 4๋‹จ๊ณ„ ์ œ๊ฑฐ, 3๋‹จ๊ณ„์— ๋™์ผํ•œ ์ˆ˜์˜ ๋ ˆ์ด์–ด(์ด ๋ ˆ์ด์–ด ์ˆ˜ ์œ ์ง€)๋ฅผ ๋ฐฐ์น˜ํ•˜๊ณ  ์ด ํ™•์žฅ๋œ 3๋‹จ๊ณ„ ์ถœ๋ ฅ์„ ๊ฐ€์ ธ์˜จ๋‹ค.
์ด๋•Œ, (ii)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด sequence ๊ธธ์ด๊ฐ€ 4๋ฐฐ ๋” ๊ธธ์–ด์ง€๊ณ  ViT ๋ชจ๋ธ์˜ ๋น„์šฉ์ด ๋” ๋งŽ์•„์ง„๋‹ค.


โˆ™Training. &. Fine-tuning
- ResNet์„ ํฌํ•จ, ๋ชจ๋“  ๋ชจ๋ธ์€ Adam(Kingma & Ba, 2015)์„ ์‚ฌ์šฉํ•œ๋‹ค. (β1 = 0.9, β2 = 0.999)
batch=4096์œผ๋กœ ํ›ˆ๋ จํ•˜๊ณ  ๋ชจ๋“  ๋ชจ๋ธ์˜ transfer์— ์œ ์šฉํ•œ 0.1์ด๋ผ๋Š” ๋†’์€ weight_decay๋ฅผ ์ ์šฉํ•œ๋‹ค.
(Appendix D.1์€ ์ผ๋ฐ˜์ ์ธ ๊ด€ํ–‰๊ณผ ๋‹ฌ๋ฆฌ Adam์ด ResNets์šฉ SGD๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ์ž˜ ์ž‘๋™ํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค).

linear Learning rate warmup๊ณผ decay๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. (detail์€ Appendix B.1์„ ์ฐธ์กฐ)
์ „์ดํ•™์Šต ์‹œ, ์šด๋™๋Ÿ‰์ด ์žˆ๋Š” SGD, batch=512๋ฅผ ๋ชจ๋“  ๋ชจ๋ธ์— ๋Œ€ํ•ด Appendix B.1.1์— ์†Œ๊ฐœ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•œ๋‹ค.
ํ‘œ 2์˜ ImageNet ๊ฒฐ๊ณผ์˜ ๊ฒฝ์šฐ ViT-L/16์˜ ๊ฒฝ์šฐ 512, ViT-H/14์˜ ๊ฒฝ์šฐ 518๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ–ˆ์œผ๋ฉฐ ํ‰๊ท  ๊ณ„์ˆ˜ 0.9999999๋กœ [Polyak & Juditsky 1992]๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค(Ramachan 2019, 2020).



โˆ™Metric
- few-shot์ด๋‚˜ ์ „์ดํ•™์Šต ์ •ํ™•๋„๋ฅผ ํ†ตํ•ด downstream dataset์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ๋‹ค.
์ „์ดํ•™์Šต ์ •ํ™•๋„๋Š” ๊ฐ dataset์„œ ์ „์ดํ•™์Šตํ•œ ์ดํ›„ ๊ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ captureํ•œ๋‹ค.
few-shot์˜ ์ •ํ™•๋„๋Š” train image์˜ ํ‘œํ˜„์„ {-1,1}K ํ‘œ์  ๋ฒกํ„ฐ์— ๋งคํ•‘ํ•œ ์ •๊ทœํ™”๋œ ์ตœ์†Œ ์ œ๊ณฑ ํšŒ๊ท€(MLE) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ณต์‹์„ ์‚ฌ์šฉํ•˜๋ฉด ํ์‡„ํ˜•(closed form)์œผ๋กœ ์ •ํ™•ํ•œ ์†”๋ฃจ์…˜์„ ๋ณต๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ฃผ๋กœ ์ „์ดํ•™์Šต์˜ ์„ฑ๋Šฅ์— ์ค‘์ ์„ ๋‘์ง€๋งŒ, ์ „์ดํ•™์Šต๋น„์šฉ์ด ๋„ˆ๋ฌด ๋งŽ์ด๋“ค์–ด์„œ ์‹ ์†ํ•œ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ์„ ํ˜• ํ“จ์ƒท ์ •ํ™•๋„๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค.




4.2 Comparision to S.O.T.A 
- ๋จผ์ € ๊ฐ€์žฅ ํฐ ๋ชจ๋ธ์ธ ViT-H/14 ๋ฐ ViT-L/16์„ ์ตœ์‹  CNN๊ณผ ๋น„๊ตํ•œ๋‹ค.
์ฒซ ๋ฒˆ์งธ ๋น„๊ตํ•  ์ ์€ ๋Œ€๊ทœ๋ชจ ResNets๋กœ ๊ฐ๋…๋œ ์ „์†ก ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋Š” Big Transfer(BiT)์ด๋‹ค.
๋‘ ๋ฒˆ์งธ๋Š” Noisy Student(Xie 2020)์œผ๋กœ, ๋ ˆ์ด๋ธ”์ด ์ œ๊ฑฐ๋œ ์ƒํƒœ์—์„œ ImageNet ๋ฐ JFT-300M์—์„œ ์ค€์ง€๋„ ํ•™์Šต์„ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ๋œ ๋Œ€๊ทœ๋ชจ EfficientNet์ด๋‹ค.

- ํ‘œ 2์˜ ๊ฒฐ๊ณผ๋Š” JFT-300M์—์„œ pre-train๋œ ์ž‘์€ ViT-L/16 ๋ชจ๋ธ์€ ๋ชจ๋“  ์ž‘์—…์—์„œ BiT-L์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ๋™์‹œ์— ํ›ˆ๋ จ์— ํ›จ์”ฌ ์ ์€ ๊ณ„์‚ฐ ๋ฆฌ์†Œ์Šค๋ฅผ ํ•„์š”๋กœ ํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.
๋Œ€ํ˜• ๋ชจ๋ธ์ธ ViT-H/14๋Š” ํŠนํžˆ ImageNet, CIFAR-100 ๋ฐ VTAB suite ๋“ฑ ๊นŒ๋‹ค๋กœ์šด dataset์—์„œ ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

ํฅ๋ฏธ๋กœ์šด ์ ์€, ์ด ๋ชจ๋ธ์€ ์ด์ „๊ธฐ์ˆ ๋ณด๋‹ค pre-train์— ํ›จ์”ฌ ์ ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜์—ˆ๋‹ค.
ํ•˜์ง€๋งŒ pre-train์˜ ํšจ์œจ์„ฑ์€ architecture์˜ ์„ ํƒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ training schedule, optimizer, weight_decay ๋“ฑ ๊ฐ™์€ ๋‹ค๋ฅธ parameter์—๋„ ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉํ•œ๋‹ค.
Section 4.4์—์„œ๋Š” ๋‹ค์–‘ํ•œ architecture์— ๋Œ€ํ•œ ์„ฑ๋Šฅ๊ณผ computing์— ๋Œ€ํ•ด ์ œ์–ด๋œ ์—ฐ๊ตฌ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

- ๋งˆ์ง€๋ง‰์œผ๋กœ, ImageNet-21k dataset์œผ๋กœ pre-train๋œ ViT-L/16 ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ dataset์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋ฉด์„œ pre-train์— ํ•„์š”ํ•œ ๋ฆฌ์†Œ์Šค๋Š” ๋” ์ ์Šต๋‹ˆ๋‹ค. ์•ฝ 30์ผ ๋‚ด์— 8๊ฐœ์˜ ์ฝ”์–ด๊ฐ€ ์žˆ๋Š” ํ‘œ์ค€ ํด๋ผ์šฐ๋“œ TPUv3๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ต์œก๋  ์ˆ˜ ์žˆ๋‹ค.

- ๊ทธ๋ฆผ 2๋Š” VTAB task๋ฅผ ๊ฐ ๊ทธ๋ฃน์œผ๋กœ ๋ถ„ํ•ดํ•˜๊ณ , ์ด benchmark์˜ ์ด์ „ S.O.T.A ๋ฐฉ๋ฒ•์ธ BiT, VIVI - ImageNet ๋ฐ Youtube์—์„œ ๊ณต๋™ ๊ต์œก๋œ ResNet(Tschannen 2020) ๋ฐ S4L - ImageNet์—์„œ supervised ๋ฐ semi-supervised ํ•™์Šต(Zhai 2019)๊ณผ ๋น„๊ตํ•œ๋‹ค. ViT-H/14๋Š” Natural ๋ฐ Structure task์—์„œ BiT-R152x4 ๋ฐ ๊ธฐํƒ€ ๋ฐฉ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•˜๋Š”๋ฐ, Specialized์—์„œ ์ƒ์œ„ ๋‘ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ์œ ์‚ฌํ•˜๋‹ค.



4.3 Pre-training Data Requirements
- ViT๋Š” ๋Œ€๊ทœ๋ชจ JFT-300M dataset์—์„œ pre-train ์‹œ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋ฉฐ ResNet๋ณด๋‹ค Inductive bias๊ฐ€ ์ ๋‹ค.
๊ทธ๋ ‡๋‹ค๋ฉด dataset์˜ ํฌ๊ธฐ๋Š” ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ• ๊นŒ? ์— ๋Œ€ํ•ด 2๊ฐ€์ง€ ์‹คํ—˜์„ ์ง„ํ–‰ํ•œ๋‹ค.


 โ‘  ํฌ๊ธฐ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” dataset์— ๋Œ€ํ•ด ViT ๋ชจ๋ธ์„ pre-trainํ•œ๋‹ค: ImageNet, ImageNet-21k ๋ฐ JFT-300M.
์†Œ๊ทœ๋ชจ dataset์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด weight_decay, dropout, label-smoothing์ด๋ผ๋Š” 3๊ฐ€์ง€ ๊ธฐ๋ณธ์ ์ธ Regularization parameter๋“ค์„ ์ตœ์ ํ™”ํ•œ๋‹ค. ๊ทธ๋ฆผ 3์€ ImageNet์„ pre-train๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค
cf. (๋‹ค๋ฅธ dataset์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ํ‘œ 5์— ๋‚˜์™€ ์žˆ๋‹ค).
ImageNet pre-train๋ชจ๋ธ๋„ ์ „์ดํ•™์Šต๋˜์–ด์žˆ์œผ๋‚˜ ImageNet์—์„œ๋Š” ๋‹ค์‹œ ์ „์ดํ•™์Šต์ด ์ง„ํ–‰๋œ๋‹ค. (์ „์ดํ•™์Šต ์‹œ ํ•ด์ƒ๋„๊ฐ€ ๋†’์•„์ง€๋ฉด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๊ธฐ ๋•Œ๋ฌธ).
๊ฐ€์žฅ ์ž‘์€ dataset์ธ ImageNet์—์„œ pre-train ์‹œ, ViT-Large ๋ชจ๋ธ์€ (moderate) regularization์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ViT-Base ๋ชจ๋ธ์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค. JFT-300M๋งŒ ์žˆ์œผ๋ฉด ๋” ํฐ ๋ชจ๋ธ์˜ ์ด์ ์„ ์ตœ๋Œ€ํ•œ ๋ˆ„๋ฆด ์ˆ˜ ์žˆ๋Š”๋ฐ, ๊ทธ๋ฆผ 3์€ ๋˜ํ•œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ BiT ๋ชจ๋ธ์ด ์ฐจ์ง€ํ•˜๋Š” ์„ฑ๋Šฅ ์˜์—ญ์„ ๋ณด์—ฌ์ค€๋‹ค. BiT CNN์€ ImageNet์—์„œ ViT๋ฅผ ๋Šฅ๊ฐ€ํ•˜์ง€๋งŒ dataset์ด ํด์ˆ˜๋ก ViT๊ฐ€ ์•ž์„œ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 
 โ‘ก
9M, 30M ๋ฐ 90M์˜ ๋žœ๋ค ํ•˜์œ„ ์ง‘ํ•ฉ๊ณผ ์ „์ฒด JFT-300M dataset์— ๋Œ€ํ•œ ๋ชจ๋ธ์„ ๊ต์œกํ•œ๋‹ค.
์ด๋•Œ, ๋” ์ž‘์€ ํ•˜์œ„ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด ์ถ”๊ฐ€์ ์ธ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋“  ์„ค์ •์— ๋Œ€ํ•ด ๋™์ผํ•œ hyper-parameter๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
์ด ๋ฐฉ์‹์œผ๋กœ, regularization์˜ ํšจ๊ณผ๊ฐ€ ์•„๋‹ˆ๋ผ ๋ณธ์งˆ์ ์ธ ๋ชจ๋ธ ํŠน์„ฑ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ, Early Stop์„ ์‚ฌ์šฉํ•˜๊ณ , training ์ค‘ ๋‹ฌ์„ฑํ•œ ์ตœ๊ณ ์˜ validation accuracy๋ฅผ ์•Œ๋ ค์ค€๋‹ค.
๊ณ„์‚ฐ์„ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด ์ „์ฒด์˜ ์ „์ดํ•™์Šต ์ •ํ™•๋„ ๋Œ€์‹  few-shot linear accuracy๋ฅผ ์•Œ๋ ค์ฃผ๋ฉฐ, ์ด๋Š” ๊ทธ๋ฆผ 4์—๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™€ ์žˆ๋‹ค. ViT๋Š” ์†Œ๊ทœ๋ชจ dataset์—์„œ ๋น„์Šทํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ ResNet๋ณด๋‹ค ๋” ์ ํ•ฉํ•˜๋‹ค.
ex) ViT-B/32๋Š” ResNet50๋ณด๋‹ค ์•ฝ๊ฐ„ ๋น ๋ฅด๋‹ค.

9M ํ•˜์œ„ ์ง‘ํ•ฉ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ํ›จ์”ฌ ๋–จ์–ด์ง€์ง€๋งŒ 90M+ ํ•˜์œ„ ์ง‘ํ•ฉ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋” ์šฐ์ˆ˜ํ•œ๋ฐ, ResNet152x2 ๋ฐ ViT-L/16์— ๋Œ€ํ•ด์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ด๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” Convolution์˜ Inductive Bias๊ฐ€ ์ž‘์€ dataset์— ์œ ์šฉํ•˜๋‹ค๋Š” ์ง๊ด€์„ ๊ฐ•ํ™”ํ•œ๋‹ค.
ํ•˜์ง€๋งŒ, ํฐ dataset์˜ ๊ฒฝ์šฐ ๊ด€๋ จ ํŒจํ„ด์„ ๋ฐ์ดํ„ฐ์—์„œ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•˜๊ณ , ์‹ฌ์ง€์–ด ์œ ์šฉํ•˜๋‹ค.





4.4. Scaling Study
- JFT-300M์˜ transfer ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ œ์–ด๋œ scaling ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.
์ด ์„ค์ •์—์„œ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ์ผ์œผํ‚ค์ง€ ์•Š์œผ๋ฉฐ, ๊ฐ ๋ชจ๋ธ์˜ accuracy/pre-train cost๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค.
model set์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
- 7 epochs์— ๋Œ€ํ•ด ์‚ฌ์ „ํ›ˆ๋ จ๋œ ResNets(R50x1, R50x2 R101x1, R152x1, R152x2, R152x2)
- 7 epochs์— ๋Œ€ํ•ด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ R16 ๋ฐ R1450
- 14 epochs์— ๋Œ€ํ•ด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ R152x2 ๋ฐ R200x3
- 14 epochs์— ๋Œ€ํ•ด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ViT-B/32, B/16, L/32, L/16
- 14 epochs์— ๋Œ€ํ•ด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ R50+ViT-L/16
(hybrid์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ ์ด๋ฆ„ ๋ ์ˆซ์ž๋Š” patch size๊ฐ€ ์•„๋‹Œ, ResNet ๋ฐฑ๋ณธ์˜ ์ด downsampling๋น„์œจ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค).


- ๊ทธ๋ฆผ 5์—๋Š” ์ด ์‚ฌ์ „ ๊ต์œก ๊ณ„์‚ฐ ๋Œ€๋น„ ์ด์ „ ์„ฑ๋Šฅ์ด ๋‚˜์™€ ์žˆ๋‹ค(compute detail: Appendix D.5 ; model๋ณ„ detail: Appendix์˜ ํ‘œ 6).
์ด์— ๋Œ€ํ•ด ๋ช‡ ๊ฐ€์ง€ ํŒจํ„ด์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋‹ค:
i) ViT๋Š” accuracy/computing ์ ˆ์ถฉ์—์„œ ResNets๋ฅผ ์••๋„ํ•œ๋‹ค. ViT๋Š” ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ์•ฝ 2~4๋ฐฐ ์ ์€ ์ปดํ“จํŒ…์„ ์‚ฌ์šฉํ•œ๋‹ค(ํ‰๊ท  5๊ฐœ ์ด์ƒ์˜ dataset).
ii) hybrid๋Š” ์ ์€ computing์œผ๋กœ ViT๋ฅผ ์•ฝ๊ฐ„ ๋Šฅ๊ฐ€ํ•˜๋‚˜, ๋” ํฐ ๋ชจ๋ธ์—์„œ๋Š” ๊ทธ ์ฐจ์ด๊ฐ€ ์‚ฌ๋ผ์ง„๋‹ค.
์ด ๊ฒฐ๊ณผ๋Š” Convolution์˜ local feature processing์ด ๋ชจ๋“  ํฌ๊ธฐ์—์„œ ViT๋ฅผ ์ง€์›ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์†Œ ๋†€๋ผ์šด ๊ฒฐ๊ณผ๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.
iii) ViT๋Š” ์‹œ๋„ํ•œ ๋ฒ”์œ„ ๋‚ด์—์„œ ํฌํ™”๋˜์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜ ํ–ฅํ›„ ํ™•์žฅ ๋…ธ๋ ฅ์— ๋™๊ธฐ๋ฅผ ์ค€๋‹ค.




4.5. Inspecting Vision Transformer
- ViT๋Š” image data์ฒ˜๋ฆฌ๋ฐฉ๋ฒ•์ดํ•ด๋ฅผ ์œ„ํ•ด internal-representation์„ ๋ถ„์„ํ•œ๋‹ค.
ViT์˜ ์ฒซ์ธต์€ flattened patch๋ฅผ ๋‚ฎ์€ ์ฐจ์›์˜ ๊ณต๊ฐ„(Eq.1)์œผ๋กœ linearly projectํ•œ๋‹ค.
๊ทธ๋ฆผ 7(์™ผ์ชฝ)์€ ํ•™์Šต๋œ embedding filter์˜ ์ฃผ์š” ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๊ตฌ์„ฑ์š”์†Œ๋Š” ๊ฐ patch๋‚ด fine-structure๋ฅผ ๋‚ฎ์€ ์ฐจ์›์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ๊ทธ๋Ÿด๋“ฏํ•œ ๊ธฐ๋ณธํ•จ์ˆ˜์™€ ์œ ์‚ฌํ•˜๋‹ค.


- ํˆฌ์˜ํ•œ ํ›„, ํ•™์Šต๋œ position embedding์ด patch์˜ representation์— ์ถ”๊ฐ€๋œ๋‹ค.
๊ทธ๋ฆผ 7(๊ฐ€์šด๋ฐ)์€ ๋ชจ๋ธ์ด position embedding์˜ ์œ ์‚ฌ์„ฑ์—์„œ image๋‚ด distance๋ฅผ encodingํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.
์ฆ‰, ๋” ๊ฐ€๊นŒ์šด patch๋Š” ๋” ์œ ์‚ฌํ•œ position embedding์„ ๊ฐ–๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

- ๋˜ํ•œ ํ–‰-์—ด(row-column) ๊ตฌ์กฐ๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค. ๋™์ผํ•œ ํ–‰/์—ด์— ์žˆ๋Š” ํŒจ์น˜๋Š” ์œ ์‚ฌํ•œ ์ž„๋ฒ ๋”ฉ์„ ๊ฐ–๋Š”๋‹ค.

- ๋งˆ์ง€๋ง‰์œผ๋กœ, sinํŒŒ์˜ ๊ตฌ์กฐ๋Š” ๋•Œ๋•Œ๋กœ ๋” ํฐ grid(Appendix D)์— apparentํ•œ๋ฐ, position embedding์ด 2D-image topology๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์šด๋‹ค๋Š” ๊ฒƒ์€ ์ง์ ‘ ์„ค๊ณ„๋œ 2D-aware embedding์ด ๊ฐœ์„ ๋˜์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ•œ๋‹ค.(Appendix D.4)


- Self-Attention์„ ํ†ตํ•ด ViT๋Š” ์ „์ฒด์ ์œผ๋กœ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ๊ฐ€์žฅ ๋‚ฎ์€ ์ธต์—์„œ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ์—ฐ๊ตฌ์ž๋“ค์€ ์‹ ๊ฒฝ๋ง์ด ์ด ์ˆ˜์šฉ๋ ฅ์„ ์–ด๋Š์ •๋„๊นŒ์ง€ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์กฐ์‚ฌํ•œ๋‹ค.
ํŠนํžˆ attention weight๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •๋ณด๊ฐ€ ํ†ตํ•ฉ๋˜๋Š” image space์˜ average distance๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค(๊ทธ๋ฆผ 7, ์˜ค๋ฅธ์ชฝ).
์ด "attention distance"๋Š” CNN์˜ receptive field size์™€ ๋น„์Šทํ•˜๋‹ค.


- ์—ฐ๊ตฌ์ž๋“ค์€ ์ผ๋ถ€ head๊ฐ€ ์ด๋ฏธ ๋‚ฎ์€ ์ธต์— ์žˆ๋Š” ๋Œ€๋ถ€๋ถ„์˜ image์— ์ฃผ๋ชฉ์„ ํ•˜๋Š”๊ฒƒ์„ ๋ฐœ๊ฒฌ, ์ •๋ณด๋ฅผ globalํ•˜๊ฒŒ ํ†ตํ•ฉํ•˜๋Š” ํŠน์ง•์ด ๋ชจ๋ธ์— ์‹ค์ œ๋กœ ์‚ฌ์šฉ๋˜๋Š”๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋‹ค๋ฅธ attention heads๋Š” ๋‚ฎ์€ ์ธต์—์„œ ์ผ๊ด€๋˜๊ฒŒ attention distance๊ฐ€ ์ž‘๋‹ค. ์ด๋ ‡๊ฒŒ ๊ณ ๋„๋กœ localํ•˜๊ฒŒ ๋œ attention์€ transformer(๊ทธ๋ฆผ 7, ์˜ค๋ฅธ์ชฝ)์ด์ „์— ResNet์„ ์ ์šฉํ•˜๋Š” hybrid ๋ชจ๋ธ์—์„œ ๋œ ๋‘๋“œ๋Ÿฌ์ง€๋ฉฐ, ์ด๋Š” CNN์˜ ์ดˆ๊ธฐ convolution์ธต๊ณผ ์œ ์‚ฌํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.
๋˜ํ•œ ์‹ ๊ฒฝ๋ง์˜ ๊นŠ์ด์— ๋”ฐ๋ผ attention distance๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š”๋ฐ, Globalํ•˜๊ฒŒ ๋ชจ๋ธ์ด clasification๊ณผ ์˜๋ฏธ๋ก ์ (semantically)์œผ๋กœ ๊ด€๋ จ์ด ์žˆ๋Š” image์˜ ์˜์—ญ์— ์ฃผ๋ชฉํ•จ์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋‹ค. (๊ทธ๋ฆผ 6).




4.6. Self-Supervision
- Transformer๋Š” NLP ์ž‘์—…์—์„œ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ๊ทธ๋“ค์˜ ์„ฑ๊ณต์˜ ๋Œ€๋ถ€๋ถ„์€ ๋›ฐ์–ด๋‚œ ํ™•์žฅ์„ฑ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋Œ€๊ทœ๋ชจ self-supervised pretraining(Devlin 2019; Radford 2018)์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ BERT์—์„œ ์‚ฌ์šฉ๋˜๋Š” masked language modeling task๋ฅผ ๋ชจ๋ฐฉํ•ด self-supervision์„ ์œ„ํ•œ masked patch prediction์— ๋Œ€ํ•œ ์˜ˆ๋น„ ํƒ์ƒ‰์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
"Self-supervised pre-training"์„ ํ†ตํ•ด, ์šฐ๋ฆฌ์˜ ์ž‘์€ ViT-B/16 ๋ชจ๋ธ์€ ImageNet์—์„œ 79.9%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๊ณ , ์ฒ˜์Œ๋ถ€ํ„ฐ ๊ต์œก์— ๋Œ€ํ•ด 2%์˜ ์ƒ๋‹นํ•œ ๊ฐœ์„ ์„ ์ด๋ฃจ๋‚˜, ์—ฌ์ „ํžˆ supervised pre-training์— ๋น„ํ•ด 4% ๋’ค๋–จ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. Appendix B.1.2์—๋Š” ์ถ”๊ฐ€ detail์ด ์กด์žฌํ•œ๋‹ค.

 

 

5. Conclusion

- ์—ฐ๊ตฌ์ง„๋“ค์€ image ์ธ์‹์— ๋Œ€ํ•œ transformer์˜ ์ง์ ‘์ ์ธ ์ ์šฉ์„ ํƒ๊ตฌํ•˜๋ฉฐ ์ด์ „์˜ ์—ฐ๊ตฌ๋“ค์—์„œ computer vision์— "self-attention"์„ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ ์ดˆ๊ธฐ patch์ƒ์„ฑ์„ ์ œ์™ธํ•˜๊ณ  image๋ณ„ inductive biases๋ฅผ architecture์— ๋„์ž…ํ•˜์ง€ ์•Š๋Š”๋‹ค.
๋Œ€์‹ , image๋ฅผ ์ผ๋ จ์˜ patch๋กœ ํ•ด์„ํ•ด ๊ธฐ๋ณธ Transformer์˜ Encoder๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค.
์ด๋Š” ๋‹จ์ˆœํ•˜์ง€๋งŒ ํ™•์žฅ๊ฐ€๋Šฅํ•˜๊ธฐ์— ๋Œ€๊ทœ๋ชจ dataset์— ๋Œ€ํ•œ pre-train๊ณผ ๊ฒฐํ•ฉํ•˜๊ฒŒ ๋˜๋ฉด ๋งค์šฐ ์ž˜ ์ž‘๋™ํ•œ๋‹ค.
๋”ฐ๋ผ์„œ ViT๋Š” ๋งŽ์€ S.O.T.A๋ชจ๋ธ๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋Šฅ๊ฐ€ํ•˜์ง€๋งŒ pre-train์— ์ƒ๋Œ€์ ์œผ๋กœ cheapํ•˜๊ฒŒ ์ž‘๋™ํ•œ๋‹ค.

- ๋ณธ ์—ฐ๊ตฌ์˜ ์ด๋Ÿฐ ์‹œ์ดˆ์ ์ธ ๊ฒฐ๊ณผ๋Š” ๊ณ ๋ฌด์ ์ด๋ผ ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜ ๋งŽ์€ ๊ณผ์ œ๊ฐ€ ์—ฌ์ „ํžˆ ๋‚จ์•„์žˆ๋Š”๋ฐ,
โ‘  Detection, Segmentation ๋“ฑ์˜ Computer Vision์— ์ ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ
โ‘ก "Self-Supervised pre-training"์— ๋Œ€ํ•œ ๋ฐฉ๋ฒ•์„ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค.
์šฐ๋ฆฌ์˜ ์ด ์ดˆ๊ธฐ์‹คํ—˜์€ "Self-Supervised pre-training"์— ๊ฐœ์„ ์„ ๋ณด์—ฌ์ฃผ๊ธด ํ•˜์ง€๋งŒ ๋Œ€๊ทœ๋ชจ "Supervised pre-training"๋ณด๋‹ค๋Š” ์—ฌ์ „ํžˆ ํฐ ๊ฒฉ์ฐจ๊ฐ€ ์กด์žฌํ•˜๊ธฐ์— ViT์˜ ์ถ”๊ฐ€์ ์ธ ํ™•์žฅ์€ ์„ฑ๋Šฅํ–ฅ์ƒ์˜ ์—ฌ์ง€๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. 

 

 

 

๐Ÿ˜ถ ๋ถ€๋ก (Appendix)

[A] Multi-Head Self Attention

 

 

[B] Experiment Details




์ด๋ฅผ ์œ„ํ•ด, training set์˜ ์ž‘์€ ๋ถ€๋ถ„์„ validation set์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. (CIFAR์€ 2%, ImageNet์€ 1%)ResNet์˜ ๊ฒฝ์šฐ, [Kolesnikov2020]์„ ๋”ฐ๋ฅด๋ฉฐ ๋ชจ๋“  fine-tuning ์‹คํ—˜์€ 384 resolution์—์„œ ์‹คํ–‰๋œ๋‹ค.
[Kolesnikov2020]์— ๋”ฐ๋ฅด๋ฉด, training๊ณผ ๋‹ค๋ฅธ resolution์œผ๋กœ fine-tuning์„ ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

 

 

 

[C] Additional Results

 ๋…ผ๋ฌธ์— ์ œ์‹œ๋œ ์ˆ˜์น˜์— ํ•ด๋‹นํ•˜๋Š” ์„ธ๋ถ€ ๊ฒฐ๊ณผ๋กœ ํ‘œ 5๋Š” ๋…ผ๋ฌธ์˜ ๊ทธ๋ฆผ 3์— ํ•ด๋‹นํ•˜๋ฉฐ ํฌ๊ธฐ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” dataset์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹ค์–‘ํ•œ ViT ๋ชจ๋ธ์˜ ์ „์ดํ•™์Šต ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค: ImageNet, ImageNet-21k ๋ฐ JFT-300M.
ํ‘œ 6์€ ๋…ผ๋ฌธ์˜ ๊ทธ๋ฆผ 5์— ํ•ด๋‹นํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ViT, ResNet ๋ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ชจ๋ธ์˜ ์ „์†ก ์„ฑ๋Šฅ๊ณผ ์‚ฌ์ „ ๊ต์œก์˜ ์˜ˆ์ƒ ๊ณ„์‚ฐ ๋น„์šฉ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

 

 

[D] Additional Analyses

 






 

+ Recent posts