๐Ÿ“Œ ๋ชฉ์ฐจ

1. Multi-Lingual. with. Zero-shot Learning
2. Mono-Lingual  Corpus

3. Transformer

๐Ÿ˜š ๊ธ€์„ ๋งˆ์น˜๋ฉฐ...

 

 

 


1. Mutli-Lingual. with. Zero-shot. Learning

์ด์ œ, NMT performance๋ฅผ ๋Œ์–ด์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•œ ๊ณ ๊ธ‰๊ธฐ๋ฒ•๋“ค์„ ์„ค๋ช…ํ•ด๋ณด๋ ค ํ•œ๋‹ค.
ํ•˜๋‚˜์˜ end2end๋ชจ๋ธ์—์„œ ์—ฌ๋Ÿฌ ์–ธ์–ด์Œ์˜ ๋ฒˆ์—ญ์„ ๋™์‹œ์— ์ œ๊ณตํ•˜๋Š” mulilingual NMT์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

 

1.1 Zero-Shot  Learning
NMT์—์„œ Zero-Shot Learning์ด๋ผ๋Š” ํฅ๋ฏธ๋กœ์šด ๋…ผ๋ฌธ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.[Enabling Zero-shot translation; 2017]
์ด ๋ฐฉ์‹์˜ ํŠน์ง•์€ ์—ฌ๋Ÿฌ ์–ธ์–ด์Œ์˜ parallel corpus๋ฅผ ํ•˜๋‚˜์˜ ๋ชจ๋ธ์— ํ›ˆ๋ จํ•˜๋ฉด ๋ถ€๊ฐ€์  ํ•™์Šต์— ์ฐธ์—ฌํ•œ corpus์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” ์–ธ์–ด์Œ๋„ ๋ฒˆ์—ญ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์ด๋‹ค.
์ฆ‰, ํ•œ๋ฒˆ๋„ NMT์— data๋ฅผ ๋ณด์—ฌ์ฃผ์ง€ ์•Š์•„๋„ ํ•ด๋‹น ์–ธ์–ด์Œ ๋ฒˆ์—ญ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
(์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด, ๋ชจ๋ธ์ด train data์— ์ง์ ‘ ๋…ธ์ถœ๋˜์ง€ ์•Š์€ ํด๋ž˜์Šค๋ฅผ ์ธ์‹ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์˜๋ฏธ)

[๊ตฌํ˜„๋ฐฉ๋ฒ•]
โˆ™ ๊ธฐ์กด ๋ณ‘๋ ฌ์ฝ”ํผ์Šค์˜ ๋งจ ์•ž์— ํŠน์ˆ˜ token์„ ์‚ฝ์ž…, ํ›ˆ๋ จํ•˜๋ฉด ๋œ๋‹ค.
โˆ™ ์ด๋•Œ, ์‚ฝ์ž…๋œ token์— ๋”ฐ๋ผ target์–ธ์–ด๊ฐ€ ๊ฒฐ์ •๋œ๋‹ค.
  src์–ธ์–ด target์–ธ์–ด
๊ธฐ์กด Hello, how are you? Hola, ¿ cómo estás?
Zero-Shot <2es> Hello, how are you? Hola, ¿ cómo estás?
์œ„์˜ ๋ชฉํ‘œ๋Š” ๋‹จ์ˆœํžˆ ๋‹ค๊ตญ์–ด NMT end2end๋ชจ๋ธ๊ตฌํ˜„์ด ์•„๋‹Œ, 
์„œ๋กœ ๋‹ค๋ฅธ ์–ธ์–ด์Œ์˜ corpus๋ฅผ ํ™œ์šฉํ•ด NMT์˜ ๋ชจ๋“  ์–ธ์–ด์Œ์— ๋Œ€ํ•ด ์ „์ฒด์ ์ธ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋ ค๋Š” ๊ด€์ ์ด๋‹ค.
์ด์— ๋Œ€ํ•ด ์•„๋ž˜ 4๊ฐœ์˜ ๊ด€์ ์œผ๋กœ ์‹คํ—˜์ด ์ง„ํ–‰๋  ๊ฒƒ์ด๋‹ค.

 

Many-to-One
โˆ™ ๋‹ค์ˆ˜์˜ ์–ธ์–ด๋ฅผ encoder์— ๋„ฃ๊ณ  train

์ด ๋ฐฉ๋ฒ•์€ ์‹ค์ œ๋ฌธ์ œ๋กœ ์ฃผ์–ด์ง„ ์–ธ์–ด dataset์™ธ์—๋„, ๋™์‹œ์— ํ›ˆ๋ จ๋œ ๋‹ค๋ฅธ์–ธ์–ด์˜ dataset์„ ํ†ตํ•ด
ํ•ด๋‹น ์–ธ์–ด์˜ ๋ฒˆ์—ญ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ์ •๋ณด๋ฅผ ์ถ”๊ฐ€๋กœ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
One-to-Many
โˆ™ ๋‹ค์ˆ˜์˜ ์–ธ์–ด๋ฅผ decoder์— ๋„ฃ๊ณ  train

์ด ๋ฐฉ๋ฒ•์€ ์œ„์˜ ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ, ์„ฑ๋Šฅํ–ฅ์ƒ์ด ์žˆ๋‹ค๋ณด๊ธด ์–ด๋ ต๋‹ค.
๊ฒŒ๋‹ค๊ฐ€ ์–‘์ด ์ถฉ๋ถ„ํ•œ(ex. ENG-FRA) corpus์˜ ๊ฒฝ์šฐ, oversampling์„ ํ•˜๊ฒŒ๋˜๋ฉด ๋” ํฐ ์†ํ•ด๋ฅผ ๋ณด๊ฒŒ ๋œ๋‹ค.
Many-to-Many
โˆ™ ๋‹ค์ˆ˜์˜ ์–ธ์–ด๋ฅผ encoder, decoder ๋ชจ๋‘์— ๋„ฃ๊ณ  train

์ด ๋ฐฉ๋ฒ•์€ ๋Œ€๋ถ€๋ถ„์˜ ์‹คํ—˜๊ฒฐ๊ณผ๊ฐ€ ํ•˜๋ฝ์„ธ์ด๋‹ค.
(๋‹ค๋งŒ ๋‹ค์–‘ํ•œ ์–ธ์–ด์Œ์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ์— ๋„ฃ๊ณ  ํ›ˆ๋ จํ•œ ๊ฒƒ ์น˜๊ณ ๋Š” BLEU Score๋Š” ๊ดœ์ธˆํ•œํŽธ)

 

Zero-Shot.NMT test
โˆ™ ์œ„์˜ ๋ฐฉ๋ฒ•์œผ๋กœ train๋œ ๋ชจ๋ธ์—์„œ train corpus์— ์—†๋Š” ์–ธ์–ด์Œ์˜ ๋ฒˆ์—ญ์„ฑ๋Šฅ์„ ํ‰๊ฐ€
  Method Zero-Shot ์œ ๋ฌด BLEU
(a) PBMT. Bridge X 28.99
(b) NMT. Bridge X 30.91
(c) NMT. POR→SPA X 31.50
(d) ๋ชจ๋ธ1] POR→ENG,ENG→SPA O 21.62
(e) ๋ชจ๋ธ2] ENG↔POR,SPA O 24.75
(f) ๋ชจ๋ธ2 + ์ ์ง„ํ•™์Šต X 31.77
(a), (b)
Bridge๋ฐฉ๋ฒ•์€ ์ค‘๊ฐ„์–ธ์–ด๋ฅผ ์˜์–ด๋กœ ํ•˜์—ฌ 2๋‹จ๊ณ„์— ๊ฑธ์ณ ๋ฒˆ์—ญํ•œ๋‹ค.
๊ตฌ๋ฌธ๊ธฐ๋ฐ˜๊ธฐ๊ณ„๋ฒˆ์—ญ(PBMT: Phrase-Based Machine Translation)๋ฐฉ์‹์€ ํ†ต๊ณ„๊ธฐ๋ฐ˜๊ธฐ๊ณ„๋ฒˆ์—ญ(SMT)์˜ ์ผ์ข…์ด๋‹ค.

(c)
NMT 'ํฌ๋ฅดํˆฌ๊ฐˆ์–ด→์ŠคํŽ˜์ธ์–ด'๋Š” ๋‹จ์ˆœ๋ณ‘๋ ฌ์ฝ”ํผ์Šค๋ฅผ ํ™œ์šฉํ•ด ๊ธฐ์กด ๋ฐฉ๋ฒ•๋Œ€๋กœ ํ›ˆ๋ จํ•œ baseline์ด๋‹ค.
๋ฌผ๋ก , zero-shot ํ›ˆ๋ จ๋ฐฉ์‹์œผ๋กœ๋Š” ๋„˜์„ ์ˆ˜ ์—†๋Š” ์ˆ˜์น˜์ด๋‹ค.

(d), (e)
๋ชจ๋ธ 1์€ POR→ENG,ENG→SPA์„ ๋‹จ์ผ๋ชจ๋ธ์— ํ›ˆ๋ จํ•œ ๋ฐฉ๋ฒ•์ด๊ณ 
๋ชจ๋ธ 2๋Š” ENG↔POR, ENG↔SPA๋ฅผ ๋‹จ์ผ๋ชจ๋ธ์— ํ›ˆ๋ จํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

(f)
๋ชจ๋ธ2 + ์ ์ง„(incremental)ํ•™์Šต๋ฐฉ์‹์€ (c)๋ณด๋‹ค ์ ์€ ์–‘์˜ corpus๋กœ ํ›ˆ๋ จํ•œ ๊ธฐ์กด ๋ชจ๋ธ์— ์ถ”๊ฐ€๋กœ ๋ชจ๋ธ 2๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์ด๋‹ค.

๋น„๋ก ๋ชจ๋ธ1๊ณผ ๋ชจ๋ธ2๋Š” ํ›ˆ๋ จ์ค‘ ํ•œ๋ฒˆ๋„ POR→SPA parallel corpus๋ฅผ ๋ณด์ง€ ๋ชปํ–ˆ์ง€๋งŒ, 20์ด ๋„˜๋Š” BLEU๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
ํ•˜์ง€๋งŒ, ๋ฌผ๋ก  (a), (b)๋ณด๋‹ค๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค.
๋‹คํ–‰ํžˆ๋„ (f)์˜ ๊ฒฝ์šฐ, (c)๋ณด๋‹ค ํฐ ์ฐจ์ด๋Š” ์•„๋‹ˆ๋‚˜ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

parallel corpus์˜ ์–‘์ด ์–ผ๋งˆ๋˜์ง€์•Š๋Š” ์–ธ์–ด์Œ์˜ ๋ฒˆ์—ญ๊ธฐ ํ›ˆ๋ จ ์‹œ, ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค.
(ํŠนํžˆ ํ•œ๊ตญ์–ด-์ผ์–ด, ์ŠคํŽ˜์ธ์–ด-ํฌ๋ฅดํˆฌ๊ฐˆ์–ด ์™€ ๊ฐ™์ด ๋งค์šฐ ๋น„์Šทํ•œ ์–ธ์–ด์Œ์„ ๊ฐ™์€ src, tgt์–ธ์–ด๋กœ ์‚ฌ์šฉ ์‹œ ๊ทธ ํšจ๊ณผ๊ฐ€ ์ฆํญ๋œ๋‹ค.)

 

 

 

 

 

 

 

 

 

 

 

 


2. Mono-Lingual Corpus

NMTํ›ˆ๋ จ์„ ์œ„ํ•ด ๋‹ค๋Ÿ‰์˜ parallel corpus๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
๋ณดํ†ต ์™„๋ฒฝํ•˜์ง€๋Š” ์•Š์ง€๋งŒ ๋‚˜๋ฆ„ ์‚ฌ์šฉํ• ๋งŒํ•œ ๋ฒˆ์—ญ๊ธฐ๊ฐ€ ๋‚˜์˜ค๋ ค๋ฉด ์ตœ์†Œ 300๋งŒ ๋ฌธ์žฅ ์Œ์ด์ƒ์ด ํ•„์š”ํ•˜๋‹ค.

ํ•˜์ง€๋งŒ, ์ธํ„ฐ๋„ท์—์„œ monolingual corpus๋Š” ๋งŽ์ง€๋งŒ multilingual corpus๋ฅผ ๋Œ€๋Ÿ‰์œผ๋กœ ์–ป๊ธฐ๋ž€ ๋งค์šฐ ํž˜๋“  ์ผ์ด๋‹ค.
๋˜ํ•œ, ๋‹จ์ผ ์–ธ์–ด corpus๊ฐ€ ์–‘์ด ๋”์šฑ ๋งŽ๊ธฐ์— ์‹ค์ œ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด์˜ ํ™•๋ฅ ๋ถ„ํฌ์— ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ๊ณ 
๋”ฐ๋ผ์„œ ๋” ๋‚˜์€ LM์„ ํ•™์Šตํ•˜๊ธฐ์— monolingual corpus๊ฐ€ ํ›จ์”ฌ ์œ ๋ฆฌํ•˜๋‹ค.

์ด๋ฒˆ Section์—๋Š” ์ €๋ ดํ•œ monolingual corpus๋ฅผ ํ™œ์šฉํ•ด NMT์„ฑ๋Šฅ์„ ์ฅ์–ด์งœ๋ณด๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.

 

2.1 LM Ensemble
์œ„์˜ ๋ฐฉ๋ฒ•์€ ๋”ฅ๋Ÿฌ๋‹์˜ ๊ฑฐ๋‘, Yoshua Bengio๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ œ์•ˆํ•˜์‹  ๋ฐฉ๋ฒ•์ด๋‹ค.
์—ฌ๊ธฐ์„œ LM์„ ๋ช…์‹œ์ ์œผ๋กœ ์•™์ƒ๋ธ”ํ•˜์—ฌ Decoder์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๊ณ ์ž ํ–ˆ๋‹ค.

[Shallow Fusion ๋ฐฉ๋ฒ•]
โˆ™ 2๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•

[Deep Fusion ๋ฐฉ๋ฒ•]
โˆ™ LM์„ seq2seq์— ํฌํ•จ์‹œ์ผœ end2endํ•™์Šต์„ ํ†ตํ•ด ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•

์œ„์˜ ๋‘ ๋ฐฉ์‹์—์„œ Deep Fusion๋ฐฉ๋ฒ•์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค.
๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ monolingual corpus๋ฅผ ํ™œ์šฉํ•ด LM์„ ํ•™์Šต,
์‹ค์ œ ๋ฒˆ์—ญ๊ธฐ ํ›ˆ๋ จ ์‹œ ์‹ ๊ฒฝ๋ง์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’์„ ๊ณ ์ •ํ•œ ์ฑ„๋กœ seq2seq๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.


์•„๋ž˜ ํ‘œ๋Š” 'ํ„ฐํ‚ค์–ด→์˜์–ด' NMT ์„ฑ๋Šฅ์„ ๊ฐ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•ด ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

๋’ค์— ๋‚˜์˜ฌ back-translation์ด๋‚˜ copied-translation๋“ค ๋ณด๋‹ค ์„ฑ๋Šฅ์  ์ธก๋ฉด์—์„œ ์ด๋“์ด ์ ๋‹ค.
ํ•˜์ง€๋งŒ ์ˆ˜์ง‘ํ•œ ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค๋ฅผ ์ „๋ถ€ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์กด์žฌํ•œ๋‹ค.
 

On Using Monolingual Corpora in Neural Machine Translation

Recent work on end-to-end neural network-based architectures for machine translation has shown promising results for En-Fr and En-De translation. Arguably, one of the major factors behind this success has been the availability of high quality parallel corp

arxiv.org

 

2.2 Dummy-Sentence ํ™œ์šฉ
์œ„์˜ ๋ช…์‹œ์ ์œผ๋กœ LM์„ ์•™์ƒ๋ธ”ํ•˜๋Š” ๋Œ€์‹  Decoder๋กœ ํ•˜์—ฌ๊ธˆ ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ๋…ผ๋ฌธ์„ ์ œ์‹œํ–ˆ๋‹ค.[https://arxiv.org/abs/1508.07909, https://arxiv.org/abs/1511.06709]
์ด ๋ฐฉ๋ฒ•์˜ ํ•ต์‹ฌ์€ src๋ฌธ์žฅ์ธ X๋ฅผ ๋นˆ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์คŒ์œผ๋กœ์จ, ๊ทธ๋ฆฌ๊ณ  attention๋“ฑ์„ ๋ชจ๋‘ Dropout์œผ๋กœ ๋Š์–ด์คŒ์œผ๋กœ์จ Encoder๋กœ๋ถ€ํ„ฐ ์ „๋‹ฌ๋˜๋Š” ์ •๋ณด๋“ค์„ ์—†์• ๋Š” ๊ฒƒ์ด๋‹ค.
์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด Decoder๊ฐ€ ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค๋ฅผ ํ™œ์šฉํ•ด LMํ•™์Šตํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

 

2.3 Back-Translation
ํ•œํŽธ, ์œ„์˜ ๋…ผ๋ฌธ๋“ค์—์„œ ์ข€ ๋” ๋ฐœ์ „๋œ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋„ ํ•จ๊ป˜ ์ œ์‹œ๋˜์—ˆ๋‹ค.
๊ธฐ์กด์˜ ํ›ˆ๋ จ๋œ ์—ญ๋ฐฉํ–ฅ๋ฒˆ์—ญ๊ธฐ(Back-Translation)๋ฅผ ์‚ฌ์šฉํ•ด mono-lingual corpus๋ฅผ ๊ธฐ๊ณ„๋ฒˆ์—ญ ํ›„ ํ•ฉ์„ฑ๋ณ‘๋ ฌ์ฝ”ํผ์Šค(synthetic parallel corpus)๋ฅผ ๋งŒ๋“  ํ›„, ์ด๋ฅผ ๊ธฐ์กด ์–‘๋ฐฉํ–ฅ๋ณ‘๋ ฌ์ฝ”ํผ์Šค์— ์ถ”๊ฐ€ํ•ด ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.
โ—๏ธ์ค‘์š”ํ•œ ์ ์€ NMT๋กœ ๋งŒ๋“ค์–ด์ง„ ํ•ฉ์„ฑ๋ณ‘๋ ฌ์ฝ”ํผ์Šค ์‚ฌ์šฉ ์‹œ, Back-Translation์˜ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด๋‹ค.

์‚ฌ์‹ค, ๋ฒˆ์—ญ๊ธฐ๋ฅผ ๋งŒ๋“ค๋ฉด ํ•˜๋‚˜์˜ parallel corpus๋กœ 2๊ฐœ์˜ NMT๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
๋”ฐ๋ผ์„œ ์ด๋ ‡๊ฒŒ ๋™์‹œ์— ์–ป์–ด์ง€๋Š” 2๊ฐœ์˜ ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์„œ๋กœ ๋ณด์™„์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
์ฆ‰, Back Translation์€ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ์˜ ๋ฒˆ์—ญ๊ธฐ๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง€๋Š” ํ•ฉ์„ฑ๋ณ‘๋ ฌ์ฝ”ํผ์Šค๋ฅผ target์‹ ๊ฒฝ๋ง์— ๋„ฃ๋Š”๋‹ค.
์˜ˆ๋ฅผ๋“ค์–ด, KOR๋‹จ์ผ์ฝ”ํผ์Šค๊ฐ€ ์žˆ์„ ๋•Œ, ์•„๋ž˜ ๊ณผ์ •์„ ๋”ฐ๋ฅธ๋‹ค.
โˆ™ ๊ธฐ์กด์— ํ›ˆ๋ จ๋œ ํ•œ→์˜ ๋ฒˆ์—ญ๊ธฐ์— ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ์‹œํ‚ด
โˆ™ ํ•œ์˜ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ์ฝ”ํผ์Šค๋ฅผ ์ƒ์„ฑ
โˆ™ ์ด๋ฅผ ๊ธฐ์กด์— ์ˆ˜์ง‘ํ•œ ํ•œ์˜ ๋ณ‘๋ ฌ์ฝ”ํผ์Šค์™€ ํ•ฉ์นœ๋‹ค.
โˆ™ ์ด๋ฅผ ์˜→ํ•œ ๋ฒˆ์—ญ๊ธฐ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ฐ ์‚ฌ์šฉํ•œ๋‹ค.
์ฆ‰, ์ผ์ข…์˜ Augmentation ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋งŒ, ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์€ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ ์ฝ”ํผ์Šค์˜ ์ƒ์„ฑ์„ ์‚ฌ์šฉํ•˜๋ฉด,
์ฃผ๊ฐ์ „๋„ ํ˜„์ƒ์ด ๋  ์ˆ˜ ์žˆ์–ด ๊ทธ ์–‘์„ ์ œํ•œํ•ด ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

cf) ์ถ”๊ฐ€์ ์ธ ์„ค๋ช…(https://chan4im.tistory.com/205)

 

2.4 Copied Translation
์ด ๋ฐฉ์‹์€ Sennrich๊ต์ˆ˜๋‹˜์ด ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์•ž์„œ ์„ค๋ช…ํ•œ ๊ธฐ์กด์˜ Dummy๋ฌธ์žฅ์„ ํ™œ์šฉํ•œ ๋ฐฉ์‹์—์„œ ์ข€ ๋” ๋ฐœ์ „ํ•œ ๋ฐฉ์‹์ด๋‹ค.
๊ธฐ์กด์˜ Dummy๋ฐฉ๋ฒ•๋Œ€์‹ , src์™€ tgt์ชฝ์— ๊ฐ™์€ data๋ฅผ ๋„ฃ์–ด ํ›ˆ๋ จ์‹œํ‚จ๋‹ค.

โˆ™๊ธฐ์กด์˜ Dummy๋ฌธ์žฅ์„ Encoder์— ๋„ฃ๋Š” ๋ฐฉ์‹์€ Encoder์—์„œ Decoder๋กœ ๊ฐ€๋Š” ๊ฒฝ๋กœ๋ฅผ ํ›ˆ๋ จ ์‹œ, DropOut์ด ํ•„์š”
โˆ™์ด ๋ฐฉ์‹์€ ๊ทธ๋Ÿด ํ•„์š”๊ฐ€ ์—†์ง€๋งŒ, src์–ธ์–ด์˜ vocabulary์— tgt์–ธ์–ด์˜ ์–ดํœ˜๊ฐ€ ํฌํ•จ๋˜๋Š” ๋ถˆํ•„์š”ํ•จ์„ ๊ฐ์ˆ˜ํ•˜๊ธด ํ•ด์•ผํ•œ๋‹ค.
๋”ฐ๋ผ์„œ ๋ณดํ†ต Back-Translation๋ฐฉ์‹๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋œ๋‹ค.

Method TUR→ENG ENG→TUR
Back-Translation 19.7 14.7
Back Translation + Copied 19.7 15.6

 

Conclusion
์œ„์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.
๋‹ค๋งŒ, ์†Œ๊ฐœ๋œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค ์ค‘, ๊ตฌํ˜„์˜ ์šฉ์ด์„ฑ ๋ฐ ํšจ์œจ์„ฑ์œผ๋กœ ์ธํ•ด ์•„๋ž˜ 2๊ฐ€์ง€๊ฐ€ ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค.
โˆ™ Back Translation
โˆ™ Copied Translation
์œ„์˜ ๋‘ ๋ฐฉ๋ฒ•๋“ค์€ ๋งค์šฐ ์ง๊ด€์ ์ด๊ณ  ๊ฐ„๋‹จํ•˜๋ฉด์„œ ํšจ๊ณผ์ ์ธ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


3. Transformer (Attention is all you need)

์ œ๋ชฉ์—์„œ๋ถ€ํ„ฐ ์•Œ ์ˆ˜ ์žˆ๋“ฏ, ๊ธฐ์กด์˜ attention์—ฐ์‚ฐ๋งŒ์„ ํ™œ์šฉํ•ด seq2seq๋ฅผ ๊ตฌํ˜„, ์„ฑ๋Šฅ๊ณผ ์†๋„ ๋‘˜ ๋‹ค ๋ชจ๋‘ ์„ฑ๊ณต์ ์œผ๋กœ ์žก์•„๋ƒˆ๋‹ค.

๊ฐ™์ด๋ณด๋ฉด ์ข‹์„ ๋‚ด์šฉ(https://chan4im.tistory.com/162)

 

[๋…ผ๋ฌธ preview] - ViT : Vision Transformer(2020). part 2. Transformer

[Transformer] Attention is all you need[Vaswani2017; https://arxiv.org/abs/1706.03762] Attention Is All You Need The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

chan4im.tistory.com

 

3.1 Architecture
 Transformer๋Š” ์˜ค์ง attention๋งŒ์„ ์‚ฌ์šฉํ•ด Encoding๊ณผ Decoding์„ ์ „๋ถ€ ์ˆ˜ํ–‰ํ•œ๋‹ค.
์ด๋•Œ, Skip-Connection์„ ํ™œ์šฉํ•ด ์‹ ๊ฒฝ๋ง์„ ๊นŠ๊ฒŒ ์Œ“๋„๋ก ๋„์™€์ค€๋‹ค.
Transformer์˜ Encoder์™€ Decoder๋ฅผ ์ด๋ฃจ๋Š” Sub-Module์€ ํฌ๊ฒŒ ๋‹ค์Œ 3๊ฐ€์ง€๋กœ ๋‚˜๋‰œ๋‹ค.
โˆ™ Self-Attention: ์ด์ „ ์ธต์˜ ์ถœ๋ ฅ์— ๋Œ€ํ•ด attention์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰
โˆ™ Attention: ๊ธฐ์กด seq2seq์™€ ๊ฐ™์ด encoder์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด attention์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰
โˆ™ Feed-Forward: attention์ธต์„ ๊ฑฐ์ณ ์–ป์€ ๊ฒฐ๊ณผ๋ฌผ์„ ์ตœ์ข…์ ์œผ๋กœ ์ •๋ฆฌ

 

Position Embedding
๊ธฐ์กด RNN์€ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ›์•„ ์ž๋™์œผ๋กœ ์ˆœ์„œ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ธฐ๋กํ•œ๋‹ค.

ํ•˜์ง€๋งŒ Transformer๋Š” RNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.
๋”ฐ๋ผ์„œ, ์ˆœ์„œ์ •๋ณด๋ฅผ ๋‹จ์–ด์™€ ํ•จ๊ป˜ ์ œ๊ณตํ•ด์•ผ ๋œ๋‹ค. (๊ฐ™์€ ๋‹จ์–ด๋ผ๋„ ์œ„์น˜์— ๋”ฐ๋ผ ์“ฐ์ž„์ƒˆ, ์—ญํ• ์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ.)
๊ฒฐ๊ณผ์ ์œผ๋กœ ์œ„์น˜์ •๋ณด๋ฅผ Positional Embedding์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.

[Positional Embedding ์ผ๋ฐ˜์‹]
Positional Embedding ๊ฒฐ๊ณผ๊ฐ’์˜ ์ฐจ์› = word embedding vector ์ฐจ์›
∴ Positional Embeddingํ–‰๋ ฌ + ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉํ–‰๋ ฌ → Encoder, Decoder์˜ ์ž…๋ ฅ.

cf) ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉํ–‰๋ ฌ = ๋‹จ์–ด์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ํ•ฉ์นœ ํ–‰๋ ฌ

 

Attention
Transformer์˜ Attention๊ตฌ์„ฑ
[MHA]
Transformer๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ Attention์œผ๋กœ ๊ตฌ์„ฑ๋œ Multi-Head Attention(MHA)๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค.
๋งˆ์น˜ CNN์—์„œ ์—ฌ๋Ÿฌ kernel์ด ๋‹ค์–‘ํ•œ feature๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์›๋ฆฌ๋ผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด์ „์‹œ๊ฐ„, Q๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ linear transformation์„ ๋ฐฐ์šฐ๋Š” ๊ณผ์ •์ด๋ผ ์†Œ๊ฐœํ–ˆ๋‹ค.

์ด๋•Œ, ๋‹ค์–‘ํ•œ Q๋ฅผ ์ƒ์„ฑํ•ด ๋‹ค์–‘ํ•œ ์ •๋ณด๋“ค์„ ์ถ”์ถœํ•œ๋‹ค๋ฉด, ๋”์šฑ ์œ ์šฉํ•  ๊ฒƒ์ด๋‹ค.
๋”ฐ๋ผ์„œ Multi-Head๋กœ ์—ฌ๋Ÿฌ attention์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•œ๋‹ค.


Q,K,V๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š” ๊ธฐ๋ณธ์ ์ธ Attention ์ˆ˜์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
์•ž์˜ Attentionํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•œ MHA ์ˆ˜์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
โˆ™ Self-Attention์˜ ๊ฒฝ์šฐ, Q,K,V ๋ชจ๋‘ ๊ฐ™์€๊ฐ’์œผ๋กœ์จ ์ด์ „์ธต์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์•„์˜จ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ผ๋ฐ˜ Attention์˜ ๊ฒฝ์šฐ, Q๋Š” ์ด์ „์ธต์˜ ๊ฒฐ๊ณผ์ด๊ณ 
K, V๋Š” encoder์˜ ๋งˆ์ง€๋ง‰ ์ธต ๊ฒฐ๊ณผ๊ฐ€ ๋œ๋‹ค.

์ด๋•Œ, Q,K,V์˜ tensor_size๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (์ด๋•Œ, n = src_length, m = tgt_length)
โˆ™ |Q| = (batch_size, n, hidden_size)
โˆ™ |K| = |V| = (batch_size, m, hidden_size)

๋˜ํ•œ, MHA์˜ ์‹ ๊ฒฝ๋ง ๊ฐ€์ค‘์น˜ WiQ, WiK, WiV, WO์˜ ํฌ๊ธฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
โˆ™ |WiQ|=|WiK|=|WiV| = (hidden_size, head_size)
โˆ™ |WO| = (hidden_size×h , hidden_size)
์ด๋•Œ, hidden_size=head_size×h์ด๋ฉฐ ๋ณดํ†ต 512๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.


transformer์—์„œ๋Š” tgt_sentence์˜ ๋ชจ๋“  time-step์„ encoder๋‚˜ ๋Œ€์ƒ tensor์˜ ๋ชจ๋“  time-step์— ๋Œ€ํ•ด ํ•œ๋ฒˆ์— attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
์ด์ „์žฅ์˜ attention๊ฒฐ๊ณผ, tensor์˜ ํฌ๊ธฐ๋Š” (batch_size, 1, hidden_size)์˜€์ง€๋งŒ
MHA์˜ ๊ฒฐ๊ณผ tensor์˜ ํฌ๊ธฐ๋Š” (batch_size, m, hidden_size)๊ฐ€ ๋œ๋‹ค.
Self-Attention๋„ K,V๊ฐ€ Q์™€ ๊ฐ™์€ tensor์ผ ๋ฟ, ์›๋ฆฌ๋Š” ๊ฐ™๊ธฐ์— m=n์ด๋‹ค.

 

Decoder์˜  Self-attention
Decoder์—์„œ Self-Attention์€ Encoder์™€ ๊ฒฐ์ด ์‚ด์ง ๋‹ฌ๋ฆฌํ•˜๋Š”๋ฐ, ์ด์ „ ์ธต์˜ ์ถœ๋ ฅ๊ฐ’์œผ๋กœ Q,K,V๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ ๊ฐ™์ง€๋งŒ ์•ฝ๊ฐ„์˜ ์ œ์•ฝ์ด ๊ฐ€๋ฏธ๋œ๋‹ค. ์™œ๋ƒํ•˜๋ฉด inference time์—์„œ ๋‹ค์Œ time-step์˜ ์ž…๋ ฅ๊ฐ’์„ ๋‹น์—ฐํžˆ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋”ฐ๋ผ์„œ train์—์„œ๋„ ์ด์ „ ์ธต์˜ ๊ฒฐ๊ณผ๊ฐ’์„ K์™€ V๋กœ ํ™œ์šฉํ•˜๋Š” self-attention์„ ์ˆ˜ํ–‰ํ•˜๋”๋ผ๋„ ๋ฏธ๋ž˜์˜ time-step์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†๋„๋ก ๋˜‘๊ฐ™์ด ๊ตฌํ˜„ํ•ด์ค„ ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด attention์—ฐ์‚ฐ ์‹œ, masking์„ ์ถ”๊ฐ€ํ•ด์ค˜์•ผํ•œ๋‹ค.
์ด๋ฅผ ํ†ตํ•ด ๋ฏธ๋ž˜์˜ time-step์— ๋Œ€ํ•ด attention_weight๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์—†๊ฒŒ ํ•œ๋‹ค.

[Attention์„ ์œ„ํ•œ Mask์ƒ์„ฑ ๋ฐฉ๋ฒ•]
mini-batch๋‚ด๋ถ€์˜ ๋ฌธ์žฅ๋“ค์€ ๊ธธ์ด๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š”๋ฐ, masking์„ ํ†ตํ•ด ์„ ํƒ์ ์œผ๋กœ attention์ˆ˜ํ–‰์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
mini-batch์˜ ํฌ๊ธฐ๋Š” mini-batch ๋‚ด๋ถ€์˜ ๊ฐ€์žฅ ๊ธด ๋ฌธ์žฅ์˜ ๊ธธ์ด(max_length)์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค.
๊ธธ์ด๊ฐ€ ์งง์€ ๋ฌธ์žฅ๋“ค์€ ๋ฌธ์žฅ์˜ ์ข…๋ฃŒ ํ›„์— padding์œผ๋กœ ์ฑ„์›Œ์ง„๋‹ค.
๋ฌธ์žฅ ๊ธธ์ด์— ๋”ฐ๋ฅธ mini-batch ํ˜•ํƒœ

๋”ฐ๋ผ์„œ ํ•ด๋‹น mini-batch๊ฐ€ encoder๋ฅผ ํ†ต๊ณผํ•˜๊ณ  decoder์—์„œ attention์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ ์‹œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.
padding์ด ์กด์žฌํ•˜๋Š” time-step์—๋„ attention๊ฐ€์ค‘์น˜๊ฐ€ ๋„˜์–ด๊ฐ€ Decoder์— ์“ธ๋ฐ์—†๋Š” ์ •๋ณด๋ฅผ ๋„˜๊ฒจ์ค„ ์ˆ˜ ์žˆ๋‹ค.
๋”ฐ๋ผ์„œ ํ•ด๋‹น time-step์— attention๊ฐ€์ค‘์น˜๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๋‹ค์‹œ 0์œผ๋กœ ๋งŒ๋“ค์–ด์ค˜์•ผ ํ•œ๋‹ค.
mask ์ ์šฉ ์‹œ attention

 

Feed Forward. &. Skip-Connection
โˆ™ FFN์€ attention๊ฒฐ๊ณผ๋ฅผ ์ •๋ฆฌํ•˜๋ฉฐ
โˆ™ attention๋ธ”๋ก์˜ ์ถœ๋ ฅ๊ฐ’์— attention๋ธ”๋ก์˜ ์ž…๋ ฅ๊ฐ’์„ ๋”ํ•ด skip-connection์„ ๊ตฌํ˜„ํ•ด์ค€๋‹ค.

 

3.2 Pytorch์˜ MHAํด๋ž˜์Šค
Attention์„ ์ง์ ‘ ๊ตฌํ˜„ํ•  ํ•„์š”์—†์ด pytorch์—์„œ ์ œ๊ณตํ•˜๋Š” attentionํด๋ž˜์Šค๋ฅผ ์ง์ ‘์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
(https://pytorch.org/docs/stable/nn.html#multiheadattention)

๋‹ค๋งŒ ์ด attentionํด๋ž˜์Šค๋Š” ์ด์ „(https://chan4im.tistory.com/201#n4)์— ์†Œ๊ฐœํ•œ ๊ธฐ๋ณธattention์ด ์•„๋‹Œ, transformer์— ์‚ฌ์šฉ๋œ attention์ด๋ผ๋Š” ์ ๋งŒ ์ฃผ์˜ํ•˜์ž.
 

torch.nn — PyTorch 2.0 documentation

Shortcuts

pytorch.org

 

3.3 ํ‰๊ฐ€
Models BLEU(ENG→GER) FLOPs(ํ›ˆ๋ จ๋น„์šฉ) BLEU(ENG→FRA) FLOPs(ํ›ˆ๋ จ๋น„์šฉ)
GNMT + RL 24.6 2.3 × 1019 39.92 1.4 × 1020
ConvS2S 25.16 9.6 × 1018 40.46 1.5 × 1020
Transformer 27.3 3.3 × 1018 38.1 3.3 × 1018
Transformer(Big) 28.4 2.3 × 1018 41.8 2.3 × 1019
Google์€ transformer๊ฐ€ ๊ธฐ์กด ์—ฌํƒ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ํ›จ์”ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Œ์„ ๋ฐํ˜”๋Š”๋ฐ, ๊ธฐ์กด RNN ๋ฐ meta(facebook)์˜ ConvS2S(Convolutional Sequence to Sequence)๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅธ ์†๋„๋กœ ํ›ˆ๋ จํ–ˆ์Œ์„ ๋ฐํ˜”๋‹ค(FLOPs).

์ด๋Ÿฐ ์†๋„์˜ ๊ฐœ์„ ์˜ ์›์ธ ์ค‘ ํ•˜๋‚˜๋กœ transformer๊ตฌ์กฐ์™€ ํ•จ๊ป˜ input feeding์˜ ์ œ๊ฑฐ 2๊ฐ€์ง€ ์š”์ธ์ด ๊ธฐ์ธํ–ˆ๋‹ค ๋ณด๋Š” ์‹œ๊ฐ์ด ๋งŽ์€๋ฐ, ๊ธฐ์กด RNN ๊ธฐ๋ฐ˜ seq2seq๋ฐฉ์‹์€ input feeding์ด ๋„์ž…๋˜๋ฉด์„œ decoderํ›ˆ๋ จ ์‹œ ๋ชจ๋“  time-step์„ ํ•œ๋ฒˆ์— ํ•  ์ˆ˜ ์—†๊ฒŒ ๋˜์—ˆ๋‹ค.
๋”ฐ๋ผ์„œ FLOPs ๋Œ€๋ถ€๋ถ„์˜ bottleneck๋ฌธ์ œ๊ฐ€ decoder์—์„œ ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค.
ํ•˜์ง€๋งŒ transformer์˜ ๊ฒฝ์šฐ, input feeding์ด ์—†๊ธฐ์— ํ›ˆ๋ จ ์‹œ ํ•œ๋ฒˆ์— ๋ชจ๋“  time-step์— ๋Œ€ํ•œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

 

Conclusion
transformer์˜ ํ˜์‹ ์ ์ธ ๊ตฌ์กฐ์  ์ฐจ์ด๋Š” seq2seq๋ฅผ ํ™œ์šฉํ•œ NMT ๋ฐ ์ž์—ฐ์–ด ์ƒ์„ฑ์—๋„ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.
๋˜ํ•œ, BERT์™€ ๊ฐ™์€ ์ž์—ฐ์–ด์ดํ•ด์˜ ๋ฒ”์ฃผ๋กœ๊นŒ์ง€ ํญ๋„“๊ฒŒ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค.
 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unla

arxiv.org

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


๋งˆ์น˜๋ฉฐ...

์ด๋ฒˆ์‹œ๊ฐ„์—๋Š” NMT์˜ ์„ฑ๋Šฅ์„ ๋” ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ๋‹ค๋ฃจ์—ˆ๋‹ค.
์‹ ๊ฒฝ๋ง์€ data๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๊ทธ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š”๋ฐ, ๋ฒˆ์—ญ๊ณผ ๊ฐ™์€ seq2seq๋ฅผ ํ™œ์šฉํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ๋•Œ, train data์ธ parallel corpus๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์ข‹๋‹ค.
๋‹ค๋งŒ, parallel corpus๋Š” ๋งค์šฐ ์ˆ˜์ง‘์ด ์–ด๋ ต๊ณ  ์ œํ•œ์ ์ด๋‹ค. (๋ณดํ†ต ์™„๋ฒฝํ•˜์ง€๋Š” ์•Š์ง€๋งŒ ๋‚˜๋ฆ„ ์‚ฌ์šฉํ• ๋งŒํ•œ ๋ฒˆ์—ญ๊ธฐ๊ฐ€ ๋‚˜์˜ค๋ ค๋ฉด ์ตœ์†Œ 300๋งŒ ์ด์ƒ์˜ ๋ฌธ์žฅ ์Œ์ด ํ•„์š”)

๋”ฐ๋ผ์„œ zero-shot learning์„ ์‚ฌ์šฉํ•œ๋‹ค.
zero-shot learning์€ ๋ชจ๋ธ์ด train data์— ์ง์ ‘ ๋…ธ์ถœ๋˜์ง€ ์•Š์€ ํด๋ž˜์Šค๋ฅผ ์ธ์‹ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์œผ๋กœ parallel corpus์˜ ์–‘์ด ์–ผ๋งˆ๋˜์ง€์•Š๋Š” ์–ธ์–ด์Œ์˜ ๋ฒˆ์—ญ๊ธฐ ํ›ˆ๋ จ ์‹œ, ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค.
(ํŠนํžˆ ํ•œ๊ตญ์–ด-์ผ์–ด, ์ŠคํŽ˜์ธ์–ด-ํฌ๋ฅดํˆฌ๊ฐˆ์–ด ์™€ ๊ฐ™์ด ๋งค์šฐ ๋น„์Šทํ•œ ์–ธ์–ด์Œ์„ ๊ฐ™์€ src, tgt์–ธ์–ด๋กœ ์‚ฌ์šฉ ์‹œ ๊ทธ ํšจ๊ณผ๊ฐ€ ์ฆํญ๋œ๋‹ค.)


์ด๋Ÿฐ ์ œํ•œ์ ์ธ ์ƒํ™ฉ์— ๋Œ€ํ•ด ๋‹จ์ผ์–ธ์–ด(monolingual corpus)๋ฅผ ํ™œ์šฉํ•ด NMT ์„ฑ๋Šฅ์„ ํ–ฅ์ƒํ•˜๋Š” ๋ฐฉ๋ฒ• ๋˜ํ•œ ๋‹ค๋ค„๋ณด์•˜๋‹ค.
๋’ค์—์„œ ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค๋ฅผ ํ™œ์šฉํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์— ๋Œ€ํ•ด์„œ๋„ ์ค‘์ ์ ์œผ๋กœ ๋‹ค๋ฃฐ ๊ฒƒ์ด๋‹ค.
๋˜ํ•œ transformer๋ผ๋Š” ๋ชจ๋ธ๊ตฌ์กฐ๋กœ ์ธํ•ด seq2seq๊ฐ€ ๋”์šฑ ๋‹ค์–‘ํ•˜๊ฒŒ ๊ตฌํ˜„๋  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํŠนํžˆ transformer๋Š” attention๋งŒ์œผ๋กœ seq2seq๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๊ธฐ์— ์†๋„์™€ ์„ฑ๋Šฅ ๋‘˜ ๋‹ค ์žก์•„๋ƒˆ๋‹ค.

 

+ Recent posts