๐Ÿ“Œ ๋ชฉ์ฐจ

1. preview
2. Duality๋ฅผ ํ™œ์šฉํ•œ Supervised Learning

3. Duality๋ฅผ ํ™œ์šฉํ•œ Unsupervised Learning
4. Back-Translation ์žฌํ•ด์„

๐Ÿ˜š ๊ธ€์„ ๋งˆ์น˜๋ฉฐ...

 

 

 


1. Preview

1.1 Duality๋ž€?
์šฐ๋ฆฐ ๋ณดํ†ต ๊ธฐ๊ณ„ํ•™์Šต์„ ํ†ตํ•ด ํŠน์ • ๋„๋ฉ”์ธ์˜ data X๋ฅผ ๋ฐ›์•„
๋‹ค๋ฅธ ๋„๋ฉ”์ธ์˜ data Y๋กœ mappingํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๋ฒ•์„ ํ•™์Šตํ•œ๋‹ค.
๋”ฐ๋ผ์„œ ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ๊ณ„ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” dataset์€ ๋‘ ๋„๋ฉ”์ธ ์‚ฌ์ด data๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.
Duality: ๋‘ ๋„๋ฉ”์ธ ์‚ฌ์ด์˜ ๊ด€๊ณ„, ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ๊ณ„ํ•™์Šต๋ฌธ์ œ๋Š” ์ด์ฒ˜๋Ÿผ Duality์†์„ฑ์„ ๊ฐ–๋Š”๋‹ค.
ํŠนํžˆ๋‚˜ NMT๋Š” ๊ฐ ๋„๋ฉ”์ธ์˜ data์‚ฌ์ด์— ์ •๋ณด๋Ÿ‰์˜ ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜ ์—†๋‹ค๋Š” ์ ์ด ํฐ ํŠน์ง•์ด์ž ์žฅ์ ์ธ๋ฐ, Duality๋ฅผ ์ ๊ทน์ ์œผ๋กœ ํ™œ์šฉํ•œ๋‹ค๋ฉด ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ๊ณ ๋„ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

1.2 CycleGAN
๋ฒˆ์—ญ ์™ธ ๋ถ„์•ผ์—์„œ๋„ Duality๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ,
โˆ™ Vision๋ถ„์•ผ์—์„œ ์†Œ๊ฐœํ•œ [CycleGAN; 2017]์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•ด๋ณด๊ณ  ํ•œ๋‹ค.


CycleGAN์ด๋ž€ ์•„๋ž˜๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์ง์ง€์–ด์ง€์ง€์•Š์€ ๋‘ ๋„๋ฉ”์ธ์˜ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ์„๋•Œ,
X๋„๋ฉ”์ธ์˜ image๋ฅผ Y๋„๋ฉ”์ธ์˜ image๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
์‚ฌ์ง„์˜ ์ „๋ฐ˜์  ๊ตฌ์กฐ๋Š” ์œ ์ง€ํ•˜๋˜, ๋ชจ๋„ค์˜ ํ™”ํ’์œผ๋กœ ๋ฐ”๊พธ๊ฑฐ๋‚˜ ๋ง์„ ์–ผ๋ฃฉ๋ง์ด๋‚˜ ๋ฏผ๋ฌด๋Šฌ๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ๋„ ํ•œ๋‹ค.
XY์™€ YX ๋ชจ๋‘ ๊ฐ๊ฐ ์ƒ์„ฑ์ž G, F์™€ ํŒ๋ณ„์ž Dx, Dy๋ฅผ ๊ฐ–๊ณ ์žˆ๊ธฐ์— min/max ๊ฒŒ์ž„์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
G๋Š” x๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ลท์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
F๋Š” y๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ xฬ‚ ์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
Dx๋Š” xฬ‚ ์ด๋‚˜ x๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ํ•ฉ์„ฑ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•œ๋‹ค.
CycleGAN์˜ ๋™์ž‘๊ฐœ์š”
์ด ๋ฐฉ์‹์˜ ํ•ต์‹ฌ์€ xฬ‚ ๋‚˜ ลท๋ฅผ ํ•ฉ์„ฑํ•  ๋•Œ, ๊ธฐ์กด์˜ ๋„๋ฉ”์ธ X, Y์— ์‹ค์ œ๋กœ ์†ํ•˜๋Š” ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ ๋งŒ๋“ค์–ด ๋‚ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
๊ทธ๋ฆฌ๊ณ  ๊ฐ๊ฐ์˜ xฬ‚ ์™€ ลท๋กœ๋ถ€ํ„ฐ ์›๋ž˜ data๋กœ ๋Œ์•„์˜ฌ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
 

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be a

arxiv.org

 

 

 

 

 

 

 

 

 

 

 

 

 

 


2. Duality๋ฅผ ํ™œ์šฉํ•œ Supervised Learning

2.1 DSL (Dual Supervised Learning)
์ด๋ฒˆ์— ์†Œ๊ฐœํ•  ๋…ผ๋ฌธ์€ Duality๋ฅผ ํ™œ์šฉํ•œ ์ง€๋„ํ•™์Šต[DSL;2017]์ด๋‹ค.
์ด ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด์˜ Teacher-Forcing์˜ ์–ด๋ ค์›€ ํ•ด๊ฒฐ ์‹œ,
RL๋ฐฉ๋ฒ•๋Œ€์‹  Duality์˜ Regularization Term์„ ๋„์ถœํ•ด ํ•ด๊ฒฐํ•œ๋‹ค.


Bayes ์ •๋ฆฌ์— ๋”ฐ๋ฅด๋ฉด ์•„๋ž˜์ˆ˜์‹์€ ํ•ญ์ƒ ์„ฑ๋ฆฝ.

๊ทธ๋ ‡๋‹ค๋ฉด, ์ด ์ˆ˜์‹๋Œ€๋กœ dataset์„ ํ†ตํ•ด ํ›ˆ๋ จํ•œ ๋ชจ๋ธ๋“ค์€ ์•„๋ž˜ ์ˆ˜์‹์„ ๋งŒ์กฑํ•ด์•ผํ•œ๋‹ค.
์œ„์™€ ๊ฐ™์€ ์ „์ œ๋กœ ๋ฒˆ์—ญํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ชฉํ‘œ์— ์ ์šฉํ•œ๋‹ค๋ฉด, ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
์œ„์˜ ์ˆ˜์‹์„ ํ•ด์„ํ•ด๋ณด์ž.

โˆ™[๋ชฉํ‘œ. 1]: Bayes์ •๋ฆฌ์—๋”ฐ๋ฅธ ์ œ์•ฝ์กฐ๊ฑด์„ ๋งŒ์กฑ + โ„“1์„ ์ตœ์†Œํ™”
โ„“1์€ ๋ฒˆ์—ญํ•จ์ˆ˜ f์— ์ž…๋ ฅ xi๋ฅผ ๋„ฃ์–ด ๋‚˜์˜จ ๋ฐ˜ํ™˜๊ฐ’๊ณผ yi์‚ฌ์ด์˜ ์†์‹ค์„ ์˜๋ฏธ

โˆ™[๋ชฉํ‘œ. 2]: โ„“2๋„ ๋ฒˆ์—ญํ•จ์ˆ˜ g์—๋Œ€ํ•ด ๋™์ผ์ž‘์—…์„ ์ˆ˜ํ–‰, ์ตœ์†Œํ™”

๋”ฐ๋ผ์„œ ์œ„์™€ ๊ฐ™์€ MSE์†์‹คํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
์ด ํ•จ์ˆ˜๋Š” Bayes์ •๋ฆฌ์—๋”ฐ๋ฅธ ์ œ์•ฝ์กฐ๊ฑด์˜ ์–‘ ๋ณ€์˜ ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค.


โˆ™ ์œ„์˜ ์ˆ˜์‹์—์„œ ์šฐ๋ฆฌ๊ฐ€ ๋™์‹œ์— ํ›ˆ๋ จ์‹œํ‚ค๋Š” ์‹ ๊ฒฝ๋งํŒŒ๋ผ๋ฏธํ„ฐ๋กœ logP(y | x; θx→y)์™€ logP(x | y; θy→x)๋ฅผ ๊ตฌํ•˜๊ณ  
โˆ™ monolingual corpus๋ฅผ ํ†ตํ•ด ๋ณ„๋„๋กœ ์ด๋ฏธ ํ›ˆ๋ จ์‹œ์ผœ ๋†“์€ LM์œผ๋กœ logpฬ‚(x)๊ณผ logpฬ‚(y)๋ฅผ ๊ทผ์‚ฌ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
์ด ๋ถ€๊ฐ€์  ์ œ์•ฝ์กฐ๊ฑด์˜ ์†์‹คํ•จ์ˆ˜๋ฅผ ๊ธฐ์กด์˜ ์†์‹คํ•จ์ˆ˜์— ์ถ”๊ฐ€ํ•ด ๋™์‹œํ•ด ์ตœ์†Œํ™”ํ•˜๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๊ฐ€๋Šฅํ•˜๋‹ค.
์—ฌ๊ธฐ์„œ λ๋ฅผ ํ†ตํ•ด ์†์‹คํ•จ์ˆ˜๋‚ด ๋น„์œจ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค. (๋งŒ์•ฝ λ๊ฐ€ ๋„ˆ๋ฌด ํฌ๋‹ค๋ฉด, ์ตœ์ ํ™”๊ณผ์ •์—์„œ regularization term์ตœ์†Œํ™” ์‹œ ์ง€๋‚˜์น˜๊ฒŒ ์ง‘์ค‘).
 

Dual Supervised Learning

Many supervised learning tasks are emerged in dual forms, e.g., English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections

arxiv.org

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


3.  Duality๋ฅผ ํ™œ์šฉํ•œ Unsupervised Learning

3.1 Dual-learning Machine Translation
๊ณต๊ต๋กญ๊ฒŒ๋„ CycleGAN์ถœ์‹œ์‹œ๊ธฐ์— ๋งž์ถฐ ๋‚˜์˜จ ๋…ผ๋ฌธ, [Dual learning for machine translation; 2016]์ด ์žˆ๋‹ค.

NLP์˜ ํŠน์„ฑ์ƒ CycleGAN์ฒ˜๋Ÿผ ์ง์ ‘์  ๋ฏธ๋ถ„๊ฐ’์ „ ์ „๋‹ฌ์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.
ํ•˜์ง€๋งŒ ๊ธฐ๋ณธ์ ์œผ๋กœ ์•„์ฃผ ๋น„์Šทํ•œ ๊ฐœ๋…์„ ํ™œ์šฉํ•˜๋Š”๋ฐ,
์œ„์˜ ๋…ผ๋ฌธ์€ "parallel corpus๋ฅผ ํ™œ์šฉํ•ด ํ›ˆ๋ จ๋œ ๊ธฐ๋ณธ์„ฑ๋Šฅ์˜ NMT๋ฅผ monolingual corpus๋ฅผ ์ด์šฉํ•ด ๊ทธ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”"ํ•˜๊ณ ์ž ํ–ˆ๋‹ค.

NLP์—์„œ GAN๊ฐ™์€ ๋ฐฉ์‹์œผ๋ก  Gradient๋ฅผ ์ „๋‹ฌ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ธฐ์— ๊ฐ•ํ™”ํ•™์Šต์„ ํ™œ์šฉํ•ด ํŒ๋ณ„์ž์˜ ๊ฐ’์„ ์ „๋‹ฌํ•ด์ค˜์•ผํ•œ๋‹ค.
(โˆตG์™€ D์‚ฌ์ด์˜ ๋ช…ํ™•ํ•œ ๋ชฉํ‘œ๋‚˜ ๋ณด์ƒ ์‹ ํ˜ธ๊ฐ€ ๋ถ€์กฑํ•˜๊ฑฐ๋‚˜ ๋ชจํ˜ธํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ.)


โˆ™mono-lingual corpus๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋ฌธ์žฅ s์—๋Œ€ํ•ด ๋ฒˆ์—ญ์„ ํ•˜๊ณ 
โˆ™๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ smid๋ฅผ ์‚ฌ์šฉํ•ด ๋ฐ˜๋Œ€๋ฐฉํ–ฅ ๋ฒˆ์—ญ์„ ํ†ตํ•ด ๋ฒˆ์—ญ ์‹œ, 
โˆ™๋ณต์›๋œ ๋ฌธ์žฅ ล์ด ๊ธฐ์กด ์ฒ˜์Œ๋ฌธ์žฅ๊ณผ์˜ ์ฐจ์ด Δ(ล, s)๊ฐ€ ์ตœ์†Œํ™”๋˜๋„๋ก ํ›ˆ๋ จํ•œ๋‹ค.
์ด๋•Œ, ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ smid๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ•ด๋‹น์–ธ์–ด๋ฌธ์žฅ์ด ๋˜๋Š”์ง€ ์—ฌ๋ถ€๊ฐ€ ์ค‘์š”ํ•œ ์ง€ํ‘œ๊ฐ€ ๋œ๋‹ค.
์œ„์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ดํŽด๋ณด์ž.
๋‘ ๋„๋ฉ”์ธ(A์–ธ์–ด, B์–ธ์–ด)์˜ ๋ฌธ์žฅ๋“ค์ด ์ฃผ์–ด์ง€๊ณ 
์ƒ์„ฑ์ž GA→B์˜ parameter θAB์™€
๋ฐ˜๋Œ€๋ฐฉํ–ฅ ์ƒ์„ฑ์ž FB→A์˜ parameter θBA๊ฐ€ ๋“ฑ์žฅํ•œ๋‹ค.
์ด GA→B์™€ FB→A๋Š” ๋ชจ๋‘ parallel corpus๋ฅผ ํ™œ์šฉํ•ด pretrain๋œ ์ƒํƒœ์ด๋‹ค.


์•ž์—์„œ ๋ฐฐ์šด Policy Gradient๋ฅผ ํ™œ์šฉํ•ด parameter๋ฅผ updateํ•˜๋ฉด, 14๋ฒˆ์งธ ์‹์ด ๋„์ถœ๋œ๋‹ค.
Ê[r]์„ ๊ฐ๊ฐ์˜ Parameter์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•œ ๊ฐ’์„ ๋”ํ•ด์ฃผ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ด์ œ, ์ด ๋ณด์ƒ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ๊ตฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

์ด์ œ k๊ฐœ์˜ sampling๋œ ๋ฌธ์žฅ์— ๋Œ€ํ•ด ๊ฐ ๋ฐฉํ–ฅ์— ๋Œ€ํ•œ ๋ณด์ƒ์„ ๊ฐ๊ฐ ๊ตฌํ•˜๊ณ  ์ด๋ฅผ ์„ ํ˜•๊ฒฐํ•ฉํ•œ๋‹ค. (์ด๋•Œ, smid: sampling๋œ ๋ฌธ์žฅ์„ ์˜๋ฏธ)
LMB๋ฅผ ์‚ฌ์šฉํ•ด ํ•ด๋‹น ๋ฌธ์žฅ์ด B์–ธ์–ด์˜ ์ง‘ํ•ฉ์— ์†ํ•˜๋Š”์ง€ ๋ณด์ƒ๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค.
LMB๋Š” ๊ธฐ์กด ์–ธ์–ดB๋‹จ์ผ์ฝ”ํผ์Šค๋กœ ์‚ฌ์ „ํ›ˆ๋ จ๋˜์–ด์žˆ๊ธฐ์— ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์žฅ์ด ์ƒ์„ฑ๋ ์ˆ˜๋ก ํ•ด๋‹น LM์—์„œ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์งˆ ๊ฒƒ์ด๋‹ค.

์ด๋•Œ, ๋žŒ๋‹ค์™€ ์•ŒํŒŒ๋Š” ๋™์ผ.
์ด๋ ‡๊ฒŒ ์–ป์€ ๐”ผ[r]์„ ๊ฐ parameter์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜๋ฉด ์œ„์™€๊ฐ™์€ ์ˆ˜์‹์„ ์–ป์„ ์ˆ˜ ์žˆ๊ณ , 
์•ž์—์„œ ์„œ์ˆ ํ•œ parameter update์ˆ˜์‹์— ๋Œ€์ž…ํ•˜๋ฉด ๋œ๋‹ค.



๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ์˜ ๋ฒˆ์—ญ B→A๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
์ตœ์ข…์ ์ธ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์„ค๋ช… ๋ฐ ๊ฒฐ๊ณผ๋Š” ์œ„์™€ ๊ฐ™๋‹ค.
Dual-Learning์ด ๋ฌธ์žฅ๊ธธ์ด์— ์ƒ๊ด€์—†์ด ํ•ญ์ƒ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
 

Dual Learning for Machine Translation

While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a du

arxiv.org

 

 

 

3.2 DUL (Dual Unsupervised Learning)
์•ž์„œ ์„ค๋ช…ํ•œ ๋“€์–ผ์ง€๋„ํ•™์Šต(DSL)์€ Bayes์ •๋ฆฌ์— ๋”ฐ๋ฅธ ์ˆ˜์‹์„ ์ œ์•ฝ์กฐ๊ฑด์œผ๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค.
์ง€๊ธˆ ์†Œ๊ฐœํ•˜๋Š” ๋…ผ๋ฌธ[DUL ; 2018]์˜ ๋ฐฉ๋ฒ•์€ ์ฃผ๋ณ€๋ถ„ํฌ(marginal distribution)์˜ ์„ฑ์งˆ์„ ์ด์šฉํ•ด ์ œ์•ฝ์กฐ๊ฑด์„ ์ƒ์„ฑํ•œ๋‹ค.

์ฃผ๋ณ€ ๋ถ„ํฌ์˜ ์†์„ฑ์„ ํ†ตํ•ด ์œ„์˜ ์ˆ˜์‹์€ ํ•ญ์ƒ ์ฐธ์ด๋‹ค.
์ด๋Š” ์กฐ๊ฑด๋ถ€ํ™•๋ฅ ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๊ณ , ์ข€ ๋” ๋‚˜์•„๊ฐ€๋ฉด ๊ธฐ๋Œ€๊ฐ’ํ‘œํ˜„์œผ๋กœ๋„ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค.
๊ทธ ํ›„ K๋ฒˆ sampling ํ•ด Monte Carlo Sampling์œผ๋กœ ๊ทผ์‚ฌํ•ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

์œ„์˜ ์ˆ˜์‹์„ NMT์— ์ ์šฉํ•ด๋ณด์ž.
โˆ™src๋ฌธ์žฅ x, tgt๋ฌธ์žฅ y๋กœ ์ด๋ค„์ง„ ์–‘๋ฐฉํ–ฅ๋ณ‘๋ ฌ์ฝ”ํผ์Šค ๐‘ฉ ์™€
โˆ™S๊ฐœ์˜ tgt๋ฌธ์žฅ y๋กœ๋งŒ ์ด๋ค„์ง„ ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค โ„ณ์ด ์žˆ๋‹ค ๊ฐ€์ •ํ•˜์ž.


์ด๋•Œ, ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋™์‹œ์— ์ฃผ๋ณ€๋ถ„ํฌ์— ๋”ฐ๋ฅธ ์ œ์•ฝ์กฐ๊ฑด ๋˜ํ•œ ๋งŒ์กฑ์‹œ์ผœ์•ผํ•œ๋‹ค.
[Objectiv; ๋ชฉํ‘œ]

์œ„์˜ ์ˆ˜์‹์„ DSL๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ λ์™€ ํ•จ๊ป˜ S(θ)์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•ด ๊ธฐ์กด ์†์‹คํ•จ์ˆ˜์— ์ถ”๊ฐ€ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

์ด์ œ DSL๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ pฬ‚(x)์™€ pฬ‚(y)๊ฐ€ ๋“ฑ์žฅํ•œ๋‹ค.
โˆ™pฬ‚(x): ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค๋กœ ๋งŒ๋“  LM์œผ๋กœ ๊ณ„์‚ฐํ•œ ๊ฐ ๋ฌธ์žฅ๋“ค์˜ ํ™•๋ฅ ๊ฐ’


์œ„์˜ ์ˆ˜์‹์— ๋”ฐ๋ฅด๋ฉด pฬ‚(x)๋ฅผ ํ†ตํ•ด src๋ฌธ์žฅ x๋ฅผ samplingํ•ด ์‹ ๊ฒฝ๋ง θ๋ฅผ ํ†ต๊ณผ์‹œ์ผœ P(y | x; θ)๋ฅผ ๊ตฌํ•ด์•ผ ๋  ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค.
ํ•˜์ง€๋งŒ, ์•„๋ž˜์ฒ˜๋Ÿผ ์กฐ๊ธˆ ๋” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ์ ‘๊ทผํ•œ๋‹ค.
์ด์ฒ˜๋Ÿผ ์ค‘์š”๋„ํ‘œ์ง‘๋ฒ•(importance sampling)์„ ํ†ตํ•ด
tgt์–ธ์–ด์˜ ๋ฌธ์žฅ y๋ฅผ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ๋ฒˆ์—ญ๊ธฐ (y→x)์— ๋„ฃ์–ด
K๊ฐœ์˜ src๋ฌธ์žฅ x๋ฅผ samplingํ•ด P(y)๋ฅผ ๊ตฌํ•œ๋‹ค.


์ด ๊ณผ์ •์„ ํ•˜๋‚˜์˜ ์†์‹คํ•จ์ˆ˜๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
โˆ™1ํ•ญ: ๋ฌธ์žฅ xn์ด ์ฃผ์–ด์งˆ ๋•Œ, yn์˜ ํ™•๋ฅ ์„ ์ตœ๋Œ€๋กœํ•˜๋Š” θ๋ฅผ ์ฐพ๋Š”๋‹ค.

โˆ™2ํ•ญ: ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค์—์„œ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ ys LM์—์„œ์˜ ํ™•๋ฅ ๊ฐ’ log pฬ‚(ys)๊ณผ์˜ ์ฐจ์ด๋ฅผ ์ค„์—ฌ์•ผํ•œ๋‹ค.
๊ทธ ๊ฐ’์€ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ๋ฒˆ์—ญ๊ธฐ(y→x)๋ฅผ ํ†ตํ•ด K๋ฒˆ samplingํ•œ ๋ฌธ์žฅ xi์˜ LMํ™•๋ฅ ๊ฐ’ pฬ‚(xi)์™€ xi๊ฐ€ ์ฃผ์–ด์งˆ ๋•Œ, ys์˜ ํ™•๋ฅ ๊ฐ’์„ ๊ณฑํ•˜๊ณ , ๋ฌธ์žฅ ys๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ samplingํ•œ ๋ฌธ์žฅ xi์˜ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋‚˜๋ˆ ์ค€ ๊ฐ’์ด ๋œ๋‹ค.


[์„ฑ๋Šฅ ๋ฐ ๊ฒฐ๊ณผ]
์ด ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ, DUL๊ณผ ๋‹ค๋ฅธ ๊ธฐ์กด ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค๋ฅผ ํ™œ์šฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๊ณผ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ๋Š” ์œ„์™€ ๊ฐ™์€๋ฐ, ์ด ๋ฐฉ๋ฒ•์€ ์•ž์„œ ์†Œ๊ฐœํ•œ ๊ธฐ์กด์˜ ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค๋ฅผ ํ™œ์šฉํ•œ ๋ฐฉ์‹๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ํ›จ์”ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์˜ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค€๋‹ค.

๋” ๋‚˜์•„๊ฐ€, ์•ž์„œ ์†Œ๊ฐœํ•œ Dual-Learning๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.
๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋น„ํšจ์œจ์  RL์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ ๋„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์€ ์ฃผ๋ชฉํ• ๋งŒํ•œ ์„ฑ๊ณผ๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.
 

Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization | Proceedings of the AAAI

 

ojs.aaai.org

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


4. ์‰ฌ์–ด๊ฐ€๊ธฐ) Back-Translation ์žฌํ•ด์„

 

[Gain Study_NLP]08. NMT ์‹ฌํ™” (Zero-Shot, Transformer)

๐Ÿ“Œ ๋ชฉ์ฐจ 1. Multi-Lingual. with. Zero-shot Learning 2. Mono-Lingual Corpus 3. Transformer ๐Ÿ˜š ๊ธ€์„ ๋งˆ์น˜๋ฉฐ... 1. Mutli-Lingual. with. Zero-shot. Learning ์ด์ œ, NMT performance๋ฅผ ๋Œ์–ด์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•œ ๊ณ ๊ธ‰๊ธฐ๋ฒ•๋“ค์„ ์„ค๋ช…ํ•ด๋ณด๋ ค ํ•œ๋‹ค.

chan4im.tistory.com

4.1 Back-Translation
์•ž์„œ Back-Translation์„ ์ถ”์ƒ์  ๊ด€์ ์—์„œ ์™œ ์ž˜ ๋™์ž‘ํ•˜๋Š”์ง€ ์•Œ๊ณ  ๋„˜์–ด๊ฐ”๋‹ค๋ฉด, 
์˜ค๋Š˜์˜ ๋…ผ๋ฌธ์€ Back-Translation์„ ์ˆ˜์น˜์  ๊ด€์ ์œผ๋กœ ๊ทธ ์ด์œ ๋ฅผ ํŒŒ์•…ํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

โˆ™ N๊ฐœ์˜ src๋ฌธ์žฅ x, tgt๋ฌธ์žฅ y๋กœ ์ด๋ค„์ง„ ์–‘๋ฐฉํ–ฅ๋ณ‘๋ ฌ์ฝ”ํผ์Šค ๐›ฃ์™€
โˆ™ S๊ฐœ์˜ tgt๋ฌธ์žฅ y๋กœ๋งŒ ์ด๋ค„์ง„ ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค โ„ณ์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.


์•ž์„œ ๋‹ค๋ฃฌ DUL(Dual Unsupervised Learning)์ฒ˜๋Ÿผ ์ตœ์ข…์ ์œผ๋กœ ์ตœ์†Œํ™”ํ•˜๋ ค๋Š” ์†์‹คํ•จ์ˆ˜๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๊ฐ€๋Šฅํ•˜๋‹ค.

์—ฌ๊ธฐ์„œ, DUL๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ P(y)๋Š” ์ฃผ๋ณ€๋ถ„ํฌ์˜ ์†์„ฑ์„ ํ™œ์šฉํ•ด ํ‘œํ˜„๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹ค๋งŒ, ์—ฌ๊ธฐ์„œ ์ขŒ๋ณ€์ด P(y)๊ฐ€ ์•„๋‹Œ logP(y)์ž„์„ ์ฃผ๋ชฉํ•˜์ž.

cf) Jensen's ๋ถ€๋“ฑ์‹์ •๋ฆฌ๋Š” ํ•ญ์ƒ log P(y)๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
[Jensen's ๋ถ€๋“ฑ์‹ ์ •๋ฆฌ]
๋กœ๊ทธํ•จ์ˆ˜๊ณก์„ ์ด ์•„๋ž˜์™€ ๊ฐ™์„ ๋•Œ, ๋‘์  x1, x2์— ๋Œ€ํ•œ ํ‰๊ท ์„ xm์ด๋ผ ํ•˜์ž.
log xm ≥ (1/2)×(logx1 + logx2)๋Š” ํ•ญ์ƒ ์„ฑ๋ฆฝํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.


์œ„์˜ ๋ถ€๋“ฑ์‹ ๊ฒฐ๊ณผ๋ฌผ์—์„œ ์Œ์˜๋ถ€ํ˜ธ๋ฅผ ๋ถ™์—ฌ์ฃผ๋ฉด ์•„๋ž˜์™€ ๊ฐ™๊ณ ,
๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ชฉ์ ์€ -logP(y)๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
์ด๋ฅผ ์กฐ๊ธˆ ์ „, ์ตœ์†Œํ™”ํ•˜๋ ค๋Š” ์†์‹คํ•จ์ˆ˜์— ์ด ์ˆ˜์‹์„ ๋Œ€์ž…ํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
๊ฒฐ๊ตญ L~์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์€ L์„ ์ตœ์†Œํ™”ํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
๋”ฐ๋ผ์„œ L~์ตœ์†Œํ™”๋ฅผ ์œ„ํ•ด optimizer๋ฅผ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
KL Divergence์˜ ๊ฒฝ์šฐ, θ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜๋ฉด ์ƒ์ˆ˜์ด๊ธฐ์— ์ƒ๋žต๋˜๋ฏ€๋กœ
GD๋ฅผ ์ด์šฉํ•œ ์˜ˆ์‹œ์ผ ๊ฒฝ์šฐ, ์•„๋ž˜์™€ ๊ฐ™์ด ๋ฏธ๋ถ„๋œ๋‹ค.


๊ฒฐ๊ณผ์ ์œผ๋กœ ์–ป๊ฒŒ๋œ ์†์‹คํ•จ์ˆ˜์˜ ๊ฐ ๋ถ€๋ถ„์˜ ์˜๋ฏธ๋ฅผ ์‚ดํŽด๋ณด์ž.
์ฒซ๋ฒˆ์งธ ํ•ญ: xn์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, yn์˜ ํ™•๋ฅ ์„ ์ตœ๋Œ€๋กœ ํ•˜๋Š” θ๋ฅผ ์ฐพ๋Š” ๊ฒƒ
๋‘๋ฒˆ์งธ ํ•ญ: sampling๋œ ๋ฌธ์žฅ xi์ด ์ฃผ์–ด์งˆ ๋•Œ, ๋‹จ์ผ์–ธ์–ด์ฝ”ํผ์Šค์˜ ๋ฌธ์žฅ ys๊ฐ€ ๋‚˜์˜ฌ ํ‰๊ท ํ™•๋ฅ ์„ ์ตœ๋Œ€๋กœ ํ•˜๋Š” θ๋ฅผ ์ฐพ๋Š” ๊ฒƒ
Back-Translation์ด๋ž€ L~(θ)๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š”๊ฒƒ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
 

Joint Training for Neural Machine Translation Models with Monolingual Data

Monolingual data have been demonstrated to be helpful in improving translation quality of both statistical machine translation (SMT) systems and neural machine translation (NMT) systems, especially in resource-poor or domain adaptation tasks where parallel

arxiv.org

 

 

 

 

 


๋งˆ์น˜๋ฉฐ...

์ด๋ฒˆ์‹œ๊ฐ„์—๋Š” ๊ธฐ์กด RL์„ ๋„˜์–ด Duality๋ผ๋Š” ์†์„ฑ์„ ํ™œ์šฉํ•ด NMT์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ดค๋‹ค.
์ด์ „ RL์„ ํ™œ์šฉํ•œ NLP ์ฑ•ํ„ฐ์—์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ, "Policy Gradient๋ฐฉ์‹"์€
โˆ™ ์žฅ์ ) ๋ฏธ๋ถ„๋ถˆ๊ฐ€๋Šฅํ•œ ๋ณด์ƒํ•จ์ˆ˜๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋‚˜ 
โˆ™ ๋‹จ์ ) Sampling๊ธฐ๋ฐ˜ ๋™์ž‘์œผ๋กœ ํ›จ์”ฌ ๋” ๋น„ํšจ์œจ์  ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.
    โˆต High Variance //  Exploration vs. Exploitation Trade-off 

ํ•˜์ง€๋งŒ Duality๋ฅผ ํ™œ์šฉํ•œ ๋ฐฉ์‹์—์„œ๋Š” ๊ธฐ์กด์˜ MLE ๋ฐ Teacher-Forcing๋ฐฉ๋ฒ• ํ•˜์—์„œ teacher-forcing์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๋Š”, "Regularization Term"์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜์˜€๋‹ค.


์ถ”๊ฐ€์ ์œผ๋กœ ๊ธฐ์กด์˜ Back-Translation๋“ฑ์˜ mono-lingual corpus ํ™œ์šฉ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์žฌํ•ด์„์„ ์ œ๊ณตํ•œ๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ ํ†ต๊ณ„๊ธฐ๋ฐ˜ํ•ด์„์ด ํ›จ์”ฌ ์ˆ˜์›”ํ•˜๊ธฐ์— ํ˜„์žฌ ๋”ฅ๋Ÿฌ๋‹ํ•™๊ณ„์˜ ์—ฐ๊ตฌ๋ฐฉํ–ฅ๊ณผ ์ƒ๋‹น๋ถ€๋ถ„ ์ผ์น˜ํ•œ๋‹ค.

 

+ Recent posts