๐Ÿ“Œ ๋ชฉ์ฐจ

0. preprocessing
1. corpus ์ˆ˜์ง‘
2. Normalization

3. ๋ฌธ์žฅ๋‹จ์œ„ tokenization
4. tokenization (ํ•œ๊ตญ์–ด, ์˜์–ด, ์ค‘๊ตญ์–ด)
5. parallel corpus ์ •๋ ฌ
6. subword tokenization (BPE Algorithm  &  UK, OOV)
7. Detokenization (๋ถ„์ ˆํ™” ๋ณต์›)
8. torchText ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

 

 


0. Preprocessing

0.1 corpus
corpus๋Š” ๋ง๋ญ‰์น˜๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋ฉฐ ๋ณดํ†ต ์—ฌ๋Ÿฌ๋‹จ์–ด๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ์„ ์˜๋ฏธ.
train data๋ฅผ ์œ„ํ•ด ์ด๋Ÿฐ ๋‹ค์ˆ˜์˜ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ corpus๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

 โˆ™ monolingual corpus: ํ•œ ๊ฐ€์ง€ ์–ธ์–ด๋กœ ๊ตฌ์„ฑ๋œ corpus
 โˆ™ bilingual corpus: ๋‘ ๊ฐœ์˜ ์–ธ์–ด๋กœ ๊ตฌ์„ฑ๋œ corpus
 โˆ™ multilingual corpus: ๋” ๋งŽ์€ ์ˆ˜์˜ ์–ธ์–ด๋กœ ๊ตฌ์„ฑ๋œ corpus
 โˆ™ parallel corpus: ์–ธ์–ด๊ฐ„์˜ ์Œ(ex.์˜๋ฌธ-ํ•œ๊ธ€)์œผ๋กœ ๊ตฌ์„ฑ๋œ corpus

 

0.2 preprocessing ๊ณผ์ • ๊ฐœ์š”
โ‘  corpus ์ˆ˜์ง‘
โ‘ก Normalize
โ‘ข ๋ฌธ์žฅ๋‹จ์œ„ Tokenize
โ‘ฃ Tokenize
โ‘ค parallel corpus ์ •๋ ฌ
โ‘ฅ subword tokenize

 

 

 

 


1. Corpus ์ˆ˜์ง‘

corpus๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ๋ฌด์ž‘์ • ์›น์‚ฌ์ดํŠธ์—์„œ corpus ํฌ๋กค๋ง์„ ํ•˜๋ฉด ๋ฒ•์  ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

์ €์ž‘๊ถŒ์€ ๋ฌผ๋ก , ๋ถˆํ•„์š”ํ•œ ํŠธ๋ž˜ํ”ฝ์ด ์›น์„œ๋ฒ„์— ๊ฐ€์ค‘๋˜๋Š” ๊ณผ์ •์—์„œ ๋ฌธ์ œ๋ฐœ์ƒ๊ฐ€๋Šฅ

๋”ฐ๋ผ์„œ ์ ์ ˆํ•œ ์›น์‚ฌ์ดํŠธ์—์„œ ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ• or ์ƒ์—…์  ๋ชฉ์ ์ด ์•„๋‹Œ ๊ฒฝ์šฐ๋กœ ์ œํ•œ๋œ ํฌ๋กค๋ง์ด ๊ถŒ์žฅ๋œ๋‹ค.

ํ•ด๋‹น ์›น์‚ฌ์ดํŠธ์˜ ํฌ๋กค๋ง ํ—ˆ์šฉ์—ฌ๋ถ€๋Š” ์‚ฌ์ดํŠธ์˜ robots.txt๋ฅผ ๋ณด๋ฉด ๋œ๋‹ค.

ex) TED์˜ robot.txt ํ™•์ธ๋ฐฉ๋ฒ•

 

1.1 monolingual corpus ์ˆ˜์ง‘
๊ฐ€์žฅ ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” corpus(์ธํ„ฐ๋„ท์— ๊ทธ๋ƒฅ ๋„๋ ค์žˆ๊ธฐ ๋•Œ๋ฌธ)์ด๊ธฐ์— ๋Œ€๋Ÿ‰์˜ corpus๋ฅผ ์†์‰ฝ๊ฒŒ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋งŒ, ์˜ฌ๋ฐ”๋ฅธ ๋„๋ฉ”์ธ์˜ corpus๋ฅผ ์ˆ˜์ง‘, ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ๋กœ ๊ฐ€๊ณตํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.

 

1.2 multilingual corpus ์ˆ˜์ง‘
์ž๋™ ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ์œ„ํ•œ parallel corpus๋ฅผ ๊ตฌํ•˜๊ธฐ๋Š” monolingual corpus์— ๋น„ํ•ด ์ƒ๋‹นํžˆ ์–ด๋ ต๊ณ 
'he'์™€ 'she'์™€ ๊ฐ™์€ ๋Œ€๋ช…์‚ฌ๊ฐ€ ์‚ฌ๋žŒ์ด๋ฆ„ ๋“ฑ์˜ ๊ณ ์œ ๋ช…์‚ฌ๋กœ ํ‘œํ˜„๋  ๋•Œ๊ฐ€ ๋งŽ์•„ ์ด๋Ÿฐ ๋ฒˆ์—ญ์— ๋Œ€ํ•œ ๋ฌธ์ œ์ ๋“ค์„ ๋‘๋ฃจ ๊ณ ๋ คํ•ด์•ผํ•œ๋‹ค.

 

 

 

 


 

2. Normalization

2.1 ์ „๊ฐ/๋ฐ˜๊ฐ ๋ฌธ์ž ์ œ๊ฑฐ

๋Œ€๋ถ€๋ถ„์˜ ์ผ๋ณธ/์ค‘๊ตญ์–ด ๋ฌธ์„œ์™€ ์ผ๋ถ€ ํ•œ๊ตญ์–ด๋ฌธ์„œ์˜ ์ˆซ์ž, ์˜์ž, ๊ธฐํ˜ธ๊ฐ€ ์ „๊ฐ๋ฌธ์ž์ผ ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌํ•˜๊ธฐ์— ์ด๋ฅผ ๋ฐ˜๊ฐ ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค.

 

 

2.2 ๋Œ€์†Œ๋ฌธ์ž ํ†ต์ผ
๋‹ค์–‘ํ•œ ํ‘œํ˜„์˜ ์ผ์›ํ™”๋Š” ํ•˜๋‚˜์˜ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ์—ฌ๋Ÿฌ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ํ˜•ํƒœ๋กœ ํ†ต์ผํ•ด ํฌ์†Œ์„ฑ(sparsity)์„ ์ค„์—ฌ์ค€๋‹ค.
๋‹ค๋งŒ, ๋”ฅ๋Ÿฌ๋‹์ด ๋ฐœ์ „ํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ๋‹จ์–ด๋“ค์„ ๋น„์Šทํ•œ ๊ฐ’์˜ vector๋กœ ๋‹จ์–ด์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋Œ€์†Œ๋ฌธ์ž ํ•ด๊ฒฐ์˜ ํ•„์š”์„ฑ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค.

 

2.3 ์ •๊ทœ ํ‘œํ˜„์‹์„ ์‚ฌ์šฉํ•œ Normalize
ํฌ๋กค๋ง์œผ๋กœ ์–ป์€ ๋‹ค๋Ÿ‰์˜ corpus์˜ ๊ฒฝ์šฐ, ํŠน์ˆ˜๋ฌธ์ž ๋ฐ ๊ธฐํ˜ธ ๋“ฑ์— ์˜ํ•œ Noise๊ฐ€ ์กด์žฌํ•œ๋‹ค.
๋˜ํ•œ ์›น์‚ฌ์ดํŠธ์˜ ์„ฑ๊ฒฉ์— ๋”ฐ๋ฅธ ์ผ์ • ํŒจํ„ด์„ ๊ฐ–๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๊ธฐ์—
ํšจ์œจ์ ์œผ๋กœ noise๋ฅผ ๊ฐ์ง€, ์—†์• ์•ผ ํ•˜๋Š”๋ฐ, ์ด๋•Œ ์ธ๋ฑ์Šค์˜ ์‚ฌ์šฉ์ด ํ•„์ˆ˜์ ์ด๋‹ค.
์•„๋ž˜๋Š” ์ •๊ทœ์‹์‹œ๊ฐํ™”์‚ฌ์ดํŠธ(https://regexper.com/)๋ฅผ ํ™œ์šฉํ•ด ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค.
[ ] ์‚ฌ์šฉ

2 or 3 or 4 or 5 or c or d or e์™€ ๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”๋‹ค.
๋ฉด ์•„๋ž˜์˜ ์ขŒ์ธก๊ณผ ๊ฐ™๋‹ค.
ex) [2345cde]

- ์‚ฌ์šฉ
์—ฐ์†๋œ ์ˆซ์ž๋‚˜ ์•ŒํŒŒ๋ฒณ ๋“ฑ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.
ex) [2-5c-e]

^ ์‚ฌ์šฉ
not์„ ๊ธฐํ˜ธ ^์„ ์‚ฌ์šฉํ•ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.
ex) [2-5c-e]

( ) ์‚ฌ์šฉ
๊ด„ํ˜ธ๋ฅผ ์ด์šฉํ•ด group ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.
ex) (x)(yz)




? + * ์‚ฌ์šฉ
?: ์•ž์˜ ์ˆ˜์‹ํ•˜๋Š” ๋ถ€๋ถ„์ด ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๊ฑฐ๋‚˜ ํ•œ๋ฒˆ๋งŒ ๋‚˜ํƒ€๋‚  ๋•Œ
+: ์•ž์˜ ์ˆ˜์‹ํ•˜๋Š” ๋ถ€๋ถ„์ด ํ•œ๋ฒˆ ์ด์ƒ ๋‚˜ํƒ€๋‚  ๊ฒฝ์šฐ
*: ์•ž์˜ ์ˆ˜์‹ํ•˜๋Š” ๋ถ€๋ถ„์ด ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๊ฑฐ๋‚˜ ์—ฌ๋Ÿฌ๋ฒˆ ๋‚˜ํƒ€๋‚  ๊ฒฝ์šฐ

ex) x?
ex) x+
ex) x*



^์™€ $์˜ ์‚ฌ์šฉ

[ ] ๋‚ด์— ํฌํ•จ๋˜์ง€ ์•Š์„ ๋•Œ,
^ ์€ ๋ผ์ธ์˜ ์‹œ์ž‘์„ ์˜๋ฏธ
$ ์€ ๋ผ์ธ์˜ ์ข…๋ฃŒ๋ฅผ ์˜๋ฏธ
ex) ^x$

 

์˜ˆ์ œ 1)

์•„๋ž˜์˜ ๊ฐœ์ธ์ •๋ณด(์ „ํ™”๋ฒˆํ˜ธ)๊ฐ€ ํฌํ•จ๋œ corpus๋ฅผ dataset์œผ๋กœ ์‚ฌ์šฉํ•  ๋•Œ, ๊ฐœ์ธ์ •๋ณด๋ฅผ ์ œ์™ธํ•˜๊ณ  ์‚ฌ์šฉํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด?

๋‹จ, ํ•ญ์ƒ ๋งˆ์ง€๋ง‰ ์ค„์— ์ „ํ™”๋ฒˆํ˜ธ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋ฉด?

Hello Kim, I would like to introduce regular expression in this section
~~
Thank you!
Sincerely,

Kim: +82-10-1234-5678

 

๊ฐœ์ธ์ •๋ณด์˜ ๊ทœ์น™์„ ๋จผ์ € ํŒŒ์•…ํ•ด๋ณด์ž. ๊ตญ๊ฐ€๋ฒˆํ˜ธ๋Š” ์ตœ๋Œ€ 3์ž๋ฆฌ์ด๊ณ  ์•ž์— +๊ฐ€ ๋ถ™์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ „ํ™”๋ฒˆํ˜ธ ์‚ฌ์ด์— -๊ฐ€ ๋“ค์–ด๊ฐˆ ์ˆ˜๋„ ์žˆ๋‹ค.

์ „ํ™”๋ฒˆํ˜ธ๋Š” ๋นˆ์นธ ์—†์ด ํ‘œํ˜„๋˜๋ฉฐ ์ง€์—ญ๋ฒˆํ˜ธ๊ฐ€ ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๊ณ  ๋งˆ์ง€๋ง‰์€ ํ•ญ์ƒ 4์ž๋ฆฌ ์ˆซ์ž์ด๋‹ค ๋“ฑ๋“ฑ...

import re
regex = r"([\w]+\s*:?\s*)?\(?\+?([0-9]{1,3})?\-[0-9]{2,3}(\)|\-)?[0-9]{3,4}\-?[0-9]{4}"

x = "Name - Kim: +82-10-9425-4869"
re.sub(regex, "REMOVED", x)

์ถœ๋ ฅ: Name - REMOVED

 

 

์˜ˆ์ œ 2) ์น˜ํ™˜์ž ์‚ฌ์šฉ

์•„๋ž˜์˜ ์˜ˆ์ œ์—์„œ ์•ŒํŒŒ๋ฒณ ์‚ฌ์ด์— ์žˆ๋Š” ์ˆซ์ž๋ฅผ ์ œ๊ฑฐํ•ด์•ผํ•œ๋‹ค๋ฉด?

๋งŒ์•ฝ ๋‹จ์ˆœํžˆ [0-9]+ ๋กœ ์ˆซ์ž๋ฅผ ์ฐพ์•„ ์—†์•ค๋‹ค๋ฉด ์ˆซ์ž๋งŒ ์žˆ๊ฑฐ๋‚˜ ์ˆซ์ž๊ฐ€ ๊ฐ€์žฅ์ž๋ฆฌ์— ์žˆ๋Š” ๊ฒฝ์šฐ๋„ ์‚ฌ๋ผ์ง€๊ฒŒ ๋œ๋‹ค.

x = '''abcdefg
12345
ab12
12ab
a1bc2d
a1
1a'''

 

๋”ฐ๋ผ์„œ ๊ด„ํ˜ธ๋กœ group์„ ์ƒ์„ฑํ•˜๊ณ  ๋ฐ”๋€” ๋ฌธ์ž์—ด ๋‚ด์—์„œ ์—ญ์Šฌ๋ž˜์‹œ(\)์™€ ํ•จ๊ป˜ ์ˆซ์ž๋ฅผ ์ด์šฉํ•ด ๋งˆ์น˜ ๋ณ€์ˆ˜๋ช…์ฒ˜๋Ÿผ ๊ฐ€๋ฆฌํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

regex = r'([a-z])[0-9]+([a-z])'
to = r'1\2\'

y = '\n'.join([re.sub(regex, to, x_i) for x_i in x.split('\n')])

([a-z])[0-9]+([a-z])

 

 

 

 

 

 

 


3. ๋ฌธ์žฅ๋‹จ์œ„ tokenization

๋‹ค๋งŒ, ๋ณดํ†ต ๋‹ค๋ฃจ๋ ค๋Š” ๋ฌธ์ œ๋“ค์€ ์ž…๋ ฅ๋‹จ์œ„๊ฐ€ ์•„๋‹Œ, ๋ฌธ์žฅ๋‹จ์œ„์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ณ 

๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ, ํ•œ ๋ผ์ธ์— ํ•œ ๋ฌธ์žฅ๋งŒ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

๋”ฐ๋ผ์„œ ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์ด ํ•œ ๋ผ์ธ์— ์žˆ๊ฑฐ๋‚˜ ํ•œ ๋ฌธ์žฅ์ด ์—ฌ๋Ÿฌ ๋ผ์ธ์— ๊ฑธ์ณ์žˆ๋‹ค๋ฉด, ์ด๋ฅผ ๋ถ„์ ˆ(tokenize)ํ•ด์ค˜์•ผ ํ•œ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด ์ง์ ‘ ๋ถ„์ ˆํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งŒ๋“ค๊ธฐ๋ณด๋‹ค๋Š” ๋„๋ฆฌ ์•Œ๋ ค์ง„ NLP Toolkit, NLTK์—์„œ ์ œ๊ณตํ•˜๋Š” sent_tokenize ์ด์šฉ์ด ์ฃผ๊ฐ€ ๋œ๋‹ค.

๋ฌผ๋ก , ์ถ”๊ฐ€์ ์ธ ์ „์ฒ˜๋ฆฌ ๋ฐ ํ›„์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ๋“ค๋„ ์กด์žฌํ•œ๋‹ค.

3.1 sentence tokenization ์˜ˆ์ œ
import sys, fileinput, re
from nltk.tokenize import sent_tokenize

if __name__ == "__main__":
    for line in fileinput.input():
        if line.strip() != "":
            line = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', line.strip())

            sentences = sent_tokenize(line.strip())

            for s in sentences:
                if s != "":
                    sys.stdout.write(s + "\n")โ€‹

 

3.2 ๋ฌธ์žฅ ํ•ฉ์น˜๊ธฐ ๋ฐ ๋ถ„์ ˆ ์˜ˆ์ œ
import sys, fileinput
from nltk.tokenize import sent_tokenize

if __name__ == "__main__":
    buf = []

    for line in fileinput.input():
        if line.strip() != "":
            buf += [line.strip()]
            sentences = sent_tokenize(" ".join(buf))

            if len(sentences) > 1:
                buf = sentences[1:]

                sys.stdout.write(sentences[0] + '\n')

    sys.stdout.write(" ".join(buf) + "\n")โ€‹

 

 

 

 


4. Tokenization

ํ’€๊ณ ์ž ํ•˜๋Š” ๋ฌธ์ œ, ์–ธ์–ด์— ๋”ฐ๋ผ ํ˜•ํƒœ์†Œ ๋ถ„์„, ๋‹จ์ˆœํ•œ ๋ถ„์ ˆ์„ ํ†ตํ•œ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ํŠนํžˆ ๋„์–ด์“ฐ๊ธฐ์— ๊ด€ํ•ด ์‚ดํŽด๋ณด์ž.

 

ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ, ํ‘œ์ค€ํ™” ๊ณผ์ •์ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•„ ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ง€๋ฉ‹๋Œ€๋กœ์ธ ๊ฒฝ์šฐ๊ฐ€ ์ƒ๋‹นํžˆ ๋งŽ์œผ๋ฉฐ, ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ๋ฌธ์žฅํ•ด์„์— ํฐ ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์•„ ์ด๋Ÿฐ ํ˜„์ƒ์ด ๋”์šฑ ๋” ๊ฐ€์ค‘๋˜๋Š” ๊ฒฝํ–ฅ์ด ์กด์žฌํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ, ์ •๊ทœํ™”๋ฅผ ํ•ด์ค„ ๋•Œ, ํ‘œ์ค€ํ™”๋œ ๋„์–ด์“ฐ๊ธฐ๋ฅผ ์ ์šฉํ•˜๋Š” ๊ณผ์ •๋„ ํ•„์š”ํ•˜๋ฉฐ ๊ต์ฐฉ์–ด๋กœ์จ ์ ‘์‚ฌ๋ฅผ ์–ด๊ทผ์—์„œ ๋ถ„๋ฆฌํ•ด์ฃผ๋Š” ์—ญํ• ๋„ ํ•„์š”ํ•˜๊ธฐ์— ํฌ์†Œ์„ฑ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•˜๊ธฐ๋„ ํ•œ๋‹ค.
์ด๋Ÿฐ ํ•œ๊ตญ์–ด์˜ tokenization์„ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ๋Š” C++๋กœ ์ œ์ž‘๋œ Mecab, Python์œผ๋กœ ์ œ์ž‘๋œ KoNLPy๊ฐ€ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.
https://github.com/kh-kim/nlp_with_pytorch_examples/blob/master/chapter-04/tokenization.ipynb


์˜์–ด
์˜ ๊ฒฝ์šฐ, ๊ธฐ๋ณธ์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์žˆ๊ณ , ๊ธฐ๋ณธ์ ์œผ๋กœ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ๊ทœ์น™์„ ๋งค์šฐ ์ž˜ ๋”ฐ๋ฅด๊ณ  ์žˆ๋‹ค.
์˜์–ด์˜ ๊ฒฝ์šฐ, ๋ณดํ†ต ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ ๊ธฐ๋ณธ์ ์ธ ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ž˜ ํ†ต์ผ๋˜์–ด ์žˆ๋Š” ํŽธ์ด๋ฏ€๋กœ ๋„์–ด์“ฐ๊ธฐ ์ž์ฒด์—๋Š” ํฐ ์ •๊ทœํ™” ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.
๋‹ค๋งŒ ์‰ผํ‘œ(comma), ๋งˆ์นจํ‘œ(period), ์ธ์šฉ๋ถ€ํ˜ธ(quotation) ๋“ฑ์„ ๋„์–ด์ฃผ์–ด์•ผ ํ•˜๋ฏ€๋กœ Python์œผ๋กœ ์ œ์ž‘๋œ NLTK๋ฅผ ์ด์šฉํ•œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.
$ pip install nltk==3.2.5โ€‹


์ผ๋ณธ์–ด์™€ ์ค‘๊ตญ์–ด์˜ ๊ฒฝ์šฐ, ๋ชจ๋“  ๋ฌธ์žฅ์ด ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ํ˜•ํƒœ๋ฅผ ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ์ ์ ˆํ•œ ์–ธ์–ด๋ชจ๋ธ๊ตฌ์„ฑ์„ ์œ„ํ•ด ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
์ผ๋ณธ์–ด์˜ ๊ฒฝ์šฐ, C++ base์˜ Mecab์„, ์ค‘๊ตญ์–ด์˜ ๊ฒฝ์šฐ Java base์˜ Stanford Parser, PKU Parser๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

 

 

 

 

 

 


5. ๋ณ‘๋ ฌ Corpus ์ •๋ ฌ

์˜ˆ๋ฅผ๋“ค์–ด ์˜์–ด์‹ ๋ฌธ๊ณผ ํ•œ๊ธ€์‹ ๋ฌธ์ด ๋งตํ•‘๋˜๋Š”, ๋ฌธ์„œ์™€ ๋ฌธ์„œ๋‹จ์œ„ ๋งตํ•‘์˜ ๊ฒฝ์šฐ,

๋ฌธ์žฅ ๋Œ€ ๋ฌธ์žฅ์— ๊ด€ํ•œ ์ •๋ ฌ์€ ์ด๋ฃจ์–ด์ ธ ์žˆ์ง€ ์•Š์•˜๊ธฐ์— ์ผ๋ถ€ ๋ถˆํ•„์š”ํ•œ ๋ฌธ์žฅ๋“ค์„ ๊ฑธ๋Ÿฌ๋‚ด์•ผํ•œ๋‹ค.

 

5.1 parallel corpus ์ œ์ž‘๊ณผ์ • ๊ฐœ์š”
โ‘  source์™€ target ์–ธ์–ด๊ฐ„์˜ ๋‹จ์–ด์‚ฌ์ „์„ ์ค€๋น„,
  ์ค€๋น„๋œ ๋‹จ์–ด์‚ฌ์ „์ด ์žˆ์œผ๋ฉด โ‘ฅ์œผ๋กœ ์ด๋™, ์—†๋‹ค๋ฉด ์•„๋ž˜๊ณผ์ •์„ ๋”ฐ๋ฅธ๋‹ค.

โ‘ก ๊ฐ ์–ธ์–ด์— ๋Œ€ํ•œ corpus๋ฅผ ์ˆ˜์ง‘ ๋ฐ ์ •์ œ

โ‘ข ๊ฐ ์–ธ์–ด์— ๋Œ€ํ•œ word embedding vector๋ฅผ ๊ตฌํ•œ๋‹ค.

โ‘ฃ MUSE๋ฅผ ํ†ตํ•ด ๋‹จ์–ด๋ ˆ๋ฒจ ๋ฒˆ์—ญ๊ธฐ๋ฅผ ํ›ˆ๋ จ
https://github.com/facebookresearch/MUSE

โ‘ค ํ›ˆ๋ จ๋œ ๋‹จ์–ด๋ ˆ๋ฒจ ๋ฒˆ์—ญ๊ธฐ๋ฅผ ํ†ตํ•ด ๋‘ ์–ธ์–ด๊ฐ„์˜ ๋‹จ์–ด์‚ฌ์ „์„ ์ƒ์„ฑ

โ‘ฅ ๋งŒ๋“ค์–ด์ง„ ๋‹จ์–ด์‚ฌ์ „์„ ๋„ฃ์–ด Champollion์„ ํ†ตํ•ด ๊ธฐ์กด์— ์ˆ˜์ง‘๋œ multilingual corpus๋ฅผ ์ •๋ ฌ
https://github.com/LowResourceLanguages/champollion

โ‘ฆ ๊ฐ ์–ธ์–ด์— ๋Œ€ํ•ด ๋‹จ์–ด์‚ฌ์ „์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ ์ ˆํ•œ tokenization์„ ์ˆ˜ํ–‰

โ‘ง ๊ฐ ์–ธ์–ด์— ๋Œ€ํ•ด Normalization์„ ์ˆ˜ํ–‰

โ‘จ Champollion์„ ์‚ฌ์šฉํ•ด parallel corpus๋ฅผ ์ƒ์„ฑ

 

5.2 ๋‹จ์–ด์‚ฌ์ „ ์ƒ์„ฑ
facebook์˜ MUSE๋Š” parallel corpus๊ฐ€ ์—†๋Š” ์ƒํ™ฉ์—์„œ ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ฝ”๋“œ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

MUSE๋Š” ๊ฐ monolingual corpus๋ฅผ ํ†ตํ•ด ๊ตฌ์ถ•๋œ ์–ธ์–ด๋ณ„ word embedding vector์— ๋Œ€ํ•ด
๋‹ค๋ฅธ ์–ธ์–ด์˜ embedding vector์™€ mapping์‹œ์ผœ ๋‹จ์–ด ๊ฐ„์˜ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”, unsupervised learning์ด๋‹ค.
<>๋ฅผ ๊ตฌ๋ถ„๋ฌธ์ž(delimeter)๋กœ ์‚ฌ์šฉํ•ด ํ•œ ๋ผ์ธ์— source๋‹จ์–ด์™€ target๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•œ๋‹ค.

ex)  MUSE๋ฅผ ํ†ตํ•ด unsupervised learning์„ ์‚ฌ์šฉํ•ด ๊ฒฐ๊ณผ๋ฌผ๋กœ ์–ป์€ ์˜ํ•œ ๋‹จ์–ด ๋ฒˆ์—ญ์‚ฌ์ „์˜ ์ผ๋ถ€.
์ƒ๋‹นํžˆ ์ •ํ™•ํ•œ ๋‹จ์–ด๊ฐ„ ๋ฒˆ์—ญ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
stories <> ์ด์•ผ๊ธฐ
stories <> ์†Œ์„ค
contact <> ์—ฐ๋ฝ
contact <> ์—ฐ๋ฝ์ฒ˜
contact <> ์ ‘์ด‰
green <> ์ดˆ๋ก์ƒ‰
green <> ๋นจ๊ฐ„์ƒ‰
dark <> ์–ด๋‘ 
dark <> ์ง™

 

5.3 CTK๋ฅผ ํ™œ์šฉํ•œ ์ •๋ ฌ
์•ž์„œ ๊ตฌ์„ฑํ•œ ์‚ฌ์ „์€ CTK์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š”๋ฐ, CTK๋Š” ์ด ์‚ฌ์ „์„ ๋ฐ”ํƒ•์œผ๋กœ parallel์˜ ๋ฌธ์žฅ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
CTK๋Š” bilingual corpus์˜ ๋ฌธ์žฅ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค๋กœ Perl์„ ์‚ฌ์šฉํ•ด ๊ตฌํ˜„๋˜์—ˆ๋‹ค.

๊ธฐ์กด ํ˜น์€ ์ž๋™์œผ๋กœ ๊ตฌ์ถ•๋œ ๋‹จ์–ด์‚ฌ์ „์„ ์ฐธ๊ณ ํ•ด CTK๋Š” ๋ฌธ์žฅ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ์—ฌ๋Ÿฌ ๋ผ์ธ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์–ธ์–ด๋ณ„ ๋ฌธ์„œ์— ๋Œ€ํ•ด ๋ฌธ์žฅ ์ •๋ ฌํ•œ ๊ฒฐ๊ณผ์˜ ์˜ˆ์ œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
omitted <=> 1
omitted <=> 2
omitted <=> 3
1 <=> 4
2 <=> 5
3 <=> 6
4,5 <=> 7
6 <=> 8
7 <=> 9
8 <=> 10
9 <=> omitted

ํ•ด์„) target์–ธ์–ด์˜ 1,2,3๋ฒˆ์งธ ๋ฌธ์žฅ์€ ์ง์„ ์ฐพ์ง€ ๋ชปํ•ด ๋ฒ„๋ ค์ง
ํ•ด์„) source์–ธ์–ด์˜ 1,2,3๋ฒˆ์งธ ๋ฌธ์žฅ์€ target์˜ 4,5,6๋ฒˆ์งธ ๋ฌธ์žฅ๊ณผ mapping
ํ•ด์„) source์–ธ์–ด์˜ 4,5๋ฒˆ์งธ ๋‘ ๋ฌธ์žฅ์€ target์˜ 7๋ฒˆ ๋ฌธ์žฅ์— ๋™์‹œ์— mapping

 

์œ„์˜ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ์–ด๋–ค ๋ฌธ์žฅ๋“ค์€ ๋ฒ„๋ ค์ง€๊ธฐ๋„ ํ•˜๊ณ 
์ผ๋Œ€์ผ(one-to-one)๋งตํ•‘, ์ผ๋Œ€๋‹ค(one-to-many), ๋‹ค๋Œ€์ผ(many-to-one)๋งตํ•‘์ด ์ด๋ค„์ง€๊ธฐ๋„ ํ•œ๋‹ค.





ex) CTK๋ฅผ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํŒŒ์ด์ฌ์œผ๋กœ ๊ฐ์‹ผ ์Šคํฌ๋ฆฝํŠธ ์˜ˆ์ œ
์ด๋•Œ, CTK_ROOT์— CTK ์œ„์น˜๋ฅผ ์ง€์ •ํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
import sys, argparse, os

BIN = "NEED TO BE CHANGED"
CMD = "%s -c %f -d %s %s %s %s"
OMIT = "omitted"
DIR_PATH = './tmp/'
INTERMEDIATE_FN = DIR_PATH + "tmp.txt"

def read_alignment(fn):
    aligns = []

    f = open(fn, 'r')

    for line in f:
        if line.strip() != "":
            srcs, tgts = line.strip().split(' <=> ')

            if srcs == OMIT:
                srcs = []
            else:
                srcs = list(map(int, srcs.split(',')))

            if tgts == OMIT:
                tgts = []
            else:
                tgts = list(map(int, tgts.split(',')))

            aligns += [(srcs, tgts)]

    f.close()

    return aligns

def get_aligned_corpus(src_fn, tgt_fn, aligns):
    f_src = open(src_fn, 'r')
    f_tgt = open(tgt_fn, 'r')

    for align in aligns:
        srcs, tgts = align

        src_buf, tgt_buf = [], []

        for src in srcs:
            src_buf += [f_src.readline().strip()]
        for tgt in tgts:
            tgt_buf += [f_tgt.readline().strip()]

        if len(src_buf) > 0 and len(tgt_buf) > 0:
            sys.stdout.write("%s\t%s\n" % (" ".join(src_buf), " ".join(tgt_buf)))

    f_tgt.close()
    f_src.close()

def parse_argument():
    p = argparse.ArgumentParser()

    p.add_argument('--src', required = True)
    p.add_argument('--tgt', required = True)
    p.add_argument('--src_ref', default = None)
    p.add_argument('--tgt_ref', default = None)
    p.add_argument('--dict', required = True)
    p.add_argument('--ratio', type = float, default = 1.1966)

    config = p.parse_args()

    return config

if __name__ == "__main__":
	assert BIN != "NEED TO BE CHANGED"

    if not os.path.exists(DIR_PATH):
         os.mkdir(DIR_PATH)

    config = parse_argument()

    if config.src_ref is None:
        config.src_ref = config.src
    if config.tgt_ref is None:
        config.tgt_ref = config.tgt

    cmd = CMD % (BIN, config.ratio, config.dict, config.src_ref, config.tgt_ref, INTERMEDIATE_FN)
    os.system(cmd)

    aligns = read_alignment(INTERMEDIATE_FN)
    get_aligned_corpus(config.src, config.tgt, aligns)

 

 

 

 

 

 

 


6. Subword Tokenization (with Byte Pair Encoding)

6.1 BPE (Byte Pair Algorithm)
BPE๋ฅผ ํ†ตํ•œ subword tokenization์€ ํ˜„์žฌ ๊ฐ€์žฅ ํ•„์ˆ˜์ ์ธ ์ „์ฒ˜๋ฆฌ๋ฐฉ๋ฒ•์ด๋‹ค.
ex) concentrate = con(together) + centr(=center) + ate(=make)
ex) ์ง‘์ค‘(้›†ไธญ) = ้›†(๋ชจ์„ ์ง‘) + ไธญ(๊ฐ€์šด๋ฐ ์ค‘)

[subword tokenization ์•Œ๊ณ ๋ฆฌ์ฆ˜]
 - ๋‹จ์–ด๋Š” ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋” ์ž‘์€ subwords์˜ ์กฐํ•ฉ์œผ๋กœ ์ด๋ค„์ง„๋‹ค๋Š” ๊ฐ€์ •ํ•˜์—
 - ์ ์ ˆํ•œ subword๋ฅผ ๋ฐœ๊ฒฌํ•ด ํ•ด๋‹น ๋‹จ์œ„๋กœ ์ชผ๊ฐœ์–ด
 - ์–ดํœ˜ ์ˆ˜๋ฅผ ์ค„์ด๊ณ  sparsity๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•.
 - ํŠนํžˆ UNK(Unknown) Token์— ๋Œ€ํ•ด ํšจ์œจ์  ๋Œ€์ฒ˜์ด๋‹ค.
[UNK Token , OOV. &.  BPE]
โˆ™ U.K[Unknown Token]:
train corpus์— ์—†๋Š” ๋‹จ์–ด
โˆ™ OOV[Out-of-Vocabulary] ๋ฌธ์ œ: U.K๋กœ ์ธํ•ด ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ์ด ๊นŒ๋‹ค๋กœ์›Œ์ง€๋Š” ํ˜„์ƒ.

์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ ๋ฌธ์žฅ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์„ ๋•Œ ๋‹จ์ˆœํžˆ ๋‹จ์–ด๋“ค์˜ ์‹œํ€€์Šค๋กœ ๋ฐ›๊ธฐ์—

UNK Token์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ํ™•๋ฅ ์„ ๋ง๊ฐ€๋œจ๋ฆฌ๊ณ  ์ ์ ˆํ•œ embedding(encoding) ๋˜๋Š” ์ƒ์„ฑ์ด ์–ด๋ ค์›Œ์ง€๋Š” ์ง€๋ขฐ์ด๋‹ค.
ํŠนํžˆ, ๋ฌธ์žฅ์ƒ์„ฑ์˜ ๊ฒฝ์šฐ, ์ด์ „๋‹จ์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ์— ๋”์šฑ ์–ด๋ ค์›Œ์ง„๋‹ค.
subword ๋ถ„๋ฆฌ๋กœ OOV๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๋Š”๋ฐ, ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ subword ๋ถ„๋ฆฌ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋ฐ”๋กœ BPE(Byte Pair Encoding)
์ด๋‹ค.

ํ•˜์ง€๋งŒ subword๋‹จ์œ„์˜ tokenization์„ ์ง„ํ–‰ํ•˜๋Š” BPE ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฒฝ์šฐ, ์‹ ์กฐ์–ด๋‚˜ ์˜คํƒ€(typo)๊ฐ™์€ UNK Token์— ๋Œ€ํ•ด subword ๋‹จ์œ„๋‚˜ ๋ฌธ์ž(character)๋‹จ์œ„๋กœ ์ชผ๊ฐœ ๊ธฐ์กด train data์˜ token๋“ค์˜ ์กฐํ•ฉ์œผ๋กœ ๋งŒ๋“ค์–ด๋ฒ„๋ฆด ์ˆ˜ ์žˆ๋‹ค.
์ฆ‰, UNK ์ž์ฒด๋ฅผ ์—†์•ฐ์œผ๋กœ์จ ํšจ์œจ์ ์œผ๋กœ UNK์— ๋Œ€์ฒ˜ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

ex)

[์˜์–ด NLTK]์— ์˜ํ•ด ๋ถ„์ ˆ๋œ ์›๋ฌธ
Natural language processing is one of biggest streams in A.I


[์˜์–ด BPE]๋กœ subword๋กœ ๋ถ„์ ˆ๋œ ์›๋ฌธ
_Natural _language _processing _is _one _of _biggest _stream s _in _A. I

 

 

 

 

 

 


7. Detokenization

์ „์ฒ˜๋ฆฌ๊ณผ์ •์—์„œ tokenization์„ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋ฉด, ๋‹ค์‹œ detokenization์„ ์ˆ˜ํ–‰ํ•ด์ค˜์•ผํ•œ๋‹ค.

์ฆ‰, ์•„๋ž˜์™€ ๊ฐ™์€ ์ „์ฒ˜๋ฆฌ๊ณผ์ •์„ ๋”ฐ๋ฅธ๋‹ค.

โˆ™ ์–ธ์–ด๋ณ„ tokenizer๋ชจ๋“ˆ(ex. NLTK)๋กœ ๋ถ„์ ˆ ์ˆ˜ํ–‰
 ์ด๋•Œ, ์ƒˆ๋กญ๊ฒŒ ๋ถ„์ ˆ๋˜๋Š” ๊ณต๋ฐฑ๊ณผ์˜ ๊ตฌ๋ถ„์„ ์œ„ํ•ด ๊ธฐ์กด ๊ณต๋ฐฑ์— _ ๊ธฐํ˜ธ๋ฅผ ์‚ฝ์ž…

โˆ™ subword๋‹จ์œ„ tokenization(ex. BPE์•Œ๊ณ ๋ฆฌ์ฆ˜)์„ ์ˆ˜ํ–‰
 ์ด๋•Œ, ์ด์ „ ๊ณผ์ •๊นŒ์ง€์˜ ๊ณต๋ฐฑ๊ณผ subword๋‹จ์œ„ ๋ถ„์ ˆ๋กœ ์ธํ•œ ๊ณต๋ฐฑ ๊ตฌ๋ถ„์„ ์œ„ํ•ด ํŠน์ˆ˜๋ฌธ์ž _ ๊ธฐํ˜ธ๋ฅผ ์‚ฝ์ž…
 ์ฆ‰, ๊ธฐ์กด ๊ณต๋ฐฑ์˜ ๊ฒฝ์šฐ _ _๋ฅผ ๋‹จ์–ด ์•ž์— ๊ฐ–๊ฒŒ ๋˜๋Š” ๊ฒƒ.

โˆ™ Detokenize ์ง„ํ–‰
 ๋จผ์ € ๊ณต๋ฐฑ์„ ์ œ๊ฑฐํ•œ๋‹ค.
 ์ดํ›„ (_๋ฅผ 2๊ฐœ ๊ฐ–๋Š”)_ _ ๋ฌธ์ž์—ด์„ ๊ณต๋ฐฑ์œผ๋กœ ์น˜ํ™˜ํ•œ๋‹ค.
 ๋งˆ์ง€๋ง‰์œผ๋กœ _ ๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค.

 

7.1 tokenization ํ›„์ฒ˜๋ฆฌ
Detokenization์„ ์‰ฝ๊ฒŒํ•˜๊ธฐ ์œ„ํ•ด tokenize ์ดํ›„ ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ๋ถ„์ ˆ๊ณผ์ •์—์„œ ์ƒˆ๋กญ๊ฒŒ ์ƒ๊ธด ๊ณต๋ฐฑ ๋‹ค์Œ์— ์‚ฝ์ž…ํ•ด์•ผํ•œ๋‹ค.


์˜ˆ์ œ) ๊ธฐ์กด ๊ณต๋ฐฑ๊ณผ ์ „์ฒ˜๋ฆฌ๋‹จ๊ณ„์—์„œ ์ƒ์„ฑ๋œ ๊ณต๋ฐฑ์„ ์„œ๋กœ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•œ ํŠน์ˆ˜๋ฌธ์ž _ ๋ฅผ ์‚ฝ์ž…ํ•˜๋Š” ์ฝ”๋“œ

tokenizer.py
import sys

STR = 'โ–'

if __name__ == "__main__":
    ref_fn = sys.argv[1]

    f = open(ref_fn, 'r')

    for ref in f:
        ref_tokens = ref.strip().split(' ')
        input_line = sys.stdin.readline().strip()

        if input_line != "":
            tokens = input_line.split(' ')

            idx = 0
            buf = []

            # We assume that stdin has more tokens than reference input.
            for ref_token in ref_tokens:
                tmp_buf = []

                while idx < len(tokens):
                    if tokens[idx].strip() == '':
                        idx += 1
                        continue

                    tmp_buf += [tokens[idx]]
                    idx += 1

                    if ''.join(tmp_buf) == ref_token:
                        break

                if len(tmp_buf) > 0:
                    buf += [STR + tmp_buf[0].strip()] + tmp_buf[1:]

            sys.stdout.write(' '.join(buf) + '\n')
        else:
            sys.stdout.write('\n')

    f.close()โ€‹

 

7.2 Detokenize ์˜ˆ์ œ
์œ„์˜ ์Šคํฌ๋ฆฝํŠธ(tokenizer.py)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
์ฃผ๋กœ ๋‹ค๋ฅธ ๋ถ„์ ˆ๋ชจ๋“ˆ์˜ ์ˆ˜ํ–‰ ํ›„์— pipe๋ฅผ ์‚ฌ์šฉํ•ด ๋ถ™์—ฌ์„œ ์‚ฌ์šฉํ•œ๋‹ค.
$ cat [before filename] | python tokenizer.py | python post_tokenizer.py [before filename]โ€‹


post_tokenizer.py
import sys

if __name__ == "__main__":
    for line in sys.stdin:
        if line.strip() != "":
            if 'โ–โ–' in line:
                line = line.strip().replace(' ', '').replace('โ–โ–', ' ').replace('โ–', '').strip()
            else:
                line = line.strip().replace(' ', '').replace('โ–', ' ').strip()

            sys.stdout.write(line + '\n')
        else:
            sys.stdout.write('\n')โ€‹

 

 

 

 

 

 

 


8. Torchtext ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

8.1 Torchtext ๋ž€?
Torchtext ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” data๋ฅผ ์ฝ๊ณ  ์ „์ฒ˜๋ฆฌํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ๋ชจ์•„๋‘” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.
(https://pytorch.org/text/stable/index.html)
Torchtext๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์‰ฝ๊ฒŒ textํŒŒ์ผ์„ ์ฝ์–ด๋‚ด ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.


์ž์„ธํ•œ ์„ค๋ช…์€ ์•„๋ž˜ ๋งํฌ(https://wikidocs.net/60314)๋ฅผ ์ฐธ๊ณ 
 

08-02 ํ† ์น˜ํ…์ŠคํŠธ ํŠœํ† ๋ฆฌ์–ผ(Torchtext tutorial) - ์˜์–ด

ํŒŒ์ดํ† ์น˜(PyTorch)์—์„œ๋Š” ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ์ถ”์ƒํ™” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ† ์น˜ํ…์ŠคํŠธ(Torchtext)๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ํ† ์น˜ํ…์ŠคํŠธ๊ฐ€ ๋ฐ˜๋“œโ€ฆ

wikidocs.net

 

+ Recent posts