๐Ÿ“Œ ๋ชฉ์ฐจ

1. word sense
2. one-hot encoding

3. thesaurus(์–ดํœ˜๋ถ„๋ฅ˜์‚ฌ์ „)
4. Feature
5. Feature Extraction. &. Text Mining (TF-IDF)
6. Feature vector ์ƒ์„ฑ
7. Vector Similarity (with Norm)
8. ๋‹จ์–ด ์ค‘์˜์„ฑ ํ•ด์†Œ (WSD)
9. Selection Preference

๐Ÿ˜š ๊ธ€์„ ๋งˆ์น˜๋ฉฐ...

 

 

 


1. Word Sense

1.1 ๋‹จ์–ด์™€ ์˜๋ฏธ์˜ ๊ด€๊ณ„์„ฑ
NLP์—์„œ ๊ฐ€์žฅ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” ๊ทผ๊ฐ„์ด์ž ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ฌธ์ œ์ธ ๋‹จ์–ด์˜ ์˜๋ฏธ(word sense)์™€ ์˜๋ฏธ์˜ ์œ ์‚ฌ์„ฑ, ๋ชจํ˜ธ์„ฑ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.
๋‹จ์–ด๋Š” ๊ฒ‰์œผ๋กœ ๋ณด์ด๋Š” ํ˜•ํƒœ์ธ ํ‘œ์ œ์–ด(lemma)์•ˆ์— ์—ฌ๋Ÿฌ ์˜๋ฏธ๋ฅผ ๋‹ด๊ณ ์žˆ์–ด์„œ ์ƒํ™ฉ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์˜๋ฏธ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
์ด๋•Œ, ์ฃผ๋ณ€์ •๋ณด(context)์— ๋”ฐ๋ผ ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ดํ•ดํ•ด์•ผํ•˜์ง€๋งŒ
context๊ฐ€ ๋ถ€์กฑํ•˜๋ฉด ambiguity๊ฐ€ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค. ( โˆต ๊ณผ๊ฑฐ๊ธฐ์–ต ๋“ฑ์˜ ์ด์œ )

์ฆ‰, ํ•œ๊ฐ€์ง€ ํ˜•ํƒœ์˜ ๋‹จ์–ด์— ์—ฌ๋Ÿฌ ์˜๋ฏธ๊ฐ€ ํฌํ•จ๋˜์–ด ์ƒ๊ธฐ๋Š” '์ค‘์˜์„ฑ' ๋ฌธ์ œ๋Š” NLP์—์„œ ๋งค์šฐ ์ค‘์š”ํ•œ ์œ„์น˜๋ฅผ ๊ฐ–๋Š”๋ฐ, ํŠนํžˆ ๊ธฐ๊ณ„๋ฒˆ์—ญ์—์„œ๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ์— ๋”ฐ๋ผ ํ•ด๋‹น ๋ฒˆ์—ญ ๋‹จ์–ด์˜ ํ˜•ํƒœ๊ฐ€ ์™„์ „ํžˆ ๋ฐ”๋€Œ๊ธฐ์— ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.
์ฆ‰, lemma(ํ‘œ์ œ์–ด)๋ฅผ ๋งค๊ฐœ์ฒด ์‚ผ์•„ ๋‚ด๋ถ€ latent space์˜ 'word sense'๋กœ ๋ณ€ํ™˜ํ•ด ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

 

1.2
๋™ํ˜•์–ดโˆ™๋‹ค์˜์–ด

 ๋™ํ˜•์–ด: ํ˜•ํƒœ๋Š” ๊ฐ™์œผ๋‚˜ ๋œป์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ์–ด (ex. ์ฐจ - tea / car)
 ๋‹ค์˜์–ด: ๋™ํ˜•์–ด๊ฐœ๋… + ๊ทธ ์˜๋ฏธ๋“ค์ด ์„œ๋กœ ๊ด€๋ จ์ด ์žˆ๋Š” ๋œป (ex. ๋‹ค๋ฆฌ - leg / desk leg)

์ด๋•Œ, ํ•œ ํ˜•ํƒœ๋‚ด ์—ฌ๋Ÿฌ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋™ํ˜•์–ดโˆ™๋‹ค์˜์–ด์˜ ๊ฒฝ์šฐ, ๋‹จ์–ด ์ค‘์˜์„ฑํ•ด์†Œ(WSD)๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋” ๋ช…ํ™•ํžˆ ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.
๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋” ๋ช…ํ™•ํžˆ ํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋ณ€๋ฌธ๋งฅ์„ ํ†ตํ•ด ์›๋ž˜๋‹จ์–ด์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•ด์•ผํ•˜๋Š”๋ฐ, ์ด๋•Œ end-to-end ๋ฐฉ๋ฒ•์ด DNN์—์„œ ์„ ํ˜ธ๋œ๋‹ค.
์ด๋กœ์ธํ•ด ๋‹จ์–ด ์ค‘์˜์„ฑ ํ•ด์†Œ์— ๋Œ€ํ•œ ํ•„์š”๋„๊ฐ€ ๋‚ฎ์•„์กŒ์ง€๋งŒ ์•„์ง ambiguity๋Š” ๋ฌธ์ œํ•ด๊ฒฐ์ด ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
๋™์˜์–ด
 ๋™์˜์–ด: ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๋‹จ์–ด๊ฐ„์— ์˜๋ฏธ๊ฐ€ ๊ฐ™์€ ๋‹จ์–ด (ex. home, place)

๋ฌผ๋ก , ์˜๋ฏธ๊ฐ€ ์™„์ „ํžˆ ๋”ฑ ๋–จ์–ด์ง€์ง€๋Š” ์•Š๊ณ  ๋˜‘๊ฐ™์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ ์ผ์ข…์˜ ๋™์˜(consensus)๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, ์ด ๋™์˜์–ด๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ ์žˆ์„ ๋•Œ, ์ด๋ฅผ ๋™์˜์–ด ์ง‘ํ•ฉ(synset)์ด๋ผ ํ•œ๋‹ค.
์ƒ์œ„์–ดโˆ™ํ•˜์œ„์–ด
๋‹จ์–ด๋Š” ํ•˜๋‚˜์˜ ์ถ”์ƒ์  ๊ฐœ๋…์„ ๊ฐ€์ง€๋ฉฐ ์ด๋•Œ, ๊ทธ ๊ฐœ๋…๋“ค์„ ํฌํ•จํ•˜๋Š” ์ƒโˆ™ํ•˜์œ„ ๊ฐœ๋…์ด ์กด์žฌํ•˜๋ฉฐ ์ด์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๋“ค์„ ์ƒ์œ„์–ด(hypemym), ํ•˜์œ„์–ด(hyponym)์ด๋ผ ํ•œ๋‹ค. (ex. ๋™๋ฌผ-ํฌ์œ ๋ฅ˜, ํฌ์œ ๋ฅ˜-์ฝ”๋ผ๋ฆฌ)

์ด๋Ÿฐ ๋‹จ์–ด๋“ค์˜ ์–ดํœ˜๋ถ„๋ฅ˜(taxonomy)์— ๋”ฐ๋ผ ๋‹จ์–ด๊ฐ„ ๊ด€๊ณ„๊ตฌ์กฐ๋ฅผ ๊ณ„์ธตํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

1.3 ๋ชจํ˜ธ์„ฑ ํ•ด์†Œ (WSD)
์ปดํ“จํ„ฐ๋Š” ์˜ค์ง text๋งŒ ๊ฐ€์ง€๋ฏ€๋กœ text๊ฐ€ ๋‚ดํฌํ•œ ์ง„์งœ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.
์ฆ‰, ๋‹จ์–ด์˜ ๊ฒ‰ ํ˜•ํƒœ์ธ text๋งŒ์œผ๋กœ๋Š” ๋ชจํ˜ธ์„ฑ(ambiguity)์ด ๋†’๊ธฐ์— ๋ชจํ˜ธ์„ฑ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •์ธ, WSD(๋‹จ์–ด ์ค‘์˜์„ฑ ํ•ด์†Œ)์œผ๋กœ NLP์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 


2. One-Hot Encoding

2.1 One-Hot Encoding
๋‹จ์–ด๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ธ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜์น˜๋กœ ๋ฐ”๊พธ๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” one-hot encoding์ด๋ผ๋Š” ๋ฐฉ์‹์ด๋‹ค.
๋ง ๊ทธ๋Œ€๋กœ ๋‹จ ํ•˜๋‚˜์˜ 1๊ณผ ๋‚˜๋จธ์ง€ ์ˆ˜๋งŽ์€ 0๋“ค๋กœ ํ‘œํ˜„๋œ encoding๋ฐฉ์‹์œผ๋กœ one-hot encoding vector์˜ ์ฐจ์›์€ ๋ณดํ†ต ์ „์ฒด vocabulary์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋œ๋‹ค. (๋ณดํ†ต ๊ทธ ์ˆซ์ž๋Š” ๋งค์šฐ ํฐ ์ˆ˜๊ฐ€ ๋จ; ๋ณดํ†ต 30,000~100,000)

๋‹จ์–ด๋Š” ์—ฐ์†์ ์ธ ์‹ฌ๋ณผ๋กœ์จ ์ด์‚ฐํ™•๋ฅ ๋ณ€์ˆ˜๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.
์œ„์™€ ๊ฐ™์ด ์‚ฌ์ „(dictionary)๋‚ด์˜ ๊ฐ ๋‹จ์–ด๋ฅผ one-hot encoding ๋ฐฉ์‹์„ ํ†ตํ•ด vector๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด ํ‘œํ˜„๋ฐฉ์‹์€ ์—ฌ๋Ÿฌ ๋ฌธ์ œ์ ์ด ์กด์žฌํ•œ๋‹ค.
 Prob 1. vector space๊ฐ€ ๋„ˆ๋ฌด ์ปค์กŒ๋‹ค. (ํ•˜๋‚˜๋งŒ 1, ๋‚˜๋จธ์ง€๋Š” 0;  ์ด๋•Œ 0์œผ๋กœ ์ฑ„์›Œ์ง„ vector๋ฅผ sparse vector๋ผ ํ•œ๋‹ค.)
 Prob 2. sparse vector์˜ ๊ฐ€์žฅ ํฐ ๋ฌธ์ œ์ ์€ vector๊ฐ„ ์—ฐ์‚ฐ ์‹œ ๊ฒฐ๊ณผ๊ฐ’์ด 0์ด ๋˜๋Š”, orthogonalํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„์ง„๋‹ค.
  - ์ฆ‰, ๋‹ค์‹œ ๋งํ•˜๋ฉด '๊ฐ•์•„์ง€', '๊ฐœ'๋ผ๋Š” ๋‹จ์–ด๋Š” ์ƒํ˜ธ์œ ์‚ฌํ•˜์ง€๋งŒ ์ด ๋‘˜์˜ ์—ฐ์‚ฐ ์‹œ ๋‘˜์˜ ์œ ์‚ฌ๋„๊ฐ€ 0์ด ๋˜์–ด๋ฒ„๋ฆด ๊ฒƒ์ด๊ณ , ์ผ๋ฐ˜ํ™”์— ์–ด๋ ค์›€์„ ๊ฒช์„ ์ˆ˜ ์žˆ๋‹ค.


  - ์ด๋Š” Curse of Dimensionality์™€ ์—ฐ๊ด€๋˜๋Š”๋ฐ, ์ฐจ์›์ด ๋†’์•„์งˆ์ˆ˜๋ก ์ •๋ณด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฐ ์ (vector)๊ฐ€ ๋งค์šฐ ๋‚ฎ์€ ๋ฐ€๋„๋กœ sparseํ•˜๊ฒŒ ํผ์ ธ์žˆ๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ์ฐจ์›์˜ ์ €์ฃผ๋กœ๋ถ€ํ„ฐ ๋ฒ—์–ด๋‚˜๊ณ ์ž ์ฐจ์›์˜ ์ถ•์†Œํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•  ํ•„์š”์„ฑ์ด ๋Œ€๋‘๋œ๋‹ค.

 

 

 

 

 

 

 


3. Thesaurus (์–ดํœ˜๋ถ„๋ฅ˜์‚ฌ์ „)

3.1 WordNet
Thesaurus(์–ดํœ˜๋ถ„๋ฅ˜์‚ฌ์ „)๋Š” ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š” ๋‹จ์–ด์˜๋ฏธ๋ฅผ ์ž˜ ๋ถ„์„โˆ™๋ถ„๋ฅ˜ํ•ด ๊ตฌ์ถ•๋œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ WordNet์€ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ์‹œ์†Œ๋Ÿฌ์Šค์˜ ์ผ์ข…์ด๋‹ค.


WordNet์€ ๋™์˜์–ด์ง‘ํ•ฉ(Synset), ์ƒ์œ„์–ด, ํ•˜์œ„์–ด ์ •๋ณด๊ฐ€ ํŠนํžˆ ์œ ํ–ฅ๋น„์ˆœํ™˜๊ทธ๋ž˜ํ”„(DAG)๋กœ ์ž˜ ๊ตฌ์ถ•๋˜์–ด์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.
(ํŠธ๋ฆฌ๊ตฌ์กฐ๊ฐ€ ์•„๋‹˜: ํ•˜๋‚˜์˜ ๋…ธ๋“œ๊ฐ€ ์—ฌ๋Ÿฌ ์ƒ์œ„๋…ธ๋“œ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ.)

WordNet์€ ์›Œ๋“œ๋„ท ์›น์‚ฌ์ดํŠธ์—์„œ ๋ฐ”๋กœ ์ด์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค. (http://wordnetweb.princeton.edu/perl/webwn)
์•„๋ž˜ ์‚ฌ์ง„์„ ๋ณด๋ฉด ๋ช…์‚ฌ์ผ ๋•Œ ์˜๋ฏธ 10๊ฐœ, ๋™์‚ฌ์ผ ๋•Œ ์˜๋ฏธ 8๊ฐœ๋ฅผ ์ •์˜ํ•˜์˜€์œผ๋ฉฐ ๋ช…์‚ฌ bank#2์˜ ๊ฒฝ์šฐ, ์—ฌ๋Ÿฌ ๋‹ค๋ฅธํ‘œํ˜„(depository financial institution#1, banking concern#1)๋“ค๋„ ๊ฐ™์ด ๊ฒŒ์‹œ๋˜์–ด์žˆ๋‹ค. (์ด๊ฒƒ๋“ค์ด ๋ฐ”๋กœ ๋™์˜์–ด ์ง‘ํ•ฉ)
์ด์ฒ˜๋Ÿผ wordnet์€ ๋‹จ์–ด๋ณ„ ์—ฌ๋Ÿฌ ๊ฐ€๋Šฅํ•œ ์˜๋ฏธ๋ฅผ ๋ฏธ๋ฆฌ ์ •์˜ํ•˜๊ณ  ๋ฒˆํ˜ธ๋ฅผ ๋งค๊ธฐ๊ณ , ๋™์˜์–ด๋ฅผ ๋งํฌํ•ด ๋™์˜์–ด ์ง‘ํ•ฉ์„ ์ œ๊ณตํ•œ๋‹ค.
์ด๋Š” WSD์— ๋งค์šฐ ์ข‹์€ label data๊ฐ€ ๋˜๋ฉฐ wordnet์ด ์ œ๊ณตํ•˜๋Š” ์ด data๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ supervised learning์„ ์ง„ํ–‰ํ•˜๋ฉด ๋‹จ์–ด์ค‘์˜์„ฑํ•ด์†Œ(WSD)๋ฌธ์ œ๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋‹ค.

cf) ํ•œ๊ตญ์–ด wordnet๋„ ์กด์žฌ
 โˆ™ KorLex(http://korlex.pusan.ac.kr/)
 โˆ™ KWN(http://wordnet.kaist.ac.kr/)

 

3.2 WordNet์„ ํ™œ์šฉํ•œ ๋‹จ์–ด๊ฐ„ ์œ ์‚ฌ๋„ ๋น„๊ต
์ถ”๊ฐ€์ ์œผ๋กœ NLTK์— wrapping๋˜์–ด ํฌํ•จ๋˜๋ฏ€๋กœ importํ•˜์—ฌ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•˜๋‹ค.
from nltk.corpus import wordnet as wn

def hypernyms(word):
    current_node = wn.synsets(word)[0]
    yield current_node
    
    while True:
        try:
            current_node = current_node.hypernyms()[0]
            yield current_node
        except IndexError:
            breakโ€‹

 

์œ„์˜ ์ฝ”๋“œ๋Š” wordnet์—์„œ ํŠน์ • ๋‹จ์–ด์˜ ์ตœ์ƒ์œ„ ๋ถ€๋ชจ๋…ธ๋“œ๊นŒ์ง€์˜ ๊ฒฝ๋กœ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ ,
์ถ”๊ฐ€์ ์œผ๋กœ ๋‹จ์–ด๋งˆ๋‹ค ๋‚˜์˜ค๋Š” ์ˆซ์ž๋กœ ๊ฐ ๋…ธ๋“œ ๊ฐ„ ๊ฑฐ๋ฆฌ๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 
์˜ˆ๋ฅผ ๋“ค์–ด, ์œ„์˜ ๊ฒฝ์šฐ, 'student'์™€ 'fireman' ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋Š” 5 ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ๊ฐ ์ตœํ•˜๋‹จ ๋…ธ๋“œ๊ฐ„์˜ ์ตœ๋‹จ๊ฑฐ๋ฆฌ๋ฅผ ์•Œ ์ˆ˜ ์žˆ๊ณ , ์ด๋ฅผ ์œ ์‚ฌ๋„๋กœ ์น˜ํ™˜ํ•ด ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
์ด๋ฅผ ์ด์šฉํ•ด ๊ณต์‹์„ ์ ์šฉํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

ํ•œ๊ณ„์ : ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•˜๋Š”๋ฐ ๋น„์šฉ๊ณผ ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ์š”๋˜๊ณ  ์ƒํ•˜์œ„์–ด๊ฐ€ ์ž˜ ๋ฐ˜์˜๋œ ์‚ฌ์ „์ด์–ด์•ผ๋งŒ ํ•ด์„œ 
์‚ฌ์ „์— ๊ธฐ๋ฐ˜ํ•œ ์œ ์‚ฌ๋„ ๊ตฌํ•˜๊ธฐ ๋ฐฉ์‹์€ ์ •ํ™•์„ฑ์€ ๋†’์œผ๋‚˜ ํ•œ๊ณ„๊ฐ€ ๋šœ๋ ทํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ.

 



 


4. Feature

4.1 Feature
์ง€๊ธˆ๊นŒ์ง€๋Š” one-hot vector๋ฅผ ํ†ตํ•ด ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ–ˆ์„ ๋•Œ, ๋งŽ์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ์„ค๋ช…ํ•˜์˜€๋‹ค.
์ด๋Š” one-hot vector์˜ ํ‘œํ˜„๋ฐฉ์‹์ด ํšจ๊ณผ์ ์ด์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

ํšจ๊ณผ์ ์ธ ์ •๋ณด ์ถ”์ถœ ๋ฐ ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๋Œ€์ƒ์˜ ํŠน์ง•(feature)์„ ์ž˜ ํ‘œํ˜„ํ•ด์•ผํ•œ๋‹ค.
์ด๋Ÿฐ ํŠน์ง•์€ ์ˆ˜์น˜๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ์ตœ๋Œ€ํ•œ ๋งŽ์€ samples๋ฅผ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์–ด์•ผํ•˜๊ธฐ์— samples๋Š” ์ˆ˜์น˜๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ณตํ†ต๋œ ํŠน์ง•์„ ๊ฐ–๊ณ  ๋‹ค์–‘ํ•˜๊ฒŒ ํ‘œํ˜„๋˜๋Š”๊ฒƒ์ด ์ข‹๋‹ค.
์ฆ‰, ๊ฐ sample์˜ feature๋งˆ๋‹ค ์ˆ˜์น˜๋ฅผ ๊ฐ–๊ฒŒํ•˜์—ฌ ์ด๋ฅผ ๋ชจ์•„ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์„ feature vector๋ผ ํ•œ๋‹ค.
4.2 ๋‹จ์–ด์˜ feature vector ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์ •
โ‘  ์˜๋ฏธ๊ฐ€ ๋น„์Šทํ•œ ๋‹จ์–ด → ์“ฐ์ž„์ƒˆ๊ฐ€ ๋น„์Šทํ•  ๊ฒƒ
โ‘ก ์“ฐ์ž„์ƒˆ๊ฐ€ ๋น„์Šท → ๋น„์Šทํ•œ ๋ฌธ์žฅ์•ˆ์—์„œ ๋น„์Šทํ•œ ์—ญํ• ๋กœ ์‚ฌ์šฉ๋  ๊ฒƒ
โ‘ข ∴ ํ•จ๊ป˜ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋“ค์ด ์œ ์‚ฌํ•  ๊ฒƒ

 

 

 


5. Feature Extraction. &. Text Mining (TF-IDF)

ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค๊ธฐ ์•ž์„œ text mining์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” TF-IDF๋ฅผ ์‚ฌ์šฉํ•ด ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.

 

5.1 TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF: ์ถœํ˜„๋นˆ๋„๋ฅผ ์‚ฌ์šฉํ•ด ์–ด๋–ค๋‹จ์–ด w๊ฐ€ ๋ฌธ์„œ d ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜์น˜์ด๋‹ค.
์ฆ‰, TF-IDF์˜ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก w๋Š” d๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ์„ฑ์งˆ์„ ๊ฐ€์ง„๋‹ค.
TF: ๋‹จ์–ด์˜ ๋ฌธ์„œ๋‚ด์— ์ถœํ˜„ํ•œ ํšŸ์ˆ˜
IDF: ๊ทธ ๋‹จ์–ด๊ฐ€ ์ถœํ˜„ํ•œ ๋ฌธ์„œ์˜ ์ˆซ์ž(DF)์˜ ์—ญ์ˆ˜

์˜ˆ๋ฅผ ๋“ค์–ด, 'the'์˜ ๊ฒฝ์šฐ, TF๊ฐ’์ด ๋งค์šฐ ํด ๊ฒƒ์ด๋‹ค.
ํ•˜์ง€๋งŒ, 'the'๊ฐ€ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ๋Š” ๊ฑฐ์˜ ์—†์„ ๊ฒƒ์ด๊ธฐ์— ์ด๋ฅผ ์œ„ํ•ด IDF๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
(IDF๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ 'the'์™€ ๊ฐ™์€ ๋‹จ์–ด๋“ค์— penalty๋ฅผ ์ค€๋‹ค.)

∴ ์ตœ์ข…์ ์œผ๋กœ ์–ป๊ฒŒ๋˜๋Š” ์ˆซ์ž๋Š” "๋‹ค๋ฅธ ๋ฌธ์„œ๋“ค์—์„œ๋Š” ์ž˜ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์ง€๋งŒ ํŠน์ • ๋ฌธ์„œ์—์„œ๋งŒ ์ž˜ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒฝ์šฐ ํฐ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, ํŠน์ • ๋ฌธ์„œ์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ ์—ญํ• ์„ ์ฐจ์ง€ํ•˜๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜์น˜๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค."

 

5.2 TF-IDF ๊ตฌํ˜„์˜ˆ์ œ
 โˆ™ 3๊ฐœ์˜ ๋…ผ๋ฌธ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ๋‹ด๊ธด ๋ฌธ์„œ๋“ค์ด doc1, doc2, doc3 ๋ณ€์ˆ˜์— ๋“ค์–ด์žˆ๋‹ค ๊ฐ€์ •ํ•˜์ž.

๐Ÿ“Œ TF ํ•จ์ˆ˜
def get_term_frequency(document, word_dict=None):
    if word_dict is None:
        word_dict = {}
    words = document.split()

    for w in words:
        word_dict[w] = 1 + (0 if word_dict.get(w) is None else word_dict[w])

    return pd.Series(word_dict).sort_values(ascending=False)โ€‹



๐Ÿ“Œ DF ํ•จ์ˆ˜
def get_document_frequency(documents):
    dicts = []
    vocab = set([])
    df = {}

    for d in documents:
        tf = get_term_frequency(d)
        dicts += [tf]
        vocab = vocab | set(tf.keys())
    
    for v in list(vocab):
        df[v] = 0
        for dict_d in dicts:
            if dict_d.get(v) is not None:
                df[v] += 1

    return pd.Series(df).sort_values(ascending=False)โ€‹



๐Ÿ“Œ TF-IDF ํ•จ์ˆ˜
def get_tfidf(docs):
    vocab = {}
    tfs = []
    for d in docs:
        vocab = get_term_frequency(d, vocab)
        tfs += [get_term_frequency(d)]
    df = get_document_frequency(docs)

    from operator import itemgetter
    import numpy as np

    stats = []
    for word, freq in vocab.items():
        tfidfs = []
        for idx in range(len(docs)):
            if tfs[idx].get(word) is not None:
                tfidfs += [tfs[idx][word] * np.log(len(docs) / df[word])]
            else:
                tfidfs += [0]

        stats.append((word, freq, *tfidfs, max(tfidfs)))

    return pd.DataFrame(stats, columns=('word',
                                        'frequency',
                                        'doc1',
                                        'doc2',
                                        'doc3',
                                        'max')).sort_values('max', ascending=False)

get_tfidf([doc1, doc2, doc3])โ€‹

<์ถœ๋ ฅ>

 

 

 

 

 


6. Feature Vector ์ƒ์„ฑ

6.1 TF ํ–‰๋ ฌ ๋งŒ๋“ค๊ธฐ
TF ๋˜ํ•œ ์ข‹์€ ํŠน์ง•์ด ๋˜๋Š”๋ฐ, ์ถœํ˜„ํ•œ ํšŸ์ˆ˜๊ฐ€ ์ฐจ์›๋ณ„๋กœ ๊ตฌ์„ฑ๋˜๋ฉด ํ•˜๋‚˜์˜ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์ด๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.
(๋ฌผ๋ก  ๋ฌธ์„œ๋ณ„ TF-IDF์ž์ฒด๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ๋„ ์ข‹์Œ.)

def get_tf(docs):
    vocab = {}
    tfs = []
    for d in docs:
        vocab = get_term_frequency(d, vocab)
        tfs += [get_term_frequency(d)]

    from operator import itemgetter
    import numpy as np

    stats = []
    for word, freq in vocab.items():
        tf_v = []
        for idx in range(len(docs)):
            if tfs[idx].get(word) is not None:
                tf_v += [tfs[idx][word]]
            else:
                tf_v += [0]
        stats.append((word, freq, *tf_v))
    
    return pd.DataFrame(stats, columns=('word',
                                        'frequency',
                                        'doc1',
                                        'doc2',
                                        'doc3')).sort_values('frequency', ascending=False)

get_tf([doc1, doc2, doc3])โ€‹

<์ถœ๋ ฅ>
์ด๋•Œ, ๊ฐ ์—ด๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
frequency: TF(w)
doc1: TF(w, d1)
doc2: TF(w, d2)
doc3: TF(w, d3)


TF(w, d1), TF(w, d2), TF(w, d3)๋Š” ๋ฌธ์„œ์— ๋Œ€ํ•œ ๋‹จ์–ด๋ณ„ ์ถœํ˜„ํšŸ์ˆ˜๋ฅผ ํ™œ์šฉํ•œ ํŠน์ง•๋ฒกํ„ฐ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.
์˜ˆ๋ฅผ๋“ค์–ด '์—ฌ๋Ÿฌ๋ถ„'์€ [5, 6, 1]์ด๋ผ๋Š” ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ๊ฐ–๋Š”๋‹ค.
๋ฌธ์„œ๊ฐ€ ๋งŽ๋‹ค๋ฉด ์ง€๊ธˆ๋ณด๋‹ค ๋” ๋‚˜์€ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

๋‹ค๋งŒ, ๋ฌธ์„œ๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์œผ๋ฉด ๋ฒกํ„ฐ์˜ ์ฐจ์› ์—ญ์‹œ ์ง€๋‚˜์น˜๊ฒŒ ์ปค์งˆ ์ˆ˜ ์žˆ๋‹ค.
๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ด์œ : ์ง€๋‚˜์น˜๊ฒŒ ์ปค์ง„ ์ฐจ์›์˜ ๋ฒกํ„ฐ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด 0์œผ๋กœ ์ฑ„์›Œ์ง„๋‹ค๋Š” ๊ฒƒ.

์œ„์˜ ํ‘œ์—์„œ 3๊ฐœ์˜ ๋ฌธ์„œ ์ค‘ ์ผ๋ถ€๋งŒ ์ถœํ˜„ํ•ด TF๊ฐ€ 0์ธ ๊ฒฝ์šฐ๊ฐ€ ์ƒ๋‹นํžˆ ์กด์žฌํ•œ๋‹ค.
์ด๋ ‡๊ฒŒ ๋ฒกํ„ฐ์˜ ๊ทนํžˆ ์ผ๋ถ€๋ถ„๋งŒ ์˜๋ฏธ์žˆ๋Š” ๊ฐ’์œผ๋กœ ์ฑ„์›Œ์ง„ ๋ฒกํ„ฐ๋ฅผ ํฌ์†Œ๋ฒกํ„ฐ๋ผ ํ•œ๋‹ค.
ํฌ์†Œ๋ฒกํ„ฐ์˜ ๊ฐ ์ฐจ์›์€ ์‚ฌ์‹ค ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ 0์ด๊ธฐ์— ์œ ์˜๋ฏธํ•œ ํŠน์ •ํ†ต๊ณ„๋ฅผ ์–ป๊ธฐ์— ๋ฐฉ์ง€ํ„ฑ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.
์ฆ‰, ๋‹จ์ˆœํžˆ ๋ฌธ์„œ์ถœํ˜„ํšŸ์ˆ˜๋กœ๋งŒ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๊ฒŒ๋˜๊ธฐ์— ๋งŽ์€ ์ •๋ณด๊ฐ€ ์†Œ์‹ค๋˜์—ˆ๋‹ค.
( โˆต ๋งค์šฐ ๋‹จ์ˆœํ™”๋˜์—ˆ๊ธฐ์— ์•„์ฃผ ์ •ํ™•ํ•œ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค๊ธฐ์—” ์—ฌ์ „ํžˆ ๋ฌด๋ฆฌ๊ฐ€ ์žˆ๋‹ค.)

 

6.2 context window๋กœ ํ•จ๊ป˜ ์ถœํ˜„ํ•œ ๋‹จ์–ด๋“ค์˜ ์ •๋ณด ํ™œ์šฉํ•˜๊ธฐ
์•ž์„  TF๋กœ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•œ ๋ฐฉ์‹๋ณด๋‹ค ๋” ์ •ํ™•ํ•˜๋‹ค.

context window: ํ•จ๊ป˜ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋“ค์„ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด windowing์„ ์‹คํ–‰ํ•˜์—ฌ ๊ทธ ์•ˆ์— ์žˆ๋Š” unit๋“ค์˜ ์ •๋ณด๋ฅผ ์ทจํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์—์„œ ์‚ฌ์šฉ๋˜๋Š” window๋ฅผ context window๋ผ ํ•œ๋‹ค.
context window๋Š” window_size๋ผ๋Š” ํ•˜๋‚˜์˜ hyper-parameter๊ฐ€ ์ถ”๊ฐ€๋˜๊ธฐ์— ์‚ฌ์šฉ์ž๊ฐ€ ๊ทธ ๊ฐ’์„ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.
 โˆ™ window_size๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ํด ๋•Œ: ํ˜„์žฌ ๋‹จ์–ด์™€ ๊ด€๊ณ„์—†๋Š” ๋‹จ์–ด๋“ค๊นŒ์ง€ TF๋ฅผ count
 โˆ™ window_size๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ์ž‘์„ ๋•Œ: ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ TF๋ฅผ count


[๋ฌธ์žฅ๋“ค์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ฃผ์–ด์ง„ window_size๋‚ด์— ํ•จ๊ป˜ ์ถœํ˜„ํ•œ ๋‹จ์–ด๋“ค์˜ ๋นˆ๋„๋ฅผ ์„ธ๋Š” ํ•จ์ˆ˜]
from collections import defaultdict

import pandas as pd

def get_context_counts(lines, w_size=2):
    co_dict = defaultdict(int)
    
    for line in lines:
        words = line.split()
        
        for i, w in enumerate(words):
            for c in words[i - w_size:i + w_size]:
                if w != c:
                    co_dict[(w, c)] += 1
            
    return pd.Series(co_dict)โ€‹

 

[TF-IDF๋ฅผ ์œ„ํ•ด ์ž‘์„ฑํ•œ get_term_frequency()ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•ด ๋™์‹œ๋ฐœ์ƒ์ •๋ณด๋ฅผ ํ†ตํ•ด ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ์ฝ”๋“œ]
def co_occurrence(co_dict, vocab):
    data = []
    
    for word1 in vocab:
        row = []
        
        for word2 in vocab:
            try:
                count = co_dict[(word1, word2)]
            except KeyError:
                count = 0
            row.append(count)
            
        data.append(row)
    
    return pd.DataFrame(data, index=vocab, columns=vocab)โ€‹




<์ถœ๋ ฅ>

์•ž์ชฝ์˜ ์ถœํ˜„๋นˆ๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด๋“ค์€ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด ์ž˜ ์ฑ„์›Œ์ ธ ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
๋’ค์ชฝ์˜ ์ถœํ˜„๋นˆ๋„๊ฐ€ ๋‚ฎ์€ ๋‹จ์–ด๋“ค์€ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด 0์œผ๋กœ ์ฑ„์›Œ์ ธ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 


7. Vector Similarity (with Norm)

์•ž์„œ ๊ตฌํ•œ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ? ํŠน์ง•๋ฒกํ„ฐ๋Š” ๋‹จ์–ด๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ๋•Œ ์•„์ฃผ ์œ ์šฉํ•˜๋‹ค.

๊ฐ€์žฅ ๋จผ์ € ์–ธ๊ธ‰ํ•œ WordNet์˜ ๊ทธ๋ž˜ํ”„๊ตฌ์กฐ์—์„œ ๋‹จ์–ด์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •, ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹จ์–ด์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋‹ค.

 

์ด๋ฒˆ์—๋Š” ๋ฒกํ„ฐ๊ฐ„์˜ ์œ ์‚ฌ๋„ or ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ๋‹ค๋ค„๋ณผ ๊ฒƒ์ด๋‹ค.

 

 

 

7.1  L1 Distance (Manhattan Distance)
L1 Distance๋Š” L1 norm์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ Manhattan Distance๋ผ๊ณ ๋„ ํ•œ๋‹ค.
์ด ๋ฐฉ๋ฒ•์€ ๋‘ ๋ฒกํ„ฐ์˜ ๊ฐ ์ฐจ์›๋ณ„ ๊ฐ’์˜ ์ฐจ์ด์˜ ์ ˆ๋Œ€๊ฐ’์„ ๋ชจ๋‘ ํ•ฉํ•œ ๊ฐ’์ด๋‹ค.

์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
def L1_distance(x1, x2):
    return ((x1-x2).abs()).sum()โ€‹

์ดˆ๋ก์ƒ‰์„ (L2 Distance)์„ ์ œ์™ธํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” L1 Distance์ด๋‹ค.

7.2  L2 Distance (Euclidean Distance)
L2 Distance๋ž€ L2 norm์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ Euclidean Distance๋ผ๊ณ ๋„ ํ•œ๋‹ค.
์ด ๋ฐฉ๋ฒ•์€ ๋‘ ๋ฒกํ„ฐ์˜ ๊ฐ ์ฐจ์›๋ณ„ ๊ฐ’ ์ฐจ์ด์˜ ์ œ๊ณฑ์˜ ํ•ฉ์— ๋ฃจํŠธ๋ฅผ ์ทจํ•œ ๊ฐ’์ด๋‹ค.



์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
def L2_distance(x1, x2):
    return ((x1 - x2)**2).sum()**.5โ€‹

 

7.3 Infinity Norm
Lnorm์„ ์‚ฌ์šฉํ•œ infinity distance๋Š” ์ฐจ์›๋ณ„ ๊ฐ’์˜ ์ฐจ์ด์ค‘ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
def infinity_distance(x1, x2):
    return ((x1 - x2).abs()).max()โ€‹โ€‹

 

 

 

 

 

7.4 Cosine Similarity
์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ํ•จ์ˆ˜๋Š” ๋‘ ๋ฒกํ„ฐ์‚ฌ์ด์˜ ๋ฐฉํ–ฅ๊ณผ ํฌ๊ธฐ๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
ํŠนํžˆ, ์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์“ฐ์ด๋Š” ์œ ์‚ฌ๋„์ธก์ •๋ฐฉ๋ฒ• ์ด๋‹ค.

โˆ™ ์ˆ˜์‹์—์„œ ๋ถ„์ž๋Š” ๋‘ ๋ฒกํ„ฐ๊ฐ„์˜ ์š”์†Œ๊ณฑ(element-wise)์„ ์‚ฌ์šฉํ•œ๋‹ค.(= ๋‚ด์ ๊ณฑ)
  - ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์˜ ๊ฒฐ๊ณผ๊ฐ€ 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋ฐฉํ–ฅ์€ ์ผ์น˜ํ•˜๊ณ 
  - ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์˜ ๊ฒฐ๊ณผ๊ฐ€ 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ง๊ต์ด๋ฉฐ
  - ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์˜ ๊ฒฐ๊ณผ๊ฐ€ -1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋ฐฉํ–ฅ์€ ๋ฐ˜๋Œ€์ด๋‹ค.


์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
def cosine_similarity(x1, x2):
    return (x1 * x2).sum() / ((x1**2).sum()**.5 * (x2**2).sum()**.5)โ€‹


[์œ ์˜ํ•  ์ ]
๋‹ค๋งŒ, ๋ถ„์ž์˜ ๋ฒกํ„ฐ๊ณฑ์—ฐ์‚ฐ์ด๋‚˜ ๋ถ„์ž์˜ L2 norm ์—ฐ์‚ฐ์€ cost๊ฐ€ ๋†’์•„ ๋ฒกํ„ฐ์ฐจ์›์ด ์ปค์งˆ์ˆ˜๋ก ์—ฐ์‚ฐ๋Ÿ‰์ด ์ปค์ง„๋‹ค.
์ด๋Š” ํฌ์†Œ๋ฒกํ„ฐ์ผ ๊ฒฝ์šฐ, ๊ฐ€์žฅ ํฐ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š”๋ฐ, ํ•ด๋‹น์ฐจ์›์ด ์ง๊ตํ•˜๋ฉด ๊ณฑ์˜ ๊ฐ’์ด 0์ด๋˜๋ฏ€๋กœ ์ •ํ™•ํ•œ ์œ ์‚ฌ๋„ ๋ฐ ๊ฑฐ๋ฆฌ๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค.

 

7.5 Jarccard Similarity
jarccard similarity๋Š” ๋‘ ์ง‘ํ•ฉ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.


โˆ™ ์ˆ˜์‹์—์„œ ๋ถ„์ž๋Š” ๋‘ ์ง‘ํ•ฉ์˜ ๊ต์ง‘ํ•ฉํฌ๊ธฐ์ด๊ณ , ๋ถ„๋ชจ๋Š” ๋‘ ์ง‘ํ•ฉ์˜ ํ•ฉ์ง‘ํ•ฉ๋‹ˆ๋‹ค.
  - ์ด๋•Œ, ํŠน์ง•๋ฒกํ„ฐ์˜ ๊ฐ ์ฐจ์›์ด ์ง‘ํ•ฉ์˜ ์š”์†Œ(element)์ด๋‹ค.
  - ๋‹ค๋งŒ, ๊ฐ ์ฐจ์›์—์„œ์˜ ๊ฐ’์ด 0 or 0์ด ์•„๋‹Œ ๊ฐ’์ด ์•„๋‹Œ, ์ˆ˜์น˜ ์ž์ฒด์— ๋Œ€ํ•ด jarccard similarity๋ฅผ ๊ตฌํ•˜๊ณ ์ž ํ•  ๋•Œ๋Š”, 2๋ฒˆ์งธ ์ค„์˜ ์ˆ˜์‹์ฒ˜๋Ÿผ ๋‘ ๋ฒกํ„ฐ์˜ ๊ฐ ์ฐจ์›์˜ ์ˆซ์ž์— ๋Œ€ํ•ด min, max์—ฐ์‚ฐ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
def get_jaccard_similarity(x1, x2):
    return torch.stack([x1, x2]).min(dim=0)[0].sum() / torch.stack([x1, x2]).max(dim=0)[0].sum()โ€‹

 

7.6 ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ๊ตฌํ•˜๊ธฐ
๋ฌธ์„œ = ๋ฌธ์žฅ์˜ ์ง‘ํ•ฉ ; ๋ฌธ์žฅ = ๋‹จ์–ด๋“ค์˜ ์ง‘ํ•ฉ
์•ž์„œ ์„ค๋ช…ํ•œ 7์ ˆ์˜ ๋‚ด์šฉ๋“ค์€ ๋‹จ์–ด์— ๋Œ€ํ•œ ํŠน์ง•์„ ์ˆ˜์ง‘, ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

๋”ฐ๋ผ์„œ ๋ฌธ์„œ์— ๋Œ€ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•˜์—ฌ ๋ฌธ์„œ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, 
์˜ˆ๋ฅผ๋“ค์–ด, ๋ฌธ์„œ ๋‚ด์˜ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ TF๋‚˜ TF-IDF๋ฅผ ๊ตฌํ•ด ๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ , ์ด๋ฅผ ํ™œ์šฉํ•ด ๋ฒกํ„ฐ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ฌผ๋ก , ํ˜„์žฌ ๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” ํ›จ์”ฌ ๋” ์ •ํ™•ํ•œ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋ฌธ์„œ๊ฐ„ ๋˜๋Š” ๋ฌธ์žฅ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 

 


8. ๋‹จ์–ด ์ค‘์˜์„ฑ ํ•ด์†Œ (WSD)

8.1 Lesk Algorithm (Thesaurus ๊ธฐ๋ฐ˜ WSD)
Lesk ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ์‚ฌ์ „ ๊ธฐ๋ฐ˜ ์ค‘์˜์„ฑํ•ด์†Œ๋ฐฉ๋ฒ•์ด๋‹ค.
์ฃผ์–ด์ง„ ๋ฌธ์žฅ์—์„œ ํŠน์ •๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ช…ํ™•ํžˆ ํ•  ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

โˆ™ Lesk ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐ€์ •: ๋ฌธ์žฅ๋‚ด ๊ฐ™์ด ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์€ ๊ณตํ†ต ํ† ํ”ฝ์„ ๊ณต์œ 
Lesk Algorithm

โˆ™ ์ค‘์˜์„ฑ์„ ํ•ด์†Œํ•˜๊ณ ์žํ•˜๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด ์‚ฌ์ „(wordnet)์˜ ์˜๋ฏธ๋ณ„ ์„ค๋ช…์„ ๊ตฌํ•จ 

โˆ™ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ ๋‚ด ๋“ฑ์žฅ๋‹จ์–ด์˜ ์‚ฌ์ „์—์„œ ์˜๋ฏธ๋ณ„ ์„ค๋ช… ์‚ฌ์ด ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•œ๋‹ค.
  ์ฃผ๋กœ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ๋•Œ, ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ตฌํ•œ๋‹ค.
  
โˆ™ ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด๋“ค์˜ ์˜๋ฏธ๋ณ„ ์„ค๋ช…๊ณผ ๊ฐ€์žฅ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์˜๋ฏธ๋ฅผ ์„ ํƒํ•œ๋‹ค.โ€‹



์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

def lesk(sentence, word):
    from nltk.wsd import lesk

    best_synset = lesk(sentence.split(), word)
    print(best_synset, best_synset.definition())





<์˜ˆ์ œ>

In [26] ์ „๊นŒ์ง€๋Š” ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, In [26]์€ ์ „ํ˜€ ๋‹ค๋ฅธ ์˜๋ฏธ๋กœ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
Lesk Algorithm์€ ๋ช…ํ™•ํ•œ ์žฅ๋‹จ์ ์„ ๊ฐ–๋Š”๋‹ค.
 - ์žฅ์ ) WordNet๊ณผ ๊ฐ™์ด ์ž˜ ๋ถ„๋ฅ˜๋œ ์‚ฌ์ „์ด ์žˆ๋‹ค๋ฉด, ์‰ฝ๊ณ  ๋น ๋ฅด๊ฒŒ WSD(๋‹จ์–ด ์ค‘์˜์„ฑ ํ•ด์†Œ)๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.
 - ๋‹จ์ ) ์‚ฌ์ „์˜ ๋‹จ์–ด ๋ฐ ์˜๋ฏธ์— ๊ด€ํ•œ ์„ค๋ช…์— ํฌ๊ฒŒ ์˜์กดํ•˜๊ฒŒ๋˜๊ณ , ์„ค๋ช…์ด ๋ถ€์‹คํ•˜๊ฑฐ๋‚˜ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์— ํฐ ํŠน์ง•์ด ์—†๋‹ค๋ฉด WSD๋Šฅ๋ ฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์ง„๋‹ค.

 

 

 

 

 

 


9. Selection Preference

์„ ํƒ์„ ํ˜ธ๋„(Selection Preference): ๋ฌธ์žฅ์€ ์—ฌ๋Ÿฌ ๋‹จ์–ด์˜ ์‹œํ€€์Šค๋กœ ์ด๋ค„์ง€๋Š”๋ฐ, ๋ฌธ์žฅ ๋‚ด ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์— ๋”ฐ๋ผ ์˜๋ฏธ๊ฐ€ ์ •ํ•ด์ง€๋ฉฐ ์ด๋ฅผ ๋” ์ˆ˜์น˜ํ™”ํ•ด ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด๋ฅผ ํ†ตํ•ด WSD๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

ex) '๋งˆ์‹œ๋‹ค'๋ผ๋Š” ๋™์‚ฌ์— ๋Œ€ํ•œ ๋ชฉ์ ์–ด๋กœ๋Š” '์Œ๋ฃŒ'ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๊ธฐ์— '์ฐจ'๋ผ๋Š” ๋‹จ์–ด๊ฐ€ '์Œ๋ฃŒ'์ผ์ง€ '์ž๊ฐ€์šฉ'์ผ์ง€ ์–ด๋””์— ์†ํ•  ์ง€ ์‰ฝ๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

9.1 Selection Preference Strength
์„ ํƒ์„ ํ˜ธ๋„๋Š” ๋‹จ์–ด๊ฐ„ ๊ด€๊ณ„๊ฐ€ ์ข€ ๋” ํŠน๋ณ„ํ•œ ๊ฒฝ์šฐ๋ฅผ ์ˆ˜์น˜ํ™”ํ•ด ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ
๋‹จ์–ด๊ฐ„ ๋ถ„ํฌ์˜ ์ฐจ์ด๊ฐ€ ํด์ˆ˜๋ก ๋” ๊ฐ•๋ ฅํ•œ ์„ ํƒ์„ ํ˜ธ๋„๋ฅผ ๊ฐ–๋Š”๋‹ค
(= ์„ ํƒ ์„ ํ˜ธ๋„ ๊ฐ•๋„(strength)๊ฐ€ ๊ฐ•ํ•˜๋‹ค).

์ด๋ฅผ KLD(KL-Divergence)๋ฅผ ์‚ฌ์šฉํ•ด ์ •์˜ํ•˜์˜€๋‹ค.
์„ ํƒ์„ ํ˜ธ๋„ ๊ฐ•๋„ SR(w)์€ w๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋ชฉํ‘œํด๋ž˜์Šค C์˜ ๋ถ„ํฌ P(C|w)์™€ ๊ทธ๋ƒฅ ํ•ด๋‹น ํด๋ž˜์Šค๋“ค์˜ ์‚ฌ์ „๋ถ„ํฌ P(C)์™€์˜ KLD๋กœ ์ •์˜๋˜์–ด ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. (์ฆ‰, ํŠน์ • ํด๋ž˜์Šค๋ฅผ ์–ผ๋งˆ๋‚˜ ์„ ํƒ์ ์œผ๋กœ ์„ ํ˜ธํ•˜๋Š”์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.)

 

9.2 Selectional Association
์ˆ ์–ด์™€ ํŠน์ •ํด๋ž˜์Šค์‚ฌ์ด ์„ ํƒ๊ด€๋ จ๋„๋ฅผ ์‚ดํŽด๋ณด์ž.
์œ„์˜ ์ˆ˜์‹์— ๋”ฐ๋ฅด๋ฉด, ์„ ํƒ์„ ํ˜ธ๋„๊ฐ•๋„๊ฐ€ ๋‚ฎ์€ ์ˆ ์–ด์— ๋Œ€ํ•ด ๋ถ„์ž์˜ ๊ฐ’์ด ํด ๊ฒฝ์šฐ, ์ˆ ์–ด์™€ ํด๋ž˜์Šค์‚ฌ์ด ๋” ํฐ ์„ ํƒ๊ด€๋ จ๋„๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.
์ฆ‰, ์„ ํƒ์„ ํ˜ธ๋„๊ฐ•๋„๊ฐ€ ๋‚ฎ์•„ ํ•ด๋‹น ์ˆ ์–ด๋Š” ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์„ ํƒ์ ์„ ํ˜ธ๊ฐ•๋„๊ฐ€ ๋‚ฎ์ง€๋งŒ,
ํŠน์ •ํด๋ž˜์Šค๋งŒ ์œ ๋… ์ˆ ์–ด์— ์˜ํ–ฅ์„ ๋ฐ›์•„ ๋ถ„์ž๊ฐ€ ์ปค์ ธ ์„ ํƒ๊ด€๋ จ๋„์˜ ์ˆ˜์น˜๊ฐ€ ์ปค์งˆ ์ˆ˜ ์žˆ์Œ์„ ๋‚ดํฌํ•œ๋‹ค.

 

9.3 Selection Preference. &. WSD
์ด๋Ÿฐ ์„ ํƒ์„ ํ˜ธ๋„์˜ ํŠน์ง•์„ ์ด์šฉํ•˜๋ฉด WSD์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
ex)  '๋งˆ์‹œ๋‹ค'๋ผ๋Š” ๋™์‚ฌ์— '์น˜'๋ผ๋Š” ๋ชฉ์ ์–ด๊ฐ€ ํ•จ๊ป˜ ์žˆ์„ ๋•Œ, '์Œ๋ฃŒ'ํด๋ž˜์Šค์ธ์ง€ '์ž๊ฐ€์šฉ'ํด๋ž˜์Šค์ธ์ง€๋งŒ ์•Œ์•„๋‚ด๋ฉด ๋œ๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ์•„๋Š” corpus๋“ค์€ ๋‹จ์–ด๋กœ ํ‘œํ˜„๋˜์–ด ์žˆ์„ ๋ฟ, ํด๋ž˜์Šค๋กœ ํ‘œํ˜„๋˜์–ด ์žˆ์ง€ ์•Š๊ธฐ์— 
์ด๋ฅผ ์œ„ํ•ด ์‚ฌ์ „์— ์ •์˜๋œ ์ง€์‹์ด๋‚˜ dataset์ด ํ•„์š”ํ•˜๋‹ค.
9.4 WordNet๊ธฐ๋ฐ˜ Selection Preference
์—ฌ๊ธฐ์„œ WordNet์ด ํฐ ์œ„๋ ฅ์„ ๋ฐœํœ˜ํ•œ๋‹ค.
WordNet์„ ์ด์šฉํ•˜๋ฉด '์ฐจ(car)'์˜ ์ƒ์œ„์–ด๋ฅผ ์•Œ ์ˆ˜ ์žˆ๊ณ  ์ด๋ฅผ ํด๋ž˜์Šค๋กœํ•˜์—ฌ ํ•„์š”์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
์ˆ ์–ด์™€ ํด๋ž˜์Šค์‚ฌ์ด ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์ •์˜ํ•˜๋Š” ์ถœํ˜„๋นˆ๋„์˜ ๊ณ„์‚ฐ์ˆ˜์‹์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

ํด๋ž˜์Šค c์— ์†ํ•˜๋Š” ํ‘œ์ œ์–ด(headword), ์ฆ‰ h๋Š” ์‹ค์ œ corpus๊ฐ€ ๋‚˜ํƒ€๋‚œ ๋‹จ์–ด๋กœ ์ˆ ์–ด(predicate) w์™€ ํ•จ๊ป˜ ์ถœํ˜„ํ•œ h์˜ ๋นˆ๋„๋ฅผ ์„ธ๊ณ , h๊ฐ€ ์†ํ•˜๋Š” ํด๋ž˜์Šค๋“ค์˜ ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ์ธ |Classes(h)|๋กœ ๋‚˜๋ˆ„์–ด์ค€๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ํด๋ž˜์Šค c์— ์†ํ•˜๋Š” ๋ชจ๋“  ํ‘œ์ œ์–ด์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•œ ํ›„ ์ด๋ฅผ ํ•ฉํ•œ ๊ฐ’์„ CountR(w,c)๋ฅผ ๊ทผ์‚ฌํ•œ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ์ˆ ์–ด w์™€ ํ‘œ์ œ์–ด h๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ h์˜ ํด๋ž˜์Šค c๋ฅผ ์ถ”์ •ํ•œ c_hat์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

9.5 pseudo word๋ฅผ ํ†ตํ•œ Selection Preferenceํ‰๊ฐ€
์œ ์‚ฌ์–ดํœ˜(pseudo word)๊ฐ€ ๋ฐ”๋กœ ์ •๊ตํ•œ testset์„ค๊ณ„๋ฅผ ์œ„ํ•œ ํ•˜๋‚˜์˜ ํ•ด๋‹ต์ด ๋  ์ˆ˜ ์žˆ๋‹ค.
์œ ์‚ฌ์–ดํœ˜๋Š” 2๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์ธ์œ„์ ์œผ๋กœ ํ•ฉ์„ฑ๋˜์–ด ๋งŒ๋“ค์–ด์ง„ ๋‹จ์–ด์ด๋‹ค.

 

9.6 Similarity Based Method[2007]
WordNet์€ ๋ชจ๋“  ์–ธ์–ด์— ์กด์žฌํ•˜์ง€ ์•Š๊ณ , ์ƒˆ๋กญ๊ฒŒ ์ƒ๊ฒจ๋‚˜๋Š” ์‹ ์กฐ์–ด๋“ค๋„ ๋ฐ˜์˜๋˜์ง€ ์•Š์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.
๋”ฐ๋ผ์„œ wordnet๊ณผ thesaurus์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  ์„ ํƒ์„ ํ˜ธ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด??

์ด๋ฅผ ์œ„ํ•ด thesaurus๋‚˜ thesaurus๊ธฐ๋ฐ˜ Lesk์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  data๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ ํƒ์„ ํ˜ธ๋„๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.
์ˆ ์–ด w, ํ‘œ์ œ์–ด h, ๋‘ ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ด€๊ณ„ R์ด tuple๋กœ ์ฃผ์–ด์งˆ ๋•Œ, ์„ ํƒ๊ด€๋ จ๋„๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋•Œ, ๊ฐ€์ค‘์น˜ ØR(w,h)๋Š” ๋™์ผํ•˜๊ฒŒ 1์ด๋‚˜ ์•„๋ž˜์ฒ˜๋Ÿผ IDF๋ฅผ ์‚ฌ์šฉํ•ด ์ •์˜ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

๋˜ํ•œ simํ•จ์ˆ˜๋Š” ์ผ์ „์˜ ์ฝ”์‚ฌ์ธ์œ ์‚ฌ๋„๋‚˜ jarccard์œ ์‚ฌ๋„๋ฅผ ํฌํ•จ, ๋‹ค์–‘ํ•œ ์œ ์‚ฌ๋„ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
์œ ์‚ฌ๋„๋น„๊ต๋ฅผ ์œ„ํ•ด SeenR(w)ํ•จ์ˆ˜๋กœ ๋Œ€์ƒ๋‹จ์–ด๋ฅผ ์„ ์ •ํ•˜๊ธฐ์— corpus์— ๋”ฐ๋ผ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์ƒ์ด ๋‹ฌ๋ผ์ง€๋ฉฐ ์ด๋ฅผ ํ†ตํ•ด thesaurus์—†์ด๋„ ์‰ฝ๊ฒŒ ์„ ํƒ์„ ํ˜ธ๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

์œ ์‚ฌ๋„๊ธฐ๋ฐ˜ ์„ ํƒ์„ ํ˜ธ๋„ ์˜ˆ์ œ

from konlpy.tag import Kkma

with open('ted.aligned.ko.refined.tok.random-10k.txt') as f:
    lines = [l.strip() for l in f.read().splitlines() if l.strip()]

def count_seen_headwords(lines, predicate='VV', headword='NNG'):
    tagger = Kkma()
    seen_dict = {}
    
    for line in lines:
        pos_result = tagger.pos(line)
        
        word_h = None
        word_p = None
        for word, pos in pos_result:
            if pos == predicate or pos[:3] == predicate + '+':
                word_p = word
                break
            if pos == headword:
                word_h = word
        
        if word_h is not None and word_p is not None:
            seen_dict[word_p] = [word_h] + ([] if seen_dict.get(word_p) is None else seen_dict[word_p])
            
    return seen_dict
def get_selectional_association(predicate, headword, lines, dataframe, metric):
    v1 = torch.FloatTensor(dataframe.loc[headword].values)
    seens = seen_headwords[predicate]
    
    total = 0
    for seen in seens:
        try:
            v2 = torch.FloatTensor(dataframe.loc[seen].values)
            total += metric(v1, v2)
        except:
            pass
        
    return total
def wsd(predicate, headwords):
    selectional_associations = []
    for h in headwords:
        selectional_associations += [get_selectional_association(predicate, h, lines, co, get_cosine_similarity)]

    print(selectional_associations)

'๊ฐ€'๋ผ๋Š” ์กฐ์‚ฌ์— ๊ฐ€์žฅ ์ž˜ ๋งž๋Š” ๋‹จ์–ด๊ฐ€ 'ํ•™๊ต'์ž„์„ ์ž˜ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋”ฐ.

 

 

 


๋งˆ์น˜๋ฉฐ...

๋‹จ์–ด๋Š” ๋ณด๊ธฐ์—๋Š” ๋ถˆ์—ฐ์†์ ์ธ ํ˜•ํƒœ์ด์ง€๋งŒ ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” ์—ฐ์†์ ์ธ '์˜๋ฏธ(sense)'๋ฅผ ๊ฐ–๋Š”๋‹ค.
๋”ฐ๋ผ์„œ ์šฐ๋ฆฐ ๋‹จ์–ด์˜ ๊ฒ‰๋ชจ์–‘์ด ๋‹ค๋ฅด๋”๋ผ๋„ ์˜๋ฏธ๊ฐ€ ์œ ์‚ฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์•Œ๊ณ  ์žˆ์œผ๋ฉฐ,
๋‹จ์–ด์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ฑ”์‚ฐํ•  ์ˆ˜ ์žˆ๋”๋ผ๋ฉด corpus๋กœ๋ถ€ํ„ฐ ๋ถ„ํฌ๋‚˜ ํŠน์ง•์„ ํ›ˆ๋ จ ์‹œ, ๋” ์ •ํ™•ํ•œ ํ›ˆ๋ จ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.


WordNet์ด๋ผ๋Š” ์‚ฌ์ „์˜ ๋“ฑ์žฅ์œผ๋กœ ๋‹จ์–ด์‚ฌ์ด ์œ ์‚ฌ๋„(๊ฑฐ๋ฆฌ)๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์ง€๋งŒ
WordNet๊ณผ ๊ฐ™์€ Thesaurus ๊ตฌ์ถ•์€ ์—„์ฒญ๋‚œ ์ผ์ด๊ธฐ์— ์‚ฌ์ „์—†์ด ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋ฉด ๋” ์ข‹์„ ๊ฒƒ์ด๋‹ค.

์‚ฌ์ „์—†์ด corpus๋งŒ์œผ๋กœ ํŠน์ง•์„ ์ถ”์ถœํ•ด ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ ๋‹ค๋ฉด, WordNet๋ณด๋‹ค ์ •๊ตํ•˜์ง€๋Š” ์•Š์ง€๋งŒ ๋” ๊ฐ„๋‹จํ•œ ์ž‘์—…์ด ๋  ๊ฒƒ์ด๋‹ค.
์ถ”๊ฐ€์ ์œผ๋กœ corpus์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์ถ”์ถœํ•  ํŠน์ง•๋“ค์€ ์ ์ฐจ ์ •ํ™•ํ•ด์งˆ ๊ฒƒ์ด๊ณ , ํŠน์ง•๋ฒกํ„ฐ๋„ ๋” ์ •ํ™•ํ•ด์งˆ ๊ฒƒ์ด๋‹ค.

ํŠน์ง•๋ฒกํ„ฐ๊ฐ€ ์ถ”์ถœ๋œ ํ›„ cosine์œ ์‚ฌ๋„, L2 Distance ๋“ฑ์˜ Metric์„ ํ†ตํ•ด ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.


ํ•˜์ง€๋งŒ ์•ž์„œ ์„œ์ˆ ํ–ˆ๋“ฏ, ๋‹จ์–ด์‚ฌ์ „ ๋‚ด ๋‹จ์–ด ์ˆ˜๊ฐ€ ๋งŽ์€๋งŒํผ ํŠน์ง•๋ฒกํ„ฐ์˜ ์ฐจ์›๋„ ๋‹จ์–ด์‚ฌ์ „ํฌ๊ธฐ์™€ ๊ฐ™๊ธฐ์— 
"๋‹จ์–ด๋Œ€์‹  ๋ฌธ์„œ"๋ฅผ ํŠน์ง•์œผ๋กœ ์‚ฌ์šฉํ•˜๋”๋ผ๋„ ์ฃผ์–ด์ง„ ๋ฌธ์„œ์˜ ์ˆซ์ž๋งŒํผ ๋ฒกํ„ฐ์˜ ์ฐจ์›์ด ์ƒ์„ฑ๋  ๊ฒƒ์ด๋‹ค.

๋‹ค๋งŒ ๋” ํฐ ๋ฌธ์ œ๋Š” ๊ทธ ์ฐจ์›์˜ ๋Œ€๋ถ€๋ถ„ ๊ฐ’์ด 0์œผ๋กœ ์ฑ„์›Œ์ง€๋Š” "ํฌ์†Œ๋ฒกํ„ฐ"๋กœ 
๋ฌด์—‡์ธ๊ฐ€ ํ•™์Šตํ•˜๊ณ ์ž ํ•  ๋•Œ, cosine์œ ์‚ฌ๋„์˜ ๊ฒฝ์šฐ ์ง๊ตํ•˜์—ฌ ์œ ์‚ฌ๋„๊ฐ’์„ 0์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
์ฆ‰, ์ •ํ™•ํ•œ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ธฐ ์–ด๋ ค์›Œ ์งˆ ์ˆ˜ ์žˆ๋‹ค.

์ด๋Ÿฐ ํฌ์†Œ์„ฑ๋ฌธ์ œ๋Š” NLP์˜ ๋‘๋“œ๋Ÿฌ์ง„ ํŠน์ง•์œผ๋กœ ๋‹จ์–ด๊ฐ€ ๋ถˆ์—ฐ์†์  ์‹ฌ๋ณผ๋กœ ์ด๋ค„์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
์ „ํ†ต์ ์ธ NLP๋Š” ์ด๋Ÿฐ ํฌ์†Œ์„ฑ๋ฌธ์ œ์— ํฐ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ๋‹ค.


ํ•˜์ง€๋งŒ ์ตœ์‹  ๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” ๋‹จ์–ด์˜ ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์ด๋Ÿฐ ํฌ์†Œ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค๊ธฐ๋ณด๋‹ค๋Š”
"word embedding"๊ธฐ๋ฒ•์œผ๋กœ 0์ด ์ž˜ ์กด์žฌํ•˜์ง€ ์•Š๋Š” dense ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•œ๋‹ค.
ex) word2vec, GloVe, ... (https://chan4im.tistory.com/197)

 

+ Recent posts