๋…ผ๋ฌธ ๋ฆฌ๋ทฐ - Word2vec (2)

๋…ผ๋ฌธ ๋ฆฌ๋ทฐ - Word2vec (2)

Distributed Representations of words and phrases and their compositionality

ํ•ด๋‹น ์ธ๋„ค์ผ์€ Wonkook Lee ๋‹˜์ด ๋งŒ๋“œ์‹  Thumbnail-Maker ๋ฅผ ์ด์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค

ํ•ด๋‹น ๋…ผ๋ฌธ์€ word2vec ์˜ ํ›„์† ๋…ผ๋ฌธ์œผ๋กœ์จ vector ํ’ˆ์งˆ๊ณผ ํ•™์Šต ์†๋„๋ฅผ ๋†’์ธ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•˜๊ณ  ์žˆ๋‹ค. ์†Œ๊ฐœ๋Š” ๊ฐ„๋‹จํžˆ ์—ฌ๊ธฐ์„œ ๋งˆ๋ฌด๋ฆฌํ•˜๊ณ  ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•ด ๋ณด๋„๋ก ํ•˜์ž.

๐ŸŒŸ Abstract

ํ•ด๋‹น ๋…ผ๋ฌธ ์ด์ „์— ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์—์„œ๋Š” continuous Skip-gram ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ณ ํ’ˆ์งˆ์˜ distributed vector representations ์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„ค๋ช…ํ–ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฒกํ„ฐ์˜ ํ€„๋ฆฌํ‹ฐ์™€ ํ•™์Šต ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค.

  • Subsampling of the frequent words
  • Instead of Hierarchical Softmax, use Negative Sampling

๋˜ํ•œ ๊ด€์šฉ์–ด์ ์œผ๋กœ ๋งž์ง€ ์•Š๋Š” ๋‹จ์–ด๋“ค์˜ ์กฐํ•ฉ๋„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” phrase vector ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. (Air Canada ์ฒ˜๋Ÿผ ์„œ๋กœ ์—ฐ๊ด€ ์—†๋Š” ๋‹จ์–ด๋“ค์ด ๋งŒ๋‚œ ๋‹จ์–ด๋“ค)

๐ŸŽŸ๏ธ Introduction

์ด์ „ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•œ Skip-gram ๋ชจ๋ธ๊ฐ™์€ ๊ฒฝ์šฐ ๊ต‰์žฅํžˆ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๊ณ ํ’ˆ์งˆ์˜ wordsโ€™ representation vector ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ•™์Šต์ด ๊ฐ€๋Šฅํ–ˆ๋˜ ์ด์œ ๋Š” ์—ฌํƒ€ ๋‹ค๋ฅธ NNLM ๋ชจ๋ธ๋“ค๊ณผ ๋‹ค๋ฅด๊ฒŒ dense matrix multiplications ๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋ฐํžˆ๊ณ  ์žˆ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด Skip-gram ์˜ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฐœ์„ ์ ์„ ํ†ตํ•ด ๋” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•ด์ง„ ๋ชจ๋ธ์„ ๋ณด์—ฌ์ฃผ๊ณ ์ž ํ•œ๋‹ค.

  • Subsampling of frequent words
    • ์ด ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์•ฝ 2~10๋ฐฐ ์†๋„ ํ–ฅ์ƒ๊ณผ ๋นˆ๋„ ํšŸ์ˆ˜๊ฐ€ ๋‚ฎ์€ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์ด๋ฃธ
  • Noise Contrastive Estimation(NCE) / Negative Sampling
    • ์ด ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์†๋„ ํ–ฅ์ƒ ๋ฐ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด ๋” ์ข‹์€ ํ’ˆ์งˆ์˜ vector ์ƒ์„ฑ ๊ฐ€๋Šฅ
  • Phrase Vector ํ•™์Šต
    • ์ž์—ฐ์Šค๋Ÿฝ์ง€ ๋ชปํ•œ ๋‹จ์–ด ์กฐํ•ฉ๋“ค์˜ ๋Œ€ํ•œ vector ํ•™์Šต์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ
    • ๋”ฐ๋ผ์„œ phrase ์˜ representation ์„ ๋‹ด์€ vector ๋ฅผ ํ™œ์šฉํ•ด Skip-gram ๋ชจ๋ธ์ด ์กฐ๊ธˆ ๋” ํ’๋ถ€ํ•ด์งˆ ์ˆ˜ ์žˆ์—ˆ์Œ
    • ex) vec(โ€œMontreal Canadiensโ€) - vec(โ€œMontrealโ€) + vec(โ€œTorontoโ€) = vec(โ€œTorontor Maple Leafsโ€)

๊ทธ๋ ‡๊ฒŒ ์ด ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๊ฐ„๋‹จํ•œ vector addition ์—์„œ ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, vec(โ€œRussiaโ€) + vec(โ€œriverโ€) ๋Š” vector(โ€œVolga Riverโ€) ์™€ ๊ฐ™์ด ํ‘œํ˜„์ด ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ•ฉ์„ฑ์„ฑ์€ ๋‹น์—ฐํ•˜์ง€ ์•Š์€ ์–ธ์–ด์˜ ์ดํ•ด๊ฐ€ word vector representation ์˜ ๊ฐ„๋‹จํ•œ ๊ณ„์‚ฐ์œผ๋กœ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๐Ÿ•น๏ธ Skip-gram Model

Objective Function : ์ค‘์‹ฌ ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ํ™•๋ฅ ์˜ log๊ฐ’ maximize

\[\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p\left(w_{t+j} \mid w_{t}\right)\]

์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ c ๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์‚ดํŽด๋ด„

๊ทธ๋ ‡๋‹ค๋ฉด ํ•ด๋‹น log ๊ฐ’์€ ์–ด๋–ป๊ฒŒ ์ •์˜๋ ๊นŒ? \[p\left(w_{O} \mid w_{I}\right)=\frac{\exp \left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)}{\sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)}\]

Skip-gram model with softmax function

I ๋ฒˆ์งธ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ์™€์˜ ๋‚ด์  ์—ฐ์‚ฐ๊ฐ’์€ exp ๊ฐ’์ด ๋ชจ๋“  ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” ๋‚ด์ ๊ฐ’์˜ exp ํ•ฉ์œผ๋กœ ๋‚˜๋ˆˆ ๊ฒƒ์œผ๋กœ ๊ณ„์‚ฐํ•œ๋‹ค.

์œ„ ์‹์—์„œ ๋ˆˆ์—ฌ๊ฒจ ๋ด์•ผํ•  ์ ์€ \(W\) ์ด๋‹ค. \(W\) ๋Š” vocab ์˜ ๊ฐœ์ˆ˜์ธ๋ฐ ์ด ์‹์ด ์‹ค์šฉ์ ์ด์ง€ ๋ชปํ•œ ์ด์œ ๋Š” softmax ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์ด \(W\) ์— ๋น„๋ก€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ณดํ†ต \(W\) ๋Š” \(10^{5}-10^{7}\) ์ •๋„ ๋œ๋‹ค๊ณ  ํ•œ๋‹ค.

โœ‚๏ธ Hierarchical Softmax

์ด์ „ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ softmax ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์–ด๋ฅผ Huffman Binary Tree ๋ฅผ ์ด์šฉํ•ด ๊ตฌ์„ฑํ•˜๊ณ  ๊ทธ์— ๋”ฐ๋ผ Hierarchical Softmax ๋ฅผ ์ด์šฉํ•˜์˜€๋‹ค. ๊ทธ์— ๋”ฐ๋ผ ๊ธฐ์กด \(O(W)\) ๋งŒํผ ๊ฑธ๋ฆฌ๋˜ time complexity ๋ฅผ \(O(log_2(W))\) ๋งŒํผ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๊ทธ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ •ํ™•ํžˆ ์ œ์‹œ๋˜์ง€ ์•Š์•˜๋Š”๋ฐ ์ด๋ฒˆ ๋…ผ๋ฌธ์„ ํ†ตํ•ด ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž.

image

Hierachical Softmax Example

๋จผ์ € ์•Œ๊ณ  ๊ฐ€์•ผํ•˜๋Š” ์ ๋“ค์ด ์žˆ๋‹ค. ๋‚˜๋Š” ๋…ผ๋ฌธ์„ ์ฝ์„ ๋•Œ ์ด ๋ถ€๋ถ„์—์„œ ์˜ค๋žœ ์‹œ๊ฐ„ ๊ฑธ๋ ธ์—ˆ๋‹ค.

  • \(n(w, j)\) ๋Š” w๋ฅผ ๋ฃจํŠธ๋กœ ํ•œ ํŠธ๋ฆฌ์˜ j ๋ฒˆ์งธ ๋…ธ๋“œ
  • \(L(w)\) ๋Š” root ๋ถ€ํ„ฐ w ๊นŒ์ง€์˜ ๊ธธ์ด
  • \(ch(n)\) ๋Š” n ์˜ child
  • \([\![x]\!]\) ๋Š” x ๊ฐ€ True ๋ผ๋ฉด 1 ์•„๋‹ˆ๋ฉด -1
  • ๊ธฐ๋ณธ์ ์œผ๋กœ ์ž์‹ ๋…ธ๋“œ๋Š” ์™ผ์ชฝ ์ž์‹ ๋…ธ๋“œ๋ฅผ ์˜๋ฏธ

์ด ๋ถ€๋ถ„์„ ์ƒ๊ธฐํ•˜๋ฉด์„œ ์•„๋ž˜ ์‹์„ ๋ณด๋„๋ก ํ•˜์ž. \[p\left(w \mid w_{I}\right)=\prod_{j=1}^{L(w)-1} \sigma\left([\![n(w, j+1)=\operatorname{ch}(n(w, j))]\!] \cdot v_{n(w, j)}^{\prime}{ }^{\top} v_{w_{I}}\right)\]

Hierarchical Softmax Equation

์œ„ ๊ทธ๋ฆผ์„ ๊ฐ™์ด ๋ณด๋ฉด์„œ ์„ค๋ช…ํ•˜๋ฉด ํŠธ๋ฆฌ๋Š” vocabulary ํฌ๊ธฐ๋งŒํผ leaf ๋…ธ๋“œ๋ฅผ ๊ฐ€์ง„๋‹ค. ์ € ์‹์—์„œ ์ค‘์š”ํ•œ ๊ฑด ์•ˆ์— ์žˆ๋Š” ์ผ์ข…์˜ if ๋ฌธ์ด๋‹ค.

  • j+1 ๋ฒˆ์งธ ๋…ธ๋“œ๊ฐ€ j๋ฒˆ์งธ ๋…ธ๋“œ์˜ ์ž์‹์ด๋ผ๋ฉด (์™ผ์ชฝ) : ์ค‘์‹ฌ ๋‹จ์–ด์™€ ๊ทธ ํ•ด๋‹น ์ฃผ์œ„ ๋‹จ์–ด์˜ ๋‚ด์ ๊ฐ’
  • j+1 ๋ฒˆ์งธ ๋…ธ๋“œ๊ฐ€ j๋ฒˆ์งธ ๋…ธ๋“œ์˜ ์ž์‹์ด ์•„๋‹ˆ๋ผ๋ฉด (์˜ค๋ฅธ์ชฝ) : -(์ค‘์‹ฌ ๋‹จ์–ด์™€ ๊ทธ ํ•ด๋‹น ์ฃผ์œ„ ๋‹จ์–ด์˜ ๋‚ด์ ๊ฐ’)

ํ•ด๋‹น ํ™•๋ฅ ๊ฐ’์„ maximize ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์žŠ์ง€ ๋ง์ž

์ด๋ ‡๊ฒŒ ๊ตฌ์„ฑ์„ ํ•˜๊ฒŒ ๋˜๋ฉด ์—ฐ์‚ฐ ํšŸ์ˆ˜๊ฐ€ L(w)์— ๊ทผ์‚ฌํ•˜๊ฒŒ ๋˜๊ณ  ์ด๋Š” ํŠธ๋ฆฌ์˜ ๋†’์ด์ด๊ธฐ์— \(log_2(V)\) ๊ฐ€ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ•œ๋งˆ๋””๋กœ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™” ํ•œ ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์˜ ๋…ธ๋“œ๋Š” ๋ฃจํŠธ ๋…ธ๋“œ๋กœ๋ถ€ํ„ฐ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ€๊น๊ฒŒ ํ•˜์—ฌ ๋” ๋น ๋ฅธ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ ์žฅ์ ์€ ํ™•๋ฅ ๊ฐ’ ๊ณ„์‚ฐ์— ์ฐธ์—ฌํ•œ ๋…ธ๋“œ๋งŒ ์—…๋ฐ์ดํŠธ ๋˜์–ด ์‹œ๊ฐ„์„ ์•„๋‚„ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿชš Negative Sampling

Hierarchical Softmax ์˜ ๋Œ€์•ˆ์œผ๋กœ Noise Contrastive Estimation(NCE) ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒ˜์Œ ๋ณด๋Š” ์šฉ์–ด์ธ๋ฐ ์ด๊ฒƒ์ด ๋ฌด์—‡์ผ๊นŒ?

  • CBoW, Skip-gram ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ ๊ณ„์‚ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— Softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ƒ˜ํ”Œ๋ง์œผ๋กœ ์ถ”์ถœํ•œ ์ผ๋ถ€์— ๋Œ€ํ•ด์„œ๋งŒ ์ ์šฉ
  • NCE ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ฌธ์ œ๋ฅผ ์‹ค์ œ context ์—์„œ ์–ป์€ ๋ฐ์ดํ„ฐ (\(X\)) ์™€ context ์— ์†ํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๋“ค์—์„œ ๋ฝ‘์€ ๋ฐ์ดํ„ฐ (\(Y\)) ๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Œ
  • k๊ฐœ์˜ ๋Œ€๋น„๋˜๋Š”(contrastive) ๋‹จ์–ด๋“ค์„ noise distribution์—์„œ ๊ตฌํ•ด์„œ (๋ชฌํ…Œ์นด๋ฅผ๋กœ) ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ๊ธฐ๋ณธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

NCE ๋Š” log ํ™•๋ฅ ์„ maximize ํ•˜๋Š”๋ฐ ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ๋Š” ๋ฐ˜๋ฉด Skip-gram ๋ชจ๋ธ์€ ์˜ค์ง ๊ณ ํ’ˆ์งˆ์˜ ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ๋ณธ ๋…ผ๋ฌธ์€ ๋ฒกํ„ฐ์˜ ํ’ˆ์งˆ์€ ์œ ์ง€ํ•˜๋ฉด์„œ NCE ๋ฅผ ๊ฐ„์†Œํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ฐ„์†Œํ™”๋œ ํ˜•ํƒœ๋ฅผ Negative Sampling ์ด๋ผ ํ•œ๋‹ค. \[\log \sigma\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)+\sum_{i=1}^{k} \mathbb{E}_{w_{i} \sim P_{n}(w)}\left[\log \sigma\left(-v_{w_{i}}^{\prime}{ }^{\top} v_{w_{I}}\right)\right]\]

Negative Sampling

๋…ผ๋ฌธ์—์„œ๋Š” Skip-gram ์˜ objective ํ•จ์ˆ˜์— ์žˆ๋Š” \(\log P\left(w_{O} \mid w_{I}\right)\) ์‹์„ ๋ชจ๋‘ ์œ„ ์‹์œผ๋กœ ๊ต์ฒดํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค.

  • ์ขŒ์ธก term : ์ž…๋ ฅ ๋‹จ์–ด \(w_I\) ์— ๋Œ€ํ•˜์—ฌ positive sample \(W_O\) ๊ฐ€ output ์ผ ํ™•๋ฅ ์„ Maximize
  • ์šฐ์ธก term : Negative Sample ์— ๋Œ€ํ•˜์—ฌ \(W_I\) ๊ฐ€ output ์ด ๋  ํ™•๋ฅ ์„ ์ตœ์†Œํ™” โ†’ ๋‚ด์  ๊ฒฐ๊ณผ์— -1 ์„ ๊ณฑํ•จ
    • Noise ๋‹จ์–ด๋“ค์˜ unigram ํ™•๋ฅ  ๋ถ„ํฌ์ธ \(P_n(w)\) ๋ฅผ ํ†ตํ•ด sampling
    • Unigram Distribution์€ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๋น„์œจ์— ๋น„๋ก€ํ•˜๊ฒŒ ํ™•๋ฅ ์„ ์„ค์ •ํ•˜๋Š” ๋ถ„ํฌ
    • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” unigram dist. ์— 3/4 ์Šน ํ•œ ๋ถ„ํฌ(\(U(w)^{3 / 4} / Z\))๊ฐ€ ์‹คํ—˜์ ์œผ๋กœ ๊ฐ€์žฅ ์ข‹๋‹ค๊ณ  ํ•จ

๊ทธ๋ ‡๋‹ค๋ฉด NCE ์™€ NEG ์˜ ์ฐจ์ด์ ์€ ๋ฌด์—‡์ผ๊นŒ?

  • NCE : sample ๊ณผ noise distribution์˜ ํ™•๋ฅ ๊ฐ’ ๋ชจ๋‘ ํ•„์š”
  • NEG : sample ๋งŒ ํ•„์š”

NCE ๊ฐ™์€ ๊ฒฝ์šฐ softmax ์˜ log ํ™•๋ฅ ์„ maximize ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์˜€๋‹ค. ์ฆ‰ ์ž˜ ๋ถ„๋ฅ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์˜€์œผ๋‚˜ ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ์ฃผ์ถ•์ธ word2vec ๊ฐ™์€ ๊ฒฝ์šฐ word representation ์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๋†’์ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ์•˜๊ธฐ์— Negative Sampling ์„ ์ด์šฉํ•œ ๊ฒƒ์ด๋‹ค.

๐Ÿ‘ Subsampling of Frequent Words

์—„์ฒญ๋‚œ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ์ ์€ ์–‘์˜ ์ •๋ณด๋ฅผ ์ฃผ์ง€๋งŒ ์ž์ฃผ ๋‚˜์˜ค๋Š” ๋‹จ์–ด๋“ค์ด ์žˆ๋‹ค. (ex. โ€œtheโ€, โ€œaโ€, โ€œisโ€, โ€ฆ). ํ•˜์ง€๋งŒ ํ’๋ถ€ํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋‹จ์–ด๋“ค ์ค‘์—์„œ๋Š” ๋นˆ๋„์ˆ˜๊ฐ€ ๋‚ฎ์€ ๋‹จ์–ด๋“ค๋„ ์žˆ๊ธฐ ๋งˆ๋ จ์ด๋‹ค.

๊ฐ ๋‹จ์–ด๋“ค์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜์˜ imbalance ํ•จ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์€ ๊ฐ„๋‹จํ•œ subsampling ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ๋‹ค. ๊ทธ๊ฒƒ์€ ๋ฐ”๋กœ discared probability

  • ์˜๋ฏธ์—†๋Š” ๋‹ค๋นˆ๋„ ๋‹จ์–ด๋ฅผ ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ ์œ„ํ•จ
  • \(P\left(w_{i}\right)=1-\sqrt{\frac{t}{f\left(w_{i}\right)}}\) ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ™•๋ฅ  ์„ค์ •
    • \(f(w_i)\) ๋Š” ๋‹จ์–ด \(w_i\) ๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„
    • \(P(w_i)\) ๋Š” ๋‹จ์–ด \(w_i\) ๊ฐ€ sampling ๋˜์ง€ ์•Š์„ ํ™•๋ฅ 
    • \(t\) ๋Š” ์„ค์ •ํ•˜๋Š” threshold

์ฆ‰, ์ž์‹ ๋“ค์ด ์ •ํ•œ threshold ๋ฅผ ๋„˜๊ธฐ๋Š” ๋นˆ๋„์ˆ˜์˜ ๋‹จ์–ด๋“ค์„ sampling ํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์ง€๋งŒ ์ค‘์š”ํ•œ ๋‹จ์–ด์˜ representation vector ์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ํ–ฅ์ƒํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

๐ŸŽ๏ธ Learning Phrases

๋Œ€๋ถ€๋ถ„์˜ phrase ๋“ค์€ ๋‹จ์ˆœํžˆ ๊ฐœ๋ณ„ ๋‹จ์–ด๋“ค์„ ํ•ฉ์นœ ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ๊ทธ phrase ์˜ representation vector ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ํŠน์ • phrase ์—์„œ๋งŒ ๋นˆ๋„ ์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด ์Œ์„ ์ฐพ๋Š” ๊ฒƒ์„ ์‹œ์ž‘์œผ๋กœ ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค๋ฉด, New York Times, Toronto Maple Leafs ์™€ ๊ฐ™์ด ๊ณ ์œ ํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๊ฒƒ๋“ค์€ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ์น˜ํ™˜ํ•˜์˜€๋‹ค๊ณ  ํ•˜๊ณ  this is, there are ๊ฐ™์€ ์˜๋ฏธ ์—†์ด ๋งŽ์ด ๋‚˜์˜ค๋Š” ๊ฒƒ๋“ค์€ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์„ ํ†ตํ•ด vocabulary ์‚ฌ์ด์ฆˆ๋ฅผ ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ  phrase ๋ฅผ ์ง์ ‘ ์ง€์ •ํ•ด์ฃผ์—ˆ๊ณ , data-driven ์„ ํ†ตํ•ด ์ž๋™์œผ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์•„๋ž˜์˜ ์‹์„ ์ ์šฉํ•ด ์ •ํ•ด๋†“์€ ๊ธฐ์ค€๋ณด๋‹ค ๋†’์œผ๋ฉด ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ์ธ์‹ํ•˜๋„๋ก ํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค. \[\operatorname{score}\left(w_{i}, w_{j}\right)=\frac{\operatorname{count}\left(w_{i} w_{j}\right)-\delta}{\operatorname{count}\left(w_{i}\right) \times \operatorname{count}\left(w_{j}\right)} .\]

Phrase Score

score = ๋‹จ์–ด๊ฐ€ ๋™์‹œ์— ๋“ฑ์žฅํ•˜๋Š” ํšŸ์ˆ˜ - \(\delta\) / ๊ฐ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ํšŸ์ˆ˜์˜ ๊ณฑ

์—ฌ๊ธฐ์„œ \(\delta\) ๋Š” ๋„ˆ๋ฌด ๋“œ๋ฌผ๊ฒŒ ๋‚˜์˜ค๋Š” ๋‹จ์–ด์˜ ์กฐํ•ฉ์ด ํ•˜๋‚˜์˜ ๊ตฌ๋กœ ๋งŒ๋“ค์–ด์ง€์ง€ ์•Š๊ธฐ ์œ„ํ•œ hyper parameter

image

analogy test dataset

์ด 5๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ analogy test ๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค. ๊ฐ๊ฐ ์•ž์ชฝ 3๊ฐœ์˜ ์—ด์„ ์ด์šฉํ•ด ๋งˆ์ง€๋ง‰ ์—ด์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ์ด๋‹ค.

โš™๏ธ Additive Compositionality

๋งˆ์ง€๋ง‰์œผ๋กœ ๋ณผ ๊ฒƒ์€ ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ํ•ฉ์„ฑ์„ฑ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Skip-gram ๋ชจ๋ธ์„ ํ†ตํ•ด ํ•™์Šต๋œ word and phrase representations ๋“ค์„ ํ†ตํ•ด ๊ฐ„๋‹จํ•œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด analogical reasoning ํƒœ์Šคํฌ๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

  • Skip-gram ์˜ ๋ชฉ์ ์€ ๊ณ ํ’ˆ์งˆ์˜ word representation ๋“ค์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ
    • ์ด๋Š” ์ค‘์‹ฌ๋ถ€ ๋‹จ์–ด์˜ context ๋ฅผ ์ด์šฉํ•ด ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ๋งž์ถœ ํ™•๋ฅ ์„ maximize ํ•˜๋Š” ๊ฒƒ
  • ๋‘ ๋‹จ์–ด์˜ vector ๋ฅผ ๋”ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๋‘ ๋ฌธ๋งฅ์„ AND ์—ฐ์‚ฐํ•œ๋‹ค๋Š” ๊ฒƒ
  • ๊ทธ๋ ‡๊ธฐ์— ๋‘ ๋‹จ์–ด์˜ ํ•ฉ์„ ํ†ตํ•ด ๊ทธ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์žˆ๋˜ context ์ •๋ณด๋ฅผ ํ•ฉ์น  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ

image

element-wise addition

๐Ÿš€ Conclusion

๋…ผ๋ฌธ์„ ๋๋งˆ์น˜๋ฉด์„œ ํŠน์ดํ•˜๊ฒŒ representation vector ์˜ ํ’ˆ์งˆ์— ์˜ํ–ฅ์„ ์คฌ๋˜ hyper parameter ๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค.

  • Choice of Model Architecture
  • Size of the vectors
  • Subsampling rate
  • Size of the training window.

ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ OOV ๋ฌธ์ œ๋Š” ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•˜์˜€๊ณ  (2013๋…„์ด์˜€๊ธฐ์—โ€ฆ) Additive Compositionality ์—์„œ ๋‹จ์ˆœ ๋”ํ•˜๊ธฐ๋งŒ ํ•  ์ˆ˜ ์žˆ๋Š”๊ฑด์ง€ ์˜๋ฌธ์ด ๋“ ๋‹ค. ๋˜ํ•œ subsmapling of frequent words ๋ฅผ ํ†ตํ•ด ์–ด๋Š ์ •๋„์˜ less frequent words ์˜ accuracy ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋†’์˜€๋Š”์ง€๋ฅผ ์•Œ ์ˆ˜ ์—†์–ด ์•„์‰ฌ์›€์ด ๋‚จ๋Š”๋‹ค. ์ด์ „ ๋…ผ๋ฌธ์—์„œ๋Š” less frequent words ๋Š” poor representation vector ๊ฐ€ ํ•™์Šต๋˜์—ˆ๊ณ  ๊ฐœ์„  ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์น˜๋กœ ํ‘œํ˜„ํ–ˆ์œผ๋ฉด ์–ด๋–จ๊นŒํ•˜๋Š” ์ƒ๊ฐ์ด ๋“ ๋‹ค.