๋…ผ๋ฌธ ๋ฆฌ๋ทฐ - Word2vec (1)

๋…ผ๋ฌธ ๋ฆฌ๋ทฐ - Word2vec (1)

Efficient Estimation of Word Representations in Vector Space

ํ•ด๋‹น ์ธ๋„ค์ผ์€ Wonkook Lee ๋‹˜์ด ๋งŒ๋“œ์‹  Thumbnail-Maker ๋ฅผ ์ด์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค

Hits

์ถ”์ฒœ ์‹œ์Šคํ…œ์—์„œ DNN ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ฐ content ํ˜น์€ user ์˜ representation vector ๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ ์‚ฌ๋„๋ฅผ ํ†ตํ•ด ์ถ”์ฒœํ•˜๋Š” ๋ฐฉ์‹์ด ํ™œ๋ฐœํ•˜๋‹ค. ํ•ด๋‹น ๊ธฐ๋ฒ•์˜ ๊ทผ๊ฐ„์€ ๋ฐ”๋กœ word2vec ์— ์žˆ๋‹ค. ๊ธฐ๋ณธ์„ ์•Œ๊ณ  ๋„˜์–ด๊ฐ€์•ผ ํ•œ๋‹ค๋Š” ์ƒ๊ฐ์— ์ด ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ์‹ค์ œ ๊ตฌํ˜„๊นŒ์ง€ ํ•ด๋ณผ ๊ณ„ํš์ด๋‹ค. ์ฐจ๊ทผ์ฐจ๊ทผ deepํ•˜๊ฒŒ ์ฝ์–ด๋ณด์ž.

๐ŸŽ๏ธ Abstract

  • ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ continuous vector representations ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•จ.
  • ๋‹จ์–ด ์œ ์‚ฌ๋„ ํƒœ์Šคํฌ์—์„œ ์ธก์ •๋œ ํ•ด๋‹น representation ์˜ ํ’ˆ์งˆ์€ ์ด์ „์— ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‚ธ ๋‹ค์–‘ํ•œ Neural Networks ๋“ค๊ณผ ๋น„๊ตํ•จ.
  • accuracy ์ธก๋ฉด์—์„œ ์ƒ๋‹นํ•œ ๋ฐœ์ „์ด ์žˆ์—ˆ๊ณ  ๋” ์ ์€ ์ปดํ“จํŒ… ์ž์›์„ ์ด์šฉํ•จ.
  • ์ž์‹ ๋“ค์ด ๋งŒ๋“  ์˜๋ฏธ๋ก ์  / ๋ฌธ๋ฒ•์  ๋‹จ์–ด ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” test set ์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋ƒˆ์Œ.
  • syntactic(๋ฌธ๋ฒ•์ ) ์œ ์‚ฌ๋„

    • big-bigger-biggest / small-smaller-smallest ์ฒ˜๋Ÿผ ๋ฌธ๋ฒ•์ ์ธ ์œ ์‚ฌ๋„ ์œ ์ถ”
  • sementic(์˜๋ฏธ๋ก ์ ) ์œ ์‚ฌ๋„

    • Seoul ๊ณผ Korea ๋Š” ์˜๋ฏธ๊ฐ„์— ์œ ์‚ฌ๋„๊ฐ€ ์กด์žฌ

๐Ÿ“€ Introduction

  • ๊ทผ๋ž˜์˜(2013๋…„ ๋‹น์‹œ) NLP ์‹œ์Šคํ…œ์€ ๋‹จ์–ด๋ฅผ ์›์ž ๋‹จ์œ„๋กœ ์ทจ๊ธ‰ํ–ˆ์Œ
    • vocabulary ์•ˆ์—์„œ ์ธ๋ฑ์Šค๋กœ ํ‘œํ˜„์ด ๋จ
    • ๊ฐ„ํŽธ์„ฑ, ๊ฐ•๊ฑดํ•จ์ด ์ด๊ฒƒ์˜ ์žฅ์ 
      • ๋‹จ์–ด๋ฅผ One-Hot ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ๋‹ค๋ฉด Encoding ๊ณผ Decoding ์ด 1:1 ๋งคํ•‘์ด ๋˜๊ธฐ์— ๊ฐ•๊ฑดํ•จ์ด๋ผ๊ณ  ํ‘œํ˜„ํ•œ ๊ฒƒ ๊ฐ™์Œ
      • ๋ฐ˜๋ฉด continuous ํ•œ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋œ๋‹ค๋ฉด 100% ์ผ์น˜ํ•˜๋Š” ๋ณต๊ตฌ๊ฐ€ ์–ด๋ ค์›€
    • ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•œ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์ด ์ ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ณต์žกํ•œ ๋ชจ๋ธ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์Œ

ํ•˜์ง€๋งŒ ๋‹จ์–ด๋ฅผ ๋‹จ์ˆœํžˆ ์ธ๋ฑ์Šค(One hot)๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์€ ๋‹น์—ฐํžˆ ๋งŽ์€ ์ œํ•œ์ ์ด ์กด์žฌํ•จ

  • ๋‹จ์–ด ๊ฐ„ ์—ฐ๊ด€์„ฑ ํ‘œํ˜„ ๋ถˆ๊ฐ€
  • ASR ๋ถ„์•ผ๋‚˜ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์—์„œ ์„ฑ๋Šฅ ์ œํ•œ โ†’ ๋ฐ์ดํ„ฐ์˜ ์–‘์— ์˜์กด์ 


๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์œผ๋กœ ์ด์ œ๋Š” ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ด์ง. ํ†ต๊ณ„ํ•™ ๊ธฐ๋ฐ˜์˜ N-gram ๋ชจ๋ธ๋ณด๋‹ค NN ๊ธฐ๋ฐ˜ LM์ด ๋น„์•ฝ์ ์ธ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ์— ์ด๋ฅผ ์ด์šฉํ•ด distributed representations ์„ ์ด์šฉํ•˜๋ ค๊ณ  ํ•จ.

  • distributed representation(๋ถ„์‚ฐ ํ‘œํ˜„) ์ด๋ž€?

    • ๋ถ„ํฌ ๊ฐ€์„ค์— ๊ธฐ๋ฐ˜ํ•ด ์ฃผ๋ณ€ ๋‹จ์–ด ๋ถ„ํฌ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด์˜ ๋ฒกํ„ฐ ํ‘œํ˜„์ด ๊ฒฐ์ •๋˜๋Š” ๊ฒƒ
    • One hot vector ๋ณด๋‹ค ์ €์ฐจ์›์ด์ง€๋งŒ dense ํ•˜๊ฒŒ ํ‘œํ˜„์ด ๋จ (์ฃผ๋ณ€ ๋‹จ์–ด์˜ ๋ถ„ํฌ ์ •๋ณด๋ฅผ ๋‚ดํฌํ•˜๊ธฐ์—)

Goals of paper

  • ๋ฉ”์ธ ๋ชฉํ‘œ๋Š” ๊ณ ํ’ˆ์งˆ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํ…Œํฌ๋‹‰์„ ์†Œ๊ฐœํ•˜๊ธฐ ์œ„ํ•จ
    • ์ข‹์€ ํ’ˆ์งˆ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜๋ฉด ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค๋ผ๋ฆฌ ์„œ๋กœ ๊ฐ€๊นŒ์ด ์œ„์น˜ํ•จ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ multiple degrees of similarity ๋„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ
    • multiple degrees of similarity
      • ๊ฐ™์€ noun ์ด์–ด๋„ ๋‹จ์ˆ˜ / ๋ณต์ˆ˜ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์ฐจ์ด๋ฅผ ๊ฐ€์ง€๋”๋ผ๋„ ์œ ์‚ฌํ•จ (ex. apple / apples)
      • ํ•˜๋‚˜์˜ ๋‹จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋น„์Šทํ•œ ์˜๋ฏธ ํŠน์„ฑ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ (big-bigger-biggest)
    • embedding vector ๋ฅผ ์ด์šฉํ•˜์—ฌ simple algebraic ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•จ
      • vector(โ€œKingโ€) - vector(โ€œManโ€) + vector(โ€œWomanโ€) = vector(โ€œQueenโ€)

๋…ผ๋ฌธ์€ ์ƒˆ๋กœ์šด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๊ฐœ๋ฐœํ•˜๋ฉด์„œ ๋‹จ์–ด ๊ฐ„ linear regularities(์•ž์„œ ๋ณด์ธ simple algebraic) ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ accuracy ๋ฅผ ์ตœ๋Œ€ํ™”ํ• ๋ ค๊ณ  ๋…ธ๋ ฅํ•จ.

์ด๋ฅผ ์œ„ํ•ด Abstract ์—์„œ ๋งํ–ˆ๋˜ test set ์„ ์ด์šฉํ•˜์˜€๊ณ  ๋†’์€ accuracy ๋กœ linear reulgarities ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€์Œ.

Previous Work

๋‹จ์–ด๋ฅผ continuous vector ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์‹œ๋„๋Š” ์˜›๋‚ ๋ถ€ํ„ฐ ์žˆ์—ˆ์Œ

  • NNLM ์„ค๋ช…
    • ํ•œ ๊ฐœ์˜ linear projection layer ์™€ non-linear hidden layer ๋กœ ๊ตฌ์„ฑ์ด ๋จ
    • ๋…ผ๋ฌธ์ด ํฅ๋ฏธ๋กญ๊ฒŒ ๋Š๋‚€ NNLM ๊ตฌ์กฐ๋Š” single hidden layer ๋ฅผ ํ†ตํ•ด ํ•™์Šต๋œ word vector ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ 
    • ๋…ผ๋ฌธ์€ ์ด first step ์— ์ง‘์ค‘ํ•˜์—ฌ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์„ ์ด์šฉํ•ด word vector ๋ฅผ ๋งŒ๋“œ๋Š” ๊ตฌ์กฐ๋ฅผ ํ™•์žฅํ•จ

๐Ÿงฝ Model Architecture

์ด์ „์— ์ œ์•ˆ๋œ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋“ค(LSA, LDA ๋“ฑ)๋„ ๋‹จ์–ด๋“ค์˜ continuous representations ๋ฅผ ๊ตฌํ•˜๋ ค๊ณ  ํ•˜์˜€์Œ.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” neural network ๋กœ ํ•™์Šต๋œ distributed representations of words ์— ์ง‘์ค‘ํ•˜๊ณ ์ž ํ•จ. ์ด๋Š” LSA ๋ณด๋‹ค ์›”๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋™์‹œ์— linear regularities ๋„ ๋ณด์กดํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์ž„. ๋˜ํ•œ LDA ๋Š” ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์ปดํ“จํŒ…์ ์œผ๋กœ expensive ํ•œ ๋ฌธ์ œ์ ์ด ์žˆ์Œ

Time Complexity O

\[\begin{align*} & O = E \times T \times Q \newline & E = number \, of \, epochs \newline & T = number \, of \, words \newline & Q = Model \, Architecture \end{align*}\]

NNLM

image

NNLM Structure

  • NNLM ์€ 4๊ฐœ์˜ ์ธต์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Œ.
  • Input Layer ์—์„œ N๊ฐœ์˜ ์ด์ „ ๋‹จ์–ด๋“ค์€ V(vocabulary) ํฌ๊ธฐ๋งŒํผ ์›ํ•ซ ๋ฒกํ„ฐ๊ฐ€ ๋งŒ๋“ค์–ด์ง
  • ์ดํ›„ projection layer ๋ฅผ ๊ฑฐ์นจ
    • ์ด projection layer ๋Š” ์€๋‹‰์ธต๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๊ณผ์˜ ๊ณฑ์…ˆ์€ ์ด๋ฃจ์–ด์ง€์ง€๋งŒ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์Œ
    • Lookup - table : \(W_p\) ์™€ ๊ณ„์‚ฐ์ด ์ด๋ฃจ์–ด์ ธ ๋‚˜์˜จ vector (\(N \times D\))


image

Projection layer ๊ณ„์‚ฐ ๊ณผ์ •

  • ์ด์ œ ๊ฐ ๋‹จ์–ด๋“ค์€ lookup table ์„ ๊ฑฐ์ณ ๋ฒกํ„ฐ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋˜๋ฉฐ projection layer ์—์„œ concat ์ด ๋œ๋‹ค. (\(N \times D\))
  • ๋งŒ๋“ค์–ด์ง„ Projection layer ๋ฅผ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ \(W_p\) ๋ฅผ ๊ณฑํ•˜๊ณ  \(tanh\) ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋˜๋ฉด์„œ ํ•™์Šต์„ ์ง„ํ–‰
  • ๋งˆ์ง€๋ง‰์œผ๋กœ Cross Entropy ๊ฑฐ์น˜๊ธฐ ์œ„ํ•ด hidden layer ์— \(W_o (H \times V)\) ๋ฅผ ๊ณฑํ•ด์ฃผ์–ด ouput layer ์ƒ์„ฑ
  • \(W_p\) ์˜ shape : \((N \times D) \, \times H\)

image

Final structure

์ด๋ ‡๊ฒŒ ํ•ด์„œ ์•ž์„  ๋งํ•œ \(Q\) ์˜ time complexity ๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์‚ฐ๋จ \[Q = N \times D + N \times D \times H + H \times V\]

  • ์œ„ ์‹์—์„œ ๊ฐ€์žฅ ์ง€๋ฐฐ์ ์ธ ์ˆ˜์‹์€ \(N \times D \times H\) ์ž„.
  • ์›๋ž˜๋Š” \(H \times V\) ๊ฐ€ ๊ฐ€์žฅ ์ง€๋ฐฐ์ ์ด์ง€๋งŒ ์ด๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์ด ์กด์žฌ
    • Avoding Normalized : ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด complexity ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹คํ•˜์ง€๋งŒ ์ •ํ™•ํžˆ ์ดํ•ด๊ฐ€ ์•ˆ๊ฐ
    • Use hierarchical softmax : ๋ณธ ๋…ผ๋ฌธ์€ ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€์Œ
      • ์ด ๋ฐฉ๋ฒ•๊ณผ vocab ์„ Huffman binary tree ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๋งŒ๋“  ๋ชจ๋ธ์€ \(H \times V\) ๊ฐ€ ์š”๊ตฌ๋œ๋‹คํ•จ
    • ์•ž์„œ ์„ค๋ช…ํ–ˆ๋“ฏ vocabulary ๋ฅผ ํ—ˆํ”„๋งŒ ์™„์ „ ์ด์ง„ ํŠธ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ๊ตฌ์„ฑํ•œ๋‹ค๋ฉด \(log_2(V)\) ๋งŒํผ์˜ output์ด ๋งŒ๋“ค์–ด์ง

NNLM ์˜ ํ•œ๊ณ„์ 

  • ๊ฐ€์žฅ ํฐ ํ•œ๊ณ„๋Š” ์ œํ•œ๋œ ๊ธธ์ด์˜ ์ž…๋ ฅ

  • ์ •ํ•ด์ง„ N๋งŒํผ๋งŒ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๊ธฐ์— ํ•œ์ •๋œ ๋ฌธ๋งฅ๋งŒ ํ•™์Šตํ•จ

RNNLM

NNLM ์˜ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ๋ชจ๋ธ

  • ์ด๋ก ์ ์œผ๋กœ RNN ๊ณ„์—ด์ด ๋” ํšจ๊ณผ์ ์œผ๋กœ ์–•์€ NN ๋ณด๋‹ค ๋ณต์žกํ•œ ํŒจํ„ด์„ ๋‚˜ํƒ€๋ƒ„
  • RNN ์€ projection layer ๋Š” ์—†์œผ๋ฉฐ hideen layer ๊ฐ€ ์ž์‹ ๊ณผ ์—ฐ๊ฒฐ๋˜์žˆ๋Š” ํŠน์ดํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง
  • ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋Š” ๋ชจ๋ธ์ด ์ด์ „์˜ short term memory ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฏ€๋กœ sequential ํ•ด์ง

Time Complexity of RNN (Q) \[Q = H \times H + H \times V\]

  • word representations D ๋Š” hidden layer H ์™€ ๋˜‘๊ฐ™์€ ๋””๋ฉ˜์…˜์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ์— \(H \times H\) ๋กœ ๊ณ„์‚ฐ์ด ๋จ.
  • NNLM ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ \(H \times V\) ๋Š” hierarchical softmax ๋ฅผ ํ™œ์šฉํ•˜์—ฌ \(H \times log_2(V)\) ๊นŒ์ง€ ์ค„์ผ ์ˆ˜ ์žˆ์Œ.

๐ŸŽ‡ New Log-linear Models

๋ณธ ๋…ผ๋ฌธ์€ computational complexity ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ distributed representations ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด 2๊ฐœ์˜ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ์ œ์‹œํ•จ.

  • ์ด์ „ ๊ตฌ์กฐ๋“ค์„ ๋ณด๋ฉด non-linear ์ธ hidden layer ๋•Œ๋ฌธ์— complexity ๊ฐ€ ์˜ฌ๋ผ๊ฐ”์—ˆ์Œ.
  • non-linear ๋•Œ๋ฌธ์— NN ์ด ๋งค๋ ฅ์ ์ด๊ธด ํ•˜๋‚˜ ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋ฅผ ์ข€ ๋” ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์„ ํ†ตํ•ด ํ•™์Šตํ• ๋ ค๊ณ  ํ•จ

image

CBOW & Countinuous Skip-gram

๋ณธ ๋…ผ๋ฌธ์€ 2๊ฐ€์ง€ ์Šคํ…์„ ํ†ตํ•ด distributed vectors ๋ฅผ ํ•™์Šตํ•˜๊ณ ์ž ํ•จ

  • ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์„ ํ†ตํ•ด continuous word vectors ๋ฅผ ํ•™์Šต
    • Continuous Bag-of-Words
    • Skip-gram
  • ๊ทธ ์œ„์— N-gram NNLM ๋ชจ๋ธ ํ•™์Šต

Continuous Bag-of-Words

  • NNLM ๊ณผ ๋น„์Šทํ•œ ๊ตฌ์กฐ์ด์ง€๋งŒ non-linear ๋ ˆ์ด์–ด๊ฐ€ ์‚ญ์ œ๋จ
  • Projection layer ๋Š” ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๊ณต์œ ํ•จ
  • ์ด ๊ฒฐ๊ณผ๊ฐ’๋“ค์„ ๋ชจ๋‘ ๋ชจ์•„ ํ‰๊ท ์„ ๊ตฌํ•˜๋ฉด ์ด๊ฒƒ์ด Projection Layer
  • ๋‹จ์–ด์˜ ์ˆœ์„œ๊ฐ€ ์˜ํ–ฅ์„ ๋ผ์น˜์ง€ ์•Š์Œ

NNLM ๊ณผ ๋‹ค๋ฅธ ์ ์€ ์ด์ „์˜ ๋‹จ์–ด๋งŒ ์“ฐ๋Š” ๊ฒƒ์ด ์•„๋‹Œ ๋ฏธ๋ž˜์˜ ๋‹จ์–ด๋„ ์‚ฌ์šฉํ•จ

๋ณธ ๋…ผ๋ฌธ์€ ์ด์ „ 4๊ฐœ์˜ ๋‹จ์–ด์™€ ์ดํ›„ 4๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์šด๋ฐ ๋‹จ์–ด๋ฅผ ๋งž์ถ”๋Š” ๊ฒƒ์„ criterion ์œผ๋กœ ์‚ผ๊ณ  ํ•™์Šต์„ ํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค.

Time Complexity of Q \[Q = N \times D + D \times log_2(V)\]

NNLM ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ hidden layer ๊ด€๋ จ ์ˆ˜์‹์ด ์‚ญ์ œ๋œ ๋ชจ์Šต

Continuous Skip-gram

CBOW ์™€ ๋น„์Šทํ•˜์ง€๋งŒ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ฃผ๋ณ€ ๋‹จ์–ด์˜ classification ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ํ˜„์žฌ์˜ ๋‹จ์–ด๋ฅผ continuous projection layer ์™€ ํ•จ๊ป˜ input ์œผ๋กœ ์‚ฌ์šฉ
  • ์ „ํ›„ ํŠน์ • ๋ฒ”์œ„๋งŒํผ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•จ
  • ์ด ๋ฒ”์œ„๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด word vectors ์˜ ํ’ˆ์งˆ์„ ์ข‹๊ฒŒ ํ•˜์ง€๋งŒ ๊ทธ์— ๋”ฐ๋ผ complexity ๊ฐ€ ๋†’์•„์ง
  • ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์ˆ˜๋ก ์—ฐ๊ด€๋„๊ฐ€ ๋‚ฎ์•„์ง€๊ธฐ์— sampling์„ ๋œ ์ค˜์„œ ๊ฐ€์ค‘์น˜๋ฅผ ์ž‘๊ฒŒํ•จ

Time Complexity of Q \[Q = C \times (D + D \times log_2(V))\]

  • C ๋Š” ๋‹จ์–ด ๊ฐ„ ์ตœ๋Œ€ ๊ธธ์ด
  • [1, C) ์ค‘์—์„œ ๋žœ๋คํ•˜๊ฒŒ number R ์„ ์„ ํƒํ•˜์—ฌ R ๊ฐœ ์ด์ „, R ๊ฐœ ์ดํ›„๋ฅผ predict
  • ์ค‘์‹ฌ ๋‹จ์–ด ์ „ํ›„๋กœ ์ง„ํ–‰ํ•˜๊ธฐ์— ์ด 2R word classification ์ด ์š”๊ตฌ๋จ
  • R ์˜ ํ‰๊ท  ๊ธฐ๋Œ“๊ฐ’์€ 1/C ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ 2R ๋ฒˆ์˜ ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜๊ธฐ์— C=2R ๋•Œ๋ฌธ์— ์œ„์— ์‹์ด ์œ ๋„๋จ

๐Ÿ‰ Result

์ด์ „์˜ ์—ฐ๊ตฌ๋“ค์€ ๋‹จ์–ด๋ฅผ ์ฃผ๋ฉด ๊ทธ ์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์„ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ๋ฐฉ๋ฒ•์œผ๋กœ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ–ˆ์Œ.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์ข€ ๋” ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ํž˜๋“ฌ

  • ๋‹จ์–ด์˜ ์œ ์‚ฌํ•จ์€ ๋‹ค์–‘ํ•˜๊ฒŒ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์Œ.
    • big-bigger ๊ฐ€ ์œ ์‚ฌํ•œ ๊ฒƒ์ฒ˜๋Ÿผ small-smaller ๊ฐ€ ์œ ์‚ฌ
    • big-biggest ํŽ˜์–ด์™€ small-smallest ํŽ˜์–ด๊ฐ€ ์œ ์‚ฌํ•จ
  • ์ด๋Ÿฌํ•œ ์œ ์‚ฌํ•จ์€ ๊ฐ„๋‹จํ•œ algebraic operations ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ
    • ex) vector(โ€œbiggestโ€) - vector(โ€œbigโ€) + vector(โ€œsmallโ€) = vector(โ€œsmallestโ€)

๋˜ํ•œ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ๋‹ค๋ฉด vector ๋“ค์€ ์˜๋ฏธ ์ฐจ์ด๋งˆ์ € ์•Œ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ

  • ex) France-Paris / Germany-Berlin

์ด๋Ÿฌํ•œ semetic relationship ์„ ์ด์šฉํ•œ๋‹ค๋ฉด NLP ์˜ ๋งŽ์€ ๋ถ€๋ถ„์— ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ.

Task Description

image

Test Set

  • 5 ๊ฐ€์ง€์˜ semantic ์งˆ๋ฌธ๊ณผ 9๊ฐœ์˜ syntactic ์งˆ๋ฌธ ๊ตฌ์„ฑ
  • ์งˆ๋ฌธ์€ 2๊ฐœ์˜ ์Šคํ…์œผ๋กœ ์ด๋ฃจ์–ด์ง
    • ๋น„์Šทํ•œ ๋‹จ์–ด ํŽ˜์–ด๋Š” ์ˆ˜๋™์ ์œผ๋กœ ๋งŒ๋“ฌ
    • 2๊ฐœ์˜ ๋‹จ์–ด์Œ์„ ์—ฐ๊ฒฐ
  • ์˜ค๋กœ์ง€ Single Token ๋‹จ์–ด๋งŒ ํ™œ์šฉ
  • ์œ„ ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ ๋“ฏ word pair 1 ๊ณผ word pair 2 ๋ฅผ ๊ตฌ์„ฑํ–ˆ๊ธฐ์— algebraic operation ์ด ๊ฐ€๋Šฅํ•ด์ง„ ๊ฒƒ ๊ฐ™์Œ
  • ์œ ์˜์–ด๋กœ ์˜ˆ์ธกํ•œ ๊ฒƒ์€ ํ‹€๋ฆฌ๋‹ค๊ณ  ํ•˜์˜€์Œ
    • 100% ์ผ์น˜๊ฐ€ ๋ถˆ๊ฐ€๋Šฅ
    • word vector ์˜ ์œ ์šฉ๋„๊ฐ€ accuracy ์™€ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” application ์ด ์žˆ์„ ๊ฒƒ์ด๋ผ ๋ฏฟ๊ธฐ ๋•Œ๋ฌธ

โ›ณ ๋งˆ๋ฌด๋ฆฌ

์ดํ›„๋Š” ์ ์€ computational complexitiy ๋กœ ํ’๋ถ€ํ•œ word vectors ๋ฅผ ์–ป์–ด๋ƒˆ๋‹ค๋Š” ๊ฒƒ์— ํฐ ์˜์˜๊ฐ€ ๋ฐํžˆ๋ฉด ๋ณธ ๋…ผ๋ฌธ์€ ๋๋‚˜๊ฒŒ ๋œ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ์ ์€

  • hidden layer ์—์„œ ํฐ ๋น„์ค‘์„ ๊ฐ€์ง€๋˜ computational complexity ๋ฅผ ์ค„์ด๊ณ ์ž hidden layer ๋ฅผ ์—†์•ฐ
  • ์ด๋Ÿฌํ•œ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์ด๋ผ๋„ ์ข‹์€ ํ’ˆ์งˆ์˜ word vectors๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Œ.
  • ์—ฐ์‚ฐ ๋น„์šฉ์ด ํš๊ธฐ์ ์œผ๋กœ ์ค„์—ˆ๊ธฐ์— ํฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋ถ€ํ„ฐ high dimesional vector ์ž„์—๋„ ์ข‹์€ ํ’ˆ์งˆ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ 

ํ•˜์ง€๋งŒ word2vec ์˜ ๋ช…ํ™•ํ•œ ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค.

  • Out of Vocabulary ๋ฌธ์ œ
    • ํ•™์Šตํ•  ๋•Œ ๋ณด์ง€ ๋ชปํ–ˆ๋˜ ๋‹จ์–ด๋ผ๋ฉด vector ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์—†๋‹ค
  • ๋‹จ์–ด ๋นˆ๋„ ์ˆ˜์— ์˜์กด์ 
    • ํŠน์ • ๋‹จ์–ด๊ฐ€ ์ ๊ฒŒ ๋‚˜์™”๋”๋ผ๋ฉด ๊ทธ ๋‹จ์–ด์˜ vector ์˜ ํ’ˆ์งˆ์€ ์•ˆ ์ข‹์„ ์ˆ˜ ๋ฐ–์— ์—†๋‹ค.

์ด๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ๋ชจ๋ธ์ด Facebook ์˜ FastText ์ด๋‹ค. subword ๋ฅผ skip-gram ์ˆ˜ํ–‰์„ ํ†ตํ•ด OOV ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

Word Embedding ์— ํฐ ๋ฐœ์ „์„ ์ผ์œผํ‚จ word2vec ์„ ์‚ดํŽด๋ณด์•˜๋‹ค. ์ถ”์ฒœ์‹œ์Šคํ…œ์— ์ด๋ฅผ ๊ณ„์Šนํ•œ item2vec, song2vec ๋“ฑ์˜ ๊ทผ๊ฐ„์ด ๋˜๊ธฐ์— ์ด๋ฒˆ ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•˜์˜€๋‹ค.

์ถ”๊ฐ€๋กœ NNLM ์— ๋Œ€ํ•ด ์•Œ๊ฒŒ ๋˜์–ด์„œ ์ข‹์€ ์‹œ๊ฐ„์ด์—ˆ๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค.