๐Ÿ“š ์Šคํ„ฐ๋””/CS224N

[CS224N] 2. Neural Classifiers

์žฅ์˜์ค€ 2023. 8. 2. 00:30

CS224N 2๋ฒˆ์งธ ๊ฐ•์˜๋ฅผ ์ˆ˜๊ฐ•ํ•˜๊ณ  ์ •๋ฆฌ ๋ฐ ์ง€์‹ ๊ณต์œ ๋ฅผ ์œ„ํ•ด ๋ธ”๋กœ๊ทธ๋ฅผ ์“ด๋‹ค. ์ฐธ๊ณ ๋กœ ๋ณธ์ธ์€ 2021 Winter ๋ฒ„์ „์„ ์ˆ˜๊ฐ•ํ–ˆ๋‹ค.


Review: Main idea of Word2Vec & Negative Sampling

์ง€๋‚œ๋ฒˆ ๋ธ”๋กœ๊ทธ์—์„œ Word2Vec์— ๊ด€ํ•ด ์ •๋ฆฌํ–ˆ๋‹ค. ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ณต๊ธฐํ•ด ๋ณด์ž.

์šฐ์„  Word2Vec์—๋Š” CBOW, Skip-gram์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์“ฐ์ธ๋‹ค.

1. CBOW

๋งฅ๋ฝ ๋ฒกํ„ฐ๊ฐ€ ์ž…๋ ฅ, ์ค‘์‹ฌ ๋ฒกํ„ฐ๊ฐ€ ์ถœ๋ ฅ์ธ ๊ฒฝ์šฐ๋ฅผ CBOW ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ผ๊ณ  ํ•œ๋‹ค. CBOW์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

๊ณผ์ •์„ ์š”์•ฝํ•˜์ž๋ฉด,

1. ์ง€์ •ํ•œ ์œˆ๋„์šฐ ํฌ๊ธฐ์˜ 2๋ฐฐ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๊ฐ€ one-hot encoding์œผ๋กœ ํ‘œํ˜„๋˜์–ด input์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค.

2. ๊ฐ๊ฐ์˜ input๋งˆ๋‹ค ์ฒซ ๋ฒˆ์งธ ๊ฐ€์ค‘์น˜ W์™€ ๊ณฑํ•ด์ง€๊ณ , ๊ทธ ํ‰๊ท ์„ ์‚ฐ์ •ํ•˜์—ฌ M ๋ฒกํ„ฐ๋กœ ์ง€์ •ํ•œ๋‹ค.

3. ์ด ๋ฒกํ„ฐ๋Š” ๋‘ ๋ฒˆ์งธ ๊ฐ€์ค‘์น˜ W'์™€ ๊ณฑํ•ด์ ธ softmax ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์ณ y^๋ฒกํ„ฐ๊ฐ€ ์ง€์ •๋œ๋‹ค.

4. ์ตœ์ข…์ ์œผ๋กœ cross entropy ํ•จ์ˆ˜๋กœ ์ตœ์ข… ์ค‘์‹ฌ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

์ด๋•Œ ๋งˆ์ง€๋ง‰์— cross entopy ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด loss๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์—์„œ Gradient Descent๋ฅผ ์‚ฌ์šฉํ•ด ๊ณ„์† ์—…๋ฐ์ดํŠธํ•ด๊ฐ„๋‹ค.

2. Skip-gram

Skip-gram์€ ์ค‘์‹ฌ ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ๋งฅ๋ฝ ๋ฒกํ„ฐ๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. Skip-gram์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

์—ฌ๊ธฐ์„œ์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋Š” CBOW์˜ ์ž…๋ ฅ ๋ฒกํ„ฐ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๊ณ  ํฌ๊ธฐ๊ฐ€ ์œˆ๋„์šฐ ํฌ๊ธฐ์˜ 2๋ฐฐ์ด๋‹ค.

์ด์™ธ ๋‹ค๋ฅธ ๊ณ„์‚ฐ ๋ฐฉ์‹์€ ์œ„์˜ CBOW์˜ ๊ณ„์‚ฐ ๋ฐฉ์‹๊ณผ ์ˆœ์„œ๋งŒ ๋‹ค๋ฅด๊ณ , ์ด์™ธ์˜ ๊ฒƒ๋“ค์€ ๊ฐ™๋‹ค.

ํ•ด๋‹น ๊ฐ•์˜์—์„œ๋Š” Skip-gram ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์˜ Word2Vec์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

3. Optimization of Word2Vec

๊ธฐ์กด์˜ Gradient Descent ๋ฐฉ์‹์€ ๊ณ„์‚ฐํ•˜๊ธฐ์— ๋น„์šฉ์ด ๋„ˆ๋ฌด ๋งŽ์ด ๋“ ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Stochastic Gradient Descent (SGD) ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์œ„ ์ด๋ฏธ์ง€์™€ ๊ฐ™์ด, Word2Vec์˜ ์ž…, ์ถœ๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” one-hot encoding์˜ ๊ฒฝ์šฐ, ๋„ˆ๋ฌด sparse ํ•œ vector๋ผ SGD๋ฅผ ์ ์šฉํ•˜๊ธฐ ๋น„ํšจ์œจ์ ์ด๋ผ๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค. 0์—์„œ์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ํ•ญ์ƒ 0 ์ผ ํ…๋ฐ, ๊ทธ๊ฒƒ์„ ๋ฌด์‹œํ•˜๊ณ  ๊ณ„์† ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๋ถˆํ•„์š”ํ•œ ์ง“์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด negative sampling์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

4. Negative Sampling

Negative Sampling์˜ ์ค‘์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ๊ธ์ •(positive), ๋žœ๋ค์œผ๋กœ ์ƒ˜ํ”Œ๋ง๋œ ๋‹จ์–ด๋“ค์„ ๋ถ€์ •(negative)์œผ๋กœ ๋ ˆ์ด๋ธ”๋ง ํ•˜์—ฌ ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค์–ด ์ด์ง„ ์„ ํ˜• ํšŒ๊ท€๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค.

์˜ˆ์‹œ

์˜ˆ๋ฅผ ๋“ค์–ด, ์ด์ „ ๋ธ”๋กœ๊ทธ์— ์ผ๋˜ 'The fat cat sat on the mat'๋ผ๋Š” ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด ๋ณด์ž.

Skip-gram ๋ฐฉ์‹์€ ํ•˜๋‚˜์˜ ์ค‘์‹ฌ ๋‹จ์–ด๋กœ๋ถ€ํ„ฐ ๋งฅ๋ฝ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ด์ง€๋งŒ, negative sampling์„ ์‚ฌ์šฉํ•˜๋ฉด positive๊ณผ negative ๋‹จ์–ด๋“ค์„ sampling ํ•ด์•ผ ํ•œ๋‹ค.

์ถœ์ฒ˜: https://wikidocs.net/69141

์œ„ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด, ์ง€์ •๋œ window ๋‚ด์˜ ๋‹จ์–ด๋“ค์„ sampling ํ•œ ๊ฒƒ์„ positive sampling, ๋‹จ์–ด corpus ๋‚ด์˜ ๋‹จ์–ด๋“ค์„ random ํ•˜๊ฒŒ sampling ํ•œ ๊ฒƒ์„ negative sampling์ด๋ผ๊ณ  ํ•œ๋‹ค. positive sampling์€ label์„ 1๋กœ, negative sampling์€ 0์„ label๋กœ ๊ฐ–๋Š”๋‹ค.

์ดํ›„, ์œ„์™€ ๊ฐ™์ด ํ•ด๋‹น pair์„ ๋‘ ๊ฐœ์˜ ์ž…๋ ฅ์œผ๋กœ ์„ ์ •ํ•˜๋Š”๋ฐ, ์ž…๋ ฅ 1์€ ์ค‘์‹ฌ ๋‹จ์–ด์˜ embedding layer(๊ณ ์ •)๋กœ, ์ž…๋ ฅ 2๋Š” ํ•ด๋‹น ๋‹จ์–ด๋“ค์˜ embedding layer๋กœ ์ง€์ •ํ•œ๋‹ค. ๋‘ ๋‹จ์–ด๋“ค์€ ๋ชจ๋‘ ํ•œ ๋‹จ์–ด corpus ๋‚ด์—์„œ ๋‚˜์˜จ ๋‹จ์–ด๋“ค์ด๊ธฐ ๋•Œ๋ฌธ์— embedding layer์˜ ํฌ๊ธฐ๋Š” ๊ฐ™๋‹ค.

์œ„๋Š” ํ•ด๋‹น embedding layer์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ์ตœ์ข…์ ์œผ๋กœ ์™„์„ฑ๋œ layer์ด๋‹ค.

์ดํ›„, ์ค‘์‹ฌ ๋‹จ์–ด์™€ ์ฃผ๋ณ€ ๋‹จ์–ด์˜ ๋‚ด์ ๊ฐ’์„ ์ด ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์œผ๋กœ ํ•˜๊ณ , ๋ ˆ์ด๋ธ”๊ณผ์˜ ์˜ค์ฐจ๋กœ๋ถ€ํ„ฐ ์—ญ์ „ํŒŒํ•˜์—ฌ ์ค‘์‹ฌ ๋‹จ์–ด์™€ ์ฃผ๋ณ€ ๋‹จ์–ด์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•ด ๋‚˜๊ฐ„๋‹ค. ํ•™์Šต ํ›„์—๋Š” ์ขŒ์ธก์˜ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์„ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๊ณ , ๋‘ ํ–‰๋ ฌ์„ ๋”ํ•œ ํ›„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋‘ ํ–‰๋ ฌ์„ ์—ฐ๊ฒฐ(concatenate)ํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์ˆ˜์‹

์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ์•Œ์•„๋ณธ negative sampling์„ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

๊ฐ€์šด๋ฐ -๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์•ž, ๋’ค 2๊ฐœ์˜ ์š”์†Œ๋ฅผ ๋‚˜๋ˆ ๋ณด๊ณ , ๊ทธ์— ๋Œ€ํ•ด ๋ถ„์„ํ•ด ๋ณด์ž.

1. ์ฒซ ๋ฒˆ์งธ ์š”์†Œ

์ด ์ˆ˜์‹์€ center word์™€ window ๋‚ด์˜ word์˜ ๋‚ด์ ๊ฐ’, positive sampling์„ ๋œปํ•œ๋‹ค. ์ด ๋‚ด์ ๊ฐ’์„ ๋‹ค์‹œ sigmoid function์„ ํ†ตํ•ด 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š”๋ฐ, ์ด ๊ฐ’์ด ์ตœ๋Œ€ํ™”์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค.

 

2. ๋‘ ๋ฒˆ์งธ ์š”์†Œ

์ด ์ˆ˜์‹์€ random ํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•œ ๋‹จ์–ด์™€ ์ค‘์‹ฌ ๋‹จ์–ด์˜ ๋‚ด์ ๊ฐ’, negative sampling์„ ๋œปํ•œ๋‹ค. ์ด ๋‚ด์ ๊ฐ’ ๋˜ํ•œ sigmoid funciton์„ ํ†ตํ•ด 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š”๋ฐ, negative sampling ๊ฐ’์ด๋ฏ€๋กœ, ์ด ๊ฐ’์„ ์ตœ์†Œํ™”์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค.

sigmoid function์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

์œ„ ๊ทธ๋ž˜ํ”„์—์„œ ๋ณด๋‹ค์‹œํ”ผ, input ๊ฐ’์ด ์Œ์ˆ˜์ผ ๊ฒฝ์šฐ, ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๊ฐ’์ด ์ถœ๋ ฅ๋˜๊ณ , ์–‘์ˆ˜์ผ ๊ฒฝ์šฐ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ๊ฐ’์ด ์ถœ๋ ฅ๋˜๋ฏ€๋กœ, negative sampling์—์„œ๋Š” input ๊ฐ’์„ ์Œ์ˆ˜๋กœ, positive sampling์—์„œ๋Š” input ๊ฐ’์„ ์–‘์ˆ˜๋กœ ์„ค์ •ํ–ˆ๋‹ค.

 

์ตœ์ข…์ ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด, negative sampling์˜ ๋ชฉ์ ์€ ์‹ค์ œ window ๋‚ด์˜ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”์‹œํ‚ค๊ณ , window ์™ธ๋ถ€์˜ ๋žœ๋คํ•œ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ์ตœ์†Œํ™”์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

 


Co-occurrence matrix

Skip-gram์—์„œ๋Š” count-based co-occurrence matrix๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

1. Window based co-occurrence matrix

Window based co-occurence matrix์—์„œ๋Š” ํ•œ ๋ฌธ์žฅ์„ ๊ธฐ์ค€์œผ๋กœ ์œˆ๋„์šฐ์— ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ช‡ ๋ฒˆ ๋“ฑ์žฅํ•˜๋Š” ์ง€๋ฅผ ์„ธ์–ด matrix๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค.

ํ•ด๋‹น matrix๋ฅผ ํ†ตํ•ด syntatic, semantic ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

2. Word-document matrix (๋‹จ์–ด-๋ฌธ์„œ ํ–‰๋ ฌ)

Word-document matrix๋Š” ํ•œ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ช‡ ๋ฒˆ ๋“ฑ์žฅํ•˜๋Š” ์ง€๋ฅผ ์„ธ์–ด matrix๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. ๋ฌธ์„œ์— ์žˆ๋Š” ๋งŽ์€ ๋‹จ์–ด๋“ค ์ค‘ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ํŠน์ • ๋‹จ์–ด๊ฐ€ ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ „์ œํ•œ๋‹ค. (ex. ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ • , tf-idf ๋“ฑ)

์ถœ์ฒ˜: https://velog.io/@tobigs-text1314/CS224n-Lecture-2-Word-Vectors-and-Word-Senses

๊ทธ๋Ÿฌ๋‚˜ ์ด์™€ ๊ฐ™์€ count-based matrix๋Š” ๋‹จ์–ด ์–‘์— ๋”ฐ๋ผ vector์–‘ ๋˜ํ•œ ์ฆ๊ฐ€ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ SVD ๋˜๋Š” LSA ๋“ฑ์„ ์ด์šฉํ•˜์—ฌ ์ฐจ์›์„ ์ถ•์†Œ์‹œํ‚จ ํ›„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ •๋ณด๋ฅผ ์ž‘์€ ์ฐจ์›์˜ ํ–‰๋ ฌ์•ˆ์— ํฌํ•จ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ๋Š”๋‹ค.

3. SVD (Singular Value Decomposition)

์ถ”ํ›„ ์ˆ˜์ •


GLOVE (Global Vectors for Word Representation)

1. ์›๋ฆฌ

์ง€๊ธˆ๊นŒ์ง€ count-based์™€ direct-prediction ๋ฐฉ์‹์„ ๋ชจ๋‘ ์‚ดํŽด๋ณด์•˜๋‹ค.

์œ„ ์‚ฌ์ง„์— ๋‚˜์™€์žˆ๋“ฏ์ด, Co-occurence matrix์™€ ๊ฐ™์€ Count-based ๋ฐฉ์‹์€ ๋น ๋ฅธ ํ›ˆ๋ จ์ด ๊ฐ€๋Šฅํ•˜๊ณ , ํ†ต๊ณ„์ ์œผ๋กœ ํ™œ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ, ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ ๊ด€๊ณ„๋„๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ต๊ณ , ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์— ๋„ˆ๋ฌด ํฐ ๊ฐ€์ค‘์„ ๋ถ€์—ฌํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ๋‹ค.

๋ฐ˜๋Œ€๋กœ, Word2Vec์™€ ๊ฐ™์€ Direct prediction ๋ฐฉ์‹์€ ๋†’์€ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ , ๋‹จ์–ด ์œ ์‚ฌ ๊ด€๊ณ„์˜ ๋ณต์žกํ•œ ํŒจํ„ด์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ, ๋ง๋ญ‰์น˜ ํฌ๊ธฐ๊ฐ€ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ณ , ํ†ต๊ณ„์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋น„ํšจ์œจ์ ์ด๋ผ๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ๋‹ค.

 

์œ„ ๊ธฐ๋ฒ•๋“ค์˜ ์žฅ์ ๋งŒ์„ ๊ฐ–์ถ˜ ๊ธฐ๋ฒ•์œผ๋กœ, GLOVE๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์ด ๋“ฑ์žฅํ–ˆ๋‹ค.

GLOVE์˜ ๊ธฐ๋ณธ์ ์ธ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • ์ž„๋ฒ ๋”ฉ๋œ ๋‹จ์–ด ๋ฒกํ„ฐ ๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ •์„ ์ˆ˜์›”ํ•˜๊ฒŒ ํ•˜๋ฉด์„œ (word2vec์˜ ์žฅ์ )
  • ๋ง๋ญ‰์น˜ ์ „์ฒด์˜ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์ž (co-occurrence matrix์˜ ์žฅ์ )

GLOVE์˜ ๋ชฉ์ ํ•จ์ˆ˜๋Š” ๋‘ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ๋‚ด์ ์ด ๋™์‹œ ๋ฐœ์ƒ ํ™•๋ฅ ์— ๋Œ€ํ•œ ๋กœ๊ทธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

๋‹ค์Œ ์˜ˆ์‹œ๋ฅผ ํ™œ์šฉํ•ด์„œ GLOVE์˜ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ •์˜ํ•ด๋ณด์ž.

 

๋‹ค์Œ ๋…ธํŠธ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋”ฐ๋ผ๊ฐ€๋ฉด ์ดํ•ด๋  ๊ฒƒ์ด๋‹ค.

์ด๋ ‡๊ฒŒ, ์šฐ๋ฆฌ๋Š”  ๋‘ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ๋‚ด์ ์ด ๋™์‹œ ๋ฐœ์ƒ ํ™•๋ฅ ์— ๋Œ€ํ•œ ๋กœ๊ทธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ ์ด๋ผ๋Š” ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

2. ๊ฒฐ๊ณผ

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด frog์™€ ๋น„์Šทํ•œ ๋™๋ฌผ๋“ค์„ ์ถ”์ถœํ•ด ๋‚ด๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 


Evaluation of word vectors

๋‹ค์Œ์€ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ๋“ค์„ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š”์ง€์— ๊ด€ํ•ด ์„ค๋ช…ํ•œ๋‹ค.

ํ‰๊ฐ€ ๋ฐฉ์‹์€ ๋‚ด์ , ์™ธ์ ์˜ ๋‘ ๊ฐ€์ง€ ํ‰๊ฐ€ ๋ฐฉ์‹์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

 

  • Intrinsic (๋‚ด์ ) ํ‰๊ฐ€ ๋ฐฉ์‹์˜ ํŠน์ง•
    • ๊ตฌ์ฒด์ ์ธ subtask (๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ ํŒ๋‹จ ๋“ฑ)์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•œ๋‹ค.
    • ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค.
    • ํ•ด๋‹น ์‹œ์Šคํ…œ์„ ์ดํ•ดํ•˜๊ธฐ ์ข‹๋‹ค.
    • ๋‚ด์  ํ‰๊ฐ€ ๋ฐฉ์‹์€ ํ˜„์‹ค์—์„œ ํ•ด๋‹น ์‹œ์Šคํ…œ์ด ์œ ์šฉํ•œ์ง€ ํŒ๋‹จํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.
  • Extrinsic (์™ธ์ ) ํ‰๊ฐ€ ๋ฐฉ์‹์˜ ํŠน์ง•
    • ํ˜„์‹ค์—์„œ ํ•ด๋‹น ์‹œ์Šคํ…œ์„ ์ ์šฉ์‹œ์ผœ ํŒ๋‹จํ•œ๋‹ค.
    • ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋А๋ฆฌ๋‹ค.
    • ํ•ด๋‹น ์‹œ์Šคํ…œ์ด ๋ฌธ์ œ์ธ์ง€, ๋‹ค๋ฅธ ์‹œ์Šคํ…œ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ๋ฌธ์ œ์ธ์ง€ ์•Œ๊ธฐ ์–ด๋ ต๋‹ค.

1. Extrinsic word vector evaluation

์‹ค์ œ ํ˜„์‹ค ๋ฌธ์ œ (real task)์— ์ง์ ‘ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. Glove๋Š” ์™ธ์  ํ‰๊ฐ€์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

GloVe๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ์™ธ์  ํ‰๊ฐ€์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

2. Intrinsic word vector evaluation

๊ตฌ์ฒด์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š”, word vector analogies๋ผ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. (analogy๋Š” ์œ ์‚ฌ๋ฅผ ์˜๋ฏธํ•œ๋‹ค)

์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ˆ์‹œ์— ๋Œ€ํ•ด ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์—ฌ๋ถ€์ด๋‹ค.

ex) man:woman :: king: ?

์‹์œผ๋กœ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด d๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๋ผ๊ณ  ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋‚ด์  ํ‰๊ฐ€์˜ ์˜ˆ์‹œ๋กœ๋Š” semantic (์˜๋ฏธ) ๋ฐฉ์‹๊ณผ syntatic (์ˆœ์„œ)๋ฐฉ์‹์ด ์žˆ๋‹ค.

  • Semantic

  • Syntatic

Dimension, corpus size ๋“ฑ์„ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋ฉด์„œ ์—ฌ๋Ÿฌ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ analogy ๋ถ„์„์„ ์ง„ํ–‰ํ•ด๋ณธ ๊ฒฐ๊ณผ, GloVe๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

 

3. Another Intrinsic word vector evaluation

์ด๋ฒˆ์—๋Š” ์ธ๊ฐ„ ํŒ๋‹จ (human judgements)์— ๋”ฐ๋ฅธ word vector distances์™€ ๋‹จ์–ด ๋ฒกํ„ฐ ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

๋‹ค์Œ์€ WordSim353์ด๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.

GloVe๋Š” ์ด ํ‰๊ฐ€ ๋ฐฉ์‹์—์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

4. Word senses and word sense ambiguity

ํ•œ ๋‹จ์–ด๊ฐ€ ๋‹ค์–‘ํ•œ ์˜๋ฏธ๋ฅผ ๋‚ดํฌํ•˜๋Š” ๊ฒฝ์šฐ, ์–ด๋–ป๊ฒŒ ์ •์˜ํ•  ์ˆ˜ ์žˆ์„๊นŒ ?

๋‹ค์Œ pike๋ผ๋Š” ๋‹จ์–ด์˜ ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด ๋ณด์ž.

์ด ๋ฌธ์ œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„๊นŒ ?

1. Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012)

ํŠน์ • ๋‹จ์–ด์˜ ์œˆ๋„์šฐ๋“ค์„ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•œ ํ›„, ๋‹จ์–ด๋“ค์„ bank1, bank2, bank3 ๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋‹ค์‹œ ์ž„๋ฒ ๋”ฉํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ฐฉ๋ฒ•์€ ๋ถ€์ •ํ™•ํ•˜๋ฏ€๋กœ ๋งŽ์ด ์“ฐ์ด์ง€ ์•Š๋Š”๋‹ค.

2. Linear Algebraic Structure of Word Senses, with Applications to Polysemy

์ด ๋ฐฉ๋ฒ•์€ ๋‹จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ๋ผ๋„ ํ•œ ๋‹จ์–ด ๋‹น ํ•œ vector๋งŒ์„ ๋ณด์œ ํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒฝ์šฐ์ด๋‹ค. ์ด ๋ฐฉ์‹์—์„œ, ๋ชจ๋“  ์˜๋ฏธ์˜ vector์— ๋Œ€ํ•œ ํ‰๊ท  vector ๋งŒ์„ ์‚ฌ์šฉํ•œ๋‹ค.


์ด๋ ‡๊ฒŒ, ์ตœ์ข…์ ์œผ๋กœ Word2Vec, Co-occurrence matrix, GloVe์™€ ์ด์— ๋Œ€ํ•œ ํ‰๊ฐ€ ๋ฐฉ์‹์— ๊ด€ํ•ด ์•Œ์•„๋ณด์•˜๋‹ค.

๋Œ“๊ธ€์ˆ˜0