๐Ÿ“š ์Šคํ„ฐ๋””/CS224N

[CS224N] 6, 7, 8. RNN, LSTM, Seq2seq, Attention & Transformers

์žฅ์˜์ค€ 2023. 12. 30. 01:46

์ข…๊ฐ• ํ›„์— 2023๋…„ ๋ฒ„์ „์— ๋งž์ถฐ ์ƒˆ๋กญ๊ฒŒ ์—…๋ฐ์ดํŠธ๋œ CS224N ๊ฐ•์˜๋ฅผ ์ˆ˜๊ฐ• ์ค‘์ด๋‹ค. ํ™•์‹คํžˆ ์š”์ฆ˜ ๊ฐ•์˜๋“ค์ด ํ›จ์”ฌ ๋” ์ตœ์‹  ์ •๋ณด๋“ค๋„ ๋งŽ๊ณ , ๊ทธ์— ๋”ฐ๋ผ ๊ฐ•์˜์˜ ์งˆ๋„ ์ข‹์€ ๊ฒƒ ๊ฐ™๋‹ค.

ํ˜๋Ÿฌ๊ฐ€๋“ฏ์ด ๋“ค์—ˆ๋˜ ๊ณผ๊ฑฐ์™€๋Š” ๋‹ค๋ฅด๊ฒŒ, ์ด๋ฒˆ์—๋Š” ์ค‘์š”ํ•œ ์ •๋ณด๋“ค์„ ์ดํ•ดํ•˜๊ณ  ๋‹ค์‹œ ๊ฐœ๋… ํ™•์ธ์ฐจ ๋ธ”๋กœ๊ทธ์— ์ •๋ฆฌํ•ด๋ณด๊ณ  ์žˆ๋‹ค.

 

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” RNN์˜ ๋„์ž…๋ถ€ํ„ฐ LSTM, Transformer๊นŒ์ง€ ์˜ค๊ฒŒ ๋œ ๊ณผ์ •๊ณผ ๊ฐ๊ฐ์˜ ๋ชจ๋ธ๋“ค์— ๋Œ€ํ•ด์„œ ์ž‘์„ฑํ•ด ๋ณด์•˜๋‹ค. ์œ„ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ ๋“ค์–ด๋งŒ ๋ณด๊ณ  ์ž˜ ์•Œ์ง€๋Š” ๋ชปํ•˜์‹  ๋ถ„๋“ค์—๊ฒŒ ๊ฐ•์ถ”.


1. RNN

Simple RNN

์ง€๋‚œ ๊ธ€์—์„œ๋„ ์ž‘์„ฑํ–ˆ๋‹ค์‹œํ”ผ, RNN์˜ ํ•ต์‹ฌ์€ ๊ฐ™์€ ๊ฐ€์ค‘์น˜ W๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์Šค์Šค๋กœ์—๊ฒŒ ํ”ผ๋“œ๋ฐฑ์„ ์ฃผ๋Š” ๋ฐฉ์‹์ด๋‹ค.

๊ธฐ๋ณธ์ ์ธ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.


Training RNN

๊ทธ๋Ÿผ ์ด๋Ÿฐ ๊ตฌ์กฐ์˜ RNN์€ ๋ณดํ†ต ์–ด๋–ป๊ฒŒ training ์‹œํ‚ค๋Š” ๊ฒƒ์ผ๊นŒ? ์ด ๊ฐœ๋…์€ RNN์„ ์‚ฌ์šฉํ•˜๋Š” Language Model (RNN-LM)์„ ์˜ˆ๋กœ ๋“ค์–ด ์„ค๋ช…ํ•ด ๋ณด์ž.

  1. ๋จผ์ € ๊ธด ๊ธธ์ด์˜ text ๋ญ‰์น˜๋ฅผ ์ค€๋น„ํ•œ๋‹ค: {x1, x2, ..., xT}
  2. ์ด ํ…์ŠคํŠธ๋ฅผ RNN-LM์— ์ฃผ๊ณ , ๊ฐ step t์— ํ•ด๋‹นํ•˜๋Š” output distribution y^(t)๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
  3. predicted probability y^(t)์™€ ์‹ค์ œ ๋‹จ์–ด y(t) (=x(t+1))์— ๋Œ€ํ•œ Loss function์„ ๋งŒ๋“ ๋‹ค.
  4. ์ „์ฒด training set์— ๋Œ€ํ•ด ์ด๊ฒƒ์˜ ํ‰๊ท ์„ ๊ตฌํ•˜์—ฌ ์ตœ์ข… loss๋ฅผ ๊ตฌํ•œ๋‹ค.

๊ฐ„๋‹จํ•˜๊ฒŒ ๋งํ•˜๋ฉด, ๊ทธ๋ƒฅ ๋ฐ”๋กœ ๋’ค ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ™•๋ฅ ์„ ๋†’์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

ํ•™์Šต์€ ์—ญ์ „ํŒŒ๋กœ ์ด๋ฃจ์–ด์ง€๋Š”๋ฐ, gradient๋ฅผ ์–ป์–ด parameter๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ ๋‹จ๊ณ„์—์„œ loss์˜ gradient ๊ฐ’์„ ๊ตฌํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ? ์ด๋•Œ ๋‹ค์Œ 2๊ฐ€์ง€์— ์ง‘์ค‘ํ•ด์•ผ ํ•œ๋‹ค:

  1.  RNN์—์„œ๋Š” W๊ฐ€ ๊ณต์œ ๋œ๋‹ค.
  2. Chain Rule์— ์˜ํ•ด ์ตœ์ข… ๋‹จ๊ณ„์—์„œ์˜ loss์˜ ๋ฏธ๋ถ„๊ฐ’์€ ๊ฐ ๋‹จ๊ณ„์˜ ๋ฏธ๋ถ„๊ฐ’์„ ๋”ํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

์ด 2๊ฐ€์ง€๋ฅผ ๊ณ ๋ คํ•˜๋ฉด, ๋ฏธ๋ถ„๊ฐ’์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.

๊ทธ๋Ÿผ ์ด์ œ ์ € ์‹์€ time step t, (t-1), (t-2), ..., 0์„ ์ง€๋‚˜๋ฉด์„œ gradient๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ๋œ๋‹ค. ์ด๋ฅผ backpropagation through time ์ด๋ผ๊ณ  ํ•œ๋‹ค๋”๋ผ.


Generating text with a RNN-LM

RNN-LM์—์„œ๋Š” Repeated Sampling์ด๋ผ๋Š” ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ text๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

์ด๋Š” ์ž„์˜์˜ ํ•œ step์—์„œ ์ƒ์„ฑ๋˜๋Š” output ์ค‘ ํ•˜๋‚˜๋ฅผ ๊ณจ๋ผ ๋‹ค์Œ step์˜ input์œผ๋กœ ์ฃผ๋Š” ๋ฐฉ์‹์ด๋‹ค.

๋‹น์—ฐํžˆ ๋ง๋งŒ ๋“ค์œผ๋ฉด ์ •ํ™•๋„๊ฐ€ ๋†’๋‹ค๊ณ  ์ƒ๊ฐ์ด ์ „ํ˜€ ๋“ค์ง€ ์•Š๊ฒ ์ง€๋งŒ, ์ƒ๊ฐ๋ณด๋‹ค๋Š” ๋‚˜์˜์ง€ ์•Š๊ณ , ๋ฌด์—‡๋ณด๋‹ค ๋‹น์‹œ์—๋Š” n-gram๋ณด๋‹ค ํ›จ์”ฌ ์ผ๊ด€์„ฑ ์žˆ๋Š” ๋ง์„ ์ƒ์„ฑํ•ด ๋‚ด์„œ ์ข‹์€ ์„ฑ๋Šฅ์œผ๋กœ ํ‰๊ฐ€๋ฐ›์•˜์—ˆ๋‹ค.


RNN์˜ ๋ฌธ์ œ

์ด๋ ‡๊ฒŒ n-gram ์‹œ๋Œ€์— ๋“ฑ์žฅํ•œ RNN์€ ์ •๋ง ๊ดœ์ฐฎ์•„ ๋ณด์˜€์ง€๋งŒ, ์‚ฌ์‹ค์ƒ ๋งŽ์€ ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ๋‹ค.

Vanishing & Exploding Gradients

๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ด๊ณ  ์œ ๋ช…ํ•œ ๋ฌธ์ œ๋Š” ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฐ ์ฆํญ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ์ด๋‹ค.

์—ญ์ „ํŒŒ๋ฅผ chain rule์„ ํ†ตํ•ด์„œ ์ „๋‹ฌํ•  ๋•Œ, ๊ฐ€์žฅ ๋งˆ์ง€๋ง‰ layer๋Š” layer์˜ ์ˆ˜๋งŒํผ ๊ณฑ์…ˆ์„ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค.

๋งŒ์•ฝ ์ด ๊ณฑํ•˜๋Š” ๊ฐ’๋“ค์˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ž‘๋‹ค๋ฉด, ์—ญ์ „ํŒŒ๊ฐ€ ์ „๋‹ฌ๋ ์ˆ˜๋ก ๊ฐ’์ด ์ ์  ๋” ์ž‘์•„์ง€๊ฒŒ ๋œ๋‹ค.

๊ทธ๋ž˜์„œ ์ด๋Ÿฐ gradient vanishing์˜ ๊ฒฝ์šฐ, ๋ฌธ์ œ๋Š” ๋ชจ๋ธ์˜ weight๊ฐ€ ์—…๋ฐ์ดํŠธ๋  ๋•Œ, ๊ทผ์ฒ˜์— ์žˆ๋Š” text์—๋งŒ ์˜ํ–ฅ์„ ๋ฐ›๊ณ , ๋ฉ€๋ฆฌ ์žˆ๋Š” text์— ๋Œ€ํ•ด์„œ๋Š” ์˜ํ–ฅ์„ ๊ฑฐ์˜ ๋ฐ›์ง€ ๋ชปํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

์ด๊ฒƒ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๊ณฑํ•ด์ง€๋Š” ๊ฐ’๋“ค์˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋ฉด, ์—ญ์ „ํŒŒ๊ฐ€ ์ „๋‹ฌ๋ ์ˆ˜๋ก ๊ฐ’์ด ์ ์  ์ปค์ง€๊ฒŒ ๋œ๋‹ค. 

์ด๋Ÿฐ gradient explosion์˜ ๊ฒฝ์šฐ, bad update๋ฅผ ์ดˆ๋ž˜ํ•˜๊ณ , ์ตœ์•…์˜ ๊ฒฝ์šฐ์—๋Š” ๋„ˆ๋ฌด ์ปค์ ธ์„œ NaN์ด ๋˜๊ณ , ๊ณ„์‚ฐํ•˜์ง€ ๋ชปํ•˜๋Š” ์ƒํƒœ๊ฐ€ ๋˜์–ด ํ•™์Šต์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ์‹œ์ž‘ํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

 

Solution for gradient Explosion: Gradient Clipping

๋จผ์ € gradient explosion์— ๋Œ€ํ•œ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์œผ๋กœ๋Š” gradient clipping์ด ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋งŒ์•ฝ gradient๊ฐ€ ์–ด๋–ค threshold๋ณด๋‹ค ํฌ๋‹ค๋ฉด, SGD update ์ „์— scale down ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ƒ๊ฐ๋ณด๋‹ค ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด gradient vanishing ๋ฌธ์ œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„๊นŒ?

์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๋กœ memory๊ฐ€ ๋ถ„๋ฆฌ๋œ RNN์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์„ ๊ณ ์•ˆํ–ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ƒ๊ธด ๋ชจ๋ธ์ด LSTM์ด๋‹ค.


2. LSTM

์•ž์„œ ๋งํ–ˆ๋“ฏ์ด, LSTM์€ RNN์˜ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด memory๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์„ ์ ์šฉํ•œ ๋ชจ๋ธ์ด๋‹ค. ์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด cell state๋ผ๋Š” ๊ฐœ๋…์„ ๋„์ž…ํ•œ๋‹ค.

ํ•˜๋‚˜์˜ hidden state๋ฅผ ๊ฐ€์ง€๋Š” RNN๊ณผ ๋‹ฌ๋ฆฌ LSTM์€ 2๊ฐœ, hidden state๊ณผ cell state๋ฅผ ๊ฐ€์ง„๋‹ค. ์ด cell state์˜ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • ๋‘ ๋ฒกํ„ฐ์˜ ๊ธธ์ด๋Š” ๋ชจ๋‘ n์ด๋‹ค.
  • cell์€ long-term ์ •๋ณด๋ฅผ ๋‹ด๋Š”๋‹ค.
  • LSTM์€ cell๋กœ๋ถ€ํ„ฐ ํ•ด๋‹น ์ •๋ณด๋ฅผ ์ฝ๊ณ , ์ง€์šฐ๊ณ , ์“ธ ์ˆ˜ ์žˆ๋‹ค. ๋งˆ์น˜ ์ปดํ“จํ„ฐ์˜ RAM ๊ฐ™์€ ๋А๋‚Œ์ด๋‹ค.

์ด๋ ‡๊ฒŒ ์ง€์šฐ๊ณ , ์“ฐ๊ณ , ์ฝ๊ธฐ ์œ„ํ•œ ๊ณผ์ •์€ 3๊ฐœ์˜ gate๋กœ ์ด๋ฃจ์–ด์ง€๋Š”๋ฐ, gate์˜ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • gate๋“ค์˜ ๊ธธ์ด ๋˜ํ•œ ๋ชจ๋‘ n์ด๋‹ค.
  • ๊ฐ timestep์— ๋Œ€ํ•ด, gate๋Š” open (1), closed (0) ์ด๊ฑฐ๋‚˜ in-between ์ƒํƒœ๋กœ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค.
  • gate๋“ค์€ dynamic ํ•˜๋‹ค. ๊ทธ๋“ค์˜ ๊ฐ’์€ ํ˜„์žฌ context๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.

์•„๋ž˜ ๊ทธ๋ฆผ์„ ์ฐธ๊ณ ํ•˜์—ฌ 3๊ฐ€์ง€์˜ gate๊ฐ€ ๋ฌด์—‡์„ ๋‚˜ํƒ€๋‚ด๋Š”์ง€ ๋ณด์ž.

  1. Forget gate: ์ง์ „ cell state๋กœ๋ถ€ํ„ฐ ๋ฌด์Šจ ์ •๋ณด๊ฐ€ ์ €์žฅ๋˜์–ด์•ผ ํ•˜๊ณ , ๋ฌด์Šจ ์ •๋ณด๊ฐ€ ์žŠํ˜€์•ผ ํ•˜๋Š”์ง€ ๊ธฐ์–ตํ•˜๋Š” gate์ด๋‹ค.
  2. Input gate: cell์˜ ์–ด๋–ค ๋ถ€๋ถ„์ด cell memory์— ๊ธฐ๋ก๋˜๋Š”์ง€ ๊ด€๋ฆฌํ•˜๋Š” gate์ด๋‹ค.
  3. Ouptut gate: cell์˜ ์–ด๋–ค ๋ถ€๋ถ„์ด hidden state๋กœ ์˜ฎ๊ฒจ์ ธ์•ผ ํ•˜๋Š”์ง€ ๊ด€๋ฆฌํ•œ๋‹ค.

LSTM์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ ๊ทธ๋ฆผ์ด ์ •๋ง ์ž˜ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค.

  • ๊ฐ€์žฅ ์™ผ์ชฝ์—์„œ๋Š” forget gate๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ ๋‹นํ•œ cell content๋ฅผ ์žŠ๋Š”๋‹ค.
  • ์ค‘๊ฐ„์—์„œ๋Š” input gate์™€ new cell content๋ฅผ ๋„ฃ์–ด ์–ด๋–ค ์ •๋ณด๊ฐ€ cell์— ๊ธฐ๋ก๋˜๋Š” ์ •๋ณด๋ฅผ ๊ด€๋ฆฌํ•œ๋‹ค.
  • ๋งˆ์ง€๋ง‰์—์„œ๋Š” output gate๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ cell state์— ์žˆ๋Š” ์ •๋ณด๋“ค์„ hidden state๋กœ ์ „๋‹ฌํ•œ๋‹ค.
  • ํ•‘ํฌ์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋œ +๋ฅผ ํ†ตํ•ด ์ƒˆ๋กœ์šด cell content๋ฅผ ์ž‘์„ฑํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ, LSTM์€ ๊ฐ์ข… gate๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ •๋ณด๊ฐ€ ๋งŽ์€ timestep์œผ๋กœ๋ถ€ํ„ฐ ๋ณด์กด๋  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค.


3. NMT & Seq2Seq

์ด๋ ‡๊ฒŒ RNN, LSTM์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์–ด๋А ์ •๋„ ๋งˆ๋ฌด๋ฆฌํ•˜๊ณ , ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์ธ Machine Translation์ด ์†Œ๊ฐœ๋œ๋‹ค.

Machine Translation์ด๋ž€ ๋ง ๊ทธ๋Œ€๋กœ ๊ธฐ๊ณ„๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ฒƒ์„ ๋œปํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ”„๋ž‘์Šค์–ด๋ฅผ ์˜์–ด๋กœ ๋ฒˆ์—ญํ•˜๋ ค๊ณ  ํ•  ๋•Œ, ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ probabilistic model์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

Statistical Machine Translation (SMT)

neural machine translation (nmt)๊ฐ€ ์ ์šฉ๋˜๊ธฐ ์ „, ์‚ฌ๋žŒ๋“ค์€ ํ™•๋ฅ  ๊ธฐ๋ฐ˜์˜ ๋ฒˆ์—ญ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ–ˆ๋Š”๋ฐ, ์ด๋“ค์˜ ํ–‰๋ณด๋Š” ์•„๋ž˜์™€ ๊ฐ™์•˜๋‹ค.

ํ”„๋ž‘์Šค์–ด ๋ฌธ์žฅ์„ x, ์˜์–ด ๋ฌธ์žฅ์„ y๋ผ๊ณ  ํ•  ๋•Œ, ๊ตฌํ•˜๋ ค๋Š” ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

์—ฌ๊ธฐ์— ๋ฒ ์ด์ง€์•ˆ ๋ฒ•์น™์„ ์ ์šฉํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด 2๊ฐ€์ง€ ์ปดํฌ๋„ŒํŠธ๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ์™ผ์ชฝ์€ ํ•™์Šต๋œ data๋กœ๋ถ€ํ„ฐ ๋‹จ์–ด์™€ ๊ตฌ์ ˆ๋“ค์ด ์–ด๋–ป๊ฒŒ ํ•ด์„๋ ์ง€ ํ•™์Šต๋˜๋Š” translation model,
  • ์˜ค๋ฅธ์ชฝ์€ ์˜์–ด๋ฅผ ์ž˜ ์ž‘์„ฑํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋˜๋Š” language model์ด ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

๋จผ์ € Translation model์„ ์‚ดํŽด๋ณด์ž. Translation model P(x|y)๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ?

  1. ๋จผ์ € ์ •๋ง ๋งŽ์€ ์–‘์˜ parallel data๊ฐ€ ํ•„์š”ํ•˜๋‹ค. (์˜ˆ์‹œ๋กœ, ์ž˜ ๋ฒˆ์—ญ๋œ ํ”„๋ž‘์Šค์–ด-์˜์–ด ์Œ์˜ ๋ฌธ์žฅ๋“ค์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค.)
  2. ๋‘ ๋ฌธ์žฅ ๊ฐ„์˜ ์ •๋ ฌ๋œ data๊ฐ€ ํ•„์š”ํ•˜๋‹ค. (์˜ˆ์‹œ๋กœ, ํ”„๋ž‘์Šค์–ด์˜ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ์˜์–ด์˜ ์–ด๋–ค ๋‹จ์–ด๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š”์ง€ ๋“ฑ์— ๊ด€ํ•œ ์ •๋ณด๊ฐ€ ์žˆ์œผ๋ฉด ์ข‹๋‹ค.)
    • ์ด๋Ÿฌํ•œ alignment๋Š” ์ผ๋Œ€์ผ, ๋‹ค๋Œ€์ผ, ์ผ๋Œ€๋‹ค, ๋‹ค๋Œ€๋‹ค์˜ ๋ชจ๋“  ๊ด€๊ณ„๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

์œ„์™€ ๊ฐ™์ด Encoding ํ•œ ์ดํ›„ Decoding ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์ด๋ ‡๊ฒŒ ๋‹จ์–ด๋งˆ๋‹ค ์กฐ๊ฐ์ด ์žˆ์œผ๋ฉฐ, ์ด ์กฐ๊ฐ๋“ค์„ ํ•˜๋‚˜์”ฉ ์งœ ๋งž์ถฐ ๋‚˜๊ฐ์œผ๋กœ์จ ์˜ˆ์ธกํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

์ด๋ ‡๊ฒŒ ์ง„ํ–‰๋œ Statistical Machine Translation์€ ๋„ˆ๋ฌด ๋ณต์žกํ–ˆ๊ณ , ํ•˜์œ„ ์ปดํฌ๋„ŒํŠธ๋“ค์ด ๋„ˆ๋ฌด ๋งŽ์•˜์œผ๋ฉฐ, ๋งŽ์€ feature engineering์ด ์š”๊ตฌ๋˜์—ˆ๊ณ , ๋‹ค์–‘ํ•œ ์ถ”๊ฐ€์ ์ธ ์ž์›๋“ค๊ณผ ์ธ๊ฐ„์˜ ๋…ธ๋ ฅ์ด ํ•„์š”ํ–ˆ๋‹ค...


Neural Machine Translation (NMT)

NMT์˜ ๊ตฌ์กฐ

์ด์ œ Neural Network๋ฅผ ๋„์ž…ํ•ด์„œ machine translation์„ ์ ์šฉํ•œ ๋ถ„์•ผ๋ฅผ ํ•œ๋ฒˆ ๋ณด์ž. ์ด ์•„ํ‚คํ…์ฒ˜๋Š” sequence-to-sequence ๋ชจ๋ธ์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋ฉฐ, 2๊ฐœ์˜ RNN์œผ๋กœ ๊ตฌ์„ฑ๋ผ ์žˆ๋‹ค.

๋จผ์ € Seq2Seq์€ ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ์ง€๋‹ˆ๊ณ , ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
(๋‹ค๋งŒ ์œ„ ๊ตฌ์กฐ๋Š” Test์˜ ๊ฒฝ์šฐ์ด๊ณ , Train ์‹œ์—๋Š” ์ƒ์„ฑ๋˜๋Š” output ๋‹จ์–ด๋ฅผ ๋‹ค์Œ input์œผ๋กœ ์ž…๋ ฅํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, true data๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ค€๋‹ค.)

  1. source sentence๋ฅผ encoder RNN์„ ํ†ตํ•ด encoding ํ•œ๋‹ค.
    • Encoding์„ ํ†ตํ•ด Decoder RNN์„ ์œ„ํ•œ initial hidden state๊ฐ€ ๋งˆ์ง€๋ง‰์— ์ฃผ์–ด์ง€๊ฒŒ ๋œ๋‹ค.
  2. Decoder RNN์€ encoding์— ์กฐ๊ฑดํ•˜์—ฌ ํƒ€๊นƒ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ฆ‰ ๋ฒˆ์—ญ์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ด์ฒ˜๋Ÿผ, Seq2Seq ๋ชจ๋ธ์€ Conditional Language Model์˜ ํ•œ ์˜ˆ์‹œ์ด๋‹ค.

 

Training NMT

๊ทธ๋ ‡๋‹ค๋ฉด Train ๊ณผ์ •์€ ์–ด๋–ป๊ฒŒ ์ด๋ฃจ์–ด์งˆ๊นŒ?

Training ๊ณผ์ •์€ ๋‹ค์Œ์„ ๋”ฐ๋ฅธ๋‹ค.

  1. ํ”„๋ž‘์Šค์–ด-์˜์–ด batch๋ฅผ ๊ฐ€์ ธ์™€
  2. ํ”„๋ž‘์Šค์–ด์— Encoder RNN์„ ์ ์šฉํ•˜์—ฌ encoding ํ•œ hidden state๋ฅผ decoder์— ์ ์šฉํ•œ๋‹ค.
  3. decoder๊ฐ€ ํ•ด๋‹น hidden state๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์˜ˆ์ธกํ•œ ๋‹จ์–ด์™€ ์‹ค์ œ ๋‹จ์–ด๊ฐ€ ์ผ์น˜ํ•˜๋Š”์ง€๋ฅผ ํŒ๋‹จํ•œ๋‹ค.

 

Multi-layer RNNs

RNN์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ์—ฌ๋Ÿฌ ๊ฒน์˜ RNN์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋‹ค. ์ด๋Š” multi-layer RNN์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š”๋ฐ, ์ด๊ฒƒ์˜ ์žฅ์ ์€ ๋”์šฑ ๋ณต์žกํ•œ ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด, ๊ทธ๋ƒฅ RNN์„ ์ข€ ๋” ์Œ“์•„ ์˜ฌ๋ฆฐ ํ˜•์‹์ด๋‹ค.

  • ํผํฌ๋จผ์Šค๊ฐ€ ์ข‹์€ RNN์€ ์ฃผ๋กœ multi-layer๋ผ๊ณ  ํ•œ๋‹ค.
  • ์‹ค์ œ๋กœ NMT์˜ ๊ฒฝ์šฐ, encoder RNN์€ 2~4๊ฒน, decoder RNN์€ 4๊ฒน์ด ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๊ณ  ํ•œ๋‹ค.
  • ๋˜ํ•œ, ์ฃผ๋กœ skip-connection(residual connection)์ด๋‚˜ dense-connection ๋“ฑ์ด RNN์„ ๋” ๊นŠ๊ฒŒ train ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•˜๋‹ค๊ณ  ํ•œ๋‹ค.

 

Decoding ๋ฐฉ์‹

1. Greedy Decoding

Decoding ์‹œ, decoder์˜ ๊ฐ step์— argmax๋ฅผ ์ทจํ•จ์œผ๋กœ์จ ํƒ€๊นƒ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์•˜๋‹ค.

์ด๋•Œ, ๊ฐ time step๋งˆ๋‹ค ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์„ greedy decoding์ด๋ผ๊ณ  ํ•œ๋‹ค.

๋ณธ ๋ฐฉ๋ฒ•์˜ ๋ฌธ์ œ์ ์€ greedy ํ•˜๊ฒŒ ์„ ํƒํ•ด์„œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์„ ๊ฒฝ์šฐ, ๋Œ์•„๊ฐˆ ์ˆ˜๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

2. Exhuastive search Decoding

์ด๋ฅผ ์œ„ํ•ด Exhaustive search Decoding์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์ด ๊ณ ์•ˆ๋˜์—ˆ๋‹ค.
์ด๋Š” x๊ฐ€ ์ฃผ์–ด์ง„ ๊ฒฝ์šฐ, ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  y์˜ ๊ฒฝ์šฐ๋ฅผ ์ฐพ์ž๋Š” ๋ฐฉ์‹์ธ๋ฐ, ์ด ๋ฐฉ๋ฒ•์€ ๋„ˆ๋ฌด computational resource๊ฐ€ ๋งŽ์ด ๋“ค์—ˆ๋‹ค.

 

3. Beam search Decoding

์ด ๋ฐฉ๋ฒ•์˜ ์ฝ”์–ด ์•„์ด๋””์–ด๋Š” decoder์˜ ๊ฐ time step์—์„œ k๊ฐœ (beam size)์˜ ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ translation์„ ์ฐพ์ž๋Š” ๊ฒƒ์ด์—ˆ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ ์ตœ์ ์˜ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์ด ๋ณด์žฅ๋ผ์žˆ์ง€๋Š” ์•Š์ง€๋งŒ, exhaustive search๋ณด๋‹ค๋Š” ํ›จ์”ฌ ํšจ์œจ์ ์ธ ๋ฐฉ์‹์ด์—ˆ๋‹ค.

Beam search Decoding์˜ ์˜ˆ์‹œ

์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ, ๊ฝค ๊ดœ์ฐฎ์€ Neural Machine Translation์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ๊ณ , ์ด์˜ ์žฅ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.

  • ๋” ๋‚˜์€ ์„ฑ๋Šฅ: ๋” ์œ ์ฐฝํ•˜๊ณ , context๊ฐ€ ๋” ์ผ๊ด€์„ฑ ์žˆ์—ˆ๋‹ค.
  • ์˜ค์ง single neural network๋งŒ ํ•„์š”ํ–ˆ๋‹ค. ๋‹ค๋ฅธ subcomponents๋“ค์ด ํ•„์š” ์—†์—ˆ๋‹ค.
  • Human engineering์„ ์ ๊ฒŒ ์š”๊ตฌํ•œ๋‹ค.

๋‹ค๋งŒ, ๋””๋ฒ„๊น…ํ•˜๊ธฐ ์–ด๋ ต๊ณ , ์ปจํŠธ๋กคํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋‹จ์  ๋˜ํ•œ ์žˆ๊ธด ํ–ˆ๋‹ค.

 

Evaluation of Machine Translation

Machine Translation์€ BLEU (Bilingual Evalutaion Understudy) ์Šค์ฝ”์–ด๋กœ ํ‰๊ฐ€ํ•œ๋‹ค.

์ด๋Š” ํ•˜๋‚˜์˜ machine-written translation๊ณผ human-written translation ์‚ฌ์ด์˜ ์œ ์‚ฌ์„ฑ์„ ๋น„๊ตํ•˜๋Š”๋ฐ, ์ด ์œ ์‚ฌ์„ฑ์€ ๋‹ค์Œ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.

  • n-gram precision
  • ๋„ˆ๋ฌด ์งง์€ system translation์— ๋Œ€ํ•œ penalty

BLEU ์Šค์ฝ”์–ด๋Š” ์šฉ์ดํ•˜์ง€๋งŒ ์™„๋ฒฝํ•˜์ง€ ๋ชปํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๋‹ด์€ ๋ฌธ์žฅ์ด๋”๋ผ๋„, ์ถฉ๋ถ„ํžˆ ๋‹ค๋ฅธ ์šฉ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅด๊ฒŒ ํ•ด์„๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋•Œ๋ฌธ์— ์ž˜ ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ๋„ ๋‚ฎ์€ ์Šค์ฝ”์–ด๋ฅผ ๋ฐ›๋Š” ํ˜„์ƒ ๋˜ํ•œ ๋ฐœ์ƒํ•˜๊ธฐ๋„ ํ–ˆ๋‹ค.


4. Attention

Problem of Seq2Seq Model: Bottleneck problem

Neural Machine Translation ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ Seq2Seq ๋ชจ๋ธ๋„ ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ๋Š”๋ฐ, ์ด๋Š” Bottleneck problem, ๋ณ‘๋ชฉ ํ˜„์ƒ์ด์—ˆ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด, Encoder RNN์˜ ๋งˆ์ง€๋ง‰ hidden state์—์„œ ๋ชจ๋“  ์ •๋ณด๋ฅผ ํฌํ•จํ•ด์•ผ ํ•œ๋‹ค๋Š” ๋ฌธ์ œ์˜€๋‹ค.

 

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ธ๊ฐ„์ด ๋ฒˆ์—ญ์„ ์‹œ๋„ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์ž.

์ธ๊ฐ„์€ ๋ฒˆ์—ญ์„ ํ•  ๋•Œ, ๋ฌธ์žฅ์˜ ํ•œ ๋ฌธ๋งฅ์„ ๋ณด๊ณ  ๋ฒˆ์—ญํ•˜๊ณ , ์ดํ›„ ๋ฌธ๋งฅ์„ ๋ณด๊ณ  ๋ฒˆ์—ญํ•˜๊ณ ,, ์ด ํ™œ๋™์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

๋˜ ๋ฌธ๋งฅ์„ ๋ณด๋ฉด์„œ ๋” ์ฃผ์˜ ๊นŠ๊ฒŒ ๋ด์•ผ ํ•  ๊ฒƒ์„ ์ธ์ง€ํ•˜๊ณ  ๊ทธ๊ฒƒ์— ์ดˆ์ ์„ ๋งž์ถ”์–ด ๋ฒˆ์—ญํ•œ๋‹ค.

 

์ด ์•„์ด๋””์–ด์— ์ฐฉ์•ˆํ•˜์—ฌ, decoder์˜ ๊ฐ ๋‹จ๊ณ„์—์„œ encoder์™€์˜ ์ง์ ‘์ ์ธ connection์„ ํ†ตํ•ด ํŠน์ • ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•˜๋„๋ก ๋งŒ๋“  ์•„ํ‚คํ…์ฒ˜๊ฐ€ Attention์ด๋‹ค.


Seq2seq with Attention

Seq2seq ๋ชจ๋ธ์— Attention์„ ๊ฒฐํ•ฉํ•œ ๊ตฌ์กฐ๋ฅผ ๋จผ์ € ์‚ดํŽด๋ณด์ž. ์•ž์„œ ๋งํ–ˆ๋“ฏ์ด, Attention์€ ์–ด๋””์— ์ฃผ์˜๋ฅผ ์ค„ ๊ฒƒ์ธ๊ฐ€๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋งŒ๋“  ์•„ํ‚คํ…์ฒ˜์ด๋‹ค.

  1. ๋จผ์ € ํŒŒ๋ž€์ƒ‰์œผ๋กœ Attention Scores๋ผ๊ณ  ๋ผ์žˆ๋Š” ๋ถ€๋ถ„์€, ๊ฐ ์œ„์น˜์—์„œ์˜ encoder hidden state์™€ decoder์˜ hidden state๋ฅผ ๊ฐ€์ง€๊ณ  attention score๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฒƒ์ด๋‹ค. ๊ณ„์‚ฐ์€ ์ฃผ๋กœ dot product, ๋‚ด์ ์„ ์‚ฌ์šฉํ•œ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ๊ฐ decoder์˜ timestep์— attention distribution์„ ๋ณด๋ฉด, ์–ด๋–ค hidden state์— ์–ผ๋งˆ๋งŒํผ์˜ attention์„ ์ฃผ๋Š”์ง€์— ๋Œ€ํ•ด ์•Œ ์ˆ˜ ์žˆ๋‹ค.
  2. ๋งŒ๋“ค์–ด์ง„ Attention Distribution์„ ๊ฐ€์ง€๊ณ , encoder hidden state์— ๋Œ€ํ•œ weighted sum์„ ์ ์šฉํ•˜์—ฌ attention output์„ ๋งŒ๋“ ๋‹ค. ์œ„ ๊ณผ์ •์„ ๊ฑฐ์น˜๋ฉด, ์ตœ์ข… attention output์€ ์ฃผ๋กœ high attention์„ ๊ฐ€์ง„, hidden state๋กœ๋ถ€ํ„ฐ ํŒŒ์ƒ๋œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

์ดํ›„์—๋Š” ์ „์˜ Seq2seq ๋ชจ๋ธ์—์„œ ํ–ˆ๋˜ ๋น„์Šทํ•œ ๊ณผ์ •๋“ค์„ ์ญ‰ ๋ฐ˜๋ณตํ•˜๋Š”๋ฐ, decoder hidden state์— attention output์„ concatenate ํ•˜๊ณ , ๊ณ„์‚ฐํ•˜์—ฌ y^์„ ๊ตฌํ•œ๋‹ค.


Problem of RNN: Linear interaction distance

RNN์€ ์ฃผ๋กœ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๊ฐ€๋Š”๋ฐ, ์ด๋Š” ๊ฐ€๊นŒ์ด ์žˆ๋Š” ๋‹จ์–ด๊ฐ€ ์„œ๋กœ์˜ ์˜๋ฏธ์— ์˜ํ–ฅ์„ ์ค€๋‹ค.

์ด์—, RNN์€ sequence ๊ธธ์ด๋งŒํผ์˜ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ๊ฑธ๋ฆฐ๋‹ค.

์ด๋Š” ๋‹ค์‹œ ๋งํ•˜๋ฉด GPU์˜ ์žฅ์ ์„ ์™„์ „ํžˆ ๋ฌด์‹œํ•ด ๋ฒ„๋ฆฌ๋Š” ๊ฒƒ์ด ๋œ๋‹ค. ๊ทธ ์ด์œ ๋Š” RNN์˜ ๊ฒฝ์šฐ, ์ข…์†์ ์œผ๋กœ ์ „์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ํ˜„์ƒ์ด ๋ฐ˜๋ณต๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. GPU์˜ ์ด์ ์„ ์ฑ™๊ธฐ๋ ค๋ฉด ๋ชจ๋“  ๋‹จ์–ด๋“ค์ด ๋…๋ฆฝ์ ์œผ๋กœ ์˜ˆ์ธก๋  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค.

 

Attention์€ ๋ชจ๋“  ๋‹จ์–ด๋“ค์˜ ํ‘œํ˜„์„ query๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ ์ ‘๊ทผํ•˜๊ณ  value ์ง‘๋‹จ์˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ๋‹ค. Attention์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์ด๋Ÿฐ ๋ณ‘๋ ฌ์ ์ธ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ด์ ธ ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ sequence ๊ธธ์ด์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ฒŒ ๋œ ๊ฒƒ์ด๋‹ค.


์ด๋Ÿฐ Attention์˜ ๊ตฌ์กฐ๋Š” ๋งˆ์น˜ lookup table๊ณผ ๊ฐ™๋‹ค. 

  • Attention์—์„œ๋Š” query์™€ key๋“ค๊ณผ์˜ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐํ•ด์„œ softmax๋กœ ์˜ํ–ฅ๋ ฅ (=weight)๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , value์— ํ•ด๋‹น weight๋ฅผ ๊ณฑํ•˜์—ฌ sum ๊ฐ’์„ ๊ตฌํ•œ๋‹ค. ๊ทธ๊ฒƒ์ด ๊ณง output์ด ๋œ๋‹ค.

Self-Attention

Self-Attention์€ keys, queries, values๋ฅผ ๊ฐ™์€ sequence๋กœ๋ถ€ํ„ฐ ๋ฝ‘์•„๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. 

vocabulary V์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ sequence๋ฅผ w_{1:n}์ด๋ผ๊ณ  ํ•˜์ž. 

๊ฐ wi์— ๋Œ€ํ•ด x_i = E * w_i๋ผ๊ณ  ํ•˜์ž. E๋Š” embedding matrix์ด๊ณ , ์ฐจ์›์€ d x |V|์ด๋‹ค.

Self-Attention์ด ์ด๋ฃจ์–ด์ง€๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. dxd ์ฐจ์›์˜ weight matrix์ธ Q, K, V๋ฅผ ๊ฐ€์ง€๊ณ  word embedding์„ transform ํ•œ๋‹ค.
    • q_i = Q * x_i (Queries)
    • k_i = K * x_i (Keys)
    • v_i = V * x_i (Values)
  2. keys์™€ queries์— ๋Œ€ํ•œ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐํ•˜๊ณ , softmax๋กœ normalize ํ•œ๋‹ค.
  3. ์ตœ์ข…์ ์œผ๋กœ, weighted sum of values๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ, ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ตœ์ข… output์„ ๊ณ„์‚ฐํ•œ๋‹ค.

  • e_ij๋Š” i๊ฐ€ j๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ์–ผ๋งˆ๋‚˜ ๋งŽ์ด lookup ํ•ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” score,
  • alpha_ij๋Š” ์ •๊ทœํ™”๋œ ์ˆซ์ž์ด๋‹ค.

์ด๋Ÿฐ Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์—๋„ ๋‹น์—ฐํžˆ ๋ฌธ์ œ๋Š” ์žˆ์—ˆ๋‹ค. 

Attention์˜ ๋ฌธ์ œ์™€ ํ•ด๊ฒฐ๋ฐฉ์•ˆ

1. Doesn't have inherent notion of order -> Positional Encoding

Attention์˜ ์ฒซ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” ๋‚ด์žฌ๋œ ์ˆœ์„œ๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. ์œ„์™€ ๊ฐ™์€ ๋‹จ์ˆœํ•œ Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜๋งŒ์„ ์‚ฌ์šฉํ•˜๋ฉด, 

'Zuko made his uncle a breakfast'์™€ 'His uncle made Zuko a breakfast'๋Š” ๊ฐ™์€ ์˜๋ฏธ๋กœ ์ธ์ฝ”๋”ฉ ๋œ๋‹ค.

 

๊ทธ๊ฐ„ RNN์€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ๋˜์–ด ์ด๋Ÿฐ ์ˆœ์„œ์— ๊ตณ์ด ์‹ ๊ฒฝ ์“ฐ์ง€ ์•Š์•„๋„ ๋์ง€๋งŒ, Attention์˜ ๊ฒฝ์šฐ, ์ˆœ์„œ ์ •๋ณด๊ฐ€ ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ–ˆ๋‹ค.

์ด์— ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋„ ํ•จ๊ป˜ encoding ํ•˜๋Š” ์ž‘์—…์„ ์ƒ๊ฐํ–ˆ๋‹ค. ์ฆ‰, index ๋˜ํ•œ vector๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

Attention์—์„œ๋Š” position vector p_i๋ฅผ embedding ๋œ ๋‹จ์–ด x_i์— ๋”ํ•˜์—ฌ ์ƒˆ๋กญ๊ฒŒ ์ž„๋ฒ ๋”ฉ๋œ ๋‹จ์–ด๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ์ด p_i๋ฅผ ์–ด๋–ป๊ฒŒ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์„๊นŒ. ๋Œ€ํ‘œ์ ์œผ๋กœ 2๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

1. Sinusoidal position representations

์ด ๋ฐฉ์‹์€ ๋ณ€ํ™”ํ•˜๋Š” ์‹œ์ ์— ๋Œ€ํ•ด sinusoidal function์„ ๋„์ž…ํ•œ ๊ฒƒ์ธ๋ฐ, ์‚ฌ์ธ๊ณผ ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜์˜ ์ฃผ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์œ„์น˜์— ๋Œ€ํ•œ ๊ณ ์œ ํ•œ ๊ฐ’์„ ์ƒํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ์ฃผ๊ธฐ์ ์ธ ํŒจํ„ด์€ ๋ชจ๋ธ์ด ์ž…๋ ฅ ์‹œํ€€์Šค์—์„œ ๊ฐ ๋‹จ์–ด์˜ ์œ„์น˜๋ฅผ ์ธ์‹ํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค€๋‹ค.

์ด๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ์žฅ์ 
    • ๋‹จ์–ด์˜ ์ ˆ๋Œ€์ ์ธ ์œ„์น˜์˜ ์ค‘์š”์„ฑ์„ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๋‹ค.
    • ๋” ๊ธด sequence๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋‹จ์ 
    • ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค

 

2. Learned absolute position representations

์ด ๋ฐฉ์‹์€ position์„ ๋‚˜ํƒ€๋‚ด๋Š” matrix์ธ p_i๊นŒ์ง€ ํŒŒ๋ผ๋ฏธํ„ฐํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ x_i์— ๋”ํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.

  • ์žฅ์ 
    • ์œ ์—ฐ์„ฑ: ๊ฐ position์€ data๋ฅผ fit ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋  ์ˆ˜ ์žˆ๋‹ค.
  • ๋‹จ์ 
    • ์‚ฌ์ „์— ์ •์˜ํ•œ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ๋„˜์–ด๊ฐ€๋Š” index๋ฅผ ์ถ”์ •ํ•˜์ง€ ๋ชปํ•œ๋‹ค. ์ฆ‰ n๋ณด๋‹ค ๊ธด input์„ ๋ฐ›์•„๋“ค์ผ ์ˆ˜ ์—†๊ฒŒ ๋œ๋‹ค. ์ด๋Š” ์‹ค์ œ์ ์œผ๋กœ๋„ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ๋ถ€๋ถ„์ด๊ณ , computational ๋น„์šฉ์ด n์ œ๊ณฑ์— ๋น„๋ก€ํ•ด์„œ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์‰ฝ์ง€ ์•Š๋‹ค.

2. No nonlinearitites for deep learning, it's just weighted averages -> Adding Nonlinearities

์•ž์„œ ์ •์˜ํ•œ Attention๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์€ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๊ทธ๋ƒฅ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๊ณ  softmax๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ๋ฟ์ด๋‹ค. 

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Feef-forward network๋ฅผ ๋”ํ•œ๋‹ค. 

3. Need to ensure not to look at the future -> Masking

Machine translation ๋“ฑ์˜ task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์ ์–ด๋„ decoder ๋งŒํผ์€ ๋‹น์—ฐํžˆ ๋‹จ๋ฐฉํ–ฅ, ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์•ผ ํ•œ๋‹ค. 

์ด๋ฅผ ์œ„ํ•ด, ๋‚ด ์‹œ์  ์ดํ›„์˜ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด์„œ๋Š” ์ •๋ณด๋ฅผ ๋ณด์ง€ ๋ชปํ•˜๋„๋ก masking์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ, ๋ฏธ๋ž˜์˜ ์ •๋ณด๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ๋” ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. (์‚ฌ์‹ค ๋ฏธ๋ž˜์˜ ์ •๋ณด๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉด ๊ทธ๊ฑธ ํ•™์Šต์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜๋„ ์—†๋‹ค)

 

์œ„์˜ ๋ฌธ์ œ์ ๋“ค์„ ๋ชจ๋‘ ํ•ด๊ฒฐํ•˜์—ฌ ๋งŒ๋“  Attention ์•„ํ‚คํ…์ฒ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์—ฌ๊ธฐ์„œ block์œผ๋กœ ํ‘œ์‹œ๋œ Masked Self-Attention๊ณผ FFN์€ ํ•œ ์Œ์œผ๋กœ ๊ฐ„์ฃผ๋˜๋ฉฐ, ๋ช‡ ๊ฐœ์˜ block์ด ๋ฐ˜๋ณต๋˜๊ธฐ๋„ ํ•œ๋‹ค.


5. Transformer

Attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ˜„์žฌ NLP ์„ธ๊ณ„์˜ ๋ฟŒ๋ฆฌ๋กœ ์ž๋ฆฌ ์žก์€ Transformer model์„ ๋งŒ๋“ค์—ˆ๋‹ค.

Transformer๋Š” Encoder์™€ Decoder๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š”๋ฐ, Decoder์˜ ๊ตฌ์กฐ๋ฅผ ๋จผ์ € ์‚ดํŽด๋ณด์ž.

Decoder

์šฐ์„  decoder๋Š” language model๊ณผ ๊ฐ™์€ system์„ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹์ธ๋ฐ, ์•ž์„œ ์„ค๋ช…ํ•œ attention๊ณผ ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๊ณ ,

self-attention์ด multi-head attention์œผ๋กœ๋งŒ ๋ฐ”๋€Œ์—ˆ๋‹ค๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ํฐ ๋ณ€ํ™” ํฌ์ธํŠธ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

 

1. Multi-head Attention

Self-Attention์— ๊ด€ํ•ด์„œ ์ƒ๊ฐํ•ด ๋ณด๋ฉด, ์•„๋ž˜์˜ 'learned'์— ๋Œ€ํ•œ attention ๊ฐ€์ค‘์น˜๋ฅผ ํ•œ ๋ฒˆ์œผ๋กœ๋งŒ ํ‘œํ˜„ํ•œ๋‹ค.

์ด์™€ ๋‹ค๋ฅด๊ฒŒ multi-head attention์€ ๋‹ค์–‘ํ•œ attention์„ ํ†ตํ•ด ์–ด๋–ค attention์—์„œ๋Š” entitiy ๊ด€๋ จํ•œ attention์„ ๊ณ„์‚ฐํ•˜๋„๋ก, ๋˜ ์–ด๋–ค attention์—์„œ๋Š” syntactic ํ•œ ์ •๋ณด๋ฅผ ๊ณ„์‚ฐํ•˜๋„๋ก ํ•œ๋‹ค.

 

๊ธฐ๋ณธ attention์—์„œ์˜ ์ด๋ก ์€ query์™€ key ์‚ฌ์ด์˜ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ attention score๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ํ•ด๋‹น score์— softmax๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋กœ ๋งŒ๋“ค์–ด, value์— ๊ณฑํ•˜๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. ๊ทธ๋ฆผ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Multi-headed attention์€ ์ด์™€ ๊ฑฐ์˜ ๋น„์Šทํ•œ๋ฐ, ๋‹ค์–‘ํ•œ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์œ„์—์„œ ์‚ฌ์šฉํ•œ matrix๋“ค์„ ๋ถ„ํ• ํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ถ„ํ• ์ด๋ž€, ๋” ๋‚ฎ์€ ์ฐจ์›์œผ๋กœ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์ด ๊ณผ์ •์„ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

 

2. Scaled Dot Product 

๋‘ ๋ฒˆ์งธ๋กœ Transformer์—์„œ ๋‚˜ํƒ€๋‚œ ๋ณ€ํ™”๋Š” scaled dot product์ด๋‹ค.

์œ„์—์„œ ๊ณ„์‚ฐํ•  ๋•Œ ๋งŒ์•ฝ dimentionality d๊ฐ€ ์ปค์ง€๊ฒŒ ๋˜๋ฉด, vector ์‚ฌ์ด์˜ ๋‚ด์ ์ด ์ปค์ ธ softmax๋ฅผ ํฌ๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. Softmax๊ฐ€ ์ปค์ง€๋ฉด gradient๊ฐ€ ์ž‘์•„์ง€๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์›๋ž˜ ์‹์ธ

์ด ์‹์„

์ด์™€ ๊ฐ™์ด ๋ณ€ํ™”์‹œ์ผฐ๋‹ค. 

(d/h)^1/2๋กœ ๋‚˜๋ˆ ์คŒ์œผ๋กœ์จ ์Šค์ฝ”์–ด๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋Š” ํ˜„์ƒ์„ ๋ง‰์€ ๊ฒƒ์ด๋‹ค.

 

3. Optimization Tricks: Residual Connection & Layer Normalization

Residual Connection์€ gradient vanishing ํ˜„์ƒ์„ ๋ง‰๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ, ์œ„์— RNN์„ ์„ค๋ช…ํ•  ๋•Œ ํ•จ๊ป˜ ์„ค๋ช…ํ–ˆ๋‹ค.

 

Layer Normalization์€ ์ •๊ทœํ™”๋ฅผ ํ•˜๋Š” ๋ฐฉ์‹์ธ๋ฐ, ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

 

Final Architecture

์ตœ์ข… ์•„ํ‚คํ…์ฒ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

block์— decoder๋ผ๊ณ  ์ž‘์„ฑ๋ผ ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ, ์˜คํƒˆ์ž๊ฐ€ ์žˆ๋‹ค..

Attention ๊ตฌ์กฐ์™€ ๊ฑฐ์˜ ๋น„์Šทํ•œ๋ฐ, block์— Masked Multi-head Attention์ด ์ ์šฉ๋œ ๊ฒƒ๊ณผ Add&Norm์ด 2๋ฒˆ ์ ์šฉ๋๋‹ค๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ํฐ ๋ณ€ํ™”์ด๋‹ค.


Encoder

Encoder์˜ ๊ตฌ์กฐ๋„ ๊ฑฐ์˜ ๋น„์Šทํ•˜์ง€๋งŒ, language modeling์„ ์œ„ํ•ด unidirectional context๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” decoder์™€๋Š” ๋‹ค๋ฅด๊ฒŒ encoder์—์„œ๋Š” bidirectional context, ์ฆ‰ ์–‘๋ฐฉํ–ฅ์„ฑ์„ ์œ„ํ•ด masking์„ ์ œ๊ฑฐํ•œ๋‹ค.


Overall & Cross Attention

์ตœ์ข…์ ์ธ Transformer ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

์—ฌ๊ธฐ์„œ ๋˜ ์‚ดํŽด๋ณผ ์ ์€ Encoder์˜ output์ด decoder์˜ masked multi head attention์˜ inputd์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” cross-attention์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ Keys์™€ Values๋Š” Encoder๋กœ๋ถ€ํ„ฐ, Queries๋Š” Decoder๋กœ๋ถ€ํ„ฐ ์˜จ๋‹ค๋Š” ๊ฒƒ์ด ํฐ ํŠน์ง•์ด๋‹ค.


Drawbacks

Transformer ๋ชจ๋ธ์˜ ๋‹จ์ ์œผ๋กœ๋Š” ๋‹ค์Œ ์‚ฌํ•ญ๋“ค์ด ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

  1. Quadratic compute in self-attention
  2. Position representation์˜ ๋ถ€์ •ํ™•์„ฑ

๋งˆ๋ฌด๋ฆฌ

๊ฝค๋‚˜ ๊ธธ๊ฒŒ RNN๋ถ€ํ„ฐ Transformer๊นŒ์ง€ ์ •๋ฆฌํ•ด ์™”๋‹ค.

๋ฌธ์ œ -> ํ•ด๊ฒฐ๋ฐฉ์‹์˜ ์Šคํƒ€์ผ๋กœ ๊นŠ์€ ์ˆ˜์‹์€ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๊ณ  ์ •๋ฆฌํ•œ ํƒ“์— deep ํ•˜์ง€๋Š” ์•Š์ง€๋งŒ, ๋ชจ๋ธ์˜ ํ๋ฆ„ ๋ฐ ์—ญ์‚ฌ๋ฅผ ์ฒ˜์Œ ์ ‘ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์ด ๋ณด๊ธฐ์—๋Š” ์ถฉ๋ถ„ํžˆ ๊นŠ์€ ๋‚ด์šฉ์ด ๋  ๊ฒƒ ๊ฐ™๋‹ค.