๐Ÿ“š ์Šคํ„ฐ๋””/CS224N

[CS224N] 5. Language Models and Recurrent Neural Networks

์žฅ์˜์ค€ 2023. 11. 20. 17:14

์ €๋ฒˆ ๊ธ€์—์„œ๋Š” dependency parser์˜ ์—ญ์‚ฌ์™€ neural net์„ ์ด์šฉํ•ด dependency parser๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•, ๊ทธ๋ฆฌ๊ณ  neural net์˜ regularization์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ๋‹ค๋ฃจ์—ˆ๋‹ค.

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Language Modeling์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ๋‹ค๋ค„๋ณธ ํ›„, RNN์˜ ๊ธฐ์ดˆ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ๋‹ค.


1. Language Modeling

Language Modeling์ด๋ž€, ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋‹ค์Œ์— ๋‚˜์˜ฌ์ง€ ์˜ˆ์ธกํ•˜๋Š” ํƒœ์Šคํฌ๋ฅผ ๋œปํ•œ๋‹ค. ์ฆ‰, context์— ๋Œ€ํ•ด ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ด๊ฑธ ์ข€ ์ˆ˜์‹์„ ๊ณ๋“ค์—ฌ ๋งํ•ด๋ณด๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค: ๋‹จ์–ด x1, x2, .., xt์— ๋Œ€ํ•ด x(t+1)์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

ํ™•๋ฅ ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

Language Modeling์„ ์œ„ํ•ด ๊ณ ์•ˆ๋œ ๋ฐฉ๋ฒ•๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

1. N-gram Language Models

๋จผ์ € n-gram์ด๋ž€, n๊ฐœ์˜ ์—ฐ์†๋œ ๋‹จ์–ด ๋ฉ์–ด๋ฆฌ๋ฅผ ๋œปํ•œ๋‹ค. n-gram์˜ ์ข…๋ฅ˜๋กœ๋Š” unigram, bigram, trigram, 4-grams ๋“ฑ์ด ์žˆ๋‹ค.

N-gram์„ ํ™œ์šฉํ•œ language model์˜ ์•„์ด๋””์–ด๋Š” ๋‹ค์–‘ํ•œ ๋ฌธ์žฅ๋“ค์— n-gram์— ๋Œ€ํ•œ ํ†ต๊ณ„๋ฅผ ๊ตฌํ•˜๊ณ , ํ•ด๋‹น ํ†ต๊ณ„๋ฅผ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

Markov assumption

๋จผ์ €, Markov assumption์ด๋ž€ ๋‹จ์–ด x(t+1)์˜ ์˜ˆ์ธก์€ ์ง์ „ (n-1) ๊ฐœ์˜ ๋‹จ์–ด์—๋งŒ ์˜์กดํ•œ๋‹ค๋Š” ๊ฐ€์ •์ด๋‹ค.

์ด ๊ฐ€์ •์— ์˜ํ•˜๋ฉด P(x(t+1) | x(t), x(t-1),..., x(1))์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

๊ทธ๋ฆฌ๊ณ  ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ํ†ตํ•ด ๋‹จ์–ด์˜ ์˜ˆ์ธก ํ™•๋ฅ ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์œ„์—์„œ ๋‚˜ํƒ€๋‚ด๋Š” n-gram๊ณผ (n-1)-gram ํ™•๋ฅ ์€ ์–ด๋–ป๊ฒŒ ๊ตฌํ• ๊นŒ ?

๊ทธ๊ฑด ๋ฐ”๋กœ ํฐ text์—์„œ ํ•ด๋‹น n-gram์ด ์–ผ๋งˆ๋‚˜ ๋‚˜์˜ค๋Š”์ง€ ์ง์ ‘ ์„ธ์–ด๋ณด๋Š” ๋ฐฉ์‹์ด๋‹ค. 

 

์˜ˆ์‹œ๋กœ ๋‹ค์Œ 4-gram language model์„ ๋ณด์ž.

์˜ˆ์‹œ ๋ฌธ์žฅ "students opened their ___"์—์„œ ๋นˆ์นธ์— ๋“ค์–ด๊ฐˆ ๋‹จ์–ด์— ๋Œ€ํ•œ ์˜ˆ์ธก ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  1. 4-gram์ด๋ฏ€๋กœ ๋นˆ์นธ์„ ํฌํ•จํ•ด์„œ ์•ž์˜ 4 ๋‹จ์–ด๋งŒ ๋‚จ๊ฒจ๋‘๊ณ  ๋‚˜๋จธ์ง€ ๋‹จ์–ด๋Š” ๋ชจ๋‘ ์ œ์™ธ์‹œํ‚จ๋‹ค.
  2. ๋ณด์œ ํ•œ context์—์„œ "students opened their"์ด ๋‚˜์˜จ ํšŸ์ˆ˜๋ฅผ ๊ตฌํ•œ๋‹ค.
  3. ๊ทธ์ค‘, "students opened their" ๋‹ค์Œ์— ๋‚˜์˜จ ๋‹จ์–ด๊ฐ€ ๊ฐ๊ฐ ๋ช‡ ํšŒ๊ฐ€ ๋‚˜์™”๋Š”์ง€ ๊ตฌํ•œ๋‹ค.
    (์œ„ ์˜ˆ์‹œ์—์„œ๋Š” "books"๊ฐ€ 400๋ฒˆ, "exams"๊ฐ€ 100๋ฒˆ์ด๋‹ค.)
  4. ํ•ด๋‹น ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ตฌํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐฉ๋ฒ•์—๋„ ๋ถ„๋ช… ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

Problems

  1. ๋จผ์ € ์ฒซ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” Sparsity Problem์ด๋‹ค. ์œ„์—์„œ ํ™•๋ฅ ์„ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉ๋œ ์ง€ํ‘œ๋“ค์„ ์ž˜ ๋ณด์ž.
    • ๋จผ์ € ๋ถ„์ž๊ฐ€ 0์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค. ์ด๋Š” "students opened their w"๊ฐ€ ๋‚˜์˜ context data์— ํ•œ ๋ฒˆ๋„ ๋‚˜์˜ค์ง€ ์•Š์€ ๊ฒฝ์šฐ๋ฅผ ๋œปํ•œ๋‹ค. 
      ์ด์— ๋Œ€ํ•œ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์œผ๋กœ, ๋ชจ๋“  ๋‹จ์–ด w์— ๋Œ€ํ•ด delta๋ฅผ ๋”ํ•ด์„œ 0์ด ๋˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ smoothing์ด๋ผ๊ณ  ํ•œ๋‹ค.
    • ๋ถ„๋ชจ๊ฐ€ 0์ด ๋  ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ๋‹ค. ์ด๋Š” "students opened their"์ด๋ผ๋Š” ๋งฅ๋ฝ์ด ํ•œ ๋ฒˆ๋„ ์ผ์–ด๋‚˜์ง€ ์•Š์€ ๊ฒฝ์šฐ๋ฅผ ๋œปํ•œ๋‹ค.
      ์ด์— ๋Œ€ํ•œ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์œผ๋กœ ์œ„์˜ ์—ฐ์†๋œ 3-gram์„ "opened their"์ฒ˜๋Ÿผ context๋ฅผ shorten ์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด ์žˆ๋‹ค. ์ด๋ฅผ backoff๋ผ๊ณ  ํ•œ๋‹ค. 
  2. ๋‘ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” ๋ชจ๋ธ์ด ๋„ˆ๋ฌด ํฌ๋‹ค๋Š” ๋ฌธ์ œ์ด๋‹ค. n-gram์„ ์‚ฌ์šฉํ•˜๋ฉด, ๋‚ด๊ฐ€ ๋ณธ ๋ง๋ญ‰์น˜ ๋‚ด์— ๋ชจ๋“  n-gram count์„ ๋ชจ๋‘ ์ €์žฅํ•ด์•ผ ํ•œ๋‹ค. ์ฆ‰, ํ™•๋ฅ ์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ์—„์ฒญ๋‚˜๊ฒŒ ํฐ table์ด ํ•„์š”ํ•œ ๊ฒƒ์ด๋‹ค. ์ด๋Š” ์‚ฌ์‹ค์ƒ n-gram ๋‚ด์—์„œ๋Š” ํ•ด๊ฒฐํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ์ด๋‹ค.

N-gram language models in practice

๊ทธ๋Ÿผ ์ด n-gram ๋ชจ๋ธ๋“ค์ด ์‹ค์ œ๋กœ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋๋Š”์ง€์— ๋Œ€ํ•ด ํ•œ๋ฒˆ ์•Œ์•„๋ณด์ž.

์˜ˆ๋ฅผ ๋“ค์–ด, 170๋งŒ ๊ฐœ์˜ ๋‹จ์–ด ๋ญ‰์น˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ , "today the ___" ์ด ๋ฌธ์žฅ์— ๋Œ€ํ•ด ๋นˆ์นธ์— ๋“ค์–ด๊ฐˆ ๋‹จ์–ด ๋ฐ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋ ค๊ณ  ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.

๊ทธ๋ฆฌ๊ณ  ํ•˜๋‚˜์”ฉ random ํ•˜๊ฒŒ sampling ํ•œ๋‹ค๊ณ  ํ•˜์ž. ๊ทธ๋Ÿผ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

์ด ๋ชจ๋“  ๊ฒƒ์ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์ง„ํ–‰๋˜๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์žฅ์ด ๋งŒ๋“ค์–ด์ง„๋‹ค:

์ƒ์„ฑ๋œ ๋ฌธ์žฅ์„ ๋ดค์„ ๋•Œ, ๋ฌธ๋ฒ•์ ์œผ๋กœ ๊ต‰์žฅํžˆ ์šฐ์ˆ˜ํ•œ ๋ฌธ์žฅ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋‚ด์šฉ์€ ์ •๋ง ๋ง๋„ ์•ˆ ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ๋‚ด์šฉ์˜ ์šฐ์ˆ˜์„ฑ์„ ์œ„ํ•ด ํ›จ์”ฌ ๋” ๋‚˜์€ model์ด ํ•„์š”ํ•˜๋‹ค.

2. Neural Language Models

๊ทธ๋ž˜์„œ ๋‚˜์˜จ ๋ชจ๋ธ์ด neural language model์ด๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๊ฐ„๋‹จํ•œ fixed-window classifier๋ฅผ ๋จผ์ € ์‚ดํŽด๋ณด๊ณ , RNN์„ ๋” ๊นŠ๊ฒŒ ๋‹ค๋ค„๋ณผ ์˜ˆ์ •์ด๋‹ค.

๊ทธ๋Ÿผ ์–ด๋–ป๊ฒŒ neural language model์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„๊นŒ?

๋จผ์ €, NER์—์„œ ๋ดค๋˜ ๊ฒƒ์ฒ˜๋Ÿผ window ์›๋ฆฌ๋ฅผ ๋จผ์ € ์ด์šฉํ•œ๋‹ค.

NER์—์„œ ๋ดค๋˜ ๊ทธ๋ฆผ

๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  1. n-gram๊ณผ ๊ฐ™์ด ์•ž์˜ ๋‹จ์–ด๋“ค์„ ์ œ๊ฑฐํ•˜๊ณ  window๋ฅผ ์กฐ์ •ํ•œ๋‹ค. 
  2. input์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” ๋‹จ์–ด๋“ค์„ one-hot vector๋กœ ๋‚˜ํƒ€๋‚ด๊ณ , word embedding์„ ์ด์šฉํ•ด ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ concatenate ํ•œ๋‹ค.
  3. ํ•ด๋‹น ๋ฒกํ„ฐ๋ฅผ hidden layer์— ํ†ต๊ณผ์‹œํ‚ค๊ณ 
  4. ์ตœ์ข… layer์— softmax classifier๋ฅผ ๋ถ™์—ฌ ๊ฒฐ๊ณผ๋กœ ์ตœ์ข… ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ๊ฒฐ๊ณผ๋กœ books๋‚˜ laptops ๊ฐ™์€ ๋‹จ์–ด๋“ค์ด ์˜ˆ์ธก๋œ๋‹ค. 

์ด classifier์˜ ์žฅ๋‹จ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

 

์žฅ์ 

  • n-gram๊ณผ ๊ฐ™์ด sparsity problem์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ์™œ๋ƒํ•˜๋ฉด word embedding์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์˜๋ฏธ์ƒ ์œ ์‚ฌํ•œ ๋‹จ์–ด๊ฐ€ ๋น„์Šทํ•œ distribution์„ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์œ„ ์˜ˆ์‹œ์—์„œ "students"๋ฅผ "people"๋กœ ๋ฐ”๊พธ๋ฉด n-gram์—์„œ๋Š” context๋ฅผ ๋ฐ”๊ฟ”์„œ ์ฐพ๋Š” ๋ฐ˜๋ฉด, neural lm์—์„œ๋Š” people์™€ students๋ฅผ ๋น„์Šทํ•˜๊ฒŒ ์ƒ๊ฐํ•œ๋‹ค.
  • storing ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋œ๋‹ค. Neural LM์„ ์‚ฌ์šฉํ•˜๋ฉด ๋‹จ์–ด ๋ฒกํ„ฐ, ๊ฐ€์ค‘์น˜์™€ ์ƒˆ ํ–‰๋ ฌ๋งŒ storing ํ•˜๋ฉด ๋˜๊ณ , ๊ตณ์ด ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ์ €์žฅํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค. 

๋ฌธ์ œ์ 

  • window๋Š” ๊ณ ์ •๋ผ ์žˆ๊ณ , ๋„ˆ๋ฌด ์ž‘๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด window๋ฅผ ๋Š˜๋ฆฌ๋ฉด W๋ฅผ ํฌ๊ฒŒ ๋งŒ๋“ค๊ณ , ๊ทธ๊ฒƒ์€ ๊ฐ ๋‹จ์–ด์— ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๊ฐ€ ๊ณฑํ•ด์ง„๋‹ค๋Š” ์ ์—์„œ ํฐ ํ˜ผ๋ž€์„ ๊ฐ€์ ธ์˜จ๋‹ค.

์ด๋Ÿฐ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ค ๊ธธ์ด์˜ input์ด๋ผ๋„ ๊ฐ๋‹นํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด neural architecture๊ฐ€ ํ•„์š”ํ•œ๋ฐ, ์ด๋ฅผ ์œ„ํ•ด ๋‚˜์˜จ ๊ฒƒ์ด RNN์ด๋‹ค.


2. Recurrent Neural Networks (RNN)

(๋ณธ ๊ฐ•์˜์—์„œ๋Š” ๋จผ์ € simple RNN์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ๋‹ค.)

RNN์˜ core idea๋Š” ๊ฐ™์€ ๊ฐ€์ค‘์น˜ W๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

classifier์˜ hidden layer๊ฐ€ ์žˆ๋Š” ๋Œ€์‹ , hidden state๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์ž๊ธฐ ์ž์‹ ์—๊ฒŒ feed back ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์Šค์Šค๋กœ feed back ํ•˜๋Š” ๊ฒƒ์„ recurrent ํ•˜๋‹ค๊ณ  ๋งํ•œ๋‹ค. ๊ณ„์‚ฐ ๊ณผ์ •์„ ๊ทธ๋ฆผ์œผ๋กœ ์‚ดํŽด๋ณด์ž.

๊ณผ์ •์„ ๊ธ€๋กœ ์„ค๋ช…ํ•˜์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  1. input ๋‹จ์–ด๋“ค์„ one-hot vector๋กœ ๋ฐ”๊พธ๊ณ  word embedding์„ ์ ์šฉํ•ด์„œ ๋‹จ์–ด๋ฅผ vector๋กœ ํ‘œํ˜„ํ•œ๋‹ค.
  2. ๋‹ค์Œ ๊ณต์‹์„ ์‚ฌ์šฉํ•ด์„œ hidden state๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค: h(t) = tanh(W_h * h(t-1) + W_e * e(t) + b1) 
    • h(0)์€ ๋ณดํ†ต ๋ชจ๋‘ 0์œผ๋กœ initialize ํ•œ๋‹ค.
    • W_e๋Š” Input embedding์— ๊ณฑํ•˜๋Š” ์šฉ๋„,
    • W_h๋Š” ๋„คํŠธ์›Œํฌ์˜ hidden state๋ฅผ update ํ•˜๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
  3. output distribution์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์Œ softmax ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค: y^(t) = softmax(U*h(t) + b2)
    • ์œ„ ์‹์„ ์ด์šฉํ•ด์„œ ์–ด๋–ค ์œ„์น˜์—์„œ๋“  hidden state๋ฅผ ์ด์šฉํ•ด์„œ ๋‹ค์Œ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์–ป๊ฒŒ ๋œ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด RNN์˜ ์žฅ๋‹จ์ ์„ ์‚ดํŽด๋ณด์ž.

์žฅ์ 

  • ์–ด๋–ค ๊ธธ์ด์˜ input์ด๋“ ์ง€ ํ•ญ์ƒ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋” ๊ธด input context๊ฐ€ ์ƒ๊ธด๋‹ค๊ณ  model size๊ฐ€ ์ปค์ง€์ง€ ์•Š๋Š”๋‹ค. ์˜ค์ง W_h, W_e๋งŒ ์‚ฌ์šฉํ•œ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ํ•˜๋‚˜์˜ weight๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋Œ€์นญ์„ฑ์ด ์žˆ๋‹ค.

๋‹จ์ 

  • ๋А๋ฆฌ๋‹ค. for ๋ฃจํ”„ ์•ˆ์—์„œ ๊ณ„์‚ฐ๋˜๋ฏ€๋กœ O(n)์˜ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง„๋‹ค.
  • ์ด๋ก ์ƒ์œผ๋กœ๋Š” ๋‹จ์–ด ์˜ˆ์ธก์„ ์œ„ํ•ด ํ›จ์”ฌ ์ „์˜ ๋‹จ์–ด ์ •๋ณด๋ฅผ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ ๋Š” ํ•˜์ง€๋งŒ, ์‹ค์ œ๋กœ๋Š” ์ฐธ๊ณ ํ•˜์ง€ ๋ชปํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ language model๋ถ€ํ„ฐ simple RNN๊นŒ์ง€ ์•Œ์•„๋ณด์•˜๋‹ค. ๋‹ค์Œ ๊ธ€์—์„œ๋Š” fancy RNN๊ณผ LSTM์— ๋Œ€ํ•ด ์•Œ์•„๋ณผ ์˜ˆ์ •์ด๋‹ค.