๐Ÿ“š ๋…ผ๋ฌธ

GPT-1: Improving Language Understanding by Generative Pre-Training

์žฅ์˜์ค€ 2023. 6. 20. 02:58

Abstract

Natural language์—๋Š” unlabeled text์˜ ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ labeled text์˜ ๋ฐ์ดํ„ฐ ์ˆ˜๋ณด๋‹ค ํ›จ์”ฌ ๋งŽ๋‹ค. ํ•ด๋‹น ์‚ฌ์‹ค์— ๊ทผ๊ฑฐํ•˜์—ฌ OpenAI์—์„œ๋Š” ๋‹ค์–‘ํ•œ unlabeled text๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ generative ํ•˜๊ฒŒ pre-train ์‹œํ‚จ GPT ๋ชจ๋ธ์„ ์ œ์‹œํ–ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์€ ์ด์ „ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ฆ๋ช…ํ–ˆ๋‹ค.

Introduction

unlabeled data๋กœ๋ถ€ํ„ฐ word-level ์ด์ƒ์˜ ์ •๋ณด๋ฅผ ๋Œ์–ด๋‚ด๋Š” ๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€์˜ ์ด์œ ๋กœ ์–ด๋ ต๋‹ค:

  1. transfer์— ์œ ์šฉํ•œ text ํ‘œํ˜„์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์— ์–ด๋–คํ•œ ํ˜•ํƒœ์˜ ์ตœ์ ํ™” ๋ชฉ์  (optimation objectives)๊ฐ€ ์ข‹์€์ง€ ๋ชจ๋ฅธ๋‹ค.
  2. ํ•™์Šต๋œ ํ‘œํ˜„์„ target task์— ์ „๋‹ฌํ•  ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์ด ๋ฌด์—‡์ธ์ง€์— ๊ด€ํ•œ ์˜๊ฒฌ ์ผ์น˜๊ฐ€ ์—†๋‹ค.

์ด๋Ÿฐ ๋ถˆ๋ถ„๋ช…์„ฑ์ด NLP์— ํšจ๊ณผ์ ์ธ semi-supervised-learning์„ ๋””๋ฒจ๋กญํ•˜๋Š” ๊ฒƒ์— ์–ด๋ ค์›€์„ ์ฃผ์—ˆ๋‹ค.

ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” unsupervised pre-training๊ณผ supervised fine-tuning์„ ๊ฒฐํ•ฉํ•œ semi-supervised ์ ‘๊ทผ์„ ์‹œ๋„ํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ๋ชฉ์ ์€ ์ ์€ ๋ณ€ํ™”๋กœ ๋‹ค์–‘ํ•œ ์ž‘์—…์— transfer ํ•  ์ˆ˜ ์žˆ๋Š” ๋ณดํŽธ์ ์ธ ํ‘œํ˜„๋“ค์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋Œ€๋Ÿ‰์˜ corpus of unlabeled text๊ฐ€ ํ•„์š”ํ•˜๊ณ , ๋ชฉํ‘œ ์ž‘์—…์„ ์œ„ํ•œ labeled data๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ๊ณผ์ •์„ ๋งŒ๋“ค์—ˆ๋‹ค:

  1. Unlabeled data์— language modeling objective๋ฅผ ์ ์šฉ์‹œ์ผœ ์ดˆ๊ธฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ํ–ˆ๋‹ค.
  2. ํ•ด๋‹น ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ labeled data๋ฅผ ์ด์šฉํ•˜์—ฌ target task์— fine-tuning์‹œํ‚จ๋‹ค.

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋กœ๋Š” transformer๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด ๋ชจ๋ธ์€ long-term(๊ธด) ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋„ RNN๊ณผ ๊ฐ™์€ ์ด์ „ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ robust(ํŠผํŠผ)ํ•œ ๊ฒฐ๊ณผ๋ฌผ์„ ๋‚ด๋†“๋Š”๋‹ค. transfer์‹œ์—๋Š” traversal-style ์ ‘๊ทผ ๊ธฐ๋ฐ˜์—์„œ ์‚ฌ์šฉ๋œ task์— ํŠน์ •์ ์ธ input adaptation์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” task๋งˆ๋‹ค ์š”๊ตฌํ•˜๋Š” input text๋ฅผ ์—ฐ์†์ ์ธ ์‹ฑ๊ธ€ ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰ task์— ๋งž๋Š” ๋ฏธ์„ธ์กฐ์ •์„ ์œ„ํ•ด์„œ pre-trained ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ๋ณ€ํ˜•์‹œํ‚จ ๊ฒƒ์ด๋‹ค. ์ด ๋•Œ๋ฌธ์— pre-trained ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๋งŒ ๋ฐ”๊ฟ”๋„ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋ฏธ์„ธ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

Related Work

 Semi-supervised learning for NLP : ์ดˆ๊ธฐ ์—ฐ๊ตฌ์—์„œ๋Š” unlabeled data๋ฅผ ์‚ฌ์šฉํ•ด์„œ word-level์ด๋‚˜ phrase-level์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋ฅผ feature๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ง€๋„ํ•™์Šต์— ์ ์šฉ์‹œ์ผฐ๋‹ค. ์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„์˜ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” unlabeled corpora๋กœ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์„ฑ๋Šฅ์  ๋ฐœ์ „์„ ๊ฐ€์ ธ๋‹ค์ฃผ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฐ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์€ ์ฃผ๋กœ ๋‹จ์–ด ์ˆ˜์ค€์˜ ์ •๋ณด๋ฅผ transfer ํ•œ๋‹ค๋Š” ์ ์—์„œ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋ณด๋‹ค ๋” ๋†’์€ ์ˆ˜์ค€์˜ ์ •๋ณด๋ฅผ transfer ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” unlabeled data์—์„œ ๋‹จ์–ด ์ˆ˜์ค€์˜ ์˜๋ฏธ๋ฅผ ๋„˜์–ด์„œ ๋” ๊ณ ์ฐจ์›์ ์ธ ๋ฌธ๋งฅ ์ˆ˜์ค€(phrase-level)์ด๋‚˜ ๋ฌธ์žฅ ์ˆ˜์ค€(sentence-level)์˜ ์ž„๋ฒ ๋”ฉ์„ ์‹œ๋„ํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค.

 

Unsupervised pre-training): ๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต์€ ์ข‹์€ ์‹œ์ž‘์ (initialization point)์„ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ๋ผ๋Š” ์ ์—์„œ ์ค€์ง€๋„ํ•™์Šต์˜ ํŠน๋ณ„ํ•œ ์œ ํ˜•์ด๋‹ค. ์ดˆ๊ธฐ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” ์ด๋ฅผ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€ ๋ฌธ์ œ์— ์‚ฌ์šฉํ–ˆ๊ณ , ์ดํ›„ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” ์‚ฌ์ „ํ•™์Šต์ด ์šฐ์ˆ˜ํ•œ ์ •๊ทœํ™”(regularization)๋กœ ์ž‘์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” DNN์—์„œ์˜ ์ผ๋ฐ˜์„ฑ์„ ๋†’์—ฌ์ฃผ์—ˆ๋‹ค. 

ํ•ด๋‹น ๋…ผ๋ฌธ๊ณผ ๋น„์Šทํ•œ ์—ฐ๊ตฌ๋กœ๋Š” ์–ธ์–ด ๋ชจ๋ธ objective๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ณ , ์ด๋ฅผ ๋ชฉํ‘œ ์ž‘์—…์„ ์œ„ํ•ด ๋ฏธ์„ธ์กฐ์ •ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์‚ฌ์ „ํ•™์Šต์„ ํ•  ๋•Œ ์–ธ์–ด์  ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด LSTM์„ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ์ด ๋•Œ๋ฌธ์— ์งง์€ ๋ฒ”์œ„์˜ ๋ฐ์ดํ„ฐ์—์„œ๋งŒ ๋ชจ๋ธ์ด ์œ ํšจํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” transformer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธด ๋ฒ”์œ„์˜ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์œ ํšจํ•˜๋„๋ก ํ–ˆ๋‹ค. ๋˜ํ•œ ๋” ํฐ ๋ฒ”์œ„์˜ ์ž‘์—…์—์„œ๋„ ์œ ์šฉํ•œ๋ฐ, ์ด๋Š” ์ž์—ฐ์–ด ์ถ”๋ก , paraphrase ๊ฐ์ง€, ๊ทธ๋ฆฌ๊ณ  story completion๋ฅผ ํฌํ•จํ•œ๋‹ค. ๋˜ํ•œ ๋‹ค๋ฅธ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” ๋ชฉํ‘œ ์ž‘์—…์„ ์œ„ํ•œ ์ง€๋„ํ•™์Šต์„ ์ง„ํ–‰ํ•  ๋•Œ ์‚ฌ์ „ํ•™์Šต์ด๋‚˜ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๋ชจ๋ธ์—์„œ ๊ฐ€์ ธ์˜จ  ์€๋‹‰ ํ‘œํ˜„(hidden representation)์„ ๋ณด์กฐ feature๋กœ์จ ํ™œ์šฉํ•˜๋Š”๋ฐ,  ์ด๋Š” ๊ฐ ์ž‘์—…์„ ์œ„ํ•œ ์ƒ๋‹นํ•œ ์–‘์˜ ์ƒˆ๋กœ์šด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์š”๊ตฌํ•œ๋‹ค. ์ด์— ๋น„ํ•ด ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์€ transfer ์‹œ ๋ชจ๋ธ ๊ตฌ์กฐ์— ๋Œ€ํ•ด ์ตœ์†Œํ•œ์˜ ๋ณ€ํ™”๋งŒ์„ ์š”๊ตฌํ•œ๋‹ค

 

Auxiliary training objectives: ๋ณด์กฐ ๋น„์ง€๋„ ํ•™์Šต ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ์ค€์ง€๋„ํ•™์Šต์˜ ๋Œ€์•ˆ์— ๊ฐ€๊น๋‹ค. ์ดˆ๊ธฐ ์—ฐ๊ตฌ์—์„œ๋Š” POS tagging๊ณผ ๊ฐ™์€ ๋ณด์กฐ NLP ์ž‘์—…์„ ์‚ฌ์šฉํ•˜์—ฌ sementic role labeling์„ ๊ฐœ์„ ํ•˜์˜€์œผ๋ฉฐ, ์ตœ๊ทผ์—๋Š” ๋ณด์กฐ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ target task์˜ ๋ชฉ์ ํ•จ์ˆ˜์— ์‚ฌ์šฉํ•˜์—ฌ ์‹œํ€€์Šค ๋ผ๋ฒจ๋ง์—์„œ์˜ ์„ฑ๋Šฅ์  ๊ฐœ์„ ์„ ์ฆ๋ช…ํ–ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ์‹คํ—˜์—์„œ๋„ ๋ณด์กฐ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต ์ž์ฒด๊ฐ€ ์ด๋ฏธ ๋ชฉํ‘œ ์ž‘์—…๊ณผ ์—ฐ๊ด€๋œ ์–ธ์–ด์  ์ •๋ณด๋ฅผ ํ•™์Šตํ•œ๋‹ค.

  • ๋‹จ์–ด ์ˆ˜์ค€๋ณด๋‹ค ๋” ๊ณ ์ฐจ์›์ ์ธ ์ˆ˜์ค€์˜ ์ •๋ณด๋ฅผ transfer ํ•˜๊ณ ์ž ํ•œ๋‹ค(๋ฌธ๋งฅ ์ˆ˜์ค€, ๋ฌธ์žฅ ์ˆ˜์ค€ ๋“ฑ)
  • ๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต์˜ ๋ชฉํ‘œ๋Š” ์ข‹์€ ์‹œ์ž‘์ ์„ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค.
  • Transformer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธด ๋ฒ”์œ„์˜ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์œ ํšจํ•˜๋‹ค.
  • Transfer์‹œ ์ตœ์†Œํ•œ์˜ ๋ณ€ํ™”๋งŒ์„ ์š”๊ตฌํ•œ๋‹ค

Framework

์œ„์—์„œ ๋งํ–ˆ๋“ฏ์ด, GPT์˜ ํ•™์Šต์€ unsupervised pre-training๊ณผ supervised fine-tuning์˜ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

1. Unsupervised pre-training

GPT๋Š” ์ฃผ์–ด์ง„ embedding๋“ค์— ๋Œ€ํ•ด transformer์˜ decoding block๋“ค๋งŒ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๊ณ , ๊ทธ๋ ‡๊ฒŒ ๊ฒฐ๊ณผ๋ฌผ์„ ์˜ˆ์ธกํ•œ๋‹ค. 

์œ„ ์‹์—์„œ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋“ฏ์ด, ๋ฐ”๋กœ ์ „ ๋‹จ๊ณ„์—์„œ k๋ฒˆ์งธ ์ด์ „ ๋‹จ๊ณ„๊นŒ์ง€์˜ token๋“ค์„ ์‚ดํŽด๋ณธ ์ดํ›„์—, ๊ทธ๊ฒƒ์„ ๋ฐ”ํƒ•์œผ๋กœ i๋ฒˆ์งธ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€์— ๋Œ€ํ•œ likelihood๋ฅผ ์ตœ๋Œ€ํ™”์‹œํ‚ค๋Š” ๊ฒƒ์ด unsupervied pre-training์˜ ๋ชฉ์ ์ด๋‹ค.

๋” ์ž์„ธํ•œ ์‹์œผ๋กœ ๋ณด์ž๋ฉด,

์œ„ ์‹์—์„œ ๊ฐ ๋ณ€์ˆ˜๋“ค์˜ ์˜๋ฏธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • U = (u_(-k),...., u_(-1)): token๋“ค์˜ context vector
  • n: layer์˜ ๊ฐœ์ˆ˜ (์Œ“์•„ ์˜ฌ๋ฆฐ decoder block์˜ ๊ฐœ์ˆ˜, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ)
  • We: token embedding matrix
  • Wp: position embedding matrix

์šฐ์„  ํ† ํฐ๋“ค์˜ context vector๊ฐ€ ์ž…๋ ฅ๋˜๊ณ  token embedding, position embedding์˜ ์ž‘์—…์„ ๊ฑฐ์ณ h0๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.

์ดํ›„, h_(l-1) ๋ฒˆ์งธ ํ•ญ๋ชฉ์„ transformer์˜ n๋ฒˆ๋งŒํผ decoder ๋ถ€๋ถ„์— ํ†ต๊ณผ์‹œํ‚ค๊ณ , ์ตœ์ข…์ ์œผ๋กœ feed forward network, softmax ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์ณ ๋งˆ์ง€๋ง‰ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

์ด๋•Œ ์ค‘์š”ํ•œ ์ ์€ ํ† ํฐ์„ processing ํ•  ๋•Œ, masked self-attention์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด๋‹ค.
(masked self-attention: ๋‚ด๊ฐ€ processing ํ•˜๊ณ ์ž ํ•˜๋Š” token์˜ ๋‹ค์Œ sequence์˜ token๋“ค์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค)

 

2. Supervised fine-tuning

y๋ผ๋Š” label์ด ์ฃผ์–ด์ง„ x_1๋ถ€ํ„ฐ x_m๊นŒ์ง€์˜ ํ† ํฐ๋“ค์˜ sequence๊ฐ€ input์œผ๋กœ ๋“ค์–ด์˜ค๊ฒŒ ๋˜๋ฉด, ํ•ด๋‹น input๋“ค์€ pre-trained model์— ๋“ค์–ด๊ฐ€ final transformer block's activation h_l^m์„ ์–ป๊ฒŒ ๋œ๋‹ค.

์ดํ›„, h_l^m์„ ์ƒˆ๋กœ์šด linear output layer์— ๋„ฃ์–ด ์˜ˆ์ธกํ•œ๋‹ค.

์ฆ‰, GPT์˜ unsupervised hidden state์ธ x^m์˜ hidden state block์„ ๊ฐ€์ ธ๋‹ค๊ฐ€ linear layer์— ๋„ฃ๊ณ , softmax ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์ณ ์ตœ์ข… ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ•ด๋‹น ํ™•๋ฅ ์ด ์•„๋ž˜ ๊ทธ๋ฆผ์˜ L2๊ฐ€ ๋œ๋‹ค.

์ดํ›„, ์ €์ž๋“ค์€ ์œ„ ๋‹จ๊ณ„๋“ค์„ ํ†ตํ•ด ๋” ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ƒˆ๋‹ค.

1. L1(U)๋ฅผ ํ†ตํ•ด language model์„ pre-training ํ•˜๊ณ , 

2. ์ดํ›„ task-specific ํ•œ dataset์ด ์ฃผ์–ด์ง€๋ฉด, ํ•ด๋‹นํ•˜๋Š” dataset์— ๋Œ€ํ•œ language ๋ชจ๋ธ์˜ fine-tuning๊ณผ, supervised learning์— ํ•ด๋‹นํ•˜๋Š” ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ํ•จ๊ป˜ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ทน๋Œ€ํ™”ํ•˜๋ฉด ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค.

3. Task-specific input transformation

์œ„์™€ ๊ฐ™์ด classification, entailment, similarity, multiple choice ๋“ฑ์˜ task๊ฐ€ ์žˆ๋‹ค.

๊ฐ๊ฐ์˜ task์— ๋Œ€ํ•˜์—ฌ input์„ ์œ„์™€ ๊ฐ™์ด ๋‹ค๋ฅด๊ฒŒ ๋งŒ๋“ค๋ฉด ํ›จ์”ฌ ํšจ๊ณผ์ ์ด์—ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

Experiments

์œ„ ๊ทธ๋ฆผ์—์„œ, ์™ผ์ชฝ์—์„œ ๋ณด๋“ฏ์ด, decoding block์„ ์Œ“์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ ์  ์ข‹์•„์กŒ๋‹ค. (๋…ผ๋ฌธ์—์„œ๋Š” ์ตœ๋Œ€ 12๊ฐœ๊นŒ์ง€ ์Œ“์•˜๋‹ค.)

์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์—์„œ ๋ณด๋“ฏ์ด, fine tuning๊ณผ zero-shot ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ–ˆ์„ ๋•Œ, fine-tuning์„ ์ง„ํ–‰ํ•˜๋ฉด ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์ง์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค.