📚 스터디/CS224N

[CS224N] 6, 7, 8. RNN, LSTM, Seq2seq, Attention & Transformers

2023. 12. 30. 01:46

[CS224N] 5. Language Models and Recurrent Neural Networks (2)	2023.11.20
[CS224N] 4. Syntactic Structure and Dependency Parsing (0)	2023.11.18
[CS224N] 3. Natural Language Processing with Deep Learning (1)	2023.11.14
[CS224N] 2. Neural Classifiers (0)	2023.08.02
[CS224N] 1. Introduction and Word Vectors (0)	2023.07.24

[CS224N] 6, 7, 8. RNN, LSTM, Seq2seq, Attention & Transformers

상단으로

[CS224N] 6, 7, 8. RNN, LSTM, Seq2seq, Attention & Transformers

1. RNN

Simple RNN

Training RNN

Generating text with a RNN-LM

RNN의 문제

Vanishing & Exploding Gradients

Solution for gradient Explosion: Gradient Clipping

2. LSTM

3. NMT & Seq2Seq

Statistical Machine Translation (SMT)

Neural Machine Translation (NMT)

NMT의 구조

Training NMT

Multi-layer RNNs

Decoding 방식

Evaluation of Machine Translation

4. Attention

Problem of Seq2Seq Model: Bottleneck problem

Seq2seq with Attention

Problem of RNN: Linear interaction distance

Self-Attention

Attention의 문제와 해결방안

1. Doesn't have inherent notion of order -> Positional Encoding

2. No nonlinearitites for deep learning, it's just weighted averages -> Adding Nonlinearities

3. Need to ensure not to look at the future -> Masking

5. Transformer

Decoder

1. Multi-head Attention

2. Scaled Dot Product

3. Optimization Tricks: Residual Connection & Layer Normalization

Final Architecture

Encoder

Overall & Cross Attention

Drawbacks

마무리

'📚 스터디 > CS224N' 카테고리의 다른 글

티스토리툴바

1. RNN

Simple RNN

Training RNN

Generating text with a RNN-LM

RNN의 문제

Vanishing & Exploding Gradients

Solution for gradient Explosion: Gradient Clipping

2. LSTM

3. NMT & Seq2Seq

Statistical Machine Translation (SMT)

Neural Machine Translation (NMT)

NMT의 구조

Training NMT

Multi-layer RNNs

Decoding 방식

Evaluation of Machine Translation

4. Attention

Problem of Seq2Seq Model: Bottleneck problem

Seq2seq with Attention

Problem of RNN: Linear interaction distance

Self-Attention

Attention의 문제와 해결방안

1. Doesn't have inherent notion of order -> Positional Encoding

2. No nonlinearitites for deep learning, it's just weighted averages -> Adding Nonlinearities

3. Need to ensure not to look at the future -> Masking

5. Transformer

Decoder

1. Multi-head Attention

2. Scaled Dot Product

3. Optimization Tricks: Residual Connection & Layer Normalization

Final Architecture

Encoder

Overall & Cross Attention

Drawbacks

마무리

'📚 스터디 > CS224N' 카테고리의 다른 글

티스토리툴바

Solution for gradient Explosion: Gradient Clipping