Attention is all you need paper 뽀개기

- 11 mins

이번 포스팅에서는 포자랩스에서 핵심적으로 쓰고 있는 모델인 transformer의 논문을 요약하면서 추가적인 기법들도 설명드리겠습니다.

Why?

Long-term dependency problem

Parallelization

Model Architecture

Encoder and Decoder structure

encoder-decoder

Encoder and Decoder stacks

architecture

Encoder

Decoder

Embeddings and Softmax

Attention

Scaled Dot-Product Attention

scaled-dot-product-attention

Multi-Head Attention

multi-head-attention

​ where

multi-head

dimension

​ *는 각각 query, key, value 개수

Self-Attention

encoder self-attention layer

encoder-self-attention

decoder self-attention layer

decoder-self-attention

Encoder-Decoder Attention Layer

encoder-decoder-attention

Position-wise Feed-Forward Networks

feed-forward-network

ffn

Positional Encoding

Training

Optimizer

lr-graph

Regularization

Residual Connection

Layer Normalization

Dropout

Label Smoothing

Conclusion

Reference

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora