Attention is all you need paper 뽀개기

- 11 mins

이번 포스팅에서는 포자랩스에서 핵심적으로 쓰고 있는 모델인 transformer의 논문을 요약하면서 추가적인 기법들도 설명드리겠습니다.

Why?

Long-term dependency problem

Parallelization

Model Architecture

Encoder and Decoder structure

encoder-decoder

Encoder and Decoder stacks

architecture

Encoder

Decoder

Embeddings and Softmax

Attention

Scaled Dot-Product Attention

scaled-dot-product-attention

\[Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\]

Multi-Head Attention

multi-head-attention

\[MultiHead(Q, K, V) = Concat(head_1,...,head_h)W^O\]

​ where \(head_i=Attention(QW_i^Q, KW_i^K, VW_i^V)\)

multi-head

dimension

​ *\(d_Q,d_K,d_V\)는 각각 query, key, value 개수

Self-Attention

encoder self-attention layer

encoder-self-attention

decoder self-attention layer

decoder-self-attention

Encoder-Decoder Attention Layer

encoder-decoder-attention

Position-wise Feed-Forward Networks

feed-forward-network

\[FFN(x)=max(0, xW_1+b_1)W_2+b_2\]

ffn

Positional Encoding

\[PE_{(pos, 2i)}=sin(pos/10000^{2i/d_{model}})\] \[PE_{(pos, 2i+1)}=cos(pos/10000^{2i/d_{model}})\] \[PE_{pos}=[cos(pos/1), sin(pos/10000^{2/d_{model}}),cos(pos/10000)^{2/d_{model}},...,sin(pos/10000)]\] \[PE_{(pos, 2i+1)}=cos(\frac{pos}{c})\] \[PE_{(pos+k, 2i)}=sin(\frac{pos+k}{c})=sin(\frac{pos}{c})cos(\frac{k}{c})+cos(\frac{pos}{c})sin(\frac{k}{c}) =PE_{(pos,2i)}cos(\frac{k}{c})+cos(\frac{pos}{c})sin(\frac{k}{c})\] \[PE_{(pos+k, 2i+1)}=cos(\frac{pos+k}{c})=cos(\frac{pos}{c})cos(\frac{k}{c})-sin(\frac{pos}{c})sin(\frac{k}{c}) =PE_{(pos,2i+1)}cos(\frac{k}{c})-sin(\frac{pos}{c})sin(\frac{k}{c})\]

Training

Optimizer

\[lrate = d_{model}^{-0.5}\cdot min(step\_num^{-0.5},step\_num \cdot warmup\_steps^{-1.5})\]

lr-graph

Regularization

Residual Connection

\[y_l = h(x_l) + F(x_l, W_l)\] \[x_{l+1} = f(y_l)\] \[x_2 =x_1+F(x_1,W_1)\] \[x_3 =x_2+F(x_2,W_2)=x_1+F(x_1,W_1)+F(x_2,W_2)\] \[x_L = x_l+\sum^{L-1}_{i=1} F(x_i, W_i)\] \[\frac{\sigma\epsilon}{\sigma x_l}= \frac{\sigma\epsilon}{\sigma x_L} \frac{\sigma x_L}{\sigma x_l} = \frac{\sigma\epsilon}{\sigma x_L} (1+\frac{\sigma}{\sigma x_l}\sum^{L-1}_{i=1} F(x_i, W_i))\]

Layer Normalization

\[\mu^l = \frac{1}{H}\sum_{i=1}^Ha^l_i\] \[\sigma^l = \sqrt{\frac{1}{H}\sum_{i=1}^H(a^l_i-\mu^l)^2}\] \[h^t = f[\frac{g}{\sigma^t}\odot(a^t-\mu^t)+b]\]

Dropout

Label Smoothing

\[q'(k|x)=(1-\epsilon)\delta_{k,y}+\epsilon u(k)\]

Conclusion

Reference

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora