Model Architecture

Screen Shot 2021-06-12 at 3.30.38 AM

  • Multiple layers of transfomer decoders

  • Similar looking with Transformer, BERT

Masked Self-Attention

Screen Shot 2021-06-12 at 3.31.25 AMScreen Shot 2021-06-12 at 3.48.49 AM

  • Unlike self-attention,

  • Mask out words on the right side of our target word

  • due to its structure, we also call it auto-regressive model: predict next word step by step

  • this seems more natural since we have no idea what next words will be in practical problems

  • leading to improve model performance in generative tasks

How GPT-2 can work on multiple tasks without fine-tuning?

  • Both original input and ‘task’ as an input
  • Ex. “how are you” + translate to Korean
  • Let our model know the task for better model performance

GPT vs BERT

GPT:

  • Pre-trained to predict the next word
  • Uni-directional
  • good at generating tasks
  • layers of decoders only

BERT:

  • Mask out the word in the middle and predict it
  • Bi-directional
  • good at understanding its meaning
  • layers of encoders only

Suppose that we want to predict “거기” in the below sentence.

Screen Shot 2021-06-12 at 3.41.18 AM

GPT only refers to “어제, 카페, 갔었어” (not in sequetial manner though), while masking out all tokens coming up after our target word. On the other hand, BERT uses all the tokens.

In both models, however, basically atttention mechanism is implemented, being trained to learn “important” word in the candiate pool via updating the whole model.

Reference