Model Architecture

Screen Shot 2021-06-12 at 3.30.38 AM

Multiple layers of transfomer decoders
Similar looking with Transformer, BERT

Masked Self-Attention

Screen Shot 2021-06-12 at 3.31.25 AM Screen Shot 2021-06-12 at 3.48.49 AM

Unlike self-attention,
Mask out words on the right side of our target word
due to its structure, we also call it auto-regressive model: predict next word step by step
this seems more natural since we have no idea what next words will be in practical problems
leading to improve model performance in generative tasks

How GPT-2 can work on multiple tasks without fine-tuning?

Both original input and ‘task’ as an input
Ex. “how are you” + translate to Korean
Let our model know the task for better model performance

GPT vs BERT

GPT:

Pre-trained to predict the next word
Uni-directional
good at generating tasks
layers of decoders only

BERT:

Mask out the word in the middle and predict it
Bi-directional
good at understanding its meaning
layers of encoders only

Suppose that we want to predict “거기” in the below sentence.

Screen Shot 2021-06-12 at 3.41.18 AM

GPT only refers to “어제, 카페, 갔었어” (not in sequetial manner though), while masking out all tokens coming up after our target word. On the other hand, BERT uses all the tokens.

In both models, however, basically atttention mechanism is implemented, being trained to learn “important” word in the candiate pool via updating the whole model.

[Paper Review] Language Models are Unsupervised Multitask Learners (GPT-2)

Model Architecture

Masked Self-Attention

How GPT-2 can work on multiple tasks without fine-tuning?

GPT vs BERT

Reference