BERT Architecture: multi-layer bi-directional transformer encoders

Screen Shot 2021-06-12 at 2.18.41 AM

There are two BERT model sizes: BERT LARGE, BERT BASE

Screen Shot 2021-06-12 at 2.23.41 AM

Screen Shot 2021-06-12 at 2.13.57 AM

Same as transformer but twelve layers
For the figure example, we only use the output at the first position passed from the first input token for simple classification. We expect that this first output contains all the information needed for classification via self-attention mechanism.

BERT consist of two parts: Pre-training and Fine-Tuning

1. Pre-training Procedure

randomly mask out a word in the given sentence to predict it
but with those rules:
80% of the time: replace the word with [MASK]
10% of the time: replace the word with a random word

–> to perturb our model to be more robust
10% of the time: leave the word unchanged
not appropriate for generative tasks but good for classification ones

Screen Shot 2021-06-12 at 2.03.47 AM

When two sentences (A, B) are given,
BERT is trained to predict a sentence B belongs after a sentence A
This helps BERT to perform well in QA tasks where relationships between two sentences are important
For embedding two sentences, BERT needs three embedding methods:
- Token Embedding: original word embedding including ,
- Position Embedding: addressing position information as $E_1, E_2$
- Segment Embedding: separating out two sentences as $E_A, E_B$

Screen Shot 2021-06-12 at 2.44.15 AM

Slightly changes pre-trained weights appropriate for the given task

for example, add a classifier layer to BERT for sentence classification problem.

Screen Shot 2021-06-12 at 2.44.15 AM