[WIP🚧] Several BERT Variant Usages in Sentence Pair Task


Many NLP tasks can be modeled in the form of sentence pair tasks. For example:

Natural Language Inference example:

A: Two men on bicycles competing in a race.

B: Men are riding bicycles on the street.

In Literature (Lan and Xu 2018):

Sentence Encoding

SSE, Infersent

a = sent_encode(A)
b = sent_encode(B)
mlp(a, b)

Interaction - Aggregation


a = context_encode(A)
b = context_encode(B)
a', b' = interact(a, b)
mlp(a', b')

Apply Contextualized Embeddings (BERT / ELMo) in Sentence Pair Tasks

Best configuration of using BERT in Sentence Pair Models

BERT Classic

tokens = [CLS] + tokens_A + [SEP] + tokens_B + [SEP]
bert_hiddens = bert(tokens)

BERT SentEncoder

a = [CLS] + tokens_A + [SEP]
b = [CLS] + tokens_B + [SEP]
bert_hiddens_a = bert(a)
bert_hiddens_b = bert(b)
mlp(bert_hiddens_a[0], bert_hiddens_b[0])

BERT SentInteractor

a = [CLS] + tokens_A + [SEP]
b = [CLS] + tokens_B + [SEP]
bert_hiddens_a = bert(a)
bert_hiddens_b = bert(b)
a', b' = interact(bert_hiddens_a, bert_hiddens_b)
mlp(bert_hiddens_a[0], bert_hiddens_b[0])

Some Variants:

  1. CLS or max pooling?
  2. BERT tune or not to tune?
  3. For fixed BERT, use last layer or (weighted avg. of) all layers
  4. For BERT SentEncoder and BERT SentInteractor, if is it not plausible to use seperate BERT encoder, is it better to use seperate transfrom layer to learn something about the asymmetry of the sentence pair?


Model MRPC(F1/acc)
BERT-Base 88.9/84.8
BERT-Large 89.3/85.4
BERT-Base(reproduce) 88.5/84.3
BERT-Large(reproduce) 89.3/85.2
BERT-Sep-ESIM-MLP 80.3/71.9
BERT-Sep-ESIM-Dist 78.5/68.9
BERT-Sep-ESIM-Dist-fix 83.4/76.4