|
# LOGIC PRE-TRAINING OF LANGUAGE MODELS |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
Pre-trained language models (PrLMs) have been shown useful for enhancing |
|
a broad range of natural language understanding (NLU) tasks. However, the |
|
capacity for capturing logic relations in challenging NLU still remains a bottleneck |
|
even for state-of-the-art PrLM enhancement, which greatly stalled their reasoning |
|
abilities. Thus we propose logic pre-training of language models, leading to the |
|
logic reasoning ability equipped PrLM, PROPHET. To let logic pre-training perform |
|
on a clear, accurate, and generalized knowledge basis, we introduce fact instead of |
|
the plain language unit in previous PrLMs. The fact is extracted through syntactic |
|
parsing in avoidance of unnecessary complex knowledge injection. Meanwhile, it |
|
enables training logic-aware models to be conducted on a more general language |
|
text. To explicitly guide the PrLM to capture logic relations, three pre-training |
|
objectives are introduced: 1) logical connectives masking to capture sentence-level |
|
logics, 2) logical structure completion to accurately capture facts from the original |
|
context, 3) logical path prediction on a logical graph to uncover global logic |
|
relationships among facts. We evaluate our model on a broad range of NLP and |
|
NLU tasks, including natural language inference, relation extraction, and machine |
|
reading comprehension with logical reasoning. Results show that the extracted fact |
|
and the newly introduced pre-training tasks can help PROPHET achieve significant |
|
performance in all the downstream tasks, especially in logic reasoning related tasks. |
|
|
|
1 INTRODUCTION |
|
|
|
Machine reasoning in natural language understanding (NLU) aims to teach machines to understand |
|
human languages by building and analyzing the connections between the facts, events, and |
|
observations using logical analysis techniques like deduction and induction, which is one of the |
|
ultimate goals towards human-parity intelligence. Although pre-trained language models (PrLMs), |
|
such as BERT (Devlin et al., 2018), GPT (Radford et al., 2018), XLNet (Yang et al., 2019) and |
|
RoBERTa (Liu et al., 2019), have established state-of-the-art performance on various aspects in NLU, |
|
they are still short in complex language understanding tasks that involve reasoning (Helwe et al., |
|
2021). The major reason behind this is that they are insufficiently capable of capturing logic relations |
|
such as negation (Kassner & Schütze, 2019), factual knowledge (Poerner et al., 2019), events (Rogers |
|
et al., 2020), and so on. Many previous studies (Sun et al., 2021; Xiong et al., 2019; Wang et al., |
|
2020) are then motivated to inject knowledge into pre-trained models like BERT and RoBERTa. |
|
However, they too much rely on massive external knowledge sources and ignore that language itself |
|
is a natural knowledge carrier as the basis of acquiring logic reasoning ability (Ouyang et al., 2021). |
|
Taking the context in Figure 1 as an example, previous approaches tend to focus on entities such as |
|
the definition of "government" and the concepts related to it like "governor", but overlook the exact |
|
relations inherent in this example, thus failing to model the complex reasoning process. |
|
|
|
Given the fact that PrLMs are the key supporting components in natural language understanding, |
|
in this work, we propose a fundamental solution by empowering the PrLMs with the capacity of |
|
capturing logic relations, which is necessary for logical reasoning. However, logical reasoning can |
|
only be implemented on the basis of clear, accurate, and generalized knowledge. Therefore, we |
|
leverage fact as the conceptual knowledge unit to serve the basis for logic relation extraction. Fact is |
|
organized as a triplet, i.e., in the form of predicate-argument structures, to represent the meaning such |
|
as "who-did-what-to-whom" and "who-is-what". Compared with existing studies that inject complex |
|
knowledge like knowledge graphs, the knowledge structure based on fact is far less complicated and |
|
more general in representing events and relations in languages. |
|
|
|
|
|
----- |
|
|
|
On top of the fact-based knowledge structure, we present PROPHET, a logic-aware pre-trained |
|
language model to learn the logic-aware relations in a universal way from very large texts. In detail, |
|
we introduce three novel pre-training objectives based on the newly introduced knowledge structure |
|
basis fact: 1) logical connectives masking for learning sentence-level logic connection. 2) logical |
|
structure completion task on top of facts for regularization, aligning extracted fact with the original |
|
context. 3) logical path prediction to capture the logic relationship between facts. PROPHET is |
|
evaluated on a broad range of language understanding tasks: natural language inference, semantic |
|
similarity, machine reading comprehension, etc. Experimental results show that the fact is useful |
|
as the carrier for knowledge modeling, and the newly introduced pre-training tasks can improve |
|
PROPHET and achieves significant performance on downstream tasks.[1] |
|
|
|
2 RELATED WORK |
|
|
|
2.1 PRE-TRAINED LANGUAGE MODELS IN NLP |
|
|
|
Large pre-trained language models (Devlin et al., 2018; Liu et al., 2019; Radford et al., 2018) |
|
have brought dramatic empirical improvements on almost every NLP task in the past few years. |
|
A classical norm of pre-training is to train neural models on a large corpus with self-supervised |
|
pre-training objectives. "Self-supervised" means that the supervision provided in the training process |
|
is automatically generated from the raw text instead of manually generation. Designing effective |
|
criteria for language modeling is one of the major topics in training pre-trained models, which decides |
|
how the model captures the knowledge from large-scale unlabeled data. The most popular pre-training |
|
objective used today is masked language modeling (MLM), initially used in BERT (Devlin et al., |
|
2018), which randomly masks out tokens, and the model is asked to uncover it given surrounding |
|
context. Recent studies have investigated diverse variants of denoising strategies (Raffel et al., 2020; |
|
Lewis et al., 2020), model architecture (Yang et al., 2019), and auxiliary objectives (Lan et al., |
|
2019; Joshi et al., 2020) to improve the model strength during pre-training. Although the existing |
|
techniques have shown effectiveness in capturing syntactic and semantic information after large-scale |
|
pre-training, they perform sensitivity to role reversal and struggles with pragmatic inference and |
|
role-based event knowledge (Rogers et al., 2020), which are critical to the ultimate goal of complex |
|
reasoning that requires to uncover logical structures. However, it is difficult for pre-trained language |
|
models to capture the logical structure inherent in the texts since logical supervision is rarely available |
|
during pre-training. Therefore, we are motivated to explicitly guide the model to capture such clues |
|
via our newly introduced self-supervised tasks. |
|
|
|
2.2 REASONING ABILITY FOR PRE-TRAINED LANGUAGE MODELS |
|
|
|
There is a lot of work in the research line of enhancing reasoning abilities in pre-trained language |
|
models via injecting knowledge. The existing approaches mainly design novel pre-training objectives |
|
and leverage abundant knowledge sources such as WordNet (Miller, 1995). |
|
|
|
Notably, ERNIE 3.0 (Sun et al., 2021) uses a broad range of pre-training objectives from word-aware, |
|
structure-aware to knowledge-aware tasks, based on a 4TB corpus consisting of plain texts and a |
|
large-scale knowledge graph. WKLM (Xiong et al., 2019) replaces entity mentions in the document |
|
with other entities of the same type, and the objective is to distinguish the replaced entity from the |
|
original ones. KEPLER (Wang et al., 2021b) encodes textual entity descriptions using embeddings |
|
from a PrLM to take full advantage of the abundant textual information. K-Adapter (Wang et al., 2020) |
|
designs neural adapters to distinguish the type of knowledge sources to capture various knowledge. |
|
|
|
Our proposed method differs from previous studies in three aspects. Firstly, our model does not |
|
require any external knowledge resources like previous methods that use WordNet, WikiData, etc. |
|
We only use small-scale textual sources following the standard PrLMs like BERT (Devlin et al., |
|
2018), along with an off-the-shelf dependency parser to extract facts. Secondly, previous works only |
|
consider triplet-level pre-training objectives. We proposed a multi-granularity pre-training strategy, |
|
considering not only triplet-level information but also sentence-level and global knowledge to enhance |
|
logic reasoning. Finally, we propose a new training mechanism apart from masked language modeling |
|
(MLM), hoping to shed light on more logic pre-training strategies in this research line. |
|
|
|
1Our codes have been uploaded as supplemental material, which will be open after the double review period. |
|
|
|
|
|
----- |
|
|
|
|Fact anarchists, participated, revolution revolution, opposite, movement they, met, suppression suppresion, after, stabilized government, was, stabilized|Logical Graph opposite participated movement anarchists revolution after coref met suppression stabalized they was same government stabalized| |
|
|---|---| |
|
|Despite concerns, anarchists participated in the Russian Revolution Text in opposition to the White movement. However, they met harsh suppression after the Bolshevik government was stabilized.|| |
|
|
|
|
|
Logical Graph |
|
|
|
|
|
Fact |
|
|
|
|
|
_they_ _was_ |
|
|
|
_government_ _stabalized_ |
|
|
|
|
|
Text |
|
|
|
|
|
Figure 1: How the facts and logical graph constructed from raw text inputs. Edges in red denotes |
|
additional edges added in the logical graph, while text with green indicates the sentence-level logical |
|
connectives which will be mentioned in §4. |
|
|
|
3 PRELIMINARIES |
|
|
|
In this section, we will introduce the concept of fact and logical graph, which is the basis of PROPHET. |
|
We will also describe extracting the fact for logical graph construction, as an example shown in |
|
Figure 1. |
|
|
|
3.1 FACT |
|
|
|
Following Nakashole & Mitchell (2014) and Ouyang et al. (2021), we extract facts which are triplets |
|
represented as T = {A1, P, A2}, where A1 and A2 are the arguments and P is the predicate between |
|
them. It can well represent a broad range of facts, reflecting the notion of "who-did-what-to-whom" |
|
and "who-is-what", etc. |
|
|
|
We extract such facts in a syntactic way, which makes our approach generic and easy to apply. Given |
|
a document, we first split the document into multiple sentences. For each sentence, we conduct |
|
dependency parsing using StanfordCoreNLP (Manning et al., 2014).[2] For the analyzed dependencies, |
|
basically, we consider verb phrases and some prepositions in the sentences as "predicates", and then |
|
we search for their corresponding actors and actees as the "arguments". |
|
|
|
3.2 LOGICAL GRAPH |
|
|
|
A logical graph is an undirected (but is not required to be connected) graph that represents |
|
logical dependency relation between components in facts. In logical graphs, nodes represent |
|
argument/predicates in the fact, and edges indicate whether two nodes have relations in a fact. |
|
Such a structure can well unveil and organize semantic information captured by facts. Besides, a |
|
logical graph supports considerations among long-range dependencies via connecting arguments and |
|
their relations in different facts across different spans. |
|
|
|
We further show how to construct such graphs based on facts. Despite given relations in facts, we |
|
design another two types of edges based on identical mentions and coreference information. (1) There |
|
can be identical mentions in different sentences, resulting in repeated nodes in facts. We connect |
|
nodes corresponding to the same non-pronoun arguments by edges with edge type same. (2) We |
|
conduct coreference resolution on context using an off-to-shelf model to identify arguments in facts |
|
that refer to the same one.[3] We add edges with type coref between them. The final logical graph is |
|
denoted as S = (V, E), where V = Ai _P and i_ 1, 2 . |
|
_∪_ _∈{_ _}_ |
|
|
|
[2https://stanfordnlp.github.io/CoreNLP/, we also tried to use OpenIE directly; however,](https://stanfordnlp.github.io/CoreNLP/) |
|
the performance is not satisfactory. |
|
[3https://github.com/huggingface/neuralcoref.](https://github.com/huggingface/neuralcoref) |
|
|
|
|
|
----- |
|
|
|
_anarchists, participated, revolution_ |
|
_revolution, opposite, movement_ |
|
_they, met, suppression_ |
|
_suppresion, after, stabilized_ Fact |
|
_government, was, stabilized_ |
|
|
|
Fact |
|
|
|
|
|
_.. However, they met harsh suppression_ |
|
|
|
Text |
|
|
|
_after the Bolshevik government ..._ |
|
|
|
|
|
_after_ _stabilized_ |
|
|
|
|
|
_Text_ |
|
_Encoder_ |
|
|
|
|
|
Logical Graph |
|
|
|
_participated_ |
|
_anarchists_ _revolution_ |
|
|
|
|
|
|
|
_[CLS]...suppression [MASK] the_ |
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|Col7| |
|
|---|---|---|---|---|---|---| |
|
|pre-training with fact-aware logics ... Text Encoder sentence connective logical path ... masking fact unit alignment prediction||||||| |
|
|||||||| |
|
|
|
_Bolshevik...[SEP]_ |
|
|
|
_...suppression after the Bolshevik..._ |
|
|
|
|
|
_movement_ |
|
|
|
_after_ |
|
|
|
_stabalized_ |
|
|
|
|
|
_suppression, after, stabilized?_ |
|
|
|
_( suppression, after, stabilized )_ |
|
|
|
|
|
_revolution, government,_ _?_ |
|
|
|
_revolution, government, 1_ |
|
|
|
|
|
coref |
|
|
|
_they_ |
|
|
|
|
|
_met_ |
|
_suppression_ |
|
|
|
_was_ |
|
|
|
|
|
same |
|
|
|
|
|
Text Fact Unit Node Pairs |
|
|
|
|
|
_government_ |
|
|
|
|
|
_stabalized_ |
|
|
|
|
|
Figure 2: An illustration about pre-training methods used in PROPHET. The model takes the text, |
|
extracted fact and the randomly sampled node pairs in the logical graph as the input. The model is |
|
pre-trained with three novel objectives. One is the standard masked language modeling applied to |
|
sententious connectives, the others are fact alignment and logical path prediction. |
|
|
|
4 PROPHET |
|
|
|
4.1 MODEL ARCHITECTURE |
|
|
|
We follow BERT (Devlin et al., 2018) and use a multi-layer bidirectional Transformer (Vaswani |
|
et al., 2017) as the model architecture of PROPHET. For keeping the focus on the newly introduced |
|
techniques, we will not review the ubiquitous Transformer architecture in detail. We develop |
|
PROPHET by using exactly the same model architecture as BERT-base, where the model consists of |
|
12 transformer layers, with 768 hidden size, 12 attention heads, and 110M model parameters in total. |
|
|
|
4.2 LOGIC-AWARE PRE-TRAINING TASKS |
|
|
|
We describe three pre-training tasks used for pre-training PROPHET in this section. Figure 2 is an |
|
illustration of PROPHET pre-training. The first task is logical connectives masking (LCM) generalized |
|
from masked language modeling (Devlin et al., 2018) for logical connectives to learn sentence-level |
|
representation. The second task is logical structure completion (LSC) for learning logic relationship |
|
inside a fact, where we first randomly mask arguments in facts, and then predict those items. Finally, |
|
a logical path prediction (LPP) task is proposed for recognizing the logical relations of randomly |
|
selected node pairs. |
|
|
|
**Logical Connective Masking** Logical connective masking is an extension of the masked language |
|
modeling (MLM) pre-training objective in Devlin et al. (2018), with a particular focus on connective |
|
indication tokens. We use the Penn Discourse TreeBank 2.0 (PDTB) (Prasad et al., 2008) to draw |
|
the logical relations among sentences. Specifically, PDTB 2.0 contains relations that are manually |
|
annotated on the 1 million Wall Street Journal (WSJ) corpus and are broadly characterized into |
|
"Explicit" and "Implicit" connectives. We use the "Explicit" type (in total 100 such connectives), |
|
which apparently presents in sentences such as discourse adverbial "instead" or subordinating |
|
conjunction "because". Taking all the identified connectives and some randomly sampled other |
|
tokens (for a total 15% of the tokens of the original context), we replace them with a [MASK] token |
|
80% of the time, with a random token 10% of the time and leave them unchanged 10% of the time. |
|
The MLM objective is to predict the original tokens of these sampled tokens, which has proven |
|
effective in previous works (Devlin et al., 2018; Liu et al., 2019). In this way, the model learns |
|
to recover the logical relations for two given sentences, which helps language understanding. The |
|
objective of this task is denoted as _conn._ |
|
_L_ |
|
|
|
**Logical Structure Completion** To align representation between the context and the extracted fact, |
|
we introduce a pre-training task of logical structure completion. The motivation here is to encourage |
|
|
|
|
|
----- |
|
|
|
the model to learn the structure-aware representation that encodes the "Who-did-What-to-Whom"like meanings for better language understanding. To speak in detail, we randomly select a specific |
|
proportion λ of the total facts (λ = 20% in this work), from a given context. For each chosen fact, we |
|
either ask the model to complete "Argument-Predicate-?" or "Argument-?-Argument" (the templates |
|
are selected based on equal probability). We denote all the blanks that need to be completed as m[a] |
|
and m[p], denoting arguments and predicates, respectively. In our implementation, this objective is the |
|
same as masked language modeling to keep simplicity, by using the original loss following Devlin |
|
et al. (2018). |
|
|
|
|
|
log D(xi _m[a], m[p]),_ (1) |
|
_|_ |
|
_i∈Xa∪p_ |
|
|
|
|
|
_Lalign = −_ |
|
|
|
|
|
where D is the discriminator to predicts a token from a large vocabulary. |
|
|
|
**Logical Path Prediction** To learn representation from the constructed logical graph, thus endowing |
|
the model with global logical reasoning ability, we propose the pre-training task of predicting whether |
|
there exists a path between two selected nodes in the logical graph. In this way, the model learns to |
|
look at logical relations across a long distance of arguments and predicates in different facts. |
|
|
|
We randomly sample 20% nodes from logical graph to form set V _[′], there are in total C_ [2]v[′] [node pairs.] |
|
_|_ _|_ |
|
We have a maximum number maxp of node pairs to predict. To avoid bias in the training process, we |
|
try to make sure that _[max]2_ _[p]_ are positive samples and the rest are negative samples, thus balancing |
|
|
|
positive-negative ratios. If the number of positive/negative samples is less than _[max]2_ _[p]_, we just keep |
|
|
|
the original pairs. Formally, the pre-training objective of this task is calculated as below following |
|
Guo et al. (2020): |
|
|
|
|
|
|
|
[δ log σ[vi, vj] + (1 _δ) log(1_ _σ[vi, vj])],_ (2) |
|
_−_ _−_ |
|
_vi,jX∈V_ _[′]_ |
|
|
|
|
|
_LP ath = −_ |
|
|
|
|
|
where δ is 1 when vi and vj have a connected path and 0 otherwise. [vi, vj] denotes the concatenation |
|
of representations of vi and vj. |
|
|
|
The final training objective is the weighted sum of the above mentioned three losses. |
|
|
|
_L = Lconn + Lalign + LP ath._ (3) |
|
|
|
4.3 PRE-TRAINING DETAILS |
|
|
|
We use the English Wikipedia (1.1 million articles in total), we sample the train and valid datasets |
|
with a split ratio of 19 : 1 on the original datasets. We omit the "Reference" and "Literature" part |
|
in a document to ensure data quality. Following the previous practice (Devlin et al., 2018), we |
|
limit the length of sentences in each batch as up to 512 tokens and the batch size is 128. We use |
|
Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 1e − 6, and weight decay is set as |
|
0.01. We pre-train our model for 500k steps. We use 8 NVIDIA V100 32G GPUs, with FP16 and |
|
deepspeed for training acceleration. Initialized by the pre-trained weights of BERTbase, we continue |
|
training our models for 200k steps. |
|
|
|
5 EXPERIMENTS |
|
|
|
5.1 TASKS AND DATASETS |
|
|
|
Our experiments are conducted on a broad range of language understanding tasks, including natural |
|
language inference, machine reading comprehension, semantic similarity, and text classification. |
|
Some of these tasks are a part of GLUE (Wang et al., 2018) benchmark. We also extend our |
|
experiments to DocRED (Yao et al., 2019), a widely used benchmark of document-level relation |
|
extraction for generalizability. To verify our model’s reasoning abilities of logic, we perform |
|
experiments on two recent logical reasoning datasets in the form of machine reading comprehension, |
|
ReClor (Yu et al., 2020) and LogiQA (Liu et al., 2020). |
|
|
|
|
|
----- |
|
|
|
Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._ |
|
|
|
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - |
|
|
|
_In literature_ |
|
BERTbase 52.1 93.5 84.6/83.4 90.5 66.4 88.9 71.2 85.8 79.6 |
|
SemBERTbase 57.8 93.5 84.4/84.0 90.9 69.3 88.2 71.8 87.3 80.8 |
|
|
|
_Our implementation_ |
|
BERTbase 53.6 93.5 84.6/83.4 90.9 66.6 88.6 71.2 85.8 79.8 |
|
PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1 |
|
|
|
Table 1: Leaderboard results on GLUE benchmark. The number below each task denotes the number |
|
of training examples. F1 scores are reported for QQP and MRPC, Spearman correlations are reported |
|
for STS-B, and accuracy scores are reported for the other tasks. |
|
|
|
Model _ReClor_ _LogiQA_ |
|
|
|
Dev Test Test-E Test-H Dev Test |
|
|
|
Human Performance* - 63.0 57.1 67.2 - 86.0 |
|
|
|
_In literature_ |
|
FOCAL REASONER (Ouyang et al., 2021) 78.6 73.3 86.4 63.0 47.3 45.8 |
|
LReasoner (Wang et al., 2021a) 74.6 71.8 83.4 62.7 45.8 43.3 |
|
DAGN (Huang et al., 2021) 65.8 58.3 75.9 44.5 36.9 39.3 |
|
|
|
BERTlarge (Devlin et al., 2018) 53.8 49.8 72.0 32.3 34.1 31.0 |
|
XLNetlarge (Yang et al., 2019) 62.0 56.0 75.7 40.5 - - |
|
RoBERTalarge (Liu et al., 2019) 62.6 55.6 75.5 40.0 35.0 35.3 |
|
DeBERTalarge (He et al., 2020) 74.4 68.9 83.4 57.5 44.4 41.5 |
|
|
|
_Our implementation_ |
|
BERTbase 51.2 47.3 71.6 28.2 33.8 32.1 |
|
PROPHET 53.4 48.8 72.4 32.2 35.2 34.1 |
|
|
|
Table 2: Accuracy on ReClor and LogiQA dataset. The public methods are based on large models. |
|
|
|
5.2 RESULTS |
|
|
|
Table 1 shows results on the GLUE benchmark datasets. We have the following observations from |
|
the above results. |
|
|
|
(1) PROPHET obtains substantial gains over the BERT baseline (continual trained for 200K steps |
|
for a fair comparison), indicating that our model can work well in a general sense of language |
|
understanding. |
|
|
|
(2) PROPHET performs particularly well on language inference tasks including MNLI, QNLI, and |
|
RTE,[4] which indicates our model’s ability to reasoning. |
|
|
|
(3) Whether it is large-scale datasets such as QQP and MNLI or small datasets like COLA and SST-B, |
|
our model demonstrates a consistent improvement, indicating its robustness. |
|
|
|
(4) From Table 2, we can see that PROPHET improves the logical reasoning ability of BERT baseline |
|
by a large margin. Especially, armed with our approach, the results on the two datasets for the |
|
BERT-base model are comparable or even surpass those with BERT-large results. |
|
|
|
In addition, we conducted experiments on a large-scale human-annotated dataset for document-level |
|
relation extraction (Yao et al., 2019). The results are shown in Table 3.[5] From the table, we can see |
|
that PROPHET still does well for relation extraction for documents by outperforming the baseline |
|
|
|
4We exclude the problematic WNLI set. |
|
5We only report the results for Ign F1 in the annotated setting as the distant supervision is too slow to train. |
|
|
|
|
|
----- |
|
|
|
_Dev_ _Test_ |
|
Model |
|
|
|
F1 Intra-F1 Inter-F1 F1 |
|
|
|
BERTbase* (Devlin et al., 2018) 54.2 61.6 47.2 53.2 |
|
Two-Phase BERT* (Wang et al., 2019) 54.4 61.8 47.3 53.9 |
|
|
|
PROPHET 54.8 (↑0.6) 62.4 (↑0.8) 47.5 (↑0.3) 54.3 (↑1.1) |
|
|
|
Table 3: Main results on the dev and test set for DocRED. * indicates that the results are taken from |
|
Nan et al. (2020). Intra- and Inter-F1 indicates F1 scores for the intra- and inter-sentence relations |
|
following the setting of Nan et al. (2020). |
|
|
|
Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._ |
|
|
|
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - |
|
|
|
PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1 |
|
w/o LCM 53.6 93.5 85.1/84.0 90.9 68.2 88.6 71.2 85.8 80.1 |
|
w/o LSC 53.6 93.6 85.0/84.1 91.3 69.0 88.9 71.4 85.9 80.3 |
|
w/o LPP 52.1 93.0 84.6/83.4 90.9 66.4 88.6 71.2 85.8 79.6 |
|
|
|
Table 4: Ablation studies of PROPHET on the test set of GLUE dataset. |
|
|
|
substantially. It even surpasses the two-phase BERT. Also, our model is especially good at coping |
|
with inter-sentence relations compared with baseline models, which means that our model is indeed |
|
capable of synthesizing the information across multiple sentences of a document, verifying the |
|
effectiveness of leveraging sententious and global information. |
|
|
|
6 ANALYSIS |
|
|
|
6.1 ABLATION STUDY |
|
|
|
To investigate the impacts of different objectives introduced, we evaluate three variants of PROPHET |
|
as described in Section 4.2: 1) the w/o LCM model adopts a substitute without logical connectives |
|
masking as the pre-training objective, 2) the w/o LSC model is such that it leaves out the logical |
|
structure completion objective, and 3) the w/o LPP model only uses the objectives of connective |
|
masking and structure completion. The results are shown in Table 4. |
|
|
|
Based on the ablation studies, we come to the following conclusions. Firstly, all three components |
|
contribute to the performance as removing any one of them causes a performance drop on the average |
|
score. Especially, the average point drops the most as we remove the logical path prediction objective, |
|
which sheds light on the importance of modeling chain-like relations of events. Secondly, we can |
|
see that logical path prediction contributes the most to the reasoning abilities as the performance on |
|
language inference improves the most when we add the sententious connective masking objective |
|
and the task of logical path prediction. |
|
|
|
6.2 COMPARISON BETWEEN FACT AND ENTITY-LIKE KNOWLEDGE |
|
|
|
We also replace the injected fact with common practice using entity-like knowledge, which is |
|
using named entities. In detail, we change the arguments in facts into named entities recognized |
|
by StanfordCoreNLP,[6] and leave the predicates extracted unchanged, resulting in the form of < |
|
_NE1, predicate, NE2 > (NE stands for named entity). If a fact is not recognized with any named_ |
|
entities, we just leave it out. |
|
|
|
The results are shown in Table 5. We can see that the performance is hurt a lot, even worse than |
|
vanilla BERT. This is quite intuitive as the number of named entities is far less than our obtained fact, |
|
|
|
[6https://stanfordnlp.github.io/CoreNLP/](https://stanfordnlp.github.io/CoreNLP/) |
|
|
|
|
|
----- |
|
|
|
Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._ |
|
|
|
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - |
|
|
|
PROPHET 57.0 93.6 85.0/84.1 91.4 69.8 89.2 71.4 86.0 81.0 |
|
w/ named entities 50.4 93.2 84.9/84.2 90.8 68.7 88.4 71.0 84.9 79.3 |
|
|
|
Table 5: Results on GLUE test set when replacing facts with named entities and key the relations |
|
unchanged. |
|
|
|
missing a lot of information inherent in the context. Whereas our introduced fact can well capture the |
|
knowledge used in the reasoning process, providing a fundamental reasoning basis. |
|
|
|
6.3 ATTENTION MATRIX HEATMAP |
|
|
|
We plot the attention matrix in token level to see how our model interprets the context using heatmap |
|
shown in Figure 3. |
|
|
|
Figure 3: Heatmap of the attention matrix of vanilla BERT and our implemented PROPHET for the |
|
sentence "However, they met harsh suppression after the Bolthevik government was stabilized.". |
|
Weights are selected from the first head of the last attention layer. |
|
|
|
From the figure, we can see that the vanilla BERT attends to delimiters, particularly punctuation |
|
as suggested in Clark et al. (2019). In comparison, our model exhibits quite different attention |
|
distribution. Firstly, our model clearly decreases the influences introduced by punctuation. Secondly, |
|
our model pays more attention to tokens representing discourse-level information, such as "however" |
|
and "after", which is consistent with our motivation. It also well captures the relations of pronouns. |
|
The event characteristics are also illustrated as seen from the "after suppression" phrases. |
|
|
|
6.4 EFFECT OF DIFFERENT CONTEXT LENGTH |
|
|
|
We group samples into ten subsets according to an equal amount of samples (around 1000 samples |
|
per interval) by context length since the majority of the samples concentrate on the interval of under |
|
60. The statistics of MNLI-matched and MNLI-mismatched dev sets are shown in Table 6. Then we |
|
calculate the accuracy of the baseline and PROPHET per group for both the matched and mismatched |
|
set, as shown in Figure. 4. We observe that the performance of the baseline groups drops dramatically |
|
when encountered with long contexts, especially for those longer than 45 tokens, while our model |
|
performs more robustly on those intervals (the slope of the dashed line is more gentle). |
|
|
|
6.5 CASE STUDY |
|
|
|
We also give a case study to demonstrate that PROPHET could enhance the reasoning process in |
|
language understanding. Given two sentences, we use PROPHET and BERT-base to predict whether |
|
|
|
|
|
----- |
|
|
|
Table 6: Distribution of context length on dev set of MNLI-matched and MNLI- mismatched dataset. |
|
|
|
Dataset [0, 29) [30, 59) [60, 89) [90, 119) [120, 149) [150, 179) [180, 209) [210, 239) |
|
|
|
|
|
MNLI-matched 39.2% 49.8% 9.6% 1.0% 0.12% 0.12% 0.02% 0.06% |
|
MNLI-mismatched 33.7% 55.0% 9.7% 1.7% 0.3% 0.1% 0.1% 0% |
|
|
|
|
|
89 |
|
|
|
88 |
|
|
|
87 |
|
|
|
86 |
|
|
|
85 |
|
|
|
84 |
|
|
|
83 |
|
|
|
82 |
|
|
|
81 |
|
|
|
80 |
|
|
|
|
|
87 |
|
|
|
86.5 |
|
|
|
86 |
|
|
|
85.5 |
|
|
|
85 |
|
|
|
84.5 |
|
|
|
84 |
|
|
|
83.5 |
|
|
|
83 |
|
|
|
82.5 |
|
|
|
82 |
|
|
|
|
|
ours |
|
|
|
vanilla BERT |
|
|
|
|
|
vanilla BERT |
|
|
|
ours |
|
|
|
|
|
(0,16] (16,21] (21,25] (25,29] (29,33] (33,38] (38,44] (44,51] (51,62] (62, ∞) |
|
|
|
Context Length |
|
|
|
|
|
(0,18] (18,24] (24,28] (28,32] (32,36] (36,40] (40,45] (45,51] (51,61] (61, ∞ |
|
|
|
Context Length |
|
|
|
|
|
Figure 4: Accuracy of different context length on MNLI-match (left) and MNLI-mismatch (right) |
|
dev set. There are approximate 1000 samples in each intervals. |
|
|
|
the sentences are entailed or not. Results are shown in Figure 5. To see the language understanding |
|
ability of our model, we made two subtle changes in the original training sample. Firstly, we change |
|
the entity referred to in the sentence. We can see that PROPHET learns better alignment relations |
|
between entities than BERT-base model. Additionally, we add a negation in the sentence. Although |
|
this change is small, it completely changes the semantic of the sentence, and leads to a reversal of the |
|
ground-truth labels. We can see that PROPHET is good at all the samples given, indicating that it is |
|
not only good at reasoning in language understanding, but also is more robust than baseline models. |
|
|
|
|Input Label Prediction|Unchanged|Entity change|Negation| |
|
|---|---|---|---| |
|
||Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SNCF.. not entailment Prophet: not entailment √ BERT-base: not entailment √|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SBB.. not entailment Prophet: not entailment √ BERT-base: entailment ×|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is not called SNCF.. entailment Prophet: nentailment √ BERT-base: not entailment ×| |
|
|
|
|
|
Figure 5: We take an example from RTE dataset, and use PROPHET and BERT-base to predict the |
|
label of the relations among two given sentences. |
|
|
|
7 CONCLUSION |
|
|
|
|
|
In this paper, we leverage fact in a newly pre-trained language model PROPHET to capture logic |
|
relations essentially, in consideration of the fundamental role of PrLM serving in NLP and NLU tasks. |
|
We introduce three novel pre-training tasks and show that PROPHET achieves significant improvement |
|
over various logic reasoning involved NLP and NLU downstream tasks, including language inference, |
|
sentence classification, semantic similarity, and machine reading comprehension. Further analysis |
|
shows that our model can well interpret the inner logical structure of the context to aid the reasoning |
|
process. |
|
|
|
|
|
----- |
|
|
|
REFERENCES |
|
|
|
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? |
|
an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019. |
|
|
|
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep |
|
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. |
|
|
|
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, |
|
Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with |
|
data flow. arXiv preprint arXiv:2009.08366, 2020. |
|
|
|
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert |
|
with disentangled attention. arXiv preprint arXiv:2006.03654, 2020. |
|
|
|
Chadi Helwe, Chloé Clavel, and Fabian M Suchanek. Reasoning with transformer-based models: |
|
Deep learning, but shallow reasoning. In 3rd Conference on Automated Knowledge Base |
|
_Construction, 2021._ |
|
|
|
Yinya Huang, Meng Fang, Yu Cao, Liwei Wang, and Xiaodan Liang. DAGN: Discourse-aware graph |
|
network for logical reasoning. In NAACL, 2021. |
|
|
|
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. |
|
SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the |
|
_Association for Computational Linguistics, 8:64–77, 2020._ |
|
|
|
Nora Kassner and Hinrich Schütze. Negated and misprimed probes for pretrained language models: |
|
Birds can talk, but cannot fly. arXiv preprint arXiv:1911.03343, 2019. |
|
|
|
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint |
|
_arXiv:1412.6980, 2014._ |
|
|
|
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. |
|
ALBERT: A lite BERT for self-supervised learning of language representations. In International |
|
_[Conference on Learning Representations, 2019. URL https://openreview.net/pdf?id=](https://openreview.net/pdf?id=H1eA7AEtvS)_ |
|
[H1eA7AEtvS.](https://openreview.net/pdf?id=H1eA7AEtvS) |
|
|
|
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, |
|
Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for |
|
natural language generation, translation, and comprehension. In Proceedings of the 58th Annual |
|
_Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020._ |
|
|
|
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A |
|
challenge dataset for machine reading comprehension with logical reasoning. In Christian Bessiere |
|
(ed.), Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, |
|
_IJCAI-20, pp. 3622–3628. International Joint Conferences on Artificial Intelligence Organization,_ |
|
[7 2020. doi: 10.24963/ijcai.2020/501. URL https://doi.org/10.24963/ijcai.2020/501.](https://doi.org/10.24963/ijcai.2020/501) |
|
Main track. |
|
|
|
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, |
|
Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining |
|
Approach. arXiv e-prints, art. arXiv:1907.11692, July 2019. |
|
|
|
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike |
|
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining |
|
approach. arXiv preprint arXiv:1907.11692, 2019. |
|
|
|
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David |
|
McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd |
|
_annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60,_ |
|
2014. |
|
|
|
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): |
|
39–41, 1995. |
|
|
|
|
|
----- |
|
|
|
Ndapandula Nakashole and Tom Mitchell. Language-aware truth assessment of fact candidates. In |
|
_Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume_ |
|
_1: Long Papers), pp. 1009–1019, 2014._ |
|
|
|
Guoshun Nan, Zhijiang Guo, Ivan Sekuli´c, and Wei Lu. Reasoning with latent structure refinement |
|
for document-level relation extraction. arXiv preprint arXiv:2005.06312, 2020. |
|
|
|
Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. Fact-driven logical reasoning. arXiv preprint |
|
_arXiv:2105.10334, 2021._ |
|
|
|
Nina Poerner, Ulli Waltinger, and Hinrich Schütze. Bert is not a knowledge base (yet): Factual |
|
knowledge vs. name-based reasoning in unsupervised qa. arXiv preprint arXiv:1911.03681, 2019. |
|
|
|
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K Joshi, and |
|
Bonnie L Webber. The penn discourse treebank 2.0. In LREC. Citeseer, 2008. |
|
|
|
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language |
|
understanding by generative pre-training. 2018. |
|
|
|
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi |
|
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text |
|
transformer. Journal of Machine Learning Research, 21:1–67, 2020. |
|
|
|
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about |
|
how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020. |
|
|
|
Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi |
|
Chen, Yanbin Zhao, Yuxiang Lu, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training |
|
for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021. |
|
|
|
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz |
|
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information |
|
_processing systems, pp. 5998–6008, 2017._ |
|
|
|
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: |
|
A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint |
|
_arXiv:1804.07461, 2018._ |
|
|
|
Hong Wang, Christfried Focke, Rob Sylvester, Nilesh Mishra, and William Wang. Fine-tune bert for |
|
docred with two-step process. arXiv preprint arXiv:1909.11898, 2019. |
|
|
|
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao, Daxin Jiang, |
|
Ming Zhou, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv |
|
_preprint arXiv:2002.01808, 2020._ |
|
|
|
Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and |
|
Nan Duan. Logic-driven context extension and data augmentation for logical reasoning of text. |
|
_arXiv preprint arXiv:2105.03659, 2021a._ |
|
|
|
Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian |
|
Tang. Kepler: A unified model for knowledge embedding and pre-trained language representation. |
|
_Transactions of the Association for Computational Linguistics, 9:176–194, 2021b._ |
|
|
|
Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. Pretrained encyclopedia: |
|
Weakly supervised knowledge-pretrained language model. arXiv preprint arXiv:1912.09637, 2019. |
|
|
|
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. |
|
Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural |
|
_information processing systems, 32, 2019._ |
|
|
|
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie |
|
Zhou, and Maosong Sun. Docred: A large-scale document-level relation extraction dataset. arXiv |
|
_preprint arXiv:1906.06127, 2019._ |
|
|
|
Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset |
|
requiring logical reasoning. In International Conference on Learning Representations (ICLR), |
|
April 2020. |
|
|
|
|
|
----- |
|
|
|
|