File size: 39,042 Bytes
f71c233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
# LOGIC PRE-TRAINING OF LANGUAGE MODELS

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Pre-trained language models (PrLMs) have been shown useful for enhancing
a broad range of natural language understanding (NLU) tasks. However, the
capacity for capturing logic relations in challenging NLU still remains a bottleneck
even for state-of-the-art PrLM enhancement, which greatly stalled their reasoning
abilities. Thus we propose logic pre-training of language models, leading to the
logic reasoning ability equipped PrLM, PROPHET. To let logic pre-training perform
on a clear, accurate, and generalized knowledge basis, we introduce fact instead of
the plain language unit in previous PrLMs. The fact is extracted through syntactic
parsing in avoidance of unnecessary complex knowledge injection. Meanwhile, it
enables training logic-aware models to be conducted on a more general language
text. To explicitly guide the PrLM to capture logic relations, three pre-training
objectives are introduced: 1) logical connectives masking to capture sentence-level
logics, 2) logical structure completion to accurately capture facts from the original
context, 3) logical path prediction on a logical graph to uncover global logic
relationships among facts. We evaluate our model on a broad range of NLP and
NLU tasks, including natural language inference, relation extraction, and machine
reading comprehension with logical reasoning. Results show that the extracted fact
and the newly introduced pre-training tasks can help PROPHET achieve significant
performance in all the downstream tasks, especially in logic reasoning related tasks.

1 INTRODUCTION

Machine reasoning in natural language understanding (NLU) aims to teach machines to understand
human languages by building and analyzing the connections between the facts, events, and
observations using logical analysis techniques like deduction and induction, which is one of the
ultimate goals towards human-parity intelligence. Although pre-trained language models (PrLMs),
such as BERT (Devlin et al., 2018), GPT (Radford et al., 2018), XLNet (Yang et al., 2019) and
RoBERTa (Liu et al., 2019), have established state-of-the-art performance on various aspects in NLU,
they are still short in complex language understanding tasks that involve reasoning (Helwe et al.,
2021). The major reason behind this is that they are insufficiently capable of capturing logic relations
such as negation (Kassner & Schütze, 2019), factual knowledge (Poerner et al., 2019), events (Rogers
et al., 2020), and so on. Many previous studies (Sun et al., 2021; Xiong et al., 2019; Wang et al.,
2020) are then motivated to inject knowledge into pre-trained models like BERT and RoBERTa.
However, they too much rely on massive external knowledge sources and ignore that language itself
is a natural knowledge carrier as the basis of acquiring logic reasoning ability (Ouyang et al., 2021).
Taking the context in Figure 1 as an example, previous approaches tend to focus on entities such as
the definition of "government" and the concepts related to it like "governor", but overlook the exact
relations inherent in this example, thus failing to model the complex reasoning process.

Given the fact that PrLMs are the key supporting components in natural language understanding,
in this work, we propose a fundamental solution by empowering the PrLMs with the capacity of
capturing logic relations, which is necessary for logical reasoning. However, logical reasoning can
only be implemented on the basis of clear, accurate, and generalized knowledge. Therefore, we
leverage fact as the conceptual knowledge unit to serve the basis for logic relation extraction. Fact is
organized as a triplet, i.e., in the form of predicate-argument structures, to represent the meaning such
as "who-did-what-to-whom" and "who-is-what". Compared with existing studies that inject complex
knowledge like knowledge graphs, the knowledge structure based on fact is far less complicated and
more general in representing events and relations in languages.


-----

On top of the fact-based knowledge structure, we present PROPHET, a logic-aware pre-trained
language model to learn the logic-aware relations in a universal way from very large texts. In detail,
we introduce three novel pre-training objectives based on the newly introduced knowledge structure
basis fact: 1) logical connectives masking for learning sentence-level logic connection. 2) logical
structure completion task on top of facts for regularization, aligning extracted fact with the original
context. 3) logical path prediction to capture the logic relationship between facts. PROPHET is
evaluated on a broad range of language understanding tasks: natural language inference, semantic
similarity, machine reading comprehension, etc. Experimental results show that the fact is useful
as the carrier for knowledge modeling, and the newly introduced pre-training tasks can improve
PROPHET and achieves significant performance on downstream tasks.[1]

2 RELATED WORK

2.1 PRE-TRAINED LANGUAGE MODELS IN NLP

Large pre-trained language models (Devlin et al., 2018; Liu et al., 2019; Radford et al., 2018)
have brought dramatic empirical improvements on almost every NLP task in the past few years.
A classical norm of pre-training is to train neural models on a large corpus with self-supervised
pre-training objectives. "Self-supervised" means that the supervision provided in the training process
is automatically generated from the raw text instead of manually generation. Designing effective
criteria for language modeling is one of the major topics in training pre-trained models, which decides
how the model captures the knowledge from large-scale unlabeled data. The most popular pre-training
objective used today is masked language modeling (MLM), initially used in BERT (Devlin et al.,
2018), which randomly masks out tokens, and the model is asked to uncover it given surrounding
context. Recent studies have investigated diverse variants of denoising strategies (Raffel et al., 2020;
Lewis et al., 2020), model architecture (Yang et al., 2019), and auxiliary objectives (Lan et al.,
2019; Joshi et al., 2020) to improve the model strength during pre-training. Although the existing
techniques have shown effectiveness in capturing syntactic and semantic information after large-scale
pre-training, they perform sensitivity to role reversal and struggles with pragmatic inference and
role-based event knowledge (Rogers et al., 2020), which are critical to the ultimate goal of complex
reasoning that requires to uncover logical structures. However, it is difficult for pre-trained language
models to capture the logical structure inherent in the texts since logical supervision is rarely available
during pre-training. Therefore, we are motivated to explicitly guide the model to capture such clues
via our newly introduced self-supervised tasks.

2.2 REASONING ABILITY FOR PRE-TRAINED LANGUAGE MODELS

There is a lot of work in the research line of enhancing reasoning abilities in pre-trained language
models via injecting knowledge. The existing approaches mainly design novel pre-training objectives
and leverage abundant knowledge sources such as WordNet (Miller, 1995).

Notably, ERNIE 3.0 (Sun et al., 2021) uses a broad range of pre-training objectives from word-aware,
structure-aware to knowledge-aware tasks, based on a 4TB corpus consisting of plain texts and a
large-scale knowledge graph. WKLM (Xiong et al., 2019) replaces entity mentions in the document
with other entities of the same type, and the objective is to distinguish the replaced entity from the
original ones. KEPLER (Wang et al., 2021b) encodes textual entity descriptions using embeddings
from a PrLM to take full advantage of the abundant textual information. K-Adapter (Wang et al., 2020)
designs neural adapters to distinguish the type of knowledge sources to capture various knowledge.

Our proposed method differs from previous studies in three aspects. Firstly, our model does not
require any external knowledge resources like previous methods that use WordNet, WikiData, etc.
We only use small-scale textual sources following the standard PrLMs like BERT (Devlin et al.,
2018), along with an off-the-shelf dependency parser to extract facts. Secondly, previous works only
consider triplet-level pre-training objectives. We proposed a multi-granularity pre-training strategy,
considering not only triplet-level information but also sentence-level and global knowledge to enhance
logic reasoning. Finally, we propose a new training mechanism apart from masked language modeling
(MLM), hoping to shed light on more logic pre-training strategies in this research line.

1Our codes have been uploaded as supplemental material, which will be open after the double review period.


-----

|Fact anarchists, participated, revolution revolution, opposite, movement they, met, suppression suppresion, after, stabilized government, was, stabilized|Logical Graph opposite participated movement anarchists revolution after coref met suppression stabalized they was same government stabalized|
|---|---|
|Despite concerns, anarchists participated in the Russian Revolution Text in opposition to the White movement. However, they met harsh suppression after the Bolshevik government was stabilized.||


Logical Graph


Fact


_they_ _was_

_government_ _stabalized_


Text


Figure 1: How the facts and logical graph constructed from raw text inputs. Edges in red denotes
additional edges added in the logical graph, while text with green indicates the sentence-level logical
connectives which will be mentioned in §4.

3 PRELIMINARIES

In this section, we will introduce the concept of fact and logical graph, which is the basis of PROPHET.
We will also describe extracting the fact for logical graph construction, as an example shown in
Figure 1.

3.1 FACT

Following Nakashole & Mitchell (2014) and Ouyang et al. (2021), we extract facts which are triplets
represented as T = {A1, P, A2}, where A1 and A2 are the arguments and P is the predicate between
them. It can well represent a broad range of facts, reflecting the notion of "who-did-what-to-whom"
and "who-is-what", etc.

We extract such facts in a syntactic way, which makes our approach generic and easy to apply. Given
a document, we first split the document into multiple sentences. For each sentence, we conduct
dependency parsing using StanfordCoreNLP (Manning et al., 2014).[2] For the analyzed dependencies,
basically, we consider verb phrases and some prepositions in the sentences as "predicates", and then
we search for their corresponding actors and actees as the "arguments".

3.2 LOGICAL GRAPH

A logical graph is an undirected (but is not required to be connected) graph that represents
logical dependency relation between components in facts. In logical graphs, nodes represent
argument/predicates in the fact, and edges indicate whether two nodes have relations in a fact.
Such a structure can well unveil and organize semantic information captured by facts. Besides, a
logical graph supports considerations among long-range dependencies via connecting arguments and
their relations in different facts across different spans.

We further show how to construct such graphs based on facts. Despite given relations in facts, we
design another two types of edges based on identical mentions and coreference information. (1) There
can be identical mentions in different sentences, resulting in repeated nodes in facts. We connect
nodes corresponding to the same non-pronoun arguments by edges with edge type same. (2) We
conduct coreference resolution on context using an off-to-shelf model to identify arguments in facts
that refer to the same one.[3] We add edges with type coref between them. The final logical graph is
denoted as S = (V, E), where V = Ai _P and i_ 1, 2 .
_∪_ _∈{_ _}_

[2https://stanfordnlp.github.io/CoreNLP/, we also tried to use OpenIE directly; however,](https://stanfordnlp.github.io/CoreNLP/)
the performance is not satisfactory.
[3https://github.com/huggingface/neuralcoref.](https://github.com/huggingface/neuralcoref)


-----

_anarchists, participated, revolution_
_revolution, opposite, movement_
_they, met, suppression_
_suppresion, after, stabilized_ Fact
_government, was, stabilized_

Fact


_.. However, they met harsh suppression_

Text

_after the Bolshevik government ..._


_after_ _stabilized_


_Text_
_Encoder_


Logical Graph

_participated_
_anarchists_ _revolution_



_[CLS]...suppression [MASK] the_

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|pre-training with fact-aware logics ... Text Encoder sentence connective logical path ... masking fact unit alignment prediction|||||||
||||||||

_Bolshevik...[SEP]_

_...suppression after the Bolshevik..._


_movement_

_after_

_stabalized_


_suppression, after, stabilized?_

_( suppression, after, stabilized )_


_revolution, government,_ _?_

_revolution, government, 1_


coref

_they_


_met_
_suppression_

_was_


same


Text Fact Unit Node Pairs


_government_


_stabalized_


Figure 2: An illustration about pre-training methods used in PROPHET. The model takes the text,
extracted fact and the randomly sampled node pairs in the logical graph as the input. The model is
pre-trained with three novel objectives. One is the standard masked language modeling applied to
sententious connectives, the others are fact alignment and logical path prediction.

4 PROPHET

4.1 MODEL ARCHITECTURE

We follow BERT (Devlin et al., 2018) and use a multi-layer bidirectional Transformer (Vaswani
et al., 2017) as the model architecture of PROPHET. For keeping the focus on the newly introduced
techniques, we will not review the ubiquitous Transformer architecture in detail. We develop
PROPHET by using exactly the same model architecture as BERT-base, where the model consists of
12 transformer layers, with 768 hidden size, 12 attention heads, and 110M model parameters in total.

4.2 LOGIC-AWARE PRE-TRAINING TASKS

We describe three pre-training tasks used for pre-training PROPHET in this section. Figure 2 is an
illustration of PROPHET pre-training. The first task is logical connectives masking (LCM) generalized
from masked language modeling (Devlin et al., 2018) for logical connectives to learn sentence-level
representation. The second task is logical structure completion (LSC) for learning logic relationship
inside a fact, where we first randomly mask arguments in facts, and then predict those items. Finally,
a logical path prediction (LPP) task is proposed for recognizing the logical relations of randomly
selected node pairs.

**Logical Connective Masking** Logical connective masking is an extension of the masked language
modeling (MLM) pre-training objective in Devlin et al. (2018), with a particular focus on connective
indication tokens. We use the Penn Discourse TreeBank 2.0 (PDTB) (Prasad et al., 2008) to draw
the logical relations among sentences. Specifically, PDTB 2.0 contains relations that are manually
annotated on the 1 million Wall Street Journal (WSJ) corpus and are broadly characterized into
"Explicit" and "Implicit" connectives. We use the "Explicit" type (in total 100 such connectives),
which apparently presents in sentences such as discourse adverbial "instead" or subordinating
conjunction "because". Taking all the identified connectives and some randomly sampled other
tokens (for a total 15% of the tokens of the original context), we replace them with a [MASK] token
80% of the time, with a random token 10% of the time and leave them unchanged 10% of the time.
The MLM objective is to predict the original tokens of these sampled tokens, which has proven
effective in previous works (Devlin et al., 2018; Liu et al., 2019). In this way, the model learns
to recover the logical relations for two given sentences, which helps language understanding. The
objective of this task is denoted as _conn._
_L_

**Logical Structure Completion** To align representation between the context and the extracted fact,
we introduce a pre-training task of logical structure completion. The motivation here is to encourage


-----

the model to learn the structure-aware representation that encodes the "Who-did-What-to-Whom"like meanings for better language understanding. To speak in detail, we randomly select a specific
proportion λ of the total facts (λ = 20% in this work), from a given context. For each chosen fact, we
either ask the model to complete "Argument-Predicate-?" or "Argument-?-Argument" (the templates
are selected based on equal probability). We denote all the blanks that need to be completed as m[a]
and m[p], denoting arguments and predicates, respectively. In our implementation, this objective is the
same as masked language modeling to keep simplicity, by using the original loss following Devlin
et al. (2018).


log D(xi _m[a], m[p]),_ (1)
_|_
_i∈Xa∪p_


_Lalign = −_


where D is the discriminator to predicts a token from a large vocabulary.

**Logical Path Prediction** To learn representation from the constructed logical graph, thus endowing
the model with global logical reasoning ability, we propose the pre-training task of predicting whether
there exists a path between two selected nodes in the logical graph. In this way, the model learns to
look at logical relations across a long distance of arguments and predicates in different facts.

We randomly sample 20% nodes from logical graph to form set V _[′], there are in total C_ [2]v[′] [node pairs.]
_|_ _|_
We have a maximum number maxp of node pairs to predict. To avoid bias in the training process, we
try to make sure that _[max]2_ _[p]_ are positive samples and the rest are negative samples, thus balancing

positive-negative ratios. If the number of positive/negative samples is less than _[max]2_ _[p]_, we just keep

the original pairs. Formally, the pre-training objective of this task is calculated as below following
Guo et al. (2020):



[δ log σ[vi, vj] + (1 _δ) log(1_ _σ[vi, vj])],_ (2)
_−_ _−_
_vi,jX∈V_ _[′]_


_LP ath = −_


where δ is 1 when vi and vj have a connected path and 0 otherwise. [vi, vj] denotes the concatenation
of representations of vi and vj.

The final training objective is the weighted sum of the above mentioned three losses.

_L = Lconn + Lalign + LP ath._ (3)

4.3 PRE-TRAINING DETAILS

We use the English Wikipedia (1.1 million articles in total), we sample the train and valid datasets
with a split ratio of 19 : 1 on the original datasets. We omit the "Reference" and "Literature" part
in a document to ensure data quality. Following the previous practice (Devlin et al., 2018), we
limit the length of sentences in each batch as up to 512 tokens and the batch size is 128. We use
Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 1e − 6, and weight decay is set as
0.01. We pre-train our model for 500k steps. We use 8 NVIDIA V100 32G GPUs, with FP16 and
deepspeed for training acceleration. Initialized by the pre-trained weights of BERTbase, we continue
training our models for 200k steps.

5 EXPERIMENTS

5.1 TASKS AND DATASETS

Our experiments are conducted on a broad range of language understanding tasks, including natural
language inference, machine reading comprehension, semantic similarity, and text classification.
Some of these tasks are a part of GLUE (Wang et al., 2018) benchmark. We also extend our
experiments to DocRED (Yao et al., 2019), a widely used benchmark of document-level relation
extraction for generalizability. To verify our model’s reasoning abilities of logic, we perform
experiments on two recent logical reasoning datasets in the form of machine reading comprehension,
ReClor (Yu et al., 2020) and LogiQA (Liu et al., 2020).


-----

Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._

CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - 

_In literature_
BERTbase 52.1 93.5 84.6/83.4 90.5 66.4 88.9 71.2 85.8 79.6
SemBERTbase 57.8 93.5 84.4/84.0 90.9 69.3 88.2 71.8 87.3 80.8

_Our implementation_
BERTbase 53.6 93.5 84.6/83.4 90.9 66.6 88.6 71.2 85.8 79.8
PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1

Table 1: Leaderboard results on GLUE benchmark. The number below each task denotes the number
of training examples. F1 scores are reported for QQP and MRPC, Spearman correlations are reported
for STS-B, and accuracy scores are reported for the other tasks.

Model _ReClor_ _LogiQA_

Dev Test Test-E Test-H Dev Test

Human Performance* -  63.0 57.1 67.2 -  86.0

_In literature_
FOCAL REASONER (Ouyang et al., 2021) 78.6 73.3 86.4 63.0 47.3 45.8
LReasoner (Wang et al., 2021a) 74.6 71.8 83.4 62.7 45.8 43.3
DAGN (Huang et al., 2021) 65.8 58.3 75.9 44.5 36.9 39.3

BERTlarge (Devlin et al., 2018) 53.8 49.8 72.0 32.3 34.1 31.0
XLNetlarge (Yang et al., 2019) 62.0 56.0 75.7 40.5 -  - 
RoBERTalarge (Liu et al., 2019) 62.6 55.6 75.5 40.0 35.0 35.3
DeBERTalarge (He et al., 2020) 74.4 68.9 83.4 57.5 44.4 41.5

_Our implementation_
BERTbase 51.2 47.3 71.6 28.2 33.8 32.1
PROPHET 53.4 48.8 72.4 32.2 35.2 34.1

Table 2: Accuracy on ReClor and LogiQA dataset. The public methods are based on large models.

5.2 RESULTS

Table 1 shows results on the GLUE benchmark datasets. We have the following observations from
the above results.

(1) PROPHET obtains substantial gains over the BERT baseline (continual trained for 200K steps
for a fair comparison), indicating that our model can work well in a general sense of language
understanding.

(2) PROPHET performs particularly well on language inference tasks including MNLI, QNLI, and
RTE,[4] which indicates our model’s ability to reasoning.

(3) Whether it is large-scale datasets such as QQP and MNLI or small datasets like COLA and SST-B,
our model demonstrates a consistent improvement, indicating its robustness.

(4) From Table 2, we can see that PROPHET improves the logical reasoning ability of BERT baseline
by a large margin. Especially, armed with our approach, the results on the two datasets for the
BERT-base model are comparable or even surpass those with BERT-large results.

In addition, we conducted experiments on a large-scale human-annotated dataset for document-level
relation extraction (Yao et al., 2019). The results are shown in Table 3.[5] From the table, we can see
that PROPHET still does well for relation extraction for documents by outperforming the baseline

4We exclude the problematic WNLI set.
5We only report the results for Ign F1 in the annotated setting as the distant supervision is too slow to train.


-----

_Dev_ _Test_
Model

F1 Intra-F1 Inter-F1 F1

BERTbase* (Devlin et al., 2018) 54.2 61.6 47.2 53.2
Two-Phase BERT* (Wang et al., 2019) 54.4 61.8 47.3 53.9

PROPHET 54.8 (↑0.6) 62.4 (↑0.8) 47.5 (↑0.3) 54.3 (↑1.1)

Table 3: Main results on the dev and test set for DocRED. * indicates that the results are taken from
Nan et al. (2020). Intra- and Inter-F1 indicates F1 scores for the intra- and inter-sentence relations
following the setting of Nan et al. (2020).

Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._

CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - 

PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1
w/o LCM 53.6 93.5 85.1/84.0 90.9 68.2 88.6 71.2 85.8 80.1
w/o LSC 53.6 93.6 85.0/84.1 91.3 69.0 88.9 71.4 85.9 80.3
w/o LPP 52.1 93.0 84.6/83.4 90.9 66.4 88.6 71.2 85.8 79.6

Table 4: Ablation studies of PROPHET on the test set of GLUE dataset.

substantially. It even surpasses the two-phase BERT. Also, our model is especially good at coping
with inter-sentence relations compared with baseline models, which means that our model is indeed
capable of synthesizing the information across multiple sentences of a document, verifying the
effectiveness of leveraging sententious and global information.

6 ANALYSIS

6.1 ABLATION STUDY

To investigate the impacts of different objectives introduced, we evaluate three variants of PROPHET
as described in Section 4.2: 1) the w/o LCM model adopts a substitute without logical connectives
masking as the pre-training objective, 2) the w/o LSC model is such that it leaves out the logical
structure completion objective, and 3) the w/o LPP model only uses the objectives of connective
masking and structure completion. The results are shown in Table 4.

Based on the ablation studies, we come to the following conclusions. Firstly, all three components
contribute to the performance as removing any one of them causes a performance drop on the average
score. Especially, the average point drops the most as we remove the logical path prediction objective,
which sheds light on the importance of modeling chain-like relations of events. Secondly, we can
see that logical path prediction contributes the most to the reasoning abilities as the performance on
language inference improves the most when we add the sententious connective masking objective
and the task of logical path prediction.

6.2 COMPARISON BETWEEN FACT AND ENTITY-LIKE KNOWLEDGE

We also replace the injected fact with common practice using entity-like knowledge, which is
using named entities. In detail, we change the arguments in facts into named entities recognized
by StanfordCoreNLP,[6] and leave the predicates extracted unchanged, resulting in the form of <
_NE1, predicate, NE2 > (NE stands for named entity). If a fact is not recognized with any named_
entities, we just leave it out.

The results are shown in Table 5. We can see that the performance is hurt a lot, even worse than
vanilla BERT. This is quite intuitive as the number of named entities is far less than our obtained fact,

[6https://stanfordnlp.github.io/CoreNLP/](https://stanfordnlp.github.io/CoreNLP/)


-----

Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._

CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - 

PROPHET 57.0 93.6 85.0/84.1 91.4 69.8 89.2 71.4 86.0 81.0
w/ named entities 50.4 93.2 84.9/84.2 90.8 68.7 88.4 71.0 84.9 79.3

Table 5: Results on GLUE test set when replacing facts with named entities and key the relations
unchanged.

missing a lot of information inherent in the context. Whereas our introduced fact can well capture the
knowledge used in the reasoning process, providing a fundamental reasoning basis.

6.3 ATTENTION MATRIX HEATMAP

We plot the attention matrix in token level to see how our model interprets the context using heatmap
shown in Figure 3.

Figure 3: Heatmap of the attention matrix of vanilla BERT and our implemented PROPHET for the
sentence "However, they met harsh suppression after the Bolthevik government was stabilized.".
Weights are selected from the first head of the last attention layer.

From the figure, we can see that the vanilla BERT attends to delimiters, particularly punctuation
as suggested in Clark et al. (2019). In comparison, our model exhibits quite different attention
distribution. Firstly, our model clearly decreases the influences introduced by punctuation. Secondly,
our model pays more attention to tokens representing discourse-level information, such as "however"
and "after", which is consistent with our motivation. It also well captures the relations of pronouns.
The event characteristics are also illustrated as seen from the "after suppression" phrases.

6.4 EFFECT OF DIFFERENT CONTEXT LENGTH

We group samples into ten subsets according to an equal amount of samples (around 1000 samples
per interval) by context length since the majority of the samples concentrate on the interval of under
60. The statistics of MNLI-matched and MNLI-mismatched dev sets are shown in Table 6. Then we
calculate the accuracy of the baseline and PROPHET per group for both the matched and mismatched
set, as shown in Figure. 4. We observe that the performance of the baseline groups drops dramatically
when encountered with long contexts, especially for those longer than 45 tokens, while our model
performs more robustly on those intervals (the slope of the dashed line is more gentle).

6.5 CASE STUDY

We also give a case study to demonstrate that PROPHET could enhance the reasoning process in
language understanding. Given two sentences, we use PROPHET and BERT-base to predict whether


-----

Table 6: Distribution of context length on dev set of MNLI-matched and MNLI- mismatched dataset.

Dataset [0, 29) [30, 59) [60, 89) [90, 119) [120, 149) [150, 179) [180, 209) [210, 239)


MNLI-matched 39.2% 49.8% 9.6% 1.0% 0.12% 0.12% 0.02% 0.06%
MNLI-mismatched 33.7% 55.0% 9.7% 1.7% 0.3% 0.1% 0.1% 0%


89

88

87

86

85

84

83

82

81

80


87

86.5

86

85.5

85

84.5

84

83.5

83

82.5

82


ours

vanilla BERT


vanilla BERT

ours


(0,16] (16,21] (21,25] (25,29] (29,33] (33,38] (38,44] (44,51] (51,62] (62, ∞)

Context Length


(0,18] (18,24] (24,28] (28,32] (32,36] (36,40] (40,45] (45,51] (51,61] (61, ∞

Context Length


Figure 4: Accuracy of different context length on MNLI-match (left) and MNLI-mismatch (right)
dev set. There are approximate 1000 samples in each intervals.

the sentences are entailed or not. Results are shown in Figure 5. To see the language understanding
ability of our model, we made two subtle changes in the original training sample. Firstly, we change
the entity referred to in the sentence. We can see that PROPHET learns better alignment relations
between entities than BERT-base model. Additionally, we add a negation in the sentence. Although
this change is small, it completely changes the semantic of the sentence, and leads to a reversal of the
ground-truth labels. We can see that PROPHET is good at all the samples given, indicating that it is
not only good at reasoning in language understanding, but also is more robust than baseline models.

|Input Label Prediction|Unchanged|Entity change|Negation|
|---|---|---|---|
||Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SNCF.. not entailment Prophet: not entailment √ BERT-base: not entailment √|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SBB.. not entailment Prophet: not entailment √ BERT-base: entailment ×|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is not called SNCF.. entailment Prophet: nentailment √ BERT-base: not entailment ×|


Figure 5: We take an example from RTE dataset, and use PROPHET and BERT-base to predict the
label of the relations among two given sentences.

7 CONCLUSION


In this paper, we leverage fact in a newly pre-trained language model PROPHET to capture logic
relations essentially, in consideration of the fundamental role of PrLM serving in NLP and NLU tasks.
We introduce three novel pre-training tasks and show that PROPHET achieves significant improvement
over various logic reasoning involved NLP and NLU downstream tasks, including language inference,
sentence classification, semantic similarity, and machine reading comprehension. Further analysis
shows that our model can well interpret the inner logical structure of the context to aid the reasoning
process.


-----

REFERENCES

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at?
an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan,
Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with
data flow. arXiv preprint arXiv:2009.08366, 2020.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert
with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.

Chadi Helwe, Chloé Clavel, and Fabian M Suchanek. Reasoning with transformer-based models:
Deep learning, but shallow reasoning. In 3rd Conference on Automated Knowledge Base
_Construction, 2021._

Yinya Huang, Meng Fang, Yu Cao, Liwei Wang, and Xiaodan Liang. DAGN: Discourse-aware graph
network for logical reasoning. In NAACL, 2021.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy.
SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the
_Association for Computational Linguistics, 8:64–77, 2020._

Nora Kassner and Hinrich Schütze. Negated and misprimed probes for pretrained language models:
Birds can talk, but cannot fly. arXiv preprint arXiv:1911.03343, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
_arXiv:1412.6980, 2014._

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut.
ALBERT: A lite BERT for self-supervised learning of language representations. In International
_[Conference on Learning Representations, 2019. URL https://openreview.net/pdf?id=](https://openreview.net/pdf?id=H1eA7AEtvS)_
[H1eA7AEtvS.](https://openreview.net/pdf?id=H1eA7AEtvS)

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehension. In Proceedings of the 58th Annual
_Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020._

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A
challenge dataset for machine reading comprehension with logical reasoning. In Christian Bessiere
(ed.), Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence,
_IJCAI-20, pp. 3622–3628. International Joint Conferences on Artificial Intelligence Organization,_
[7 2020. doi: 10.24963/ijcai.2020/501. URL https://doi.org/10.24963/ijcai.2020/501.](https://doi.org/10.24963/ijcai.2020/501)
Main track.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining
Approach. arXiv e-prints, art. arXiv:1907.11692, July 2019.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.

Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David
McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd
_annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60,_
2014.

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):
39–41, 1995.


-----

Ndapandula Nakashole and Tom Mitchell. Language-aware truth assessment of fact candidates. In
_Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume_
_1: Long Papers), pp. 1009–1019, 2014._

Guoshun Nan, Zhijiang Guo, Ivan Sekuli´c, and Wei Lu. Reasoning with latent structure refinement
for document-level relation extraction. arXiv preprint arXiv:2005.06312, 2020.

Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. Fact-driven logical reasoning. arXiv preprint
_arXiv:2105.10334, 2021._

Nina Poerner, Ulli Waltinger, and Hinrich Schütze. Bert is not a knowledge base (yet): Factual
knowledge vs. name-based reasoning in unsupervised qa. arXiv preprint arXiv:1911.03681, 2019.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K Joshi, and
Bonnie L Webber. The penn discourse treebank 2.0. In LREC. Citeseer, 2008.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training. 2018.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 21:1–67, 2020.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about
how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020.

Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi
Chen, Yanbin Zhao, Yuxiang Lu, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training
for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
_processing systems, pp. 5998–6008, 2017._

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue:
A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint
_arXiv:1804.07461, 2018._

Hong Wang, Christfried Focke, Rob Sylvester, Nilesh Mishra, and William Wang. Fine-tune bert for
docred with two-step process. arXiv preprint arXiv:1909.11898, 2019.

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao, Daxin Jiang,
Ming Zhou, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv
_preprint arXiv:2002.01808, 2020._

Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and
Nan Duan. Logic-driven context extension and data augmentation for logical reasoning of text.
_arXiv preprint arXiv:2105.03659, 2021a._

Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian
Tang. Kepler: A unified model for knowledge embedding and pre-trained language representation.
_Transactions of the Association for Computational Linguistics, 9:176–194, 2021b._

Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. Pretrained encyclopedia:
Weakly supervised knowledge-pretrained language model. arXiv preprint arXiv:1912.09637, 2019.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.
Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural
_information processing systems, 32, 2019._

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie
Zhou, and Maosong Sun. Docred: A large-scale document-level relation extraction dataset. arXiv
_preprint arXiv:1906.06127, 2019._

Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset
requiring logical reasoning. In International Conference on Learning Representations (ICLR),
April 2020.


-----