# LOGIC PRE-TRAINING OF LANGUAGE MODELS **Anonymous authors** Paper under double-blind review ABSTRACT Pre-trained language models (PrLMs) have been shown useful for enhancing a broad range of natural language understanding (NLU) tasks. However, the capacity for capturing logic relations in challenging NLU still remains a bottleneck even for state-of-the-art PrLM enhancement, which greatly stalled their reasoning abilities. Thus we propose logic pre-training of language models, leading to the logic reasoning ability equipped PrLM, PROPHET. To let logic pre-training perform on a clear, accurate, and generalized knowledge basis, we introduce fact instead of the plain language unit in previous PrLMs. The fact is extracted through syntactic parsing in avoidance of unnecessary complex knowledge injection. Meanwhile, it enables training logic-aware models to be conducted on a more general language text. To explicitly guide the PrLM to capture logic relations, three pre-training objectives are introduced: 1) logical connectives masking to capture sentence-level logics, 2) logical structure completion to accurately capture facts from the original context, 3) logical path prediction on a logical graph to uncover global logic relationships among facts. We evaluate our model on a broad range of NLP and NLU tasks, including natural language inference, relation extraction, and machine reading comprehension with logical reasoning. Results show that the extracted fact and the newly introduced pre-training tasks can help PROPHET achieve significant performance in all the downstream tasks, especially in logic reasoning related tasks. 1 INTRODUCTION Machine reasoning in natural language understanding (NLU) aims to teach machines to understand human languages by building and analyzing the connections between the facts, events, and observations using logical analysis techniques like deduction and induction, which is one of the ultimate goals towards human-parity intelligence. Although pre-trained language models (PrLMs), such as BERT (Devlin et al., 2018), GPT (Radford et al., 2018), XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), have established state-of-the-art performance on various aspects in NLU, they are still short in complex language understanding tasks that involve reasoning (Helwe et al., 2021). The major reason behind this is that they are insufficiently capable of capturing logic relations such as negation (Kassner & Schütze, 2019), factual knowledge (Poerner et al., 2019), events (Rogers et al., 2020), and so on. Many previous studies (Sun et al., 2021; Xiong et al., 2019; Wang et al., 2020) are then motivated to inject knowledge into pre-trained models like BERT and RoBERTa. However, they too much rely on massive external knowledge sources and ignore that language itself is a natural knowledge carrier as the basis of acquiring logic reasoning ability (Ouyang et al., 2021). Taking the context in Figure 1 as an example, previous approaches tend to focus on entities such as the definition of "government" and the concepts related to it like "governor", but overlook the exact relations inherent in this example, thus failing to model the complex reasoning process. Given the fact that PrLMs are the key supporting components in natural language understanding, in this work, we propose a fundamental solution by empowering the PrLMs with the capacity of capturing logic relations, which is necessary for logical reasoning. However, logical reasoning can only be implemented on the basis of clear, accurate, and generalized knowledge. Therefore, we leverage fact as the conceptual knowledge unit to serve the basis for logic relation extraction. Fact is organized as a triplet, i.e., in the form of predicate-argument structures, to represent the meaning such as "who-did-what-to-whom" and "who-is-what". Compared with existing studies that inject complex knowledge like knowledge graphs, the knowledge structure based on fact is far less complicated and more general in representing events and relations in languages. ----- On top of the fact-based knowledge structure, we present PROPHET, a logic-aware pre-trained language model to learn the logic-aware relations in a universal way from very large texts. In detail, we introduce three novel pre-training objectives based on the newly introduced knowledge structure basis fact: 1) logical connectives masking for learning sentence-level logic connection. 2) logical structure completion task on top of facts for regularization, aligning extracted fact with the original context. 3) logical path prediction to capture the logic relationship between facts. PROPHET is evaluated on a broad range of language understanding tasks: natural language inference, semantic similarity, machine reading comprehension, etc. Experimental results show that the fact is useful as the carrier for knowledge modeling, and the newly introduced pre-training tasks can improve PROPHET and achieves significant performance on downstream tasks.[1] 2 RELATED WORK 2.1 PRE-TRAINED LANGUAGE MODELS IN NLP Large pre-trained language models (Devlin et al., 2018; Liu et al., 2019; Radford et al., 2018) have brought dramatic empirical improvements on almost every NLP task in the past few years. A classical norm of pre-training is to train neural models on a large corpus with self-supervised pre-training objectives. "Self-supervised" means that the supervision provided in the training process is automatically generated from the raw text instead of manually generation. Designing effective criteria for language modeling is one of the major topics in training pre-trained models, which decides how the model captures the knowledge from large-scale unlabeled data. The most popular pre-training objective used today is masked language modeling (MLM), initially used in BERT (Devlin et al., 2018), which randomly masks out tokens, and the model is asked to uncover it given surrounding context. Recent studies have investigated diverse variants of denoising strategies (Raffel et al., 2020; Lewis et al., 2020), model architecture (Yang et al., 2019), and auxiliary objectives (Lan et al., 2019; Joshi et al., 2020) to improve the model strength during pre-training. Although the existing techniques have shown effectiveness in capturing syntactic and semantic information after large-scale pre-training, they perform sensitivity to role reversal and struggles with pragmatic inference and role-based event knowledge (Rogers et al., 2020), which are critical to the ultimate goal of complex reasoning that requires to uncover logical structures. However, it is difficult for pre-trained language models to capture the logical structure inherent in the texts since logical supervision is rarely available during pre-training. Therefore, we are motivated to explicitly guide the model to capture such clues via our newly introduced self-supervised tasks. 2.2 REASONING ABILITY FOR PRE-TRAINED LANGUAGE MODELS There is a lot of work in the research line of enhancing reasoning abilities in pre-trained language models via injecting knowledge. The existing approaches mainly design novel pre-training objectives and leverage abundant knowledge sources such as WordNet (Miller, 1995). Notably, ERNIE 3.0 (Sun et al., 2021) uses a broad range of pre-training objectives from word-aware, structure-aware to knowledge-aware tasks, based on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. WKLM (Xiong et al., 2019) replaces entity mentions in the document with other entities of the same type, and the objective is to distinguish the replaced entity from the original ones. KEPLER (Wang et al., 2021b) encodes textual entity descriptions using embeddings from a PrLM to take full advantage of the abundant textual information. K-Adapter (Wang et al., 2020) designs neural adapters to distinguish the type of knowledge sources to capture various knowledge. Our proposed method differs from previous studies in three aspects. Firstly, our model does not require any external knowledge resources like previous methods that use WordNet, WikiData, etc. We only use small-scale textual sources following the standard PrLMs like BERT (Devlin et al., 2018), along with an off-the-shelf dependency parser to extract facts. Secondly, previous works only consider triplet-level pre-training objectives. We proposed a multi-granularity pre-training strategy, considering not only triplet-level information but also sentence-level and global knowledge to enhance logic reasoning. Finally, we propose a new training mechanism apart from masked language modeling (MLM), hoping to shed light on more logic pre-training strategies in this research line. 1Our codes have been uploaded as supplemental material, which will be open after the double review period. ----- |Fact anarchists, participated, revolution revolution, opposite, movement they, met, suppression suppresion, after, stabilized government, was, stabilized|Logical Graph opposite participated movement anarchists revolution after coref met suppression stabalized they was same government stabalized| |---|---| |Despite concerns, anarchists participated in the Russian Revolution Text in opposition to the White movement. However, they met harsh suppression after the Bolshevik government was stabilized.|| Logical Graph Fact _they_ _was_ _government_ _stabalized_ Text Figure 1: How the facts and logical graph constructed from raw text inputs. Edges in red denotes additional edges added in the logical graph, while text with green indicates the sentence-level logical connectives which will be mentioned in §4. 3 PRELIMINARIES In this section, we will introduce the concept of fact and logical graph, which is the basis of PROPHET. We will also describe extracting the fact for logical graph construction, as an example shown in Figure 1. 3.1 FACT Following Nakashole & Mitchell (2014) and Ouyang et al. (2021), we extract facts which are triplets represented as T = {A1, P, A2}, where A1 and A2 are the arguments and P is the predicate between them. It can well represent a broad range of facts, reflecting the notion of "who-did-what-to-whom" and "who-is-what", etc. We extract such facts in a syntactic way, which makes our approach generic and easy to apply. Given a document, we first split the document into multiple sentences. For each sentence, we conduct dependency parsing using StanfordCoreNLP (Manning et al., 2014).[2] For the analyzed dependencies, basically, we consider verb phrases and some prepositions in the sentences as "predicates", and then we search for their corresponding actors and actees as the "arguments". 3.2 LOGICAL GRAPH A logical graph is an undirected (but is not required to be connected) graph that represents logical dependency relation between components in facts. In logical graphs, nodes represent argument/predicates in the fact, and edges indicate whether two nodes have relations in a fact. Such a structure can well unveil and organize semantic information captured by facts. Besides, a logical graph supports considerations among long-range dependencies via connecting arguments and their relations in different facts across different spans. We further show how to construct such graphs based on facts. Despite given relations in facts, we design another two types of edges based on identical mentions and coreference information. (1) There can be identical mentions in different sentences, resulting in repeated nodes in facts. We connect nodes corresponding to the same non-pronoun arguments by edges with edge type same. (2) We conduct coreference resolution on context using an off-to-shelf model to identify arguments in facts that refer to the same one.[3] We add edges with type coref between them. The final logical graph is denoted as S = (V, E), where V = Ai _P and i_ 1, 2 . _∪_ _∈{_ _}_ [2https://stanfordnlp.github.io/CoreNLP/, we also tried to use OpenIE directly; however,](https://stanfordnlp.github.io/CoreNLP/) the performance is not satisfactory. [3https://github.com/huggingface/neuralcoref.](https://github.com/huggingface/neuralcoref) ----- _anarchists, participated, revolution_ _revolution, opposite, movement_ _they, met, suppression_ _suppresion, after, stabilized_ Fact _government, was, stabilized_ Fact _.. However, they met harsh suppression_ Text _after the Bolshevik government ..._ _after_ _stabilized_ _Text_ _Encoder_ Logical Graph _participated_ _anarchists_ _revolution_ _[CLS]...suppression [MASK] the_ |Col1|Col2|Col3|Col4|Col5|Col6|Col7| |---|---|---|---|---|---|---| |pre-training with fact-aware logics ... Text Encoder sentence connective logical path ... masking fact unit alignment prediction||||||| |||||||| _Bolshevik...[SEP]_ _...suppression after the Bolshevik..._ _movement_ _after_ _stabalized_ _suppression, after, stabilized?_ _( suppression, after, stabilized )_ _revolution, government,_ _?_ _revolution, government, 1_ coref _they_ _met_ _suppression_ _was_ same Text Fact Unit Node Pairs _government_ _stabalized_ Figure 2: An illustration about pre-training methods used in PROPHET. The model takes the text, extracted fact and the randomly sampled node pairs in the logical graph as the input. The model is pre-trained with three novel objectives. One is the standard masked language modeling applied to sententious connectives, the others are fact alignment and logical path prediction. 4 PROPHET 4.1 MODEL ARCHITECTURE We follow BERT (Devlin et al., 2018) and use a multi-layer bidirectional Transformer (Vaswani et al., 2017) as the model architecture of PROPHET. For keeping the focus on the newly introduced techniques, we will not review the ubiquitous Transformer architecture in detail. We develop PROPHET by using exactly the same model architecture as BERT-base, where the model consists of 12 transformer layers, with 768 hidden size, 12 attention heads, and 110M model parameters in total. 4.2 LOGIC-AWARE PRE-TRAINING TASKS We describe three pre-training tasks used for pre-training PROPHET in this section. Figure 2 is an illustration of PROPHET pre-training. The first task is logical connectives masking (LCM) generalized from masked language modeling (Devlin et al., 2018) for logical connectives to learn sentence-level representation. The second task is logical structure completion (LSC) for learning logic relationship inside a fact, where we first randomly mask arguments in facts, and then predict those items. Finally, a logical path prediction (LPP) task is proposed for recognizing the logical relations of randomly selected node pairs. **Logical Connective Masking** Logical connective masking is an extension of the masked language modeling (MLM) pre-training objective in Devlin et al. (2018), with a particular focus on connective indication tokens. We use the Penn Discourse TreeBank 2.0 (PDTB) (Prasad et al., 2008) to draw the logical relations among sentences. Specifically, PDTB 2.0 contains relations that are manually annotated on the 1 million Wall Street Journal (WSJ) corpus and are broadly characterized into "Explicit" and "Implicit" connectives. We use the "Explicit" type (in total 100 such connectives), which apparently presents in sentences such as discourse adverbial "instead" or subordinating conjunction "because". Taking all the identified connectives and some randomly sampled other tokens (for a total 15% of the tokens of the original context), we replace them with a [MASK] token 80% of the time, with a random token 10% of the time and leave them unchanged 10% of the time. The MLM objective is to predict the original tokens of these sampled tokens, which has proven effective in previous works (Devlin et al., 2018; Liu et al., 2019). In this way, the model learns to recover the logical relations for two given sentences, which helps language understanding. The objective of this task is denoted as _conn._ _L_ **Logical Structure Completion** To align representation between the context and the extracted fact, we introduce a pre-training task of logical structure completion. The motivation here is to encourage ----- the model to learn the structure-aware representation that encodes the "Who-did-What-to-Whom"like meanings for better language understanding. To speak in detail, we randomly select a specific proportion λ of the total facts (λ = 20% in this work), from a given context. For each chosen fact, we either ask the model to complete "Argument-Predicate-?" or "Argument-?-Argument" (the templates are selected based on equal probability). We denote all the blanks that need to be completed as m[a] and m[p], denoting arguments and predicates, respectively. In our implementation, this objective is the same as masked language modeling to keep simplicity, by using the original loss following Devlin et al. (2018). log D(xi _m[a], m[p]),_ (1) _|_ _i∈Xa∪p_ _Lalign = −_ where D is the discriminator to predicts a token from a large vocabulary. **Logical Path Prediction** To learn representation from the constructed logical graph, thus endowing the model with global logical reasoning ability, we propose the pre-training task of predicting whether there exists a path between two selected nodes in the logical graph. In this way, the model learns to look at logical relations across a long distance of arguments and predicates in different facts. We randomly sample 20% nodes from logical graph to form set V _[′], there are in total C_ [2]v[′] [node pairs.] _|_ _|_ We have a maximum number maxp of node pairs to predict. To avoid bias in the training process, we try to make sure that _[max]2_ _[p]_ are positive samples and the rest are negative samples, thus balancing positive-negative ratios. If the number of positive/negative samples is less than _[max]2_ _[p]_, we just keep the original pairs. Formally, the pre-training objective of this task is calculated as below following Guo et al. (2020): [δ log σ[vi, vj] + (1 _δ) log(1_ _σ[vi, vj])],_ (2) _−_ _−_ _vi,jX∈V_ _[′]_ _LP ath = −_ where δ is 1 when vi and vj have a connected path and 0 otherwise. [vi, vj] denotes the concatenation of representations of vi and vj. The final training objective is the weighted sum of the above mentioned three losses. _L = Lconn + Lalign + LP ath._ (3) 4.3 PRE-TRAINING DETAILS We use the English Wikipedia (1.1 million articles in total), we sample the train and valid datasets with a split ratio of 19 : 1 on the original datasets. We omit the "Reference" and "Literature" part in a document to ensure data quality. Following the previous practice (Devlin et al., 2018), we limit the length of sentences in each batch as up to 512 tokens and the batch size is 128. We use Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 1e − 6, and weight decay is set as 0.01. We pre-train our model for 500k steps. We use 8 NVIDIA V100 32G GPUs, with FP16 and deepspeed for training acceleration. Initialized by the pre-trained weights of BERTbase, we continue training our models for 200k steps. 5 EXPERIMENTS 5.1 TASKS AND DATASETS Our experiments are conducted on a broad range of language understanding tasks, including natural language inference, machine reading comprehension, semantic similarity, and text classification. Some of these tasks are a part of GLUE (Wang et al., 2018) benchmark. We also extend our experiments to DocRED (Yao et al., 2019), a widely used benchmark of document-level relation extraction for generalizability. To verify our model’s reasoning abilities of logic, we perform experiments on two recent logical reasoning datasets in the form of machine reading comprehension, ReClor (Yu et al., 2020) and LogiQA (Liu et al., 2020). ----- Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._ CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - _In literature_ BERTbase 52.1 93.5 84.6/83.4 90.5 66.4 88.9 71.2 85.8 79.6 SemBERTbase 57.8 93.5 84.4/84.0 90.9 69.3 88.2 71.8 87.3 80.8 _Our implementation_ BERTbase 53.6 93.5 84.6/83.4 90.9 66.6 88.6 71.2 85.8 79.8 PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1 Table 1: Leaderboard results on GLUE benchmark. The number below each task denotes the number of training examples. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. Model _ReClor_ _LogiQA_ Dev Test Test-E Test-H Dev Test Human Performance* - 63.0 57.1 67.2 - 86.0 _In literature_ FOCAL REASONER (Ouyang et al., 2021) 78.6 73.3 86.4 63.0 47.3 45.8 LReasoner (Wang et al., 2021a) 74.6 71.8 83.4 62.7 45.8 43.3 DAGN (Huang et al., 2021) 65.8 58.3 75.9 44.5 36.9 39.3 BERTlarge (Devlin et al., 2018) 53.8 49.8 72.0 32.3 34.1 31.0 XLNetlarge (Yang et al., 2019) 62.0 56.0 75.7 40.5 - - RoBERTalarge (Liu et al., 2019) 62.6 55.6 75.5 40.0 35.0 35.3 DeBERTalarge (He et al., 2020) 74.4 68.9 83.4 57.5 44.4 41.5 _Our implementation_ BERTbase 51.2 47.3 71.6 28.2 33.8 32.1 PROPHET 53.4 48.8 72.4 32.2 35.2 34.1 Table 2: Accuracy on ReClor and LogiQA dataset. The public methods are based on large models. 5.2 RESULTS Table 1 shows results on the GLUE benchmark datasets. We have the following observations from the above results. (1) PROPHET obtains substantial gains over the BERT baseline (continual trained for 200K steps for a fair comparison), indicating that our model can work well in a general sense of language understanding. (2) PROPHET performs particularly well on language inference tasks including MNLI, QNLI, and RTE,[4] which indicates our model’s ability to reasoning. (3) Whether it is large-scale datasets such as QQP and MNLI or small datasets like COLA and SST-B, our model demonstrates a consistent improvement, indicating its robustness. (4) From Table 2, we can see that PROPHET improves the logical reasoning ability of BERT baseline by a large margin. Especially, armed with our approach, the results on the two datasets for the BERT-base model are comparable or even surpass those with BERT-large results. In addition, we conducted experiments on a large-scale human-annotated dataset for document-level relation extraction (Yao et al., 2019). The results are shown in Table 3.[5] From the table, we can see that PROPHET still does well for relation extraction for documents by outperforming the baseline 4We exclude the problematic WNLI set. 5We only report the results for Ign F1 in the annotated setting as the distant supervision is too slow to train. ----- _Dev_ _Test_ Model F1 Intra-F1 Inter-F1 F1 BERTbase* (Devlin et al., 2018) 54.2 61.6 47.2 53.2 Two-Phase BERT* (Wang et al., 2019) 54.4 61.8 47.3 53.9 PROPHET 54.8 (↑0.6) 62.4 (↑0.8) 47.5 (↑0.3) 54.3 (↑1.1) Table 3: Main results on the dev and test set for DocRED. * indicates that the results are taken from Nan et al. (2020). Intra- and Inter-F1 indicates F1 scores for the intra- and inter-sentence relations following the setting of Nan et al. (2020). Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._ CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1 w/o LCM 53.6 93.5 85.1/84.0 90.9 68.2 88.6 71.2 85.8 80.1 w/o LSC 53.6 93.6 85.0/84.1 91.3 69.0 88.9 71.4 85.9 80.3 w/o LPP 52.1 93.0 84.6/83.4 90.9 66.4 88.6 71.2 85.8 79.6 Table 4: Ablation studies of PROPHET on the test set of GLUE dataset. substantially. It even surpasses the two-phase BERT. Also, our model is especially good at coping with inter-sentence relations compared with baseline models, which means that our model is indeed capable of synthesizing the information across multiple sentences of a document, verifying the effectiveness of leveraging sententious and global information. 6 ANALYSIS 6.1 ABLATION STUDY To investigate the impacts of different objectives introduced, we evaluate three variants of PROPHET as described in Section 4.2: 1) the w/o LCM model adopts a substitute without logical connectives masking as the pre-training objective, 2) the w/o LSC model is such that it leaves out the logical structure completion objective, and 3) the w/o LPP model only uses the objectives of connective masking and structure completion. The results are shown in Table 4. Based on the ablation studies, we come to the following conclusions. Firstly, all three components contribute to the performance as removing any one of them causes a performance drop on the average score. Especially, the average point drops the most as we remove the logical path prediction objective, which sheds light on the importance of modeling chain-like relations of events. Secondly, we can see that logical path prediction contributes the most to the reasoning abilities as the performance on language inference improves the most when we add the sententious connective masking objective and the task of logical path prediction. 6.2 COMPARISON BETWEEN FACT AND ENTITY-LIKE KNOWLEDGE We also replace the injected fact with common practice using entity-like knowledge, which is using named entities. In detail, we change the arguments in facts into named entities recognized by StanfordCoreNLP,[6] and leave the predicates extracted unchanged, resulting in the form of < _NE1, predicate, NE2 > (NE stands for named entity). If a fact is not recognized with any named_ entities, we just leave it out. The results are shown in Table 5. We can see that the performance is hurt a lot, even worse than vanilla BERT. This is quite intuitive as the number of named entities is far less than our obtained fact, [6https://stanfordnlp.github.io/CoreNLP/](https://stanfordnlp.github.io/CoreNLP/) ----- Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._ CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B - PROPHET 57.0 93.6 85.0/84.1 91.4 69.8 89.2 71.4 86.0 81.0 w/ named entities 50.4 93.2 84.9/84.2 90.8 68.7 88.4 71.0 84.9 79.3 Table 5: Results on GLUE test set when replacing facts with named entities and key the relations unchanged. missing a lot of information inherent in the context. Whereas our introduced fact can well capture the knowledge used in the reasoning process, providing a fundamental reasoning basis. 6.3 ATTENTION MATRIX HEATMAP We plot the attention matrix in token level to see how our model interprets the context using heatmap shown in Figure 3. Figure 3: Heatmap of the attention matrix of vanilla BERT and our implemented PROPHET for the sentence "However, they met harsh suppression after the Bolthevik government was stabilized.". Weights are selected from the first head of the last attention layer. From the figure, we can see that the vanilla BERT attends to delimiters, particularly punctuation as suggested in Clark et al. (2019). In comparison, our model exhibits quite different attention distribution. Firstly, our model clearly decreases the influences introduced by punctuation. Secondly, our model pays more attention to tokens representing discourse-level information, such as "however" and "after", which is consistent with our motivation. It also well captures the relations of pronouns. The event characteristics are also illustrated as seen from the "after suppression" phrases. 6.4 EFFECT OF DIFFERENT CONTEXT LENGTH We group samples into ten subsets according to an equal amount of samples (around 1000 samples per interval) by context length since the majority of the samples concentrate on the interval of under 60. The statistics of MNLI-matched and MNLI-mismatched dev sets are shown in Table 6. Then we calculate the accuracy of the baseline and PROPHET per group for both the matched and mismatched set, as shown in Figure. 4. We observe that the performance of the baseline groups drops dramatically when encountered with long contexts, especially for those longer than 45 tokens, while our model performs more robustly on those intervals (the slope of the dashed line is more gentle). 6.5 CASE STUDY We also give a case study to demonstrate that PROPHET could enhance the reasoning process in language understanding. Given two sentences, we use PROPHET and BERT-base to predict whether ----- Table 6: Distribution of context length on dev set of MNLI-matched and MNLI- mismatched dataset. Dataset [0, 29) [30, 59) [60, 89) [90, 119) [120, 149) [150, 179) [180, 209) [210, 239) MNLI-matched 39.2% 49.8% 9.6% 1.0% 0.12% 0.12% 0.02% 0.06% MNLI-mismatched 33.7% 55.0% 9.7% 1.7% 0.3% 0.1% 0.1% 0% 89 88 87 86 85 84 83 82 81 80 87 86.5 86 85.5 85 84.5 84 83.5 83 82.5 82 ours vanilla BERT vanilla BERT ours (0,16] (16,21] (21,25] (25,29] (29,33] (33,38] (38,44] (44,51] (51,62] (62, ∞) Context Length (0,18] (18,24] (24,28] (28,32] (32,36] (36,40] (40,45] (45,51] (51,61] (61, ∞ Context Length Figure 4: Accuracy of different context length on MNLI-match (left) and MNLI-mismatch (right) dev set. There are approximate 1000 samples in each intervals. the sentences are entailed or not. Results are shown in Figure 5. To see the language understanding ability of our model, we made two subtle changes in the original training sample. Firstly, we change the entity referred to in the sentence. We can see that PROPHET learns better alignment relations between entities than BERT-base model. Additionally, we add a negation in the sentence. Although this change is small, it completely changes the semantic of the sentence, and leads to a reversal of the ground-truth labels. We can see that PROPHET is good at all the samples given, indicating that it is not only good at reasoning in language understanding, but also is more robust than baseline models. |Input Label Prediction|Unchanged|Entity change|Negation| |---|---|---|---| ||Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SNCF.. not entailment Prophet: not entailment √ BERT-base: not entailment √|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SBB.. not entailment Prophet: not entailment √ BERT-base: entailment ×|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is not called SNCF.. entailment Prophet: nentailment √ BERT-base: not entailment ×| Figure 5: We take an example from RTE dataset, and use PROPHET and BERT-base to predict the label of the relations among two given sentences. 7 CONCLUSION In this paper, we leverage fact in a newly pre-trained language model PROPHET to capture logic relations essentially, in consideration of the fundamental role of PrLM serving in NLP and NLU tasks. We introduce three novel pre-training tasks and show that PROPHET achieves significant improvement over various logic reasoning involved NLP and NLU downstream tasks, including language inference, sentence classification, semantic similarity, and machine reading comprehension. Further analysis shows that our model can well interpret the inner logical structure of the context to aid the reasoning process. ----- REFERENCES Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020. Chadi Helwe, Chloé Clavel, and Fabian M Suchanek. Reasoning with transformer-based models: Deep learning, but shallow reasoning. In 3rd Conference on Automated Knowledge Base _Construction, 2021._ Yinya Huang, Meng Fang, Yu Cao, Liwei Wang, and Xiaodan Liang. DAGN: Discourse-aware graph network for logical reasoning. In NAACL, 2021. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the _Association for Computational Linguistics, 8:64–77, 2020._ Nora Kassner and Hinrich Schütze. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. arXiv preprint arXiv:1911.03343, 2019. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint _arXiv:1412.6980, 2014._ Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International _[Conference on Learning Representations, 2019. URL https://openreview.net/pdf?id=](https://openreview.net/pdf?id=H1eA7AEtvS)_ [H1eA7AEtvS.](https://openreview.net/pdf?id=H1eA7AEtvS) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual _Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020._ Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In Christian Bessiere (ed.), Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, _IJCAI-20, pp. 3622–3628. International Joint Conferences on Artificial Intelligence Organization,_ [7 2020. doi: 10.24963/ijcai.2020/501. URL https://doi.org/10.24963/ijcai.2020/501.](https://doi.org/10.24963/ijcai.2020/501) Main track. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints, art. arXiv:1907.11692, July 2019. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd _annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60,_ 2014. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39–41, 1995. ----- Ndapandula Nakashole and Tom Mitchell. Language-aware truth assessment of fact candidates. In _Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume_ _1: Long Papers), pp. 1009–1019, 2014._ Guoshun Nan, Zhijiang Guo, Ivan Sekuli´c, and Wei Lu. Reasoning with latent structure refinement for document-level relation extraction. arXiv preprint arXiv:2005.06312, 2020. Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. Fact-driven logical reasoning. arXiv preprint _arXiv:2105.10334, 2021._ Nina Poerner, Ulli Waltinger, and Hinrich Schütze. Bert is not a knowledge base (yet): Factual knowledge vs. name-based reasoning in unsupervised qa. arXiv preprint arXiv:1911.03681, 2019. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K Joshi, and Bonnie L Webber. The penn discourse treebank 2.0. In LREC. Citeseer, 2008. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020. Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information _processing systems, pp. 5998–6008, 2017._ Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint _arXiv:1804.07461, 2018._ Hong Wang, Christfried Focke, Rob Sylvester, Nilesh Mishra, and William Wang. Fine-tune bert for docred with two-step process. arXiv preprint arXiv:1909.11898, 2019. Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao, Daxin Jiang, Ming Zhou, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv _preprint arXiv:2002.01808, 2020._ Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and Nan Duan. Logic-driven context extension and data augmentation for logical reasoning of text. _arXiv preprint arXiv:2105.03659, 2021a._ Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. Kepler: A unified model for knowledge embedding and pre-trained language representation. _Transactions of the Association for Computational Linguistics, 9:176–194, 2021b._ Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. arXiv preprint arXiv:1912.09637, 2019. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural _information processing systems, 32, 2019._ Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. Docred: A large-scale document-level relation extraction dataset. arXiv _preprint arXiv:1906.06127, 2019._ Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations (ICLR), April 2020. -----