# SIMPLE GNN REGULARISATION FOR 3D MOLECULAR PROPERTY PREDICTION & BEYOND **Jonathan Godwin, Michael Schaarschmidt, Alexander Gaunt,** **Alvaro Sanchez-Gonzales, Yulia Rubanova, Petar Veliˇckovi´c,** **James Kirkpatrick & Peter Battaglia** DeepMind, London {jonathangodwin}@deepmind.com ABSTRACT In this paper we show that simple noisy regularisation can be an effective way to address oversmoothing. We argue that regularisers addressing oversmoothing should both penalise node latent similarity and encourage meaningful node representations. From this observation we derive “Noisy Nodes”, a simple technique in which we corrupt the input graph with noise, and add a noise correcting node-level loss. The diverse node level loss encourages latent node diversity, and the denoising objective encourages graph manifold learning. Our regulariser applies well-studied methods in simple, straightforward ways which allow even generic architectures to overcome oversmoothing and achieve state of the art results on quantum chemistry tasks, and improve results significantly on Open Graph Benchmark (OGB) datasets. Our results suggest Noisy Nodes can serve as a complementary building block in the GNN toolkit. 1 INTRODUCTION Graph Neural Networks (GNNs) are a family of neural networks that operate on graph structured data by iteratively passing learned messages over the graph’s structure (Scarselli et al., 2009; Bronstein et al., 2017; Gilmer et al., 2017; Battaglia et al., 2018; Shlomi et al., 2021). While Graph Neural Networks have demonstrated success in a wide variety of tasks (Zhou et al., 2020a; Wu et al., 2020; Bapst et al., 2020; Schütt et al., 2017; Klicpera et al., 2020a), it has been proposed that in practice “oversmoothing” limits their ability to benefit from overparametrization. Oversmoothing is a phenomenon where a GNN’s latent node representations become increasing indistinguishable over successive steps of message passing (Chen et al., 2019). Once these representations are oversmoothed, the relational structure of the representation is lost, and further message-passing cannot improve expressive capacity. We argue that the challenges of overcoming oversmoothing are two fold. First, finding a way to encourage node latent diversity; second, to encourage the diverse node latents to encode meaningful graph representations. Here we propose a simple noise regulariser, Noisy Nodes, and demonstrate how it overcomes these challenges across a range of datasets and architectures, achieving top results on OC20 IS2RS & IS2RE direct, QM9 and OGBG-PCQM4Mv1. Our “Noisy Nodes” method is a simple technique for regularising GNNs and associated training procedures. During training, our noise regularisation approach corrupts the input graph’s attributes with noise, and adds a per-node noise correction term. We posit that our Noisy Nodes approach is effective because the model is rewarded for maintaining and refining distinct node representations through message passing to the final output, which causes it to resist oversmoothing. Like denoising autoencoders, it encourages the model to explicitly learn the manifold on which the uncorrupted input graph’s features lie, analogous to a form of representation learning. When applied to 3D molecular prediction tasks, it encourages the model to distinguish between low and high energy states. We find that applying Noisy Nodes reduces oversmoothing for shallower networks, and allows us to see improvements with added depth, even on tasks for which depth was assumed to be unhelpful. This study’s approach is to investigate the combination of Noisy Nodes with generic, popular baseline GNN architectures. For 3D Molecular prediction we use a standard architecture working on 3D point clouds developed for particle fluid simulations, the Graph Net Simulator (GNS) (Sanchez-Gonzalez* ----- et al., 2020), which has also been used for molecular property prediction (Hu et al., 2021b). Without using Noisy Nodes the GNS is not a competitive model, but using Noisy Nodes allows the GNS to achieve top performance on three 3D molecular property prediction tasks: the OC20 IS2RE direct task by 43% over previous work, 12% on OC20 IS2RS direct, and top results on 3 out of 12 of the QM9 tasks. For non-spatial GNN benchmarks we test a MPNN (Gilmer et al., 2017) on OGBG-MOLPCBA and OGBG-PCQM4M (Hu et al., 2021a) and again see significant improvements. Finally, we applied Noisy Nodes to a GCN (Kipf & Welling, 2016), arguably the most popular and simple GNN, trained on OGBN-Arxiv and see similar results. These results suggest Noisy Nodes can serve as a complementary GNN building block. 2 PRELIMINARIES: GRAPH PREDICTION PROBLEM Let G = (V, E, g) be an input graph. The nodes are V = {v1, . . ., v|V |}, where vi ∈ R[d][v] . The directed, attributed edges are E = {e1, . . ., e|E|}: each edge includes a sender node index, receiver node index, and edge attribute,ek ∈ R[d][e]. The graph-level property is ek = ( g ∈sk, rR[d]k[g], e. _k), respectively, where sk, rk ∈{1, . . ., |V |} and_ The goal is to predict a target graph, G[′], with the same structure as G, but different node, edge, and/or graph-level attributes. We denote _G[ˆ][′]_ as a model’s prediction of G[′]. Some error metric defines quality of _G[ˆ][′]_ with respect to the target G[′], Error( G[ˆ][′], G[′]), which the training loss terms are defined to optimize. In this paper the phrase “message passing steps” is synonymous with “GNN layers”. 3 OVERSMOOTHING “Oversmoothing” is when the node latent vectors of a GNN become very similar after successive layers of message passing. Once nodes are identical there is no relational information contained in the nodes, and no higher-order latent graph representations can be learned. It is easiest to see this effect with the update function of a Graph Convolutional Network with no adjacency normalization _vi[k]_ [=][ P]j _[Wv]j[k][−][1]_ with j ∈ _Neighborhoodvi_ _, W ∈_ R[d][g][×][d][g] and k the layer index. As the number of applications increases, the averaging effect of the summation forces the nodes to become almost identical. However, as soon as residual connections are added we can construct a network that need not suffer from oversmoothing by setting the residual updates to zero at a similarity threshold. Similarly, multi-head attention Vaswani et al. (2017); Veliˇckovi´c et al. (2018) and GNNs with edge updates (Battaglia et al., 2018; Gilmer et al., 2017) can modulate node updates. As such for modern GNNs oversmoothing is primarily a “training” problem - i.e. how to choose model architectures and regularisers to encourage and preserve meaningful latent relational representations. We can discern two desiderata for a regulariser or loss that addresses oversmoothing. First, it should penalise identical node latents. Second, it should encourage meaningful latent representations of the data. One such example may be the auto-regressive loss of transformer based language models (Brown et al. (2020)). In this case, each word (equivalent to node) prediction must be distinct, and the auto-regressive loss encourages relational dependence upon prior words. We can take inspiration from this observation to derive auxiliary losses that both have diverse node targets and encourage relational representation learning. In the following section we derive one such regulariser, Noisy Nodes. 4 NOISY NODES Noisy Nodes tackles the oversmoothing problem by adding a diverse noise correction target, modifying the original graph prediction problem definition in several ways. It introduces a graph corrupted by noise, _G[˜] = ( V,[˜]_ _E,[˜]_ ˜g), where ˜vi _V is constructed by adding noise, σi, to the input nodes,_ _∈_ [˜] _v˜i = vi + σi. The edges,_ _E[˜], and graph-level attribute, ˜g, can either be uncorrupted by noise (i.e.,_ _E˜ = E, ˜g = g), calculated from the noisy nodes (for example in a nearest neighbors graph), or_ corrupted independent of the nodes—these are minor choices that can be informed by the specific problem setting. ----- Figure 2: Per layer node latent diversity, measured by MAD on a 16 layer MPNN trained on OGBGMOLPCBA. Noisy Nodes maintains a higher level of diversity throughout the network than competing methods. Figure 1: Noisy Node mechanics during training. Input positions are corrupted with noise σ, and the training objective is the node-level difference between target positions and the noisy inputs. Our method requires a noise correction target to prevent oversmoothing by enforcing diversity in the last layers of the GNN, which can be achieved with an auxiliary denoising autoencoder loss. For example, where the Error is defined with respect to graph-level predictions (e.g., predict the minimum energy value of some molecular system), a second output head can be added to the GNN architecture which requires denoising the inputs as targets. Alternatively, if the inputs and targets are in the same real domain as is the case for physical simulations we can adjust the target for the noise. Figure 1 demonstrates this Noisy Nodes set up. The auxiliary loss is weighted by a constant coefficient λ ∈ R. In Figure 2 we illustrate the impact of Noisy Nodes on oversmoothing by plotting the Mean Absolute Distance (MAD) (Chen et al., 2020) of the residual updates of each layer of an MPNN trained on the QM9 (Ramakrishnan et al., 2014) dataset, and compare it to alternative methods DropEdge (Rong et al., 2019) and DropNode (Do et al., 2021). MAD is a measure of the diversity of graph node features, often used to quantify oversmoothing, the higher the number the more diverse the node features, the lower the number the less diverse. In this plot we can see that for Noisy Nodes the node updates remain diverse for all of the layers, whereas without Noisy Nodes diversity is lost. Further analysis of MAD across seeds and with sorted layers can be seen in Appendix Figures 7 and 6 for models applied to 3D point clouds. **The Graph Manifold Learning Perspective. By using an implicit mapping from corrupted data to** clean data, the Noisy Nodes objective encourages the model to learn the manifold on which the clean data lies— we speculate that the GNN learns to go from low probability graphs to high probability graphs. In the autoencoder case the GNN learns the manifold of the input data. When node targets are provided, the GNN learns the manifold of the target data (e.g. the manifold of atoms at equilibrium). We speculate that such a manifold may include commonly repeated substructures that are useful for downstream prediction tasks. A similar motivation can be found for denoising in (Vincent et al., 2010; Song & Ermon, 2019). **The Energy Perspective for Molecular Property Prediction. Local, random distortions of the** geometry of a molecule at a local energy minimum are almost certainly higher energy configurations. As such, a task that maps from a noised molecule to a local energy minimum is learning a mapping from high energy to low energy. Data such as QM9 contains molecules at local minima. Some problems have input data that is already high energy, and targets that are at equilibrium. For these datasets we can generate new high energy states by adding noise to the inputs but keeping the equilibrium target the same, Figure 1 demonstrates this approach. To preserve translation invariance we use displacements between input and target ∆, the corrected target after noise is ∆ _−_ _σ._ 5 RELATED WORK **Oversmoothing. Recent work has aimed to understand why it is challenging to realise the benefits of** training deeper GNNs (Wu et al., 2020). Since first being noted in ((Li et al., 2018)) oversmoothing has been studied extensively and regularisation techniques have been suggested to overcome it (Chen ----- et al., 2019; Cai & Wang, 2020; Rong et al., 2019; Zhou et al., 2020b; Yang et al., 2020; Do et al., 2021; Zhao & Akoglu, 2020). A recent paper, (Li et al., 2021), finds, as in previous work, (Li et al., 2019; 2020), the optimal depth for some datasets they evaluate on to be far lower (5 for OGBN-Arxiv from the Open Graph Benchmark (Hu et al., 2020a), for example) than the 1000 layers possible. **Denoising & Noise Models. Training neural networks with noise has a long history (Sietsma &** Dow, 1991; Bishop, 1995). Of particular relevance are Denoising Autoencoders (Vincent et al., 2008) in which an autoencoder is trained to map corrupted inputs ˜x to uncorrupted inputs x. Denoising Autoencoders have found particular success as a form of pre-training for representation learning (Vincent et al., 2010). More recently, in research applying GNNs to simulation (Sanchez-Gonzalez et al., 2018; Sanchez-Gonzalez* et al., 2020; Pfaff et al., 2020) Gaussian noise is added during training to input positions of a ground truth simulator to mimic the distribution of errors of the learned simulator. Pre-training methods (Devlin et al., 2019; You et al., 2020; Thakoor et al., 2021) are another similar approach; most similarly to our method Hu et al. (2020b) apply a reconstruction loss to graphs with masked nodes to generate graph embeddings for use in downstream tasks. FLAG (Kong et al., 2020) adds adversarial noise during training to input node features as a form of data augmentation for GNNs that demonstrates improved performance for many tasks. It does not add an additional auxiliary loss, which we find is essential for addressing oversmoothing. In other related GNN work, (Sato et al., 2021) use random input features to improve generalisation of graph neaural networks. Adding noise to help input node disambiguation has also been covered in (Dasoulas et al., 2019; Loukas, 2020; Vignac et al., 2020; Murphy et al., 2019), but there is no auxiliary loss. Finally, we take inspiration from (Vincent et al., 2008; 2010; Vincent, 2011; Song & Ermon, 2019) which use the observation that noised data lies off the data manifold for representation learning and generative modelling. **Machine Learning for 3D Molecular Property Prediction. One application of GNNs is to speed** up quantum chemistry calculations which operate on 3D positions of a molecule (Duvenaud et al., 2015; Gilmer et al., 2017; Schütt et al., 2017; Hu et al., 2021b). Common goals are the prediction of molecular properties (Ramakrishnan et al., 2014), forces (Chmiela et al., 2017), energies (Chanussot* et al., 2020) and charges (Unke & Meuwly, 2019). A common approach to embed physical symmetries is to design a network that predicts a rotation and translation invariant energy (Schütt et al., 2017; Klicpera et al., 2020a; Liu et al., 2021). The input features of such models include distances (Schütt et al., 2017), angles (Klicpera et al., 2020b;a) or torsions and higher order terms (Liu et al., 2021). An alternative approach to embedding symmetries is to design a rotation equivariant neural network that use equivariant representations (Thomas et al., 2018; Köhler et al., 2019; Kondor et al., 2018; Fuchs et al., 2020; Batzner et al., 2021; Anderson et al., 2019; Satorras et al., 2021). **Machine Learning for Bond and Atom Molecular Graphs. Predicting properties from molecular** graphs without 3D points, such as graphs of bonds and atoms, is studied separately and often used to benchmark generic graph property prediction models such as GCNs (Hu et al., 2020a) or GATs (Veliˇckovi´c et al., 2018). Models developed for 3D molecular property prediction cannot be applied to bond and atom graphs. Common datasets that contain such data are OGBG-MOLPCBA and OGBG-MOLHIV. 6 3D MOLECULAR PROPERTY PREDICTION EXPERIMENTS AND RESULTS In this section we evaluate how a popular, simple model, the GNS (Sanchez-Gonzalez* et al., 2020) performs on 3D molecular prediction tasks when combined with Noisy Nodes. The GNS was originally developed for particle fluid simulations, but has recently been adapted for molecular property prediction (Hu et al., 2021b). We find that Without Noisy Nodes the GNS architecture is not competitive, but by using Noisy Nodes we see improved performance comparable to the use of specialised architectures. We made minor changes to the GNS architecture. We featurise the distance input features using radial basis functions. We group layer weights, similar to grouped layers used in Jumper et al. (2021) for reduced parameter counts; for a group size of n the first n layer weights are repeated, i.e. the first layer with a group size of 10 has the same weights as the 11[th], 21[st], 31[st] layers and so on. n contiguous ----- Figure 3: Validation curves, OC20 IS2RE ID. A) Without any node targets our model has poor performance and realises no benefit from depth. B) After adding a position node loss, performance improves as depth increases. C) As we add Noisy Nodes and parameters the model achieves SOTA, even with 3 layers, and stops overfitting. D) Adding Noisy Nodes allows a model with even fully shared weights to achieve SOTA. blocks of layers are considered a single group. Finally we find that decoding the intermediate latents and adding a loss after each group aids training stability. The decoder is shared across groups. We tested this architecture on three challenging molecular property prediction benchmarks: OC20 (Chanussot* et al., 2020) IS2RS & IS2RE, and QM9 (Ramakrishnan et al., 2014). These benchmarks are detailed below, but as general distinctions, OC20 tasks use graphs 2-20x larger than QM9. While QM9 always requires graph-level prediction, one of OC20’s two tasks (IS2RS) requires node-level predictions while the other (IS2RE) requires graph-level predictions. All training details may be found in the Appendix. 6.1 OPEN CATALYST 2020 **[Dataset. The OC20 dataset (Chanussot* et al., 2020) (CC Attribution 4.0) describes the interaction](https://opencatalystproject.org/)** of a small molecule (the adsorbate) and a large slab (the catalyst), with total systems consisting of 20-200 atoms simulated until equilibrium is reached. We focus on two tasks; the Initial Structure to Resulting Energy (IS2RE) task which takes the initial structure of the simulation and predicts the final energy, and the Initial Structure to Resulting Structure (IS2RS) which takes the initial structure and predicts the relaxed structure. Note that we train the more common “direct” prediction task that map directly from initial positions to target in a single forward pass, and compare against other models trained for direct prediction. Models are evaluated on 4 held out test sets. Four canonical validation datasets are also provided. Test sets are evaluated on a remote server hosted by the dataset authors with a very limited number of submissions per team. Noisy Nodes in this case consists of a random jump between the initial position and relaxed position. During training we first sample uniformly from a point in the relaxation trajectory or interpolate uniformly between the initial and final positions (vi _v˜i)γ, γ_ U(0, 1), and then add I.I.D Gaussian noise with mean zero and σ = 0.3. The Noisy Node target is the relaxed structure. − _∼_ ----- Table 1: OC20 ISRE Validation, eV MAE, ↓. “GNS-Shared” indicates shared weights. “GNS-10” indicates a group size of 10. Model Layers OOD Both OOD Adsorbate OOD Catalyst ID GNS 50 0.59 ±0.01 0.65 ±0.01 0.55 ±0.00 0.54 ±0.00 GNS-Shared + Noisy Nodes 50 0.49 ±0.00 0.54 ±0.00 0.51 ±0.01 0.51 ±0.01 GNS + Noisy Nodes 50 0.48 ±0.00 0.53 ±0.00 0.49 ±0.01 0.48 ±0.00 GNS-10 + Noisy Nodes 100 **0.46±0.00** **0.51 ±0.00** **0.48 ±0.00** **0.47 ±0.00** Table 2: Results OC20 IS2RE Test eV MAE ↓ SchNet DimeNet++ SpinConv SphereNet GNS + Noisy Nodes OOD Both 0.704 0.661 0.674 0.638 **0.465 (-24.0%)** OOD Adsorbate 0.734 0.725 0.723 0.703 **0.565 (-22.8%)** OOD Catalyst 0.662 0.576 0.569 0.571 **0.437 (-17.2%)** ID 0.639 0.562 0.558 0.563 **0.422 (-18.8%)** Average Energy within Threshold (AEwT) ↑ SchNet DimeNet++ SpinConv SphereNet GNS + Noisy Nodes OOD Both 0.0221 0.0241 0.0233 0.0241 **0.047 (+95.8%)** OOD Adsorbate 0.0233 0.0207 0.026 0.0229 **0.035 (+89.5%)** OOD Catalyst 0.0294 0.0410 0.0382 0.0409 **0.080 (+95.1%)** ID 0.0296 0.0425 0.0408 0.0447 **0.091 (+102.0%)** We first convert to fractional coordinates (i.e. use the periodic unit cell as the basis) which render the predictions of our model invariant to rotations, and append the following rotation and translation invariant vector (αβ[T] _, βγ[T]_ _, αγ[T]_ _, |α|, |β|, |γ|) ∈_ R[6] to the edge features where α, β, γ are vectors of the unit cell. This additional vector provides rotation invariant angular and extent information to the GNN. **IS2RE Results. In Figure 3 we show how using Noisy Nodes allows the GNS to achieve state** of the art performance. Figure 3 A shows that without any auxiliary node target, an IS2RE GNS achieves poor performance even with increased depth. The fact that increased depth does not result in improvement supports the hypothesis that GNS suffers from oversmoothing. As we add a node level position target in B) we see better performance, and improvement as depth increases, validating our hypothesis that node level targets are key to addressing oversmoothing. In C) we add noisy nodes and parameters, and see that the increased diversity of the node level predictions leads to very significant improvements and SOTA, even for a shallow 3 layer network. D) demonstrates this effect is not just due to increased parameters - SOTA can still be achieve with shared layer weights . In Table 1 we conduct an ablation on our hyperparameters, and again demonstrate the improved performance of using Noisy Nodes. Results were averaged over 3 seeds and standard errors on the best obtained checkpoint show little sensitivity to initialisation. All results in the table are reported using sampling states from trajectories. We conducted an ablation on ID comparing sampling from a relaxation trajectory and interpolating between initial & final positions which found that interpolation improved our score from 0.47 to 0.45. Our best hyperparameter setting was 100 layers which achieved a 95.6% relative performance improvement against SOTA results (Table 2) on the AEwT benchmark. Due to limited permitted test submissions, results presented here were from one test upload of our best performing validation seed. **IS2RS Results. In Table 4 we see that GNS + Noisy Nodes is significantly better than the only other** reported IS2RS direct result, ForceNet, itself a GNS variant. ----- Table 3: OC20 IS2RS Validation, ADwT, ↑ Model Layers OOD Both OOD Adsorbate OOD Catalyst ID GNS 50 43.0%±0.0 38.0%±0.0 37.5% 0.0 40.0%±0.0 GNS + Noisy Nodes 50 50.1%±0.0 44.3%±0.0 44.1%±0.0 46.1% ±0.0 GNS-10 + Noisy Nodes 50 52.0%±0.0 46.2%±0.0 46.1% ±0.0 48.3% ±0.0 GNS-10 + Noisy Nodes + Pos only 100 **54.3%±0.0** **48.3%±0.0** **48.2% ±0.0** **50.0% ±0.0** Table 4: OC20 IS2RS Test, ADwT, ↑ Model OOD Both OOD Adsorbate OOD Catalyst ID ForceNet 46.9% 37.7% 43.7% 44.9% GNS + Noisy Nodes **52.7%** **43.9%** **48.4%** **50.9%** Relative Improvement **+12.4%** **+16.4%** **+10.7%** **+13.3%** 6.2 QM9 **Dataset. The QM9 benchmark (Ramakrishnan et al., 2014) contains 134k molecules in equilibrium** with up to 9 heavy C, O, N and F atoms, targeting 12 associated chemical properties (License: CCBY 4.0). We use 114k molecules for training, 10k for validation and 10k for test. All results are on the test set. We subtract a fixed per atom energy from the target values computed from linear regression to reduce variance. We perform training in eV units for energetic targets, and evaluate using MAE. We summarise the results across the targets using mean standardised MAE (std. MAE) in which MAEs are normalised by their standard deviation, and mean standardised logMAE. Std. MAE is dominated by targets with high relative error such as ∆ϵ, whereas logMAE is sensitive to outliers such as _R[2]_ . As is standard for this dataset, a model is trained separately for each target. For this dataset we add I.I.D Gaussian noise with mean zero and σ = 0.02 to the input atom positions. A denoising autoencoder loss is used. **Results In Table 6 we can see that adding Noisy Nodes significantly improves results by 23.1%** relative for GNS, making it competitive with specialised architectures. To understand the effect of adding a denoising loss, we tried just adding noise and found no where near the same improvement (Table 6). A GNS-10 + Noisy Nodes with 30 layers achieves top results on 3 of the 12 targets and comparable performance on the remainder (Table 6). On the std. MAE aggregate metric GNS + Noisy Nodes performs better than all other reported results, showing that Noisy Nodes can make even a generic model competitive with models hand-crafted for molecular property prediction. The same trend is repeated for an rotation invariant version of this network that uses the principle axes of inertia ordered by eigenvalue as the co-ordinate frame (Table 5). _R[2]_, the electronic spatial extent, is an outlier for GNS + Noisy Nodes. Interestingly, we found that without noise GNS-10 + Noisy Nodes achieves 0.33 for this target. We speculate that this target is particularly sensitive to noise, and the best noise value for this target would be significantly lower than for the dataset as a whole. Table 5: QM9, Impact of Noisy Nodes on GNS architecture. Layers std. MAE % Change logMAE GNS 10 1.17 - -5.39 GNS + Noise But No Node Target 10 1.16 -0.9% -5.32 GNS + Noisy Nodes 10 0.90 -23.1% -5.58 GNS-10 + Noisy Nodes 20 0.89 -23.9% -5.59 GNS-10 + Noisy Nodes + Invariance 30 0.92 -21.4% -5.57 GNS-10 + Noisy Nodes 30 **0.88** **-24.8%** **-5.60** ----- Table 6: QM9, Test MAE, Mean & Standard Deviation of 3 Seeds Reported. Target Unit SchNet E(n)GNN DimeNet++ SphereNet PaiNN **GNS + Noisy Nodes** _µ_ D 0.033 0.029 0.030 0.027 **0.012** 0.025 ±0.01 _α_ _a0[3]_ 0.235 0.071 **0.043** 0.047 0.045 0.052 ±0.00 _ϵHOMO_ meV 41 29.0 24.6 23.6 27.6 **20.4 ±0.2** _ϵLUMO_ meV 34 25.0 19.5 18.9 20.4 **18.6 ±0.4** ∆ϵ meV 63 48.0 32.6 32.3 45.7 **28.6 ±0.1** _R[2]_ _a0[2]_ **0.07** 0.11 0.33 0.29 0.07 0.70 ±0.01 ZPVE meV 1.7 1.55 1.21 **1.12** 1.28 1.16 ±0.01 _U0_ meV 14.00 11.00 6.32 6.26 **5.85** 7.30 ±0.12 _U_ meV 19.00 12.00 6.28 7.33 **5.83** 7.57 ±0.03 _H_ meV 14.00 12.00 6.53 6.40 **5.98** 7.43±0.06 _cGv_ meVmol Kcal 0.03314.00 12.000.031 0.0237.56 **0.0228.0** 0.0247.35 0.0258.30 ±00..1400 _±_ std. MAE % 1.76 1.22 0.98 0.94 1.00 **0.88** logMAE -5.17 -5.43 -5.67 -5.68 **-5.85** -5.60 Table 7: OGBG-PCQM4M Results Model Number of Layers Using Noisy Nodes MAE MPNN + Virtual Node 16 Yes 0.1249 ± 0.0003 MPNN + Virtual Node 50 No 0.1236 ± 0.0001 Graphormer (Ying et al., 2021) - - 0.1234 MPNN + Virtual Node 50 Yes **0.1218 ± 0.0001** 7 NON-SPATIAL TASKS The previous experiments use the 3D geometries of atoms, and models that operate on 3D points. However, the recipe of adding a denoising auxiliary loss can be applied to other graphs with different types of features. In this section we apply Noisy Nodes to additional datasets with no 3D points, using different GNNs, and show analagous effects to the 3D case. Details of the hyperparameters, models and training details can be found in the appendix. 7.1 OGBG-PCQM4M This dataset from the OGB benchmarks consists of molecular graphs which consist of bonds and atom types, and no 3D or 2D coordinates. To adapt Noisy Nodes to this setting, we randomly flip node and edge features at a rate of 5% and add a reconstruction loss. We evaluate Noisy Nodes using an MPNN + Virtual Node (Gilmer et al., 2017). The test set is not currently available for this dataset. In Table 7 we see that for this task Noisy Nodes enables a 50 layer MPNN to reach state of the art results. Before adding Noisy Nodes, adding capacity beyond 16 layers did not improve results. 7.2 OGBG-MOLPCBA The OGBG-MOLPCBA dataset contains molecular graphs with no 3D points, with the goal of classifying 128 biological activities. On the OGBG-MOLPCBA dataset we again use an MPNN + Virtual Node and random flipping noise. In Figure 4 we see that adding Noisy Nodes improves the performance of the base model, accentuated for deeper networks. Our 16 layer MPNN improved from 27.6% ± 0.004 to 28.1% ± 0.002 Mean Average Precision (“Mean AP”). Figure 5 demonstrates how Noisy Nodes improves performance during training. Of the reported results, our MPNN is most similar to GCN[1] + Virtual Node and GIN + Virtual Node (Xu et al., 2018) which report results of 24.2% ± 0.003 and 27.03% ± 0.003 respectively. We evaluate alternative methods for 1The GCN implemented in the official OGB code base has explicit edge updates, akin to the MPNN. ----- Figure 4: Adding Noisy Nodes with random flipping of input categories improves the performance of MPNNs, and the effect is accentuated with depth. Figure 5: Validation curve comparing with and without noisy nodes. Using Noisy Nodes leads to a consistent improvement. oversmoothing, DropNode and DropEdge in Figure 2 and find that Noisy Nodes is more effective at address oversmoothing, although all 3 methods can be combined favourably (results in appendix). 7.3 OGBN-ARXIV The above results use models with explicit edge updates, and are reported for graph prediction. To test the effectiveness with Noisy Nodes with GCNs, arguably the simplest and most popular GNN, we use OGBN-ARXIV, a citation network with the goal of predicting the arxiv category of each paper. Adding Noisy Nodes, with noise as input dropout of 0.1, to 4 layer GCN with residual connections improves from 72.39% ± 0.002 accuracy to 72.52% ± 0.003 accuracy. A baseline 4 layer GCN on this dataset reports 71.71% ± 0.002. The SOTA for this dataset is 74.31% (Sun & Wu, 2020). 7.4 LIMITATIONS We have not demonstrated the effectiveness of Noisy Nodes in small data regimes, which may be important for learning from experimental data. The representation learning perspective requires access to a local minimum configuration, which is not the case for all quantum modeling datasets. We have also not demonstrated the combination of Noisy Nodes with more sophisticated 3D molecular property prediction models such as DimeNet++(Klicpera et al., 2020a), such models may require an alternative reconstruction loss to position change, such as pairwise interatomic distances. We leave this to future work. Noisy Nodes requires careful selection of the form of noise, and a balance between the auxiliary and primary losses. This can require hyper parameter tuning, and models can be sensitive to the choice of these parameters. Noisy Nodes has a particular effect for deep GNNs, but depth is not always an advantage. There are situations, for example molecular dynamics, which place a premium on very fast inference time. However even at 3 layers (a comparable depth to alternative architectures) the GNS architecture achieves state of the art validation OC20 IS2RE predictions (Figure 3). Finally, returns diminish as depth increases indicating depth is not the only answer (Table 1). 8 CONCLUSIONS In this work we present Noisy Nodes, a novel regularisation technique for GNNs with particular focus on 3D molecular property prediction. Noisy nodes helps address common challenges around oversmoothed node representations, shows benefits for GNNs of all depths, but in particular improves performance for deeper GNNs. We demonstrate results on challenging 3D molecular property prediction tasks, and some generic GNN benchmark datasets. We believe these results demonstrate Noisy Nodes could be a useful building block for GNNs for molecular property prediction and beyond. ----- 9 REPRODUCIBILITY STATEMENT Code for reproducing OGB-PCQM4M results using Noisy Nodes is available on github, and was prepared as part of a leaderboard submission. [https://github.com/deepmind/](https://github.com/deepmind/deepmind-research/tree/master/ogb_lsc/pcq) [deepmind-research/tree/master/ogb_lsc/pcq.](https://github.com/deepmind/deepmind-research/tree/master/ogb_lsc/pcq) We provide detailed hyper parameter settings for all our experiments in the appendix, in addition to formulae for computing the encoder and decoder stages of the GNS. 10 ETHICS STATEMENT **Who may benefit from this work? Molecular property prediction with GNNs is a fast-growing** area with applications across domains such as drug design, catalyst discovery, synthetic biology, and chemical engineering. Noisy Nodes could aid models applied to these domains. We also demonstrate on OC20 that our direct state prediction approach is nearly as accurate as learned relaxed approaches at a small fraction of the computational cost, which may support material design which requires many predictions. Finally, Noisy Nodes could be adapted and applied to many areas in which GNNs are used—for example, knowledge base completion, physical simulation or traffic prediction. **Potential negative impact and reflection. Noisy Nodes sees improved performance from depth, but** the training of very deep GNNs could contribute to global warming. Care should be taken when utilising depth, and we note that Noisy Nodes settings can be calibrated at shallow depth. REFERENCES Brandon M. Anderson, T. Hy, and R. Kondor. Cormorant: Covariant molecular neural networks. In _NeurIPS, 2019._ Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamakarios, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Wojciech Stokowiec, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL [http://github.com/deepmind.](http://github.com/deepmind) V. Bapst, T. Keck, Agnieszka Grabska-Barwinska, C. Donner, E. D. Cubuk, S. Schoenholz, A. Obika, Alexander W. R. Nelson, T. Back, D. Hassabis, and P. Kohli. Unveiling the predictive power of static structure in glassy systems. Nature Physics, 16:448–454, 2020. P. Battaglia, Jessica B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, A. Santoro, R. Faulkner, Çaglar Gülçehre, H. Song, A. J. Ballard, J. Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charlie Nash, Victoria Langston, Chris Dyer, N. Heess, Daan Wierstra, P. Kohli, M. Botvinick, Oriol Vinyals, Y. Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. ArXiv, abs/1806.01261, 2018. Simon Batzner, T. Smidt, L. Sun, J. Mailoa, M. Kornbluth, N. Molinari, and B. Kozinsky. Se(3)equivariant graph neural networks for data-efficient and accurate interatomic potentials. ArXiv, abs/2101.03164, 2021. Charles M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural Computation, 7:108–116, 1995. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL [http://github.com/google/jax.](http://github.com/google/jax) ----- Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017. T. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, J. Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks. _CoRR,_ [abs/2006.13318, 2020. URL https://arxiv.org/abs/2006.13318.](https://arxiv.org/abs/2006.13318) Lowik Chanussot*, Abhishek Das*, Siddharth Goyal*, Thibaut Lavril*, Muhammed Shuaibi*, Morgane Riviere, Kevin Tran, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Aini Palizhati, Anuroop Sriram, Brandon Wood, Junwoong Yoon, Devi Parikh, C. Lawrence Zitnick, and Zachary Ulissi. Open catalyst 2020 (oc20) dataset and community challenges. ACS Catalysis, 0(0): [6059–6072, 2020. doi: 10.1021/acscatal.0c04525. URL https://doi.org/10.1021/](https://doi.org/10.1021/acscatal.0c04525) [acscatal.0c04525.](https://doi.org/10.1021/acscatal.0c04525) Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. CoRR, abs/1909.03211, [2019. URL http://arxiv.org/abs/1909.03211.](http://arxiv.org/abs/1909.03211) Deli Chen, Yankai Lin, W. Li, Peng Li, J. Zhou, and Xu Sun. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. In AAAI, 2020. Stefan Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, Kristof T. Schütt, and K. Müller. Machine learning of accurate energy-conserving molecular force fields. Science Advances, 3, 2017. George Dasoulas, Ludovic Dos Santos, Kevin Scaman, and Aladin Virmaux. Coloring graph neural networks for node disambiguation. ArXiv, abs/1912.06058, 2019. J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019. Tien Huu Do, Duc Minh Nguyen, Giannis Bekoulis, Adrian Munteanu, and N. Deligiannis. Graph convolutional neural networks with node transition probability-based message passing and dropnode regularization. Expert Syst. Appl., 174:114711, 2021. David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Proceedings of the 28th International Conference on Neural _Information Processing Systems - Volume 2, NIPS’15, pp. 2224–2232, Cambridge, MA, USA,_ 2015. MIT Press. F. Fuchs, Daniel E. Worrall, Volker Fischer, and M. Welling. Se(3)-transformers: 3d roto-translation equivariant attention networks. ArXiv, abs/2006.10503, 2020. J. Gilmer, S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. ArXiv, abs/1704.01212, 2017. Jonathan Godwin*, Thomas Keck*, Peter Battaglia, Victor Bapst, Thomas Kipf, Yujia Li, Kimberly Stachenfeld, Petar Veliˇckovi´c, and Alvaro Sanchez-Gonzalez. Jraph: A library for graph neural [networks in jax., 2020. URL http://github.com/deepmind/jraph.](http://github.com/deepmind/jraph) Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX, 2020. [URL http://github.com/deepmind/dm-haiku.](http://github.com/deepmind/dm-haiku) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ArXiv, abs/2005.00687, 2020a. ----- Weihua Hu, Bowen Liu, Joseph Gomes, M. Zitnik, Percy Liang, V. Pande, and J. Leskovec. Strategies for pre-training graph neural networks. arXiv: Learning, 2020b. Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021a. Weihua Hu, Muhammed Shuaibi, Abhishek Das, Siddharth Goyal, Anuroop Sriram, J. Leskovec, Devi Parikh, and C. L. Zitnick. Forcenet: A graph neural network for large-scale quantum calculations. _ArXiv, abs/2103.01436, 2021b._ John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, Andrew Cowie, Bernardino RomeraParedes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David A. Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold. _Nature, 596:583 – 589, 2021._ Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. _CoRR,_ abs/1412.6980, 2015. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. _[CoRR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.](http://arxiv.org/abs/1609.02907)_ Johannes Klicpera, Shankari Giri, Johannes T. Margraf, and Stephan Günnemann. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. _CoRR,_ [abs/2011.14115, 2020a. URL https://arxiv.org/abs/2011.14115.](https://arxiv.org/abs/2011.14115) Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. ArXiv, abs/2003.03123, 2020b. Risi Kondor, Hy Truong Son, Horace Pan, Brandon M. Anderson, and Shubhendu Trivedi. Covariant [compositional networks for learning graphs. CoRR, abs/1801.02144, 2018. URL http://](http://arxiv.org/abs/1801.02144) [arxiv.org/abs/1801.02144.](http://arxiv.org/abs/1801.02144) Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, G. Taylor, and T. Goldstein. Flag: Adversarial data augmentation for graph neural networks. _ArXiv,_ abs/2010.09891, 2020. Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: sampling configurations for multi-body systems with symmetric energies, 2019. G. Li, M. Müller, Ali K. Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? _2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9266–9275, 2019._ Guohao Li, C. Xiong, Ali K. Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns. ArXiv, abs/2006.07739, 2020. Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. Training graph neural networks [with 1000 layers. CoRR, abs/2106.07476, 2021. URL https://arxiv.org/abs/2106.](https://arxiv.org/abs/2106.07476) [07476.](https://arxiv.org/abs/2106.07476) Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Yi Liu, Limei Wang, Meng Liu, Xuan Zhang, Bora Oztekin, and Shuiwang Ji. Spherical message passing for 3d graph networks. arXiv preprint arXiv:2102.05013, 2021. Andreas Loukas. How hard is to distinguish graphs with graph neural networks? arXiv: Learning, 2020. ----- Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak A. Rao, and Bruno Ribeiro. Relational pooling for graph representations. In ICML, 2019. T. Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and P. Battaglia. Learning mesh-based simulation with graph networks. ArXiv, abs/2010.03409, 2020. R. Ramakrishnan, Pavlo O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014. Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. The truly deep graph convolutional [networks for node classification. CoRR, abs/1907.10903, 2019. URL http://arxiv.org/](http://arxiv.org/abs/1907.10903) [abs/1907.10903.](http://arxiv.org/abs/1907.10903) Alvaro Sanchez-Gonzalez, N. Heess, Jost Tobias Springenberg, J. Merel, Martin A. Riedmiller, R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control. _ArXiv, abs/1806.01242, 2018._ Alvaro Sanchez-Gonzalez*, Jonathan Godwin*, Tobias Pfaff*, Rex Ying*, Jure Leskovec, and Peter Battaglia. Learning to simulate complex physics with graph networks. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8459–8468. PMLR, 13–18 Jul 2020. URL [http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html.](http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html) R. Sato, Makoto Yamada, and Hisashi Kashima. Random features strengthen graph neural networks. In SDM, 2021. Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks, 2021. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. doi: 10.1109/TNN.2008.2005605. Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, A. Tkatchenko, and K. Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. In NIPS, 2017. Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. Graph neural networks in particle physics. _Machine Learning: Science and Technology, 2(2):021001, Jan 2021. ISSN 2632-2153. doi:_ [10.1088/2632-2153/abbf9a. URL http://dx.doi.org/10.1088/2632-2153/abbf9a.](http://dx.doi.org/10.1088/2632-2153/abbf9a) J. Sietsma and Robert J. F. Dow. Creating artificial neural networks that generalize. Neural Networks, 4:67–79, 1991. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. _ArXiv, abs/1907.05600, 2019._ Chuxiong Sun and Guoshi Wu. Adaptive graph diffusion networks with hop-wise attention. ArXiv, abs/2012.15024, 2020. Shantanu Thakoor, C. Tallec, M. G. Azar, R. Munos, Petar Velivckovi’c, and Michal Valko. Bootstrapped representation learning on graphs. ArXiv, abs/2102.06514, 2021. Nathaniel Thomas, Tess Smidt, Steven M. Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation- and translation-equivariant neural networks for 3d point [clouds. CoRR, abs/1802.08219, 2018. URL http://arxiv.org/abs/1802.08219.](http://arxiv.org/abs/1802.08219) Oliver T. Unke and Markus Meuwly. Physnet: A neural network for predicting energies, forces, dipole moments, and partial charges. Journal of Chemical Theory and Computation, 15(6):3678–3693, [May 2019. ISSN 1549-9626. doi: 10.1021/acs.jctc.9b00181. URL http://dx.doi.org/10.](http://dx.doi.org/10.1021/acs.jctc.9b00181) [1021/acs.jctc.9b00181.](http://dx.doi.org/10.1021/acs.jctc.9b00181) Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017. ----- Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks, 2018. Cl’ement Vignac, Andreas Loukas, and Pascal Frossard. Building powerful and equivariant graph neural networks with structural message-passing. arXiv: Learning, 2020. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computa_tion, 23:1661–1674, 2011._ Pascal Vincent, H. Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML ’08, 2008. Pascal Vincent, H. Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, 2010. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and _learning systems, 2020._ Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural [networks? CoRR, abs/1810.00826, 2018. URL http://arxiv.org/abs/1810.00826.](http://arxiv.org/abs/1810.00826) Chaoqi Yang, Ruijie Wang, Shuochao Yao, Shengzhong Liu, and Tarek Abdelzaher. Revisiting" over-smoothing" in deep gcns. arXiv preprint arXiv:2003.13663, 2020. Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform bad for graph representation? ArXiv, abs/2106.05234, 2021. Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. ArXiv, abs/2010.13902, 2020. L. Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. ArXiv, abs/1909.12223, 2020. Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. _AI Open, 1:57–81, 2020a._ Kuangqi Zhou, Yanfei Dong, Wee Sun Lee, Bryan Hooi, Huan Xu, and Jiashi Feng. Effective [training strategies for deep graph neural networks. CoRR, abs/2006.07107, 2020b. URL https:](https://arxiv.org/abs/2006.07107) [//arxiv.org/abs/2006.07107.](https://arxiv.org/abs/2006.07107) A APPENDIX The following sections include details on training setup, hyper-parameters, input processing, as well as additional experimental results. A.1 ADDITIONAL METRICS FOR OPEN CATALYST IS2RS TEST SET Relaxation approaches to IS2RS minimise forces with respect to positions, with the expectation that forces at the minimum are close to zero. One metric of such a model’s success is to evaluate the forces at the converged structure using ground truth Density Functional Theory calculations and see how close they are to zero. Two metrics are provided by OC20 (Chanussot* et al., 2020) on the IS2RS test set: Force below Threshold (FbT), which is the percentage of structures that have forces below 0.05 eV/Angstrom, and Average Force below Threshold (AFbT) which is FbT calculated at multiple thresholds. The OC20 project computes test DFT calculations on the evaluation server and presents a summary result for all IS2RS position predictions. Such calculations take 10-12 hours and they are not available for the validation set. Thus, we are not able to analyse the results in Tables 8 and 9 in any further detail. Before application to catalyst screening further work may be needed for direct approaches to ensure forces do not explode from atoms being too close together. ----- Table 8: OC20 IS2RS Test, Average Force below Threshold %, ↑ Model Method OOD Both OOD Adsorbate OOD Catalyst ID Noisy Nodes Direct 0.09% 0.00% 0.29% 0.54% Table 9: OC20 IS2RS Test, Force below Threshold %, ↑ Model Method OOD Both OOD Adsorbate OOD Catalyst ID Noisy Nodes Direct 0.0% 0.0% 0.0% 0.0% A.2 MORE DETAILS ON GNS ADAPTATIONS FOR MOLECULAR PROPERTY PREDICTION. **Encoder.** The node features are a learned embedding lookup of the atom type, and in the case of OC20 two additional binary features representing whether the atom is part of the adsorbate or catalyst and whether the atom remains fixed during the quantum chemistry simulation. The edge features,2 sin( _[cπ]R_ _[d][)]_ _ek are the distances |d| featurised using c Radial Bessel basis functions, ˜eRBF,c =_ _R_ _d_, and the edge vector displacements, d, normalised by the edge distance: q _ek = Concat(˜eRBF,1(_ _d_ ), ..., ˜eRBF,c( _d_ ), [d] _|_ _|_ _|_ _|_ _d_ _|_ _|_ [)] Our conversion to fractional coordinates only applied to the vector quantities, i.e. **Decoder** _d_ _|d|_ [.] The decoder consists of two parts, a graph-level decoder which predicts a single output for the input graph, and a node-level decoder which predicts individual outputs for each node. The graph-level decoder implements the following equation: _|V |_ _|V |_ _y = W_ [Proc] MLPProc(a[Proc]i ) + b[Proc] + W [Enc] MLPEnc(a[Enc]i ) + b[Enc] _i=1_ _i=1_ X X Where a[Proc]i are node latents from the Processor, a[Enc]i are node latents from the Encoder, W [Enc] and _W_ [Proc] are linear layers, b[Enc] and b[Proc] are biases, and |V | is the number of nodes. The node-level decoder is simply an MLP applied to each a[Proc]i which predicts a[∆]i [.] A.3 MORE DETAILS ON MPNN FOR OGBG-PCQM4M AND OGBG-MOLPCBA Our MPNN follows the blueprint of Gilmer et al. (2017). We use _[⃗]h[(]v[t][)]_ to denote the latent vector of node v at message passing step t, and ⃗m[(]uv[t][)] [to be the computed message vector for the edge between] nodes u and v at message passing step t. We define the update functions as: _m⃗_ [(]uv[t][+1)] = ψt+1 _⃗h[(]u[t][)][,⃗]h[(]v[t][)][, ⃗]m[(]uv[t][)]_ [+][ ⃗]m[(]uv[t][−][1)] (1)   + _[⃗]h[t]u_ (2) _⃗h[(]u[t][+1)]_ = φt+1 _⃗h[(]u[t][)][,]_ _m⃗_ [(]vu[t][+1)] _uX∈Nv_ _m⃗_ [(]uv[t][+1)] _vX∈Nu_ Where the message function ψt+1 and the update function φt+1 are MLPs. We use a “Virtual Node” which is connected to all other nodes to enable long range communication. Out readout function is an MLP. No spatial features are used. ----- Figure 6: GNS Unsorted MAD per Layer Averaged Over 3 Random Seeds. Evidence of oversmoothing is clear. Model trained on QM9. Figure 7: GNS Sorted MAD per Layer Averaged Over 3 Random Seeds. The trend is clearer when the MAD values have been sorted. Model trained on QM9. A.4 EXPERIMENT SETUP FOR 3D MOLECULAR MODELING **Open Catalyst. All training experiments were ran on a cluster of TPU devices. For the Open Catalyst** experiments, each individual run (i.e. a single random seed) utilised 8 TPU devices on 2 hosts (4 per host) for training, and 4 V100 GPU devices for evaluation (1 per dataset). Each Open Catalyst experiment was ran until convergence for up to 200 hours. Our best result, the large 100 layer model requires 7 days of training using the above setting. Each configuration was run at least 3 times in this hardware configuration, including all ablation settings. We further note that making effective use of our regulariser requires sweeping noise values. These sweeps are dataset dependent and can be carried out using few message passing steps. **QM9. Experiments were also run on TPU devices. Each seed was run using 8 TPU devices on a** single host for training, and 2 V100 GPU devices for evaluation. QM9 targets were trained between 12-24 hours per experiment. Following Klicpera et al. (2020b) we define std. MAE as : _fθ[(][m][)](Xi, zi)_ _t[(]i[m][)]_ _|_ _−_ [ˆ] _σm_ std. MAE = [1] _m=1_ _M_ log _m=1_ X _i=1_ and logMAE as: _fθ[(][m][)](Xi, zi)_ _t[(]i[m][)]_ _|_ _−_ [ˆ] _σm_ logMAE = [1] _i=1_ with target index m, number of targets M = 12, dataset size N, ground truth values _t[ˆ][(][m][)], model_ _fθ[(][m][)], inputs Xi and zi, and standard deviation σm of_ _t[ˆ][(][m][)]._ A.5 OVER SMOOTHING ANALYSIS FOR GNS In addition to Figure 2, we repeat the analysis with a mean MAD over 3 seeds 7. Furthermore we remove the sorting layer by MAD value and find the trend holds. A.6 NOISE ABLATIONS FOR OGBG-MOLPCBA We conduct a noise ablation on the random flipping noise for OGBG-MOLPCBA with an 8 layer MPNN + Virtual Node, and find that our model is not very sensitive to the noise value (Table 10), but degrades from 0.1. ----- |Flip Probability|Mean AP| |---|---| |0.01 0.03 0.05 0.1 0.2|27.8% +- 0.002 27.9% +- 0.003 28.1% +- 0.001 28.0% +- 0.003 27.7% +- 0.002| |---|---| Table 10: OGBG-MOLPCBA Noise Ablation |Col1|Mean AP| |---|---| |MPNN Without DropEdge MPNN With DropEdge MPNN + DropEdge + Noisy Nodes|27.4% 0.002 ± 27.5% 0.001 ± 27.8% 0.002 ±| |---|---| Table 11: OGBG-MOLPCBA DropEdge Ablation A.7 DROPEDGE & DROPNODE ABLATIONS FOR OGBG-MOLPCBA We conduct an ablation with our 16 layer MPNN using DropEdge at a rate of 0.1 as an alternative approach to improving oversmoothing and find it does not improve performance for ogbg-molpcba (Table 11), similarly we find DropNode (Table 12) does not improve performance. In addition, we find that these two methods can’t be combined well together, reaching a performance of 27.0% ± 0.003. However, both methods can be combined advantageously with Noisy Nodes. We also measure the MAD of the node latents for each layer and find the indeed Noisy Nodes is more effective at addressing oversmoothing in Figure 8. A.8 TRAINING CURVES FOR OC20 NOISY NODES ABLATIONS DEMONSTRATING OVERFITTING Figure 9 |Col1|Mean AP| |---|---| |MPNN With DropNode MPNN Without DropNode MPNN + DropNode + Noisy Nodes|27.5% 0.001 ± 27.5% 0.004 ± 28.2% 0.005 ±| |---|---| Table 12: OGBG-MOLPCBA DropNode Ablation ----- Figure 8: Comparison of the effect of techniques to address oversmoothing on MPNNs. Whilst Some effect can be seen from DropEdge and DropNode, Noisy Nodes is significantly better at preserving per node diversity. A.9 PSEUDOCODE FOR 3D MOLECULAR PREDICTION TRAINING STEP **Algorithm 1: Noisy Nodes Training Step** _G = (V, E, g) // Input graph_ _G˜ = G // Initialize noisy graph_ _λ // Noisy Nodes Weight_ **if not_provided(V** _[′]) then_ _V_ _[′]_ _←_ _V_ **end** **if predict_differences then** ∆ = _vi[′]_ **end** _{_ _[−]_ _[v][i][|][i][ ∈]_ [1][, . . .,][ |][V][ |}] **for each i ∈** 1, . . ., |V | do _σi = sample_node_noise(shape_of(vi));_ _v˜i = vi + σi;_ **if predict_differences then** ∆˜ _i = ∆i −_ _σi;_ **end** **endfor** _E˜ = recompute_edges(V[˜] );_ _Gˆ[′]_ = GNN(G[˜]); **if predict_differences then** _V_ = ∆[˜] _i;_ _[′]_ **end** Loss = λ NoisyNodesLoss(G[ˆ][′], V _[′]) + PrimaryLoss(G[ˆ][′], V_ _[′]));_ Loss.minimise() ----- Figure 9: Training curves to accompany Figure 3. This demonstrates that even as the validation performance is getting worse, training loss is going down, indicating overfitting. ----- Table 13: Open Catalyst training parameters. Parameter Value or description Optimiser Adam with warm up and cosine cycling _β1_ 0.9 _β2_ 0.95 Warm up steps 5e5 Warm up start learning rate 1e − 5 Warm up/cosine max learning rate 1e − 4 Cosine cycle length 5e6 Loss type Mean squared error Batch size Dynamic to max edge/node/graph count Max nodes in batch 1024 Max edges in batch 12800 Max graphs in batch 10 MLP number of layers 3 MLP hidden sizes 512 Number Bessel Functions 512 Activation shifted softplus message passing layers 50 Group size 10 Node/Edge latent vector sizes 512 Position noise Gaussian (µ = 0, σ = 0.3) Parameter update Exponentially moving average (EMA) smoothing EMA decay 0.9999 Position Loss Co-efficient 1.0 A.10 TRAINING DETAILS Our code base is implemented in JAX using Haiku and Jraph for GNNs, and Optax for training (Bradbury et al., 2018; Babuschkin et al., 2020; Godwin* et al., 2020; Hennigan et al., 2020). Model selection used early stopping. All results reported as an average of 10 random seeds. OGBG-PCQM4M & OGBG-MOLPCBA were trained with 16 TPUs and evaluated with a single V100 GPU. OGBN-Arxiv was trained and evalated with a single TPU **3D Molecular Prediction** We minimise the mean squared error loss on mean and standard deviation normalised targets and use the Adam (Kingma & Ba, 2015) optimiser with warmup and cosine decay. For OC20 IS2RE energy prediction we subtract a learned reference energy, computed using an MLP with atom types as input. For the GNS model the node and edge latents as well as MLP hidden layers were sized 512, with 3 layers per MLP and using shifted softplus activations throughout. OC20 & QM9 Models were trained on 8 TPU devices and evaluated on a single V100 GPUs. We provide the full set of hyper-parameters and computational resources used separately for each dataset in the Appendix. All noise levels were determined by sweeping a small range of values (≈ 10) informed by the noised feature covariance. **Non Spatial Tasks** A.11 HYPER-PARAMETERS **Open Catalyst. We list the hyper-parameters used to train the default Open Catalyst experiment.** If not specified otherwise (e.g. in ablations of these parameters), experiments were ran with this configuration. ----- Table 14: QM9 training parameters. Parameter Value or description Optimiser Adam with warm up and cosine cycling _β1_ 0.9 _β2_ 0.95 Warm up steps 1e4 Warm up start learning rate 3e − 7 Warm up/cosine max learning rate 1e − 4 Cosine cycle length 2e6 Loss type Mean squared error Batch size Dynamic to max edge/node/graph count Max nodes in batch 256 Max edges in batch 4096 Max graphs in batch 8 MLP number of layers 3 MLP hidden sizes 1024 Number Bessel Funtions 512 Activation shifted softplus message passing layers 10 Group Size 10 Node/Edge latent vector sizes 512 Position noise Gaussian (µ = 0, σ = 0.02) Parameter update Exponentially moving average (EMA) smoothing EMA decay 0.9999 Position Loss Coefficient 0.1 Dynamic batch sizes refers to constructing batches by specifying maximum node, edge and graph counts (as opposed to only graph counts) to better balance computational load. Batches are constructed until one of the limits is reached. Parameter updates were smoothed using an EMA for the current training step with the current decay value computed through decay = min(decay, (1.0 + step)/(10.0 + step). As discussed in the evaluation, best results on Open Catalyst were obtained by utilising a 100 layer network with group size 10. **QM9 Table 14 lists QM9 hyper-parameters which primarily reflect the smaller dataset and geometries** with fewer long range interactions. For U0, U, H and G we use a slightly larger number of graphs per batch - 16 - and a smaller position loss co-efficient of 0.01. **OGBG-PCQM4M Table 15 provides the hyper parameters for OGBG-PCQM4M.** **OGBG-MOLPCBA Table 16 provides the hyper parameters for the OGBG-MOLPCBA experiments.** **OGBN-ARXIV Table 17 provides the hyper parameters for the OGBN-Arxiv experiments.** ----- Table 15: OGBG-PCQM4M Training Parameters. Parameter Value or description Optimiser Adam with warm up and cosine cycling _β1_ 0.9 _β2_ 0.95 Warm up steps 5e4 Warm up start learning rate 1e − 5 Warm up/cosine max learning rate 1e − 4 Cosine cycle length 5e5 Loss type Mean absolute error Reconstruction type Softmax Cross Entropy Batch size Dynamic to max edge/node/graph count Max nodes in batch 20,480 Max edges in batch 8,192 Max graphs in batch 512 MLP number of layers 2 MLP hidden sizes 512 Activation relu Node/Edge latent vector sizes 512 Noisy Nodes Category Flip Fate 0.05 Parameter update Exponentially moving average (EMA) smoothing EMA decay 0.999 Reconstruction Loss Coefficient 0.1 Table 16: OGBG-MOLPCBA Training Parameters. Parameter Value or description Optimiser Adam with warm up and cosine cycling _β1_ 0.9 _β2_ 0.95 Warm up steps 1e4 Warm up start learning rate 1e − 5 Warm up/cosine max learning rate 1e − 4 Cosine cycle length 1e5 Loss type Softmax Cross Entropy Reconstruction loss type Softmax Cross Entropy Batch size Dynamic to max edge/node/graph count Max nodes in batch 20,480 Max edges in batch 8,192 Max graphs in batch 512 MLP number of layers 2 MLP hidden sizes 512 Activation relu Batch Normalization Yes, after every hidden layer Node/Edge latent vector sizes 512 Dropnode Rate 0.1 Dropout Rate 0.1 Noisy Nodes Category Flip Fate 0.05 Parameter update Exponentially moving average (EMA) smoothing EMA decay 0.999 Reconstruction Loss Coefficient 0.1 ----- Table 17: OGBG-ARXIV Training Parameters. Parameter Value or description Optimiser Adam with warm up and cosine cycling _β1_ 0.9 _β2_ 0.95 Warm up steps 50 Warm up start learning rate 1e − 5 Warm up/cosine max learning rate 1e − 3 Cosine cycle length 12, 000 Loss type Softmax Cross Entropy Reconstruction loss type Mean Squared Error Batch size Full graph MLP number of layers 1 Activation relu Batch Normalization Yes, after every hidden layer Node/Edge latent vector sizes 256 Dropout Rate 0.5 Noisy Nodes Input Dropout 0.05 Reconstruction Loss Coefficient 0.1 -----