File size: 64,295 Bytes

f71c233

# SIMPLE GNN REGULARISATION FOR 3D MOLECULAR PROPERTY PREDICTION & BEYOND

**Jonathan Godwin, Michael Schaarschmidt, Alexander Gaunt,**
**Alvaro Sanchez-Gonzales, Yulia Rubanova, Petar Veliˇckovi´c,**
**James Kirkpatrick & Peter Battaglia**
DeepMind, London
{jonathangodwin}@deepmind.com

ABSTRACT

In this paper we show that simple noisy regularisation can be an effective way
to address oversmoothing. We argue that regularisers addressing oversmoothing
should both penalise node latent similarity and encourage meaningful node representations. From this observation we derive “Noisy Nodes”, a simple technique in
which we corrupt the input graph with noise, and add a noise correcting node-level
loss. The diverse node level loss encourages latent node diversity, and the denoising
objective encourages graph manifold learning. Our regulariser applies well-studied
methods in simple, straightforward ways which allow even generic architectures to
overcome oversmoothing and achieve state of the art results on quantum chemistry
tasks, and improve results significantly on Open Graph Benchmark (OGB) datasets.
Our results suggest Noisy Nodes can serve as a complementary building block in
the GNN toolkit.

1 INTRODUCTION

Graph Neural Networks (GNNs) are a family of neural networks that operate on graph structured data
by iteratively passing learned messages over the graph’s structure (Scarselli et al., 2009; Bronstein
et al., 2017; Gilmer et al., 2017; Battaglia et al., 2018; Shlomi et al., 2021). While Graph Neural
Networks have demonstrated success in a wide variety of tasks (Zhou et al., 2020a; Wu et al., 2020;
Bapst et al., 2020; Schütt et al., 2017; Klicpera et al., 2020a), it has been proposed that in practice
“oversmoothing” limits their ability to benefit from overparametrization.

Oversmoothing is a phenomenon where a GNN’s latent node representations become increasing indistinguishable over successive steps of message passing (Chen et al., 2019). Once these representations
are oversmoothed, the relational structure of the representation is lost, and further message-passing
cannot improve expressive capacity. We argue that the challenges of overcoming oversmoothing are
two fold. First, finding a way to encourage node latent diversity; second, to encourage the diverse
node latents to encode meaningful graph representations. Here we propose a simple noise regulariser,
Noisy Nodes, and demonstrate how it overcomes these challenges across a range of datasets and
architectures, achieving top results on OC20 IS2RS & IS2RE direct, QM9 and OGBG-PCQM4Mv1.

Our “Noisy Nodes” method is a simple technique for regularising GNNs and associated training
procedures. During training, our noise regularisation approach corrupts the input graph’s attributes
with noise, and adds a per-node noise correction term. We posit that our Noisy Nodes approach is
effective because the model is rewarded for maintaining and refining distinct node representations
through message passing to the final output, which causes it to resist oversmoothing. Like denoising
autoencoders, it encourages the model to explicitly learn the manifold on which the uncorrupted input
graph’s features lie, analogous to a form of representation learning. When applied to 3D molecular
prediction tasks, it encourages the model to distinguish between low and high energy states. We
find that applying Noisy Nodes reduces oversmoothing for shallower networks, and allows us to see
improvements with added depth, even on tasks for which depth was assumed to be unhelpful.

This study’s approach is to investigate the combination of Noisy Nodes with generic, popular baseline
GNN architectures. For 3D Molecular prediction we use a standard architecture working on 3D point
clouds developed for particle fluid simulations, the Graph Net Simulator (GNS) (Sanchez-Gonzalez*


-----

et al., 2020), which has also been used for molecular property prediction (Hu et al., 2021b). Without
using Noisy Nodes the GNS is not a competitive model, but using Noisy Nodes allows the GNS
to achieve top performance on three 3D molecular property prediction tasks: the OC20 IS2RE
direct task by 43% over previous work, 12% on OC20 IS2RS direct, and top results on 3 out of
12 of the QM9 tasks. For non-spatial GNN benchmarks we test a MPNN (Gilmer et al., 2017) on
OGBG-MOLPCBA and OGBG-PCQM4M (Hu et al., 2021a) and again see significant improvements.
Finally, we applied Noisy Nodes to a GCN (Kipf & Welling, 2016), arguably the most popular and
simple GNN, trained on OGBN-Arxiv and see similar results. These results suggest Noisy Nodes can
serve as a complementary GNN building block.

2 PRELIMINARIES: GRAPH PREDICTION PROBLEM

Let G = (V, E, g) be an input graph. The nodes are V = {v1, . . ., v|V |}, where vi ∈ R[d][v] . The
directed, attributed edges are E = {e1, . . ., e|E|}: each edge includes a sender node index, receiver
node index, and edge attribute,ek ∈ R[d][e]. The graph-level property is ek = ( g ∈sk, rR[d]k[g], e. _k), respectively, where sk, rk ∈{1, . . ., |V |} and_

The goal is to predict a target graph, G[′], with the same structure as G, but different node, edge,
and/or graph-level attributes. We denote _G[ˆ][′]_ as a model’s prediction of G[′]. Some error metric defines
quality of _G[ˆ][′]_ with respect to the target G[′], Error( G[ˆ][′], G[′]), which the training loss terms are defined to
optimize. In this paper the phrase “message passing steps” is synonymous with “GNN layers”.

3 OVERSMOOTHING

“Oversmoothing” is when the node latent vectors of a GNN become very similar after successive
layers of message passing. Once nodes are identical there is no relational information contained in
the nodes, and no higher-order latent graph representations can be learned. It is easiest to see this
effect with the update function of a Graph Convolutional Network with no adjacency normalization
_vi[k]_ [=][ P]j _[Wv]j[k][−][1]_ with j ∈ _Neighborhoodvi_ _, W ∈_ R[d][g][×][d][g] and k the layer index. As the number
of applications increases, the averaging effect of the summation forces the nodes to become almost
identical. However, as soon as residual connections are added we can construct a network that
need not suffer from oversmoothing by setting the residual updates to zero at a similarity threshold.
Similarly, multi-head attention Vaswani et al. (2017); Veliˇckovi´c et al. (2018) and GNNs with edge
updates (Battaglia et al., 2018; Gilmer et al., 2017) can modulate node updates. As such for modern
GNNs oversmoothing is primarily a “training” problem - i.e. how to choose model architectures and
regularisers to encourage and preserve meaningful latent relational representations.

We can discern two desiderata for a regulariser or loss that addresses oversmoothing. First, it should
penalise identical node latents. Second, it should encourage meaningful latent representations of
the data. One such example may be the auto-regressive loss of transformer based language models
(Brown et al. (2020)). In this case, each word (equivalent to node) prediction must be distinct, and
the auto-regressive loss encourages relational dependence upon prior words. We can take inspiration
from this observation to derive auxiliary losses that both have diverse node targets and encourage
relational representation learning. In the following section we derive one such regulariser, Noisy
Nodes.

4 NOISY NODES

Noisy Nodes tackles the oversmoothing problem by adding a diverse noise correction target, modifying the original graph prediction problem definition in several ways. It introduces a graph corrupted
by noise, _G[˜] = ( V,[˜]_ _E,[˜]_ ˜g), where ˜vi _V is constructed by adding noise, σi, to the input nodes,_
_∈_ [˜]
_v˜i = vi + σi. The edges,_ _E[˜], and graph-level attribute, ˜g, can either be uncorrupted by noise (i.e.,_
_E˜ = E, ˜g = g), calculated from the noisy nodes (for example in a nearest neighbors graph), or_
corrupted independent of the nodes—these are minor choices that can be informed by the specific
problem setting.


-----

Figure 2: Per layer node latent diversity, measured
by MAD on a 16 layer MPNN trained on OGBGMOLPCBA. Noisy Nodes maintains a higher level
of diversity throughout the network than competing
methods.


Figure 1: Noisy Node mechanics during
training. Input positions are corrupted
with noise σ, and the training objective is
the node-level difference between target
positions and the noisy inputs.


Our method requires a noise correction target to prevent oversmoothing by enforcing diversity in the
last layers of the GNN, which can be achieved with an auxiliary denoising autoencoder loss. For
example, where the Error is defined with respect to graph-level predictions (e.g., predict the minimum
energy value of some molecular system), a second output head can be added to the GNN architecture
which requires denoising the inputs as targets. Alternatively, if the inputs and targets are in the same
real domain as is the case for physical simulations we can adjust the target for the noise. Figure 1
demonstrates this Noisy Nodes set up. The auxiliary loss is weighted by a constant coefficient λ ∈ R.

In Figure 2 we illustrate the impact of Noisy Nodes on oversmoothing by plotting the Mean Absolute
Distance (MAD) (Chen et al., 2020) of the residual updates of each layer of an MPNN trained on the
QM9 (Ramakrishnan et al., 2014) dataset, and compare it to alternative methods DropEdge (Rong
et al., 2019) and DropNode (Do et al., 2021). MAD is a measure of the diversity of graph node
features, often used to quantify oversmoothing, the higher the number the more diverse the node
features, the lower the number the less diverse. In this plot we can see that for Noisy Nodes the node
updates remain diverse for all of the layers, whereas without Noisy Nodes diversity is lost. Further
analysis of MAD across seeds and with sorted layers can be seen in Appendix Figures 7 and 6 for
models applied to 3D point clouds.

**The Graph Manifold Learning Perspective. By using an implicit mapping from corrupted data to**
clean data, the Noisy Nodes objective encourages the model to learn the manifold on which the clean
data lies— we speculate that the GNN learns to go from low probability graphs to high probability
graphs. In the autoencoder case the GNN learns the manifold of the input data. When node targets are
provided, the GNN learns the manifold of the target data (e.g. the manifold of atoms at equilibrium).
We speculate that such a manifold may include commonly repeated substructures that are useful for
downstream prediction tasks. A similar motivation can be found for denoising in (Vincent et al.,
2010; Song & Ermon, 2019).

**The Energy Perspective for Molecular Property Prediction. Local, random distortions of the**
geometry of a molecule at a local energy minimum are almost certainly higher energy configurations.
As such, a task that maps from a noised molecule to a local energy minimum is learning a mapping
from high energy to low energy. Data such as QM9 contains molecules at local minima.

Some problems have input data that is already high energy, and targets that are at equilibrium. For
these datasets we can generate new high energy states by adding noise to the inputs but keeping the
equilibrium target the same, Figure 1 demonstrates this approach. To preserve translation invariance
we use displacements between input and target ∆, the corrected target after noise is ∆ _−_ _σ._

5 RELATED WORK

**Oversmoothing. Recent work has aimed to understand why it is challenging to realise the benefits of**
training deeper GNNs (Wu et al., 2020). Since first being noted in ((Li et al., 2018)) oversmoothing
has been studied extensively and regularisation techniques have been suggested to overcome it (Chen


-----

et al., 2019; Cai & Wang, 2020; Rong et al., 2019; Zhou et al., 2020b; Yang et al., 2020; Do et al.,
2021; Zhao & Akoglu, 2020). A recent paper, (Li et al., 2021), finds, as in previous work, (Li et al.,
2019; 2020), the optimal depth for some datasets they evaluate on to be far lower (5 for OGBN-Arxiv
from the Open Graph Benchmark (Hu et al., 2020a), for example) than the 1000 layers possible.

**Denoising & Noise Models. Training neural networks with noise has a long history (Sietsma &**
Dow, 1991; Bishop, 1995). Of particular relevance are Denoising Autoencoders (Vincent et al., 2008)
in which an autoencoder is trained to map corrupted inputs ˜x to uncorrupted inputs x. Denoising
Autoencoders have found particular success as a form of pre-training for representation learning
(Vincent et al., 2010). More recently, in research applying GNNs to simulation (Sanchez-Gonzalez
et al., 2018; Sanchez-Gonzalez* et al., 2020; Pfaff et al., 2020) Gaussian noise is added during
training to input positions of a ground truth simulator to mimic the distribution of errors of the learned
simulator. Pre-training methods (Devlin et al., 2019; You et al., 2020; Thakoor et al., 2021) are
another similar approach; most similarly to our method Hu et al. (2020b) apply a reconstruction loss
to graphs with masked nodes to generate graph embeddings for use in downstream tasks. FLAG
(Kong et al., 2020) adds adversarial noise during training to input node features as a form of data
augmentation for GNNs that demonstrates improved performance for many tasks. It does not add an
additional auxiliary loss, which we find is essential for addressing oversmoothing. In other related
GNN work, (Sato et al., 2021) use random input features to improve generalisation of graph neaural
networks. Adding noise to help input node disambiguation has also been covered in (Dasoulas et al.,
2019; Loukas, 2020; Vignac et al., 2020; Murphy et al., 2019), but there is no auxiliary loss.

Finally, we take inspiration from (Vincent et al., 2008; 2010; Vincent, 2011; Song & Ermon, 2019)
which use the observation that noised data lies off the data manifold for representation learning and
generative modelling.

**Machine Learning for 3D Molecular Property Prediction. One application of GNNs is to speed**
up quantum chemistry calculations which operate on 3D positions of a molecule (Duvenaud et al.,
2015; Gilmer et al., 2017; Schütt et al., 2017; Hu et al., 2021b). Common goals are the prediction of
molecular properties (Ramakrishnan et al., 2014), forces (Chmiela et al., 2017), energies (Chanussot*
et al., 2020) and charges (Unke & Meuwly, 2019).

A common approach to embed physical symmetries is to design a network that predicts a rotation and
translation invariant energy (Schütt et al., 2017; Klicpera et al., 2020a; Liu et al., 2021). The input
features of such models include distances (Schütt et al., 2017), angles (Klicpera et al., 2020b;a) or
torsions and higher order terms (Liu et al., 2021). An alternative approach to embedding symmetries
is to design a rotation equivariant neural network that use equivariant representations (Thomas et al.,
2018; Köhler et al., 2019; Kondor et al., 2018; Fuchs et al., 2020; Batzner et al., 2021; Anderson
et al., 2019; Satorras et al., 2021).

**Machine Learning for Bond and Atom Molecular Graphs. Predicting properties from molecular**
graphs without 3D points, such as graphs of bonds and atoms, is studied separately and often used
to benchmark generic graph property prediction models such as GCNs (Hu et al., 2020a) or GATs
(Veliˇckovi´c et al., 2018). Models developed for 3D molecular property prediction cannot be applied
to bond and atom graphs. Common datasets that contain such data are OGBG-MOLPCBA and
OGBG-MOLHIV.

6 3D MOLECULAR PROPERTY PREDICTION EXPERIMENTS AND RESULTS

In this section we evaluate how a popular, simple model, the GNS (Sanchez-Gonzalez* et al., 2020)
performs on 3D molecular prediction tasks when combined with Noisy Nodes. The GNS was
originally developed for particle fluid simulations, but has recently been adapted for molecular
property prediction (Hu et al., 2021b). We find that Without Noisy Nodes the GNS architecture is
not competitive, but by using Noisy Nodes we see improved performance comparable to the use of
specialised architectures.

We made minor changes to the GNS architecture. We featurise the distance input features using radial
basis functions. We group layer weights, similar to grouped layers used in Jumper et al. (2021) for
reduced parameter counts; for a group size of n the first n layer weights are repeated, i.e. the first layer
with a group size of 10 has the same weights as the 11[th], 21[st], 31[st] layers and so on. n contiguous


-----

Figure 3: Validation curves, OC20 IS2RE ID. A) Without any node targets our model has poor
performance and realises no benefit from depth. B) After adding a position node loss, performance
improves as depth increases. C) As we add Noisy Nodes and parameters the model achieves SOTA,
even with 3 layers, and stops overfitting. D) Adding Noisy Nodes allows a model with even fully
shared weights to achieve SOTA.

blocks of layers are considered a single group. Finally we find that decoding the intermediate latents
and adding a loss after each group aids training stability. The decoder is shared across groups.

We tested this architecture on three challenging molecular property prediction benchmarks:
OC20 (Chanussot* et al., 2020) IS2RS & IS2RE, and QM9 (Ramakrishnan et al., 2014). These
benchmarks are detailed below, but as general distinctions, OC20 tasks use graphs 2-20x larger than
QM9. While QM9 always requires graph-level prediction, one of OC20’s two tasks (IS2RS) requires
node-level predictions while the other (IS2RE) requires graph-level predictions. All training details
may be found in the Appendix.

6.1 OPEN CATALYST 2020

**[Dataset. The OC20 dataset (Chanussot* et al., 2020) (CC Attribution 4.0) describes the interaction](https://opencatalystproject.org/)**
of a small molecule (the adsorbate) and a large slab (the catalyst), with total systems consisting of
20-200 atoms simulated until equilibrium is reached.

We focus on two tasks; the Initial Structure to Resulting Energy (IS2RE) task which takes the initial
structure of the simulation and predicts the final energy, and the Initial Structure to Resulting Structure
(IS2RS) which takes the initial structure and predicts the relaxed structure. Note that we train the
more common “direct” prediction task that map directly from initial positions to target in a single
forward pass, and compare against other models trained for direct prediction.

Models are evaluated on 4 held out test sets. Four canonical validation datasets are also provided.
Test sets are evaluated on a remote server hosted by the dataset authors with a very limited number of
submissions per team.

Noisy Nodes in this case consists of a random jump between the initial position and relaxed position.
During training we first sample uniformly from a point in the relaxation trajectory or interpolate
uniformly between the initial and final positions (vi _v˜i)γ, γ_ U(0, 1), and then add I.I.D Gaussian
noise with mean zero and σ = 0.3. The Noisy Node target is the relaxed structure. − _∼_


-----

Table 1: OC20 ISRE Validation, eV MAE, ↓.
“GNS-Shared” indicates shared weights. “GNS-10” indicates a group size of 10.

Model Layers OOD Both OOD Adsorbate OOD Catalyst ID

GNS 50 0.59 ±0.01 0.65 ±0.01 0.55 ±0.00 0.54 ±0.00
GNS-Shared + Noisy Nodes 50 0.49 ±0.00 0.54 ±0.00 0.51 ±0.01 0.51 ±0.01
GNS + Noisy Nodes 50 0.48 ±0.00 0.53 ±0.00 0.49 ±0.01 0.48 ±0.00
GNS-10 + Noisy Nodes 100 **0.46±0.00** **0.51 ±0.00** **0.48 ±0.00** **0.47 ±0.00**

Table 2: Results OC20 IS2RE Test


eV MAE ↓

SchNet DimeNet++ SpinConv SphereNet GNS + Noisy Nodes

OOD Both 0.704 0.661 0.674 0.638 **0.465 (-24.0%)**
OOD Adsorbate 0.734 0.725 0.723 0.703 **0.565 (-22.8%)**
OOD Catalyst 0.662 0.576 0.569 0.571 **0.437 (-17.2%)**
ID 0.639 0.562 0.558 0.563 **0.422 (-18.8%)**

Average Energy within Threshold (AEwT) ↑

SchNet DimeNet++ SpinConv SphereNet GNS + Noisy Nodes

OOD Both 0.0221 0.0241 0.0233 0.0241 **0.047 (+95.8%)**
OOD Adsorbate 0.0233 0.0207 0.026 0.0229 **0.035 (+89.5%)**
OOD Catalyst 0.0294 0.0410 0.0382 0.0409 **0.080 (+95.1%)**
ID 0.0296 0.0425 0.0408 0.0447 **0.091 (+102.0%)**

We first convert to fractional coordinates (i.e. use the periodic unit cell as the basis) which render
the predictions of our model invariant to rotations, and append the following rotation and translation
invariant vector (αβ[T] _, βγ[T]_ _, αγ[T]_ _, |α|, |β|, |γ|) ∈_ R[6] to the edge features where α, β, γ are vectors
of the unit cell. This additional vector provides rotation invariant angular and extent information to
the GNN.

**IS2RE Results. In Figure 3 we show how using Noisy Nodes allows the GNS to achieve state**
of the art performance. Figure 3 A shows that without any auxiliary node target, an IS2RE GNS
achieves poor performance even with increased depth. The fact that increased depth does not result in
improvement supports the hypothesis that GNS suffers from oversmoothing. As we add a node level
position target in B) we see better performance, and improvement as depth increases, validating our
hypothesis that node level targets are key to addressing oversmoothing. In C) we add noisy nodes and
parameters, and see that the increased diversity of the node level predictions leads to very significant
improvements and SOTA, even for a shallow 3 layer network. D) demonstrates this effect is not just
due to increased parameters - SOTA can still be achieve with shared layer weights .

In Table 1 we conduct an ablation on our hyperparameters, and again demonstrate the improved
performance of using Noisy Nodes. Results were averaged over 3 seeds and standard errors on the
best obtained checkpoint show little sensitivity to initialisation. All results in the table are reported
using sampling states from trajectories. We conducted an ablation on ID comparing sampling from a
relaxation trajectory and interpolating between initial & final positions which found that interpolation
improved our score from 0.47 to 0.45.

Our best hyperparameter setting was 100 layers which achieved a 95.6% relative performance
improvement against SOTA results (Table 2) on the AEwT benchmark. Due to limited permitted test
submissions, results presented here were from one test upload of our best performing validation seed.

**IS2RS Results. In Table 4 we see that GNS + Noisy Nodes is significantly better than the only other**
reported IS2RS direct result, ForceNet, itself a GNS variant.


-----

Table 3: OC20 IS2RS Validation, ADwT, ↑

Model Layers OOD Both OOD Adsorbate OOD Catalyst ID

GNS 50 43.0%±0.0 38.0%±0.0 37.5% 0.0 40.0%±0.0
GNS + Noisy Nodes 50 50.1%±0.0 44.3%±0.0 44.1%±0.0 46.1% ±0.0
GNS-10 + Noisy Nodes 50 52.0%±0.0 46.2%±0.0 46.1% ±0.0 48.3% ±0.0
GNS-10 + Noisy Nodes + Pos only 100 **54.3%±0.0** **48.3%±0.0** **48.2% ±0.0** **50.0% ±0.0**

Table 4: OC20 IS2RS Test, ADwT, ↑

Model OOD Both OOD Adsorbate OOD Catalyst ID

ForceNet 46.9% 37.7% 43.7% 44.9%
GNS + Noisy Nodes **52.7%** **43.9%** **48.4%** **50.9%**

Relative Improvement **+12.4%** **+16.4%** **+10.7%** **+13.3%**

6.2 QM9

**Dataset. The QM9 benchmark (Ramakrishnan et al., 2014) contains 134k molecules in equilibrium**
with up to 9 heavy C, O, N and F atoms, targeting 12 associated chemical properties (License: CCBY
4.0). We use 114k molecules for training, 10k for validation and 10k for test. All results are on the
test set. We subtract a fixed per atom energy from the target values computed from linear regression
to reduce variance. We perform training in eV units for energetic targets, and evaluate using MAE.
We summarise the results across the targets using mean standardised MAE (std. MAE) in which
MAEs are normalised by their standard deviation, and mean standardised logMAE. Std. MAE is
dominated by targets with high relative error such as ∆ϵ, whereas logMAE is sensitive to outliers
such as _R[2]_ . As is standard for this dataset, a model is trained separately for each target.

For this dataset we add I.I.D Gaussian noise with mean zero and σ = 0.02 to the input atom positions.
A denoising autoencoder loss is used.

**Results In Table 6 we can see that adding Noisy Nodes significantly improves results by 23.1%**
relative for GNS, making it competitive with specialised architectures. To understand the effect of
adding a denoising loss, we tried just adding noise and found no where near the same improvement
(Table 6).

A GNS-10 + Noisy Nodes with 30 layers achieves top results on 3 of the 12 targets and comparable
performance on the remainder (Table 6). On the std. MAE aggregate metric GNS + Noisy Nodes
performs better than all other reported results, showing that Noisy Nodes can make even a generic
model competitive with models hand-crafted for molecular property prediction. The same trend is
repeated for an rotation invariant version of this network that uses the principle axes of inertia ordered
by eigenvalue as the co-ordinate frame (Table 5).
_R[2]_, the electronic spatial extent, is an outlier for GNS + Noisy Nodes. Interestingly, we found that
without noise GNS-10 + Noisy Nodes achieves 0.33 for this target. We speculate that this target is
particularly sensitive to noise, and the best noise value for this target would be significantly lower
than for the dataset as a whole.

Table 5: QM9, Impact of Noisy Nodes on GNS architecture.

Layers std. MAE % Change logMAE

GNS 10 1.17 -  -5.39
GNS + Noise But No Node Target 10 1.16 -0.9% -5.32
GNS + Noisy Nodes 10 0.90 -23.1% -5.58
GNS-10 + Noisy Nodes 20 0.89 -23.9% -5.59
GNS-10 + Noisy Nodes + Invariance 30 0.92 -21.4% -5.57
GNS-10 + Noisy Nodes 30 **0.88** **-24.8%** **-5.60**


-----

Table 6: QM9, Test MAE, Mean & Standard Deviation of 3 Seeds Reported.

Target Unit SchNet E(n)GNN DimeNet++ SphereNet PaiNN **GNS + Noisy Nodes**


_µ_ D 0.033 0.029 0.030 0.027 **0.012** 0.025 ±0.01
_α_ _a0[3]_ 0.235 0.071 **0.043** 0.047 0.045 0.052 ±0.00
_ϵHOMO_ meV 41 29.0 24.6 23.6 27.6 **20.4 ±0.2**
_ϵLUMO_ meV 34 25.0 19.5 18.9 20.4 **18.6 ±0.4**
∆ϵ meV 63 48.0 32.6 32.3 45.7 **28.6 ±0.1**
_R[2]_ _a0[2]_ **0.07** 0.11 0.33 0.29 0.07 0.70 ±0.01
ZPVE meV 1.7 1.55 1.21 **1.12** 1.28 1.16 ±0.01
_U0_ meV 14.00 11.00 6.32 6.26 **5.85** 7.30 ±0.12
_U_ meV 19.00 12.00 6.28 7.33 **5.83** 7.57 ±0.03
_H_ meV 14.00 12.00 6.53 6.40 **5.98** 7.43±0.06
_cGv_ meVmol Kcal 0.03314.00 12.000.031 0.0237.56 **0.0228.0** 0.0247.35 0.0258.30 ±00..1400

_±_

std. MAE % 1.76 1.22 0.98 0.94 1.00 **0.88**
logMAE -5.17 -5.43 -5.67 -5.68 **-5.85** -5.60

Table 7: OGBG-PCQM4M Results

Model Number of Layers Using Noisy Nodes MAE

MPNN + Virtual Node 16 Yes 0.1249 ± 0.0003
MPNN + Virtual Node 50 No 0.1236 ± 0.0001
Graphormer (Ying et al., 2021) -  -  0.1234
MPNN + Virtual Node 50 Yes **0.1218 ± 0.0001**

7 NON-SPATIAL TASKS

The previous experiments use the 3D geometries of atoms, and models that operate on 3D points.
However, the recipe of adding a denoising auxiliary loss can be applied to other graphs with different
types of features. In this section we apply Noisy Nodes to additional datasets with no 3D points,
using different GNNs, and show analagous effects to the 3D case. Details of the hyperparameters,
models and training details can be found in the appendix.

7.1 OGBG-PCQM4M

This dataset from the OGB benchmarks consists of molecular graphs which consist of bonds and
atom types, and no 3D or 2D coordinates. To adapt Noisy Nodes to this setting, we randomly flip
node and edge features at a rate of 5% and add a reconstruction loss. We evaluate Noisy Nodes using
an MPNN + Virtual Node (Gilmer et al., 2017). The test set is not currently available for this dataset.

In Table 7 we see that for this task Noisy Nodes enables a 50 layer MPNN to reach state of the art
results. Before adding Noisy Nodes, adding capacity beyond 16 layers did not improve results.

7.2 OGBG-MOLPCBA

The OGBG-MOLPCBA dataset contains molecular graphs with no 3D points, with the goal of
classifying 128 biological activities. On the OGBG-MOLPCBA dataset we again use an MPNN +
Virtual Node and random flipping noise. In Figure 4 we see that adding Noisy Nodes improves the
performance of the base model, accentuated for deeper networks. Our 16 layer MPNN improved
from 27.6% ± 0.004 to 28.1% ± 0.002 Mean Average Precision (“Mean AP”). Figure 5 demonstrates
how Noisy Nodes improves performance during training. Of the reported results, our MPNN is
most similar to GCN[1] + Virtual Node and GIN + Virtual Node (Xu et al., 2018) which report
results of 24.2% ± 0.003 and 27.03% ± 0.003 respectively. We evaluate alternative methods for

1The GCN implemented in the official OGB code base has explicit edge updates, akin to the MPNN.


-----

Figure 4: Adding Noisy Nodes with random
flipping of input categories improves the performance of MPNNs, and the effect is accentuated with depth.


Figure 5: Validation curve comparing with
and without noisy nodes. Using Noisy Nodes
leads to a consistent improvement.


oversmoothing, DropNode and DropEdge in Figure 2 and find that Noisy Nodes is more effective at
address oversmoothing, although all 3 methods can be combined favourably (results in appendix).

7.3 OGBN-ARXIV

The above results use models with explicit edge updates, and are reported for graph prediction. To
test the effectiveness with Noisy Nodes with GCNs, arguably the simplest and most popular GNN,
we use OGBN-ARXIV, a citation network with the goal of predicting the arxiv category of each paper.
Adding Noisy Nodes, with noise as input dropout of 0.1, to 4 layer GCN with residual connections
improves from 72.39% ± 0.002 accuracy to 72.52% ± 0.003 accuracy. A baseline 4 layer GCN on
this dataset reports 71.71% ± 0.002. The SOTA for this dataset is 74.31% (Sun & Wu, 2020).

7.4 LIMITATIONS

We have not demonstrated the effectiveness of Noisy Nodes in small data regimes, which may be
important for learning from experimental data. The representation learning perspective requires
access to a local minimum configuration, which is not the case for all quantum modeling datasets. We
have also not demonstrated the combination of Noisy Nodes with more sophisticated 3D molecular
property prediction models such as DimeNet++(Klicpera et al., 2020a), such models may require an
alternative reconstruction loss to position change, such as pairwise interatomic distances. We leave
this to future work.

Noisy Nodes requires careful selection of the form of noise, and a balance between the auxiliary and
primary losses. This can require hyper parameter tuning, and models can be sensitive to the choice
of these parameters. Noisy Nodes has a particular effect for deep GNNs, but depth is not always an
advantage. There are situations, for example molecular dynamics, which place a premium on very
fast inference time. However even at 3 layers (a comparable depth to alternative architectures) the
GNS architecture achieves state of the art validation OC20 IS2RE predictions (Figure 3). Finally,
returns diminish as depth increases indicating depth is not the only answer (Table 1).

8 CONCLUSIONS

In this work we present Noisy Nodes, a novel regularisation technique for GNNs with particular
focus on 3D molecular property prediction. Noisy nodes helps address common challenges around
oversmoothed node representations, shows benefits for GNNs of all depths, but in particular improves
performance for deeper GNNs. We demonstrate results on challenging 3D molecular property
prediction tasks, and some generic GNN benchmark datasets. We believe these results demonstrate
Noisy Nodes could be a useful building block for GNNs for molecular property prediction and
beyond.


-----

9 REPRODUCIBILITY STATEMENT

Code for reproducing OGB-PCQM4M results using Noisy Nodes is available on github, and
was prepared as part of a leaderboard submission. [https://github.com/deepmind/](https://github.com/deepmind/deepmind-research/tree/master/ogb_lsc/pcq)
[deepmind-research/tree/master/ogb_lsc/pcq.](https://github.com/deepmind/deepmind-research/tree/master/ogb_lsc/pcq)

We provide detailed hyper parameter settings for all our experiments in the appendix, in addition to
formulae for computing the encoder and decoder stages of the GNS.

10 ETHICS STATEMENT

**Who may benefit from this work? Molecular property prediction with GNNs is a fast-growing**
area with applications across domains such as drug design, catalyst discovery, synthetic biology, and
chemical engineering. Noisy Nodes could aid models applied to these domains. We also demonstrate
on OC20 that our direct state prediction approach is nearly as accurate as learned relaxed approaches
at a small fraction of the computational cost, which may support material design which requires many
predictions.

Finally, Noisy Nodes could be adapted and applied to many areas in which GNNs are used—for
example, knowledge base completion, physical simulation or traffic prediction.

**Potential negative impact and reflection. Noisy Nodes sees improved performance from depth, but**
the training of very deep GNNs could contribute to global warming. Care should be taken when
utilising depth, and we note that Noisy Nodes settings can be calibrated at shallow depth.

REFERENCES

Brandon M. Anderson, T. Hy, and R. Kondor. Cormorant: Covariant molecular neural networks. In
_NeurIPS, 2019._

Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David
Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones,
Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King,
Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamakarios, Roman Ring,
Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan
Srinivasan, Wojciech Stokowiec, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL
[http://github.com/deepmind.](http://github.com/deepmind)

V. Bapst, T. Keck, Agnieszka Grabska-Barwinska, C. Donner, E. D. Cubuk, S. Schoenholz, A. Obika,
Alexander W. R. Nelson, T. Back, D. Hassabis, and P. Kohli. Unveiling the predictive power of
static structure in glassy systems. Nature Physics, 16:448–454, 2020.

P. Battaglia, Jessica B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, Mateusz Malinowski,
Andrea Tacchetti, David Raposo, A. Santoro, R. Faulkner, Çaglar Gülçehre, H. Song, A. J. Ballard,
J. Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charlie Nash, Victoria Langston,
Chris Dyer, N. Heess, Daan Wierstra, P. Kohli, M. Botvinick, Oriol Vinyals, Y. Li, and Razvan
Pascanu. Relational inductive biases, deep learning, and graph networks. ArXiv, abs/1806.01261,
2018.

Simon Batzner, T. Smidt, L. Sun, J. Mailoa, M. Kornbluth, N. Molinari, and B. Kozinsky. Se(3)equivariant graph neural networks for data-efficient and accurate interatomic potentials. ArXiv,
abs/2101.03164, 2021.

Charles M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural Computation,
7:108–116, 1995.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and
Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL
[http://github.com/google/jax.](http://github.com/google/jax)


-----

Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric
deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42,
2017.

T. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens
Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
J. Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.

Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks. _CoRR,_
[abs/2006.13318, 2020. URL https://arxiv.org/abs/2006.13318.](https://arxiv.org/abs/2006.13318)

Lowik Chanussot*, Abhishek Das*, Siddharth Goyal*, Thibaut Lavril*, Muhammed Shuaibi*,
Morgane Riviere, Kevin Tran, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Aini Palizhati,
Anuroop Sriram, Brandon Wood, Junwoong Yoon, Devi Parikh, C. Lawrence Zitnick, and Zachary
Ulissi. Open catalyst 2020 (oc20) dataset and community challenges. ACS Catalysis, 0(0):
[6059–6072, 2020. doi: 10.1021/acscatal.0c04525. URL https://doi.org/10.1021/](https://doi.org/10.1021/acscatal.0c04525)
[acscatal.0c04525.](https://doi.org/10.1021/acscatal.0c04525)

Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. CoRR, abs/1909.03211,
[2019. URL http://arxiv.org/abs/1909.03211.](http://arxiv.org/abs/1909.03211)

Deli Chen, Yankai Lin, W. Li, Peng Li, J. Zhou, and Xu Sun. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. In AAAI, 2020.

Stefan Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, Kristof T. Schütt, and K. Müller.
Machine learning of accurate energy-conserving molecular force fields. Science Advances, 3, 2017.

George Dasoulas, Ludovic Dos Santos, Kevin Scaman, and Aladin Virmaux. Coloring graph neural
networks for node disambiguation. ArXiv, abs/1912.06058, 2019.

J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In NAACL-HLT, 2019.

Tien Huu Do, Duc Minh Nguyen, Giannis Bekoulis, Adrian Munteanu, and N. Deligiannis. Graph convolutional neural networks with node transition probability-based message passing and dropnode
regularization. Expert Syst. Appl., 174:114711, 2021.

David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for
learning molecular fingerprints. In Proceedings of the 28th International Conference on Neural
_Information Processing Systems - Volume 2, NIPS’15, pp. 2224–2232, Cambridge, MA, USA,_
2015. MIT Press.

F. Fuchs, Daniel E. Worrall, Volker Fischer, and M. Welling. Se(3)-transformers: 3d roto-translation
equivariant attention networks. ArXiv, abs/2006.10503, 2020.

J. Gilmer, S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message
passing for quantum chemistry. ArXiv, abs/1704.01212, 2017.

Jonathan Godwin*, Thomas Keck*, Peter Battaglia, Victor Bapst, Thomas Kipf, Yujia Li, Kimberly
Stachenfeld, Petar Veliˇckovi´c, and Alvaro Sanchez-Gonzalez. Jraph: A library for graph neural
[networks in jax., 2020. URL http://github.com/deepmind/jraph.](http://github.com/deepmind/jraph)

Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX, 2020.
[URL http://github.com/deepmind/dm-haiku.](http://github.com/deepmind/dm-haiku)

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta,
and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ArXiv,
abs/2005.00687, 2020a.


-----

Weihua Hu, Bowen Liu, Joseph Gomes, M. Zitnik, Percy Liang, V. Pande, and J. Leskovec. Strategies
for pre-training graph neural networks. arXiv: Learning, 2020b.

Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A
large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021a.

Weihua Hu, Muhammed Shuaibi, Abhishek Das, Siddharth Goyal, Anuroop Sriram, J. Leskovec, Devi
Parikh, and C. L. Zitnick. Forcenet: A graph neural network for large-scale quantum calculations.
_ArXiv, abs/2103.01436, 2021b._

John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, Andrew Cowie, Bernardino RomeraParedes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David A.
Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu,
Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold.
_Nature, 596:583 – 589, 2021._

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. _CoRR,_
abs/1412.6980, 2015.

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.
_[CoRR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.](http://arxiv.org/abs/1609.02907)_

Johannes Klicpera, Shankari Giri, Johannes T. Margraf, and Stephan Günnemann. Fast
and uncertainty-aware directional message passing for non-equilibrium molecules. _CoRR,_
[abs/2011.14115, 2020a. URL https://arxiv.org/abs/2011.14115.](https://arxiv.org/abs/2011.14115)

Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular
graphs. ArXiv, abs/2003.03123, 2020b.

Risi Kondor, Hy Truong Son, Horace Pan, Brandon M. Anderson, and Shubhendu Trivedi. Covariant
[compositional networks for learning graphs. CoRR, abs/1801.02144, 2018. URL http://](http://arxiv.org/abs/1801.02144)
[arxiv.org/abs/1801.02144.](http://arxiv.org/abs/1801.02144)

Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, G. Taylor,
and T. Goldstein. Flag: Adversarial data augmentation for graph neural networks. _ArXiv,_
abs/2010.09891, 2020.

Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: sampling configurations for multi-body
systems with symmetric energies, 2019.

G. Li, M. Müller, Ali K. Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns?
_2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9266–9275, 2019._

Guohao Li, C. Xiong, Ali K. Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper
gcns. ArXiv, abs/2006.07739, 2020.

Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. Training graph neural networks
[with 1000 layers. CoRR, abs/2106.07476, 2021. URL https://arxiv.org/abs/2106.](https://arxiv.org/abs/2106.07476)
[07476.](https://arxiv.org/abs/2106.07476)

Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks
for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 32, 2018.

Yi Liu, Limei Wang, Meng Liu, Xuan Zhang, Bora Oztekin, and Shuiwang Ji. Spherical message
passing for 3d graph networks. arXiv preprint arXiv:2102.05013, 2021.

Andreas Loukas. How hard is to distinguish graphs with graph neural networks? arXiv: Learning,
2020.


-----

Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak A. Rao, and Bruno Ribeiro. Relational
pooling for graph representations. In ICML, 2019.

T. Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and P. Battaglia. Learning mesh-based simulation with graph networks. ArXiv, abs/2010.03409, 2020.

R. Ramakrishnan, Pavlo O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistry structures
and properties of 134 kilo molecules. Scientific Data, 1, 2014.

Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. The truly deep graph convolutional
[networks for node classification. CoRR, abs/1907.10903, 2019. URL http://arxiv.org/](http://arxiv.org/abs/1907.10903)
[abs/1907.10903.](http://arxiv.org/abs/1907.10903)

Alvaro Sanchez-Gonzalez, N. Heess, Jost Tobias Springenberg, J. Merel, Martin A. Riedmiller,
R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control.
_ArXiv, abs/1806.01242, 2018._

Alvaro Sanchez-Gonzalez*, Jonathan Godwin*, Tobias Pfaff*, Rex Ying*, Jure Leskovec, and Peter
Battaglia. Learning to simulate complex physics with graph networks. In Hal Daumé III and Aarti
Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119
of Proceedings of Machine Learning Research, pp. 8459–8468. PMLR, 13–18 Jul 2020. URL
[http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html.](http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html)

R. Sato, Makoto Yamada, and Hisashi Kashima. Random features strengthen graph neural networks.
In SDM, 2021.

Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks,
2021.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The
graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. doi:
10.1109/TNN.2008.2005605.

Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, A. Tkatchenko,
and K. Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum
interactions. In NIPS, 2017.

Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. Graph neural networks in particle physics.
_Machine Learning: Science and Technology, 2(2):021001, Jan 2021. ISSN 2632-2153. doi:_
[10.1088/2632-2153/abbf9a. URL http://dx.doi.org/10.1088/2632-2153/abbf9a.](http://dx.doi.org/10.1088/2632-2153/abbf9a)

J. Sietsma and Robert J. F. Dow. Creating artificial neural networks that generalize. Neural Networks,
4:67–79, 1991.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
_ArXiv, abs/1907.05600, 2019._

Chuxiong Sun and Guoshi Wu. Adaptive graph diffusion networks with hop-wise attention. ArXiv,
abs/2012.15024, 2020.

Shantanu Thakoor, C. Tallec, M. G. Azar, R. Munos, Petar Velivckovi’c, and Michal Valko. Bootstrapped representation learning on graphs. ArXiv, abs/2102.06514, 2021.

Nathaniel Thomas, Tess Smidt, Steven M. Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick
Riley. Tensor field networks: Rotation- and translation-equivariant neural networks for 3d point
[clouds. CoRR, abs/1802.08219, 2018. URL http://arxiv.org/abs/1802.08219.](http://arxiv.org/abs/1802.08219)

Oliver T. Unke and Markus Meuwly. Physnet: A neural network for predicting energies, forces, dipole
moments, and partial charges. Journal of Chemical Theory and Computation, 15(6):3678–3693,
[May 2019. ISSN 1549-9626. doi: 10.1021/acs.jctc.9b00181. URL http://dx.doi.org/10.](http://dx.doi.org/10.1021/acs.jctc.9b00181)
[1021/acs.jctc.9b00181.](http://dx.doi.org/10.1021/acs.jctc.9b00181)

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017.


-----

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph attention networks, 2018.

Cl’ement Vignac, Andreas Loukas, and Pascal Frossard. Building powerful and equivariant graph
neural networks with structural message-passing. arXiv: Learning, 2020.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computa_tion, 23:1661–1674, 2011._

Pascal Vincent, H. Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and
composing robust features with denoising autoencoders. In ICML ’08, 2008.

Pascal Vincent, H. Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked
denoising autoencoders: Learning useful representations in a deep network with a local denoising
criterion. J. Mach. Learn. Res., 11:3371–3408, 2010.

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A
comprehensive survey on graph neural networks. IEEE transactions on neural networks and
_learning systems, 2020._

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
[networks? CoRR, abs/1810.00826, 2018. URL http://arxiv.org/abs/1810.00826.](http://arxiv.org/abs/1810.00826)

Chaoqi Yang, Ruijie Wang, Shuochao Yao, Shengzhong Liu, and Tarek Abdelzaher. Revisiting"
over-smoothing" in deep gcns. arXiv preprint arXiv:2003.13663, 2020.

Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and
Tie-Yan Liu. Do transformers really perform bad for graph representation? ArXiv, abs/2106.05234,
2021.

Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph
contrastive learning with augmentations. ArXiv, abs/2010.13902, 2020.

L. Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. ArXiv, abs/1909.12223,
2020.

Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang,
Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications.
_AI Open, 1:57–81, 2020a._

Kuangqi Zhou, Yanfei Dong, Wee Sun Lee, Bryan Hooi, Huan Xu, and Jiashi Feng. Effective
[training strategies for deep graph neural networks. CoRR, abs/2006.07107, 2020b. URL https:](https://arxiv.org/abs/2006.07107)
[//arxiv.org/abs/2006.07107.](https://arxiv.org/abs/2006.07107)

A APPENDIX

The following sections include details on training setup, hyper-parameters, input processing, as well
as additional experimental results.

A.1 ADDITIONAL METRICS FOR OPEN CATALYST IS2RS TEST SET

Relaxation approaches to IS2RS minimise forces with respect to positions, with the expectation that
forces at the minimum are close to zero. One metric of such a model’s success is to evaluate the
forces at the converged structure using ground truth Density Functional Theory calculations and see
how close they are to zero. Two metrics are provided by OC20 (Chanussot* et al., 2020) on the
IS2RS test set: Force below Threshold (FbT), which is the percentage of structures that have forces
below 0.05 eV/Angstrom, and Average Force below Threshold (AFbT) which is FbT calculated at
multiple thresholds.

The OC20 project computes test DFT calculations on the evaluation server and presents a summary
result for all IS2RS position predictions. Such calculations take 10-12 hours and they are not available
for the validation set. Thus, we are not able to analyse the results in Tables 8 and 9 in any further
detail. Before application to catalyst screening further work may be needed for direct approaches to
ensure forces do not explode from atoms being too close together.


-----

Table 8: OC20 IS2RS Test, Average Force below Threshold %, ↑

Model Method OOD Both OOD Adsorbate OOD Catalyst ID

Noisy Nodes Direct 0.09% 0.00% 0.29% 0.54%

Table 9: OC20 IS2RS Test, Force below Threshold %, ↑

Model Method OOD Both OOD Adsorbate OOD Catalyst ID

Noisy Nodes Direct 0.0% 0.0% 0.0% 0.0%

A.2 MORE DETAILS ON GNS ADAPTATIONS FOR MOLECULAR PROPERTY PREDICTION.

**Encoder.**

The node features are a learned embedding lookup of the atom type, and in the case of OC20 two
additional binary features representing whether the atom is part of the adsorbate or catalyst and
whether the atom remains fixed during the quantum chemistry simulation.

The edge features,2 sin( _[cπ]R_ _[d][)]_ _ek are the distances |d| featurised using c Radial Bessel basis functions, ˜eRBF,c =_

_R_ _d_, and the edge vector displacements, d, normalised by the edge distance:

q

_ek = Concat(˜eRBF,1(_ _d_ ), ..., ˜eRBF,c( _d_ ), [d]
_|_ _|_ _|_ _|_ _d_

_|_ _|_ [)]


Our conversion to fractional coordinates only applied to the vector quantities, i.e.

**Decoder**


_d_

_|d|_ [.]


The decoder consists of two parts, a graph-level decoder which predicts a single output for the input
graph, and a node-level decoder which predicts individual outputs for each node. The graph-level
decoder implements the following equation:

_|V |_ _|V |_
_y = W_ [Proc] MLPProc(a[Proc]i ) + b[Proc] + W [Enc] MLPEnc(a[Enc]i ) + b[Enc]

_i=1_ _i=1_

X X


Where a[Proc]i are node latents from the Processor, a[Enc]i are node latents from the Encoder, W [Enc] and
_W_ [Proc] are linear layers, b[Enc] and b[Proc] are biases, and |V | is the number of nodes. The node-level
decoder is simply an MLP applied to each a[Proc]i which predicts a[∆]i [.]

A.3 MORE DETAILS ON MPNN FOR OGBG-PCQM4M AND OGBG-MOLPCBA

Our MPNN follows the blueprint of Gilmer et al. (2017). We use _[⃗]h[(]v[t][)]_ to denote the latent vector of
node v at message passing step t, and ⃗m[(]uv[t][)] [to be the computed message vector for the edge between]
nodes u and v at message passing step t. We define the update functions as:

_m⃗_ [(]uv[t][+1)] = ψt+1 _⃗h[(]u[t][)][,⃗]h[(]v[t][)][, ⃗]m[(]uv[t][)]_ [+][ ⃗]m[(]uv[t][−][1)] (1)
 


+ _[⃗]h[t]u_ (2)


_⃗h[(]u[t][+1)]_ = φt+1


_⃗h[(]u[t][)][,]_


_m⃗_ [(]vu[t][+1)]
_uX∈Nv_


_m⃗_ [(]uv[t][+1)]
_vX∈Nu_


Where the message function ψt+1 and the update function φt+1 are MLPs. We use a “Virtual Node”
which is connected to all other nodes to enable long range communication. Out readout function is
an MLP. No spatial features are used.


-----

Figure 6: GNS Unsorted MAD per Layer
Averaged Over 3 Random Seeds. Evidence
of oversmoothing is clear. Model trained on
QM9.


Figure 7: GNS Sorted MAD per Layer Averaged Over 3 Random Seeds. The trend
is clearer when the MAD values have been
sorted. Model trained on QM9.


A.4 EXPERIMENT SETUP FOR 3D MOLECULAR MODELING

**Open Catalyst. All training experiments were ran on a cluster of TPU devices. For the Open Catalyst**
experiments, each individual run (i.e. a single random seed) utilised 8 TPU devices on 2 hosts (4 per
host) for training, and 4 V100 GPU devices for evaluation (1 per dataset).

Each Open Catalyst experiment was ran until convergence for up to 200 hours. Our best result, the
large 100 layer model requires 7 days of training using the above setting. Each configuration was run
at least 3 times in this hardware configuration, including all ablation settings.

We further note that making effective use of our regulariser requires sweeping noise values. These
sweeps are dataset dependent and can be carried out using few message passing steps.

**QM9. Experiments were also run on TPU devices. Each seed was run using 8 TPU devices on a**
single host for training, and 2 V100 GPU devices for evaluation. QM9 targets were trained between
12-24 hours per experiment.

Following Klicpera et al. (2020b) we define std. MAE as :


_fθ[(][m][)](Xi, zi)_ _t[(]i[m][)]_
_|_ _−_ [ˆ]

_σm_


std. MAE = [1]


_m=1_

_M_

log

_m=1_

X


_i=1_


and logMAE as:


_fθ[(][m][)](Xi, zi)_ _t[(]i[m][)]_
_|_ _−_ [ˆ]

_σm_


logMAE = [1]


_i=1_


with target index m, number of targets M = 12, dataset size N, ground truth values _t[ˆ][(][m][)], model_
_fθ[(][m][)], inputs Xi and zi, and standard deviation σm of_ _t[ˆ][(][m][)]._

A.5 OVER SMOOTHING ANALYSIS FOR GNS

In addition to Figure 2, we repeat the analysis with a mean MAD over 3 seeds 7. Furthermore we
remove the sorting layer by MAD value and find the trend holds.

A.6 NOISE ABLATIONS FOR OGBG-MOLPCBA

We conduct a noise ablation on the random flipping noise for OGBG-MOLPCBA with an 8 layer
MPNN + Virtual Node, and find that our model is not very sensitive to the noise value (Table 10), but
degrades from 0.1.


-----

|Flip Probability|Mean AP|
|---|---|


|0.01 0.03 0.05 0.1 0.2|27.8% +- 0.002 27.9% +- 0.003 28.1% +- 0.001 28.0% +- 0.003 27.7% +- 0.002|
|---|---|


Table 10: OGBG-MOLPCBA Noise Ablation

|Col1|Mean AP|
|---|---|


|MPNN Without DropEdge MPNN With DropEdge MPNN + DropEdge + Noisy Nodes|27.4% 0.002 ± 27.5% 0.001 ± 27.8% 0.002 ±|
|---|---|



Table 11: OGBG-MOLPCBA DropEdge Ablation

A.7 DROPEDGE & DROPNODE ABLATIONS FOR OGBG-MOLPCBA

We conduct an ablation with our 16 layer MPNN using DropEdge at a rate of 0.1 as an alternative
approach to improving oversmoothing and find it does not improve performance for ogbg-molpcba
(Table 11), similarly we find DropNode (Table 12) does not improve performance. In addition, we
find that these two methods can’t be combined well together, reaching a performance of 27.0% ±
0.003. However, both methods can be combined advantageously with Noisy Nodes.

We also measure the MAD of the node latents for each layer and find the indeed Noisy Nodes is more
effective at addressing oversmoothing in Figure 8.

A.8 TRAINING CURVES FOR OC20 NOISY NODES ABLATIONS DEMONSTRATING
OVERFITTING

Figure 9

|Col1|Mean AP|
|---|---|


|MPNN With DropNode MPNN Without DropNode MPNN + DropNode + Noisy Nodes|27.5% 0.001 ± 27.5% 0.004 ± 28.2% 0.005 ±|
|---|---|



Table 12: OGBG-MOLPCBA DropNode Ablation


-----

Figure 8: Comparison of the effect of techniques to address oversmoothing on MPNNs. Whilst Some
effect can be seen from DropEdge and DropNode, Noisy Nodes is significantly better at preserving
per node diversity.

A.9 PSEUDOCODE FOR 3D MOLECULAR PREDICTION TRAINING STEP

**Algorithm 1: Noisy Nodes Training Step**
_G = (V, E, g) // Input graph_
_G˜ = G // Initialize noisy graph_
_λ // Noisy Nodes Weight_
**if not_provided(V** _[′]) then_

_V_ _[′]_ _←_ _V_
**end**
**if predict_differences then**

∆ = _vi[′]_
**end** _{_ _[−]_ _[v][i][|][i][ ∈]_ [1][, . . .,][ |][V][ |}]
**for each i ∈** 1, . . ., |V | do

_σi = sample_node_noise(shape_of(vi));_
_v˜i = vi + σi;_
**if predict_differences then**

∆˜ _i = ∆i −_ _σi;_
**end**
**endfor**
_E˜ = recompute_edges(V[˜] );_
_Gˆ[′]_ = GNN(G[˜]);
**if predict_differences then**

_V_ = ∆[˜] _i;_

_[′]_
**end**
Loss = λ NoisyNodesLoss(G[ˆ][′], V _[′]) + PrimaryLoss(G[ˆ][′], V_ _[′]));_
Loss.minimise()


-----

Figure 9: Training curves to accompany Figure 3. This demonstrates that even as the validation
performance is getting worse, training loss is going down, indicating overfitting.


-----

Table 13: Open Catalyst training parameters.

Parameter Value or description

Optimiser Adam with warm up and cosine cycling
_β1_ 0.9
_β2_ 0.95
Warm up steps 5e5
Warm up start learning rate 1e − 5
Warm up/cosine max learning rate 1e − 4
Cosine cycle length 5e6
Loss type Mean squared error

Batch size Dynamic to max edge/node/graph count
Max nodes in batch 1024
Max edges in batch 12800
Max graphs in batch 10

MLP number of layers 3
MLP hidden sizes 512
Number Bessel Functions 512
Activation shifted softplus
message passing layers 50
Group size 10
Node/Edge latent vector sizes 512

Position noise Gaussian (µ = 0, σ = 0.3)
Parameter update Exponentially moving average (EMA) smoothing
EMA decay 0.9999
Position Loss Co-efficient 1.0

A.10 TRAINING DETAILS

Our code base is implemented in JAX using Haiku and Jraph for GNNs, and Optax for training
(Bradbury et al., 2018; Babuschkin et al., 2020; Godwin* et al., 2020; Hennigan et al., 2020). Model
selection used early stopping.

All results reported as an average of 10 random seeds. OGBG-PCQM4M & OGBG-MOLPCBA
were trained with 16 TPUs and evaluated with a single V100 GPU. OGBN-Arxiv was trained and
evalated with a single TPU

**3D Molecular Prediction**

We minimise the mean squared error loss on mean and standard deviation normalised targets and use
the Adam (Kingma & Ba, 2015) optimiser with warmup and cosine decay. For OC20 IS2RE energy
prediction we subtract a learned reference energy, computed using an MLP with atom types as input.

For the GNS model the node and edge latents as well as MLP hidden layers were sized 512, with 3
layers per MLP and using shifted softplus activations throughout. OC20 & QM9 Models were trained
on 8 TPU devices and evaluated on a single V100 GPUs. We provide the full set of hyper-parameters
and computational resources used separately for each dataset in the Appendix. All noise levels were
determined by sweeping a small range of values (≈ 10) informed by the noised feature covariance.

**Non Spatial Tasks**

A.11 HYPER-PARAMETERS

**Open Catalyst. We list the hyper-parameters used to train the default Open Catalyst experiment.**
If not specified otherwise (e.g. in ablations of these parameters), experiments were ran with this
configuration.


-----

Table 14: QM9 training parameters.

Parameter Value or description

Optimiser Adam with warm up and cosine cycling
_β1_ 0.9
_β2_ 0.95
Warm up steps 1e4
Warm up start learning rate 3e − 7
Warm up/cosine max learning rate 1e − 4
Cosine cycle length 2e6
Loss type Mean squared error

Batch size Dynamic to max edge/node/graph count
Max nodes in batch 256
Max edges in batch 4096
Max graphs in batch 8

MLP number of layers 3
MLP hidden sizes 1024
Number Bessel Funtions 512
Activation shifted softplus
message passing layers 10
Group Size 10
Node/Edge latent vector sizes 512

Position noise Gaussian (µ = 0, σ = 0.02)
Parameter update Exponentially moving average (EMA) smoothing
EMA decay 0.9999
Position Loss Coefficient 0.1

Dynamic batch sizes refers to constructing batches by specifying maximum node, edge and graph
counts (as opposed to only graph counts) to better balance computational load. Batches are constructed
until one of the limits is reached.

Parameter updates were smoothed using an EMA for the current training step with the current decay
value computed through decay = min(decay, (1.0 + step)/(10.0 + step). As discussed in the
evaluation, best results on Open Catalyst were obtained by utilising a 100 layer network with group
size 10.

**QM9 Table 14 lists QM9 hyper-parameters which primarily reflect the smaller dataset and geometries**
with fewer long range interactions. For U0, U, H and G we use a slightly larger number of graphs
per batch - 16 - and a smaller position loss co-efficient of 0.01.

**OGBG-PCQM4M Table 15 provides the hyper parameters for OGBG-PCQM4M.**

**OGBG-MOLPCBA Table 16 provides the hyper parameters for the OGBG-MOLPCBA experiments.**

**OGBN-ARXIV Table 17 provides the hyper parameters for the OGBN-Arxiv experiments.**


-----

Table 15: OGBG-PCQM4M Training Parameters.

Parameter Value or description

Optimiser Adam with warm up and cosine cycling
_β1_ 0.9
_β2_ 0.95
Warm up steps 5e4
Warm up start learning rate 1e − 5
Warm up/cosine max learning rate 1e − 4
Cosine cycle length 5e5
Loss type Mean absolute error
Reconstruction type Softmax Cross Entropy

Batch size Dynamic to max edge/node/graph count
Max nodes in batch 20,480
Max edges in batch 8,192
Max graphs in batch 512

MLP number of layers 2
MLP hidden sizes 512
Activation relu
Node/Edge latent vector sizes 512

Noisy Nodes Category Flip Fate 0.05
Parameter update Exponentially moving average (EMA) smoothing
EMA decay 0.999
Reconstruction Loss Coefficient 0.1

Table 16: OGBG-MOLPCBA Training Parameters.

Parameter Value or description

Optimiser Adam with warm up and cosine cycling
_β1_ 0.9
_β2_ 0.95
Warm up steps 1e4
Warm up start learning rate 1e − 5
Warm up/cosine max learning rate 1e − 4
Cosine cycle length 1e5
Loss type Softmax Cross Entropy
Reconstruction loss type Softmax Cross Entropy

Batch size Dynamic to max edge/node/graph count
Max nodes in batch 20,480
Max edges in batch 8,192
Max graphs in batch 512

MLP number of layers 2
MLP hidden sizes 512
Activation relu
Batch Normalization Yes, after every hidden layer
Node/Edge latent vector sizes 512

Dropnode Rate 0.1
Dropout Rate 0.1
Noisy Nodes Category Flip Fate 0.05
Parameter update Exponentially moving average (EMA) smoothing
EMA decay 0.999
Reconstruction Loss Coefficient 0.1


-----

Table 17: OGBG-ARXIV Training Parameters.

Parameter Value or description

Optimiser Adam with warm up and cosine cycling
_β1_ 0.9
_β2_ 0.95
Warm up steps 50
Warm up start learning rate 1e − 5
Warm up/cosine max learning rate 1e − 3
Cosine cycle length 12, 000
Loss type Softmax Cross Entropy
Reconstruction loss type Mean Squared Error

Batch size Full graph

MLP number of layers 1
Activation relu
Batch Normalization Yes, after every hidden layer
Node/Edge latent vector sizes 256

Dropout Rate 0.5
Noisy Nodes Input Dropout 0.05
Reconstruction Loss Coefficient 0.1


-----