|
# SIMPLE GNN REGULARISATION FOR 3D MOLECULAR PROPERTY PREDICTION & BEYOND |
|
|
|
**Jonathan Godwin, Michael Schaarschmidt, Alexander Gaunt,** |
|
**Alvaro Sanchez-Gonzales, Yulia Rubanova, Petar Veliˇckovi´c,** |
|
**James Kirkpatrick & Peter Battaglia** |
|
DeepMind, London |
|
{jonathangodwin}@deepmind.com |
|
|
|
ABSTRACT |
|
|
|
In this paper we show that simple noisy regularisation can be an effective way |
|
to address oversmoothing. We argue that regularisers addressing oversmoothing |
|
should both penalise node latent similarity and encourage meaningful node representations. From this observation we derive “Noisy Nodes”, a simple technique in |
|
which we corrupt the input graph with noise, and add a noise correcting node-level |
|
loss. The diverse node level loss encourages latent node diversity, and the denoising |
|
objective encourages graph manifold learning. Our regulariser applies well-studied |
|
methods in simple, straightforward ways which allow even generic architectures to |
|
overcome oversmoothing and achieve state of the art results on quantum chemistry |
|
tasks, and improve results significantly on Open Graph Benchmark (OGB) datasets. |
|
Our results suggest Noisy Nodes can serve as a complementary building block in |
|
the GNN toolkit. |
|
|
|
1 INTRODUCTION |
|
|
|
Graph Neural Networks (GNNs) are a family of neural networks that operate on graph structured data |
|
by iteratively passing learned messages over the graph’s structure (Scarselli et al., 2009; Bronstein |
|
et al., 2017; Gilmer et al., 2017; Battaglia et al., 2018; Shlomi et al., 2021). While Graph Neural |
|
Networks have demonstrated success in a wide variety of tasks (Zhou et al., 2020a; Wu et al., 2020; |
|
Bapst et al., 2020; Schütt et al., 2017; Klicpera et al., 2020a), it has been proposed that in practice |
|
“oversmoothing” limits their ability to benefit from overparametrization. |
|
|
|
Oversmoothing is a phenomenon where a GNN’s latent node representations become increasing indistinguishable over successive steps of message passing (Chen et al., 2019). Once these representations |
|
are oversmoothed, the relational structure of the representation is lost, and further message-passing |
|
cannot improve expressive capacity. We argue that the challenges of overcoming oversmoothing are |
|
two fold. First, finding a way to encourage node latent diversity; second, to encourage the diverse |
|
node latents to encode meaningful graph representations. Here we propose a simple noise regulariser, |
|
Noisy Nodes, and demonstrate how it overcomes these challenges across a range of datasets and |
|
architectures, achieving top results on OC20 IS2RS & IS2RE direct, QM9 and OGBG-PCQM4Mv1. |
|
|
|
Our “Noisy Nodes” method is a simple technique for regularising GNNs and associated training |
|
procedures. During training, our noise regularisation approach corrupts the input graph’s attributes |
|
with noise, and adds a per-node noise correction term. We posit that our Noisy Nodes approach is |
|
effective because the model is rewarded for maintaining and refining distinct node representations |
|
through message passing to the final output, which causes it to resist oversmoothing. Like denoising |
|
autoencoders, it encourages the model to explicitly learn the manifold on which the uncorrupted input |
|
graph’s features lie, analogous to a form of representation learning. When applied to 3D molecular |
|
prediction tasks, it encourages the model to distinguish between low and high energy states. We |
|
find that applying Noisy Nodes reduces oversmoothing for shallower networks, and allows us to see |
|
improvements with added depth, even on tasks for which depth was assumed to be unhelpful. |
|
|
|
This study’s approach is to investigate the combination of Noisy Nodes with generic, popular baseline |
|
GNN architectures. For 3D Molecular prediction we use a standard architecture working on 3D point |
|
clouds developed for particle fluid simulations, the Graph Net Simulator (GNS) (Sanchez-Gonzalez* |
|
|
|
|
|
----- |
|
|
|
et al., 2020), which has also been used for molecular property prediction (Hu et al., 2021b). Without |
|
using Noisy Nodes the GNS is not a competitive model, but using Noisy Nodes allows the GNS |
|
to achieve top performance on three 3D molecular property prediction tasks: the OC20 IS2RE |
|
direct task by 43% over previous work, 12% on OC20 IS2RS direct, and top results on 3 out of |
|
12 of the QM9 tasks. For non-spatial GNN benchmarks we test a MPNN (Gilmer et al., 2017) on |
|
OGBG-MOLPCBA and OGBG-PCQM4M (Hu et al., 2021a) and again see significant improvements. |
|
Finally, we applied Noisy Nodes to a GCN (Kipf & Welling, 2016), arguably the most popular and |
|
simple GNN, trained on OGBN-Arxiv and see similar results. These results suggest Noisy Nodes can |
|
serve as a complementary GNN building block. |
|
|
|
2 PRELIMINARIES: GRAPH PREDICTION PROBLEM |
|
|
|
Let G = (V, E, g) be an input graph. The nodes are V = {v1, . . ., v|V |}, where vi ∈ R[d][v] . The |
|
directed, attributed edges are E = {e1, . . ., e|E|}: each edge includes a sender node index, receiver |
|
node index, and edge attribute,ek ∈ R[d][e]. The graph-level property is ek = ( g ∈sk, rR[d]k[g], e. _k), respectively, where sk, rk ∈{1, . . ., |V |} and_ |
|
|
|
The goal is to predict a target graph, G[′], with the same structure as G, but different node, edge, |
|
and/or graph-level attributes. We denote _G[ˆ][′]_ as a model’s prediction of G[′]. Some error metric defines |
|
quality of _G[ˆ][′]_ with respect to the target G[′], Error( G[ˆ][′], G[′]), which the training loss terms are defined to |
|
optimize. In this paper the phrase “message passing steps” is synonymous with “GNN layers”. |
|
|
|
3 OVERSMOOTHING |
|
|
|
“Oversmoothing” is when the node latent vectors of a GNN become very similar after successive |
|
layers of message passing. Once nodes are identical there is no relational information contained in |
|
the nodes, and no higher-order latent graph representations can be learned. It is easiest to see this |
|
effect with the update function of a Graph Convolutional Network with no adjacency normalization |
|
_vi[k]_ [=][ P]j _[Wv]j[k][−][1]_ with j ∈ _Neighborhoodvi_ _, W ∈_ R[d][g][×][d][g] and k the layer index. As the number |
|
of applications increases, the averaging effect of the summation forces the nodes to become almost |
|
identical. However, as soon as residual connections are added we can construct a network that |
|
need not suffer from oversmoothing by setting the residual updates to zero at a similarity threshold. |
|
Similarly, multi-head attention Vaswani et al. (2017); Veliˇckovi´c et al. (2018) and GNNs with edge |
|
updates (Battaglia et al., 2018; Gilmer et al., 2017) can modulate node updates. As such for modern |
|
GNNs oversmoothing is primarily a “training” problem - i.e. how to choose model architectures and |
|
regularisers to encourage and preserve meaningful latent relational representations. |
|
|
|
We can discern two desiderata for a regulariser or loss that addresses oversmoothing. First, it should |
|
penalise identical node latents. Second, it should encourage meaningful latent representations of |
|
the data. One such example may be the auto-regressive loss of transformer based language models |
|
(Brown et al. (2020)). In this case, each word (equivalent to node) prediction must be distinct, and |
|
the auto-regressive loss encourages relational dependence upon prior words. We can take inspiration |
|
from this observation to derive auxiliary losses that both have diverse node targets and encourage |
|
relational representation learning. In the following section we derive one such regulariser, Noisy |
|
Nodes. |
|
|
|
4 NOISY NODES |
|
|
|
Noisy Nodes tackles the oversmoothing problem by adding a diverse noise correction target, modifying the original graph prediction problem definition in several ways. It introduces a graph corrupted |
|
by noise, _G[˜] = ( V,[˜]_ _E,[˜]_ ˜g), where ˜vi _V is constructed by adding noise, σi, to the input nodes,_ |
|
_∈_ [˜] |
|
_v˜i = vi + σi. The edges,_ _E[˜], and graph-level attribute, ˜g, can either be uncorrupted by noise (i.e.,_ |
|
_E˜ = E, ˜g = g), calculated from the noisy nodes (for example in a nearest neighbors graph), or_ |
|
corrupted independent of the nodes—these are minor choices that can be informed by the specific |
|
problem setting. |
|
|
|
|
|
----- |
|
|
|
Figure 2: Per layer node latent diversity, measured |
|
by MAD on a 16 layer MPNN trained on OGBGMOLPCBA. Noisy Nodes maintains a higher level |
|
of diversity throughout the network than competing |
|
methods. |
|
|
|
|
|
Figure 1: Noisy Node mechanics during |
|
training. Input positions are corrupted |
|
with noise σ, and the training objective is |
|
the node-level difference between target |
|
positions and the noisy inputs. |
|
|
|
|
|
Our method requires a noise correction target to prevent oversmoothing by enforcing diversity in the |
|
last layers of the GNN, which can be achieved with an auxiliary denoising autoencoder loss. For |
|
example, where the Error is defined with respect to graph-level predictions (e.g., predict the minimum |
|
energy value of some molecular system), a second output head can be added to the GNN architecture |
|
which requires denoising the inputs as targets. Alternatively, if the inputs and targets are in the same |
|
real domain as is the case for physical simulations we can adjust the target for the noise. Figure 1 |
|
demonstrates this Noisy Nodes set up. The auxiliary loss is weighted by a constant coefficient λ ∈ R. |
|
|
|
In Figure 2 we illustrate the impact of Noisy Nodes on oversmoothing by plotting the Mean Absolute |
|
Distance (MAD) (Chen et al., 2020) of the residual updates of each layer of an MPNN trained on the |
|
QM9 (Ramakrishnan et al., 2014) dataset, and compare it to alternative methods DropEdge (Rong |
|
et al., 2019) and DropNode (Do et al., 2021). MAD is a measure of the diversity of graph node |
|
features, often used to quantify oversmoothing, the higher the number the more diverse the node |
|
features, the lower the number the less diverse. In this plot we can see that for Noisy Nodes the node |
|
updates remain diverse for all of the layers, whereas without Noisy Nodes diversity is lost. Further |
|
analysis of MAD across seeds and with sorted layers can be seen in Appendix Figures 7 and 6 for |
|
models applied to 3D point clouds. |
|
|
|
**The Graph Manifold Learning Perspective. By using an implicit mapping from corrupted data to** |
|
clean data, the Noisy Nodes objective encourages the model to learn the manifold on which the clean |
|
data lies— we speculate that the GNN learns to go from low probability graphs to high probability |
|
graphs. In the autoencoder case the GNN learns the manifold of the input data. When node targets are |
|
provided, the GNN learns the manifold of the target data (e.g. the manifold of atoms at equilibrium). |
|
We speculate that such a manifold may include commonly repeated substructures that are useful for |
|
downstream prediction tasks. A similar motivation can be found for denoising in (Vincent et al., |
|
2010; Song & Ermon, 2019). |
|
|
|
**The Energy Perspective for Molecular Property Prediction. Local, random distortions of the** |
|
geometry of a molecule at a local energy minimum are almost certainly higher energy configurations. |
|
As such, a task that maps from a noised molecule to a local energy minimum is learning a mapping |
|
from high energy to low energy. Data such as QM9 contains molecules at local minima. |
|
|
|
Some problems have input data that is already high energy, and targets that are at equilibrium. For |
|
these datasets we can generate new high energy states by adding noise to the inputs but keeping the |
|
equilibrium target the same, Figure 1 demonstrates this approach. To preserve translation invariance |
|
we use displacements between input and target ∆, the corrected target after noise is ∆ _−_ _σ._ |
|
|
|
5 RELATED WORK |
|
|
|
**Oversmoothing. Recent work has aimed to understand why it is challenging to realise the benefits of** |
|
training deeper GNNs (Wu et al., 2020). Since first being noted in ((Li et al., 2018)) oversmoothing |
|
has been studied extensively and regularisation techniques have been suggested to overcome it (Chen |
|
|
|
|
|
----- |
|
|
|
et al., 2019; Cai & Wang, 2020; Rong et al., 2019; Zhou et al., 2020b; Yang et al., 2020; Do et al., |
|
2021; Zhao & Akoglu, 2020). A recent paper, (Li et al., 2021), finds, as in previous work, (Li et al., |
|
2019; 2020), the optimal depth for some datasets they evaluate on to be far lower (5 for OGBN-Arxiv |
|
from the Open Graph Benchmark (Hu et al., 2020a), for example) than the 1000 layers possible. |
|
|
|
**Denoising & Noise Models. Training neural networks with noise has a long history (Sietsma &** |
|
Dow, 1991; Bishop, 1995). Of particular relevance are Denoising Autoencoders (Vincent et al., 2008) |
|
in which an autoencoder is trained to map corrupted inputs ˜x to uncorrupted inputs x. Denoising |
|
Autoencoders have found particular success as a form of pre-training for representation learning |
|
(Vincent et al., 2010). More recently, in research applying GNNs to simulation (Sanchez-Gonzalez |
|
et al., 2018; Sanchez-Gonzalez* et al., 2020; Pfaff et al., 2020) Gaussian noise is added during |
|
training to input positions of a ground truth simulator to mimic the distribution of errors of the learned |
|
simulator. Pre-training methods (Devlin et al., 2019; You et al., 2020; Thakoor et al., 2021) are |
|
another similar approach; most similarly to our method Hu et al. (2020b) apply a reconstruction loss |
|
to graphs with masked nodes to generate graph embeddings for use in downstream tasks. FLAG |
|
(Kong et al., 2020) adds adversarial noise during training to input node features as a form of data |
|
augmentation for GNNs that demonstrates improved performance for many tasks. It does not add an |
|
additional auxiliary loss, which we find is essential for addressing oversmoothing. In other related |
|
GNN work, (Sato et al., 2021) use random input features to improve generalisation of graph neaural |
|
networks. Adding noise to help input node disambiguation has also been covered in (Dasoulas et al., |
|
2019; Loukas, 2020; Vignac et al., 2020; Murphy et al., 2019), but there is no auxiliary loss. |
|
|
|
Finally, we take inspiration from (Vincent et al., 2008; 2010; Vincent, 2011; Song & Ermon, 2019) |
|
which use the observation that noised data lies off the data manifold for representation learning and |
|
generative modelling. |
|
|
|
**Machine Learning for 3D Molecular Property Prediction. One application of GNNs is to speed** |
|
up quantum chemistry calculations which operate on 3D positions of a molecule (Duvenaud et al., |
|
2015; Gilmer et al., 2017; Schütt et al., 2017; Hu et al., 2021b). Common goals are the prediction of |
|
molecular properties (Ramakrishnan et al., 2014), forces (Chmiela et al., 2017), energies (Chanussot* |
|
et al., 2020) and charges (Unke & Meuwly, 2019). |
|
|
|
A common approach to embed physical symmetries is to design a network that predicts a rotation and |
|
translation invariant energy (Schütt et al., 2017; Klicpera et al., 2020a; Liu et al., 2021). The input |
|
features of such models include distances (Schütt et al., 2017), angles (Klicpera et al., 2020b;a) or |
|
torsions and higher order terms (Liu et al., 2021). An alternative approach to embedding symmetries |
|
is to design a rotation equivariant neural network that use equivariant representations (Thomas et al., |
|
2018; Köhler et al., 2019; Kondor et al., 2018; Fuchs et al., 2020; Batzner et al., 2021; Anderson |
|
et al., 2019; Satorras et al., 2021). |
|
|
|
**Machine Learning for Bond and Atom Molecular Graphs. Predicting properties from molecular** |
|
graphs without 3D points, such as graphs of bonds and atoms, is studied separately and often used |
|
to benchmark generic graph property prediction models such as GCNs (Hu et al., 2020a) or GATs |
|
(Veliˇckovi´c et al., 2018). Models developed for 3D molecular property prediction cannot be applied |
|
to bond and atom graphs. Common datasets that contain such data are OGBG-MOLPCBA and |
|
OGBG-MOLHIV. |
|
|
|
6 3D MOLECULAR PROPERTY PREDICTION EXPERIMENTS AND RESULTS |
|
|
|
In this section we evaluate how a popular, simple model, the GNS (Sanchez-Gonzalez* et al., 2020) |
|
performs on 3D molecular prediction tasks when combined with Noisy Nodes. The GNS was |
|
originally developed for particle fluid simulations, but has recently been adapted for molecular |
|
property prediction (Hu et al., 2021b). We find that Without Noisy Nodes the GNS architecture is |
|
not competitive, but by using Noisy Nodes we see improved performance comparable to the use of |
|
specialised architectures. |
|
|
|
We made minor changes to the GNS architecture. We featurise the distance input features using radial |
|
basis functions. We group layer weights, similar to grouped layers used in Jumper et al. (2021) for |
|
reduced parameter counts; for a group size of n the first n layer weights are repeated, i.e. the first layer |
|
with a group size of 10 has the same weights as the 11[th], 21[st], 31[st] layers and so on. n contiguous |
|
|
|
|
|
----- |
|
|
|
Figure 3: Validation curves, OC20 IS2RE ID. A) Without any node targets our model has poor |
|
performance and realises no benefit from depth. B) After adding a position node loss, performance |
|
improves as depth increases. C) As we add Noisy Nodes and parameters the model achieves SOTA, |
|
even with 3 layers, and stops overfitting. D) Adding Noisy Nodes allows a model with even fully |
|
shared weights to achieve SOTA. |
|
|
|
blocks of layers are considered a single group. Finally we find that decoding the intermediate latents |
|
and adding a loss after each group aids training stability. The decoder is shared across groups. |
|
|
|
We tested this architecture on three challenging molecular property prediction benchmarks: |
|
OC20 (Chanussot* et al., 2020) IS2RS & IS2RE, and QM9 (Ramakrishnan et al., 2014). These |
|
benchmarks are detailed below, but as general distinctions, OC20 tasks use graphs 2-20x larger than |
|
QM9. While QM9 always requires graph-level prediction, one of OC20’s two tasks (IS2RS) requires |
|
node-level predictions while the other (IS2RE) requires graph-level predictions. All training details |
|
may be found in the Appendix. |
|
|
|
6.1 OPEN CATALYST 2020 |
|
|
|
**[Dataset. The OC20 dataset (Chanussot* et al., 2020) (CC Attribution 4.0) describes the interaction](https://opencatalystproject.org/)** |
|
of a small molecule (the adsorbate) and a large slab (the catalyst), with total systems consisting of |
|
20-200 atoms simulated until equilibrium is reached. |
|
|
|
We focus on two tasks; the Initial Structure to Resulting Energy (IS2RE) task which takes the initial |
|
structure of the simulation and predicts the final energy, and the Initial Structure to Resulting Structure |
|
(IS2RS) which takes the initial structure and predicts the relaxed structure. Note that we train the |
|
more common “direct” prediction task that map directly from initial positions to target in a single |
|
forward pass, and compare against other models trained for direct prediction. |
|
|
|
Models are evaluated on 4 held out test sets. Four canonical validation datasets are also provided. |
|
Test sets are evaluated on a remote server hosted by the dataset authors with a very limited number of |
|
submissions per team. |
|
|
|
Noisy Nodes in this case consists of a random jump between the initial position and relaxed position. |
|
During training we first sample uniformly from a point in the relaxation trajectory or interpolate |
|
uniformly between the initial and final positions (vi _v˜i)γ, γ_ U(0, 1), and then add I.I.D Gaussian |
|
noise with mean zero and σ = 0.3. The Noisy Node target is the relaxed structure. − _∼_ |
|
|
|
|
|
----- |
|
|
|
Table 1: OC20 ISRE Validation, eV MAE, ↓. |
|
“GNS-Shared” indicates shared weights. “GNS-10” indicates a group size of 10. |
|
|
|
Model Layers OOD Both OOD Adsorbate OOD Catalyst ID |
|
|
|
GNS 50 0.59 ±0.01 0.65 ±0.01 0.55 ±0.00 0.54 ±0.00 |
|
GNS-Shared + Noisy Nodes 50 0.49 ±0.00 0.54 ±0.00 0.51 ±0.01 0.51 ±0.01 |
|
GNS + Noisy Nodes 50 0.48 ±0.00 0.53 ±0.00 0.49 ±0.01 0.48 ±0.00 |
|
GNS-10 + Noisy Nodes 100 **0.46±0.00** **0.51 ±0.00** **0.48 ±0.00** **0.47 ±0.00** |
|
|
|
Table 2: Results OC20 IS2RE Test |
|
|
|
|
|
eV MAE ↓ |
|
|
|
SchNet DimeNet++ SpinConv SphereNet GNS + Noisy Nodes |
|
|
|
OOD Both 0.704 0.661 0.674 0.638 **0.465 (-24.0%)** |
|
OOD Adsorbate 0.734 0.725 0.723 0.703 **0.565 (-22.8%)** |
|
OOD Catalyst 0.662 0.576 0.569 0.571 **0.437 (-17.2%)** |
|
ID 0.639 0.562 0.558 0.563 **0.422 (-18.8%)** |
|
|
|
Average Energy within Threshold (AEwT) ↑ |
|
|
|
SchNet DimeNet++ SpinConv SphereNet GNS + Noisy Nodes |
|
|
|
OOD Both 0.0221 0.0241 0.0233 0.0241 **0.047 (+95.8%)** |
|
OOD Adsorbate 0.0233 0.0207 0.026 0.0229 **0.035 (+89.5%)** |
|
OOD Catalyst 0.0294 0.0410 0.0382 0.0409 **0.080 (+95.1%)** |
|
ID 0.0296 0.0425 0.0408 0.0447 **0.091 (+102.0%)** |
|
|
|
We first convert to fractional coordinates (i.e. use the periodic unit cell as the basis) which render |
|
the predictions of our model invariant to rotations, and append the following rotation and translation |
|
invariant vector (αβ[T] _, βγ[T]_ _, αγ[T]_ _, |α|, |β|, |γ|) ∈_ R[6] to the edge features where α, β, γ are vectors |
|
of the unit cell. This additional vector provides rotation invariant angular and extent information to |
|
the GNN. |
|
|
|
**IS2RE Results. In Figure 3 we show how using Noisy Nodes allows the GNS to achieve state** |
|
of the art performance. Figure 3 A shows that without any auxiliary node target, an IS2RE GNS |
|
achieves poor performance even with increased depth. The fact that increased depth does not result in |
|
improvement supports the hypothesis that GNS suffers from oversmoothing. As we add a node level |
|
position target in B) we see better performance, and improvement as depth increases, validating our |
|
hypothesis that node level targets are key to addressing oversmoothing. In C) we add noisy nodes and |
|
parameters, and see that the increased diversity of the node level predictions leads to very significant |
|
improvements and SOTA, even for a shallow 3 layer network. D) demonstrates this effect is not just |
|
due to increased parameters - SOTA can still be achieve with shared layer weights . |
|
|
|
In Table 1 we conduct an ablation on our hyperparameters, and again demonstrate the improved |
|
performance of using Noisy Nodes. Results were averaged over 3 seeds and standard errors on the |
|
best obtained checkpoint show little sensitivity to initialisation. All results in the table are reported |
|
using sampling states from trajectories. We conducted an ablation on ID comparing sampling from a |
|
relaxation trajectory and interpolating between initial & final positions which found that interpolation |
|
improved our score from 0.47 to 0.45. |
|
|
|
Our best hyperparameter setting was 100 layers which achieved a 95.6% relative performance |
|
improvement against SOTA results (Table 2) on the AEwT benchmark. Due to limited permitted test |
|
submissions, results presented here were from one test upload of our best performing validation seed. |
|
|
|
**IS2RS Results. In Table 4 we see that GNS + Noisy Nodes is significantly better than the only other** |
|
reported IS2RS direct result, ForceNet, itself a GNS variant. |
|
|
|
|
|
----- |
|
|
|
Table 3: OC20 IS2RS Validation, ADwT, ↑ |
|
|
|
Model Layers OOD Both OOD Adsorbate OOD Catalyst ID |
|
|
|
GNS 50 43.0%±0.0 38.0%±0.0 37.5% 0.0 40.0%±0.0 |
|
GNS + Noisy Nodes 50 50.1%±0.0 44.3%±0.0 44.1%±0.0 46.1% ±0.0 |
|
GNS-10 + Noisy Nodes 50 52.0%±0.0 46.2%±0.0 46.1% ±0.0 48.3% ±0.0 |
|
GNS-10 + Noisy Nodes + Pos only 100 **54.3%±0.0** **48.3%±0.0** **48.2% ±0.0** **50.0% ±0.0** |
|
|
|
Table 4: OC20 IS2RS Test, ADwT, ↑ |
|
|
|
Model OOD Both OOD Adsorbate OOD Catalyst ID |
|
|
|
ForceNet 46.9% 37.7% 43.7% 44.9% |
|
GNS + Noisy Nodes **52.7%** **43.9%** **48.4%** **50.9%** |
|
|
|
Relative Improvement **+12.4%** **+16.4%** **+10.7%** **+13.3%** |
|
|
|
6.2 QM9 |
|
|
|
**Dataset. The QM9 benchmark (Ramakrishnan et al., 2014) contains 134k molecules in equilibrium** |
|
with up to 9 heavy C, O, N and F atoms, targeting 12 associated chemical properties (License: CCBY |
|
4.0). We use 114k molecules for training, 10k for validation and 10k for test. All results are on the |
|
test set. We subtract a fixed per atom energy from the target values computed from linear regression |
|
to reduce variance. We perform training in eV units for energetic targets, and evaluate using MAE. |
|
We summarise the results across the targets using mean standardised MAE (std. MAE) in which |
|
MAEs are normalised by their standard deviation, and mean standardised logMAE. Std. MAE is |
|
dominated by targets with high relative error such as ∆ϵ, whereas logMAE is sensitive to outliers |
|
such as _R[2]_ . As is standard for this dataset, a model is trained separately for each target. |
|
|
|
For this dataset we add I.I.D Gaussian noise with mean zero and σ = 0.02 to the input atom positions. |
|
A denoising autoencoder loss is used. |
|
|
|
**Results In Table 6 we can see that adding Noisy Nodes significantly improves results by 23.1%** |
|
relative for GNS, making it competitive with specialised architectures. To understand the effect of |
|
adding a denoising loss, we tried just adding noise and found no where near the same improvement |
|
(Table 6). |
|
|
|
A GNS-10 + Noisy Nodes with 30 layers achieves top results on 3 of the 12 targets and comparable |
|
performance on the remainder (Table 6). On the std. MAE aggregate metric GNS + Noisy Nodes |
|
performs better than all other reported results, showing that Noisy Nodes can make even a generic |
|
model competitive with models hand-crafted for molecular property prediction. The same trend is |
|
repeated for an rotation invariant version of this network that uses the principle axes of inertia ordered |
|
by eigenvalue as the co-ordinate frame (Table 5). |
|
_R[2]_, the electronic spatial extent, is an outlier for GNS + Noisy Nodes. Interestingly, we found that |
|
without noise GNS-10 + Noisy Nodes achieves 0.33 for this target. We speculate that this target is |
|
particularly sensitive to noise, and the best noise value for this target would be significantly lower |
|
than for the dataset as a whole. |
|
|
|
Table 5: QM9, Impact of Noisy Nodes on GNS architecture. |
|
|
|
Layers std. MAE % Change logMAE |
|
|
|
GNS 10 1.17 - -5.39 |
|
GNS + Noise But No Node Target 10 1.16 -0.9% -5.32 |
|
GNS + Noisy Nodes 10 0.90 -23.1% -5.58 |
|
GNS-10 + Noisy Nodes 20 0.89 -23.9% -5.59 |
|
GNS-10 + Noisy Nodes + Invariance 30 0.92 -21.4% -5.57 |
|
GNS-10 + Noisy Nodes 30 **0.88** **-24.8%** **-5.60** |
|
|
|
|
|
----- |
|
|
|
Table 6: QM9, Test MAE, Mean & Standard Deviation of 3 Seeds Reported. |
|
|
|
Target Unit SchNet E(n)GNN DimeNet++ SphereNet PaiNN **GNS + Noisy Nodes** |
|
|
|
|
|
_µ_ D 0.033 0.029 0.030 0.027 **0.012** 0.025 ±0.01 |
|
_α_ _a0[3]_ 0.235 0.071 **0.043** 0.047 0.045 0.052 ±0.00 |
|
_ϵHOMO_ meV 41 29.0 24.6 23.6 27.6 **20.4 ±0.2** |
|
_ϵLUMO_ meV 34 25.0 19.5 18.9 20.4 **18.6 ±0.4** |
|
∆ϵ meV 63 48.0 32.6 32.3 45.7 **28.6 ±0.1** |
|
_R[2]_ _a0[2]_ **0.07** 0.11 0.33 0.29 0.07 0.70 ±0.01 |
|
ZPVE meV 1.7 1.55 1.21 **1.12** 1.28 1.16 ±0.01 |
|
_U0_ meV 14.00 11.00 6.32 6.26 **5.85** 7.30 ±0.12 |
|
_U_ meV 19.00 12.00 6.28 7.33 **5.83** 7.57 ±0.03 |
|
_H_ meV 14.00 12.00 6.53 6.40 **5.98** 7.43±0.06 |
|
_cGv_ meVmol Kcal 0.03314.00 12.000.031 0.0237.56 **0.0228.0** 0.0247.35 0.0258.30 ±00..1400 |
|
|
|
_±_ |
|
|
|
std. MAE % 1.76 1.22 0.98 0.94 1.00 **0.88** |
|
logMAE -5.17 -5.43 -5.67 -5.68 **-5.85** -5.60 |
|
|
|
Table 7: OGBG-PCQM4M Results |
|
|
|
Model Number of Layers Using Noisy Nodes MAE |
|
|
|
MPNN + Virtual Node 16 Yes 0.1249 ± 0.0003 |
|
MPNN + Virtual Node 50 No 0.1236 ± 0.0001 |
|
Graphormer (Ying et al., 2021) - - 0.1234 |
|
MPNN + Virtual Node 50 Yes **0.1218 ± 0.0001** |
|
|
|
7 NON-SPATIAL TASKS |
|
|
|
The previous experiments use the 3D geometries of atoms, and models that operate on 3D points. |
|
However, the recipe of adding a denoising auxiliary loss can be applied to other graphs with different |
|
types of features. In this section we apply Noisy Nodes to additional datasets with no 3D points, |
|
using different GNNs, and show analagous effects to the 3D case. Details of the hyperparameters, |
|
models and training details can be found in the appendix. |
|
|
|
7.1 OGBG-PCQM4M |
|
|
|
This dataset from the OGB benchmarks consists of molecular graphs which consist of bonds and |
|
atom types, and no 3D or 2D coordinates. To adapt Noisy Nodes to this setting, we randomly flip |
|
node and edge features at a rate of 5% and add a reconstruction loss. We evaluate Noisy Nodes using |
|
an MPNN + Virtual Node (Gilmer et al., 2017). The test set is not currently available for this dataset. |
|
|
|
In Table 7 we see that for this task Noisy Nodes enables a 50 layer MPNN to reach state of the art |
|
results. Before adding Noisy Nodes, adding capacity beyond 16 layers did not improve results. |
|
|
|
7.2 OGBG-MOLPCBA |
|
|
|
The OGBG-MOLPCBA dataset contains molecular graphs with no 3D points, with the goal of |
|
classifying 128 biological activities. On the OGBG-MOLPCBA dataset we again use an MPNN + |
|
Virtual Node and random flipping noise. In Figure 4 we see that adding Noisy Nodes improves the |
|
performance of the base model, accentuated for deeper networks. Our 16 layer MPNN improved |
|
from 27.6% ± 0.004 to 28.1% ± 0.002 Mean Average Precision (“Mean AP”). Figure 5 demonstrates |
|
how Noisy Nodes improves performance during training. Of the reported results, our MPNN is |
|
most similar to GCN[1] + Virtual Node and GIN + Virtual Node (Xu et al., 2018) which report |
|
results of 24.2% ± 0.003 and 27.03% ± 0.003 respectively. We evaluate alternative methods for |
|
|
|
1The GCN implemented in the official OGB code base has explicit edge updates, akin to the MPNN. |
|
|
|
|
|
----- |
|
|
|
Figure 4: Adding Noisy Nodes with random |
|
flipping of input categories improves the performance of MPNNs, and the effect is accentuated with depth. |
|
|
|
|
|
Figure 5: Validation curve comparing with |
|
and without noisy nodes. Using Noisy Nodes |
|
leads to a consistent improvement. |
|
|
|
|
|
oversmoothing, DropNode and DropEdge in Figure 2 and find that Noisy Nodes is more effective at |
|
address oversmoothing, although all 3 methods can be combined favourably (results in appendix). |
|
|
|
7.3 OGBN-ARXIV |
|
|
|
The above results use models with explicit edge updates, and are reported for graph prediction. To |
|
test the effectiveness with Noisy Nodes with GCNs, arguably the simplest and most popular GNN, |
|
we use OGBN-ARXIV, a citation network with the goal of predicting the arxiv category of each paper. |
|
Adding Noisy Nodes, with noise as input dropout of 0.1, to 4 layer GCN with residual connections |
|
improves from 72.39% ± 0.002 accuracy to 72.52% ± 0.003 accuracy. A baseline 4 layer GCN on |
|
this dataset reports 71.71% ± 0.002. The SOTA for this dataset is 74.31% (Sun & Wu, 2020). |
|
|
|
7.4 LIMITATIONS |
|
|
|
We have not demonstrated the effectiveness of Noisy Nodes in small data regimes, which may be |
|
important for learning from experimental data. The representation learning perspective requires |
|
access to a local minimum configuration, which is not the case for all quantum modeling datasets. We |
|
have also not demonstrated the combination of Noisy Nodes with more sophisticated 3D molecular |
|
property prediction models such as DimeNet++(Klicpera et al., 2020a), such models may require an |
|
alternative reconstruction loss to position change, such as pairwise interatomic distances. We leave |
|
this to future work. |
|
|
|
Noisy Nodes requires careful selection of the form of noise, and a balance between the auxiliary and |
|
primary losses. This can require hyper parameter tuning, and models can be sensitive to the choice |
|
of these parameters. Noisy Nodes has a particular effect for deep GNNs, but depth is not always an |
|
advantage. There are situations, for example molecular dynamics, which place a premium on very |
|
fast inference time. However even at 3 layers (a comparable depth to alternative architectures) the |
|
GNS architecture achieves state of the art validation OC20 IS2RE predictions (Figure 3). Finally, |
|
returns diminish as depth increases indicating depth is not the only answer (Table 1). |
|
|
|
8 CONCLUSIONS |
|
|
|
In this work we present Noisy Nodes, a novel regularisation technique for GNNs with particular |
|
focus on 3D molecular property prediction. Noisy nodes helps address common challenges around |
|
oversmoothed node representations, shows benefits for GNNs of all depths, but in particular improves |
|
performance for deeper GNNs. We demonstrate results on challenging 3D molecular property |
|
prediction tasks, and some generic GNN benchmark datasets. We believe these results demonstrate |
|
Noisy Nodes could be a useful building block for GNNs for molecular property prediction and |
|
beyond. |
|
|
|
|
|
----- |
|
|
|
9 REPRODUCIBILITY STATEMENT |
|
|
|
Code for reproducing OGB-PCQM4M results using Noisy Nodes is available on github, and |
|
was prepared as part of a leaderboard submission. [https://github.com/deepmind/](https://github.com/deepmind/deepmind-research/tree/master/ogb_lsc/pcq) |
|
[deepmind-research/tree/master/ogb_lsc/pcq.](https://github.com/deepmind/deepmind-research/tree/master/ogb_lsc/pcq) |
|
|
|
We provide detailed hyper parameter settings for all our experiments in the appendix, in addition to |
|
formulae for computing the encoder and decoder stages of the GNS. |
|
|
|
10 ETHICS STATEMENT |
|
|
|
**Who may benefit from this work? Molecular property prediction with GNNs is a fast-growing** |
|
area with applications across domains such as drug design, catalyst discovery, synthetic biology, and |
|
chemical engineering. Noisy Nodes could aid models applied to these domains. We also demonstrate |
|
on OC20 that our direct state prediction approach is nearly as accurate as learned relaxed approaches |
|
at a small fraction of the computational cost, which may support material design which requires many |
|
predictions. |
|
|
|
Finally, Noisy Nodes could be adapted and applied to many areas in which GNNs are used—for |
|
example, knowledge base completion, physical simulation or traffic prediction. |
|
|
|
**Potential negative impact and reflection. Noisy Nodes sees improved performance from depth, but** |
|
the training of very deep GNNs could contribute to global warming. Care should be taken when |
|
utilising depth, and we note that Noisy Nodes settings can be calibrated at shallow depth. |
|
|
|
REFERENCES |
|
|
|
Brandon M. Anderson, T. Hy, and R. Kondor. Cormorant: Covariant molecular neural networks. In |
|
_NeurIPS, 2019._ |
|
|
|
Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David |
|
Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, |
|
Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, |
|
Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamakarios, Roman Ring, |
|
Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan |
|
Srinivasan, Wojciech Stokowiec, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL |
|
[http://github.com/deepmind.](http://github.com/deepmind) |
|
|
|
V. Bapst, T. Keck, Agnieszka Grabska-Barwinska, C. Donner, E. D. Cubuk, S. Schoenholz, A. Obika, |
|
Alexander W. R. Nelson, T. Back, D. Hassabis, and P. Kohli. Unveiling the predictive power of |
|
static structure in glassy systems. Nature Physics, 16:448–454, 2020. |
|
|
|
P. Battaglia, Jessica B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, Mateusz Malinowski, |
|
Andrea Tacchetti, David Raposo, A. Santoro, R. Faulkner, Çaglar Gülçehre, H. Song, A. J. Ballard, |
|
J. Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charlie Nash, Victoria Langston, |
|
Chris Dyer, N. Heess, Daan Wierstra, P. Kohli, M. Botvinick, Oriol Vinyals, Y. Li, and Razvan |
|
Pascanu. Relational inductive biases, deep learning, and graph networks. ArXiv, abs/1806.01261, |
|
2018. |
|
|
|
Simon Batzner, T. Smidt, L. Sun, J. Mailoa, M. Kornbluth, N. Molinari, and B. Kozinsky. Se(3)equivariant graph neural networks for data-efficient and accurate interatomic potentials. ArXiv, |
|
abs/2101.03164, 2021. |
|
|
|
Charles M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural Computation, |
|
7:108–116, 1995. |
|
|
|
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal |
|
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and |
|
Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL |
|
[http://github.com/google/jax.](http://github.com/google/jax) |
|
|
|
|
|
----- |
|
|
|
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric |
|
deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, |
|
2017. |
|
|
|
T. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind |
|
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, |
|
Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens |
|
Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, |
|
J. Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. |
|
Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. |
|
|
|
Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks. _CoRR,_ |
|
[abs/2006.13318, 2020. URL https://arxiv.org/abs/2006.13318.](https://arxiv.org/abs/2006.13318) |
|
|
|
Lowik Chanussot*, Abhishek Das*, Siddharth Goyal*, Thibaut Lavril*, Muhammed Shuaibi*, |
|
Morgane Riviere, Kevin Tran, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Aini Palizhati, |
|
Anuroop Sriram, Brandon Wood, Junwoong Yoon, Devi Parikh, C. Lawrence Zitnick, and Zachary |
|
Ulissi. Open catalyst 2020 (oc20) dataset and community challenges. ACS Catalysis, 0(0): |
|
[6059–6072, 2020. doi: 10.1021/acscatal.0c04525. URL https://doi.org/10.1021/](https://doi.org/10.1021/acscatal.0c04525) |
|
[acscatal.0c04525.](https://doi.org/10.1021/acscatal.0c04525) |
|
|
|
Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. CoRR, abs/1909.03211, |
|
[2019. URL http://arxiv.org/abs/1909.03211.](http://arxiv.org/abs/1909.03211) |
|
|
|
Deli Chen, Yankai Lin, W. Li, Peng Li, J. Zhou, and Xu Sun. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. In AAAI, 2020. |
|
|
|
Stefan Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, Kristof T. Schütt, and K. Müller. |
|
Machine learning of accurate energy-conserving molecular force fields. Science Advances, 3, 2017. |
|
|
|
George Dasoulas, Ludovic Dos Santos, Kevin Scaman, and Aladin Virmaux. Coloring graph neural |
|
networks for node disambiguation. ArXiv, abs/1912.06058, 2019. |
|
|
|
J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep |
|
bidirectional transformers for language understanding. In NAACL-HLT, 2019. |
|
|
|
Tien Huu Do, Duc Minh Nguyen, Giannis Bekoulis, Adrian Munteanu, and N. Deligiannis. Graph convolutional neural networks with node transition probability-based message passing and dropnode |
|
regularization. Expert Syst. Appl., 174:114711, 2021. |
|
|
|
David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for |
|
learning molecular fingerprints. In Proceedings of the 28th International Conference on Neural |
|
_Information Processing Systems - Volume 2, NIPS’15, pp. 2224–2232, Cambridge, MA, USA,_ |
|
2015. MIT Press. |
|
|
|
F. Fuchs, Daniel E. Worrall, Volker Fischer, and M. Welling. Se(3)-transformers: 3d roto-translation |
|
equivariant attention networks. ArXiv, abs/2006.10503, 2020. |
|
|
|
J. Gilmer, S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message |
|
passing for quantum chemistry. ArXiv, abs/1704.01212, 2017. |
|
|
|
Jonathan Godwin*, Thomas Keck*, Peter Battaglia, Victor Bapst, Thomas Kipf, Yujia Li, Kimberly |
|
Stachenfeld, Petar Veliˇckovi´c, and Alvaro Sanchez-Gonzalez. Jraph: A library for graph neural |
|
[networks in jax., 2020. URL http://github.com/deepmind/jraph.](http://github.com/deepmind/jraph) |
|
|
|
Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX, 2020. |
|
[URL http://github.com/deepmind/dm-haiku.](http://github.com/deepmind/dm-haiku) |
|
|
|
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, |
|
and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ArXiv, |
|
abs/2005.00687, 2020a. |
|
|
|
|
|
----- |
|
|
|
Weihua Hu, Bowen Liu, Joseph Gomes, M. Zitnik, Percy Liang, V. Pande, and J. Leskovec. Strategies |
|
for pre-training graph neural networks. arXiv: Learning, 2020b. |
|
|
|
Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A |
|
large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021a. |
|
|
|
Weihua Hu, Muhammed Shuaibi, Abhishek Das, Siddharth Goyal, Anuroop Sriram, J. Leskovec, Devi |
|
Parikh, and C. L. Zitnick. Forcenet: A graph neural network for large-scale quantum calculations. |
|
_ArXiv, abs/2103.01436, 2021b._ |
|
|
|
John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, Andrew Cowie, Bernardino RomeraParedes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David A. |
|
Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, |
|
Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold. |
|
_Nature, 596:583 – 589, 2021._ |
|
|
|
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. _CoRR,_ |
|
abs/1412.6980, 2015. |
|
|
|
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. |
|
_[CoRR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.](http://arxiv.org/abs/1609.02907)_ |
|
|
|
Johannes Klicpera, Shankari Giri, Johannes T. Margraf, and Stephan Günnemann. Fast |
|
and uncertainty-aware directional message passing for non-equilibrium molecules. _CoRR,_ |
|
[abs/2011.14115, 2020a. URL https://arxiv.org/abs/2011.14115.](https://arxiv.org/abs/2011.14115) |
|
|
|
Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular |
|
graphs. ArXiv, abs/2003.03123, 2020b. |
|
|
|
Risi Kondor, Hy Truong Son, Horace Pan, Brandon M. Anderson, and Shubhendu Trivedi. Covariant |
|
[compositional networks for learning graphs. CoRR, abs/1801.02144, 2018. URL http://](http://arxiv.org/abs/1801.02144) |
|
[arxiv.org/abs/1801.02144.](http://arxiv.org/abs/1801.02144) |
|
|
|
Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, G. Taylor, |
|
and T. Goldstein. Flag: Adversarial data augmentation for graph neural networks. _ArXiv,_ |
|
abs/2010.09891, 2020. |
|
|
|
Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: sampling configurations for multi-body |
|
systems with symmetric energies, 2019. |
|
|
|
G. Li, M. Müller, Ali K. Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? |
|
_2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9266–9275, 2019._ |
|
|
|
Guohao Li, C. Xiong, Ali K. Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper |
|
gcns. ArXiv, abs/2006.07739, 2020. |
|
|
|
Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. Training graph neural networks |
|
[with 1000 layers. CoRR, abs/2106.07476, 2021. URL https://arxiv.org/abs/2106.](https://arxiv.org/abs/2106.07476) |
|
[07476.](https://arxiv.org/abs/2106.07476) |
|
|
|
Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks |
|
for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, |
|
volume 32, 2018. |
|
|
|
Yi Liu, Limei Wang, Meng Liu, Xuan Zhang, Bora Oztekin, and Shuiwang Ji. Spherical message |
|
passing for 3d graph networks. arXiv preprint arXiv:2102.05013, 2021. |
|
|
|
Andreas Loukas. How hard is to distinguish graphs with graph neural networks? arXiv: Learning, |
|
2020. |
|
|
|
|
|
----- |
|
|
|
Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak A. Rao, and Bruno Ribeiro. Relational |
|
pooling for graph representations. In ICML, 2019. |
|
|
|
T. Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and P. Battaglia. Learning mesh-based simulation with graph networks. ArXiv, abs/2010.03409, 2020. |
|
|
|
R. Ramakrishnan, Pavlo O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistry structures |
|
and properties of 134 kilo molecules. Scientific Data, 1, 2014. |
|
|
|
Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. The truly deep graph convolutional |
|
[networks for node classification. CoRR, abs/1907.10903, 2019. URL http://arxiv.org/](http://arxiv.org/abs/1907.10903) |
|
[abs/1907.10903.](http://arxiv.org/abs/1907.10903) |
|
|
|
Alvaro Sanchez-Gonzalez, N. Heess, Jost Tobias Springenberg, J. Merel, Martin A. Riedmiller, |
|
R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control. |
|
_ArXiv, abs/1806.01242, 2018._ |
|
|
|
Alvaro Sanchez-Gonzalez*, Jonathan Godwin*, Tobias Pfaff*, Rex Ying*, Jure Leskovec, and Peter |
|
Battaglia. Learning to simulate complex physics with graph networks. In Hal Daumé III and Aarti |
|
Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 |
|
of Proceedings of Machine Learning Research, pp. 8459–8468. PMLR, 13–18 Jul 2020. URL |
|
[http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html.](http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html) |
|
|
|
R. Sato, Makoto Yamada, and Hisashi Kashima. Random features strengthen graph neural networks. |
|
In SDM, 2021. |
|
|
|
Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks, |
|
2021. |
|
|
|
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The |
|
graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. doi: |
|
10.1109/TNN.2008.2005605. |
|
|
|
Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, A. Tkatchenko, |
|
and K. Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum |
|
interactions. In NIPS, 2017. |
|
|
|
Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. Graph neural networks in particle physics. |
|
_Machine Learning: Science and Technology, 2(2):021001, Jan 2021. ISSN 2632-2153. doi:_ |
|
[10.1088/2632-2153/abbf9a. URL http://dx.doi.org/10.1088/2632-2153/abbf9a.](http://dx.doi.org/10.1088/2632-2153/abbf9a) |
|
|
|
J. Sietsma and Robert J. F. Dow. Creating artificial neural networks that generalize. Neural Networks, |
|
4:67–79, 1991. |
|
|
|
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. |
|
_ArXiv, abs/1907.05600, 2019._ |
|
|
|
Chuxiong Sun and Guoshi Wu. Adaptive graph diffusion networks with hop-wise attention. ArXiv, |
|
abs/2012.15024, 2020. |
|
|
|
Shantanu Thakoor, C. Tallec, M. G. Azar, R. Munos, Petar Velivckovi’c, and Michal Valko. Bootstrapped representation learning on graphs. ArXiv, abs/2102.06514, 2021. |
|
|
|
Nathaniel Thomas, Tess Smidt, Steven M. Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick |
|
Riley. Tensor field networks: Rotation- and translation-equivariant neural networks for 3d point |
|
[clouds. CoRR, abs/1802.08219, 2018. URL http://arxiv.org/abs/1802.08219.](http://arxiv.org/abs/1802.08219) |
|
|
|
Oliver T. Unke and Markus Meuwly. Physnet: A neural network for predicting energies, forces, dipole |
|
moments, and partial charges. Journal of Chemical Theory and Computation, 15(6):3678–3693, |
|
[May 2019. ISSN 1549-9626. doi: 10.1021/acs.jctc.9b00181. URL http://dx.doi.org/10.](http://dx.doi.org/10.1021/acs.jctc.9b00181) |
|
[1021/acs.jctc.9b00181.](http://dx.doi.org/10.1021/acs.jctc.9b00181) |
|
|
|
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, |
|
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017. |
|
|
|
|
|
----- |
|
|
|
Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua |
|
Bengio. Graph attention networks, 2018. |
|
|
|
Cl’ement Vignac, Andreas Loukas, and Pascal Frossard. Building powerful and equivariant graph |
|
neural networks with structural message-passing. arXiv: Learning, 2020. |
|
|
|
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computa_tion, 23:1661–1674, 2011._ |
|
|
|
Pascal Vincent, H. Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and |
|
composing robust features with denoising autoencoders. In ICML ’08, 2008. |
|
|
|
Pascal Vincent, H. Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked |
|
denoising autoencoders: Learning useful representations in a deep network with a local denoising |
|
criterion. J. Mach. Learn. Res., 11:3371–3408, 2010. |
|
|
|
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A |
|
comprehensive survey on graph neural networks. IEEE transactions on neural networks and |
|
_learning systems, 2020._ |
|
|
|
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural |
|
[networks? CoRR, abs/1810.00826, 2018. URL http://arxiv.org/abs/1810.00826.](http://arxiv.org/abs/1810.00826) |
|
|
|
Chaoqi Yang, Ruijie Wang, Shuochao Yao, Shengzhong Liu, and Tarek Abdelzaher. Revisiting" |
|
over-smoothing" in deep gcns. arXiv preprint arXiv:2003.13663, 2020. |
|
|
|
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and |
|
Tie-Yan Liu. Do transformers really perform bad for graph representation? ArXiv, abs/2106.05234, |
|
2021. |
|
|
|
Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph |
|
contrastive learning with augmentations. ArXiv, abs/2010.13902, 2020. |
|
|
|
L. Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. ArXiv, abs/1909.12223, |
|
2020. |
|
|
|
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, |
|
Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. |
|
_AI Open, 1:57–81, 2020a._ |
|
|
|
Kuangqi Zhou, Yanfei Dong, Wee Sun Lee, Bryan Hooi, Huan Xu, and Jiashi Feng. Effective |
|
[training strategies for deep graph neural networks. CoRR, abs/2006.07107, 2020b. URL https:](https://arxiv.org/abs/2006.07107) |
|
[//arxiv.org/abs/2006.07107.](https://arxiv.org/abs/2006.07107) |
|
|
|
A APPENDIX |
|
|
|
The following sections include details on training setup, hyper-parameters, input processing, as well |
|
as additional experimental results. |
|
|
|
A.1 ADDITIONAL METRICS FOR OPEN CATALYST IS2RS TEST SET |
|
|
|
Relaxation approaches to IS2RS minimise forces with respect to positions, with the expectation that |
|
forces at the minimum are close to zero. One metric of such a model’s success is to evaluate the |
|
forces at the converged structure using ground truth Density Functional Theory calculations and see |
|
how close they are to zero. Two metrics are provided by OC20 (Chanussot* et al., 2020) on the |
|
IS2RS test set: Force below Threshold (FbT), which is the percentage of structures that have forces |
|
below 0.05 eV/Angstrom, and Average Force below Threshold (AFbT) which is FbT calculated at |
|
multiple thresholds. |
|
|
|
The OC20 project computes test DFT calculations on the evaluation server and presents a summary |
|
result for all IS2RS position predictions. Such calculations take 10-12 hours and they are not available |
|
for the validation set. Thus, we are not able to analyse the results in Tables 8 and 9 in any further |
|
detail. Before application to catalyst screening further work may be needed for direct approaches to |
|
ensure forces do not explode from atoms being too close together. |
|
|
|
|
|
----- |
|
|
|
Table 8: OC20 IS2RS Test, Average Force below Threshold %, ↑ |
|
|
|
Model Method OOD Both OOD Adsorbate OOD Catalyst ID |
|
|
|
Noisy Nodes Direct 0.09% 0.00% 0.29% 0.54% |
|
|
|
Table 9: OC20 IS2RS Test, Force below Threshold %, ↑ |
|
|
|
Model Method OOD Both OOD Adsorbate OOD Catalyst ID |
|
|
|
Noisy Nodes Direct 0.0% 0.0% 0.0% 0.0% |
|
|
|
A.2 MORE DETAILS ON GNS ADAPTATIONS FOR MOLECULAR PROPERTY PREDICTION. |
|
|
|
**Encoder.** |
|
|
|
The node features are a learned embedding lookup of the atom type, and in the case of OC20 two |
|
additional binary features representing whether the atom is part of the adsorbate or catalyst and |
|
whether the atom remains fixed during the quantum chemistry simulation. |
|
|
|
The edge features,2 sin( _[cπ]R_ _[d][)]_ _ek are the distances |d| featurised using c Radial Bessel basis functions, ˜eRBF,c =_ |
|
|
|
_R_ _d_, and the edge vector displacements, d, normalised by the edge distance: |
|
|
|
q |
|
|
|
_ek = Concat(˜eRBF,1(_ _d_ ), ..., ˜eRBF,c( _d_ ), [d] |
|
_|_ _|_ _|_ _|_ _d_ |
|
|
|
_|_ _|_ [)] |
|
|
|
|
|
Our conversion to fractional coordinates only applied to the vector quantities, i.e. |
|
|
|
**Decoder** |
|
|
|
|
|
_d_ |
|
|
|
_|d|_ [.] |
|
|
|
|
|
The decoder consists of two parts, a graph-level decoder which predicts a single output for the input |
|
graph, and a node-level decoder which predicts individual outputs for each node. The graph-level |
|
decoder implements the following equation: |
|
|
|
_|V |_ _|V |_ |
|
_y = W_ [Proc] MLPProc(a[Proc]i ) + b[Proc] + W [Enc] MLPEnc(a[Enc]i ) + b[Enc] |
|
|
|
_i=1_ _i=1_ |
|
|
|
X X |
|
|
|
|
|
Where a[Proc]i are node latents from the Processor, a[Enc]i are node latents from the Encoder, W [Enc] and |
|
_W_ [Proc] are linear layers, b[Enc] and b[Proc] are biases, and |V | is the number of nodes. The node-level |
|
decoder is simply an MLP applied to each a[Proc]i which predicts a[∆]i [.] |
|
|
|
A.3 MORE DETAILS ON MPNN FOR OGBG-PCQM4M AND OGBG-MOLPCBA |
|
|
|
Our MPNN follows the blueprint of Gilmer et al. (2017). We use _[⃗]h[(]v[t][)]_ to denote the latent vector of |
|
node v at message passing step t, and ⃗m[(]uv[t][)] [to be the computed message vector for the edge between] |
|
nodes u and v at message passing step t. We define the update functions as: |
|
|
|
_m⃗_ [(]uv[t][+1)] = ψt+1 _⃗h[(]u[t][)][,⃗]h[(]v[t][)][, ⃗]m[(]uv[t][)]_ [+][ ⃗]m[(]uv[t][−][1)] (1) |
|
|
|
|
|
|
|
+ _[⃗]h[t]u_ (2) |
|
|
|
|
|
_⃗h[(]u[t][+1)]_ = φt+1 |
|
|
|
|
|
_⃗h[(]u[t][)][,]_ |
|
|
|
|
|
_m⃗_ [(]vu[t][+1)] |
|
_uX∈Nv_ |
|
|
|
|
|
_m⃗_ [(]uv[t][+1)] |
|
_vX∈Nu_ |
|
|
|
|
|
Where the message function ψt+1 and the update function φt+1 are MLPs. We use a “Virtual Node” |
|
which is connected to all other nodes to enable long range communication. Out readout function is |
|
an MLP. No spatial features are used. |
|
|
|
|
|
----- |
|
|
|
Figure 6: GNS Unsorted MAD per Layer |
|
Averaged Over 3 Random Seeds. Evidence |
|
of oversmoothing is clear. Model trained on |
|
QM9. |
|
|
|
|
|
Figure 7: GNS Sorted MAD per Layer Averaged Over 3 Random Seeds. The trend |
|
is clearer when the MAD values have been |
|
sorted. Model trained on QM9. |
|
|
|
|
|
A.4 EXPERIMENT SETUP FOR 3D MOLECULAR MODELING |
|
|
|
**Open Catalyst. All training experiments were ran on a cluster of TPU devices. For the Open Catalyst** |
|
experiments, each individual run (i.e. a single random seed) utilised 8 TPU devices on 2 hosts (4 per |
|
host) for training, and 4 V100 GPU devices for evaluation (1 per dataset). |
|
|
|
Each Open Catalyst experiment was ran until convergence for up to 200 hours. Our best result, the |
|
large 100 layer model requires 7 days of training using the above setting. Each configuration was run |
|
at least 3 times in this hardware configuration, including all ablation settings. |
|
|
|
We further note that making effective use of our regulariser requires sweeping noise values. These |
|
sweeps are dataset dependent and can be carried out using few message passing steps. |
|
|
|
**QM9. Experiments were also run on TPU devices. Each seed was run using 8 TPU devices on a** |
|
single host for training, and 2 V100 GPU devices for evaluation. QM9 targets were trained between |
|
12-24 hours per experiment. |
|
|
|
Following Klicpera et al. (2020b) we define std. MAE as : |
|
|
|
|
|
_fθ[(][m][)](Xi, zi)_ _t[(]i[m][)]_ |
|
_|_ _−_ [ˆ] |
|
|
|
_σm_ |
|
|
|
|
|
std. MAE = [1] |
|
|
|
|
|
_m=1_ |
|
|
|
_M_ |
|
|
|
log |
|
|
|
_m=1_ |
|
|
|
X |
|
|
|
|
|
_i=1_ |
|
|
|
|
|
and logMAE as: |
|
|
|
|
|
_fθ[(][m][)](Xi, zi)_ _t[(]i[m][)]_ |
|
_|_ _−_ [ˆ] |
|
|
|
_σm_ |
|
|
|
|
|
logMAE = [1] |
|
|
|
|
|
_i=1_ |
|
|
|
|
|
with target index m, number of targets M = 12, dataset size N, ground truth values _t[ˆ][(][m][)], model_ |
|
_fθ[(][m][)], inputs Xi and zi, and standard deviation σm of_ _t[ˆ][(][m][)]._ |
|
|
|
A.5 OVER SMOOTHING ANALYSIS FOR GNS |
|
|
|
In addition to Figure 2, we repeat the analysis with a mean MAD over 3 seeds 7. Furthermore we |
|
remove the sorting layer by MAD value and find the trend holds. |
|
|
|
A.6 NOISE ABLATIONS FOR OGBG-MOLPCBA |
|
|
|
We conduct a noise ablation on the random flipping noise for OGBG-MOLPCBA with an 8 layer |
|
MPNN + Virtual Node, and find that our model is not very sensitive to the noise value (Table 10), but |
|
degrades from 0.1. |
|
|
|
|
|
----- |
|
|
|
|Flip Probability|Mean AP| |
|
|---|---| |
|
|
|
|
|
|0.01 0.03 0.05 0.1 0.2|27.8% +- 0.002 27.9% +- 0.003 28.1% +- 0.001 28.0% +- 0.003 27.7% +- 0.002| |
|
|---|---| |
|
|
|
|
|
Table 10: OGBG-MOLPCBA Noise Ablation |
|
|
|
|Col1|Mean AP| |
|
|---|---| |
|
|
|
|
|
|MPNN Without DropEdge MPNN With DropEdge MPNN + DropEdge + Noisy Nodes|27.4% 0.002 ± 27.5% 0.001 ± 27.8% 0.002 ±| |
|
|---|---| |
|
|
|
|
|
|
|
Table 11: OGBG-MOLPCBA DropEdge Ablation |
|
|
|
A.7 DROPEDGE & DROPNODE ABLATIONS FOR OGBG-MOLPCBA |
|
|
|
We conduct an ablation with our 16 layer MPNN using DropEdge at a rate of 0.1 as an alternative |
|
approach to improving oversmoothing and find it does not improve performance for ogbg-molpcba |
|
(Table 11), similarly we find DropNode (Table 12) does not improve performance. In addition, we |
|
find that these two methods can’t be combined well together, reaching a performance of 27.0% ± |
|
0.003. However, both methods can be combined advantageously with Noisy Nodes. |
|
|
|
We also measure the MAD of the node latents for each layer and find the indeed Noisy Nodes is more |
|
effective at addressing oversmoothing in Figure 8. |
|
|
|
A.8 TRAINING CURVES FOR OC20 NOISY NODES ABLATIONS DEMONSTRATING |
|
OVERFITTING |
|
|
|
Figure 9 |
|
|
|
|Col1|Mean AP| |
|
|---|---| |
|
|
|
|
|
|MPNN With DropNode MPNN Without DropNode MPNN + DropNode + Noisy Nodes|27.5% 0.001 ± 27.5% 0.004 ± 28.2% 0.005 ±| |
|
|---|---| |
|
|
|
|
|
|
|
Table 12: OGBG-MOLPCBA DropNode Ablation |
|
|
|
|
|
----- |
|
|
|
Figure 8: Comparison of the effect of techniques to address oversmoothing on MPNNs. Whilst Some |
|
effect can be seen from DropEdge and DropNode, Noisy Nodes is significantly better at preserving |
|
per node diversity. |
|
|
|
A.9 PSEUDOCODE FOR 3D MOLECULAR PREDICTION TRAINING STEP |
|
|
|
**Algorithm 1: Noisy Nodes Training Step** |
|
_G = (V, E, g) // Input graph_ |
|
_G˜ = G // Initialize noisy graph_ |
|
_λ // Noisy Nodes Weight_ |
|
**if not_provided(V** _[′]) then_ |
|
|
|
_V_ _[′]_ _←_ _V_ |
|
**end** |
|
**if predict_differences then** |
|
|
|
∆ = _vi[′]_ |
|
**end** _{_ _[−]_ _[v][i][|][i][ ∈]_ [1][, . . .,][ |][V][ |}] |
|
**for each i ∈** 1, . . ., |V | do |
|
|
|
_σi = sample_node_noise(shape_of(vi));_ |
|
_v˜i = vi + σi;_ |
|
**if predict_differences then** |
|
|
|
∆˜ _i = ∆i −_ _σi;_ |
|
**end** |
|
**endfor** |
|
_E˜ = recompute_edges(V[˜] );_ |
|
_Gˆ[′]_ = GNN(G[˜]); |
|
**if predict_differences then** |
|
|
|
_V_ = ∆[˜] _i;_ |
|
|
|
_[′]_ |
|
**end** |
|
Loss = λ NoisyNodesLoss(G[ˆ][′], V _[′]) + PrimaryLoss(G[ˆ][′], V_ _[′]));_ |
|
Loss.minimise() |
|
|
|
|
|
----- |
|
|
|
Figure 9: Training curves to accompany Figure 3. This demonstrates that even as the validation |
|
performance is getting worse, training loss is going down, indicating overfitting. |
|
|
|
|
|
----- |
|
|
|
Table 13: Open Catalyst training parameters. |
|
|
|
Parameter Value or description |
|
|
|
Optimiser Adam with warm up and cosine cycling |
|
_β1_ 0.9 |
|
_β2_ 0.95 |
|
Warm up steps 5e5 |
|
Warm up start learning rate 1e − 5 |
|
Warm up/cosine max learning rate 1e − 4 |
|
Cosine cycle length 5e6 |
|
Loss type Mean squared error |
|
|
|
Batch size Dynamic to max edge/node/graph count |
|
Max nodes in batch 1024 |
|
Max edges in batch 12800 |
|
Max graphs in batch 10 |
|
|
|
MLP number of layers 3 |
|
MLP hidden sizes 512 |
|
Number Bessel Functions 512 |
|
Activation shifted softplus |
|
message passing layers 50 |
|
Group size 10 |
|
Node/Edge latent vector sizes 512 |
|
|
|
Position noise Gaussian (µ = 0, σ = 0.3) |
|
Parameter update Exponentially moving average (EMA) smoothing |
|
EMA decay 0.9999 |
|
Position Loss Co-efficient 1.0 |
|
|
|
A.10 TRAINING DETAILS |
|
|
|
Our code base is implemented in JAX using Haiku and Jraph for GNNs, and Optax for training |
|
(Bradbury et al., 2018; Babuschkin et al., 2020; Godwin* et al., 2020; Hennigan et al., 2020). Model |
|
selection used early stopping. |
|
|
|
All results reported as an average of 10 random seeds. OGBG-PCQM4M & OGBG-MOLPCBA |
|
were trained with 16 TPUs and evaluated with a single V100 GPU. OGBN-Arxiv was trained and |
|
evalated with a single TPU |
|
|
|
**3D Molecular Prediction** |
|
|
|
We minimise the mean squared error loss on mean and standard deviation normalised targets and use |
|
the Adam (Kingma & Ba, 2015) optimiser with warmup and cosine decay. For OC20 IS2RE energy |
|
prediction we subtract a learned reference energy, computed using an MLP with atom types as input. |
|
|
|
For the GNS model the node and edge latents as well as MLP hidden layers were sized 512, with 3 |
|
layers per MLP and using shifted softplus activations throughout. OC20 & QM9 Models were trained |
|
on 8 TPU devices and evaluated on a single V100 GPUs. We provide the full set of hyper-parameters |
|
and computational resources used separately for each dataset in the Appendix. All noise levels were |
|
determined by sweeping a small range of values (≈ 10) informed by the noised feature covariance. |
|
|
|
**Non Spatial Tasks** |
|
|
|
A.11 HYPER-PARAMETERS |
|
|
|
**Open Catalyst. We list the hyper-parameters used to train the default Open Catalyst experiment.** |
|
If not specified otherwise (e.g. in ablations of these parameters), experiments were ran with this |
|
configuration. |
|
|
|
|
|
----- |
|
|
|
Table 14: QM9 training parameters. |
|
|
|
Parameter Value or description |
|
|
|
Optimiser Adam with warm up and cosine cycling |
|
_β1_ 0.9 |
|
_β2_ 0.95 |
|
Warm up steps 1e4 |
|
Warm up start learning rate 3e − 7 |
|
Warm up/cosine max learning rate 1e − 4 |
|
Cosine cycle length 2e6 |
|
Loss type Mean squared error |
|
|
|
Batch size Dynamic to max edge/node/graph count |
|
Max nodes in batch 256 |
|
Max edges in batch 4096 |
|
Max graphs in batch 8 |
|
|
|
MLP number of layers 3 |
|
MLP hidden sizes 1024 |
|
Number Bessel Funtions 512 |
|
Activation shifted softplus |
|
message passing layers 10 |
|
Group Size 10 |
|
Node/Edge latent vector sizes 512 |
|
|
|
Position noise Gaussian (µ = 0, σ = 0.02) |
|
Parameter update Exponentially moving average (EMA) smoothing |
|
EMA decay 0.9999 |
|
Position Loss Coefficient 0.1 |
|
|
|
Dynamic batch sizes refers to constructing batches by specifying maximum node, edge and graph |
|
counts (as opposed to only graph counts) to better balance computational load. Batches are constructed |
|
until one of the limits is reached. |
|
|
|
Parameter updates were smoothed using an EMA for the current training step with the current decay |
|
value computed through decay = min(decay, (1.0 + step)/(10.0 + step). As discussed in the |
|
evaluation, best results on Open Catalyst were obtained by utilising a 100 layer network with group |
|
size 10. |
|
|
|
**QM9 Table 14 lists QM9 hyper-parameters which primarily reflect the smaller dataset and geometries** |
|
with fewer long range interactions. For U0, U, H and G we use a slightly larger number of graphs |
|
per batch - 16 - and a smaller position loss co-efficient of 0.01. |
|
|
|
**OGBG-PCQM4M Table 15 provides the hyper parameters for OGBG-PCQM4M.** |
|
|
|
**OGBG-MOLPCBA Table 16 provides the hyper parameters for the OGBG-MOLPCBA experiments.** |
|
|
|
**OGBN-ARXIV Table 17 provides the hyper parameters for the OGBN-Arxiv experiments.** |
|
|
|
|
|
----- |
|
|
|
Table 15: OGBG-PCQM4M Training Parameters. |
|
|
|
Parameter Value or description |
|
|
|
Optimiser Adam with warm up and cosine cycling |
|
_β1_ 0.9 |
|
_β2_ 0.95 |
|
Warm up steps 5e4 |
|
Warm up start learning rate 1e − 5 |
|
Warm up/cosine max learning rate 1e − 4 |
|
Cosine cycle length 5e5 |
|
Loss type Mean absolute error |
|
Reconstruction type Softmax Cross Entropy |
|
|
|
Batch size Dynamic to max edge/node/graph count |
|
Max nodes in batch 20,480 |
|
Max edges in batch 8,192 |
|
Max graphs in batch 512 |
|
|
|
MLP number of layers 2 |
|
MLP hidden sizes 512 |
|
Activation relu |
|
Node/Edge latent vector sizes 512 |
|
|
|
Noisy Nodes Category Flip Fate 0.05 |
|
Parameter update Exponentially moving average (EMA) smoothing |
|
EMA decay 0.999 |
|
Reconstruction Loss Coefficient 0.1 |
|
|
|
Table 16: OGBG-MOLPCBA Training Parameters. |
|
|
|
Parameter Value or description |
|
|
|
Optimiser Adam with warm up and cosine cycling |
|
_β1_ 0.9 |
|
_β2_ 0.95 |
|
Warm up steps 1e4 |
|
Warm up start learning rate 1e − 5 |
|
Warm up/cosine max learning rate 1e − 4 |
|
Cosine cycle length 1e5 |
|
Loss type Softmax Cross Entropy |
|
Reconstruction loss type Softmax Cross Entropy |
|
|
|
Batch size Dynamic to max edge/node/graph count |
|
Max nodes in batch 20,480 |
|
Max edges in batch 8,192 |
|
Max graphs in batch 512 |
|
|
|
MLP number of layers 2 |
|
MLP hidden sizes 512 |
|
Activation relu |
|
Batch Normalization Yes, after every hidden layer |
|
Node/Edge latent vector sizes 512 |
|
|
|
Dropnode Rate 0.1 |
|
Dropout Rate 0.1 |
|
Noisy Nodes Category Flip Fate 0.05 |
|
Parameter update Exponentially moving average (EMA) smoothing |
|
EMA decay 0.999 |
|
Reconstruction Loss Coefficient 0.1 |
|
|
|
|
|
----- |
|
|
|
Table 17: OGBG-ARXIV Training Parameters. |
|
|
|
Parameter Value or description |
|
|
|
Optimiser Adam with warm up and cosine cycling |
|
_β1_ 0.9 |
|
_β2_ 0.95 |
|
Warm up steps 50 |
|
Warm up start learning rate 1e − 5 |
|
Warm up/cosine max learning rate 1e − 3 |
|
Cosine cycle length 12, 000 |
|
Loss type Softmax Cross Entropy |
|
Reconstruction loss type Mean Squared Error |
|
|
|
Batch size Full graph |
|
|
|
MLP number of layers 1 |
|
Activation relu |
|
Batch Normalization Yes, after every hidden layer |
|
Node/Edge latent vector sizes 256 |
|
|
|
Dropout Rate 0.5 |
|
Noisy Nodes Input Dropout 0.05 |
|
Reconstruction Loss Coefficient 0.1 |
|
|
|
|
|
----- |
|
|
|
|