Published as a conference paper at ICLR 2022
COLD BREW: DISTILLING GRAPH NODE REPRESEN-
TATIONS WITH INCOMPLETE OR MISSING NEIGHBOR-
HOODS
Wenqing Zheng, Edward W Huang, Nikhil Rao, Sumeet Katariya, Zhangyang Wang,
Karthik Subbian
{wenqzhen,ewhuang,nikhilsr,katsumee,wzhangwa,ksubbian}@amazon.com
ABSTRACT
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in
node classiﬁcation, regression, and recommendation tasks. GNNs work well
when rich and high-quality connections are available. However, their effective-
ness is often jeopardized in many real-world graphs in which node degrees have
power-law distributions. The extreme case of this situation, where a node may
have no neighbors, is called Strict Cold Start (SCS). SCS forces the predic-
tion to rely completely on the node’s own features. We propose Cold Brew,
a teacher-student distillation approach to address the SCS and noisy-neighbor
challenges for GNNs. We also introduce feature contribution ratio (FCR), a
metric to quantify the behavior of inductive GNNs to solve SCS. We experi-
mentally show that FCR disentangles the contributions of different graph data
components and helps select the best architecture for SCS generalization. We
further demonstrate the superior performance of Cold Brew on several public
benchmark and proprietary e-commerce datasets, where many nodes have ei-
ther very few or noisy connections. Our source code is available at https:
//github.com/amazon-research/gnn-tail-generalization.
1
INTRODUCTION
Normalized Number of Nodes
101
102
103
104
Node Degrees
cora
citeseer
pubmed
arxiv
chameleon
actor
squirrel
wisconsin
cornell
texas
Figure 1: Top: Graph nodes may have a power-law
(“long-tail”) connectivity distribution, with a large
fraction of nodes (yellow) having few to no neigh-
bors. Bottom: Long-tail distributions in real-world
datasets, making modern GNNs fail to generalize to
the tail/cold-start nodes.
Graph Neural Networks (GNNs) achieve state-of-
the-art results across a wide range of tasks such as
graph classiﬁcation, node classiﬁcation, link predic-
tion, and recommendation (Wu et al., 2020; Goyal
& Ferrara, 2018; Kherad & Bidgoly, 2020; Shaikh
et al., 2017; Silva et al., 2010; Zhang et al., 2019).
Most modern GNNs rely on the principle of message
passing to aggregate each node’s features from its
(multi-hop) neighborhood (Kipf & Welling, 2016;
Veliˇ
ckovi´
c et al., 2017; Hamilton et al., 2017; Xu
et al., 2018a; Wu et al., 2019; Klicpera et al., 2018).
Therefore, the success of GNNs relies on the pres-
ence of dense and high-quality connections. Even
inductive GNNs Hamilton et al. (2017) learn a func-
tion of the node feature and the node neighborhood,
which requires the neighborhood to be present dur-
ing inference.
A practical barrier for widespread applicability of
GNNs arises from the long-tail node-degree distribu-
tion existing in many large-scale real-world graphs.
Speciﬁcally, the node degree distribution is power
law in nature, with a majority of nodes having very
few connections (Hao et al., 2021; Ding et al., 2021; Lam et al., 2008; Lu et al., 2020). Figure 1 (top)
illustrates a long-tail distribution, accompanied with the statistics of several public datasets (bottom).
1
Published as a conference paper at ICLR 2022
Many information retrieval and recommendation applications face the scenario of Strict Cold Start
(SCS) (Li et al., 2019b; Ding et al., 2021), wherein some nodes have no edges connected. Predicting
for these nodes admittedly is even more challenging than the tail nodes in the graph. In these cases,
existing GNNs fail to perform well due to the sparsity or absence of the neighborhood.
In this paper, we develop GNN models that have truly inductive capabilities: one can learn effective
node embeddings for “orphaned” nodes in a graph. This capability is important to fully realize the
potential of large-scale GNN models on modern, industry-scale datasets with very long tails and
many orphaned nodes. To this end, we adopt the teacher-student knowledge distillation procedure
(Yang et al., 2021; Chen et al., 2020b) and propose Cold Brew to distill the knowledge of a GNN
teacher into a multilayer perceptron (MLP) student.
The Cold Brew framework addresses two key questions: (1) how we can efﬁciently distill the teacher’s
knowledge for the sake of tail and cold-start generalization, and (2) how can a student make use of
this knowledge. We answer these two questions by learning a latent node-wise embedding using
knowledge distillation, which both avoids “over-smoothness” (Oono & Suzuki, 2020; Li et al., 2018;
NT & Maehara, 2019) and discovers latent neighborhoods, which are missing for the SCS nodes.
Note that in contrast to traditional knowledge distillation (Hinton et al., 2015), our aim is not to train
a simpler student model to perform as well as the more complex teacher. Instead, we aim to train a
student model that is better than the teacher in terms of generalizing to tail or SCS samples.
In addition, to help select the cold-start friendly model architectures, we develop a metric called
Feature Contribution Ratio (FCR) that quantiﬁes the contribution of node features with respect to the
adjacency structure in the dataset for a speciﬁc downstream task. FCR indicates the difﬁculty level
in generalizing to tail and cold-start nodes and guides our principled selection of both teacher and
student model architectures in Cold Brew. We summarize our key contributions as follows:
• To generalize better to tail and SCS nodes, we design the Cold Brew knowledge distillation
framework: we enhance the teacher GNN by appending the node-wise Structural Embedding
(SE) to strengthen the teacher’s expressiveness, and design a novel mechanism for the MLP
student to rediscover the missing “latent/virtual neighborhoods,” on which it can perform
message passing.
• We propose Feature Contribution Ratio (FCR), which quantiﬁes the difﬁculty in generalizing
to tail and cold-start nodes. We leverage FCR in a principled “screening process” to select
the best model architectures for both the GNN teacher and the MLP student.
• As the existing GNN studies only evaluate on the entire graph and do not explicitly evaluate
on head/tail/SCS, we uncover the hidden differences of head/tail/SCS by creating bespoke
train/test splits. Extensive experiments on public and proprietary e-commerce graph datasets
validate the effectiveness of Cold Brew in tail and cold-start generalization.
1.1
PROBLEM SETUP
GNNs effectively learn node representations using two components in graph data: they process node
features through distributed node-wise transformations and process adjacency structure through
localized neighborhood aggregations. For the ﬁrst component, GNNs apply shared feature trans-
formations to all nodes regardless of the neighborhoods. For the second component, GNNs use
permutation-invariant aggregators to collect neighborhood information.
We take the node classiﬁcation problem in the sequel for the sake of simplicity. All our proposed
methods can be easily adapted to other semi-supervised or unsupervised problem settings, which
we show in Section 5. We denote the graph data of interest by G with node set V, |V| = N. Each
node possesses a din−dimensional feature and a dout−dimensional label (either dout classes or a
continuous vector in the case of regression). Let X0 ∈RN×din and Y ∈RN×dout be the matrices
of node features and labels, respectively. Let Ni be the neighborhood of the i-th node, 0 ≤i < N. In
large-scale graphs, |Ni| is often small for a (possibly substantial) portion of nodes. We refer to these
nodes as tail nodes. Some nodes may have |Ni| = 0, and we refer to these extreme cold start cases as
isolated nodes.
A classical GNN learns representations for the ith node at the lth layer as a function of its representa-
tion and its neighborhood’s representations at the (l −1)th layer:
xl
i := f
 {xl−1
i
}, {xl−1
j
}j∈Ni

(1)
2
Published as a conference paper at ICLR 2022
where f(·) is a general function that applies node-wise transformation on node xl−1
i
and aggregates
information of its neighborhood {xl−1
j
}j∈Ni to obtain the ﬁnal node representation. Given i’s input
features x0
i and its neighborhood Ni, one can use (1) to obtain its representation and predict yi,
making these models inductive.
We are interested in improving the performance of these GNNs on a set of tail and cold-start nodes,
where Ni for node i is either unreliable1 or absent. In these cases, applying (1) will yield a suboptimal
node representation, since {xl−1
j
}j∈Ni will be unreliable or empty at inference time.
2
RELATED WORK
GNNs learn by aggregating neighborhood information to learn node representations (Kipf & Welling,
2016; Veliˇ
ckovi´
c et al., 2017; Hamilton et al., 2017; Xu et al., 2018a; Wu et al., 2019; Klicpera et al.,
2018). Inductive variants of GNNs such as GraphSAGE (Hamilton et al., 2017) require initial node
features as well as the neighborhood information of each node to learn the representation. Most
works on improving GNNs have focused on learning better aggregation functions, and methods that
can work when the neighborhood is absent or noisy have not been sufﬁciently exploited, except two
recent concurrent works (Hu et al., 2021; Zhang et al., 2021).
In the context of cold start, (Hao et al., 2021) and (Ding et al., 2021) employ a transfer learning
approach. (Yang et al., 2021) proposes a knowledge distillation approach for GNN, while (Chen
et al., 2020b) proposes a self-distillation approach. In all the above cases, the models need full
knowledge of the neighbors of the cold-start nodes in question and do not address the case of noisy
or missing neighborhoods. Another possible solution is to directly train an MLP that only takes node
features. (Hu et al., 2021) proposes to learn graph embeddings with only node-wise MLP, while using
contrastive loss to regularize the graph structure.
Some previous works have studied the relation between node feature similarity and edge connections
and how that inﬂuences the selection of appropriate graph models. (Pei et al., 2020) proposed the
homophily metric that categorizes graphs into assortative and disassortative classes. (Wang et al.,
2021) dissected the feature propagation steps of linear GCNs from a perspective of continuous graph
diffusion and analyzed why linear GCNs fail to beneﬁt from more propagation steps. (Liu et al.,
2020a) further studied the inﬂuence of homophily on model selection and proposed a non-local GNN.
3
STRICT COLD START GENERALIZATION
We now address the problem of generalization to the tail and cold-start nodes, where the neighborhood
information is missing/noisy (Section 1). A naive baseline is to train an MLP to map node features
to labels. However, such a method would disregard all graph information, and we show via our
Feature Contribution Ratio and other experimental results that for most assortative graph datasets, the
node-wise MLP approach is suboptimal.
The key idea of our framework is the following: the GNN maps node features into a d-dimensional
embedding space, and since the number of nodes N is usually much bigger than the embedding
dimensionality d, we end up with an overcomplete set for this space using the embeddings as the
basis. This implies the possibility that any node representation can be cast as a linear combination of
K ≪N existing node representations. Our aim will be to train a student model that can accurately
discover the combination of the best K existing node embeddings of a target isolated node. We call
this procedure latent/virtual neighborhood discovery, which is equivalent to using MLPs to “mimic”
the node representations learned by the teacher GNN.
We adopt the knowledge distillation procedure (Yang et al., 2021; Chen et al., 2020b) to improve
the quality of the learned embeddings for tail and cold-start nodes. We use a teacher GNN model to
embed the nodes onto a low-dimensional manifold by utilizing the graph structure. Then, the goal of
the student is to learn a mapping from the node features to this manifold without knowledge of the
graph that the teacher has. We further aim to let the student model generalize to SCS cases where the
teacher model fails, beyond just mimicking the teacher as standard knowledge distillation does.
1For example, a user with only one movie watched or an item with too few purchases.
3
Published as a conference paper at ICLR 2022
xi
Cold Brew Architecture
SE
Node Feature
MLP1:
Latent
Neighbors
Graph embeddings 
from teacher GNN
Aggregated features
Target
MLP2:
Cold Start Setting
Node feature
Adjacency Structure
Prediction 
Target
Normal Case
Node feature
Adjacency Structure 
(Estimated)
Prediction 
Target
Cold Start Case
(a) The teacher-student knowledge distillation of the
Cold Brew framework under the cold-start setting.
xl+1
i
= σ(Σj∈𝒩i aijxl
jWl)
 GCN layers
L
3. Self-Feature of Node  
i
4. Neighbor-Features of Node  
i
1. Self-Label of Node  
i
2. Neighbor-Labels of Node i
(b) Four GNN atomic components in deciding
GNN’s output, which are used for FCR analysis.
Figure 2: (a): The proposed Cold Brew framework. In normal case (left upper), GNN relies on both node
feature and adjacency structure to make prediction. In cold start case (left lower) when the adjacency structure
is missing, the cold brew student model ﬁrst estimate the adjacency structure, then use both node feature and
adjacency structure to make prediction. The “SE” (right) is the structural embedding learned by Cold Brew’s
teacher GNN. (b): Four atomic components deciding the GNN embeddings of node i. Our proposed FCR metric
disentangles them into two models: the MLP that only considers Part 1 and Part 3, and label propagation that
only considers Part 1 and Part 2.
3.1
THE TEACHER MODEL OF COLD BREW: STRUCTURAL EMBEDDING GNN
Consider a graph G. For a Graph Convolutional Network with L layers, the l-th layer transformation
can be written as2: X(l+1) = σ( ˜
AX(l)W(l)), where ˜
A is the normalized adjacency matrix, ˜
A =
D−1/2AD−1/2, D is the diagonal degree matrix, and A is the adjacency matrix. X(l) ∈RN×d1 is
the node representations in the l-th layer, W(l) ∈Rd1×d2 is the feature transformation matrix, where
the values of d1/d2 depend on layer l: (d1, d2) = (din, dhidden) for l = 0, (dhidden, dhidden) for
1 ≤l ≤L −2, and (dhidden, nclasses) for l = L −1. σ(·) = is the nonlinear functions applied to
each layer, (e.g., ReLU). Norm(·) refers to an optional batch or layer normalization.
GNNs typically suffer from oversmoothing (Oono & Suzuki, 2020; Li et al., 2018; NT & Maehara,
2019), i.e., node representations become too similar to each other. Inspired by the positional encoding
in Transformers (Vaswani et al., 2017), we train the teacher GNN to learn an additional set of node
embeddings that can be appended, which we term the Structural Embedding (SE). SE learns
to incorporate extra information besides original node features (such as node labels in the case
of semi-supervised learning) through gradient backpropagation. The existence of SE avoids the
oversmoothing issue in GNNs: the transformations applied to different nodes are no longer the same
for every node since the SE of each node is different and participates in the feature transformation.
This can be of independent interest to GNN researchers.
Speciﬁcally, for each layer l, 0 ≤l ≤L −1, the Structural Embedding takes the form of a learnable
matrix E(l), and the SE-GNN layer forward pass can be written as:
X(l+1) = σ

˜
A

X(l)W(l) + E(l)
, X(l) ∈RN×d1, W(l) ∈Rd1×d2, E(l) ∈RN×d2
(2)
Remark 1:
Note that SE is not the same as the bias term in traditional feature transformation
X(l+1) = σ

˜
A
 X(l)W(l) + b(l)
; in the bias b ∈RN×d2, the rows are copied/shared across all
nodes. In contrast, we have a different structural embedding for every node.
Remark 2:
SE is also unlike traditional label propagation (LP) (Iscen et al., 2019; Wang &
Leskovec, 2020; Huang et al., 2020b). LP encodes label information through iterating E(t+1) =
(1 −α)G + α ˜
AE(t), where G is a one-hot encoding of ground truth for training node classes and
zeros for test nodes, and 0 < α < 1 is the portion of mixture at each iteration.
2Compared to Equation (1), multiplication by ˜
A plays the role of aggregating both {xi} and {xj}i∈Ni.
4
Published as a conference paper at ICLR 2022
SE-GNN enables node i to learn to encode the self and neighbors’ label information3 into its own
node embedding through ˜
A. We use the Graph Convolutional Networks (Kipf & Welling, 2016),
combined with other building blocks proposed in recent literature including: (1) initial/dense/jumping
connections, and (2) batch/pair/node/group normalization as the backbone of Cold Brew’s teacher
GNN. More details are described in Appendix B. We also apply a regularization term to the loss
function, yielding the following loss function:
loss = CE(X(L)
train, Ytrain) + η∥E∥2
2
(3)
where X(L)
train is the model’s embedding at the L-th layer, CE(X(L)
train, Ytrain) is the Cross Entropy
loss between the model output X(L)
train and the ground truth Y on the training set, and η is a
regularization coefﬁcient (grid-searched for different datasets in practice). The Cross-Entropy loss
can be replaced by any other appropriate loss depending on the task.
3.2
THE STUDENT MLP MODEL OF COLD BREW
We design the student to be composed of two MLP modules. Given a target node, the ﬁrst MLP
module imitates the node embeddings generated by the GNN teacher. Next, given any node, we ﬁnd
a set of virtual neighbors of that node from the graph. Finally, a second MLP attends to both the
target node and the virtual neighborhood and transforms them into the embeddings of interest.
Suppose we would like to obtain the embedding of a potentially isolated target node i given only its
feature xi. From the teacher GNN, at each layer l, we have access to two sets of node embeddings:
X(l)W(l) and E(l). Denote ¯
E as the embeddings that the teacher GNN passes over to the student
MLPs. We offer two options for ¯
E: it can be the ﬁnal output of the teacher GNN (in this case,
¯
E ∈RN×dout := X(L)), or it can be the concatenation of all intermediate results of the teacher GNN,
similar to (Romero et al., 2014): ¯
E ∈RN×(dhidden∗(L−1)+dout) := X(L) SL−1
l=0 (E(l) + X(l)W(l)),
where S is the concatenation of matrices at the feature dimension (second dimension). ¯
E acts as the
target for the ﬁrst MLP and also the input to the second MLP.
The ﬁrst MLP learns a mapping from the input node features X(0) to ¯
E, i.e., for node i, ˆ
ei = ξ1(x(0)
i ),
where ˆ
ei is trained with supervised learning to reproduce ¯
E[i, :]. Then, we discover the virtual
neighborhood by applying an attention-based aggregation of the existing embeddings in the graph
before linearly combining them:
˜
ei = softmax(ΘK( ˆ
ei ¯
E⊤))¯
E
(4)
where ΘK(·) is the top-K hard thresholding operator: for z ∈R1×N: [ΘK(z)]j = zj if zj is among
the top-K largest elements of z, and ΘK(z)j = −∞otherwise. Finally, the second MLP learns a
mapping ξ2 : [xi, ˜
ei] →yi, where yi = Y[i, :] is the ground truth for node i.
Equation (4) ﬁrst selects K nodes from the N nodes that the teacher GNN was trained on via the hard
thresholding operator. ˜
ej is then a linear combination of K node4 embeddings. Thus, every sample
whether or not seen previously while training the GNN can be represented as a linear combination of
these representations. The MLP ξ2(·) maps this representation to the ﬁnal target of interest. Thus, we
decompose every node embedding as a linear combination of an (overcomplete) basis.
The training of ξ1(·) occurs by minimizing the mean squared error over the non-isolated nodes in
the graph (mimicking the teacher’s embeddings), and the training of ξ2(·) occurs by minimizing the
cross entropy (for the node classiﬁcation task) or mean squared error (for the node regression task)
on the training split of the tail and isolated part of the graph. An illustration of SE-MLP’s inference
procedure for the isolated nodes is shown in Figure 2. When the number of nodes is large, the ranking
procedure involved in ΘK(·) can be precomputed after training the ﬁrst part and before training the
second part.
3.3
MODEL INTERPRETATION FROM A LABEL SMOOTHING PERSPECTIVE
We quote Theorem 1 in (Wang & Leskovec, 2020): Suppose that the latent ground-truth mapping from
node features to node labels is differentiable and L-Lipschitz. If the edge weights aij approximately
3This will be inferred in the case of missing labels.
4We abuse terminology here since E contains node and structural embeddings from multiple layers.
5
Published as a conference paper at ICLR 2022
smooth xi over its immediate neighbors with error ϵi, i.e., xi =
1
dii Σj∈N aijxj +ϵi, then the aij also
approximately smooth yi to bound within error |yi −
1
dii Σj∈Niaijyj| ≤L||ϵ||2 + o(maxj∈Ni(||xj −
xi||2)), where o(·) denotes a higher order inﬁnitesimal.
This theorem indicates that the errors of the label predictions are determined by the difference of
the features after neighborhood aggregation: if ϵi is large, then the error in the label prediction
is also large, and vice versa. However, with structural embeddings, each node i also learns an
independent embedding ¯
E[:, i] during the aggregation, which changes
1
dii Σj∈N aijxj + ϵi into
1
dii Σj∈N aijxj + ¯
E[:, i]+ϵi. Deduced from this theorem, the structural embedding ¯
E is important for
the teacher model: it allows higher ﬂexibility and expressiveness in learning the residual difference
between nodes, and hence the error ϵi can be lowered if ¯
E is properly learned.
From this theorem, one can also see the necessity of introducing neighborhood aggregations like
that of the Cold Brew student model. If one directly applies MLP models without neighborhood
aggregation, the ϵi turns out to be non-negligible, leading to higher losses in the label predictions.
However, Cold Brew introduces the neighborhood aggregation mechanism so that the second part of
the student MLP takes over the aggregation of neighborhood generated by the ﬁrst MLP. Therefore,
Cold Brew eliminates the above residual error even in the absence of the actual neighborhood.
4
MODEL SELECTION AND GRAPH COMPONENT DISENTANGLEMENT WITH
FEATURE CONTRIBUTION RATIO
We now discuss Feature Contribution Ratio (FCR), a metric to quantify the difﬁculty of learning
representations under the truly inductive cold-start case, and a hyperparameter optimization approach
to select the best suitable model architecture that helps tail and cold-start generalization.
As conceptually illustrated in Figure 2, there are four atomic components contributing to the learned
embedding of node i in the graph: 1. the label of i (self-label); 2. the label of neighbors of i (neighbor-
labels); 3. the features of i (self-feature); 4. the features of neighbors of i (neighbor-features). To
quantize the SCS generalization difﬁculty, we ﬁrst divide these four components into two submodules
to disentangle the contributions of the node features with respect to the adjacency structure of the
graph dataset. Then, we quantize it based on the assumption that the SCS generalization difﬁculty is
proportional to the contribution ratio of the node features.
We posit that a submodule that learns accurate node representations must include the node’s (self)
label, so that training can be performed via backpropagation. What remains is to use the label with
other atomic components to construct two specialized models that each make use of only the node
features or the adjacency structure. For the ﬁrst submodule, we build an MLP that maps the self-
features to self-labels, ignoring any neighborhood information present in the dataset. For the second
submodule, we adopt a Label Propagation (LP) method (Huang et al., 2020a)5 to learn representations
from self- and neighbor-labels. This model ignores any node feature information.
With the above two submodules, we introduce the Feature Contribution Ratio (FCR) that characterizes
the relative importance of the node features and the graph structure. Speciﬁcally, for graph dataset
G, we deﬁne the contribution of a submodule to be the residual performance of the submodule
compared to a full-ﬂedged GNN (e.g., Equation (1)) using both the node feature as well as the
adjacency structure. Denote zMLP , zLP , and zGNN as the performance of the MLP submodule, LP
submodule, and the full GNN on the test set, respectively. If zMLP ≪zGNN, then FCR(G) is small
and the graph structure is important, and noisy or missing neighborhood information will hurt model
performance. Based on this intuition, we build SCR as:
δMLP =zGNN −zMLP ,
δLP = zGNN −zLP
(5a)
FCR(G) =
(
δLP
δMLP +δLP × 100%
zMLP ≤zGNN
1 +
|δMLP |
|δMLP |+δLP × 100%
zMLP > zGNN
(5b)
Interpreting FCR values. For a particular graph G, if 0% ≤FCR(G) < 50%, it means zGNN >
zLP > zMLP , and the neighborhood information in G is mainly responsible for the GNN achieving
good performance. If 50% ≤FCR(G) < 100%, then zGNN > zMLP > zLP , and the node
features contribute more to the GNN’s performance. If FCR(G) ≥100%, then zMLP > zGNN >
5We ignore the node features and use the label logits as explained in (Huang et al., 2020a).
6
Published as a conference paper at ICLR 2022
zLP , and the node aggregation in GNNs can actually lead to reduced performance compared to
pointwise models. This case usually happens for some disassortative graphs, where the majority of
neighborhoods hold labels different from that of the center nodes (e.g., as observed by (Liu et al.,
2020a)).
Integrate FCR as a tool to design teacher and student models. For some graph datasets and models,
the SCS generalization can be challenging without neighborhood information (i.e., zGNN > zLP >
zMLP ). We hence consider FCR as a principled “screening process” to select model architectures for
both teacher and student that own the best inductive bias for SCS generalization.
To achieve this, during the computation of FCR, we perform exhaustive grid search of the architectures
(residual connection types, normalization, hidden layers, etc.) for the MLP, LP, and GNN modules,
and pick the best-performing variant. Detailed deﬁnition of the search space can be found in
Appendix B. We treat this grid search procedure as a special case of architecture selection and
hyperparameter optimization for Cold Brew. We observe that FCR is able to identify the GNN and
MLP architectures that are particularly friendly for SCS generalization, improving our method design.
In experiments, we observe that different model conﬁgurations are favored by different datasets, and
we use the found optimal teacher GNN and student MLP architectures to perform Cold Brew. More
detailed discussions are presented in section 5.3.
5
EXPERIMENTS AND DISCUSSION
In this section, we ﬁrst evaluate FCR by training GNNs on several commonly used graph datasets
and observing how well they generalize to tail and cold-start nodes. We also compare it to the graph
homophiliy metric β proposed in (Pei et al., 2020). Next, we apply Cold Brew to these datasets
and compare its generalization ability to baseline graph-based and pointwise MLP models on these
datasets. We also show results on proprietary industry datasets.
5.1
DATASETS AND SPLITS
We perform experiments on ﬁve open-source datasets and four proprietary datasets. The proprietary
e-commerce datasets, “E-comm 1/2/3/4”, refer to graphs subsampled from anonymized logs of an
e-commerce store. They are sampled so as to not reﬂect the actual raw trafﬁc distributions, and results
are provided with respect to a baseline model for these datasets. The different number sufﬁxes refer
to different product subsets, and the labels indicate product categories that we wish to predict. Node
features are text embeddings obtained from a ﬁne-tuned BERT model. We show FCR values for the
public datasets. The statistics of the datasets are summarized in Table 1.
We create training and test splits of the data in order to speciﬁcally study the generalization ability
of Cold Brew to tail and cold-start nodes. In the following tables, the head data corresponds to the
top 10% highest-degree nodes in the graph and the subgraph that they induce. We take the data
that corresponds to the bottom 10% of the degree distribution, and artiﬁcally remove all the edges
emanating from these nodes. We then refer to this set of nodes as the isolation data. The tail data
corresponds to the 10% nodes in the remaining graph with lowest (non-zero) degree and the subgraph
that they induce. All the zero-degree nodes are in the isolation data. The Overall data refers to the
training/test splits without distinguishing head/tail/isolation.
Cora Citeseer Pubmed
Arxiv
Chameleon E-comm1 E-comm2 E-comm3 E-comm4
Num. of Nodes
2708
3327
19717
169343
2277
4918
29352
319482
793194
Num. of Edges
13264 12431 108365 2315598
65019
104753
1415646 8689910 22368070
Max Degree
169
100
172
13161
733
277
1721
4925
12452
Mean Degree
4.90
3.74
5.50
13.67
28.55
21.30
48.23
27.20
28.19
Median Degree
4
3
3
6
13
10
21
15
14
Isolated Nodes %
3%
3%
3%
3%
3%
6%
5%
5%
6%
Table 1: The statistics of datasets selected for evaluation.
5.2
FCR EVALUATION
In Table 2, the top part presents the FCR results together with the homophily metric β from (Pei et al.,
2020) (Equation 6). The bottom part shows the prediction accuracies for the head and the tail nodes.
7
Published as a conference paper at ICLR 2022
As can be seen from the table, FCR differs among datasets and is negatively correlated with the
homophily metric (with Pearson correlation coefﬁcient -0.76). The high absolute correlation value
and its negative sign indicate that the more similar the nodes are to their neighborhoods, the more
difﬁcult it is to generalize with MLP based models. FCR is thus an indicator of the tail generalization
difﬁculty. Evaluations on more datasets (including the datasets where FCR > 100%) are presented in
Appendix C.
β(G) =
1
|V|
X
v∈V
the number of v’s direct neighbors that have the same labels as v
the number of v’s directly connected neighbors
× 100%
(6)
Cora
Citeseer
Pubmed
Arxiv
Chameleon
GNN
86.96
72.44
75.96
71.54
68.51
MLP
69.02
56.59
73.51
54.89
58.65
Label Propagation
78.18
45.00
67.8
68.26
41.01
FCR %
32.86 %
63.39 %
76.91%
16.45%
73.61%
β(G) %
83%
71%
79%
68%
25%
head −tail(GNN)
4.44
23.98
11.71
5.9
0.24
head −isolation(GNN)
31.01
33.09
15.21
28.81
1.55
Table 2: Top part: FCR and its components. The β metric is added as a reference. Bottom part: the performance
difference of GNN on the head/tail and head/isolation splits. Here, the “tail/isolation” means the 10% least
connected, and isolated nodes in the graph.
5.3
EXPERIMENTAL RESULTS ON TAIL GENERALIZATION WITH COLD BREW
In Table 3, we present the performance of Cold Brew together with baselines on the tail and the
isolation splits, across several different datasets. All the models in the table are evaluated on the
training data, and are evaluated on the tail or isolation splits discussed in section 5.1. In Table 3
GCN refers to the the best conﬁguration found through FCR-guided grid search (check Appendix B
for details), without Structural Embedding. Correspondingly, GCN + SE refers to the best FCR-
guided conﬁguration with Structural Embedding, which is the default teacher model of Cold Brew.
GraphSAGE refers to (Hamilton et al., 2017), Simple MLP refers to a simple node-wise MLP that
has two hidden layers with 128 hidden dimensions, and GraphMLP refers to (Hu et al., 2021). The
results for the e-commerce datasets are presented as relative improvements to the baseline (each value
is the difference with respect to the value of the GCN 2 layers on same dataset of the same split). We
do not disclose the absolute numbers due to proprietary reasons.
As shown in Table 3, Cold Brew’s student MLP improves accuracy on isolated nodes by up to +11%
on the e-commerce datasets and +2.4% on the open-source datasets. Cold Brew’s student model
handles isolated nodes better, and the teacher GNN also achieves better performance on the tail split
compared to all other models. Especially when compared with GraphMLP, Cold Brew’s student MLP
consistently performs better. This can be explained from their different mechanisms: GraphMLP
encodes graph knowledge implicitly in the learned weights, while Cold Brew explicitly attends to
neighborhoods even when they are absent. More detailed comparisons can be found in Appendix C.
Splits
Metrics/Models
Open-Source Datasets
Proprietary Datasets
Cora
Citeseer
Pubmed
Arxiv
Chameleon
E-comm1
E-comm2
E-comm3
E-comm4
Isolation
GNNs
GCN 2 layers
58.02
47.09
71.50
44.51
57.28
−
−
−
−
GraphSAGE
66.02
51.46
69.87
47.32
59.83
+3.89
+4.81
+5.24
+0.52
MLPs
Simple MLP
68.40
53.26
65.84
51.03
60.76
+5.89
+9.85
+5.83
+6.42
GraphMLP
65.00
52.82
71.22
51.10
63.54
+6.27
+9.46
+5.99
+7.37
Cold Brew
GCN + SE 2 layers
58.37
47.78
73.85
45.20
60.13
+0.27
+0.76
-0.50
+1.22
Student MLP
69.62
53.17
72.33
52.36
62.28
+7.56
+11.09
+5.64
+9.05
Tail
GNNs
GCN 2 layers
84.54
56.51
74.95
67.74
58.33
−
−
−
−
GraphSAGE
82.82
52.77
73.07
63.23
61.26
-3.82
-3.07
-2.87
-6.42
MLPs
Simple MLP
70.76
54.85
67.21
52.14
50.12
-0.37
+1.74
-0.13
-0.45
GraphMLP
70.09
55.56
71.45
52.40
52.84
-0.33
+1.64
+1.27
+0.80
Cold Brew
GCN + SE 2 layers
84.66
56.32
75.33
68.11
60.80
+0.85
+0.44
-0.60
+1.10
Student MLP
71.80
54.88
72.54
53.24
51.36
+0.32
+3.09
-0.18
+2.09
Table 3: The performance comparisons on the isolation and tail splits of different datasets. The full comparisons
on head/tail/isolation/overall data are in the Appendix C. GCN+SE 2 layers is Cold Brew’s teacher model. Cold
Brew outperforms GNN and other MLP baselines, and achieves the best performance on the isolation splits as
well as some tail splits.
8
Published as a conference paper at ICLR 2022
Splits
Models
Datasets
Cora
Citeseer
Pubmed
E-comm1
Isolation
GCN 2 layers
34.10
50.41
51.52
−
TailGCN
36.13
51.48
51.19
+2.18
Meta-Tail2Vec
36.92
50.90
51.62
+2.34
Cold Brew’s MLP
44.59
55.14
54.82
+5.39
Table 4:
Link prediction Mean Reciprocal
Ranks (MRR) on the isolation data. Note that
Cold Brew outperforms baselines speciﬁcally
built for generalizing to the tail.
Splits
Models
Datasets
Cora
Citeseer
Pubmed
E-comm1
Isolation
GCN 2 layers
58.02
47.09
71.50
−
TailGCN
62.04
51.87
72.10
+3.14
Meta-Tail2Vec
61.16
50.46
71.80
+2.80
Cold Brew’s MLP
69.62
53.17
72.33
+7.56
Table 5: Node classiﬁcation accuracies with
other baselines speciﬁcally created to generalize
to the tail. Cold Brew outperforms these meth-
ods when edge data is absent in the graph.
We also evaluated the link prediction performance by replacing the node classiﬁcation loss with the
link prediction loss. On the manually created isolation split, the model is asked to recover the ground
truth edges which are manually removed. The results are shown in Table 4. The baseline models
shown in table are TailGCN (Vetter et al., 1991) and Meta-Tail2Vec (Liu et al., 2020b). A comparison
over these models on the node classiﬁcation on the isolation split is provided in Table 5. As observed
from the table 4 and 5, Cold Brew outperformed TailGCN and Meta-Tail2Vec on the isolation split,
since both TailGCN and Meta-Tail2Vec explicitly are not zero-shot methods and require explicit
neighborhood nodes, hence their performance degrades when the neighborhood is empty and padded
by zero vectors.
The full performance on other splits are listed in Table 10 in the appendix as a reference. The results
across all splits in Table 10 provide evidence for a few phenomena, for example, the high FCR means
that the graph structure does not add too much information for the task at hand, and that GNN type
models tend to perform better on the head while MLP type models tend to perform better on the
tail/isolation splits. On the other hand, the proposed Structural Embeddings imply a potential to
alleviate the over-smoothness (Oono & Suzuki, 2020; Li et al., 2018; NT & Maehara, 2019) and
bottleneck (Alon & Yahav, 2020) issues observed in deep GCN models. As shown in table Table 6,
Cold Brew’s GCN (GCN + SE) signiﬁcantly outperformed the traditional GCN on 64 layers: the
former has 34% test accuracy higher on Cora, 23% higher on Citeseer, and similar on others.
Finally, the improvement over isolation and tail splits (especially the isolation split) comes with a
cost: we observed a performance drop for the student MLP model on the head and several other
datasets’ tail splits, compared with the naive GCN model. However, Cold Brew speciﬁcally targets
the challenging strict cold start issues, as a new compelling alternative for in these cases. Meanwhile
in the non-cold-start cases, the traditional GCN models can still be used to obtain good performance.
Note that even on the head splits, the proposed GNN teacher model of Cold Brew still outperformed
traditional GNN models. We hence consider as promising future work to adaptively switch between
using Cold Brew teacher and student models, based on the current node connectivity degree.
Splits
Metrics/Models
Open-Source Datasets
Proprietary Datasets
Cora
Citeseer
Pubmed
Arxiv
Chameleon
E-comm1
E-comm2
E-comm3
E-comm4
Overall
GCN 64 layers
40.04
23.66
75.65
65.53
58.14
-5.49
-6.59
-6.13
-3.57
GCN + SE 64 layers
74.23
46.80
78.12
69.28
59.88
-1.71
-2.92
-3.29
-0.06
Head
GCN 64 layers
46.46
49.84
85.89
67.53
67.16
-5.60
-6.24
-6.05
-3.16
GCN + SE 64 layers
87.38
71.18
86.81
71.35
69.63
-1.78
-2.17
-2.79
-0.35
Tail
GCN 64 layers
45.14
24.42
71.89
63.91
56.48
-3.85
-3.62
-3.84
-1.14
GCN + SE 64 layers
79.56
36.52
74.88
65.19
61.73
-2.42
-2.52
-3.68
-1.23
Isolation
GCN 64 layers
39.97
22.12
68.57
40.03
57.60
-4.66
-4.63
-4.93
-1.89
GCN + SE 64 layers
40.33
24.53
71.22
41.18
60.13
-3.08
-3.02
-4.00
-2.32
Table 6: The comparisons of Cold Brew’s GCN and the traditional GCN for deep layers. When the number of
layers is large, Cold Brew’s GCN retains good performance while the traditional GCN without SE suffers from
the “over-smoothess” and degrades. Even with shallow layers, Cold Brew’s GCN is better than traditional GCN.
6
CONCLUSION
In this paper, we studied the problem of generalizing GNNs to the tail and strict cold start nodes,
whose neighborhood information is either sparse/noisy or completely missing. We proposed a teacher-
student knowledge distillation procedure to better generalize to the isolated nodes. We added an
independent set of structural embeddings in GNN layers to alleviate node over-smoothness, and also
proposed a virtual neighbor discovery step for the student model to attend to latent neighborhoods.
We additionally present the FCR metric to quantify the difﬁculty of truly inductive representation
learning and to optimize our model architecture design. Experiments demonstrated the consistently
superior performance of our proposed framework on both public benchmarks and proprietary datasets.
9
Published as a conference paper at ICLR 2022
REFERENCES
Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications.
arXiv preprint arXiv:2006.05205, 2020.
Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph
convolutional networks. arXiv preprint arXiv:2007.02133, 2020a.
Yuzhao Chen, Yatao Bian, Xi Xiao, Yu Rong, Tingyang Xu, and Junzhou Huang. On self-distilling
graph neural network. arXiv preprint arXiv:2011.02255, 2020b.
Hao Ding, Yifei Ma, Anoop Deoras, Yuyang Wang, and Hao Wang. Zero-shot recommender systems.
arXiv preprint arXiv:2105.08318, 2021.
Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A
survey. Knowledge-Based Systems, 151:78–94, 2018.
William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs.
arXiv preprint arXiv:1706.02216, 2017.
Bowen Hao, Jing Zhang, Hongzhi Yin, Cuiping Li, and Hong Chen. Pre-training graph neural net-
works for cold-start users and items representation. In Proceedings of the 14th ACM International
Conference on Web Search and Data Mining, pp. 265–273, 2021.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531, 2015.
Yang Hu, Haoxuan You, Zhecan Wang, Zhicheng Wang, Erjin Zhou, and Yue Gao. Graph-mlp: Node
classiﬁcation without message passing in graph. arXiv preprint arXiv:2106.04051, 2021.
Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson. Combining label propa-
gation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993,
2020a.
Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson. Combining label propa-
gation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993,
2020b.
Wenbing Huang, Yu Rong, Tingyang Xu, Fuchun Sun, and Junzhou Huang. Tackling over-smoothing
for general graph convolutional networks. arXiv e-prints, pp. arXiv–2008, 2020c.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. ICML, 2015.
Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Label propagation for deep semi-
supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 5070–5079, 2019.
Mahdi Kherad and Amir Jalaly Bidgoly. Recommendation system using a deep learning and graph
analysis approach. arXiv preprint arXiv:2004.08100, 2020.
Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks.
arXiv preprint arXiv:1609.02907, 2016.
Johannes Klicpera, Aleksandar Bojchevski, and Stephan G¨
unnemann. Predict then propagate: Graph
neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
Xuan Nhat Lam, Thuc Vu, Trong Duc Le, and Anh Duc Duong. Addressing cold-start problem
in recommendation systems. In Proceedings of the 2nd international conference on Ubiquitous
information management and communication, pp. 208–211, 2008.
Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep
as cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
9267–9276, 2019a.
10
Published as a conference paper at ICLR 2022
Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train
deeper gcns. arXiv preprint arXiv:2006.07739, 2020.
Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. From zero-shot learning
to cold-start recommendation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence,
volume 33, pp. 4189–4196, 2019b.
Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for
semi-supervised learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.
Meng Liu, Zhengyang Wang, and Shuiwang Ji. Non-local graph neural networks. arXiv preprint
arXiv:2005.14612, 2020a.
Zemin Liu, Wentao Zhang, Yuan Fang, Xinming Zhang, and Steven CH Hoi. Towards locality-aware
meta-learning of tail node embeddings on networks. In Proceedings of the 29th ACM International
Conference on Information & Knowledge Management, pp. 975–984, 2020b.
Yuanfu Lu, Yuan Fang, and Chuan Shi. Meta-learning on heterogeneous information networks for
cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pp. 1563–1573, 2020.
Sitao Luan, Mingde Zhao, Xiao-Wen Chang, and Doina Precup. Break the ceiling: Stronger multi-
scale deep graph convolutional networks. arXiv preprint arXiv:1906.02174, 2019.
Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass ﬁlters.
arXiv preprint arXiv:1905.09550, 2019.
Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node
classiﬁcation. In International Conference on Learning Representations, 2020.
Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric
graph convolutional networks. arXiv preprint arXiv:2002.05287, 2020.
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and
Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convo-
lutional networks on node classiﬁcation. In International Conference on Learning Representations.
https://openreview. net/forum, 2020.
Shakila Shaikh, Sheetal Rathi, and Prachi Janrao. Recommendation system in e-commerce websites:
A graph based approached. In 2017 IEEE 7th International Advance Computing Conference
(IACC), pp. 931–934. IEEE, 2017.
Nitai B Silva, Ren Tsang, George DC Cavalcanti, and Jyh Tsang. A graph-based friend recommen-
dation system using genetic algorithm. In IEEE congress on evolutionary computation, pp. 1–7.
IEEE, 2010.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine
learning research, 15(1):1929–1958, 2014.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.
Petar Veliˇ
ckovi´
c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
T Vetter, A Engel, T Biro, and U Mosel. η production in nucleon-nucleon collisions. Physics Letters
B, 263(2):153–156, 1991.
Hongwei Wang and Jure Leskovec. Unifying graph convolutional neural networks and label propaga-
tion. arXiv preprint arXiv:2002.06755, 2020.
11
Published as a conference paper at ICLR 2022
Yifei Wang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Dissecting the diffusion process in
linear graph convolutional networks. arXiv preprint arXiv:2102.10739, 2021.
Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Sim-
plifying graph convolutional networks. In International conference on machine learning, pp.
6861–6871. PMLR, 2019.
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A
comprehensive survey on graph neural networks. IEEE transactions on neural networks and
learning systems, 32(1):4–24, 2020.
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? arXiv preprint arXiv:1810.00826, 2018a.
Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie
Jegelka. Representation learning on graphs with jumping knowledge networks. In International
Conference on Machine Learning, pp. 5453–5462. PMLR, 2018b.
Chaoqi Yang, Ruijie Wang, Shuochao Yao, Shengzhong Liu, and Tarek Abdelzaher. Revisiting
”over-smoothing” in deep gcns. arXiv preprint arXiv:2003.13663, 2020.
Cheng Yang, Jiawei Liu, and Chuan Shi. Extract the knowledge of graph neural networks and go
beyond it: An effective knowledge distillation framework. In Proceedings of the Web Conference
2021, pp. 1227–1237, 2021.
Hongwei Zhang, Tijin Yan, Zenjun Xie, Yuanqing Xia, and Yuan Zhang. Revisiting graph convolu-
tional network on semi-supervised node classiﬁcation from an optimization perspective. arXiv
preprint arXiv:2009.11469, 2020.
Jiani Zhang, Xingjian Shi, Shenglin Zhao, and Irwin King. Star-gcn: Stacked and reconstructed
graph convolutional networks for recommender systems. arXiv preprint arXiv:1905.13129, 2019.
Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah. Graph-less neural networks: Teaching old
mlps new tricks via distillation. arXiv preprint arXiv:2110.08727, 2021.
Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint
arXiv:1909.12223, 2019.
Kaixiong Zhou, Xiao Huang, Yuening Li, Daochen Zha, Rui Chen, and Xia Hu. Towards deeper
graph neural networks with differentiable group normalization. Advances in Neural Information
Processing Systems, 33, 2020a.
Kuangqi Zhou, Yanfei Dong, Kaixin Wang, Wee Sun Lee, Bryan Hooi, Huan Xu, and Jiashi Feng.
Understanding and resolving performance degradation in graph convolutional networks. arXiv
preprint arXiv:2006.07107, 2020b.
Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu.
Layer-
dependent importance sampling for training deep and large graph convolutional networks.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´
e-Buc, E. Fox, and R. Gar-
nett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Asso-
ciates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/
91ba4a4478a66bee9812b0804b6f9d1b-Paper.pdf.
A
MORE ILLUSTRATIONS
The more detailed inference procedures for GNN and Cold Brew are illustrated in Figure 3.
12
Published as a conference paper at ICLR 2022
node feature
Embedding
Graph Structure
GNN
node feature
Embedding
MLP #1
node feature
Virtual Neighborhood
Embedding
Middle step: discover virtual
neighborhood based on
embedding similarity
Embedding
MLP #2
GNN inference
Cold Brew inference
BeTr.drawio
https://app.diagrams.net/
1 of 1
2/22/22, 10:10 AM
Figure 3: Inference procedure illustration for GNN and Cold Brew.
B
SEARCH SPACE DETAILS
In computing FCR, we include a search space of model hyperparameters for GNN, MLP, and LP in
order to ﬁnd the best suitable conﬁgurations for distillation.
For the GNN model, we take GCN as a backbone and performed grid search over the number of
hidden layers, whether it has the structural embedding, the type of residual connection, and the type
of normalization. For the number of hidden layers, we considered 2, 4, 8, 16, 32, and 64. For the
types of residual connections, we include: (1) connection to the last layer (Li et al., 2019a; 2018), (2)
initial connection to the initial layer (Chen et al., 2020a; Klicpera et al., 2018; Zhang et al., 2020),
(3) dense connection to all preceding layers (Li et al., 2019a; 2018; 2020; Luan et al., 2019), and
(4) jumping connection combining all the preceding layers only at the ﬁnal graph convolutional
layer (Xu et al., 2018b; Liu et al., 2020b). For the types of normalizations, we grid search over batch
normalization (BatchNorm) (Ioffe & Szegedy, 2015), pair normalization (PairNorm) (Zhao & Akoglu,
2019), node normalization (NodeNorm) (Zhou et al., 2020b), mean normalization (MeanNorm) (Yang
et al., 2020), and differentiable group normalization (GroupNorm) (Zhou et al., 2020a). For types of
graph dropout methods, we include Dropout (Srivastava et al., 2014), DropEdge (Rong et al., 2020),
DropNode (Huang et al., 2020c), and LADIES (Zou et al., 2019).
For the architecture design for Cold Brew’s MLP, we conducted hyperparameter search over the num-
ber of hidden layers, the existence of residual connection, the hidden dimensions, and the optimizers.
The number of hidden layers is searched over 2, 8, 16, and 32. The number of hidden dimensions
is searched over 128 and 256. The optimizer is searched over (Adam(lr=0.001) Adam(lr=0.005),
Adam(lr=0.02), SGD(lr=0.005))
For Label Propagation, we conducted hyperparameter search over the number of propagations,
the propagation matrix type, and the mixing coefﬁcient α (Huang et al., 2020a). The number of
propagations is searched over 10, 20, 50, 100, and 200. The propagation matrix type is searched over
adjacency matrix and normalized Laplacian matrix. The mixing coefﬁcient α is searched over 0.01,
0.1, 0.5, 0.9, and 0.99.
The best GCN, MLP, and LP conﬁgurations are reported in Tables 7, 8, and 9, respectively.
C
THE PERFORMANCE ON ALL SPLITS OF THE DATA
The performance evaluations on all splits are listed in Table 10. The FCR evaluation on more datasets
are presented in Figure 11. We hypothesize that a high FCR means that the graph does not add
13
Published as a conference paper at ICLR 2022
Dataset
Best GCN
num layers
whether has SE
residual type
normalization type
Cora
2 layer
has structural embedding
no residual
PairNorm
Citeseer
2 layer
has structural embedding
no residual
PairNorm
Pubmed
16 layer
has structural embedding
initial connection
GroupNorm
Arxiv
4 layer
has structural embedding
initial connection
GroupNorm
Chameleon
2 layer
has structural embedding
initial connection
BatchNorm
Table 7: Best GCN conﬁgurations.
Dataset
Best MLP
hidden layers
residual connection
hidden dimensions
optimizer
Cora
2 layer
no residual
128
Adam(lr=0.001)
Citeseer
4 layer
no residual
128
Adam(lr=0.001)
Pubmed
2 layer
no residual
256
Adam(lr=0.02)
Arxiv
2 layer
no residual
256
Adam(lr=0.001)
Chameleon
2 layer
no residual
256
Adam(lr=0.001)
Table 8: Best MLP conﬁgurations.
too much information for the task at hand. We indeed see evidence for this hypothesis in Table 10,
where for the Pubmed dataset (FCR ≈77%), the MLP-type models tend to outperform GNN-type
models in all splits On the other hand, regardless of FCR, for almost all datasets, the MLP-type
models outperform the GNN-type models on the isolation split, and a few on the tail split, while the
GNN-type models are superior in other splits.
D
VISUALIZING THE LEARNED EMBEDDINGS
Figure 4 visualizes the last-layer embeddings of different models after t-SNE dimensionality reduction.
In the ﬁgure, colors denotes node labels and all nodes are marked as dots, with isolation subset
nodes additionally marked with x’s and the tail subset additionally marked with triangles. Although
the GCN model did a decent job in separating different classes, a signiﬁcant portion of the tail and
isolation nodes fall into wrong class clusters. Cold Brew’s MLP is more discriminative in the tail and
isolation splits.
14
Published as a conference paper at ICLR 2022
Dataset
Best LP
number of propagations
propagation matrix type
mixing coefﬁcient
Cora
50
Laplacian matrix
0.1
Citeseer
100
Laplacian matrix
0.01
Pubmed
50
Adjacency matrix
0.5
Arxiv
100
Laplacian matrix
0.5
Chameleon
50
Laplacian matrix
0.1
Table 9: Best Label Propagation conﬁgurations.
Splits
Metrics/Models
Open-Source Datasets
Proprietary Datasets
Cora
Citeseer
Pubmed
Arxiv
Chameleon
E-comm1
E-comm2
E-comm3
E-comm4
Isolation
GNNs
GCN 2 layers
58.02
47.09
71.50
44.51
57.28
−
−
−
−
GraphSAGE
66.02
51.46
69.87
47.32
59.83
+3.89
+4.81
+5.24
+0.52
MLPs
Simple MLP
68.40
53.26
65.84
51.03
60.76
+5.89
+9.85
+5.83
+6.42
GraphMLP
65.00
52.82
71.22
51.10
63.54
+6.27
+9.46
+5.99
+7.37
Cold Brew
GCN + SE 2 layers
58.37
47.78
73.85
45.20
60.13
+0.27
+0.76
-0.50
+1.22
Student MLP
69.62
53.17
72.33
52.36
62.28
+7.56
+11.09
+5.64
+9.05
Tail
GNNs
GCN 2 layers
84.54
56.51
74.95
67.74
58.33
−
−
−
−
GraphSAGE
82.82
52.77
73.07
63.23
61.26
-3.82
-3.07
-2.87
-6.42
MLPs
Simple MLP
70.76
54.85
67.21
52.14
50.12
-0.37
+1.74
-0.13
-0.45
GraphMLP
70.09
55.56
71.45
52.40
52.84
-0.33
+1.64
+1.27
+0.80
Cold Brew
GCN + SE 2 layers
84.66
56.32
75.33
68.11
60.80
+0.85
+0.44
-0.60
+1.10
Student MLP
71.80
54.88
72.54
53.24
51.36
+0.32
+3.09
-0.18
+2.09
Head
GNNs
GCN 2 layers
88.68
80.37
85.79
73.35
67.49
−
−
−
−
GraphSAGE
87.75
74.81
86.94
70.85
62.08
-4.26
-4.17
-3.50
-7.46
MLPs
Simple MLP
74.33
72.00
89.00
56.34
60.82
-16.74
-18.10
-16.73
-16.51
GraphMLP
72.45
69.83
89.00
56.65
62.44
-15.96
-18.08
-15.33
-15.41
Cold Brew
GCN + SE 2 layers
89.39
80.76
87.83
74.01
70.56
+1.11
+0.47
-0.39
+1.28
Student MLP
74.53
72.33
90.33
57.41
61.28
-15.28
-17.42
-17.02
-15.41
Overall
GNNs
GCN 2 layers
84.89
70.38
78.18
71.50
59.30
−
−
−
−
GraphSAGE
80.90
66.21
76.73
68.33
70.02
-3.09
-3.86
-2.58
-5.48
MLPs
Simple MLP
69.02
56.59
73.51
54.89
58.65
-12.69
-12.86
-12.68
-13.16
GraphMLP
71.87
68.22
82.03
53.81
57.67
-12.26
-12.01
-10.80
-11.41
Cold Brew
GCN + SE 2 layers
86.96
72.44
79.03
71.92
68.51
+0.65
-0.24
-0.77
+1.43
Student MLP
72.36
67.54
82.00
54.94
59.07
-11.25
-11.51
-11.55
-11.21
Table 10: The performance comparisons on all splits of different datasets.
Cora
Citeseer
Pubmed
Arxiv
Cham.
Squ.
Actor
Cornell
Texas
Wisconsin
GNN
86.96
72.44
75.96
71.54
68.51
31.95
59.79
65.1
61.08
81.62
MLP
69.02
56.59
73.51
54.89
58.65
38.51
37.93
86.26
83.33
85.42
Label Propagation
78.18
45.00
67.8
68.26
41.01
22.85
29.69
32.06
52.08
40.62
FCR %
32.86 %
63.39 %
76.91%
16.45%
73.61%
141.91%
57.93%
139.04%
171.2 %
108.48 %
β(G) %
83%
71%
79%
68%
25%
22%
24%
11%
6%
16%
head −tail(GNN)
4.44
23.98
11.71
5.9
0.24
-6.51
2.22
-4.37
-11.26
-33.92
head −isolation(GNN)
31.01
33.09
15.21
28.81
1.55
-4.85
22.61
-18.68
-24.62
-29.23
Table 11: Top part: FCR and its components. The β metric is added as a reference. Bottom part: the
performance difference of GNN on the head/tail and head/isolation splits.
15
Published as a conference paper at ICLR 2022
Figure 4: Top two subﬁgures: the last-layer embeddings of GCN and Simple MLP. Bottom two
subﬁgures: the last-layer embeddings of GraphMLP and Cold Brew’s student MLP. All embeddings
are projected to 2D with t-SNE. Cold Brew’s MLP has the fewest isolated nodes that are misplaced
into wrong clusters.
16