|
Published as a conference paper at ICLR 2022 |
|
COLD BREW: DISTILLING GRAPH NODE REPRESEN- |
|
TATIONS WITH INCOMPLETE OR MISSING NEIGHBOR- |
|
HOODS |
|
Wenqing Zheng, Edward W Huang, Nikhil Rao, Sumeet Katariya, Zhangyang Wang, |
|
Karthik Subbian |
|
{wenqzhen,ewhuang,nikhilsr,katsumee,wzhangwa,ksubbian}@amazon.com |
|
ABSTRACT |
|
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in |
|
node classification, regression, and recommendation tasks. GNNs work well |
|
when rich and high-quality connections are available. However, their effective- |
|
ness is often jeopardized in many real-world graphs in which node degrees have |
|
power-law distributions. The extreme case of this situation, where a node may |
|
have no neighbors, is called Strict Cold Start (SCS). SCS forces the predic- |
|
tion to rely completely on the node’s own features. We propose Cold Brew, |
|
a teacher-student distillation approach to address the SCS and noisy-neighbor |
|
challenges for GNNs. We also introduce feature contribution ratio (FCR), a |
|
metric to quantify the behavior of inductive GNNs to solve SCS. We experi- |
|
mentally show that FCR disentangles the contributions of different graph data |
|
components and helps select the best architecture for SCS generalization. We |
|
further demonstrate the superior performance of Cold Brew on several public |
|
benchmark and proprietary e-commerce datasets, where many nodes have ei- |
|
ther very few or noisy connections. Our source code is available at https: |
|
//github.com/amazon-research/gnn-tail-generalization. |
|
1 |
|
INTRODUCTION |
|
Normalized Number of Nodes |
|
101 |
|
102 |
|
103 |
|
104 |
|
Node Degrees |
|
cora |
|
citeseer |
|
pubmed |
|
arxiv |
|
chameleon |
|
actor |
|
squirrel |
|
wisconsin |
|
cornell |
|
texas |
|
Figure 1: Top: Graph nodes may have a power-law |
|
(“long-tail”) connectivity distribution, with a large |
|
fraction of nodes (yellow) having few to no neigh- |
|
bors. Bottom: Long-tail distributions in real-world |
|
datasets, making modern GNNs fail to generalize to |
|
the tail/cold-start nodes. |
|
Graph Neural Networks (GNNs) achieve state-of- |
|
the-art results across a wide range of tasks such as |
|
graph classification, node classification, link predic- |
|
tion, and recommendation (Wu et al., 2020; Goyal |
|
& Ferrara, 2018; Kherad & Bidgoly, 2020; Shaikh |
|
et al., 2017; Silva et al., 2010; Zhang et al., 2019). |
|
Most modern GNNs rely on the principle of message |
|
passing to aggregate each node’s features from its |
|
(multi-hop) neighborhood (Kipf & Welling, 2016; |
|
Veliˇ |
|
ckovi´ |
|
c et al., 2017; Hamilton et al., 2017; Xu |
|
et al., 2018a; Wu et al., 2019; Klicpera et al., 2018). |
|
Therefore, the success of GNNs relies on the pres- |
|
ence of dense and high-quality connections. Even |
|
inductive GNNs Hamilton et al. (2017) learn a func- |
|
tion of the node feature and the node neighborhood, |
|
which requires the neighborhood to be present dur- |
|
ing inference. |
|
A practical barrier for widespread applicability of |
|
GNNs arises from the long-tail node-degree distribu- |
|
tion existing in many large-scale real-world graphs. |
|
Specifically, the node degree distribution is power |
|
law in nature, with a majority of nodes having very |
|
few connections (Hao et al., 2021; Ding et al., 2021; Lam et al., 2008; Lu et al., 2020). Figure 1 (top) |
|
illustrates a long-tail distribution, accompanied with the statistics of several public datasets (bottom). |
|
1 |
|
Published as a conference paper at ICLR 2022 |
|
Many information retrieval and recommendation applications face the scenario of Strict Cold Start |
|
(SCS) (Li et al., 2019b; Ding et al., 2021), wherein some nodes have no edges connected. Predicting |
|
for these nodes admittedly is even more challenging than the tail nodes in the graph. In these cases, |
|
existing GNNs fail to perform well due to the sparsity or absence of the neighborhood. |
|
In this paper, we develop GNN models that have truly inductive capabilities: one can learn effective |
|
node embeddings for “orphaned” nodes in a graph. This capability is important to fully realize the |
|
potential of large-scale GNN models on modern, industry-scale datasets with very long tails and |
|
many orphaned nodes. To this end, we adopt the teacher-student knowledge distillation procedure |
|
(Yang et al., 2021; Chen et al., 2020b) and propose Cold Brew to distill the knowledge of a GNN |
|
teacher into a multilayer perceptron (MLP) student. |
|
The Cold Brew framework addresses two key questions: (1) how we can efficiently distill the teacher’s |
|
knowledge for the sake of tail and cold-start generalization, and (2) how can a student make use of |
|
this knowledge. We answer these two questions by learning a latent node-wise embedding using |
|
knowledge distillation, which both avoids “over-smoothness” (Oono & Suzuki, 2020; Li et al., 2018; |
|
NT & Maehara, 2019) and discovers latent neighborhoods, which are missing for the SCS nodes. |
|
Note that in contrast to traditional knowledge distillation (Hinton et al., 2015), our aim is not to train |
|
a simpler student model to perform as well as the more complex teacher. Instead, we aim to train a |
|
student model that is better than the teacher in terms of generalizing to tail or SCS samples. |
|
In addition, to help select the cold-start friendly model architectures, we develop a metric called |
|
Feature Contribution Ratio (FCR) that quantifies the contribution of node features with respect to the |
|
adjacency structure in the dataset for a specific downstream task. FCR indicates the difficulty level |
|
in generalizing to tail and cold-start nodes and guides our principled selection of both teacher and |
|
student model architectures in Cold Brew. We summarize our key contributions as follows: |
|
• To generalize better to tail and SCS nodes, we design the Cold Brew knowledge distillation |
|
framework: we enhance the teacher GNN by appending the node-wise Structural Embedding |
|
(SE) to strengthen the teacher’s expressiveness, and design a novel mechanism for the MLP |
|
student to rediscover the missing “latent/virtual neighborhoods,” on which it can perform |
|
message passing. |
|
• We propose Feature Contribution Ratio (FCR), which quantifies the difficulty in generalizing |
|
to tail and cold-start nodes. We leverage FCR in a principled “screening process” to select |
|
the best model architectures for both the GNN teacher and the MLP student. |
|
• As the existing GNN studies only evaluate on the entire graph and do not explicitly evaluate |
|
on head/tail/SCS, we uncover the hidden differences of head/tail/SCS by creating bespoke |
|
train/test splits. Extensive experiments on public and proprietary e-commerce graph datasets |
|
validate the effectiveness of Cold Brew in tail and cold-start generalization. |
|
1.1 |
|
PROBLEM SETUP |
|
GNNs effectively learn node representations using two components in graph data: they process node |
|
features through distributed node-wise transformations and process adjacency structure through |
|
localized neighborhood aggregations. For the first component, GNNs apply shared feature trans- |
|
formations to all nodes regardless of the neighborhoods. For the second component, GNNs use |
|
permutation-invariant aggregators to collect neighborhood information. |
|
We take the node classification problem in the sequel for the sake of simplicity. All our proposed |
|
methods can be easily adapted to other semi-supervised or unsupervised problem settings, which |
|
we show in Section 5. We denote the graph data of interest by G with node set V, |V| = N. Each |
|
node possesses a din−dimensional feature and a dout−dimensional label (either dout classes or a |
|
continuous vector in the case of regression). Let X0 ∈RN×din and Y ∈RN×dout be the matrices |
|
of node features and labels, respectively. Let Ni be the neighborhood of the i-th node, 0 ≤i < N. In |
|
large-scale graphs, |Ni| is often small for a (possibly substantial) portion of nodes. We refer to these |
|
nodes as tail nodes. Some nodes may have |Ni| = 0, and we refer to these extreme cold start cases as |
|
isolated nodes. |
|
A classical GNN learns representations for the ith node at the lth layer as a function of its representa- |
|
tion and its neighborhood’s representations at the (l −1)th layer: |
|
xl |
|
i := f |
|
|