|
Under review as a conference paper at ICLR 2022 |
|
CLASSIFY AND GENERATE RECIPROCALLY: |
|
SIMULTANEOUS POSITIVE-UNLABELLED LEARNING |
|
AND CONDITIONAL GENERATION WITH EXTRA DATA |
|
Anonymous authors |
|
Paper under double-blind review |
|
ABSTRACT |
|
The scarcity of class-labeled data is a ubiquitous bottleneck in a wide range of |
|
machine learning problems. While abundant unlabeled data normally exist and |
|
provide a potential solution, it is extremely challenging to exploit them. In this pa- |
|
per, we address this problem by leveraging Positive-Unlabeled (PU) classification |
|
and the conditional generation with extra unlabeled data simultaneously, both of |
|
which aim to make full use of agnostic unlabeled data to improve classification and |
|
generation performance. In particular, we present a novel training framework to |
|
jointly target both PU classification and conditional generation when exposing to |
|
extra data, especially out-of-distribution unlabeled data, by exploring the interplay |
|
between them: 1) enhancing the performance of PU classifiers with the assistance |
|
of a novel Conditional Generative Adversarial Network (CGAN) that is robust to |
|
noisy labels, 2) leveraging extra data with predicted labels from a PU classifier |
|
to help the generation. Our key contribution is a Classifier-Noise-Invariant Con- |
|
ditional GAN (CNI-CGAN) that can learn the clean data distribution from noisy |
|
labels predicted by a PU classifier. Theoretically, we proved the optimal condi- |
|
tion of CNI-CGAN and experimentally, we conducted extensive evaluations on |
|
diverse datasets, verifying the simultaneous improvements on both classification |
|
and generation. |
|
1 |
|
INTRODUCTION |
|
Existing machine learning methods, particularly deep learning models, typically require big data |
|
to pursue remarkable performance. For instance, conditional deep generative models are able to |
|
generate high-fidelity and diverse images, but they have to rely on vast amounts of labeled data (Lu- |
|
cic et al., 2019). Nevertheless, it is often laborious or impractical to collect large-scale accurate |
|
class-labeled data in real-world scenarios, and thus the label scarcity is ubiquitous. Under such cir- |
|
cumstances, the performance of classification and conditional generation (Mirza & Osindero, 2014) |
|
drops significantly (Lucic et al., 2019). At the same time, diverse unlabeled data are available in |
|
enormous quantities, and therefore a key issue is how to take advantage of the extra data to enhance |
|
the conditional generation or classification. |
|
Within the unlabeled data, both in-distribution and out-of-distribution data exist, where in- |
|
distribution data conform to the distribution of the labeled data while out-of-distribution data do |
|
not. Our key insight is to harness the out-of-distribution data. In the generation with extra data, |
|
most related works focused on the in-distribution data (Lucic et al., 2019; Gui et al., 2020; Donahue |
|
& Simonyan, 2019). When it comes to the out-of-distribution data, the majority of existing meth- |
|
ods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempted to forcibly train |
|
generative models on a large amount of unlabeled data, and then transferred the learned knowledge |
|
of the pre-trained generator to the in-distribution data. In classification, a common setting to utilize |
|
unlabeled data is semi-supervised learning (Miyato et al., 2018; Sun et al., 2019; Berthelot et al., |
|
2019), which usually assumes that the unlabeled and labeled data come from the same distribution, |
|
ignoring their distributional mismatch. In contrast, Positive and Unlabeled (PU) Learning (Bekker |
|
& Davis, 2020; Kiryo et al., 2017) is an elegant way of handling this under-studied problem, where |
|
a model has the only access to positive samples and unlabeled data. Therefore, it is possible to |
|
utilize pseudo labels predicted by a PU classifier on unlabeled data to guide the conditional gen- |
|
1 |
|
Under review as a conference paper at ICLR 2022 |
|
eration. However, the predicted signals from the classifier tend to be noisy. Although there are a |
|
flurry of papers about learning from noisy labels for classification (Tsung Wei Tsai, 2019; Ge et al., |
|
2020; Guo et al., 2019), to our best knowledge, no work has considered to leverage the noisy labels |
|
seamlessly in the joint classification and generation. Additionally, another work (Hou et al., 2018) |
|
leveraged GANs to recover both positive and negative data distribution to step away from overfit- |
|
ting, but they never considered the noise-invariant generation or their mutual improvement. The |
|
generative-discriminative complementary learning (Xu et al., 2019) was investigated in weakly su- |
|
pervised learning, but we are the first attempt to tackle the (Multi-) Positive and Unlabeled learning |
|
setting while developing the method of noise-invariant generation from noisy labels. Please refer to |
|
Section 5 for the discussion about more related works. |
|
In this paper, we focus on the mutual benefits of conditional generation and PU classification, |
|
when we are only accessible to little class-labeled data, but extra unlabeled data, including out- |
|
of-distribution data, can be available. Firstly, a parallel non-negative multi-class PU estimator is |
|
derived to classify both the positive data of all classes and the negative data. Then we design a |
|
Classifier-Noise-Invariant Conditional Generative Adversarial Network (CNI-CGAN) that is able to |
|
learn the clean data distribution on all unlabeled data with noisy labels provided by the PU clas- |
|
sifier. Simultaneously, we also leverage our CNI-CGAN to enhance the performance of the PU |
|
classification through data augmentation, demonstrating a reciprocal benefit for both generation and |
|
classification. We provide the theoretical analysis on the optimal condition of our CNI-CGAN and |
|
conduct extensive experiments to verify the superiority of our approach. |
|
2 |
|
OUR METHOD |
|
2.1 |
|
POSITIVE-UNLABELED LEARNING |
|
Traditional Binary Positive-Unlabeled Problem Setting Let X ∈Rd and Y ∈{±1} be the input |
|
and output variables and p(x, y) is the joint distribution with marginal distribution pp(x) = p(x|Y = |
|
+1) and pn(x) = p(x|Y = −1). In particular, we denote p(x) as the distribution of unlabeled data. |
|
np, nn and nu are the amount of positive, negative and unlabeled data, respectively. |
|
Parallel Non-Negative PU Estimator Vanilla PU learning (Bekker & Davis, 2020; Kiryo et al., |
|
2017; Du Plessis et al., 2014; 2015) employs unbiased and consistent estimator. Denote gθ : Rd → |
|
R as the score function parameterized by θ, and ℓ: R × {±1} →R as the loss function. The risk of |
|
gθ can be approximated by its empirical version denoted as b |
|
Rpn(gθ): |
|
b |
|
Rpn(gθ) = πp b |
|
R+ |
|
p (gθ) + πn b |
|
R− |
|
n (gθ), |
|
(1) |
|
where πp represents the class prior probability, i.e. πp = P(Y = +1) with πp+πn = 1. In addition, |
|
b |
|
R+ |
|
p (gθ) = |
|
1 |
|
np |
|
Pnp |
|
i=1 ℓ(gθ (xp |
|
i ) , +1) and b |
|
R− |
|
n (gθ) = |
|
1 |
|
nn |
|
Pnn |
|
i=1 ℓ(gθ (xn |
|
i ) , −1) . |
|
As negative data xn are unavailable, a common strategy is to offset R− |
|
n (gθ). We also know that |
|
πnpn(x) = p(x) −πppp(x), and hence πn b |
|
R− |
|
n (gθ) = b |
|
R− |
|
u (gθ) −πp b |
|
R− |
|
p (gθ). Then the resulting |
|
unbiased risk estimator b |
|
Rpu(gθ) can be formulated as: |
|
b |
|
Rpu(gθ) = πp b |
|
R+ |
|
p (gθ) −πp b |
|
R− |
|
p (gθ) + b |
|
R− |
|
u (gθ), |
|
(2) |
|
where b |
|
R− |
|
p (gθ) = |
|
1 |
|
np |
|
Pnp |
|
i=1 ℓ(gθ (xp |
|
i ) , −1) and b |
|
R− |
|
u (gθ) = |
|
1 |
|
nu |
|
Pnu |
|
i=1 ℓ(gθ (xu |
|
i ) , −1). The advan- |
|
tage of this unbiased risk minimizer is that the optimal solution can be easily obtained if g is linear |
|
in θ. However, in real scenarios we tend to leverage more flexible models gθ, e.g., deep neural |
|
networks. This strategy will push the estimator to a point where it starts to suffer from overfit- |
|
ting. Hence, we decide to utilize non-negative risk (Kiryo et al., 2017) for our PU learning, which |
|
has been verified in (Kiryo et al., 2017) to allow deep neural network to mitigate overfitting. The |
|
non-negative PU estimator is formulated as: |
|
b |
|
Rpu(gθ) = πp b |
|
R+ |
|
p (gθ) + max |
|
n |
|
0, b |
|
R− |
|
u (gθ) −πp b |
|
R− |
|
p (gθ) |
|
o |
|
. |
|
(3) |
|
In pursue of the parallel implementation of b |
|
Rpu(gθ), we replace max |
|
n |
|
0, b |
|
R− |
|
u (gθ) −πp b |
|
R− |
|
p (gθ) |
|
o |
|
with its lower bound 1 |
|
N |
|
PN |
|
i=1 max |
|
n |
|
0, b |
|
R− |
|
u (gθ; X i |
|
u) −πp b |
|
R− |
|
p (gθ; X i |
|
p) |
|
o |
|
where X i |
|
u and X i |
|
p denote as |
|
the unlabeled and positive data in the i-th mini-batch, and N is the number of batches. |
|
2 |
|
Under review as a conference paper at ICLR 2022 |
|
From Binary PU to Multi-PU Learning Previous PU learning focuses on learning a classifier from |
|
positive and unlabeled data, and cannot easily be adapted to K + 1 multi-classification tasks where |
|
K represents the number of classes in the positive data. Multi-Positive and Unlabeled learning (Xu |
|
et al., 2017) was ever developed, but the proposed algorithm may not allow deep neural networks. |
|
Instead, we extend binary PU learning to multi-class version in a straightforward way by addition- |
|
ally incorporating cross entropy loss on all the positive data with labels for different classes. More |
|
precisely, we consider the K +1-class classifier fθ as a score function fθ = |
|
|