pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /0rjx6jy25R4.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame contribute delete

53.8 kB

	Under review as a conference paper at ICLR 2022
	CLASSIFY AND GENERATE RECIPROCALLY:
	SIMULTANEOUS POSITIVE-UNLABELLED LEARNING
	AND CONDITIONAL GENERATION WITH EXTRA DATA
	Anonymous authors
	Paper under double-blind review
	ABSTRACT
	The scarcity of class-labeled data is a ubiquitous bottleneck in a wide range of
	machine learning problems. While abundant unlabeled data normally exist and
	provide a potential solution, it is extremely challenging to exploit them. In this pa-
	per, we address this problem by leveraging Positive-Unlabeled (PU) classiﬁcation
	and the conditional generation with extra unlabeled data simultaneously, both of
	which aim to make full use of agnostic unlabeled data to improve classiﬁcation and
	generation performance. In particular, we present a novel training framework to
	jointly target both PU classiﬁcation and conditional generation when exposing to
	extra data, especially out-of-distribution unlabeled data, by exploring the interplay
	between them: 1) enhancing the performance of PU classiﬁers with the assistance
	of a novel Conditional Generative Adversarial Network (CGAN) that is robust to
	noisy labels, 2) leveraging extra data with predicted labels from a PU classiﬁer
	to help the generation. Our key contribution is a Classiﬁer-Noise-Invariant Con-
	ditional GAN (CNI-CGAN) that can learn the clean data distribution from noisy
	labels predicted by a PU classiﬁer. Theoretically, we proved the optimal condi-
	tion of CNI-CGAN and experimentally, we conducted extensive evaluations on
	diverse datasets, verifying the simultaneous improvements on both classiﬁcation
	and generation.
	1
	INTRODUCTION
	Existing machine learning methods, particularly deep learning models, typically require big data
	to pursue remarkable performance. For instance, conditional deep generative models are able to
	generate high-ﬁdelity and diverse images, but they have to rely on vast amounts of labeled data (Lu-
	cic et al., 2019). Nevertheless, it is often laborious or impractical to collect large-scale accurate
	class-labeled data in real-world scenarios, and thus the label scarcity is ubiquitous. Under such cir-
	cumstances, the performance of classiﬁcation and conditional generation (Mirza & Osindero, 2014)
	drops signiﬁcantly (Lucic et al., 2019). At the same time, diverse unlabeled data are available in
	enormous quantities, and therefore a key issue is how to take advantage of the extra data to enhance
	the conditional generation or classiﬁcation.
	Within the unlabeled data, both in-distribution and out-of-distribution data exist, where in-
	distribution data conform to the distribution of the labeled data while out-of-distribution data do
	not. Our key insight is to harness the out-of-distribution data. In the generation with extra data,
	most related works focused on the in-distribution data (Lucic et al., 2019; Gui et al., 2020; Donahue
	& Simonyan, 2019). When it comes to the out-of-distribution data, the majority of existing meth-
	ods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempted to forcibly train
	generative models on a large amount of unlabeled data, and then transferred the learned knowledge
	of the pre-trained generator to the in-distribution data. In classiﬁcation, a common setting to utilize
	unlabeled data is semi-supervised learning (Miyato et al., 2018; Sun et al., 2019; Berthelot et al.,
	2019), which usually assumes that the unlabeled and labeled data come from the same distribution,
	ignoring their distributional mismatch. In contrast, Positive and Unlabeled (PU) Learning (Bekker
	& Davis, 2020; Kiryo et al., 2017) is an elegant way of handling this under-studied problem, where
	a model has the only access to positive samples and unlabeled data. Therefore, it is possible to
	utilize pseudo labels predicted by a PU classiﬁer on unlabeled data to guide the conditional gen-
	1
	Under review as a conference paper at ICLR 2022
	eration. However, the predicted signals from the classiﬁer tend to be noisy. Although there are a
	ﬂurry of papers about learning from noisy labels for classiﬁcation (Tsung Wei Tsai, 2019; Ge et al.,
	2020; Guo et al., 2019), to our best knowledge, no work has considered to leverage the noisy labels
	seamlessly in the joint classiﬁcation and generation. Additionally, another work (Hou et al., 2018)
	leveraged GANs to recover both positive and negative data distribution to step away from overﬁt-
	ting, but they never considered the noise-invariant generation or their mutual improvement. The
	generative-discriminative complementary learning (Xu et al., 2019) was investigated in weakly su-
	pervised learning, but we are the ﬁrst attempt to tackle the (Multi-) Positive and Unlabeled learning
	setting while developing the method of noise-invariant generation from noisy labels. Please refer to
	Section 5 for the discussion about more related works.
	In this paper, we focus on the mutual beneﬁts of conditional generation and PU classiﬁcation,
	when we are only accessible to little class-labeled data, but extra unlabeled data, including out-
	of-distribution data, can be available. Firstly, a parallel non-negative multi-class PU estimator is
	derived to classify both the positive data of all classes and the negative data. Then we design a
	Classiﬁer-Noise-Invariant Conditional Generative Adversarial Network (CNI-CGAN) that is able to
	learn the clean data distribution on all unlabeled data with noisy labels provided by the PU clas-
	siﬁer. Simultaneously, we also leverage our CNI-CGAN to enhance the performance of the PU
	classiﬁcation through data augmentation, demonstrating a reciprocal beneﬁt for both generation and
	classiﬁcation. We provide the theoretical analysis on the optimal condition of our CNI-CGAN and
	conduct extensive experiments to verify the superiority of our approach.
	2
	OUR METHOD
	2.1
	POSITIVE-UNLABELED LEARNING
	Traditional Binary Positive-Unlabeled Problem Setting Let X ∈Rd and Y ∈{±1} be the input
	and output variables and p(x, y) is the joint distribution with marginal distribution pp(x) = p(x\|Y =
	+1) and pn(x) = p(x\|Y = −1). In particular, we denote p(x) as the distribution of unlabeled data.
	np, nn and nu are the amount of positive, negative and unlabeled data, respectively.
	Parallel Non-Negative PU Estimator Vanilla PU learning (Bekker & Davis, 2020; Kiryo et al.,
	2017; Du Plessis et al., 2014; 2015) employs unbiased and consistent estimator. Denote gθ : Rd →
	R as the score function parameterized by θ, and ℓ: R × {±1} →R as the loss function. The risk of
	gθ can be approximated by its empirical version denoted as b
	Rpn(gθ):
	b
	Rpn(gθ) = πp b
	R+
	p (gθ) + πn b
	R−
	n (gθ),
	(1)
	where πp represents the class prior probability, i.e. πp = P(Y = +1) with πp+πn = 1. In addition,
	b
	R+
	p (gθ) =
	1
	np
	Pnp
	i=1 ℓ(gθ (xp
	i ) , +1) and b
	R−
	n (gθ) =
	1
	nn
	Pnn
	i=1 ℓ(gθ (xn
	i ) , −1) .
	As negative data xn are unavailable, a common strategy is to offset R−
	n (gθ). We also know that
	πnpn(x) = p(x) −πppp(x), and hence πn b
	R−
	n (gθ) = b
	R−
	u (gθ) −πp b
	R−
	p (gθ). Then the resulting
	unbiased risk estimator b
	Rpu(gθ) can be formulated as:
	b
	Rpu(gθ) = πp b
	R+
	p (gθ) −πp b
	R−
	p (gθ) + b
	R−
	u (gθ),
	(2)
	where b
	R−
	p (gθ) =
	1
	np
	Pnp
	i=1 ℓ(gθ (xp
	i ) , −1) and b
	R−
	u (gθ) =
	1
	nu
	Pnu
	i=1 ℓ(gθ (xu
	i ) , −1). The advan-
	tage of this unbiased risk minimizer is that the optimal solution can be easily obtained if g is linear
	in θ. However, in real scenarios we tend to leverage more ﬂexible models gθ, e.g., deep neural
	networks. This strategy will push the estimator to a point where it starts to suffer from overﬁt-
	ting. Hence, we decide to utilize non-negative risk (Kiryo et al., 2017) for our PU learning, which
	has been veriﬁed in (Kiryo et al., 2017) to allow deep neural network to mitigate overﬁtting. The
	non-negative PU estimator is formulated as:
	b
	Rpu(gθ) = πp b
	R+
	p (gθ) + max
	n
	0, b
	R−
	u (gθ) −πp b
	R−
	p (gθ)
	o
	.
	(3)
	In pursue of the parallel implementation of b
	Rpu(gθ), we replace max
	n
	0, b
	R−
	u (gθ) −πp b
	R−
	p (gθ)
	o
	with its lower bound 1
	N
	PN
	i=1 max
	n
	0, b
	R−
	u (gθ; X i
	u) −πp b
	R−
	p (gθ; X i
	p)
	o
	where X i
	u and X i
	p denote as
	the unlabeled and positive data in the i-th mini-batch, and N is the number of batches.
	2
	Under review as a conference paper at ICLR 2022
	From Binary PU to Multi-PU Learning Previous PU learning focuses on learning a classiﬁer from
	positive and unlabeled data, and cannot easily be adapted to K + 1 multi-classiﬁcation tasks where
	K represents the number of classes in the positive data. Multi-Positive and Unlabeled learning (Xu
	et al., 2017) was ever developed, but the proposed algorithm may not allow deep neural networks.
	Instead, we extend binary PU learning to multi-class version in a straightforward way by addition-
	ally incorporating cross entropy loss on all the positive data with labels for different classes. More
	precisely, we consider the K +1-class classiﬁer fθ as a score function fθ =
	f 1
	θ (x), . . . , f K+1
	θ
	(x)

	.
	After the softmax function, we select the ﬁrst K positive data to construct cross-entropy loss ℓCE,
	i.e., ℓCE(fθ(x), y) = log PK+1
	j=1 exp

	f j
	θ(x)

	−f y
	θ (x) where y ∈[K]. For the PU loss, we
	consider the composite function h(fθ(x)) : Rd →R where h(·) conducts a logit transforma-
	tion on the accumulative probability for the ﬁrst K classes, i.e., h(fθ(x)) = ln(
	p
	1−p) in which
	p = PK
	j=1 exp

	f j
	θ(x)

	/ PK+1
	j=1 exp

	f j
	θ(x)

	. The ﬁnal mini-batch risk of our PU learning can be
	presented as:
	e
	Rpu(fθ; X i) = πp b
	R+
	p (h(fθ); X i
	p) + max
	n
	0, b
	R−
	u (h(fθ); X i
	u) −πp b
	R−
	p (h(fθ); X i
	p)
	o
	+ b
	RCE
	p (fθ; X i
	p),
	(4)
	where b
	RCE
	p (fθ; X i
	p) =
	1
	np
	Pnp
	i=1 ℓCE (fθ (xp
	i ) , y).
	2.2
	CLASSIFIER-NOISE-INVARIANT CONDITIONAL GENERATIVE ADVERSARIAL
	NETWORK (CNI-CGAN)
	୥
	PU
	௥
	௥
	PU
	୥
	Figure 1: Model architecture of our Classiﬁer-
	Noise-Invariant Conditional GAN (CNI-CGAN).
	The output xg of the conditional generator G is
	paired with a noisy label ˜
	y corrupted by the PU-
	dependent confusion matrix ˜
	C. The discriminator
	D distinguishes between whether a given labeled
	sample comes from the real data (xr, PUθ(xr))
	or generated data (xg, ˜
	y).
	To leverage extra data, i.e., all unlabeled data,
	to beneﬁt the generation, we deploy our condi-
	tional generative model on all data with pseudo
	labels predicted by our PU classiﬁer. However,
	these predicted labels tend to be noisy, reduc-
	ing the reliability of the supervision signals and
	thus worsening the performance of the condi-
	tional generative model. Besides, the noise de-
	pends on the accuracy of the given PU classi-
	ﬁer. To address this issue, we focus on devel-
	oping a novel noise-invariant conditional GAN
	that is robust to noisy labels provided by a spec-
	iﬁed classiﬁer, e.g. a PU classiﬁer. We call our
	method Classiﬁer-Noise-Invariant Conditional
	Generative Adversarial Network (CNI-CGAN)
	and the architecture is depicted in Figure 1. In
	the following, we elaborate on each part of it.
	Principle of the Design of CNI-CGAN
	Albeit being noisy, the pseudo labels given by
	the PU classiﬁer still provide rich information
	that we can exploit. The key is to take the noise
	generation mechanism into consideration dur-
	ing the generation.
	We denote the real data
	as xr and the predicted hard label through the
	PU classiﬁer as PUθ(xr), i.e., PUθ(xr) =
	arg maxi f i
	θ(xr), as displayed in Figure 1. We
	let the generator “imitate” the noise generation
	mechanism to generate pseudo labels for the labeled data. With both pseudo and real labels, we
	can leverage the PU classiﬁer fθ to estimate a confusion matrix ˜
	C to model the label noise from
	the classiﬁer. During the generation, a real label y, while being fed into the generator G, will also
	be polluted by ˜
	C to compute a noisy label ˜
	y, which then will be combined with the generated fake
	sample xg for the following discrimination. Finally, the discriminator D will distinguish the real
	samples [xr, PUθ(xr)] out of fake samples [xg, ˜
	y]. Overall, the noise “generation” mechanism from
	both sides can be balanced.
	3
	Under review as a conference paper at ICLR 2022
	Estimation of ˜
	C The key in the design of ˜
	C is to estimate the label noise of the pre-trained PU
	classiﬁer by considering all the samples of each class. More speciﬁcally, the confusion matrix ˜
	C is
	k + 1 by k + 1 and each entry ˜
	Cij represents the probability of a generated sample xg, given a label
	i, being classiﬁed as class j by the PU classiﬁer. Mathematically, we denote ˜
	Cij as:
	˜
	Cij = P(PUθ(xg) = j\|y = i) = Ez[I{P Uθ(xg)=j\|y=i}],
	(5)
	where xg = G(z, y = i) and I is the indicator function. Owing to the stochastic optimization
	nature when training deep neural networks, we incorporate the estimation of ˜
	C in the processing
	of training by Exponential Moving Average (EMA) method. This choice can balance the utilization
	of information from previous training samples and the updated PU classiﬁer to estimate ˜
	C. We
	formulate the update of ˜
	C(l+1) in the l-th mini-batch as follows:
	˜
	C(l+1) = λ ˜
	C(l) + (1 −λ)∆
	˜
	C
	Xl,
	(6)
	where ∆˜
	C
	Xl denotes the incremental change of ˜
	C on the current l-th mini-batch data Xl via Eq. 5. λ
	is the averaging coefﬁcient in EMA.
	Theoretical Guarantee of Clean Data Distribution Firstly, we denote O(x) as the oracle class
	of sample x from an oracle classiﬁer O(·). Let πi, i = 1, ..., K +1, be the class-prior probability
	of the class i in the multi-positive unlabeled setting. Theorem 1 proves the optimal condition of
	CNI-CGAN to guarantee the convergence to the clean data distribution. The proof is provided in
	Appendix A.
	Theorem 1. (Optimal Condition of CNI-CGAN) Let P g be a probabilistic transition matrix where
	P g
	ij = P(O(xg) = j\|y = i) indicates the probability of sample xg with the oracle label j generated
	by G with the initial label i. We assume that the conditional sample space of each class is disjoint
	with each other, then
	(1) P g is a permutation matrix if the generator G in CNI-CGAN is optimal, with the permutation,
	compared with an identity matrix, only happens on rows r where corresponding πr, r ∈r are equal.
	(2) If P g is an identity matrix and the generator G in CNI-CGAN is optimal, then pr(x, y) =
	pg(x, y) where pr(x, y) and pg(x, y) are the real and the generating joint distribution, respectively.
	Brieﬂy speaking, CNI-CGAN can learn the clean data distribution if P g is an identity matrix. More
	importantly, the method we elaborate till now has already guaranteed Pg as a permutation matrix,
	which is very close to an identity one. We need an additional constraint, although the permutation
	happens only when same class-prior probabilities exist.
	The Auxiliary Loss The optimal G in CNI-CGAN can only guarantee that pg(x, y) is close to
	pr(x, y) as the optimal permutation matrix P g is close to the identity matrix. Hence in practice, to
	ensure that we can exactly learn an identity matrix for P g and thus achieve the clean data distri-
	bution, we introduce an auxiliary loss to encourage a larger trace of P g, i.e., PK+1
	i=1 P(O(xg) =
	i)\|y = i). As O(·) is intractable, we approximate it by the current PU classiﬁer PUθ(xg). Then we
	obtain the auxiliary loss ℓaux:
	ℓaux(z, y) = max{κ −
	1
	K + 1
	K+1
	X
	i=1
	Ez(I{P Uθ(xg)=i\|y=i}), 0},
	(7)
	where κ ∈(0, 1) is a hyper-parameter. With the support of auxiliary loss, P g has the tendency to
	converge to the identity matrix where CNI-CGAN can learn the clean data distribution even in the
	presence of noisy labels.
	Comparison with RCGAN (Thekumparampil et al., 2018; Kaneko et al., 2019) The the-
	oretical property of CNI-CGAN has a major advantage over existing Robust CGAN (RC-
	GAN) (Thekumparampil et al., 2018; Kaneko et al., 2019), for which the optimal condition can
	only be achieved when the label confusion matrix is known a priori. Although heuristics can be
	employed, such as RCGAN-U (Thekumparampil et al., 2018), to handle the unknown label noise
	setting, these approaches still lack the theoretical guarantee to converge to the clean data distribution.
	To guarantee the efﬁcacy of our approach, one implicit and mild assumption is that our PU classiﬁer
	will not overﬁt on the training data, while our non-negative estimator helps to ensure that it as
	4
	Under review as a conference paper at ICLR 2022
	explained in Section 2.1. To further clarify the optimization process of CNI-CGAN, we elaborate
	the training steps of D and G, respectively.
	D-Step: We train D on an adversarial loss from both the real data and the generated (xg, ˜
	y), where
	˜
	y is corrupted by ˜
	C. ˜
	Cy denotes the y-th row of ˜
	C. We formulate the loss of D as:
	max
	D∈F
	E
	x∼p(x)[φ(D(x, PUθ(x)))] +
	E
	z∼PZ ,y∼PY
	˜
	y\|y∼˜
	Cy
	[φ(1 −D(G(z, y), ˜
	y))],
	(8)
	where F is a family of discriminators and PZ is the distribution of latent space vector z, e.g.,
	a Normal distribution. PY is a discrete uniform distribution on [K + 1] and φ is the measuring
	function.
	G-Step: We train G additionally on the auxiliary loss ℓaux(z, y) as follows:
	min
	G∈G
	E
	z∼PZ ,y∼PY
	˜
	y\|y∼˜
	Cy
	[φ(1 −D(G(z, y), ˜
	y)) + βℓaux(z, y)] ,
	(9)
	where β controls the strength of auxiliary loss and G is a family of generators. In summary, our
	CNI-CGAN conducts K +1 classes generation, which can be further leveraged to beneﬁt the K + 1
	PU classiﬁcation via data augmentation.
	Algorithm 1 Alternating Minimization for PU Learning and Classiﬁer-Noise-Invariant Generation.
	Input: Training data (Xp, Xu). Batch size M and hyper-parameter β > 0, λ, κ ∈(0, 1). L0 and
	L ∈N +. Initializing ˜
	C(1) as identity matrix. Number of batches N during the training.
	Output: Model parameter for generator G, and θ for the PU classiﬁer fθ.
	1: / * Pre-train PU classiﬁer fθ * /
	2: for i = 1 to N do
	3:
	Update fθ by descending its stochastic gradient of e
	Rpu
	fθ; X i
	via Eq. 4.
	4: end for
	5: repeat
	6:
	/ * Update CNI-CGAN * /
	7:
	for l = 1 to L do
	8:
	Sample {z1, ..., zM}, {y1, ..., yM} and {x1, ..., xM} from PZ, PY and all training data,
	respectively, and then sample {˜
	y1, ..., ˜
	yM} through the current ˜
	C(l). Then, update the
	discriminator D by ascending its stochastic gradient of
	1
	M
	M
	X
	i=1
	[φ(D(xi, PUθ(xi)))] + φ(1 −D(G(zi, yi), ˜
	yi))].
	9:
	Sample {z1, ..., zM} and {y1, ..., yM} from PZ and PY , and then sample {˜
	y1, ..., ˜
	yM}
	through the current ˜
	C(l). Update the generator G by descending its stochastic gradient of
	1
	M
	M
	X
	i=1
	[φ(1 −D(G(zi, yi), ˜
	yi)) + βℓaux(yi, zi)].
	10:
	if l ≥L0 then
	11:
	Compute ∆˜
	C
	Xl =
	1
	M
	PM
	i=1 I{P Uθ(G(zi,yi))\|yi} via Eq. 5, and then estimate ˜
	C by
	˜
	C(l+1) = λ ˜
	C(l) + (1 −λ)∆
	˜
	C
	Xl.
	12:
	end if
	13:
	end for
	14:
	/ * Update PU classiﬁer via Data Augmentation * /
	15:
	Sample {z1, ..., zM} and {y1, ..., yM} from PZ and PY , respectively, and then update the
	PU classiﬁer fθ by descending its stochastic gradient of
	1
	M
	M
	X
	i=1
	ℓCE (fθ (G(zi, yi)) , yi) .
	16: until convergence
	5
	Under review as a conference paper at ICLR 2022
	3
	ALGORITHM
	Firstly, we obtain a PU classiﬁer fθ trained on multi-positive and unlabeled dataset with the par-
	allel non-negative estimator derived in Section 2.1. Then we train our CNI-CGAN, described in
	Section 2.2, on all data with pseudo labels predicted by the pre-trained PU classiﬁer. As our CNI-
	CGAN is robust to noisy labels, we leverage the data generated by CNI-CGAN to conduct data
	augmentation to improve the PU classiﬁer. Finally, we implement the joint optimization for the
	training of CNI-CGAN and the data augmentation of the PU classiﬁer. We summarize the proce-
	dure in Algorithm 1 and provide more details in Appendix C.
	Computational Cost Analysis In the implementation of our CNI-CGAN, we need to additionally
	estimate ˜
	C, a (K + 1) × (K + 1) matrix. The computational cost of this small matrix is negligible
	compared with the updating of discriminator and generator networks, although the estimation of ˜
	C
	is crucial.
	Simultaneous Improvement on PU Learning and Generation with Extra Data From the per-
	spective of PU classiﬁcation, due to the theoretical guarantee from Theorem 1, CNI-CGAN is capa-
	ble of learning a clean data distribution out of noisy pseudo labels predicted by the pre-trained PU
	classiﬁer. Hence, the following data augmentation has the potential to improve the generalization
	of PU classiﬁcation regardless of the speciﬁc form of the PU estimator. From the perspective of
	generation with extra data, the predicted labels on unlabeled data from the PU classiﬁer can provide
	CNI-CGAN with more supervised signals, thus further improving the quality of generation. Due
	to the joint optimization, both the PU classiﬁcation and conditional generative models are able to
	improve each other reciprocally, as demonstrated in the following experiments.
	4
	EXPERIMENT
	Experimental Setup We perform our approaches and several baselines on MNIST, Fashion-MNIST
	and CIFAR-10. We select the ﬁrst 5 classes on MNIST and 5 non-clothes classes on Fashion-
	MNIST, respectively, for K + 1 classiﬁcation (K = 5). To verify the consistent effectiveness of our
	method in the standard binary PU setting, we pick the 4 categories of transportation tools in CIFAR-
	10 as the one-class positive dataset. As for the baselines, the ﬁrst is CGAN-P, where a Vanilla
	CGAN (Mirza & Osindero, 2014) is trained only on limited positive data. Another natural baseline
	is CGAN-A where a Vanilla CGAN is trained on all data with labels given by the PU classiﬁer.
	10
	2
	10
	1
	0
	20
	40
	60
	80
	100
	Generator Label Accuracy(%)
	MNIST
	CGAN-P
	CGAN-A
	Ours
	10
	2
	10
	1
	Positive Rate
	65
	70
	75
	80
	85
	90
	95
	PU Accuracy(%)
	MNIST
	Original PU
	CGAN-A
	Ours
	10
	2
	10
	1
	75
	80
	85
	90
	95
	100
	Generator Label Accuracy(%)
	Fashion-MNIST
	CGAN-P
	CGAN-A
	Ours
	10
	2
	10
	1
	Positive Rate
	80.0
	82.5
	85.0
	87.5
	90.0
	92.5
	95.0
	97.5
	PU Accuracy(%)
	Fashion-MNIST
	Original PU
	CGAN-A
	Ours
	10
	2
	10
	1
	1.5
	2.0
	2.5
	3.0
	3.5
	4.0
	4.5
	Inception Score
	CIFAR-10
	CGAN-P
	CGAN-A
	Ours
	10
	2
	10
	1
	Positive Rate
	78
	80
	82
	84
	86
	88
	90
	PU Accuracy(%)
	CIFAR-10
	Original PU
	CGAN-A
	Ours
	Figure 2: Generation and classiﬁcation performance of CGAN-P, CGAN-A and Ours on three
	datasets. Results of CGAN-P (blue lines) on PU accuracy do not exist since CGAN-P generates
	only K classes data rather than K + 1 categories that the PU classiﬁer needs.
	6
	Under review as a conference paper at ICLR 2022
	The last baseline is RCGAN-U (Thekumparampil et al., 2018) where the confusion matrix is totally
	learnable while training. For fair comparisons, we choose the same GAN architecture. Through a
	line search of hyper-parameters, we choose κ as 0.75, β as 5.0 and λ = 0.99 across all the datasets.
	We set L0 as 5 in Algorithm 1. More details about hyper-parameters can be found in Appendix D.
	Evaluation Metrics For MNIST and Fashion-MNIST, we mainly use Generator Label Accu-
	racy (Thekumparampil et al., 2018) and PU Accuracy to evaluate the quality of generated images.
	Generator Label Accuracy compares speciﬁed y from CGANs to the true class of the generated ex-
	amples through a pre-trained (almost) oracle classiﬁer f. In experiments, we pre-trained two K+1
	classiﬁers with 99.28% and 98.23% accuracy on the two datasets, respectively. Additionally, the
	increased PU Accuracy measures the closeness between generated data distribution and test (almost
	real) data distribution for the PU classiﬁcation, serving as a key indicator to reﬂect the quality of
	generated images. For CIFAR 10, we use both Inception Score (Salimans et al., 2016) to evaluate
	the quality of the generated samples, and the increased PU Accuracy to quantify the improvement
	of generated samples on the PU classiﬁcation.
	4.1
	GENERATION AND CLASSIFICATION PERFORMANCE
	We set the whole training dataset as the unlabeled data and select certain amount of positive data
	with the ratio of Positive Rate. Figure 2 presents the trend of Generator Label Accuracy, Inception
	Score and PU Accuracy as the Positive Rate increases. It turns out that CNI-CGAN outperforms
	CGAN-P and CGAN-A consistently especially when the Positive Rate is small, i.e. little positive
	data. Remarkably, our approach enhances the PU accuracy greatly when exposed to low positive
	rates, while CGAN-A even worsens the original PU classiﬁer sometimes in this scenario due to the
	existence of too much label noise given by a less accurate PU classiﬁer. Meanwhile, when more su-
	pervised positive data are given, the PU classiﬁer generalizes better and then provides more accurate
	labels, conversely leading to more consistent and better performance for all methods. Besides, note
	that even though the CGAN-P achieves comparable generator label accuracy on MNIST, it results
	in a lower Inception Score. We demonstrate this in Appendix D.
	Table 1: PU classiﬁcation accuracy of RCGAN-U and Ours across three datasets. Final PU accuracy
	represents the accuracy of PU classiﬁer after the data augmentation.
	Final PU Accuracy \ Positive Rates (%)
	0.2%
	0.5%
	1.0%
	10.0%
	MNIST
	Original PU
	68.86
	76.75
	86.94
	95.88
	RCGAN-U
	87.95
	95.24
	95.86
	97.80
	Ours
	96.33
	96.43
	96.71
	97.82
	Fashion-MNIST
	Original PU
	80.68
	88.25
	93.05
	95.99
	RCGAN-U
	89.21
	92.05
	94.59
	97.24
	Ours
	89.23
	93.82
	95.16
	97.33
	CIFAR-10
	Original PU
	76.79
	80.63
	85.53
	88.43
	RCGAN-U
	83.13
	86.22
	88.22
	90.45
	Ours
	87.64
	87.92
	88.60
	90.69
	To verify the advantage of theoretical property for our CNI-CGAN, we further compare it with
	RCGCN-U (Thekumparampil et al., 2018; Kaneko et al., 2019), the heuristic version of robust gen-
	eration against unknown noisy labels setting without the theoretical guarantee of optimal condition.
	As observed in Table 1, our method outperforms RCGAN-U especially when the positive rate is
	low. When the amount of positive labeled data is relatively large, e.g., 10.0%, both our approach
	and RCGAN-U can obtain comparable performance.
	Visualization To further demonstrate the superiority of CNI-CGAN compared with the other base-
	lines, we present some generated images within K +1 classes from CGAN-A, RCGAN-U and CNI-
	CGAN on MNIST, and high-quality images from CNI-CGAN on Fashion-MNIST and CIFAR-10,
	in Figure 3. In particular, we choose the positive rate as 0.2% on MNIST, yielding the initial PU
	classiﬁer with 69.14% accuracy. Given the noisy labels on all data, our CNI-CGAN can generate
	more accurate images of each class visually compared with CGAN-A and RCGAN-U. Results of
	Fashion-MNIST and comparison with CGAN-P on CIFAR-10 can refer to Appendix E.
	7
	Under review as a conference paper at ICLR 2022
	MNIST: Positive Rate 0.2%, Initial PU: 69.14%
	Generator Label Accuracy
	39.67% 81.58% 96.33%
	CGAN-A RCGAN-U CNI-CGAN
	CNI-CGAN
	Fashion-MNIST CIFAR-10
	Figure 3: Visualization of generated samples on three datasets. Rows below the red line represent
	the negative class. We highlight the erroneously generated images with red boxes on MNIST.
	4.2
	ROBUSTNESS OF OUR APPROACH
	Robustness against the Initial PU accuracy The auxiliary loss can help the CNI-CGAN to learn
	the clean data distribution regardless of the initial accuracy of PU classiﬁers. To verify that, we
	select distinct positive rates, yielding the pre-trained PU classiﬁers with different initial accuracies.
	Then we perform our method based on these PU classiﬁers. Figure 4 suggests that our approach
	can still attain the similar generation quality under different initial PU accuracies after sufﬁcient
	training, although better initial PU accuracy can be beneﬁcial to the generation performance in the
	early phase.
	10
	2
	10
	3
	10
	4
	Number of Training Iterations
	0
	20
	40
	60
	80
	100
	GAN: Generator Label Accuracy (%)
	MNIST
	Initial PU 77.87%
	Initial PU 84.57%
	Initial PU 91.28%
	10
	2
	10
	3
	10
	4
	10
	5
	Number of Training Iterations
	0
	20
	40
	60
	80
	100
	GAN: Generator Label Accuracy (%)
	Fashion MNIST
	Initial PU 88.27%
	Initial PU 91.03%
	Initial PU 94.02%
	10
	2
	10
	3
	10
	4
	Number of Training Iterations
	78
	80
	82
	84
	86
	88
	90
	PU Accuracy(%)
	CIFAR 10
	Initial PU 79.49%
	Initial PU 82.51%
	Initial PU 85.45%
	Figure 4: Tendency of generation performance as the training iterations increase on three datasets.
	Robustness against the Unlabeled data In real scenarios, we are more likely to have little knowl-
	edge about the extra data we have. To further verify the robustness of CNI-CGAN against the
	unknown distribution of extra data, we test different approaches across different amounts and dis-
	tributions of the unlabeled data. Particularly, we consider two different types of distributions for
	unlabeled data. Type 1 is [
	1
	K+1, ...,
	1
	K+1,
	1
	K+1] where the number of data in each class, including
	the negative data, is even, while type 2 is [ 1
	2K , ... 1
	2K , 1
	2] where the negative data makes up half of
	all unlabeled data. In experiments, we focus on the PU Accuracy to evaluate both the generation
	quality and the improvement of PU learning. For MNIST, we choose 1% and 0.5% for two settings
	while we opt for 0.5% and 0.2% on both Fashion-MNIST and CIFAR-10.
	Figure 5 manifests that the accuracy of PU classiﬁer exhibits a slight ascending tendency with the
	increasing of the number of unlabeled data. More importantly, our CNI-CGAN almost consistently
	outperforms other baselines across different amount of unlabeled data as well as distinct distributions
	of unlabeled data. This veriﬁes that the robustness of our proposal to the distribution of extra data
	can be maintained potentially. We leave the investigation on the robustness against more imbalanced
	situations as future works.
	8
	Under review as a conference paper at ICLR 2022
	10
	3
	10
	4
	50
	60
	70
	80
	90
	100
	Distribution Type 1: PU Acc(%)
	MNIST
	Original PU
	CGAN-A
	RCGAN-U
	Ours
	10
	3
	10
	4
	Number of Unlabeled Data
	75
	80
	85
	90
	95
	100
	Distribution Type 2: PU Acc(%)
	MNIST
	Original PU
	CGAN-A
	RCGAN-U
	Ours
	10
	4
	70
	75
	80
	85
	90
	95
	100
	Distribution Type 1: PU Acc(%)
	Fashion-MNIST
	Original PU
	CGAN-A
	RCGAN-U
	Ours
	10
	4
	Number of Unlabeled Data
	70
	75
	80
	85
	90
	95
	100
	Distribution Type 2: PU Acc(%)
	Fashion-MNIST
	Original PU
	CGAN-A
	RCGAN-U
	Ours
	10
	4
	70
	75
	80
	85
	90
	95
	100
	Distribution Type 1: PU Acc(%)
	CIFAR-10
	Original PU
	CGAN-A
	RCGAN-U
	Ours
	10
	4
	Number of Unlabeled Data
	70
	75
	80
	85
	90
	95
	100
	Distribution Type 2: PU Acc(%)
	CIFAR-10
	Original PU
	CGAN-A
	RCGAN-U
	Ours
	Figure 5: PU Classiﬁcation accuracy of CGAN-A, RCGAN-U and Ours after joint optimization
	across different amounts and distribution types of unlabeled data.
	5
	RELATED WORKS
	Positive-Unlabeled (PU) Learning.
	Positive and Unlabeled (PU) Learning is the setting where
	a learner has only access to positive examples and unlabeled data (Bekker & Davis, 2020; Kiryo
	et al., 2017). One related work (Hou et al., 2018) employed GANs (Goodfellow et al., 2014) to
	recover both positive and negative data distribution to step away from overﬁtting. Kato et al. (Kato
	et al., 2018) focused on remedying the selection bias in the PU learning. Besides, Multi-Positive
	and Unlabeled Learning (Xu et al., 2017) extended the binary PU setting to the multi-class version,
	therefore adapting to more practical applications. By contrast, our multi-positive unlabeled method
	absorbs the advantages of previous approaches, and in the meanwhile intuitively extends them to ﬁt
	the differential deep neural networks optimization.
	Conditional GANs on Few Labels Data.
	To attain high-quality images with both ﬁdelity and di-
	versity, the training of generative models requires a large dataset. To reduce the need of huge amount
	of data, the vast majority of methods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al.,
	2020) attempted to transfer prior knowledge of the pre-trained generator. Another branch (Lucic
	et al., 2019) is to leverage self- and supervised learning to add pseudo labels on the in-distribution
	unlabeled data in order to expand labeled dataset. Compared with this approach, our strategy can be
	viewed to automatically “pick” useful in-distribution data from total unknown unlabeled data via PU
	learning framework, and then constructs robust conditional GANs to generate clean data distribution
	out of predicted label noise. Please refer to more related works in Appendix B.
	6
	DISCUSSION AND CONCLUSION
	In this paper, we proposed a new method, CNI-CGAN, to jointly exploit PU classiﬁcation and
	conditional generation. It is, to our best knowledge, the ﬁrst method of such kind to break the ceiling
	of class-label scarcity, by combining two promising yet separate methodologies to gain massive
	mutual improvements. CNI-CGAN can learn the clean data distribution from noisy labels given by
	a PU classiﬁer, and then enhance the performance of PU classiﬁcation through data augmentation
	in various settings. We have demonstrated, both theoretically and experimentally, the superiority
	of our proposal on diverse benchmark datasets in an exhaustive and comprehensive manner. In the
	future, it will be promising to investigate learning strategies on imbalanced data, e.g., cost-sensitive
	learning (Elkan, 2001), to extend our approach to broader settings, which will further cater to real-
	world scenarios where highly unbalanced data are commonly available. In addition, the leverage of
	soft labels in the design of CNI-CGAN is also promising.
	9
	Under review as a conference paper at ICLR 2022
	Ethics Statement.
	Our designed CNI-CGAN framework can interplay with the PU classiﬁcation
	and robust generation, which can mitigate the scarcity of class-labeled data. Leveraging extra data
	may correlate with the privacy issue as the privacy issue still exists in generative models. Thus,
	a privacy-guaranteed version of our algorithm can be further proposed in the future to handle the
	potential privacy issue.
	Reproducibility Statement.
	For the theoretical part, we clearly state the related assumption and
	detailed proof process in Appendix A. In terms of the algorithm, our implementation is directly
	adapted from the public one of generative models and PU learning.
	REFERENCES
	Jessa Bekker and Jesse Davis. Learning from positive and unlabeled data: A survey. Machine
	Learning, 109(4):719–760, 2020.
	David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A
	Raffel.
	Mixmatch: A holistic approach to semi-supervised learning.
	In Advances in Neural
	Information Processing Systems, pp. 5050–5060, 2019.
	Grigorios G Chrysos, Jean Kossaiﬁ, and Stefanos Zafeiriou. Robust conditional generative adver-
	sarial networks. arXiv preprint arXiv:1805.08657, 2018.
	Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In Advances in
	Neural Information Processing Systems, pp. 10541–10551, 2019.
	Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from
	positive and unlabeled data. In International conference on machine learning, pp. 1386–1394,
	2015.
	Marthinus C Du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and
	unlabeled data. In Advances in neural information processing systems, pp. 703–711, 2014.
	Charles Elkan. The foundations of cost-sensitive learning. In International joint conference on
	artiﬁcial intelligence, volume 17, pp. 973–978. Lawrence Erlbaum Associates Ltd, 2001.
	Yixiao Ge, Dapeng Chen, and Hongsheng Li. Mutual mean-teaching: Pseudo label reﬁnery for un-
	supervised domain adaptation on person re-identiﬁcation. International Conference on Learning
	Representations, 2020.
	Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
	Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor-
	mation processing systems, pp. 2672–2680, 2014.
	Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. A review on generative adver-
	sarial networks: Algorithms, theory, and applications. arXiv preprint arXiv:2001.06937, 2020.
	Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Im-
	proved training of wasserstein gans. In Advances in neural information processing systems, pp.
	5767–5777, 2017.
	Tianyu Guo, Chang Xu, Boxin Shi, Chao Xu, and Dacheng Tao. Learning from bad data via gener-
	ation. In Advances in Neural Information Processing Systems, pp. 6042–6053, 2019.
	Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier
	exposure. International Conference on Learning Representations, 2018.
	Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Inlier-
	based outlier detection via direct density ratio estimation. In 2008 Eighth IEEE International
	Conference on Data Mining, pp. 223–232. IEEE, 2008.
	Ming Hou, Brahim Chaib-draa, Chao Li, and Qibin Zhao. Generative adversarial positive-unlabelled
	learning. In J´
	erˆ
	ome Lang (ed.), Proceedings of the Twenty-Seventh International Joint Conference
	on Artiﬁcial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 2255–2261.
	ijcai.org, 2018. doi: 10.24963/ijcai.2018/312.
	10
	Under review as a conference paper at ICLR 2022
	Takuhiro Kaneko, Yoshitaka Ushiku, and Tatsuya Harada. Label-noise robust generative adversarial
	networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
	pp. 2467–2476, 2019.
	Masahiro Kato, Takeshi Teshima, and Junya Honda. Learning from positive and unlabeled data with
	a selection bias. 2018.
	Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama.
	Positive-unlabeled
	learning with non-negative risk estimator. In Advances in neural information processing systems,
	pp. 1675–1685, 2017.
	Kiran Koshy Thekumparampil, Sewoong Oh, and Ashish Khetan. Robust conditional gans under
	missing or uncertain labels. arXiv preprint arXiv:1906.03579, 2019.
	Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly.
	High-ﬁdelity image generation with fewer labels. nternational Conference on Machine Learning
	(ICML), 2019.
	Mehdi Mirza and Simon Osindero.
	Conditional generative adversarial nets.
	arXiv preprint
	arXiv:1411.1784, 2014.
	Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a
	regularization method for supervised and semi-supervised learning. IEEE transactions on pattern
	analysis and machine intelligence, 41(8):1979–1993, 2018.
	Atsuhiro Noguchi and Tatsuya Harada. Image generation from small datasets via batch statistics
	adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2750–
	2758, 2019.
	Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
	Improved techniques for training gans.
	Advances in Neural Information Processing Systems,
	2016.
	Alex Smola, Le Song, and Choon Hui Teo. Relative novelty detection. In Artiﬁcial Intelligence and
	Statistics, pp. 536–543, 2009.
	Ke Sun, Bing Yu, Zhouchen Lin, and Zhanxing Zhu. Patch-level neighborhood interpolation: A
	general and effective graph-based regularization strategy. arXiv preprint arXiv:1911.09307, 2019.
	Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and Sewoong Oh. Robustness of conditional
	gans to noisy labels. In Advances in neural information processing systems, pp. 10271–10282,
	2018.
	Jun Zhu Tsung Wei Tsai, Tsung Wei Tsai. Countering noisy labels by learning from auxiliary clean
	labels. arXiv preprint arXiv:1905.13305, 2019.
	Qin Wang, Wen Li, and Luc Van Gool. Semi-supervised learning by augmented distribution align-
	ment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1466–1475,
	2019.
	Yanwu Xu, Mingming Gong, Junxiang Chen, Tongliang Liu, Kun Zhang, and Kayhan Batmanghe-
	lich. Generative-discriminative complementary learning. AAAI 2020, 2019.
	Yixing Xu, Chang Xu, Chao Xu, and Dacheng Tao. Multi-positive and unlabeled learning. In IJCAI,
	pp. 3182–3188, 2017.
	Shin’ya Yamaguchi, Sekitoshi Kanai, and Takeharu Eda. Effective data augmentation with multi-
	domain learning gans. arXiv preprint arXiv:1912.11597, 2019.
	Wei Li Shaogang Gong Yanbei Chen, Xiatian Zhu. Semi-supervised learning under class distribution
	mismatch. AAAI 2020, 2019.
	Bing Yu, Jingfeng Wu, Jinwen Ma, and Zhanxing Zhu. Tangent-normal adversarial regularization
	for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and
	Pattern Recognition, pp. 10676–10684, 2019.
	11
	Under review as a conference paper at ICLR 2022
	Miaoyun Zhao, Yulai Cong, and Lawrence Carin. On leveraging pretrained gans for limited-data
	generation. arXiv preprint arXiv:2002.11810, 2020.
	12
	Under review as a conference paper at ICLR 2022
	A
	APPENDIX: PROOF OF THEOREM 1
	Firstly, we recall some deﬁnitions. Denote xr, xg as the real training and generated samples, respectively. x are
	the population of all data, and xr are sampled from p(x). yg represents the initial labels for the generator G,
	while ˜
	y indicates the labels perturbed by ˜
	C from yg. The class-prior πi meets πi = P(yg = i) = P(O(xr) =
	i). For a rigorous proof of Theorem 1, we elaborate it again in the appendix.
	Theorem 1
	We assume that the following three mild assumptions can be met: (a) PU classiﬁer is not over-
	ﬁtting on the training data, (b) P(PUθ(xg)\|O(xg), yg) = P(PUθ(xg)\|O(xg)), (c) the conditional sample
	space is disjoint from each other class. Then,
	(1) P g is a permutation matrix if the generator G in CNI-CGAN is optimal, with the permutation, compared
	with an identity matrix, only happens on rows r where corresponding πr, r ∈r are equal.
	(2) If P g is an identity matrix and the generator G in CNI-CGAN is optimal, then pr(x, y) = pg(x, y) where
	pr(x, y) and pg(x, y) are the real and generating joint distribution, respectively.
	A.1
	PROOF OF (1)
	Proof. For a general setting, the oracle class of xg given by label yg is not necessarily equal to PUθ(xg). Thus,
	we consider the oracle class of xg, i.e., O(xg) in the proof.
	Optimal G. In CNI-CGAN, G is optimal if and only if
	pr(xr, PUθ(xr)) = pg(xg, ˜
	y).
	(10)
	The equivalence of joint probability distribution can further derive the equivalence of marginal distribution, i.e.,
	pr(xr) = pg(xg). We deﬁne a probability matrix C where Cij = P(PUθ(x) = j\|O(x) = i) where x are the
	population data. According to (c), we can apply O(·) on both xr and xg in Eq. 10. Then we have:
	P(O(xr) = i, PUθ(xr) = j)
	(c)
	= P(O(xg) = i, ˜
	y = j)
	P(O(xr) = i)P(PUθ(xr) = j\|O(xr) = i) =
	K+1
	X
	k=1
	P(yg = k, O(xg) = i)P(˜
	y = j\|yg = k, O(xg) = i)
	πiCij
	(a)
	=
	K+1
	X
	k=1
	P(O(xg) = i\|yg = k)P(yg = k)P(˜
	y = j\|yg = k)
	πiCij =
	K+1
	X
	k=1
	P g⊤
	ik πk ˜
	Ckj,
	(11)
	where assumption (a) indicates that PUθ(xr) is close to PUθ(x) so that P(PUθ(xr) = j\|O(xr) = i) =
	P(PUθ(x) = j\|O(x) = i). Then the corresponding matrix form follows as
	ΠC = P g⊤Π ˜
	C
	(12)
	Deﬁnition. According to the deﬁnition of ˜
	C and Law of Total Probability, we have:
	P(yg = i)P(PUθ(xg) = j\|yg = i) =
	πi
	K+1
	X
	k=1
	P(O(xg) = k\|yg = i)P(PUθ(xg) = j\|O(xg) = k, yg = i)
	πi ˜
	Cij
	(b)
	= πi
	K+1
	X
	k=1
	P g
	ikP(PUθ(xg) = j\|O(xg) = k)
	πi ˜
	Cij = πi
	K+1
	X
	k=1
	P g
	ikCkj,
	(13)
	where the last equation is met as p(xg) is close to p(x) when G is optimal, and thus P(PUθ(xg) = j\|O(xg) =
	k) = P(PUθ(x) = j\|O(x) = k). Then we consider the corresponding matrix form as follows
	Π ˜
	C = ΠP gC
	(14)
	13
	Under review as a conference paper at ICLR 2022
	where Π is the diagonal matrix of prior vector π. Combining Eq. 14 and 12, we have P g⊤ΠP g = Π, which
	indicates P g is a general orthogonal matrix. In addition, the element of P g is non-negative and the sum of each
	row is 1. Therefore, we have P g is a permutation matrix with permutation compared with the identity matrix
	only happens on rows r where corresponding πr, r ∈r are equal. Particularly, if all πi are different from
	each other, then permutation operation will not happen, indicating the optimal conditional of P g is the identity
	matrix.
	A.2
	PROOF OF (2)
	We additionally denote yr as the real label of real sample xr, i.e., yr = O(xr). According to the optimal
	condition of G in Eq. 10, we have pr(xr) = pg(xg). Since we have P g is an identity matrix, then O(xg) = yg
	a.e. Thus, we have pg(xg\|yg = i) = pg(xg\|O(xg) = i), ∀i = 1, .., K + 1. According the assumption (c) and
	Eq. 10, we have pr(xr\|O(xr) = i) = pg(xg\|O(xg) = i). In addition, we know that pr(xr\|O(xr) = i) =
	pr(xr\|yr = i), thus we have pr(xr\|yr = i) = pg(xg\|yg = i). Further, we consider the identical class-prior
	πi. Finally, we have
	pr(xr\|yr = i)πi = pg(xg\|yg = i)πi
	pr(xr\|yr = i)p(O(xr) = i) = pg(xg\|yg = i)p(yg = i)
	pr(xr\|yr = i)p(yr = i) = pg(xg\|yg = i)p(yg = i)
	pr(xr, yr) = pg(xg, yg).
	(15)
	B
	APPENDIX: MORE RELATED WORKS
	Positive-Unlabeled (PU) Learning.
	Positive and Unlabeled (PU) Learning is the setting where a learner
	has only access to positive examples and unlabeled data (Bekker & Davis, 2020; Kiryo et al., 2017). One related
	work (Hou et al., 2018) employed GANs (Goodfellow et al., 2014) to recover both positive and negative data
	distribution to step away from overﬁtting. Kato et al. (Kato et al., 2018) focused on remedying the selection
	bias in the PU learning. Besides, Multi-Positive and Unlabeled Learning (Xu et al., 2017) extended the binary
	PU setting to the multi-class version, therefore adapting to more practical applications. By contrast, our multi-
	positive unlabeled method absorbs the advantages of previous approaches, and in the meanwhile intuitively
	extends them to ﬁt the differential deep neural networks optimization.
	Conditional GANs on Few Labels Data.
	To attain high-quality images with both ﬁdelity and diversity,
	the training of generative models requires a large dataset. To reduce the need of huge amount of data, the
	vast majority of methods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempted to
	transfer prior knowledge of the pre-trained generator. Another branch (Lucic et al., 2019) is to leverage self-
	and supervised learning to add pseudo labels on the in-distribution unlabeled data in order to expand labeled
	dataset. Compared with this approach, our strategy can be viewed to automatically “pick” useful in-distribution
	data from total unknown unlabeled data via PU learning framework, and then constructs robust conditional
	GANs to generate clean data distribution out of predicted label noise.
	Robust GANs.
	Robust Conditional GANs (Thekumparampil et al., 2018; Kaneko et al., 2019) were pro-
	posed to defend against class-dependent noisy labels. The main idea of these methods is to corrupt labels of
	generated samples before feeding to the adversarial discriminator, forcing the generator to produce sample with
	clean labels. Another supplementary investigation (Koshy Thekumparampil et al., 2019) explored the scenario
	when CGANs get exposed to missing or ambiguous labels, while another work (Chrysos et al., 2018) leveraged
	the structure of the model in the target space to address this issue. In contrast, the noises in our model stem
	from the prediction error of a given classiﬁer. We employ the imperfect classiﬁer to estimate the label confusion
	noise, yielding a new branch of Robust CGANs against “classiﬁer” label noises.
	Semi-Supervised Learning (SSL).
	One crucial issue in SSL (Miyato et al., 2018; Yu et al., 2019; Sun
	et al., 2019) is how to tackle with the mismatch of unlabeled and labeled data. Augmented Distribution Align-
	ment (Wang et al., 2019) was proposed to leverage adversarial training to alleviate the bias, but they focused on
	the empirical distribution mismatch owing to the limited number of labeled data. Further, Uncertainty Aware
	Self-Distillation (Yanbei Chen, 2019) was proposed to concentrate on this under-studied problem, which can
	guarantee the effectiveness of learning. In contrast, our approach leverages the PU learning to construct the
	“open world” classiﬁcation.
	Out-Of-Distribution (OOD) Detection
	OOD Detection is one classical but always vibrant machine
	learning problem. PU learning can be used for the detection of outliers in an unlabeled dataset with knowledge
	only from a collection of inlier data (Hido et al., 2008; Smola et al., 2009). Another interesting and related
	14
	Under review as a conference paper at ICLR 2022
	Table 2: Further evaluation of CGAN-P and Ours from the perspective of Inception Score on MNIST
	and Fashion-MNIST datasets.
	Positive Rates
	0.75%
	1.0%
	3.0%
	5.0%
	10.0%
	Inception Score (± Standard Deviation)
	MNIST
	CGAN-P
	5.08±0.02
	5.10±0.03
	5.09±0.02
	5.14±0.03
	5.10±0.04
	Ours
	5.60±0.01
	5.59±0.02
	5.65±0.02
	5.52±0.01
	5.63±0.02
	Fashion-MNIST
	CGAN-P
	4.95±0.03
	5.01 ± 0.03
	5.04 ± 0.04
	5.02±0.04
	5.00 ±0.03
	Ours
	4.99 ± 0.02
	5.01 ± 0.02
	5.03±0.01
	5.07 ± 0.02
	5.04 ± 0.02
	work is Outlier Exposure (Hendrycks et al., 2018), an approach that leveraged an auxiliary dataset to enhance
	the anomaly detector based on existing limited data. This problem is similar to our generation task, the goal of
	which is to take better advantage of extra dataset, especially out-of-distribution data, to boost the generation.
	Learning from Noisy Labels
	Rotational-Decoupling Consistency Regularization (RDCR) (Tsung
	Wei Tsai, 2019) was designed to integrate the consistency-based methods with the self-supervised rotation
	task to learn noise-tolerant representations. Mutual Mean-Teaching (Ge et al., 2020) was proposed to reﬁne the
	soft labels on person re-identiﬁcation task by averaging the parameters of two neural networks . In addition,
	the data with noisy labels can also be viewed as bad data. Another work (Guo et al., 2019) provided a worst-
	case learning formulation from bad data, and designed a data-generation scheme in an adversarial manner,
	augmenting data to improve the current classiﬁer.
	C
	APPENDIX: DETAILS ABOUT ALGORITHM 1
	Similar in (Kiryo et al., 2017), we utilize the sigmoid loss ℓsig(t, y) = 1/(1 + exp(ty)) in the implementation
	of the PU learning. Besides, we denote ri = b
	R−
	u
	g; X i
	u

	−πp b
	R−
	p
	g; X i
	p

	in the i-th mini-batch. Instructed
	by the algorithm in (Kiryo et al., 2017), if ri < 0 we turn to optimize −∇θri in order to make this mini-batch
	less overﬁtting, which is slightly different from Eq. 4.
	D
	APPENDIX: DETAILS ABOUT EXPERIMENTS
	PU classiﬁer and GAN architecture
	For the PU classiﬁer, we employ 6 convolutional layers with dif-
	ferent number of ﬁlters on MNIST, Fashion-MNIST and CIFAR 10, respectively. For the GAN architecture,
	we leverage the architecture of generator and discriminator in the tradition conditional GANs (Mirza & Osin-
	dero, 2014). To guarantee the convergence of RCGAN-U, we replace Batch Normalization with Instance Batch
	Normalization. The latent space dimensions of generator are 128, 128, 256 for the three datasets, respectively.
	As for the optimization of GAN, we deploy the avenue same as WGAN-GP (Gulrajani et al., 2017) to pursue
	desirable generation quality. Speciﬁcally, we set update step of discriminator as 1.
	Fashion-MNIST: Positive Rate 0.3%, Initial PU: 85.41%
	Generator Label Accuracy
	81.17% 94.95% 95.13%
	CGAN-A RCGAN-U CNI-CGAN
	Figure 6: Visualization of generated samples from several baselines and ours on Fashion-MNIST.
	15
	Under review as a conference paper at ICLR 2022
	CIFAR-10: Positive Rate 0.3%, Initial PU: 79.46%
	CGAN-P CNI-CGAN
	Figure 7: Visualization of generated samples from CGAN-P and ours on CIFAR-10.
	Choice of Hyper-parameters
	We choose κ as 0.75, β as 5.0 and λ = 0.99 across all the approaches.
	The learning rates of PU classiﬁer and CGAN are 0.001 and 0.0001, respectively. In the alternate minimization
	process, we set the update step as 1 for PU classiﬁer after updating the CGAN, and L0 as 5 in Algorithm 1.
	We used the same and sufﬁcient epoch for all settings (180 epochs for joint optimization) to guarantee the
	convergence as well as for fair comparisons.
	Further Evaluation of CGAN-P and Ours from the Aspect of Inception Score
	To better verify
	our approach can generate more pleasant images than CGAN-P, we additionally compare the Inception Score
	these two methods attain. Speciﬁcally, we trained a (almost) perfect classiﬁer with 99.21 % and 91.33% accu-
	racy for MNIST and Fashion-MNIST respectively. Then we generate 50,000 samples from the two approaches
	to compute Inception Score, the results of which are exhibited in Table 2. It turns out that our method attain
	the consistent superiority against CGAN-P on the Inception Score for MNIST, even though the generator label
	accuracy of these two approaches are comparable. Note that the two method obtains the similar Inception Score
	on Fashion-MNIST, but our strategy outperforms CGAN-P signiﬁcantly from the perspective of generator label
	accuracy. Overall, we can claim that our method is better than CGAN-P.
	E
	APPENDIX: MORE IMAGES
	We additionally show some generated images on other datasets generated by baselines and CNI-CGAN, shown
	in Figure 6. Note that we highlight the erroneously generated images with red boxes. Speciﬁcally, on Fashion-
	MNIST our approach can generated images with more accurate labels compared with CGAN-A and RCGAN-
	U. Additionally, the quality of generated images from our approach are much better than those from CGAN-P
	that only leverages limited supervised data, as shown in Figure 7 on CIFAR-10.
	16