Under review as a conference paper at ICLR 2022 CLASSIFY AND GENERATE RECIPROCALLY: SIMULTANEOUS POSITIVE-UNLABELLED LEARNING AND CONDITIONAL GENERATION WITH EXTRA DATA Anonymous authors Paper under double-blind review ABSTRACT The scarcity of class-labeled data is a ubiquitous bottleneck in a wide range of machine learning problems. While abundant unlabeled data normally exist and provide a potential solution, it is extremely challenging to exploit them. In this pa- per, we address this problem by leveraging Positive-Unlabeled (PU) classification and the conditional generation with extra unlabeled data simultaneously, both of which aim to make full use of agnostic unlabeled data to improve classification and generation performance. In particular, we present a novel training framework to jointly target both PU classification and conditional generation when exposing to extra data, especially out-of-distribution unlabeled data, by exploring the interplay between them: 1) enhancing the performance of PU classifiers with the assistance of a novel Conditional Generative Adversarial Network (CGAN) that is robust to noisy labels, 2) leveraging extra data with predicted labels from a PU classifier to help the generation. Our key contribution is a Classifier-Noise-Invariant Con- ditional GAN (CNI-CGAN) that can learn the clean data distribution from noisy labels predicted by a PU classifier. Theoretically, we proved the optimal condi- tion of CNI-CGAN and experimentally, we conducted extensive evaluations on diverse datasets, verifying the simultaneous improvements on both classification and generation. 1 INTRODUCTION Existing machine learning methods, particularly deep learning models, typically require big data to pursue remarkable performance. For instance, conditional deep generative models are able to generate high-fidelity and diverse images, but they have to rely on vast amounts of labeled data (Lu- cic et al., 2019). Nevertheless, it is often laborious or impractical to collect large-scale accurate class-labeled data in real-world scenarios, and thus the label scarcity is ubiquitous. Under such cir- cumstances, the performance of classification and conditional generation (Mirza & Osindero, 2014) drops significantly (Lucic et al., 2019). At the same time, diverse unlabeled data are available in enormous quantities, and therefore a key issue is how to take advantage of the extra data to enhance the conditional generation or classification. Within the unlabeled data, both in-distribution and out-of-distribution data exist, where in- distribution data conform to the distribution of the labeled data while out-of-distribution data do not. Our key insight is to harness the out-of-distribution data. In the generation with extra data, most related works focused on the in-distribution data (Lucic et al., 2019; Gui et al., 2020; Donahue & Simonyan, 2019). When it comes to the out-of-distribution data, the majority of existing meth- ods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempted to forcibly train generative models on a large amount of unlabeled data, and then transferred the learned knowledge of the pre-trained generator to the in-distribution data. In classification, a common setting to utilize unlabeled data is semi-supervised learning (Miyato et al., 2018; Sun et al., 2019; Berthelot et al., 2019), which usually assumes that the unlabeled and labeled data come from the same distribution, ignoring their distributional mismatch. In contrast, Positive and Unlabeled (PU) Learning (Bekker & Davis, 2020; Kiryo et al., 2017) is an elegant way of handling this under-studied problem, where a model has the only access to positive samples and unlabeled data. Therefore, it is possible to utilize pseudo labels predicted by a PU classifier on unlabeled data to guide the conditional gen- 1 Under review as a conference paper at ICLR 2022 eration. However, the predicted signals from the classifier tend to be noisy. Although there are a flurry of papers about learning from noisy labels for classification (Tsung Wei Tsai, 2019; Ge et al., 2020; Guo et al., 2019), to our best knowledge, no work has considered to leverage the noisy labels seamlessly in the joint classification and generation. Additionally, another work (Hou et al., 2018) leveraged GANs to recover both positive and negative data distribution to step away from overfit- ting, but they never considered the noise-invariant generation or their mutual improvement. The generative-discriminative complementary learning (Xu et al., 2019) was investigated in weakly su- pervised learning, but we are the first attempt to tackle the (Multi-) Positive and Unlabeled learning setting while developing the method of noise-invariant generation from noisy labels. Please refer to Section 5 for the discussion about more related works. In this paper, we focus on the mutual benefits of conditional generation and PU classification, when we are only accessible to little class-labeled data, but extra unlabeled data, including out- of-distribution data, can be available. Firstly, a parallel non-negative multi-class PU estimator is derived to classify both the positive data of all classes and the negative data. Then we design a Classifier-Noise-Invariant Conditional Generative Adversarial Network (CNI-CGAN) that is able to learn the clean data distribution on all unlabeled data with noisy labels provided by the PU clas- sifier. Simultaneously, we also leverage our CNI-CGAN to enhance the performance of the PU classification through data augmentation, demonstrating a reciprocal benefit for both generation and classification. We provide the theoretical analysis on the optimal condition of our CNI-CGAN and conduct extensive experiments to verify the superiority of our approach. 2 OUR METHOD 2.1 POSITIVE-UNLABELED LEARNING Traditional Binary Positive-Unlabeled Problem Setting Let X ∈Rd and Y ∈{±1} be the input and output variables and p(x, y) is the joint distribution with marginal distribution pp(x) = p(x|Y = +1) and pn(x) = p(x|Y = −1). In particular, we denote p(x) as the distribution of unlabeled data. np, nn and nu are the amount of positive, negative and unlabeled data, respectively. Parallel Non-Negative PU Estimator Vanilla PU learning (Bekker & Davis, 2020; Kiryo et al., 2017; Du Plessis et al., 2014; 2015) employs unbiased and consistent estimator. Denote gθ : Rd → R as the score function parameterized by θ, and ℓ: R × {±1} →R as the loss function. The risk of gθ can be approximated by its empirical version denoted as b Rpn(gθ): b Rpn(gθ) = πp b R+ p (gθ) + πn b R− n (gθ), (1) where πp represents the class prior probability, i.e. πp = P(Y = +1) with πp+πn = 1. In addition, b R+ p (gθ) = 1 np Pnp i=1 ℓ(gθ (xp i ) , +1) and b R− n (gθ) = 1 nn Pnn i=1 ℓ(gθ (xn i ) , −1) . As negative data xn are unavailable, a common strategy is to offset R− n (gθ). We also know that πnpn(x) = p(x) −πppp(x), and hence πn b R− n (gθ) = b R− u (gθ) −πp b R− p (gθ). Then the resulting unbiased risk estimator b Rpu(gθ) can be formulated as: b Rpu(gθ) = πp b R+ p (gθ) −πp b R− p (gθ) + b R− u (gθ), (2) where b R− p (gθ) = 1 np Pnp i=1 ℓ(gθ (xp i ) , −1) and b R− u (gθ) = 1 nu Pnu i=1 ℓ(gθ (xu i ) , −1). The advan- tage of this unbiased risk minimizer is that the optimal solution can be easily obtained if g is linear in θ. However, in real scenarios we tend to leverage more flexible models gθ, e.g., deep neural networks. This strategy will push the estimator to a point where it starts to suffer from overfit- ting. Hence, we decide to utilize non-negative risk (Kiryo et al., 2017) for our PU learning, which has been verified in (Kiryo et al., 2017) to allow deep neural network to mitigate overfitting. The non-negative PU estimator is formulated as: b Rpu(gθ) = πp b R+ p (gθ) + max n 0, b R− u (gθ) −πp b R− p (gθ) o . (3) In pursue of the parallel implementation of b Rpu(gθ), we replace max n 0, b R− u (gθ) −πp b R− p (gθ) o with its lower bound 1 N PN i=1 max n 0, b R− u (gθ; X i u) −πp b R− p (gθ; X i p) o where X i u and X i p denote as the unlabeled and positive data in the i-th mini-batch, and N is the number of batches. 2 Under review as a conference paper at ICLR 2022 From Binary PU to Multi-PU Learning Previous PU learning focuses on learning a classifier from positive and unlabeled data, and cannot easily be adapted to K + 1 multi-classification tasks where K represents the number of classes in the positive data. Multi-Positive and Unlabeled learning (Xu et al., 2017) was ever developed, but the proposed algorithm may not allow deep neural networks. Instead, we extend binary PU learning to multi-class version in a straightforward way by addition- ally incorporating cross entropy loss on all the positive data with labels for different classes. More precisely, we consider the K +1-class classifier fθ as a score function fθ = f 1 θ (x), . . . , f K+1 θ (x)  . After the softmax function, we select the first K positive data to construct cross-entropy loss ℓCE, i.e., ℓCE(fθ(x), y) = log PK+1 j=1 exp  f j θ(x)  −f y θ (x) where y ∈[K]. For the PU loss, we consider the composite function h(fθ(x)) : Rd →R where h(·) conducts a logit transforma- tion on the accumulative probability for the first K classes, i.e., h(fθ(x)) = ln( p 1−p) in which p = PK j=1 exp  f j θ(x)  / PK+1 j=1 exp  f j θ(x)  . The final mini-batch risk of our PU learning can be presented as: e Rpu(fθ; X i) = πp b R+ p (h(fθ); X i p) + max n 0, b R− u (h(fθ); X i u) −πp b R− p (h(fθ); X i p) o + b RCE p (fθ; X i p), (4) where b RCE p (fθ; X i p) = 1 np Pnp i=1 ℓCE (fθ (xp i ) , y). 2.2 CLASSIFIER-NOISE-INVARIANT CONDITIONAL GENERATIVE ADVERSARIAL NETWORK (CNI-CGAN) ୥ PU ௥ ௥ PU ୥ Figure 1: Model architecture of our Classifier- Noise-Invariant Conditional GAN (CNI-CGAN). The output xg of the conditional generator G is paired with a noisy label ˜ y corrupted by the PU- dependent confusion matrix ˜ C. The discriminator D distinguishes between whether a given labeled sample comes from the real data (xr, PUθ(xr)) or generated data (xg, ˜ y). To leverage extra data, i.e., all unlabeled data, to benefit the generation, we deploy our condi- tional generative model on all data with pseudo labels predicted by our PU classifier. However, these predicted labels tend to be noisy, reduc- ing the reliability of the supervision signals and thus worsening the performance of the condi- tional generative model. Besides, the noise de- pends on the accuracy of the given PU classi- fier. To address this issue, we focus on devel- oping a novel noise-invariant conditional GAN that is robust to noisy labels provided by a spec- ified classifier, e.g. a PU classifier. We call our method Classifier-Noise-Invariant Conditional Generative Adversarial Network (CNI-CGAN) and the architecture is depicted in Figure 1. In the following, we elaborate on each part of it. Principle of the Design of CNI-CGAN Albeit being noisy, the pseudo labels given by the PU classifier still provide rich information that we can exploit. The key is to take the noise generation mechanism into consideration dur- ing the generation. We denote the real data as xr and the predicted hard label through the PU classifier as PUθ(xr), i.e., PUθ(xr) = arg maxi f i θ(xr), as displayed in Figure 1. We let the generator “imitate” the noise generation mechanism to generate pseudo labels for the labeled data. With both pseudo and real labels, we can leverage the PU classifier fθ to estimate a confusion matrix ˜ C to model the label noise from the classifier. During the generation, a real label y, while being fed into the generator G, will also be polluted by ˜ C to compute a noisy label ˜ y, which then will be combined with the generated fake sample xg for the following discrimination. Finally, the discriminator D will distinguish the real samples [xr, PUθ(xr)] out of fake samples [xg, ˜ y]. Overall, the noise “generation” mechanism from both sides can be balanced. 3 Under review as a conference paper at ICLR 2022 Estimation of ˜ C The key in the design of ˜ C is to estimate the label noise of the pre-trained PU classifier by considering all the samples of each class. More specifically, the confusion matrix ˜ C is k + 1 by k + 1 and each entry ˜ Cij represents the probability of a generated sample xg, given a label i, being classified as class j by the PU classifier. Mathematically, we denote ˜ Cij as: ˜ Cij = P(PUθ(xg) = j|y = i) = Ez[I{P Uθ(xg)=j|y=i}], (5) where xg = G(z, y = i) and I is the indicator function. Owing to the stochastic optimization nature when training deep neural networks, we incorporate the estimation of ˜ C in the processing of training by Exponential Moving Average (EMA) method. This choice can balance the utilization of information from previous training samples and the updated PU classifier to estimate ˜ C. We formulate the update of ˜ C(l+1) in the l-th mini-batch as follows: ˜ C(l+1) = λ ˜ C(l) + (1 −λ)∆ ˜ C Xl, (6) where ∆˜ C Xl denotes the incremental change of ˜ C on the current l-th mini-batch data Xl via Eq. 5. λ is the averaging coefficient in EMA. Theoretical Guarantee of Clean Data Distribution Firstly, we denote O(x) as the oracle class of sample x from an oracle classifier O(·). Let πi, i = 1, ..., K +1, be the class-prior probability of the class i in the multi-positive unlabeled setting. Theorem 1 proves the optimal condition of CNI-CGAN to guarantee the convergence to the clean data distribution. The proof is provided in Appendix A. Theorem 1. (Optimal Condition of CNI-CGAN) Let P g be a probabilistic transition matrix where P g ij = P(O(xg) = j|y = i) indicates the probability of sample xg with the oracle label j generated by G with the initial label i. We assume that the conditional sample space of each class is disjoint with each other, then (1) P g is a permutation matrix if the generator G in CNI-CGAN is optimal, with the permutation, compared with an identity matrix, only happens on rows r where corresponding πr, r ∈r are equal. (2) If P g is an identity matrix and the generator G in CNI-CGAN is optimal, then pr(x, y) = pg(x, y) where pr(x, y) and pg(x, y) are the real and the generating joint distribution, respectively. Briefly speaking, CNI-CGAN can learn the clean data distribution if P g is an identity matrix. More importantly, the method we elaborate till now has already guaranteed Pg as a permutation matrix, which is very close to an identity one. We need an additional constraint, although the permutation happens only when same class-prior probabilities exist. The Auxiliary Loss The optimal G in CNI-CGAN can only guarantee that pg(x, y) is close to pr(x, y) as the optimal permutation matrix P g is close to the identity matrix. Hence in practice, to ensure that we can exactly learn an identity matrix for P g and thus achieve the clean data distri- bution, we introduce an auxiliary loss to encourage a larger trace of P g, i.e., PK+1 i=1 P(O(xg) = i)|y = i). As O(·) is intractable, we approximate it by the current PU classifier PUθ(xg). Then we obtain the auxiliary loss ℓaux: ℓaux(z, y) = max{κ − 1 K + 1 K+1 X i=1 Ez(I{P Uθ(xg)=i|y=i}), 0}, (7) where κ ∈(0, 1) is a hyper-parameter. With the support of auxiliary loss, P g has the tendency to converge to the identity matrix where CNI-CGAN can learn the clean data distribution even in the presence of noisy labels. Comparison with RCGAN (Thekumparampil et al., 2018; Kaneko et al., 2019) The the- oretical property of CNI-CGAN has a major advantage over existing Robust CGAN (RC- GAN) (Thekumparampil et al., 2018; Kaneko et al., 2019), for which the optimal condition can only be achieved when the label confusion matrix is known a priori. Although heuristics can be employed, such as RCGAN-U (Thekumparampil et al., 2018), to handle the unknown label noise setting, these approaches still lack the theoretical guarantee to converge to the clean data distribution. To guarantee the efficacy of our approach, one implicit and mild assumption is that our PU classifier will not overfit on the training data, while our non-negative estimator helps to ensure that it as 4 Under review as a conference paper at ICLR 2022 explained in Section 2.1. To further clarify the optimization process of CNI-CGAN, we elaborate the training steps of D and G, respectively. D-Step: We train D on an adversarial loss from both the real data and the generated (xg, ˜ y), where ˜ y is corrupted by ˜ C. ˜ Cy denotes the y-th row of ˜ C. We formulate the loss of D as: max D∈F E x∼p(x)[φ(D(x, PUθ(x)))] + E z∼PZ ,y∼PY ˜ y|y∼˜ Cy [φ(1 −D(G(z, y), ˜ y))], (8) where F is a family of discriminators and PZ is the distribution of latent space vector z, e.g., a Normal distribution. PY is a discrete uniform distribution on [K + 1] and φ is the measuring function. G-Step: We train G additionally on the auxiliary loss ℓaux(z, y) as follows: min G∈G E z∼PZ ,y∼PY ˜ y|y∼˜ Cy [φ(1 −D(G(z, y), ˜ y)) + βℓaux(z, y)] , (9) where β controls the strength of auxiliary loss and G is a family of generators. In summary, our CNI-CGAN conducts K +1 classes generation, which can be further leveraged to benefit the K + 1 PU classification via data augmentation. Algorithm 1 Alternating Minimization for PU Learning and Classifier-Noise-Invariant Generation. Input: Training data (Xp, Xu). Batch size M and hyper-parameter β > 0, λ, κ ∈(0, 1). L0 and L ∈N +. Initializing ˜ C(1) as identity matrix. Number of batches N during the training. Output: Model parameter for generator G, and θ for the PU classifier fθ. 1: / * Pre-train PU classifier fθ * / 2: for i = 1 to N do 3: Update fθ by descending its stochastic gradient of e Rpu fθ; X i via Eq. 4. 4: end for 5: repeat 6: / * Update CNI-CGAN * / 7: for l = 1 to L do 8: Sample {z1, ..., zM}, {y1, ..., yM} and {x1, ..., xM} from PZ, PY and all training data, respectively, and then sample {˜ y1, ..., ˜ yM} through the current ˜ C(l). Then, update the discriminator D by ascending its stochastic gradient of 1 M M X i=1 [φ(D(xi, PUθ(xi)))] + φ(1 −D(G(zi, yi), ˜ yi))]. 9: Sample {z1, ..., zM} and {y1, ..., yM} from PZ and PY , and then sample {˜ y1, ..., ˜ yM} through the current ˜ C(l). Update the generator G by descending its stochastic gradient of 1 M M X i=1 [φ(1 −D(G(zi, yi), ˜ yi)) + βℓaux(yi, zi)]. 10: if l ≥L0 then 11: Compute ∆˜ C Xl = 1 M PM i=1 I{P Uθ(G(zi,yi))|yi} via Eq. 5, and then estimate ˜ C by ˜ C(l+1) = λ ˜ C(l) + (1 −λ)∆ ˜ C Xl. 12: end if 13: end for 14: / * Update PU classifier via Data Augmentation * / 15: Sample {z1, ..., zM} and {y1, ..., yM} from PZ and PY , respectively, and then update the PU classifier fθ by descending its stochastic gradient of 1 M M X i=1 ℓCE (fθ (G(zi, yi)) , yi) . 16: until convergence 5 Under review as a conference paper at ICLR 2022 3 ALGORITHM Firstly, we obtain a PU classifier fθ trained on multi-positive and unlabeled dataset with the par- allel non-negative estimator derived in Section 2.1. Then we train our CNI-CGAN, described in Section 2.2, on all data with pseudo labels predicted by the pre-trained PU classifier. As our CNI- CGAN is robust to noisy labels, we leverage the data generated by CNI-CGAN to conduct data augmentation to improve the PU classifier. Finally, we implement the joint optimization for the training of CNI-CGAN and the data augmentation of the PU classifier. We summarize the proce- dure in Algorithm 1 and provide more details in Appendix C. Computational Cost Analysis In the implementation of our CNI-CGAN, we need to additionally estimate ˜ C, a (K + 1) × (K + 1) matrix. The computational cost of this small matrix is negligible compared with the updating of discriminator and generator networks, although the estimation of ˜ C is crucial. Simultaneous Improvement on PU Learning and Generation with Extra Data From the per- spective of PU classification, due to the theoretical guarantee from Theorem 1, CNI-CGAN is capa- ble of learning a clean data distribution out of noisy pseudo labels predicted by the pre-trained PU classifier. Hence, the following data augmentation has the potential to improve the generalization of PU classification regardless of the specific form of the PU estimator. From the perspective of generation with extra data, the predicted labels on unlabeled data from the PU classifier can provide CNI-CGAN with more supervised signals, thus further improving the quality of generation. Due to the joint optimization, both the PU classification and conditional generative models are able to improve each other reciprocally, as demonstrated in the following experiments. 4 EXPERIMENT Experimental Setup We perform our approaches and several baselines on MNIST, Fashion-MNIST and CIFAR-10. We select the first 5 classes on MNIST and 5 non-clothes classes on Fashion- MNIST, respectively, for K + 1 classification (K = 5). To verify the consistent effectiveness of our method in the standard binary PU setting, we pick the 4 categories of transportation tools in CIFAR- 10 as the one-class positive dataset. As for the baselines, the first is CGAN-P, where a Vanilla CGAN (Mirza & Osindero, 2014) is trained only on limited positive data. Another natural baseline is CGAN-A where a Vanilla CGAN is trained on all data with labels given by the PU classifier. 10 2 10 1 0 20 40 60 80 100 Generator Label Accuracy(%) MNIST CGAN-P CGAN-A Ours 10 2 10 1 Positive Rate 65 70 75 80 85 90 95 PU Accuracy(%) MNIST Original PU CGAN-A Ours 10 2 10 1 75 80 85 90 95 100 Generator Label Accuracy(%) Fashion-MNIST CGAN-P CGAN-A Ours 10 2 10 1 Positive Rate 80.0 82.5 85.0 87.5 90.0 92.5 95.0 97.5 PU Accuracy(%) Fashion-MNIST Original PU CGAN-A Ours 10 2 10 1 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Inception Score CIFAR-10 CGAN-P CGAN-A Ours 10 2 10 1 Positive Rate 78 80 82 84 86 88 90 PU Accuracy(%) CIFAR-10 Original PU CGAN-A Ours Figure 2: Generation and classification performance of CGAN-P, CGAN-A and Ours on three datasets. Results of CGAN-P (blue lines) on PU accuracy do not exist since CGAN-P generates only K classes data rather than K + 1 categories that the PU classifier needs. 6 Under review as a conference paper at ICLR 2022 The last baseline is RCGAN-U (Thekumparampil et al., 2018) where the confusion matrix is totally learnable while training. For fair comparisons, we choose the same GAN architecture. Through a line search of hyper-parameters, we choose κ as 0.75, β as 5.0 and λ = 0.99 across all the datasets. We set L0 as 5 in Algorithm 1. More details about hyper-parameters can be found in Appendix D. Evaluation Metrics For MNIST and Fashion-MNIST, we mainly use Generator Label Accu- racy (Thekumparampil et al., 2018) and PU Accuracy to evaluate the quality of generated images. Generator Label Accuracy compares specified y from CGANs to the true class of the generated ex- amples through a pre-trained (almost) oracle classifier f. In experiments, we pre-trained two K+1 classifiers with 99.28% and 98.23% accuracy on the two datasets, respectively. Additionally, the increased PU Accuracy measures the closeness between generated data distribution and test (almost real) data distribution for the PU classification, serving as a key indicator to reflect the quality of generated images. For CIFAR 10, we use both Inception Score (Salimans et al., 2016) to evaluate the quality of the generated samples, and the increased PU Accuracy to quantify the improvement of generated samples on the PU classification. 4.1 GENERATION AND CLASSIFICATION PERFORMANCE We set the whole training dataset as the unlabeled data and select certain amount of positive data with the ratio of Positive Rate. Figure 2 presents the trend of Generator Label Accuracy, Inception Score and PU Accuracy as the Positive Rate increases. It turns out that CNI-CGAN outperforms CGAN-P and CGAN-A consistently especially when the Positive Rate is small, i.e. little positive data. Remarkably, our approach enhances the PU accuracy greatly when exposed to low positive rates, while CGAN-A even worsens the original PU classifier sometimes in this scenario due to the existence of too much label noise given by a less accurate PU classifier. Meanwhile, when more su- pervised positive data are given, the PU classifier generalizes better and then provides more accurate labels, conversely leading to more consistent and better performance for all methods. Besides, note that even though the CGAN-P achieves comparable generator label accuracy on MNIST, it results in a lower Inception Score. We demonstrate this in Appendix D. Table 1: PU classification accuracy of RCGAN-U and Ours across three datasets. Final PU accuracy represents the accuracy of PU classifier after the data augmentation. Final PU Accuracy \ Positive Rates (%) 0.2% 0.5% 1.0% 10.0% MNIST Original PU 68.86 76.75 86.94 95.88 RCGAN-U 87.95 95.24 95.86 97.80 Ours 96.33 96.43 96.71 97.82 Fashion-MNIST Original PU 80.68 88.25 93.05 95.99 RCGAN-U 89.21 92.05 94.59 97.24 Ours 89.23 93.82 95.16 97.33 CIFAR-10 Original PU 76.79 80.63 85.53 88.43 RCGAN-U 83.13 86.22 88.22 90.45 Ours 87.64 87.92 88.60 90.69 To verify the advantage of theoretical property for our CNI-CGAN, we further compare it with RCGCN-U (Thekumparampil et al., 2018; Kaneko et al., 2019), the heuristic version of robust gen- eration against unknown noisy labels setting without the theoretical guarantee of optimal condition. As observed in Table 1, our method outperforms RCGAN-U especially when the positive rate is low. When the amount of positive labeled data is relatively large, e.g., 10.0%, both our approach and RCGAN-U can obtain comparable performance. Visualization To further demonstrate the superiority of CNI-CGAN compared with the other base- lines, we present some generated images within K +1 classes from CGAN-A, RCGAN-U and CNI- CGAN on MNIST, and high-quality images from CNI-CGAN on Fashion-MNIST and CIFAR-10, in Figure 3. In particular, we choose the positive rate as 0.2% on MNIST, yielding the initial PU classifier with 69.14% accuracy. Given the noisy labels on all data, our CNI-CGAN can generate more accurate images of each class visually compared with CGAN-A and RCGAN-U. Results of Fashion-MNIST and comparison with CGAN-P on CIFAR-10 can refer to Appendix E. 7 Under review as a conference paper at ICLR 2022 MNIST: Positive Rate 0.2%, Initial PU: 69.14% Generator Label Accuracy 39.67% 81.58% 96.33% CGAN-A RCGAN-U CNI-CGAN CNI-CGAN Fashion-MNIST CIFAR-10 Figure 3: Visualization of generated samples on three datasets. Rows below the red line represent the negative class. We highlight the erroneously generated images with red boxes on MNIST. 4.2 ROBUSTNESS OF OUR APPROACH Robustness against the Initial PU accuracy The auxiliary loss can help the CNI-CGAN to learn the clean data distribution regardless of the initial accuracy of PU classifiers. To verify that, we select distinct positive rates, yielding the pre-trained PU classifiers with different initial accuracies. Then we perform our method based on these PU classifiers. Figure 4 suggests that our approach can still attain the similar generation quality under different initial PU accuracies after sufficient training, although better initial PU accuracy can be beneficial to the generation performance in the early phase. 10 2 10 3 10 4 Number of Training Iterations 0 20 40 60 80 100 GAN: Generator Label Accuracy (%) MNIST Initial PU 77.87% Initial PU 84.57% Initial PU 91.28% 10 2 10 3 10 4 10 5 Number of Training Iterations 0 20 40 60 80 100 GAN: Generator Label Accuracy (%) Fashion MNIST Initial PU 88.27% Initial PU 91.03% Initial PU 94.02% 10 2 10 3 10 4 Number of Training Iterations 78 80 82 84 86 88 90 PU Accuracy(%) CIFAR 10 Initial PU 79.49% Initial PU 82.51% Initial PU 85.45% Figure 4: Tendency of generation performance as the training iterations increase on three datasets. Robustness against the Unlabeled data In real scenarios, we are more likely to have little knowl- edge about the extra data we have. To further verify the robustness of CNI-CGAN against the unknown distribution of extra data, we test different approaches across different amounts and dis- tributions of the unlabeled data. Particularly, we consider two different types of distributions for unlabeled data. Type 1 is [ 1 K+1, ..., 1 K+1, 1 K+1] where the number of data in each class, including the negative data, is even, while type 2 is [ 1 2K , ... 1 2K , 1 2] where the negative data makes up half of all unlabeled data. In experiments, we focus on the PU Accuracy to evaluate both the generation quality and the improvement of PU learning. For MNIST, we choose 1% and 0.5% for two settings while we opt for 0.5% and 0.2% on both Fashion-MNIST and CIFAR-10. Figure 5 manifests that the accuracy of PU classifier exhibits a slight ascending tendency with the increasing of the number of unlabeled data. More importantly, our CNI-CGAN almost consistently outperforms other baselines across different amount of unlabeled data as well as distinct distributions of unlabeled data. This verifies that the robustness of our proposal to the distribution of extra data can be maintained potentially. We leave the investigation on the robustness against more imbalanced situations as future works. 8 Under review as a conference paper at ICLR 2022 10 3 10 4 50 60 70 80 90 100 Distribution Type 1: PU Acc(%) MNIST Original PU CGAN-A RCGAN-U Ours 10 3 10 4 Number of Unlabeled Data 75 80 85 90 95 100 Distribution Type 2: PU Acc(%) MNIST Original PU CGAN-A RCGAN-U Ours 10 4 70 75 80 85 90 95 100 Distribution Type 1: PU Acc(%) Fashion-MNIST Original PU CGAN-A RCGAN-U Ours 10 4 Number of Unlabeled Data 70 75 80 85 90 95 100 Distribution Type 2: PU Acc(%) Fashion-MNIST Original PU CGAN-A RCGAN-U Ours 10 4 70 75 80 85 90 95 100 Distribution Type 1: PU Acc(%) CIFAR-10 Original PU CGAN-A RCGAN-U Ours 10 4 Number of Unlabeled Data 70 75 80 85 90 95 100 Distribution Type 2: PU Acc(%) CIFAR-10 Original PU CGAN-A RCGAN-U Ours Figure 5: PU Classification accuracy of CGAN-A, RCGAN-U and Ours after joint optimization across different amounts and distribution types of unlabeled data. 5 RELATED WORKS Positive-Unlabeled (PU) Learning. Positive and Unlabeled (PU) Learning is the setting where a learner has only access to positive examples and unlabeled data (Bekker & Davis, 2020; Kiryo et al., 2017). One related work (Hou et al., 2018) employed GANs (Goodfellow et al., 2014) to recover both positive and negative data distribution to step away from overfitting. Kato et al. (Kato et al., 2018) focused on remedying the selection bias in the PU learning. Besides, Multi-Positive and Unlabeled Learning (Xu et al., 2017) extended the binary PU setting to the multi-class version, therefore adapting to more practical applications. By contrast, our multi-positive unlabeled method absorbs the advantages of previous approaches, and in the meanwhile intuitively extends them to fit the differential deep neural networks optimization. Conditional GANs on Few Labels Data. To attain high-quality images with both fidelity and di- versity, the training of generative models requires a large dataset. To reduce the need of huge amount of data, the vast majority of methods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempted to transfer prior knowledge of the pre-trained generator. Another branch (Lucic et al., 2019) is to leverage self- and supervised learning to add pseudo labels on the in-distribution unlabeled data in order to expand labeled dataset. Compared with this approach, our strategy can be viewed to automatically “pick” useful in-distribution data from total unknown unlabeled data via PU learning framework, and then constructs robust conditional GANs to generate clean data distribution out of predicted label noise. Please refer to more related works in Appendix B. 6 DISCUSSION AND CONCLUSION In this paper, we proposed a new method, CNI-CGAN, to jointly exploit PU classification and conditional generation. It is, to our best knowledge, the first method of such kind to break the ceiling of class-label scarcity, by combining two promising yet separate methodologies to gain massive mutual improvements. CNI-CGAN can learn the clean data distribution from noisy labels given by a PU classifier, and then enhance the performance of PU classification through data augmentation in various settings. We have demonstrated, both theoretically and experimentally, the superiority of our proposal on diverse benchmark datasets in an exhaustive and comprehensive manner. In the future, it will be promising to investigate learning strategies on imbalanced data, e.g., cost-sensitive learning (Elkan, 2001), to extend our approach to broader settings, which will further cater to real- world scenarios where highly unbalanced data are commonly available. In addition, the leverage of soft labels in the design of CNI-CGAN is also promising. 9 Under review as a conference paper at ICLR 2022 Ethics Statement. Our designed CNI-CGAN framework can interplay with the PU classification and robust generation, which can mitigate the scarcity of class-labeled data. Leveraging extra data may correlate with the privacy issue as the privacy issue still exists in generative models. Thus, a privacy-guaranteed version of our algorithm can be further proposed in the future to handle the potential privacy issue. Reproducibility Statement. For the theoretical part, we clearly state the related assumption and detailed proof process in Appendix A. In terms of the algorithm, our implementation is directly adapted from the public one of generative models and PU learning. REFERENCES Jessa Bekker and Jesse Davis. Learning from positive and unlabeled data: A survey. Machine Learning, 109(4):719–760, 2020. David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5050–5060, 2019. Grigorios G Chrysos, Jean Kossaifi, and Stefanos Zafeiriou. Robust conditional generative adver- sarial networks. arXiv preprint arXiv:1805.08657, 2018. Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10541–10551, 2019. Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In International conference on machine learning, pp. 1386–1394, 2015. Marthinus C Du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In Advances in neural information processing systems, pp. 703–711, 2014. Charles Elkan. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, volume 17, pp. 973–978. Lawrence Erlbaum Associates Ltd, 2001. Yixiao Ge, Dapeng Chen, and Hongsheng Li. Mutual mean-teaching: Pseudo label refinery for un- supervised domain adaptation on person re-identification. International Conference on Learning Representations, 2020. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor- mation processing systems, pp. 2672–2680, 2014. Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. A review on generative adver- sarial networks: Algorithms, theory, and applications. arXiv preprint arXiv:2001.06937, 2020. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Im- proved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777, 2017. Tianyu Guo, Chang Xu, Boxin Shi, Chao Xu, and Dacheng Tao. Learning from bad data via gener- ation. In Advances in Neural Information Processing Systems, pp. 6042–6053, 2019. Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. International Conference on Learning Representations, 2018. Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Inlier- based outlier detection via direct density ratio estimation. In 2008 Eighth IEEE International Conference on Data Mining, pp. 223–232. IEEE, 2008. Ming Hou, Brahim Chaib-draa, Chao Li, and Qibin Zhao. Generative adversarial positive-unlabelled learning. In J´ erˆ ome Lang (ed.), Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 2255–2261. ijcai.org, 2018. doi: 10.24963/ijcai.2018/312. 10 Under review as a conference paper at ICLR 2022 Takuhiro Kaneko, Yoshitaka Ushiku, and Tatsuya Harada. Label-noise robust generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2467–2476, 2019. Masahiro Kato, Takeshi Teshima, and Junya Honda. Learning from positive and unlabeled data with a selection bias. 2018. Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Advances in neural information processing systems, pp. 1675–1685, 2017. Kiran Koshy Thekumparampil, Sewoong Oh, and Ashish Khetan. Robust conditional gans under missing or uncertain labels. arXiv preprint arXiv:1906.03579, 2019. Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly. High-fidelity image generation with fewer labels. nternational Conference on Machine Learning (ICML), 2019. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018. Atsuhiro Noguchi and Tatsuya Harada. Image generation from small datasets via batch statistics adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2750– 2758, 2019. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in Neural Information Processing Systems, 2016. Alex Smola, Le Song, and Choon Hui Teo. Relative novelty detection. In Artificial Intelligence and Statistics, pp. 536–543, 2009. Ke Sun, Bing Yu, Zhouchen Lin, and Zhanxing Zhu. Patch-level neighborhood interpolation: A general and effective graph-based regularization strategy. arXiv preprint arXiv:1911.09307, 2019. Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and Sewoong Oh. Robustness of conditional gans to noisy labels. In Advances in neural information processing systems, pp. 10271–10282, 2018. Jun Zhu Tsung Wei Tsai, Tsung Wei Tsai. Countering noisy labels by learning from auxiliary clean labels. arXiv preprint arXiv:1905.13305, 2019. Qin Wang, Wen Li, and Luc Van Gool. Semi-supervised learning by augmented distribution align- ment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1466–1475, 2019. Yanwu Xu, Mingming Gong, Junxiang Chen, Tongliang Liu, Kun Zhang, and Kayhan Batmanghe- lich. Generative-discriminative complementary learning. AAAI 2020, 2019. Yixing Xu, Chang Xu, Chao Xu, and Dacheng Tao. Multi-positive and unlabeled learning. In IJCAI, pp. 3182–3188, 2017. Shin’ya Yamaguchi, Sekitoshi Kanai, and Takeharu Eda. Effective data augmentation with multi- domain learning gans. arXiv preprint arXiv:1912.11597, 2019. Wei Li Shaogang Gong Yanbei Chen, Xiatian Zhu. Semi-supervised learning under class distribution mismatch. AAAI 2020, 2019. Bing Yu, Jingfeng Wu, Jinwen Ma, and Zhanxing Zhu. Tangent-normal adversarial regularization for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10676–10684, 2019. 11 Under review as a conference paper at ICLR 2022 Miaoyun Zhao, Yulai Cong, and Lawrence Carin. On leveraging pretrained gans for limited-data generation. arXiv preprint arXiv:2002.11810, 2020. 12 Under review as a conference paper at ICLR 2022 A APPENDIX: PROOF OF THEOREM 1 Firstly, we recall some definitions. Denote xr, xg as the real training and generated samples, respectively. x are the population of all data, and xr are sampled from p(x). yg represents the initial labels for the generator G, while ˜ y indicates the labels perturbed by ˜ C from yg. The class-prior πi meets πi = P(yg = i) = P(O(xr) = i). For a rigorous proof of Theorem 1, we elaborate it again in the appendix. Theorem 1 We assume that the following three mild assumptions can be met: (a) PU classifier is not over- fitting on the training data, (b) P(PUθ(xg)|O(xg), yg) = P(PUθ(xg)|O(xg)), (c) the conditional sample space is disjoint from each other class. Then, (1) P g is a permutation matrix if the generator G in CNI-CGAN is optimal, with the permutation, compared with an identity matrix, only happens on rows r where corresponding πr, r ∈r are equal. (2) If P g is an identity matrix and the generator G in CNI-CGAN is optimal, then pr(x, y) = pg(x, y) where pr(x, y) and pg(x, y) are the real and generating joint distribution, respectively. A.1 PROOF OF (1) Proof. For a general setting, the oracle class of xg given by label yg is not necessarily equal to PUθ(xg). Thus, we consider the oracle class of xg, i.e., O(xg) in the proof. Optimal G. In CNI-CGAN, G is optimal if and only if pr(xr, PUθ(xr)) = pg(xg, ˜ y). (10) The equivalence of joint probability distribution can further derive the equivalence of marginal distribution, i.e., pr(xr) = pg(xg). We define a probability matrix C where Cij = P(PUθ(x) = j|O(x) = i) where x are the population data. According to (c), we can apply O(·) on both xr and xg in Eq. 10. Then we have: P(O(xr) = i, PUθ(xr) = j) (c) = P(O(xg) = i, ˜ y = j) P(O(xr) = i)P(PUθ(xr) = j|O(xr) = i) = K+1 X k=1 P(yg = k, O(xg) = i)P(˜ y = j|yg = k, O(xg) = i) πiCij (a) = K+1 X k=1 P(O(xg) = i|yg = k)P(yg = k)P(˜ y = j|yg = k) πiCij = K+1 X k=1 P g⊤ ik πk ˜ Ckj, (11) where assumption (a) indicates that PUθ(xr) is close to PUθ(x) so that P(PUθ(xr) = j|O(xr) = i) = P(PUθ(x) = j|O(x) = i). Then the corresponding matrix form follows as ΠC = P g⊤Π ˜ C (12) Definition. According to the definition of ˜ C and Law of Total Probability, we have: P(yg = i)P(PUθ(xg) = j|yg = i) = πi K+1 X k=1 P(O(xg) = k|yg = i)P(PUθ(xg) = j|O(xg) = k, yg = i) πi ˜ Cij (b) = πi K+1 X k=1 P g ikP(PUθ(xg) = j|O(xg) = k) πi ˜ Cij = πi K+1 X k=1 P g ikCkj, (13) where the last equation is met as p(xg) is close to p(x) when G is optimal, and thus P(PUθ(xg) = j|O(xg) = k) = P(PUθ(x) = j|O(x) = k). Then we consider the corresponding matrix form as follows Π ˜ C = ΠP gC (14) 13 Under review as a conference paper at ICLR 2022 where Π is the diagonal matrix of prior vector π. Combining Eq. 14 and 12, we have P g⊤ΠP g = Π, which indicates P g is a general orthogonal matrix. In addition, the element of P g is non-negative and the sum of each row is 1. Therefore, we have P g is a permutation matrix with permutation compared with the identity matrix only happens on rows r where corresponding πr, r ∈r are equal. Particularly, if all πi are different from each other, then permutation operation will not happen, indicating the optimal conditional of P g is the identity matrix. A.2 PROOF OF (2) We additionally denote yr as the real label of real sample xr, i.e., yr = O(xr). According to the optimal condition of G in Eq. 10, we have pr(xr) = pg(xg). Since we have P g is an identity matrix, then O(xg) = yg a.e. Thus, we have pg(xg|yg = i) = pg(xg|O(xg) = i), ∀i = 1, .., K + 1. According the assumption (c) and Eq. 10, we have pr(xr|O(xr) = i) = pg(xg|O(xg) = i). In addition, we know that pr(xr|O(xr) = i) = pr(xr|yr = i), thus we have pr(xr|yr = i) = pg(xg|yg = i). Further, we consider the identical class-prior πi. Finally, we have pr(xr|yr = i)πi = pg(xg|yg = i)πi pr(xr|yr = i)p(O(xr) = i) = pg(xg|yg = i)p(yg = i) pr(xr|yr = i)p(yr = i) = pg(xg|yg = i)p(yg = i) pr(xr, yr) = pg(xg, yg). (15) B APPENDIX: MORE RELATED WORKS Positive-Unlabeled (PU) Learning. Positive and Unlabeled (PU) Learning is the setting where a learner has only access to positive examples and unlabeled data (Bekker & Davis, 2020; Kiryo et al., 2017). One related work (Hou et al., 2018) employed GANs (Goodfellow et al., 2014) to recover both positive and negative data distribution to step away from overfitting. Kato et al. (Kato et al., 2018) focused on remedying the selection bias in the PU learning. Besides, Multi-Positive and Unlabeled Learning (Xu et al., 2017) extended the binary PU setting to the multi-class version, therefore adapting to more practical applications. By contrast, our multi- positive unlabeled method absorbs the advantages of previous approaches, and in the meanwhile intuitively extends them to fit the differential deep neural networks optimization. Conditional GANs on Few Labels Data. To attain high-quality images with both fidelity and diversity, the training of generative models requires a large dataset. To reduce the need of huge amount of data, the vast majority of methods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempted to transfer prior knowledge of the pre-trained generator. Another branch (Lucic et al., 2019) is to leverage self- and supervised learning to add pseudo labels on the in-distribution unlabeled data in order to expand labeled dataset. Compared with this approach, our strategy can be viewed to automatically “pick” useful in-distribution data from total unknown unlabeled data via PU learning framework, and then constructs robust conditional GANs to generate clean data distribution out of predicted label noise. Robust GANs. Robust Conditional GANs (Thekumparampil et al., 2018; Kaneko et al., 2019) were pro- posed to defend against class-dependent noisy labels. The main idea of these methods is to corrupt labels of generated samples before feeding to the adversarial discriminator, forcing the generator to produce sample with clean labels. Another supplementary investigation (Koshy Thekumparampil et al., 2019) explored the scenario when CGANs get exposed to missing or ambiguous labels, while another work (Chrysos et al., 2018) leveraged the structure of the model in the target space to address this issue. In contrast, the noises in our model stem from the prediction error of a given classifier. We employ the imperfect classifier to estimate the label confusion noise, yielding a new branch of Robust CGANs against “classifier” label noises. Semi-Supervised Learning (SSL). One crucial issue in SSL (Miyato et al., 2018; Yu et al., 2019; Sun et al., 2019) is how to tackle with the mismatch of unlabeled and labeled data. Augmented Distribution Align- ment (Wang et al., 2019) was proposed to leverage adversarial training to alleviate the bias, but they focused on the empirical distribution mismatch owing to the limited number of labeled data. Further, Uncertainty Aware Self-Distillation (Yanbei Chen, 2019) was proposed to concentrate on this under-studied problem, which can guarantee the effectiveness of learning. In contrast, our approach leverages the PU learning to construct the “open world” classification. Out-Of-Distribution (OOD) Detection OOD Detection is one classical but always vibrant machine learning problem. PU learning can be used for the detection of outliers in an unlabeled dataset with knowledge only from a collection of inlier data (Hido et al., 2008; Smola et al., 2009). Another interesting and related 14 Under review as a conference paper at ICLR 2022 Table 2: Further evaluation of CGAN-P and Ours from the perspective of Inception Score on MNIST and Fashion-MNIST datasets. Positive Rates 0.75% 1.0% 3.0% 5.0% 10.0% Inception Score (± Standard Deviation) MNIST CGAN-P 5.08±0.02 5.10±0.03 5.09±0.02 5.14±0.03 5.10±0.04 Ours 5.60±0.01 5.59±0.02 5.65±0.02 5.52±0.01 5.63±0.02 Fashion-MNIST CGAN-P 4.95±0.03 5.01 ± 0.03 5.04 ± 0.04 5.02±0.04 5.00 ±0.03 Ours 4.99 ± 0.02 5.01 ± 0.02 5.03±0.01 5.07 ± 0.02 5.04 ± 0.02 work is Outlier Exposure (Hendrycks et al., 2018), an approach that leveraged an auxiliary dataset to enhance the anomaly detector based on existing limited data. This problem is similar to our generation task, the goal of which is to take better advantage of extra dataset, especially out-of-distribution data, to boost the generation. Learning from Noisy Labels Rotational-Decoupling Consistency Regularization (RDCR) (Tsung Wei Tsai, 2019) was designed to integrate the consistency-based methods with the self-supervised rotation task to learn noise-tolerant representations. Mutual Mean-Teaching (Ge et al., 2020) was proposed to refine the soft labels on person re-identification task by averaging the parameters of two neural networks . In addition, the data with noisy labels can also be viewed as bad data. Another work (Guo et al., 2019) provided a worst- case learning formulation from bad data, and designed a data-generation scheme in an adversarial manner, augmenting data to improve the current classifier. C APPENDIX: DETAILS ABOUT ALGORITHM 1 Similar in (Kiryo et al., 2017), we utilize the sigmoid loss ℓsig(t, y) = 1/(1 + exp(ty)) in the implementation of the PU learning. Besides, we denote ri = b R− u g; X i u  −πp b R− p g; X i p  in the i-th mini-batch. Instructed by the algorithm in (Kiryo et al., 2017), if ri < 0 we turn to optimize −∇θri in order to make this mini-batch less overfitting, which is slightly different from Eq. 4. D APPENDIX: DETAILS ABOUT EXPERIMENTS PU classifier and GAN architecture For the PU classifier, we employ 6 convolutional layers with dif- ferent number of filters on MNIST, Fashion-MNIST and CIFAR 10, respectively. For the GAN architecture, we leverage the architecture of generator and discriminator in the tradition conditional GANs (Mirza & Osin- dero, 2014). To guarantee the convergence of RCGAN-U, we replace Batch Normalization with Instance Batch Normalization. The latent space dimensions of generator are 128, 128, 256 for the three datasets, respectively. As for the optimization of GAN, we deploy the avenue same as WGAN-GP (Gulrajani et al., 2017) to pursue desirable generation quality. Specifically, we set update step of discriminator as 1. Fashion-MNIST: Positive Rate 0.3%, Initial PU: 85.41% Generator Label Accuracy 81.17% 94.95% 95.13% CGAN-A RCGAN-U CNI-CGAN Figure 6: Visualization of generated samples from several baselines and ours on Fashion-MNIST. 15 Under review as a conference paper at ICLR 2022 CIFAR-10: Positive Rate 0.3%, Initial PU: 79.46% CGAN-P CNI-CGAN Figure 7: Visualization of generated samples from CGAN-P and ours on CIFAR-10. Choice of Hyper-parameters We choose κ as 0.75, β as 5.0 and λ = 0.99 across all the approaches. The learning rates of PU classifier and CGAN are 0.001 and 0.0001, respectively. In the alternate minimization process, we set the update step as 1 for PU classifier after updating the CGAN, and L0 as 5 in Algorithm 1. We used the same and sufficient epoch for all settings (180 epochs for joint optimization) to guarantee the convergence as well as for fair comparisons. Further Evaluation of CGAN-P and Ours from the Aspect of Inception Score To better verify our approach can generate more pleasant images than CGAN-P, we additionally compare the Inception Score these two methods attain. Specifically, we trained a (almost) perfect classifier with 99.21 % and 91.33% accu- racy for MNIST and Fashion-MNIST respectively. Then we generate 50,000 samples from the two approaches to compute Inception Score, the results of which are exhibited in Table 2. It turns out that our method attain the consistent superiority against CGAN-P on the Inception Score for MNIST, even though the generator label accuracy of these two approaches are comparable. Note that the two method obtains the similar Inception Score on Fashion-MNIST, but our strategy outperforms CGAN-P significantly from the perspective of generator label accuracy. Overall, we can claim that our method is better than CGAN-P. E APPENDIX: MORE IMAGES We additionally show some generated images on other datasets generated by baselines and CNI-CGAN, shown in Figure 6. Note that we highlight the erroneously generated images with red boxes. Specifically, on Fashion- MNIST our approach can generated images with more accurate labels compared with CGAN-A and RCGAN- U. Additionally, the quality of generated images from our approach are much better than those from CGAN-P that only leverages limited supervised data, as shown in Figure 7 on CIFAR-10. 16