# EQUALIZED ROBUSTNESS: TOWARDS SUSTAINABLE FAIRNESS UNDER DISTRIBUTIONAL SHIFTS **Anonymous authors** Paper under double-blind review ABSTRACT Increasing concerns have been raised on deep learning fairness in recent years. Existing fairness metrics and algorithms mainly focus on the discrimination of model performance across different groups on in-distribution data. It remains unclear whether the fairness achieved on in-distribution data can be generalized to data with unseen distribution shifts, which are commonly encountered in real-world applications. In this paper, we first propose a new fairness goal, termed Equalized Robustness (ER), to impose fair model robustness against unseen distribution shifts across majority and minority groups. ER measures robustness disparity by the maximum mean discrepancy (MMD) distance between the loss curvature distributions of two groups of data. We show that previous fairness learning algorithms designed for in-distribution fairness fail to meet the new robust fairness goal. We further propose a novel fairness learning algorithm, termed Curvature Matching (CUMA), to simultaneously achieve both traditional in-distribution fairness and our new robust fairness. CUMA debiases the model robustness by minimizing the MMD distance between loss curvature distributions of two groups. Experiments on three popular datasets show CUMA achieves superior fairness in robustness against distribution shifts, without more sacrifice on either overall accuracies or the in-distribution fairness. 1 INTRODUCTION With the wide deployment of deep learning in modern business applications concerning individual lives and privacy, there naturally emerge concerns on machine learning fairness (Podesta et al., 2014; Mu˜noz et al., 2016; Smuha, 2019). Research efforts on various fairness evaluation metrics and corresponding enforcing methods have been carried out (Edwards & Storkey, 2016; Hardt et al., 2016; Du et al., 2020). Specifically, many such metrics require some form of “equalized model performance” across different groups on in-distribution data. Examples include Demographic parity (DP) (Edwards & Storkey, 2016), Equalized Opportunity (EOpp), and Equalized Odds (EO) (Hardt et al., 2016). Unfortunately, when deployed for real-world applications, deep models commonly encounter data with unforeseeable distribution shifts (Hendrycks & Dietterich, 2019; Hendrycks et al., 2020; 2021). It has been shown that deep learning models can have drastically degraded performance (Hendrycks & Dietterich, 2019; Hendrycks et al., 2020; 2021; Taori et al., 2020) and show unreliable behaviors (Qiu et al., 2019; Yan et al., 2021) under unseen distribution shifts. Intuitively speaking, previous fairness learning algorithms aim to optimize the model to a local minimum where data from majority and minority groups have similar average loss values (and thus similar in-distribution performance). However, those algorithms do not take into consideration the the stability or “robustness” of their found fairness-aware minima. Taking object detection in a self-driving car for example, it might have been calibrated over high-quality clear images to be “fair” with different skin colors; however such fairness may severely break down when applied to data collected in adverse visual conditions, such as inclement weather, poor lighting, or other digital artifacts. Our experiments also find that previous state-of-the-art fairness algorithms would be jeopardized if distributional shifts are present in test data, as illustrated in Figure 1 (b). The above findings beg the following question: _How to achieve practically sustainable fairness, e.g., even under unseen distribution shifts?_ ----- 𝐿(𝑥) 𝐿(𝑥) 𝑥! 𝑥" 𝑥 𝑥! 𝑥!" 𝑥# 𝑥#" 𝑥 𝐿(𝑥) 𝑥! 𝑥!" 𝑥# 𝑥#" 𝑥 (a) Normal training (b) Traditional fair training (c) Robust fair training (Unfair) (In-distribution fairness) (In-distribution & robust fairness) Figure 1: Illustrating the achieved fairness of normal training, traditional fair training and our proposed robust fair training algorithms. Horizontal and vertical axes represent input x and corresponding loss value L(x), respectively. Solid blue curves show the loss landscapes. Circles denote majority data points (xa and x[′]a[), while triangles denote minority data points (][x][i] [and][ x][′]i[). Green] points (xa and xi) are in-distribution data while red ones (x[′]a [and][ x]i[′] [) are sampled from test sets with] distribution shifts. (a) Normal training results in unfair models: minority group has worse performance (i.e., larger loss values). (b) Traditional fair training algorithms can achieve in-distribution fairness but not in a robust way: a small distribution shift can break the fairness due to loss curvature biases across different groups. In fact, such learned fair models can have almost the same large bias as the normally trained models when facing distribution shifts. (c) Our robust fair training algorithm can simultaneously achieve fairness both on in-distribution data and at distribution shifts, by matching both loss values and loss curvatures across different groups. To answer that, we first propose a new fairness objective, termed Equalized Robustness (ER), which aims to impose “equalized robustness” against unseen distribution shifts across the majority and minority groups, so that the learned fairness can sustain even with test data perturbed. ER explicitly considers a new dimension of fairness that is practically significant yet so far largely overlooked. In other words, ER assesses fairness on “out-of-distribution”. Therefore it works as a complement instead of a replacement for previous fairness metrics, which focus on assessing the “in-distribution” fairness. Previous research has shown that model robustness against input perturbation is highly correlated with loss curvature smoothness (Bartlett et al., 2017; Moosavi-Dezfooli et al., 2019; Weng et al., 2018). Our experiments also observed that, the local loss curvature of minority group is often larger than that of majority group, leading to the two group’s robustness discrepancy against distribution shifts. To this end, we propose to empirically quantify the robustness discrepancy as the maximum mean discrepancy (MMD) (Gretton et al., 2012) distance between the local model smoothness distributions, for data samples from the majority and minority groups. We experimentally demonstrate that our new metric aligns well with model performance under real-world distribution shifts. On top of that, we further propose a new fair learning algorithm, termed Curvature Matching (CUMA), to simultaneously achieve both traditional in-distribution fairness and ER. CUMA matches the local curvature distribution between data points from the two different groups, as illustrated in Figure 1 (c), by adding a curvature-matching regularizer that can be efficiently computed via a one-shot power iteration method. Our codes will be released upon acceptance. Our contributions can be summarized as bellow: - We propose Equalized Robustness (ER), a new fairness objective for machine learning models, to impose equalized model robustness against unforeseeable distributions shifts across majority and minority groups. - We further propose a new fairness learning algorithm dubbed Curvature Matching (CUMA), which enforces ER during training by utilizing a one-shot power iteration method. - Experiments show that CUMA achieves much more robust fairness against distribution shifts, without more sacrifice on either overall accuracies or the in-distribution fairness, compared with traditional in-distribution fair learning methods. ----- 2 PRELIMINARIES 2.1 MACHINE LEARNING FAIRNESS **Problem Setting and Metrics** Machine learning fairness can be generally categorized into individual fairness and group fairness (Du et al., 2020). Individual fairness requires similar inputs to have similar predictions (Dwork et al., 2012). Compared with individual fairness, group fairness is a more popular setting and thus the focus of our paper. Given input data X ∈ R[n] with sensitive attributes A ∈{0, 1} and their corresponding ground truth labels Y ∈{0, 1}, group fairness requires a learned binary classifier f (·; θ) : R[n] _→{0, 1} parameterized by θ to give equally accurate predic-_ tions (denoted as _Y[ˆ] := f_ (X)) on the two groups with A = 0 and A = 1. Multiple fairness criteria have been defined in this context. Demographic parity (DP) (Edwards & Storkey, 2016) requires identical ratio of positive predictions between two groups: P ( Y[ˆ] = 1|A = 0) = P ( Y[ˆ] = 1|A = 1). Equalized Odds (EO) (Hardt et al., 2016) requires identical false positive rates (FPRs) and false negative rates (FNRs) between the two groups: P ( Y[ˆ] ̸= Y |A = 0, Y = y) = P ( Y[ˆ] ̸= Y |A = 1, Y = y), ∀y ∈{0, 1}. Equalized Opportunity (EOpp) (Hardt et al., 2016) requires only equal FNRs between the groups: P ( Y[ˆ] ̸= Y |A = 0, Y = 0) = P ( Y[ˆ] ̸= Y |A = 1, Y = 0). Based on these fairness criteria, quantified metrics are defined to measure fairness. Specifically, DP, EO and EOpp distances (Madras et al., 2018) are defined as follows: ∆DP := |P ( Y[ˆ] = 1|A = 0) − _P_ ( Y[ˆ] = 1|A = 1)| (1) ∆EO := _P_ ( Y[ˆ] = Y _A = 0, Y = y)_ _P_ ( Y[ˆ] = Y _A = 1, Y = y)_ _|_ _̸_ _|_ _−_ _̸_ _|_ _|_ (2) _y∈{X0,1}_ ∆EOpp := |P ( Y[ˆ] ̸= Y |A = 0, Y = 0) − _P_ ( Y[ˆ] ̸= Y |A = 1, Y = 0)| (3) MMD has been previously used to define fairness metric in (Quadrianto & Sharmanska, 2017) defines a more general fairness metric using MMD distance, and shows ∆DP, ∆EO and ∆EOpp to be spatial cases of their unified metric. All these metrics consider the in-distribution fairness, while our Equalized Generalizibility is the first fairness metric explicitly aware of robust generalization ability on unseen distributions. **Bias Mitigation Methods** Many methods have been proposed to mitigate model bias. Data preprocessing methods such as re-weighting (Kamiran & Calders, 2012) and data-transformation (Calmon et al., 2017) have been used to reduce discrimination before model training. In contrast, Hardt et al. (2016) and Zhao et al. (2017) propose post-processing methods to calibrate model predictions towards a desired fair distribution after model training. Instead of pre- or post-processing, researchers have explored to enhance fairness during training. For example, Madras et al. (2018) uses a adversarial training technique and shows the learned fair representations can transfer to unseen target tasks. The key technique, adversarial training (Edwards & Storkey, 2016), was designed for feature disentanglement on hidden representations such that sensitive (Edwards & Storkey, 2016) or domain-specific information (Ganin et al., 2016) will be removed while keeping other useful information for the target task. The hidden representations are typically the output of intermediate layers of neural networks (Ganin et al., 2016; Edwards & Storkey, 2016; Madras et al., 2018). Instead, methods, like adversarial debiasing (Zhang et al., 2018) and its simplified version (Wadsworth et al., 2018), directly apply the adversary on the output layer of the classifier, which also promotes the model fairness. Observing the unfairness due to ignoring the worst learning risk of specific samples, Hashimoto et al. (2018) proposes to use distributionally robust optimization which provably bounds the worst-case risk over groups. Creager et al. (2019) proposes a flexible fair representation learning framework based on VAE (Kingma & Welling, 2013), that can be easily adapted for different sensitive attribute settings during run-time. Sarhan et al. (2020) uses orthogonality constraints as a proxy for independence to disentangles the utility and sensitive representations. Martinez et al. (2020) formulates group fairness with multiple sensitive attributes as a multi-objective learning problem and proposes a simple optimization algorithm to find the Pareto optimality. Another line of research focuses on learning unbiased representations from biased ones (Bahng et al., 2020; Nam et al., 2020). Bahng et al. (2020) proposes a novel framework to learn unbiased representations by explicitly enforcing them to be different from a set of pre-defined biased representations. Nam et al. (2020) observes that data bias can be either benign or malicious, and removing malicious bias along can achieve fairness. Li & Vasconcelos (2019) jointly learns a data re-sampling weight distribution that penalizes easy samples and network parameters. ----- **Applications in Computer Vision** When many fairness metrics and debiasing algorithms are designed for general learning problems as aforementioned, there are a line of research and applications focusing on fairness-encouraged computer vision tasks. For instance, Buolamwini et al. (Buolamwini & Gebru, 2018) shows current commercial gender-recognition systems have substantial accuracy disparities among groups with different genders and skin colors. Wilson et al. (2019) observe that state-of-the-art segmentation models achieve better performance on pedestrians with lighter skin colors. In (Shankar et al., 2017; de Vries et al., 2019), it is found that the common geographical bias in public image databases can lead to strong performance disparities among images from locales with different income levels. Nagpal et al. (2019) reveal that the focus region of face-classification models depends on people’s ages or races, which may explain the source of age- and race-biases of classifiers. On the awareness of the unfairness, many efforts have been devoted to mitigate such biases in computer vision tasks. Wang et al. (2019) shows the effectiveness of adversarial debiasing technique (Zhang et al., 2018) in fair image classification and activity recognition tasks. Beyond the supervised learning, FairFaceGAN (Hwang et al., 2020) is proposed to prevent undesired sensitive feature translation during image editing. Similar ideas have also been successfully applied to visual question answering (Park et al., 2020). 2.2 MODEL ROBUSTNESS AND SMOOTHNESS Model generalization ability and robustness has been shown to be highly correlated with model smoothness (Moosavi-Dezfooli et al., 2019; Weng et al., 2018). Weng et al. (2018) and Guo et al. (2018) use local Lipschitz constant to estimate model robustness against small perturbations on inputs within a hyper-ball. Moosavi-Dezfooli et al. (2019) proposes to improve model robustness by adding a curvature constraint to encourage model smoothness. Miyato et al. (2018) approximates model local smoothness by the spectral norm of Hessian matrix, and improves model robustness against adversarial attacks by regularizing model smoothness. 3 EQUALIZED ROBUSTNESS: A NEW METRIC FOR FAIR GENERALIZATION AND ROBUSTNESS Consider a binary classifier f (·; θ) trained on two groups of data X1 and X2 respectively. Our goal is to define a metric to measure the gap of model robustness between the two groups. Formulating such a metric is highly non-trivial, with difficulties from mainly two aspects. The first challenge is that we need to ensure fair generalization against multiple unseen distribution shifts that may encounter in real world applications. A trivial solution would be selecting a set of predefined distribution shifts and measuring the average performance gap (e.g., ∆EO) against them. However, this approach requires engineering overhead in handcrafting the predefined distribution shifts, and the predefined distribution shifts may not be representative enough to cover all unseen cases. Previous research (Miyato et al., 2018; Moosavi-Dezfooli et al., 2019; Guo et al., 2018; Weng et al., 2018) has shown both theoretically and empirically that deep model robustness scales with its model smoothness. Following (Miyato et al., 2018; Moosavi-Dezfooli et al., 2019), we use the spectral norm of Hessian matrix to approximate local smoothness as an indicator of model robustness. Specifically, given an input x, the Hessian matrix H(x) is defined as the second-order gradient of L(x) with respect to input x: H(x) = ∇x[2] _[L][(][x][)][. The approximated local curvature][ C][(][x][)]_ at point x is thus defined as: _C(x) = σ(H(x)),_ (4) where σ(H) is the spectral norm of H: σ(H) = supv:∥v∥2=1 ∥Hv∥2. Intuitively, C(x) measures the maximal directional curvature or change rate at x. Thus, smaller C(x) indicates better local smoothness around x (Miyato et al., 2018; Moosavi-Dezfooli et al., 2019). For the second difficulty, unlike previous fairness metrics where the target random variable[1] follows a Bernoulli distribution, the local curvature used in ER is a continuous random variable without a simple underlying distribution. The unknown distribution form makes it difficult to directly measure the difference between the curvature distributions by a parametric statistic test (e.g., t-test or KL divergence). To tackle this problem, we utilize maximum mean discrepancy (MMD) (Gretton et al., 1Such as Y = 1 in DP and Y ̸= ˆY in EO and EOpp. (See Section 2.1.) ----- 2012) to do a two-sample test on (X1) and (X2). MMD is a distribution distance measure, _C_ _C_ agnostic to the exact distribution formulation and only based on the mean difference. Formally, our new fairness metric for equalized robustness is defined as follows: **Our new fairness metric ∆ER** Consider a machine learning model f trained on two groups of data X1 and X2 respectively. Suppose C(X1) ∼P1 and C(X2) ∼P2, then the model’s ∆ER is defined as the squared maximum-mean-discrepancy (MMD) distance between (X1) and (X2): _C_ _C_ ∆ER = MMD[2](P1, P2). (5) MMD is widely used to measure the distance between two high-dimensional distributions in deep learning (Li et al., 2015; 2017; Bi´nkowski et al., 2018). The MMD distance between two distributions P and Q is defined as MMD[2]( _,_ ) = _µ_ _µ_ [=][ E][P] [[][k][(][X, X][)]][ −] [2][E][P][,][Q][[][k][(][X, Y][ )] +][ E][Q][[][k][(][Y, Y][ )]] (6) _P_ _Q_ _∥_ _P −_ _Q∥H[2]_ where X ∼P, Y ∼Q and k(·, ·) is the kernel function. In practice, we use finite samples from P and Q to statistically estimate their MMD distance: _N_ _k(xi, yj) + N[1][2]_ _j=1_ X MMD[2](P, Q) = _k(yj, yj′_ ) (7) _j[′]=1_ X _k(xi, xi′_ ) _−_ _i[′]=1_ X _M_ [2] _MN_ _i=1_ _i=1_ _j=1_ where {xi ∼P}i[M]=1[,][ {][y][j][ ∼Q}][N]j=1[, and we use the mixed RBF kernel function][ k][(][x, y][) =] _σ_ S _[e][−]_ _[∥][x]2[−]σ[y][2][∥][2]_ with hyperparameter S = 1, 2, 4, 8, 16 . Ablation studies on S values are con_∈_ _{_ _}_ ducted in Section 5.3. P 4 CURVATURE MATCHING: FAIR MACHINE LEARNING TOWARDS EQUALIZED ROBUSTNESS 4.1 PRACTICAL CURVATURE APPROXIMATION In order to achieve equalized robustness, one intuitive solution is to add ∆ER (Eq. (5)) as an regularization term in the loss function during training phase. However, it is non-practical to precisely calculate the spectral norm (which is equal to the absolute value of dominant eigenvalue) of Hessian matrix in ∆ER. To solve this problem, we use a one-shot power iteration method (PIM) for practical approximation of C(x) during training. First we rewrite C(x) with the following form: _C(x) = σ(H(x)) = ∥H(x)v∥, where v is the dominant eigenvector with the maximal eigenvalue,_ which can be calculated by power iteration method. In practice, to increase training efficiency, we use a one-shot power iteration method. Specifically, we estimate the dominant eigenvector v by the sign(g) gradient direction: ˜v := _∥sign(g)∥_ _[≈]_ _[v][, where][ g][ =][ ∇][x][L][(][x][)][. This is because previous works have]_ observed a large similarity between the dominant eigenvector and the gradient direction (Miyato et al., 2018; Moosavi-Dezfooli et al., 2019). We further approximate Hessian matrix by finite differentiation on gradients: H(x)v _h_ where h is a small constant. As a result, the _≈_ _[∇][x][L][(][x][+][hv][)][−∇][x][L][(][x][)]_ final approximation of curvature smoothness is _v)_ _x_ (x) (x) (x) := _−∇_ _L_ _∥_ _._ (8) _C_ _≈_ _C[˜]_ _[∥∇][x][L][(][x][ +][ h][˜]h_ _|_ _|_ 4.2 CURVATURE MATCHING With the practical curvature approximation, now we can match the curvature distribution of the two groups by minimizing the MMD distance. Suppose _C[˜](X1) ∼Q1 and_ _C[˜](X2) ∼Q2, we define the_ curvature matching loss functions as: _Lcm = MMD[2](Q1, Q2)_ (9) We add Lcm to the traditional adversarially fair training (Ganin et al., 2016; Madras et al., 2018) loss function as a regularizer, in order to attain both in-distribution fairness and fair robustness. As ----- Figure 2: The overall framework of CUMA. x is 𝑄2 the input sample. ht is the utility head for the tar get task. ha is the adversarial head to predict sensitive attributes. fs is the shared backbone. C(·) is the curvature estimation function, as defined in 𝑥 𝑓𝑠[(∙; θ]s[)] ℎ𝑡(∙; θ𝑡) 𝐿𝑐𝑙𝑓 Eq. (4). _Q1 and Q2 are local curvature distri-_ butions of majority and minority groups, respectively. Lcm, Lclf and Ladv are three loss terms as defined in Eq. (9) and (11). 𝐶(∙) 𝑄1 𝐿𝑐𝑚 𝑄2 𝑓∙; θ = ℎ𝑡(𝑓𝑠[(∙; θ]s[); θ]𝑡[)] 𝑥 𝑓𝑠[(∙; θ]s[)] ℎ𝑡(∙; θ𝑡) 𝐿𝑐𝑙𝑓 ℎ𝑎(∙; θ𝑎) 𝐿𝑎𝑑𝑣 illustrated in Figure 2, our model follows the same “two-head” structure as traditional adversarial learning frameworks (Ganin et al., 2016; Madras et al., 2018), where ht is the utility head for the target task, ha is the adversarial head to predict sensitive attributes, and fs is the shared backbone.[2] Suppose for each sample xi, the sensitive attribute is ai and the corresponding target label is yi, then our overall optimization problem can be written as: _θmins,θt_ [max]θa _[L][ = min]θs,θt_ [max]θa [(][L][clf][ −] _[α][L][adv][ +][ γ][L][cm][)]_ (10) where _clf = [1]_ _L_ _N_ _N_ _ℓ(ht(fs(xi; θs); θt), yi),_ _adv = [1]_ _L_ _N_ _i=1_ X _ℓ(ha(fs(xi; θs); θa), ai),_ (11) _i=1_ X _ℓ(·, ·) is the cross-entropy loss function, α and γ are trade-off hyperparameters, and N is the number_ of training samples. 5 EXPERIMENTS 5.1 EXPERIMENTAL SETUP **Datasets and pre-processing** Experiments are conducted on three datasets widely used to evaluate machine learning fairness: Communities and Crime (C&C) (Redmond & Baveja, 2002), Adult (Kohavi, 1996), and CelebA (Liu et al., 2015).[3] C&C dataset has 1,994 samples with neighborhood population statistics, where 1,500 are used for training and the rest for evaluation. The target task is to predict violent crime per capita, and we use “RacePctBlack” (percentage of black population in the neighborhood) and “FemalePctDiv” (divorce ratio of female in the neighborhood) as sensitive attributes. All features in C&C dataset are of continous values in [0, 1]. To fit in the fairness problem setting, we binarilize the target and sensitive attributes with the top-30% largest value as the threshold.[4] We also do data-whitening on C&C. Adult dataset has 48,842 samples with basic personal information such as education and occupation, where 30,000 are used for training and the rest for evaluation. The target task is to predict the person’s annual income, and we use “gender” (male or female) as the sensitive attribute. The features in Adult dataset are of either continuous (e.g., age) or categorical (e.g. sex) values. We use one-hot encoding on the categorical features and then concatenate them with the continuous ones. We use data whitening on the concatenated features. CelebA has over 200,000 images of celebrity faces, with 40 attribute annotations. The target task is to predict the “attractiveness” attribute and the sensitive attributes to protect are “chubby” and “eyeglasses”. We randomly select 45, 000 as training samples and 5, 000 as testing samples. All images are center-cropped and resized to 128 × 128, and pixel values are scaled to [0, 1]. 32Thus the binary classifier f (·; θ) = ht(fs(·; θs); θt), with θ = θt ∪ _θs._ Traditional image classification datasets (e.g., ImageNet) are not directly applicable since they lack fairness attribute labels. 4As a result P[A = 0] = 30% and P[Y = 0] = 30%. ----- Table 1: Results on C&C dataset with “RacePctBlack” as the sensitive attribute. The best and second-best metrics are shown in bold and underlined, respectively. |Method|Original Test Set|Col3|Col4|Col5|With Gaussian Noise|Col7|With Uniform Noise|Col9| |---|---|---|---|---|---|---|---|---| ||Accuracy (↑)|∆EOpp(↓)|∆EO (↓)|∆ER (↓)|∆EOpp(↓)|∆EO (↓)|∆EOpp(↓)|∆EO (↓)| |||In-distribution fairness||Robust fairness under distribution shifts||||| |Normal AdvDebias LAFTR CUMA|89.05 84.79 85.80 85.20±1.70|38.52 26.68 13.32 12.71±1.47|63.22 39.84 28.83 28.17±1.70|46.16 21.77 16.98 7.59±0.19|35.43 26.68 13.53 10.17±0.89|60.13 39.84 29.04 28.69±1.92|39.51 23.65 16.69 12.85±2.98|64.21 36.81 32.20 27.11±0.82| Original Test Set With Gaussian Noise With Uniform Noise Method Accuracy (↑) ∆In-distribution fairnessEOpp(↓) ∆EO (↓) ∆ER (↓) ∆Robust fairness under distribution shiftsEOpp(↓) ∆EO (↓) ∆EOpp(↓) ∆EO (↓) Normal **89.05** 38.52 63.22 46.16 35.43 60.13 39.51 64.21 AdvDebias 84.79 26.68 39.84 21.77 26.68 39.84 23.65 36.81 LAFTR 85.80 13.32 28.83 16.98 13.53 29.04 16.69 32.20 CUMA 85.20±1.70 **12.71±1.47** **28.17±1.70** **7.59±0.19** **10.17±0.89** **28.69±1.92** **12.85±2.98** **27.11±0.82** Table 2: Results on C&C dataset with “FemalePctDiv” as the sensitive attribute. The best and second-best metrics are shown in bold and underlined, respectively. |Method|Original Test Set|Col3|Col4|Col5|With Gaussian Noise|Col7|With Uniform Noise|Col9| |---|---|---|---|---|---|---|---|---| ||Accuracy (↑)|∆EOpp(↓)|∆EO (↓)|∆ER (↓)|∆EOpp(↓)|∆EO (↓)|∆EOpp(↓)|∆EO (↓)| |||In-distribution fairness||Robust fairness under distribution shifts||||| |Normal AdvDebias LAFTR CUMA|85.60 83.57 83.16 83.39±1.01|17.28 12.80 11.73 8.65±0.59|54.74 38.73 27.83 27.57±0.74|67.69 37.17 28.15 27.70±1.04|17.63 12.80 11.73 8.71±0.88|56.41 38.73 29.30 27.70±1.04|18.77 11.38 11.38 9.63±1.37|54.60 37.15 30.11 28.35±1.73| **Models** For C&C and Adult datasets, we use two-layer MLPs for fs, ht and ha. For CelebA dataset, we use ResNet18 as backbone, where the first three stages are used as fs and the last stage (together with the fully connected classification layer) is used as ht. The auxiliary adversarial head _ha has the same structure as ht. Detailed model structures are described in Appx. A._ **Baseline Methods** We compare CUMA with the following state-of-the-art in-distribution fairness algorithms. Adversarial debiasing (AdvDebias) (Zhang et al., 2018) is one of the most popular fair training algorithm based on adversarial training (Ganin et al., 2016). Madras et al. (2018) proposes a similar framework termed Learned Adversarially Fair and Transferable Representations (LAFTR), by replacing the cross-entropy loss used in (Zhang et al., 2018) with a group-normalized ℓ1 loss, which is shown to work better on highly unbalanced datasets. We also include normal (fairnessignorant) training as a baseline. **Evaluation Metric** We use three different groups of evaluation metrics: the overall accuracy, in-distribution fairness metrics, and robust fairness metrics. We report the overall accuracy on all test samples in the original test sets. To measure in-distribution fairness, we use ∆EOpp and ∆EO on the original test sets. To measure robust fairness under distribution shifts, we use our newly proposed ∆ER on the original test sets, and also ∆EOpp and ∆EO on a set of pre-defined real-world distribution shifts. We intend to show that ∆ER calculated on the original test sets aligns well with robust fairness under real-world distribution shifts. See the following paragraph for the details in constructing distributional shifts. **Distributional shifts** On Adult and C&C datasets, we construct two distribution shifts by adding random Gaussian and uniform noises, respectively, to the test data. Specifically, following (Madras et al., 2018; Zhang et al., 2018), the categorical features in Adult and C&C datasets are first one-hot encoded and then whitened into float-value vectors, where noises are added. Both types of noises have mean µ = 0 and has standard derivation σ = 0.03 . On CelebA dataset, following (Hendrycks & Dietterich, 2019), we construct two distribution shifts by adding random Gaussian (with mean _µ = 0 and standard derivation σ = 0.08) and impulse noise (with ratio p = 0.03), respectively. We_ report the fairness in robustness against other settings of distribution shifts in Appx. C. **Implementation Details** Unless further specified, we set the loss trade-off parameter α to 1 in all experiments by default. We use Adam optimizer (Kingma & Ba, 2014) with initial learning rate 10[−][3] and weight decay 10[−][5]. The learning rate is gradually decreased to 0 by cosine annealing learning rate scheduler (Loshchilov & Hutter, 2016). On both Adult and C&C datasets, we train for 50 epochs from scratch for all methods. On CelebA dataser, we first normally train a model for 100 epochs, and then finetune it for 20 epochs using CUMA. For fair comparison, we train for 120 epochs on CelebA for all baseline methods. The constant h in Eq. (8) is set to 1 by default. For more implementation details, please check Appx. A. ----- Table 3: Results on Adult dataset with “Sex” as the sensitive attribute. The best and second-best |metrics are|shown in bold and underlined, respectively.|Col3|Col4|Col5|Col6|Col7|Col8|Col9| |---|---|---|---|---|---|---|---|---| |Method|Original Test Set||||With Gaussian Noise||With Uniform Noise|| ||Accuracy (↑)|∆EOpp(↓)|∆EO (↓)|∆ER (↓)|∆EOpp(↓)|∆EO (↓)|∆EOpp(↓)|∆EO (↓)| |||In-distribution fairness||Robust fairness under distribution shifts||||| |Normal AdvDebias LAFTR CUMA|86.11 85.17 85.97 85.30±0.73|6.65 5.12 6.28 4.83±0.24|15.45 5.92 11.96 4.77±0.34|34.25 16.78 25.38 5.59±0.28|6.66 5.10 6.22 4.74±0.32|15.01 5.95 12.08 4.81±0.51|6.87 5.77 6.45 5.43±0.19|15.72 7.29 12.06 6.87±0.31| Table 4: Results on CelebA dataset with “Chubby” as the sensitive attribute. The best and secondbest metrics are shown in bold and underlined, respectively. |Method|Original Test Set|Col3|Col4|Col5|With Gaussian Noise|Col7|With Impulse Noise|Col9| |---|---|---|---|---|---|---|---|---| ||Accuracy (↑)|∆EOpp(↓)|∆EO (↓)|∆ER (↓)|∆EOpp(↓)|∆EO (↓)|∆EOpp(↓)|∆EO (↓)| |||In-distribution fairness||Robust fairness under distribution shifts||||| |Normal AdvDebias LAFTR CUMA|91.25 90.48 89.92 89.97±0.38|38.45 26.41 26.54 27.19±0.75|42.56 29.73 29.10 30.26±0.95|59.34 42.65 39.16 23.23±0.39|39.16 28.95 27.94 27.62±0.85|43.90 35.46 34.60 31.49±1.28|39.76 29.73 28.96 27.97±0.48|44.51 36.48 35.12 31.74±1.14| Original Test Set With Gaussian Noise With Impulse Noise Method Accuracy (↑) ∆In-distribution fairnessEOpp(↓) ∆EO (↓) ∆ER (↓) ∆Robust fairness under distribution shiftsEOpp(↓) ∆EO (↓) ∆EOpp(↓) ∆EO (↓) Normal **91.25** 38.45 42.56 59.34 39.16 43.90 39.76 44.51 AdvDebias 90.48 **26.41** 29.73 42.65 28.95 35.46 29.73 36.48 LAFTR 89.92 26.54 **29.10** 39.16 27.94 34.60 28.96 35.12 CUMA 89.97±0.38 27.19±0.75 30.26±0.95 **23.23±0.39** **27.62±0.85** **31.49±1.28** **27.97±0.48** **31.74±1.14** 5.2 MAIN RESULTS Experimental results on three datasets with different sensitive attributes are shown in Tables 3-5, where we compare CUMA with the baseline methods on three different groups of metrics as discussed in Section 5.1. “Normal” means standard training without any fairness regularization. All numbers are shown as percentages. Many intriguing findings can be concluded from the results. First, we see that previous state-of-the-art fairness learning algorithms would be jeopardized if distributional shifts are present in test data. For example, on C&C dataset with “RacePctBlack” as sensitive attribute (Table 1), LAFTR achieves ∆EO = 28.83% on in-distribution test set, while that number is increased to 32.20% on the test set perturbed with uniform random noise. Similarly, for AdvDebias, it achieves ∆EO = 29.73% on the original CelebA test set with “chubby” as the sensitive attribute (Table 4), while that number is increased to 35.46% and 36.48% on test sets perturbed with Gaussian and impulse noises, respectively. Second, we see that CUMA achieves the best robust fairness under distribution shifts on all three benchmark datasets with different sensitive attribute settings, while maintaining similar indistribution fairness and overall accuracy. For example, on C&C dataset with “RacePctBlack” as the sensitive attribute (Table 1), CUMA achieves 2.73% and 4.82% less ∆EO than the second-best performer (LAFTR) under distribution shifts by additive Gaussain and uniform noises, respectively. Moreover, for the same experiment setting, although CUMA and LAFTR achieve almost identical in-distribution fairness (the difference between their ∆EO on original test set is within 0.5%), CUMA keeps (and even increases) the fairness under distribution shifts (e.g., 1.33% smaller ∆EO under uniform noises), while the fairness achieved by LAFTR is jeopardized under both types of distribution shifts (e.g., 3.37% larger ∆EO under uniform noises). Similarly, on CelebA dataset with “Chubby” as the sensitive attribute, LAFTR has even slightly better in-distribution fairness than CUMA. However, when the test sets have distribution shifts, the fairness achieved by LAFTR is jeopardized (with 5.50% and 6.02% more ∆EO under Gaussian and uniform noises, respectively), while CUMA keeps its fairness and achieves better fairness under distribution shifts (e.g., 2.50% and 3.17% less ∆EO compared with LAFTR.). Third, for all three datasets, the ∆ER calculated on the original test set highly correlates with traditional fairness metrics (e.g., ∆EOpp, ∆EO) calculated on the perturbed test sets: the smaller ∆ER on the in-distribution test set, the smaller ∆EO on perturbed test sets. This shows that our new metric ∆ER aligns well with robust fairness under real-world distribution shifts, and validates the rationality of using it as an indicator of model robustness discrimination. More experimental results are shown in Appx. B (trade-off curves between fairness and accuracy) and Appx. C (results on other settings of distributional shifts). ----- Table 5: Results on CelebA dataset with “Eyeglasses” as the sensitive attribute. The best and secondbest metrics are shown in bold and underlined, respectively. |Method|Original Test Set|Col3|Col4|Col5|With Gaussian Noise|Col7|With Impulse Noise|Col9| |---|---|---|---|---|---|---|---|---| ||Accuracy (↑)|∆EOpp(↓)|∆EO (↓)|∆ER (↓)|∆EOpp(↓)|∆EO (↓)|∆EOpp(↓)|∆EO (↓)| |||In-distribution fairness||Robust fairness under distribution shifts||||| |Normal AdvDebias LAFTR CUMA|90.52 88.65 89.72 89.10±0.13|36.40 23.15 24.90 24.16±0.40|43.96 32.56 35.48 33.39±0.22|54.38 41.06 42.93 32.56±0.41|35.62 25.70 26.12 25.76±0.50|42.91 36.41 37.94 34.77±0.47|37.92 23.92 24.52 22.61±0.06|45.63 33.46 34.10 31.68±0.15| 5.3 ABLATION STUDY **Ablation Study on ∆ER** In this section, we study how well can the ∆ER predict the robust fairness and the sensitivity of ∆ER with respect to S (the sampling set for σ in the mixed RBF kernel function, as described in Section 3). A small σ will make the ∆ER more sensitive to the difference between the two sample set, which could be caused by either the true discrepancy of distributions or the different noise introduced by sampling. In contrast, a larger ones may under-estimate the discrepancy. Thus, a proper S should in- Table 6: ∆ER values with different mixed RBF clude a wide range of σ to avoid the domination kernel scale parameter set S. Results are reported on C&C dataset with “RacePctBlack” as the sensi of either large or small values. In this paper, tive attribute. Models are trained by CUMA with we choose a geometric sequence with 2 as the base, i.e., S = 1, 2, 4, 8, 16 . Furthermore, we compare ∆ER values under three different sets: S = S1 S = S2 S = S3 with Uniform Noise S(the default1 = {0.25 S, 0. as defined in Section5, 1, 2, 4}, S2 = {1, 2, 4 3, 8), and, 16} _γγ = 0 = 1.1_ 12.728.56 13.527.61 11.064.22 31.0927.02 S3 = {4, 8, 16, 32, 64}. Results are shown in |different|γ values.|Col3| |---|---|---| ||∆ER on Original Test Set|∆EO on Test Set with Uniform Noise| ||S = S1 S = S2 S = S3|| |γ = 0.1 γ = 1 γ = 10|12.72 13.52 11.06 8.56 7.61 4.22 8.40 7.24 4.02|31.09 27.02 26.98| Table 6. As in Section 5.2, we empirically evaluate the robust fairness by ∆EO on the test set corrupted by uniform noise. From the results, we observe that with all three different S settings, ∆ER aligns well with the model fairness under distribution shifts (∆EO under uniform noise). **Ablation Study on CUMA** In this Table 7: Ablation study results on the loss trade-off paramesection, we check the sensitivity of ters α and γ in the CUMA algorithm. Results are reported on CUMA with respect to its hyper- C&C dataset with “RacePctBlack” as the sensitive attribute. rameters α and γ in Eq. (10) and h |Col1|α 0.1 1 10|γ 0.1 1 10|h 0.1 1| |---|---|---|---| |Accuracy ∆EO ∆ER|86.94 85.40 83.75 59.74 28.35 32.68 42.50 7.61 18.56|85.19 85.40 84.79 38.85 28.35 27.99 13.52 7.61 7.24|85.32 85.40 29.15 28.35 7.53 7.61| peeks at around α = 1, so we use it as the default α value. When fixing α = 1, the best trade-off between overall accuracy and robust fairness is achieved at round γ = 1, which we use as the default γ. Varying the value of h hardly affects the performance of CUMA. 6 CONCLUSION In this paper, we first propose a new fairness goal, termed Equalized Robustness (ER), to impose fair model robustness against unseen distribution shifts across different data groups. We further propose a novel fairness learning algorithm, termed Curvature Matching (CUMA), to simultaneously achieve both traditional in-distribution fairness and our new robust fairness. Experiments show CUMA achieves superior fairness in robustness against distribution shifts, without more sacrifice on either overall accuracies or the in-distribution fairness compared with traditional in-distribution fair learning methods. As a pioneer work, the new concept of ER proposed in this paper aims to measure a new dimension of fairness that is practically significant yet so far largely overlooked: ER assesses “out-of-distribution” fairness while previous metrics focus on “in-distribution” fairness. Therefor, ER works as a complement instead of a replacement for previous fairness metrics. We hope our work can open up more discussions on how to evaluate model fairness in a more complete spectrum. ----- REFERENCES Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In International Conference on Machine Learning, pp. 528–539, 2020. Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017. Mikołaj Bi´nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018. Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp. 77–91, 2018. Flavio P Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized pre-processing for discrimination prevention. In International _Conference on Neural Information Processing Systems, pp. 3995–4004, 2017._ Elliot Creager, David Madras, J¨orn-Henrik Jacobsen, Marissa Weis, Kevin Swersky, Toniann Pitassi, and Richard Zemel. Flexibly fair representation learning by disentanglement. In International _Conference on Machine Learning, pp. 1436–1445, 2019._ Terrance de Vries, Ishan Misra, Changhan Wang, and Laurens van der Maaten. Does object recognition work for everyone? In IEEE Conference on Computer Vision and Pattern Recognition _Workshops, pp. 52–59, 2019._ Mengnan Du, Fan Yang, Na Zou, and Xia Hu. Fairness in deep learning: A computational perspective. IEEE Intelligent Systems, 2020. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science Conference, pp. 214–226, 2012. Harrison Edwards and Amos Storkey. Censoring representations with an adversary. In International _Conference on Learning Representations, 2016._ Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096–2030, 2016. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch¨olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723–773, 2012. Yiwen Guo, Chao Zhang, Changshui Zhang, and Yurong Chen. Sparse DNNs with improved adversarial robustness. In Advances in Neural Information Processing Systems, 2018. Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In _Advances in Neural Information Processing Systems, 2016._ Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pp. 1929–1938, 2018. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations, 2019. Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv _preprint arXiv:2006.16241, 2020._ Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. ----- Sunhee Hwang, Sungho Park, Dohyung Kim, Mirae Do, and Hyeran Byun. Fairfacegan: Fairnessaware facial image-to-image translation. In British Machine Vision Conference, 2020. Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint _arXiv:1412.6980, 2014._ Diederik P Kingma and Max Welling. Auto-encoding variational bayes. _arXiv preprint_ _arXiv:1312.6114, 2013._ Ron Kohavi. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In Interna_tional Conference on Knowledge Discovery and Data Mining, pp. 202–207, 1996._ Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab´as P´oczos. MMD GAN: Towards deeper understanding of moment matching network. In International Conference on _Machine Learning, 2017._ Yi Li and Nuno Vasconcelos. REPAIR: Removing representation bias by dataset resampling. In _IEEE Conference on Computer Vision and Pattern Recognition, pp. 9572–9581, 2019._ Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International _Conference on Machine Learning, pp. 1718–1727, 2015._ Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, 2015. Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv _preprint arXiv:1608.03983, 2016._ David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pp. 3384–3393, 2018. Natalia Martinez, Martin Bertran, and Guillermo Sapiro. Minimax Pareto fairness: A multi objective perspective. In International Conference on Machine Learning, pp. 6755–6764, 2020. Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern _Analysis and Machine Intelligence, 41(8):1979–1993, 2018._ Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In IEEE Conference on Computer Vision and _Pattern Recognition, pp. 9078–9086, 2019._ Cecilia Mu˜noz, Megan Smith, and DJ Patil. Big data: A report on algorithmic systems, opportunity, _and civil rights. United States Executive Office of the President, 2016._ Shruti Nagpal, Maneet Singh, Richa Singh, and Mayank Vatsa. Deep learning for face recognition: Pride or prejudiced? arXiv preprint arXiv:1904.01219, 2019. Junhyun Nam, Hyuntak Cha, Sung-Soo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier. In Advances in Neural Information Processing _Systems, 2020._ Sungho Park, Sunhee Hwang, Jongkwang Hong, and Hyeran Byun. Fair-VQA: Fairness-aware visual question answering through sensitive attribute prediction. IEEE Access, 8:215091–215099, 2020. John Podesta, Penny Pritzker, Ernest J. Moniz, John Holdren, and Jeffery Zients. Big data: Seizing _opportunities and preserving values. United States Executive Office of the President, 2014._ ----- Yuxian Qiu, Jingwen Leng, Cong Guo, Quan Chen, Chao Li, Minyi Guo, and Yuhao Zhu. Adversarial defense through network profiling based path extraction. In IEEE Conference on Computer _Vision and Pattern Recognition, pp. 4777–4786, 2019._ Novi Quadrianto and Viktoriia Sharmanska. Recycling privileged learning and distribution matching for fairness. In Advances in Neural Information Processing Systems, 2017. Michael Redmond and Alok Baveja. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3): 660–678, 2002. Mhd Hasan Sarhan, Nassir Navab, Abouzar Eslami, and Shadi Albarqouni. Fairness by learning orthogonal disentangled representations. In European Conference on Computer Vision, pp. 746– 761, 2020. Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, and D Sculley. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. In Advances in Neural Information Processing Systems Workshop, 2017. Nathalie A Smuha. The EU approach to ethics guidelines for trustworthy artificial intelligence. _Computer Law Review International, 20(4):97–106, 2019._ Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. In Advances _in Neural Information Processing Systems, 2020._ Christina Wadsworth, Francesca Vera, and Chris Piech. Achieving fairness through adversarial learning: an application to recidivism prediction. arXiv preprint arXiv:1807.00199, 2018. Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In IEEE _International Conference on Computer Vision, pp. 5310–5319, 2019._ Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. In International Conference on Learning Representations, 2018. Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. Predictive inequity in object detection. _arXiv preprint arXiv:1902.11097, 2019._ Hanshu Yan, Jingfeng Zhang, Gang Niu, Jiashi Feng, Vincent YF Tan, and Masashi Sugiyama. CIFS: Improving adversarial robustness of cnns via channel-wise importance-based feature selection. arXiv preprint arXiv:2102.05311, 2021. Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In AAAI Conference on AI, Ethics, and Society, pp. 335–340, 2018. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. _arXiv preprint_ _arXiv:1707.09457, 2017._ ----- A MORE IMPLEMENTATION DETAILS On C&C and Adult datsets, suppose the input feature dimension is d, then the dimensions of hidden layers in fs and ht are d → 100 → 64 and 64 → 32 → 2, respectively. ha has identical model structure with ht. For all three sub-networks, ReLU activation function and dropout layer with 0.25 dropout ratio are applied between the two fully connected layers. On CelebA dataset, we use the ResNet18 as backbone. The input feature size of ht and ha is 8 _×_ 8 _×_ 256 (with channel-last layout). B TRADE-OFF CURVES BETWEEN FAIRNESS AND ACCURACY For CUMA and both baseline methods, we can obtain different trade-offs between fairness and accuracy by setting the loss function weights (e.g., α and γ) to different values. For example, the larger α, the better fairness and the worse accuracy. Such trade-off curves between fairness and accuracy of different methods are shown in Figure 3. The closer the curve to the top-left corner (i.e., with larger accuracy and smaller ∆EO), the better Pareto frontier is achieved. As we can see, our method achieves the best Pareto frontiers for both in-distribution fairness (left panel) and robust fairness under distribution shifts (middle and right panel). Figure 3: Trade-off curves between fairness and accuracy of different methods. Results are reported on C&C dataset with “RacePctBlack” as the sensitive attribute. C RESULTS ON OTHER SETTINGS OF DISTRIBUTIONAL SHIFTS Table 8: Results on C&C dataset with “RacePctBlack” as the sensitive attribute. The best and second-best metrics are shown in bold and underlined, respectively. |Method|Original Test Set|Col3|Col4|Col5|With Gaussian Noise|Col7|With Uniform Noise|Col9| |---|---|---|---|---|---|---|---|---| ||Accuracy (↑)|∆ EOpp(↓)|∆ (↓) EO|∆ (↓) ER|∆ EOpp(↓)|∆ (↓) EO|∆ EOpp(↓)|∆ (↓) EO| |||In-distribution fairness||Robust fairness under distribution shifts||||| |Normal AdvDebias LAFTR CUMA|89.05 84.79 85.80 85.40|38.52 26.68 13.32 12.52|63.22 39.84 28.83 28.35|46.16 21.77 16.98 7.61|36.71 28.61 13.96 11.76|61.54 37.02 31.25 27.15|40.22 22.84 16.58 12.80|63.17 37.41 33.42 27.41| Table 9: Results on C&C dataset with “FemalePctDiv” as the sensitive attribute. The best and second-best metrics are shown in bold and underlined, respectively. |Method|Original Test Set|Col3|Col4|Col5|With Gaussian Noise|Col7|With Uniform Noise|Col9| |---|---|---|---|---|---|---|---|---| ||Accuracy (↑)|∆ EOpp(↓)|∆ (↓) EO|∆ (↓) ER|∆ EOpp(↓)|∆ (↓) EO|∆ EOpp(↓)|∆ (↓) EO| |||In-distribution fairness||Robust fairness under distribution shifts||||| |Normal AdvDebias LAFTR CUMA|85.60 83.57 83.16 83.37|17.28 12.80 11.73 8.90|54.74 38.73 27.83 27.79|67.69 37.17 28.15 23.13|18.52 14.90 13.12 9.12|57.64 39.60 30.21 28.74|20.25 12.58 12.41 9.96|55.52 35.26 31.52 29.23| In this section, we show that the conclusions drawn in Section 5.2 hold under different settings of distributional shifts. Specifically, we consider a new noise setting with mean µ = 0 and standard ----- derivation σ = 0.06 (other than the mean µ = 0 and standard derivation σ = 0.03 evaluated in the main text) for both random Gaussian and uniform noises. The results under these new distributional shifts on C&C dataset are shown in Tables 8 and 9. -----