File size: 39,080 Bytes
f71c233 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 |
# SYNTHETIC REDUCED NEAREST NEIGHBOR MODEL ## FOR REGRESSION Anonymous authors Paper under double-blind review ABSTRACT Nearest neighbor models are among the most established and accurate approaches to machine learning. In this paper, we investigate Synthetic Reduced Nearest Neighbor (SRNN) as a novel approach to regression tasks. Existing prototype nearest neighbor models are initialized by training a k-means model over each class. However, such initialization is only applicable to classification tasks. In this work, we propose a novel initialization and expectation maximization approach for enabling the application of SRNN to regression. The proposed initialization approach is based on applying the k-means algorithm on the target responses of samples to create various clusters of targets. This is proceeded by learning several centroids in the input space for each cluster found over the targets. Essentially, the initialization consists of finding target clusters and running k-means in the space of feature vectors for the corresponding target cluster. The optimization procedure consists of applying an expectation maximization approach similar to the k-means algorithm that optimizes the centroids in the input space. This algorithm is comprised of two steps: (1) The assignment step, where assignments of the samples to each centroid is found and the target response (i.e., prediction) of each centroid is determined; and (2) the update/centroid step, where each centroid is updated such that the loss function of the entire model is minimized. We will show that the centroid step operates over all samples via solving a weighted binary classification. However, the centroid step is NP-hard and no surrogate objective function exists for solving this problem. Therefore, a new surrogate is proposed to approximate the solution for the centroid step. Furthermore, we consider the consistency of the model, and show that the model is consistent under mild assumptions. The bias-variance relationship in this model is also discussed. We report the empirical evaluation of the proposed SRNN regression model in comparison to several state-of-the-art techniques. 1 INTRODUCTION One of the main topics of research in Machine Learning is the relation between the features and output responses Hastie et al. (2009); Santosa & Symes (1986); Tibshirani (1996); Criminisi & Shotton (2013). Synthetic Reduced Nearest Neighbor (SRNN) models are shown to be an effective tool in determining the relationships between features of the inputs and the sub-clusters of each class in supervised learning tasks Tavallali et al. (2020b). However, existing prototype nearest neighbor models such as SRNN are constrained to classification problems, and to the best of our knowledge, there remains a gap in extending these algorithms towards regression tasks. Such regression reduced nearest neighbor models may find extensive applications in epidemiological studies Tavallali et al. (2020a); Cisneros et al. (2021), medical studies Criminisi & Shotton (2013); Graf et al. (2011b;a), and other applied regression tasks in general Tibshirani (1996). To address this gap in the state of the art, we propose a novel algorithm for the optimization and construction of Regression Synthetic Reduced Nearest Neighbor (Reg-SRNN) models. The proposed Reg-SRNN is capable of discovering various modalities of the input data, and relates those to the modalities of the output responses. The Reg-SRNN algorithm is designed to handle both single-response and multi-response regression. The multi-response regression consists of learning the relation between input samples and several ground-truth output responses. Reg-SRNN partitions the input space into piecewise constant regions, where each region is represented by a centroid and ----- its output response. From this perspective, Reg-SRNN is similar to other piecewise constant models, such as Li & Martin (2017); Begon et al. (2017); Bertsimas & Dunn (2017); Tavallali et al. (2019; 2020c). Reg-SRNN is capable of learning an accurate relation between each cluster of the data and its corresponding output responses. Therefore, Reg-SRNN can also provide enhanced interpretablity by reducing the information content of clusters into a compressed representation manifested in their centroids. The technical contributions of this paper include the proposal of a novel initialization that by itself is competitive to other existing regression models. This is proceeded by an expectation maximization algorithm for directly minimizing the least squares error of the mode. The proposed optimization algorithm is provably convergent, and it is shown that it monotonically decreases the loss function. Therefore, the algorithm has a convergence guarantee on minimizing the loss function and achieving a local optimum.It is also worth mentioning that the algorithm does not cycle. The proposed optimization algorithm consists of two steps and is inspired by K-means algorithm Lloyd (1982). One step is the assignment step and is composed of finding samples assignments and proper output response of the centroid. Second step is the update step where the centroid is optimized such that the loss function is decreased. The centroid step is affected by all the samples and we will show that this update step is a kind of NP-hard weighted binary classification problem. The update step is computed through a surrogate objective function that is similar to SVM. We establish that the algorithm is efficient because of its linear computational complexity. Finally, the model is evaluated on various datasets with various sizes and dimensionalities, the results of which demonstrate that Reg-SRNN is capable of competing and even over-matching similar regression models. Accordingly, the main contributions of this paper are as follows: - We propose a novel algorithm for initialization of SRNN models to extend their application to regression tasks. - We develop an optimization algorithm for regression SRNN models with guarantees on convergence. - Through experimental evaluation, we demonstrated the feasibility of our proposed regression SRNN model in filling the gap between more complex models (such as random forests) and basic and interpretable models such as linear regression and decision trees. 2 RELATED WORK A regression task consists of learning the relation between samples of the input space and a numerical output space. More specifically, regression is a supervised learning task of mapping inputs (independent variable X) to the output Y, which is a continuous vector (Y ∈ R[d]). If the dimensionality of the output d ≥ 2, the task is known as multi-response regression. Regression has been the workhorse of numerous fields Tai (2021), and various regression models have been developed and expanded fundamentally over the recent decades Hastie et al. (2009). This expansion has been so rampant such that listing all such models and their relationships is a difficult task and is out of the scope of this work. However, a brief review of the recent models is presented in this paper. A common objective function for regression is to minimize the least squares error: ||Y[ˆ] − Y ||[2] (1) Where Y[ˆ] is the prediction. According to the Gauss-Markov theorem Gauss (1823), the least squares error can be an unbiased linear model of minimum variance of the data under certain assumptions. Ordinary least squares may fail to properly predict outcomes if it is applied to settings where the Gauss-Markov assumptions are not held. Therefore, it is important to understand the assumptions and occasionally apply the proper changes to the objective function of equation 1 to modify the model Tai (2021). Manifestations of such changes include imposing regulations or constraints over the objective function. The literature on ordinary least squares estimation has extensively dealt with some of the well-known concerns that might violate the assumptions, such as Ridge Hoerl & Kennard (1970a;b), Lasso Tibshirani (1996), Elastic Net Zou & Hastie (2005), trees Quinlan (2014), forest Breiman (2001), boosting B¨uhlmann & Yu (2003) and others. Common regression models include bagging, boosting, random forest Criminisi & Shotton (2013), oblique trees Murthy et al. (1994); Norouzi et al. (2015); Heath et al. (1993), and regression SVM ----- Drucker et al. (1997). In the context of regression trees, various approaches of inducing a tree are presented in the literature. Most decision tree induction methods are concentrated on the splitting criterion used at the growing phase of the tree Ikonomovska et al. (2011); Levati´c et al. (2014). Application of decision tree algorithms to multi-response regression has been previously considered in the literature Breiman et al. (1984); De’Ath (2002). In Breiman et al. (1984); Quinlan (1986), authors consider training a decision tree for each individual output response. However, such approach constructs a large model specially if the number of output responses are high. Another approach proposed in De’Ath (2002) consists of constructing a single decision tree for all the output responses. In other words, the model predicts all the output values simultaneously through a single decision tree. However, a model for all the outputs might not be sufficient Kocev et al. (2009) because they train model for single response rather than the true problem which is a multi-response regression. Authors in Kocev et al. (2009) have explored two approaches to the multi-response regression problem by comparing learning a model for each output separately (i.e., multiple regression trees), and learning one model for all outputs simultaneously (i.e., a single multi-target regression tree). In order to improve predictive performance, Kocev et al. (2013) has also considered two ensemble learning techniques, namely, bagging Breiman (1996); Liang et al. (2011) and random forests Breiman (2001) for regression trees and multi-target regression trees. The results showed that multi-target regression trees created more accurate and compact models. A related topic to the problem of this paper is nearest neighbor regression. Nearest neighbor regression and local estimators are well-established methods in the literature of ordinary univariate location estimators (Benedetti (1977); Stone (1974); Tukey et al. (1977). However, as per our extensive search, there remains a gap in prototype nearest neighbor approaches to regression. The only work that considered a similar model and optimization to SRNN was Huang (2017). However, the proposed algorithm does not have guarantee of convergence or achieving some sort of optimum solution. 3 PROPOSED METHOD 3.1 PRELIMINARIES Assume a dataset consisting of tuples (xi, yi) where x, y and i represent input features, output responses and index number. Each tuple represent a data xi and its corresponding output response yi. Here, xi ∈ R[D] and yi ∈ R[d]. The Regression Synthetic Reduced Nearest Neighbor (Reg. SRNN) consists of K tuples of synthetically produced centroids/prototypes (cj, ˆyj) where c, ˆyj and j represent the centroid’s point in the input space, output prediction and index. At the inference time, the Reg. SRNN operates like a nearest neighbor model where the centroids are used as the samples. The problem of training Reg. SRNN is as follows: N {(cjmin, ˆyj)}[K]1 Xi=1 ||yi − yˆji[∗] [||][2] (2) s.t ji[∗] [=][ argmin] d(xi − cj) {j}[K]1 where d(.) is a distance metric. Through this paper we use the l-2 norm as the distance metric: ||xi − cj||[2] (3) d(xi − cj) = Essentially the prediction of the model consists of the output prediction of closest centroid to the input sample. Officially, we define Reg. SRNN as follows: NN (x) = yjI(x ∈ Rj) (4) Xj=1 Where, NN (.) represents a nearest neighbor function of the K centroids. I(.) is an indicator function that produces 1 if the input x is in the region of Rj. Rj represents the region where the closest centroid to the points in that region is cj. ----- 3.2 INITIALIZATION Numerical optimization algorithms require initialization (cite num opt). In this paper, we propose a novel initialization for the regression SRNN. Previous work on the initialization of SRNN models consisted of learning a K-means model for each class of the data. For example, in case of M classes and K centroids, M[K] [centroids are learned for each class as initialization of the SRNN Kusner et al.] (2014); Wang et al. (2016); Zhong et al. (2017); Tavallali et al. (2020b). However, such approach is not applicable to the regression. This initialization also have close ties with naive Bayes and density estimation Silverman (2018). Here, we expand this initialization to the case of Reg. SRNN. Intuitively, the output responses can consist of several modalities. In other words, it is possible that the output responses are generated from several distributions. The clusters of such distributions can be approximated by running a K-means over the output space (M centroids). Assume that Sm represents the set of samples assigned to each output cluster. Next step consists of learning |[KN]Sm| centroids over the input features of the Sm for all M clusters. In other words, we learn centroids over the input features of each output cluster relative to the population of that cluster. The found centroids at the second step are used as initialization for the Reg. SRNN. At this step, ˆyj is found using the following formula: yˆj = mean(yi ∈ Sj) (5) where Sj represents the set of samples that are assigned to j[th] centroid. Sj essentially consists of samples where j[th] centroid is the closest centroid to them. mean(.) represents the average of its input set. Note that Sj ∈ Rj. 3.3 CONSISTENCY Here we discuss the consistency of the Reg. SRNN. We show that Reg. SRNN is a nonparametric and consistent under mild assumptions for continuous features. Assume an independent identically distributed (iid) dataset where it is being generated from f (x). f (x) represents the true function for relation between the inputs and outputs. Also assume that f (x) is a piecewise constant function. Over a set of N observations, consistency of a nonparametric estimator f[ˆ]N (x) (such as Reg. SRNN) is shown using the following formula Parzen (1962): ( f[ˆ]N (x) − f (x))[2]dx = 0) = 1 (6) Pr( lim N →∞ The proof of consistency of Reg. SRNN is similar to proof of consistency for regression trees Breiman et al. (1984) and density estimators in Ram & Gray (2011). Theorem 1 (Consistency of Reg. SRNN). The estimator defined in equation 4 satisfies equation equation 6. Proof. Assume B and d1 denote the collection of all sets t ⊂ X and a fixed positive integer, respectively. Assume that B describes the sets that are the solution to a system of d1 inequalities b[T] x ≤ c where b ∈ R[d] and c ∈ R. Every region in the Reg. SRNN in formula equation 4 can be seen as a solution of a system of d1 inequalities of the form b[T] x ≤ c where b is a hot-one-vector (only one element of b is 1 and the rest are 0). Therefore, Reg.SRNN ⊂ B. Assume a random point Xn from function f on X, (n ≥ 1). F[ˆ]N represents the empirical function learned by Reg. SRNN over Xn For N ≥ 1, 1 ≤ n ≤ N, and defined on a set t ⊂ X by FˆN (t) = [1] nX=1 ynI(Xn ∈ t) = mean(yn ∈ Rt) = fˆN (x)dx (7) where yn = f (xn) and f[ˆ](x) is the estimator presented in equation 4. Using a general version of Glivenko-Cantelli theorem Vapnik & Chervonenkis (2015) Pr( lim |F[ˆ]N (t) − N →∞[sup]t∈B f (x)dx| = 0) = 1 (8) ----- By replacing equation equation 7 in equation 8, we have Pr( lim | N →∞[sup]t∈B Z Pr( lim N →∞[sup]t∈B fˆN (x)dx − f (x)dx| = 0) = 1 (9) then: |f[ˆ]N (x) − f (x)|dx ≥ 0) = 1 (10) Further, by assuming that the region of t leans toward 0 as N →∞ (Pr(N lim→∞ t [dx][ = 0) = 1][).] R Rest of the steps will follow similar to theorem 1 in Ram & Gray (2011) and we get ( f[ˆ]N (x) − f (x))[2]dx = 0) = 1 (11) Pr( lim N →∞ Please note that the steps are similarly done in Ram & Gray (2011) except the step of equation 7 which is different. The consistency shows that as the number of samples go to infinity and as |SKj| [→] [0][, then the Reg.] SRNN is consistent. This is provable thanks to the assumption that the true function that relates the inputs to the outputs is a piecewise constant function. This essentially means that the SRNN itself is consistent also for classification tasks. 3.4 THE EXPECTATION MAXIMIZATION OF REG. SRNN The problem equation 2 represents the training problem of Reg. SRNN. SRNN generally is known as prototype nearest neighbor in other papers of literature and centroids are also called prototypes. Problem equation 2 resembles K-means Lloyd (1982) problem except that the loss function is ||xi − cj||. The expectation maximization in this paper follows same approach as to K-means algorithm and the approach in Tavallali et al. (2020b). The optimization consists of two steps, assignment step and the update step. 3.4.1 ASSIGNMENT STEP The assignment step consists of calculating the assignment of the samples to each centroid and finding the optimum output prediction of each centroid. The problem of this step is as follows: N min ||yi − yˆj ∗||2. { ˆyj}[K]1 Xi=1 (12) s.t ji[∗] [=][ argmin] d(xi − cj). {j}[K]1 Note that problem equation 12 only tends to optimize over ˆyj for j = 1...K. Using sets Sj, problem equation 12 can be simplified to: {i|(xiX,yi)∈Sj} ||yi − yˆj||[2]. (13) min { ˆyj}[K]1 j=1 The problem equation 13 can be separated over each centroid and its corresponding set. This can be done since the regions are distinct and samples can not be shared among the regions. The prediction for each region is the label of j[th] centroid ˆyj (the centroid that represents the region). As a result, the problem of optimizing label for each region is {i|(xiX,yi)∈Sj } ||yi − yˆj||[2]. (14) min yˆj whose minimum is presented in formula equation 5. In other words, the optimum of ˆyj is the mean of response of samples in the Rj. As a result, in the assignment step, the Sj and ˆyj have to be calculated for all the samples. ----- 3.4.2 UPDATE STEP The update step consists of updating the position of centroids such that the objective function in equation 2 is minimized Lloyd (1982). At this step the output prediction of the centroids are kept constant. The update step is affected by all the samples in the dataset because by changing the position of centroid, the assignments of the samples can get changed and as a result the prediction for each train sample gets changed. Further it will be shown that finding optimum of the problem in this step is NP-hard. Therefore, the centroid problem is approximated through a novel surrogate objective function. Each centroid is optimized individually Tavallali et al. (2020b). The optimization problem for k[th] centroid consists of moving the k[th] centroid such that the samples’ assignment are changed in favor of decreasing the objective function equation 2. The centroid problem is N min ||yi − yˆj ∗||2. ck Xi=1 (15) s.t ji[∗] [=][ argmin] d(xi − cj). {j}[K]1 Note that the optimization is only over ck. We rewrite the problem equation 15 over the assignment of a sample to k[th] centroid or to the rest of centroids. For simplicity we introduce the notation rij = d(xi − cj). The problem is N mincj ||yi − yˆk||[2]U (riji[∗] [−] [r][ik][) +][ ||][y][i][ −] y[ˆ]ji[∗] [||][2][U] [(][r][ik][ −] [r][ij]i[∗] [)][.] Xi=1 (16) s.t ji[∗] [=][ argmin] d(xi − cj). {j}[K]1 [,j][̸][=][k] In problem equation 16, U (.) is a step function where it outputs 1 if the input is larger than 0 and otherwise it will produce 0. Note that the input arguments of the step functions are negative of each other. This means a sample has to either get assigned to the k[th] centroid or the rest of centroids. The sample can not get assigned to both terms of equation 16. The assignment of i[th] sample to each term will produce a continuous error. This essentially means that the problem equation 16 is a weighted binary classification problem. The problem equation 16 encourages the sample to get assigned to the side that produces lower error. For simplicity we introduce the sets Sc and Sc[′]k [. The][ S][c][k][ represents] the set of samples that produce lower error if assigned to the ck centroid and Sc[′]k [represents the set] of other samples. let ti = abs(||yi − yˆk||[2] −||yi − yˆji[∗] [||][2][)][ where][ abs][(][.][)][ returns the absolute value] of its input. The problem of equation 16 is equivalent to min cj (xi,yXi)∈Sc tiU (riji[∗] [−] [r][ik][) +] tiU (rik − riji[∗] [)] (17) (xi,yi)∈Sc′ The problem equation 17 is a NP-hard problem Nguyen & Sanner (2013). Therefore, inspired by the SVM, we approximate the solution to problem equation 17 using the following surrogate objective function c[∗]k[(][µ][) =][ argmin] ck xXi∈Sc tirik + tirelu(µrij∗ − rik) (18) xi∈Sc′ where µ is a penalty coefficient. Intuitively, the surrogate objective function encourages the k[th] centroid to stay close to samples of Sc while staying away from samples of Sc[′]k [.][ µ][ is increased from] 0 to 1 and along this path, the c[∗]k[(][µ][)][ that produces the smallest error for equation 17 is selected.] This surrogate is a modified version of surrogate objective function in Tavallali et al. (2020b). The µ acts similar to slack variables in a SVM problem. The objective function of problem equation 18 is a continuous function; thus, a local optimum of the problem can be found using gradient-based algorithms. 3.5 RELATION TO EM ALGORITHM The proposed algorithm is originally inspired by the EM algorithm used for K-means. The assignment step consists of finding the samples assigned to each centroid and finding the optimal output ----- prediction of each centroid. The assignment of samples is the same as calculating the prior probabilities in an EM algorithm. Finding optimum output prediction of each centroid can also be considered as a part of maximization step. In the update step, the centroids have to be updated. At this step, the outcome of assigning each sample to each centroid is evaluated (||yi − yj||[2][K]j=1[∀][i][ = 1][,][ 2][, ..., N] [). This evaluation is equivalent] to calculating the posterior probabilities. Then the centroid problem is approximated using these outcomes which is the maximization step. 3.6 RELATION OF SURROGATE TO SVM The problem of finding the best centroid that is closer to samples of Sc than any other centroid can be cast as a feasibility problem. find ck (19) s.t rik < riji[∗] ∀(xi, yi) ∈ Sc rik > riji[∗] ∀(xi, yi) ∈ Sc[′] However, this feasibility problem is NP-hard since the the second set of constraints are concave Sahni (1974). These constraints make the problem different from similar SVM problems where the constraints are convex and global solution can be found in efficient time. 3.7 PROPERTIES OF THE ALGORITHM computational complexity Optimizing the centroid problem takes O(ND) since it uses a gradient based algorithm for solving the surrogate objective. All the K centroids have to optimized at each iteration. Therefore, the computational complexity of the algorithm is O(αNDK) where α is the number of iterations. Convergence The convergence to a local optimum is similar to that of the K-means algorithm Lloyd (1982); Tavallali et al. (2020b). At each iteration, the error decreases and the objective function is bounded by 0 from bellow. Further, the different combination of assignment of the samples to the centroids is finite. Therefore, the algorithm stops after a finite number of iterations. 4 EXPERIMENTAL RESULTS In this section, the experimental results are presented to demonstrate the merits of proposed proposed algorithm. Various datasets are downloaded and used for evaluation from UCI repository Dheeru & Karra Taniskidou (2017). We partitioned the datasets using the following setups: 1- if the dataset contained a separate test set, the dataset was used as provided in the repository. In case cross-validation was needed (for specific models) and validation set was not provided then the we partitioned the trainset to 80% train and 20% validation sets. 2- if the test set was not provided we partitioned the dataset to 80% train and 20% test set. For models that needed validationset, we divided the trainset to 80% train and 20% validation set. 3- if all sets were provided by the repository, then the sets were used as provided. The setup for Reg. SRNN is as follows: at the initialization phase, we used K = 4 for the output clusters for slice localization data and for the rest of datasets K = 2 was used for the output clusters. The number of input centroids used for each output cluster were proportional to the population of each cluster. The centroid problem was optimized using batch stochastic gradient descent. We used various setups for each dataset to achieve the smallest possible train error. Various models are used for comparison with Reg. SRNN. The basis of comparison is based on the number of base models used in each model. All the models comparable to the Reg. SRNN can be seen as a kind of ensemble model that consists of several base models. In case of forest and boosting, each tree is a base model. For prototype models such as Reg. SRNN and K-means, each centroid is a base model. For Radial Basis Function (RBF) models, each RBF is considered as a base model. We also compared our model with linear regression and Ridge regression. The two models are presented with straight lines. Figures 1, and 2 present performance of each model ----- 100 **slice localization data** 100 **CPU** 35 **Airfoil** 0.02 **mg** 0.019 80 80 30 0.018 60 60 25 0.017 MSE MSE MSE MSE 0.016 40 40 20 0.015 20 15 0.014 20 0.013 0 100 200 300 400 500 10 20 30 40 50 60 10 10 20 30 40 50 60 0.012 10 20 30 40 50 60 #base models #base models #base models #base models **slice localization data** 100 80 60 40 20 |Col1|SRNN SRNN0 RF K-means regboost| |---|---| |bagging RBF+lin-reg lin-reg ridgeCV|bagging RBF+lin-reg lin-reg ridgeCV| SRNN SRNN0 RF K-means regboost bagging RBF+lin-reg lin-reg ridgeCV 100 **CPU** 35 **Airfoil** 0.02 **mg** 0.019 80 30 0.018 60 25 0.017 MSE MSE MSE 0.016 40 20 0.015 15 0.014 20 0.013 10 20 30 40 50 60 10 10 20 30 40 50 60 0.012 10 20 30 40 50 60 #base models #base models #base models Figure 1: Horizontal axis represents the number of base models(e.g., centroids, RBFs, trees and etc.). The vertical axis represents the mean squared error. The first and second row curves show the train errors and test errors, respectively. Reg. SRNN is compared with similar models. over various datasets based on mean squared error versus number of base models. Reg. SRNN was compared with Random Forest (RF), K-means, Regression Boosting (regboost), bagging, Radial Basis Function+ linear regression (RBF+lin-reg), linear regression (lin-reg), and ridge regression (ridgeCV). For the RF, each tree was trained using 70% of the trainset and 0.7 of features were randomly selected and used at the split nodes. For the K-means, a K-means model with desired number of centroids were trained over trainset and the assignment step of Reg. SRNN was applied to the model. For the regboost, we used trees of depth 3. For bagging, similar setup to RF was used except that no feature randomization is applied. RBF+lin-reg consists of first training RBFs (same number of RBFs as the centroids in other models) and the training linear regression over the outputs of RBFs. The width of RBFs were selected by cross-validation. The penalty coefficient of ridgeCV was selected by cross-validation. As can be observed from figures 1, and 2, the SRNN-Reg was able to achieve better or comparable train and test errors to other models. 5 LIMITATIONS: The model presented in this paper is new in the sense that it has never been proposed previously. However, since its nature is a nearest neighbor model we expect that the model presents some limitations similar to those of a nearest neighbor model. Similar to nearest neighbor for regression that might have high test error in high dimensional datasets Hastie et al. (2009). Therefore, we suspect that the performance of Reg-SRNN might deteriorate with higher dimensional dataset. However, in our experiments, we noticed that the Reg-SRNN performed very well in slice localization data which is a high dimensional dataset (384 features). 6 CONCLUSION In this paper, as per our search, we have proposed the first regression synthetic reduced nearest neighbor. The consistency of the algorithm was proved and its properties were explored. We showed that the algorithm is computationally efficient and can converge to a local optimum in the sense than no move can improve the model any further. The approach is essentially the same type of EM algorithm used for K-means and it is extended for the case of regression synthetic reduced nearest neighbor. The update step was an NP-hard weighted binary classification problem. The optimum of such problems are typically approximated using a surrogate objective function, such as hinge loss in SVM for 0-1 binary classification problem. Therefore, we approximated the solution to the update step through a novel surrogate objective function. Further, we analyzed the relation of the update ----- train error test error **eunite** **eunite** 1000 1000 900 800 800 MSE 600 MSE 700600 400 500 400 200 300 10 20 30 40 50 60 10 20 30 40 50 60 #base models #base models Figure 2: Horizontal axis represents the number of base models(e.g., centroids, RBFs, trees and etc.). The vertical axis represents the mean squared error. Reg. SRNN is compared with similar models. step with SVM. Experimentally, we showed that the Reg. SRNN performs better or competitive to other similar models in the literature such as ensembles and centroid based models. 7 REPRODUCIBILITY The experiment codes (currently attached as supplementary materials) will be published on a Github repository to facilitate reproduction and extension of our results. Furthermore, the datasets used in our experiments are publicly available via the UCI repository. REFERENCES Jean-Michel Begon, Arnaud Joly, and Pierre Geurts. Globally induced forest: A prepruning compression scheme. In International Conference on Machine Learning, pp. 420–428, 2017. Jacqueline K Benedetti. On the nonparametric estimation of regression functions. Journal of the Royal Statistical Society: Series B (Methodological), 39(2):248–253, 1977. Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039– 1082, 2017. Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. Leo J. Breiman, Jerome H. Friedman, R. A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth, Belmont, Calif., 1984. Peter B¨uhlmann and Bin Yu. Boosting with the l 2 loss: regression and classification. Journal of the American Statistical Association, 98(462):324–339, 2003. Ricardo Cisneros, Hamed Gharibi, Marcela R Entwistle, Pooya Tavallali, Mukesh Singhal, and Donald Schweizer. Nitrogen dioxide and asthma emergency department visits in california, usa during cold season (november to february) of 2005 to 2015: A time-stratified case-crossover analysis. Science of The Total Environment, 754:142089, 2021. Antonio Criminisi and Jamie Shotton. Decision Forests for Computer Vision and Medical Image Analysis. Advances in Computer Vision and Pattern Recognition. Springer-Verlag, 2013. Glenn De’Ath. Multivariate regression trees: a new technique for modeling species–environment relationships. Ecology, 83(4):1105–1117, 2002. Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL [http://archive.ics.uci.edu/ml.](http://archive.ics.uci.edu/ml) Harris Drucker, Chris JC Burges, Linda Kaufman, Alex Smola, Vladimir Vapnik, et al. Support vector regression machines. Advances in neural information processing systems, 9:155–161, 1997. ----- Carl Friedrich Gauss. Theoria combinationis observationum erroribus minimis obnoxiae, volume 2. H. Dieterich, 1823. Franz Graf, Hans-Peter Kriegel, Sebastian P¨olsterl, Matthias Schubert, and Alexander Cavallaro. Position prediction in ct volume scans. In Proceedings of the 28th International Conference on Machine Learning (ICML) Workshop on Learning for Global Challenges, Bellevue, Washington, WA, 2011a. Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian P¨olsterl, and Alexander Cavallaro. 2d image registration in ct images using radial image descriptors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 607–614. Springer, 2011b. Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York, NY: Springer, second edition, 2009. David Heath, Simon Kasif, and Steven Salzberg. Induction of oblique decision trees. In IJCAI, volume 1993, pp. 1002–1007, 1993. Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970a. Arthur E Hoerl and Robert W Kennard. Ridge regression: applications to nonorthogonal problems. Technometrics, 12(1):69–82, 1970b. Hanwen Huang. Regression in heterogeneous problems. Statistica Sinica, pp. 71–88, 2017. Elena Ikonomovska, Jo˜ao Gama, and Saˇso Dˇzeroski. Incremental multi-target model trees for data streams. In Proceedings of the 2011 ACM symposium on applied computing, pp. 988–993. ACM, 2011. Dragi Kocev, Saˇso Dˇzeroski, Matt D White, Graeme R Newell, and Peter Griffioen. Using single-and multi-target regression trees and ensembles to model a compound index of vegetation condition. Ecological Modelling, 220(8):1159–1168, 2009. Dragi Kocev, Celine Vens, Jan Struyf, and Saˇso Dˇzeroski. Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3):817–833, 2013. Matt Kusner, Stephen Tyree, Kilian Weinberger, and Kunal Agrawal. Stochastic neighbor compression. In International Conference on Machine Learning, pp. 622–630, 2014. Jurica Levati´c, Michelangelo Ceci, Dragi Kocev, and Saˇso Dˇzeroski. Semi-supervised learning for multi-target regression. In International Workshop on New Frontiers in Mining Complex Patterns, pp. 3–18. Springer, 2014. Alexander Hanbo Li and Andrew Martin. Forest-type regression with general losses and robust forest. In International Conference on Machine Learning, pp. 2091–2100, 2017. Guohua Liang, Xingquan Zhu, and Chengqi Zhang. An empirical study of bagging predictors for different learning algorithms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, 2011. Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2): 129–137, 1982. Sreerama K Murthy, Simon Kasif, and Steven Salzberg. A system for induction of oblique decision trees. Journal of artificial intelligence research, 2:1–32, 1994. Tan Nguyen and Scott Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pp. 1085–1093, 2013. Mohammad Norouzi, Maxwell D Collins, David J Fleet, and Pushmeet Kohli. Co2 forest: Improved random forest by continuous optimization of oblique splits. arXiv preprint arXiv:1506.06155, 2015. ----- Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962. J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. Parikshit Ram and Alexander G Gray. Density estimation trees. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 627–635. ACM, 2011. Sartaj Sahni. Computationally related problems. SIAM Journal on computing, 3(4):262–279, 1974. Fadil Santosa and William W Symes. Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7(4):1307–1330, 1986. Bernard W Silverman. Density estimation for statistics and data analysis. Routledge, 2018. Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111–133, 1974. Yunpeng Tai. A survey of regression algorithms and connections with deep learning. arXiv preprint arXiv:2104.12647, 2021. Pooya Tavallali, Mehran Yazdi, and Mohammad Reza Khosravi. Robust cascaded skin detector based on adaboost. Multimedia Tools and Applications, 78(2):2599–2620, 2019. Pooya Tavallali, Hamed Gharibi, Mukesh Singhal, Donald Schweizer, and Ricardo Cisneros. A multi-pollutant model: a method suitable for studying complex relationships in environmental epidemiology. Air Quality, Atmosphere & Health, 13:645–657, 2020a. Pooya Tavallali, Peyman Tavallali, Mohammad Reza Khosravi, and Mukesh Singhal. Interpretable synthetic reduced nearest neighbor: An expectation maximization approach. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 1921–1925. IEEE, 2020b. Pooya Tavallali, Mehran Yazdi, and Mohammad R Khosravi. A systematic training procedure for viola-jones face detector in heterogeneous computing architecture. Journal of Grid Computing, pp. 1–16, 2020c. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. John W Tukey et al. Exploratory data analysis, volume 2. Reading, Mass., 1977. Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pp. 11–30. Springer, 2015. Wenlin Wang, Changyou Chen, Wenlin Chen, Piyush Rai, and Lawrence Carin. Deep metric learning with data summarization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 777–794. Springer, 2016. Kai Zhong, Ruiqi Guo, Sanjiv Kumar, Bowei Yan, David Simcha, and Inderjit Dhillon. Fast classification with binary prototypes. In Artificial Intelligence and Statistics, pp. 1255–1263, 2017. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005. ----- |