天天看點

使用主要協變量回歸改進樣本和特征選擇(CS)Improving Sample and Feature Selection with Principal Covariates Regression

羅斯·克森斯基,本傑明·赫爾弗雷希特,埃德加·恩格爾,米歇爾·塞裡奧蒂

從大量候選項中選擇最相關的功能和示例是一項在自動資料分析文本中經常發生的任務,它可用于提高模型的計算性能,而且通常也具有可傳輸性。在這裡,我們重點介紹兩個流行的子選擇方案,它們已應用于此目的:CUR 分解,它基于要素矩陣的低級近似值和最遠點采樣,它依賴于最多樣化的樣本和區分特征的疊代辨別。我們修改這些不受監督的方法,按照與主體共變量回歸(PCovR)方法相同的精神,納入受監督的元件。我們表明,合并目标資訊可提供在監督任務中性能更好的選擇,我們用山脊回歸、核心脊回歸和稀疏核心回歸來示範這些選擇。我們還表明,結合簡單的監督學習模型可以提高更複雜的模型(如前饋神經網絡)的準确性。我們提出進行調整,以盡量減少執行無人監督的任務時任何子選擇可能産生的影響。我們示範了使用 PCov-CUR和 PCov-FPS在化學和材料科學應用上的顯著改進,通常将實作給定回歸精度水準所需的特征和樣本數減少 2 個因子和樣本數。

Improving Sample and Feature Selection with Principal Covariates Regression

Rose K. Cersonsky, Benjamin A. Helfrecht, Edgar A. Engel, Michele Ceriotti

Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it can be used to improve the computational performance, and also often the transferability, of a model. Here we focus on two popular sub-selection schemes which have been applied to this end: CUR decomposition, that is based on a low-rank approximation of the feature matrix and Farthest Point Sampling, that relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the Principal Covariates Regression (PCovR) method. We show that incorporating target information provides selections that perform better in supervised tasks, which we demonstrate with ridge regression, kernel ridge regression, and sparse kernel regression. We also show that incorporating aspects of simple supervised learning models can improve the accuracy of more complex models, such as feed-forward neural networks. We present adjustments to minimize the impact that any subselection may incur when performing unsupervised tasks. We demonstrate the significant improvements associated with the use of PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples which are required to achieve a given level of regression accuracy.