R’s model formula infrastructure was discussed in my previous post. Despite the elegance and convenience of the formula method, there are some aspects that are limiting.

Limitations to Extensibility

The model formula interface does have some limitations:

It can be kludgy with many operations on many variables (e.g., log transforming 50 variables via a formula without using paste )
The predvars aspect (discussed in my previous post) limits the utility of the operations. Suppose a formula had: knn_impute(x1) + knn_impute(x2) . Do we embed the training set twice in predvars ?
Operations are constrained to single columns or features (excluding interaction specifications). For example, you cannot do

I’ll use PCA feature extraction a few times here since it is probably familiar to many readers.

Everything Happens at Once

Some of our data operations might be sequential. For example, it is not unreasonable to have predictors that require:

imputation of a missing value
centering and scale
conversion to PCA scores

Given that the formula method operations happen (in effect) at once, this workflow requires some sort of custom solution. While

caret::preProcess

was designed for this sequence of operations, it does so in a single call, as opposed to a progression of steps exemplified by

ggplot2

dplyr

, or

magrittr

Allowing a series of steps to be defined in order is more consistent with how data analysis is conducted. However, it does raise the complexity of the underlying implementation. For example,

caret::preProcess

dictates the possible sequence of tasks to be: filters, single-variable transformations, normalizations, imputation, signal extraction, and spatial sign. This avoids nonsensical sequences that center the data before applying a Box-Cox calculation (which requires positive data).

No Recycling

As a corollary to the point above, there is no way to recycle the

terms

between models that share the same formula and data/environment. For example, if I fit a CART model to a data set with many predictors, the random forest model (theoretically) shouldn’t need to recreate the same

terms

information about the design matrix. If the model function has the non-formula interface (e.g.,

mod_func(x, y)

), this can make it easier. However, many do not.

Also, suppose that one of the pre-processing steps is computationally expensive. We’d like to be able to store the state of the results and then add another layer of computations (perhaps as a separate object).

Formulas and Wide Datasets

The

terms

object saves a matrix with as many rows as formula variables and at least as many columns (depending on interactions, etc). Most of this data is zero and a non–sparse representation is used. The current framework was built in a time where there was more focus on interactions, nesting and other operations on a small scale.

It is unlikely that models would have hundreds of interaction terms, but now it is not uncommon to have hundreds or thousands of main effects. As the number of predictors increases, this takes up an inordinate amount of execution time. For simple

randomForest

rpart

calls, the formula/

terms

work can account for most of the execution time. For example, we can calculate how much time functions spend generating the model matrix relative to the total execution time. For

rpart

and

randomForest

, we used the default arguments and did the calculations with a simulated data set of 200 data points and varying numbers of predictors:

This is especially problematic for ensemble models. For example,

ipred:::ipredbagg

creates an ensemble of

rpart

trees. Since

rpart

only has a formula method, the footprint of the bagged model object can become very large if X trees are contained in the ensemble. Alternatively,

randomForest.formula

takes the approach of generating the

terms

once and feeding the model frame to

randomForest.default

. This does not work for

rpart

since there is no non-formula method exposed. Some functions (e.g.,

lm

survival::coxph

) have arguments that can be used to prevent the

terms

and similar objects from being returned. This saves space but prevents new samples from being predicted. A little more detail can be found here.

One issue is the

"factors"

attribute of the

terms

object (discussed in the previous post). This is a non-sparse matrix that has a row for each predictor in the formula and a column for each model term (e.g. main effects, interactions, etc.). The purpose of this object is to know which predictors are involved in which terms.

The issue is that this matrix can get very large and usually has a high proportion of zeros. For example:

As the number of predictors increases, the rate of ones is likely to approach a value close to zero very quickly. For example:

Again, it is doubtful that a model with a large number of predictors will have a correspondingly large number of high-level interactions (see the Pareto principle applied to modeling).

Variable Roles

Some packages have implemented extensions of the basic formula. There are cases when formula are needed for specific sub-models. For example, a random coefficient model can be fit with the

lmer

function. In this case, a model is specified for a particular clustering variable (e.g., a subject in a clinical trial). The code is an example of how

lmer

syntax works:

Here

Subject

is important to the model-fitting routine, but not as a predictor. Similarly, the Bradley-Terry model can be used to model competitions and contests. A model on a set of boxers in a series of contests can include terms for their reach:

Another extension of basic formulas comes from the

modeltools

and

mboost

packages. The function

mboost::mob

fits a tree-based model with regression models in the terminal nodes. For this model, a separate list of predictors are used as splitting variables (to define the tree structure) and another set of regression variables that are modeled in the terminal nodes. An example of this call is:

The commonality between these three examples is that there are variables that are critical to the model but do not play the role of standard regression terms. For

lmer

Subject

is the independent experimental unit. For

mob

, we have variables to be used for splitting, etc.

There are similar issues on the left-hand side of the formula. When there are multivariate outcomes, different packages have different approaches:

The overall point here is that, for the most part, the formula method assumes that there is one variable on the left-hand side of the tilde and that the variables on the right-hand side are predictors (exceptions are discussed below). One can envision other roles that columns could play in the analysis of data. Besides the examples given above, variables could be used for

outcomes
predictors
stratification
data for assessing model performance (e.g., loan amount to compute expected loss)
conditioning or faceting variables (e.g., lattice or ggplot2 )
random effects or hierarchical model ID variables
case weights
offsets
error terms (limited to Error in the aov function)

The last three items on this list are currently handled in formulas as “specials” or have existing functions. For example, when the model function has a

weights

argument, the current formula/

terms

frame work uses a function (

model.weights

) to extract the weights, and also makes sure that the weights are not included as covariates. The same is true for offsets.

Summary

Some limitations of the current formula interface can be mitigated by writing your own or utilizing the

Formula

package.

However, there are a number of conceptual aspects (e.g., roles, sequential processing) that would require a completely different approach to defining a design matrix, and this will be the focus of an upcoming tidyverse package.

The R Formula Method: The Bad Parts Limitations to Extensibility Everything Happens at Once Summary

Limitations to Extensibility

Everything Happens at Once

No Recycling

Formulas and Wide Datasets

Variable Roles

Summary

繼續閱讀

R語言學習筆記7_方差分析七、方差分析

Redis 資料庫初級篇1、Nosql(not only sql)2、Redis資料庫3、Redis指令

R語言中用于計算Rsquare的包rsq理論介紹函數介紹

R語言中的功效分析

logistic regression（二項 logistic 與多項logistic ）

R fundamentals 3:data frame, matrix and arraydataframematrixarray

R fundamentals :flow controlR flow controlguideline of flow controlifelse ifswitchvectorized ifrepeatwhile loopfor loopapply

R fundamentals: functionsfunctionfunction componentsfunction nameargument matchingarguments with default valueadditional argumentslazy evaluationreturn multiple valuefunction as objectsanonymous function

R fundamentals 1:variables, operators and vectorized operations

R 語言中的箱線圖介紹 boxplot 箱線圖（boxplot）介紹

R語言做圖之——barplot

箱線圖

生物資訊學入門使用 GEO基因晶片資料進行差異表達分析（DEG）——Limma 算法資料代碼結果解讀

R語言與統計分析

基于R統計分析——樣本與分布

利用十折交叉檢驗的k-近鄰