天天看點

The R Formula Method: The Bad Parts Limitations to Extensibility Everything Happens at Once Summary

R’s model formula infrastructure was discussed in my previous post. Despite the elegance and convenience of the formula method, there are some aspects that are limiting.

Limitations to Extensibility

The model formula interface does have some limitations:

  • It can be kludgy with many operations on many variables (e.g., log transforming 50 variables via a formula without using 

    paste

    )
  • The 

    predvars

     aspect (discussed in my previous post) limits the utility of the operations. Suppose a formula had: 

    knn_impute(x1) + knn_impute(x2)

    . Do we embed the training set twice in 

    predvars

    ?
  • Operations are constrained to single columns or features (excluding interaction specifications). For example, you cannot do

I’ll use PCA feature extraction a few times here since it is probably familiar to many readers.

Everything Happens at Once

Some of our data operations might be sequential. For example, it is not unreasonable to have predictors that require:

  1. imputation of a missing value
  2. centering and scale
  3. conversion to PCA scores

Given that the formula method operations happen (in effect) at once, this workflow requires some sort of custom solution. While 

caret::preProcess

 was designed for this sequence of operations, it does so in a single call, as opposed to a progression of steps exemplified by 

ggplot2

dplyr

, or 

magrittr

.

Allowing a series of steps to be defined in order is more consistent with how data analysis is conducted. However, it does raise the complexity of the underlying implementation. For example, 

caret::preProcess

 dictates the possible sequence of tasks to be: filters, single-variable transformations, normalizations, imputation, signal extraction, and spatial sign. This avoids nonsensical sequences that center the data before applying a Box-Cox calculation (which requires positive data).

No Recycling

As a corollary to the point above, there is no way to recycle the 

terms

 between models that share the same formula and data/environment. For example, if I fit a CART model to a data set with many predictors, the random forest model (theoretically) shouldn’t need to recreate the same 

terms

 information about the design matrix. If the model function has the non-formula interface (e.g., 

mod_func(x, y)

), this can make it easier. However, many do not.

Also, suppose that one of the pre-processing steps is computationally expensive. We’d like to be able to store the state of the results and then add another layer of computations (perhaps as a separate object).

Formulas and Wide Datasets

The 

terms

 object saves a matrix with as many rows as formula variables and at least as many columns (depending on interactions, etc). Most of this data is zero and a non–sparse representation is used. The current framework was built in a time where there was more focus on interactions, nesting and other operations on a small scale.

It is unlikely that models would have hundreds of interaction terms, but now it is not uncommon to have hundreds or thousands of main effects. As the number of predictors increases, this takes up an inordinate amount of execution time. For simple 

randomForest

 or 

rpart

 calls, the formula/

terms

 work can account for most of the execution time. For example, we can calculate how much time functions spend generating the model matrix relative to the total execution time. For 

rpart

 and 

randomForest

, we used the default arguments and did the calculations with a simulated data set of 200 data points and varying numbers of predictors:

This is especially problematic for ensemble models. For example, 

ipred:::ipredbagg

 creates an ensemble of 

rpart

 trees. Since 

rpart

 only has a formula method, the footprint of the bagged model object can become very large if X trees are contained in the ensemble. Alternatively, 

randomForest.formula

 takes the approach of generating the 

terms

 once and feeding the model frame to 

randomForest.default

. This does not work for 

rpart

 since there is no non-formula method exposed. Some functions (e.g., 

lm

survival::coxph

) have arguments that can be used to prevent the 

terms

 and similar objects from being returned. This saves space but prevents new samples from being predicted. A little more detail can be found here.

One issue is the 

"factors"

 attribute of the 

terms

 object (discussed in the previous post). This is a non-sparse matrix that has a row for each predictor in the formula and a column for each model term (e.g. main effects, interactions, etc.). The purpose of this object is to know which predictors are involved in which terms.

The issue is that this matrix can get very large and usually has a high proportion of zeros. For example:

As the number of predictors increases, the rate of ones is likely to approach a value close to zero very quickly. For example:

Again, it is doubtful that a model with a large number of predictors will have a correspondingly large number of high-level interactions (see the Pareto principle applied to modeling).

Variable Roles

Some packages have implemented extensions of the basic formula. There are cases when formula are needed for specific sub-models. For example, a random coefficient model can be fit with the 

lmer

 function. In this case, a model is specified for a particular clustering variable (e.g., a subject in a clinical trial). The code is an example of how 

lmer

 syntax works:

Here 

Subject

 is important to the model-fitting routine, but not as a predictor. Similarly, the Bradley-Terry model can be used to model competitions and contests. A model on a set of boxers in a series of contests can include terms for their reach:

Another extension of basic formulas comes from the 

modeltools

 and 

mboost

 packages. The function 

mboost::mob

 fits a tree-based model with regression models in the terminal nodes. For this model, a separate list of predictors are used as splitting variables (to define the tree structure) and another set of regression variables that are modeled in the terminal nodes. An example of this call is:

The commonality between these three examples is that there are variables that are critical to the model but do not play the role of standard regression terms. For 

lmer

Subject

 is the independent experimental unit. For 

mob

, we have variables to be used for splitting, etc.

There are similar issues on the left-hand side of the formula. When there are multivariate outcomes, different packages have different approaches:

The overall point here is that, for the most part, the formula method assumes that there is one variable on the left-hand side of the tilde and that the variables on the right-hand side are predictors (exceptions are discussed below). One can envision other roles that columns could play in the analysis of data. Besides the examples given above, variables could be used for

  • outcomes
  • predictors
  • stratification
  • data for assessing model performance (e.g., loan amount to compute expected loss)
  • conditioning or faceting variables (e.g., 

    lattice

     or 

    ggplot2

    )
  • random effects or hierarchical model ID variables
  • case weights
  • offsets
  • error terms (limited to 

    Error

     in the 

    aov

     function)

The last three items on this list are currently handled in formulas as “specials” or have existing functions. For example, when the model function has a 

weights

 argument, the current formula/

terms

 frame work uses a function (

model.weights

) to extract the weights, and also makes sure that the weights are not included as covariates. The same is true for offsets.

Summary

Some limitations of the current formula interface can be mitigated by writing your own or utilizing the 

Formula

 package.

However, there are a number of conceptual aspects (e.g., roles, sequential processing) that would require a completely different approach to defining a design matrix, and this will be the focus of an upcoming tidyverse package.