Reduce the size of model objects saved to disk

The most useful content begins under the second section on the butcher package. Feel free to skip the backstory.

Backstory

“Learn to fail often and fail fast.” - Hadley Wickham

This summer I was given the chance to build a new R package that reduces the size of modeling objects saved to disk. This internship project was aptly named object scrubber and was catalyzed by two main problems:

  1. The environment (which may carry a lot of junk) is often saved along with the fitted model object; and
  2. Redundancies in the structure of the model object itself often exist.

These problems are a recurring theme as referenced in the examples below:

  1. Davis Vaughan’s lm-hell and call-hell GitHub gists.
  2. Zach Mayer’s comment on the caret package.
  3. Nina Zumel’s post on trimming the fat from glm models.

Given these starting points, I first latched on to this idea: “Once you’ve examined a model and decided that it’s satisfactory, all you probably want to do is predict.” I adapted Nina’s approach and removed every possible component of fitted model objects such that the predict generic for each specific model object still worked. Unfortunately, this brute-force method severely limited the application for this new package. What if we’re not just interested in predict()? What about all the other post-estimation functions, like fitted(), coef(), summary(), plot(), print() and update()?

So we moved onto another idea: levels of trimming. What if we let users decide the severity at which model components are trimmed? The “greatest” severity would then be defined by the functions I had originally: remove everything so that only predict() works. The “least” severe might be to just remove the call attribute… Yet, we hit another obstacle: how do we generalize the level of severity for each model object? By the desired level of memory released? By the kind of post-estimation activity the user wants to perform? And how do we deal with type stability? Our new model object is butchered. It’s fundamentally different from the original fitted object.

It wasn’t until two rounds of re-factoring and a month into this internship, I finally had a framework that made sense. We would provide a simple set of five S3 generics that removed the “bad apples”—the parts of a model object that often took up too memory and were not used in post-estimation functions; thus, providing the user the option to pick and choose exactly what should be removed, based on (1) how much memory is released and (2) how much functionality is disabled as a result.

This “object scrubbing” tool is called butcher and is now available on CRAN.

The butcher package

library(butcher)

butcher is a new addition to the tidymodels ecosystem, whose goal is to reduce the size of modeling objects saved to disk. This goal is achieved with the following five S3 generics:

  • axe_call(): To remove the call object.
  • axe_ctrl(): To remove control parameters associated with training.
  • axe_data(): To remove the training data.
  • axe_fitted(): To remove fitted values.
  • axe_env(): To remove the original environment in which the model was processed.

In the case where predict() is actually the only downstream function of interest, we provided the function butcher::butcher(), which executes all five of these methods on a specific model object. The objects for which axe methods currently exist are the same as those listed for the parsnip package here. Yet, if there are model objects not covered, but could benefit from any kind of axing, we’ve provided a streamlined way to contribute; just run new_model_butcher(“your_object_class”, “your_model_package”) and a template will be generated to create a set of relevant axe methods. For anyone looking to make their first open-source contribution, butcher is a great place to start as there is a world of modeling classes we have yet considered.

In practice

As someone who does not regularly keep tabs on memory consumption in R, the need for butcher was not initially clear to me. But here are three examples in which axe generics are leveraged to demonstrate their utility:

Case 1: The use of formulas

Most modeling functions follow the framework as outlined in the seminal “white book”, in which statistical models are defined by (1) a formula, (2) data, and (3) something that captures the discrepancies between the model fit and the data. To demo the value of butcher::axe_env(), I’m going to focus only on the formula and data component as done in Lumley’s paper here. As he shares, most modeling functions consist of a call, with a formula and data argument. An example is a simple linear regression:

lm(y ~ x, data = df, weights = w)

But, where does x, y and w come from? These variables are first sought after in df. If not found, the variables are then sought after in the environment in which the formula was created (which, oftentimes, is the environment in which lm() was called). So, we might have situations in which some variables exist in df and some do not. Things would be a lot simpler if all the variables specified must exist in the data argument, but that was not a condition when the formula object was first introduced. As a result, we are forced to deal with the problem as Lumley calls attention to:

Authors of modeling and graphics functions are required to implement a limited form of dynamic scope, which they have not done in an entirely consistent way.

A formula (or expression distinguished by ~ symbol, known as twiddle) thus carries the environment in which it was created along with it, which might become a problem when we bring the data and formula together to actually fit a model. To be more explicit, most linear models and their extensions rely on pre-processing the formula and data to arrive at a model matrix, and this pre-processing consists of three basic steps:

  1. Converting the formula into a terms object.
  2. Combining the terms object with the data via model.frame().
  3. Generating the desired model matrix from this output.

Since terms() and model.frame() are two key pre-processing functions, whose output is often carried along with the overall fitted model, their use might result in fitted model objects that take up too much memory.

Here is an example with the use of terms(), which processes the formula into a convenient form for data pre-processing.

create_terms_object <- function() {
  some_junk_in_environment <- runif(1e6) # we didn't know about
  return(terms(Species ~ ., data = iris))
}

example <- create_terms_object()

We use the lobstr package to assess the memory footprint of this new terms object:

suppressWarnings(library(lobstr))
obj_size(example)
## 8,003,928 B

Why did our simple terms object takes more than 8MB? If we check the object size of our terms object without wrapping it in an environment that includes “some junk”, we find it is only:

obj_size(terms(Species ~ ., data = iris))
## 3,712 B

Of course, this is a rather contrived example in which we explicitly included some junk object we knew we didn’t need, but such “junk” might actually inadvertently appear in complex modeling development pipelines. One way to clean up all the parts of fitted models (that might be have junky environments attached) is to run:

cleaned_example <- butcher::axe_env(example, verbose = TRUE)
## ✔ Memory released: '8,000,136 B'

The memory released brings us back to the memory footprint we expect without the some_junk_in_environment object.

Case 1a: The use of quosures

As formulas are the inspiration for quosures (and quosures a subset of formulas), the use of quosures carry with it similar issues. Quosures and formulas are so alike in that in early versions of tidy evaluation formulas were used instead! Formulas were attractive as they required only one keystroke—only ~ was needed to create formula. Yet, it was this feature that made the addition of a quosure essential; as Hadley notes, “there is no clean way to make ~ a quasiquoting function,” or a function that supports both quotation and unquotation. More on quasiquotation here.

While the differences between formulas and quosures are subtle, quosures have become ubiquitous given the ease at which it facilitates the combination of developers’ code for a function with that of its users. Capturing an expression from a user is thus as simple as,

library(rlang)
user_supplied_expression <- function(x) enquo(x)
user_supplied_expression(y + 2*z)
## <quosure>
## expr: ^y + 2 * z
## env:  global

which is particularly useful in cases where we want to control when and where this user-supplied expression is evaluated. A good example is with the recipes package, which relies on quosures in most of its transformation functions:

suppressMessages(suppressWarnings(library(recipes)))

create_recipes_object <- function() {
  some_junk_in_environment <- runif(1e6) # we don't need
  return(
    recipe(mpg ~ cyl, data = mtcars) %>%
      step_center(all_predictors()) %>%
      step_scale(all_predictors()) %>%
      step_spatialsign(all_predictors())
  )
}

example <- create_recipes_object()
# Check original size
obj_size(example)
## 8,012,064 B
# Remove "junky" environments
cleaned_example <- butcher::axe_env(example, verbose = TRUE)
## ✔ Memory released: '8,003,536 B'

Case 2: The call object

Most of the time there is a call object embedded within the final model fit. Storing this call is generally done for summary() and print() methods. For example, for a simple linear regression model, we’d see the call printed out:

fit <- lm(mpg ~ cyl, data = mtcars)
fit
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

Storing the call in this manner is fine, as it rarely hogs up memory:

obj_size(fit$call)
## 728 B

A problem arises when package developers evaluate this lm() function call, not as done above, but with the help of do.call(). In other words, they call their modeling function programmatically:

.f <- mpg ~ cyl
.data <- mtcars
do.call(lm, list(formula = .f, data = .data))
## 
## Call:
## (function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## {
##     ret.x <- x
##     ret.y <- y
##     cl <- match.call()
##     mf <- match.call(expand.dots = FALSE)
##     m <- match(c("formula", "data", "subset", "weights", "na.action", 
##         "offset"), names(mf), 0L)
##     mf <- mf[c(1L, m)]
##     mf$drop.unused.levels <- TRUE
##     mf[[1L]] <- quote(stats::model.frame)
##     mf <- eval(mf, parent.frame())
##     if (method == "model.frame") 
##         return(mf)
##     else if (method != "qr") 
##         warning(gettextf("method = '%s' is not supported. Using 'qr'", 
##             method), domain = NA)
##     mt <- attr(mf, "terms")
##     y <- model.response(mf, "numeric")
##     w <- as.vector(model.weights(mf))
##     if (!is.null(w) && !is.numeric(w)) 
##         stop("'weights' must be a numeric vector")
##     offset <- as.vector(model.offset(mf))
##     if (!is.null(offset)) {
##         if (length(offset) != NROW(y)) 
##             stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
##                 length(offset), NROW(y)), domain = NA)
##     }
##     if (is.empty.model(mt)) {
##         x <- NULL
##         z <- list(coefficients = if (is.matrix(y)) matrix(, 0, 
##             3) else numeric(), residuals = y, fitted.values = 0 * 
##             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
##             0) else if (is.matrix(y)) nrow(y) else length(y))
##         if (!is.null(offset)) {
##             z$fitted.values <- offset
##             z$residuals <- y - offset
##         }
##     }
##     else {
##         x <- model.matrix(mt, mf, contrasts)
##         z <- if (is.null(w)) 
##             lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
##                 ...)
##         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
##             ...)
##     }
##     class(z) <- c(if (is.matrix(y)) "mlm", "lm")
##     z$na.action <- attr(mf, "na.action")
##     z$offset <- offset
##     z$contrasts <- attr(x, "contrasts")
##     z$xlevels <- .getXlevels(mt, mf)
##     z$call <- cl
##     z$terms <- mt
##     if (model) 
##         z$model <- mf
##     if (ret.x) 
##         z$x <- x
##     if (ret.y) 
##         z$y <- y
##     if (!qr) 
##         z$qr <- NULL
##     z
## })(formula = mpg ~ cyl, data = structure(list(mpg = c(21, 21, 
## 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 
## 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 
## 19.2, 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4), cyl = c(6, 6, 4, 
## 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 
## 8, 4, 4, 4, 8, 6, 8, 4), disp = c(160, 160, 108, 258, 360, 225, 
## 360, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 
## 440, 78.7, 75.7, 71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 
## 95.1, 351, 145, 301, 121), hp = c(110, 110, 93, 110, 175, 105, 
## 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 
## 65, 97, 150, 150, 245, 175, 66, 91, 113, 264, 175, 335, 109), 
##     drat = c(3.9, 3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 
##     3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 
##     4.22, 3.7, 2.76, 3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 
##     3.62, 3.54, 4.11), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 
##     3.46, 3.57, 3.19, 3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 
##     5.424, 5.345, 2.2, 1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 
##     3.845, 1.935, 2.14, 1.513, 3.17, 2.77, 3.57, 2.78), qsec = c(16.46, 
##     17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 
##     18.9, 17.4, 17.6, 18, 17.98, 17.82, 17.42, 19.47, 18.52, 
##     19.9, 20.01, 16.87, 17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 
##     14.5, 15.5, 14.6, 18.6), vs = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 
##     1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 
##     0, 0, 0, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
##     0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), 
##     gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 
##     3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4), carb = c(4, 
##     4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 
##     1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2)), row.names = c("Mazda RX4", 
## "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", 
## "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", 
## "Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", 
## "Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
## "Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
## "Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", 
## "Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", 
## "Volvo 142E"), class = "data.frame"))
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

Haha, sorry. I wanted you to scroll through that :) The point is, the data is literally inlined into the output, which you can avoid by using rlang::call2() (if you are contributing to a package) as detailed here, or you can fix by using butcher.

ugly_fit <- do.call(lm, list(formula = .f, data = .data))
cleaned_fit <- butcher::axe_call(ugly_fit, verbose = TRUE)
## ✔ Memory released: '66,880 B'
## ✖ Disabled: `print()`, `summary()`

More on all the axe generics available here.

In general

Because I often don’t see these intermediate stages to get to a fitted model object, I wasn’t aware of all the possible memory leaks. With butcher, I can manage and make the most of the memory I have available, but it’s only a short-term fix, a band-aid solution to a much deeper issue: the need for better conventions around model representation and implementation.

Acknowledgements

butcher is a community effort. Huge gratitude to Davis, Max, Alex, Hadley and Bruna for your guidance throughout.

Avatar
Joyce Cahoon
PhD Student

My research interests include the foundations of statistical inference, Bayes and empirical Bayes, and their applications in high-dimensional problems.