Best Subset Selection

We can use regsubset()  from leaps library for the task. I’ll use the mtcars dataset to demonstrate the usage.

regsubset() defaults at 8 variables. To truly include every feature, we need to change nvmax  parameter.

The mtcars dataset has 11 variables. Since we want to predict mpg, the maximum variables we can fit is 10.

The number on the left is the number of the feature in a model. regsubset() uses RSS as a criterion to include/exclude a feature. For example, a model with four features, the best four are hp, wt, qsec, and am.

The result from regsubset()  contains 28 lists, which we may not need all. Let’s extract only stats we want.

We then can easily extract as follows:

Unfortunately, GGPLOT doesn’t work with regsubsets class. Therefore, we need some extra step for visualization.

It seems like model number 5 is the best among the pack. We then can extract the coefficients by using coef() .

Unfortunately, regsubset is not compatible with the popular predict() function. Therefore to predict, we need to do a lot of data munging which we is covered in this post.

TL;DR In the era where we can collect data more than we can analyze, Best Subset Selection is good to know, but it could hardly be a preferred method.