Cross-Validation

An ingenious framework to estimate prediction error using the training set.

Cross-validation enables in-sample estimation of the prediction error. That is, a strategy to estimate how well a model generalizes using only the training set. This is extremely useful since it allows us to verify a model directly on the training set without touching the test set. If the test set was used to select variables, we would overfit to the test set and our prediction error would be biased.

Cross-validation is the goto strategy to select models, features and parameters, and rightfully so. Sometimes cross-validation is not feasible, e.g. for time series data. Then one has to rely on either analytical frameworks such as AIC & BIC or holdout validation data. In terms of time series, this means forecasting future values.

To do cross-validation the training set is randomly partitioned in K equally large folds.

If your data is imbalanced be sure to stratify your folds to keep the same distribution.

K-1 of the folds are then used for training and 1 fold for testing. This is repeated K times so each fold gets to be test fold once. This enables each observation to be in the test set once. Finally, the cross-validation score for a model candidate is then its mean score over all the folds.

Using cross-validation we can tune our hyperparameters, meaning that we can find the best set of parameters for the model. Remember that hyperparameters are the set of parameters used by models that are not estimated from the data.

Scikit offers neatly packaged cross-validation and hyperparameter search all in one. Assume that we are interested in finding the best regularizing strength and l1-ratio of an ElasticNet regressor:

# Lets first set the scene by importing regressor and generating data

# Import ElasticNet regressor
from sklearn.linear_model import ElasticNet

# Make a toy regression dataset using sklearns built in function
from sklearn.datasets import make_regression

# Import Gridsearch and Randomized search CV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Import numpy and scipystats to make parameter ranges and distributions
import numpy as np
from scipy.stats import uniform

# Generate toy example with 50 features and 10 of them being informative
X, y = make_regression(n_features=50, n_informative=10, n_samples=1000)

# Create an ElasticNet regressor
# Note that we can specify any static hyperparameters here:
regressor = ElasticNet(max_iter=42)



# Create a parameter grid with feasiable values:
param_grid = {
                # Ratio of L1 vs L2 regularization (lasso vs ridge)
                'l1_ratio': np.linspace(0.1,1,num=10),
    
                # Regularization strength
                'alpha': np.logspace(0.1,np.log10(4))
                }

# Do 10-fold cross validation and consider the cartesian product of all parameters
grid_search = GridSearchCV(regressor, param_grid, cv = 10, scoring ='neg_mean_squared_error')
grid_search.fit(X,y)

# Grab the best set of parameters
grid_search.best_params_

The grid search cross-validation used above considers every combination of the parameter grid. This is not feasible for more than a handful parameters and values.

Randomized search is more useful and randomly sample parameters to test. The parameters can be provided as lists or distributions, where continuous values are recommended to be provided as distributions. In our example, we know that the l1_ratio parameter should only be coarsely searched over and linearly over 0.1-1, while the regularization strength could take any value. As such the alpha parameter can be replaced by a uniform distribution:

# Using random search any ranges will be randomly sampled and 
# distributions can be specified instead.
params = {
        # Ratio of L1 vs L2 regularization (lasso vs ridge)
        'l1_ratio': np.linspace(0.1,1,num=10),

        # Regularization strength, NOTE: use a distribution instead of range
        'alpha': uniform(loc=1, scale=2)
        }

# Do 10-fold Crossvalidation using random search
random_search = RandomizedSearchCV(regressor, params, cv = 10, scoring ='neg_mean_squared_error')
random_search.fit(X,y)

# Grab the best set of parameters
random_search.best_params_

Last updated

Was this helpful?