An ingenious framework to estimate prediction error using the training set.
Cross-validation enables in-sample estimation of the prediction error. That is, a strategy to estimate how well a model generalizes using only the training set. This is extremely useful since it allows us to verify a model directly on the training set without touching the test set. If the test set was used to select variables, we would overfit to the test set and our prediction error would be biased.
Cross-validation is the goto strategy to select models, features and parameters, and rightfully so. Sometimes cross-validation is not feasible, e.g. for time series data. Then one has to rely on either analytical frameworks such as AIC & BIC or holdout validation data. In terms of time series, this means forecasting future values.
To do cross-validation the training set is randomly partitioned in K equally large folds.
If your data is imbalanced be sure to stratify your folds to keep the same distribution.
K-1 of the folds are then used for training and 1 fold for testing. This is repeated K times so each fold gets to be test fold once. This enables each observation to be in the test set once. Finally, the cross-validation score for a model candidate is then its mean score over all the folds.
Using cross-validation we can tune our hyperparameters, meaning that we can find the best set of parameters for the model. Remember that hyperparameters are the set of parameters used by models that are not estimated from the data.
# Lets first set the scene by importing regressor and generating data# Import ElasticNet regressorfrom sklearn.linear_model import ElasticNet# Make a toy regression dataset using sklearns built in functionfrom sklearn.datasets import make_regression# Import Gridsearch and Randomized search CVfrom sklearn.model_selection import GridSearchCV, RandomizedSearchCV# Import numpy and scipystats to make parameter ranges and distributionsimport numpy as npfrom scipy.stats import uniform# Generate toy example with 50 features and 10 of them being informativeX, y =make_regression(n_features=50,n_informative=10,n_samples=1000)# Create an ElasticNet regressor# Note that we can specify any static hyperparameters here:regressor =ElasticNet(max_iter=42)# Create a parameter grid with feasiable values:param_grid ={# Ratio of L1 vs L2 regularization (lasso vs ridge)'l1_ratio': np.linspace(0.1,1,num=10),# Regularization strength'alpha': np.logspace(0.1,np.log10(4))}# Do 10-fold cross validation and consider the cartesian product of all parametersgrid_search =GridSearchCV(regressor, param_grid,cv=10,scoring='neg_mean_squared_error')grid_search.fit(X,y)# Grab the best set of parametersgrid_search.best_params_
Randomized Search
The grid search cross-validation used above considers every combination of the parameter grid. This is not feasible for more than a handful parameters and values.
Randomized search is more useful and randomly sample parameters to test. The parameters can be provided as lists or distributions, where continuous values are recommended to be provided as distributions. In our example, we know that the l1_ratio parameter should only be coarsely searched over and linearly over 0.1-1, while the regularization strength could take any value. As such the alpha parameter can be replaced by a uniform distribution:
# Using random search any ranges will be randomly sampled and
# distributions can be specified instead.
params = {
# Ratio of L1 vs L2 regularization (lasso vs ridge)
'l1_ratio': np.linspace(0.1,1,num=10),
# Regularization strength, NOTE: use a distribution instead of range
'alpha': uniform(loc=1, scale=2)
}
# Do 10-fold Crossvalidation using random search
random_search = RandomizedSearchCV(regressor, params, cv = 10, scoring ='neg_mean_squared_error')
random_search.fit(X,y)
# Grab the best set of parameters
random_search.best_params_