Cross-Validation
An ingenious framework to estimate prediction error using the training set.
Last updated
Was this helpful?
An ingenious framework to estimate prediction error using the training set.
Last updated
Was this helpful?
Cross-validation enables in-sample estimation of the prediction error. That is, a strategy to estimate how well a model generalizes using only the training set. This is extremely useful since it allows us to verify a model directly on the training set without touching the test set. If the test set was used to select variables, we would overfit to the test set and our prediction error would be biased.
Cross-validation is the goto strategy to select models, features and parameters, and rightfully so. Sometimes cross-validation is not feasible, e.g. for time series data. Then one has to rely on either analytical frameworks such as AIC & BIC or holdout validation data. In terms of time series, this means forecasting future values.
To do cross-validation the training set is randomly partitioned in K equally large folds.
K-1 of the folds are then used for training and 1 fold for testing. This is repeated K times so each fold gets to be test fold once. This enables each observation to be in the test set once. Finally, the cross-validation score for a model candidate is then its mean score over all the folds.
Using cross-validation we can tune our hyperparameters, meaning that we can find the best set of parameters for the model. Remember that hyperparameters are the set of parameters used by models that are not estimated from the data.
Assume that we are interested in finding the best regularizing strength and l1-ratio of an ElasticNet regressor:
The used above considers every combination of the parameter grid. This is not feasible for more than a handful parameters and values.
. The parameters can be provided as lists or distributions, where continuous values are recommended to be provided as distributions. In our example, we know that the l1_ratio
parameter should only be coarsely searched over and linearly over 0.1-1, while the regularization strength could take any value. As such the alpha
parameter can be replaced by a uniform distribution: