Pipeline

Machine Learning consists of several stages such as encoding, scaling, feature selection and fitting a model. A pipeline is a convenient tool to ensure that all stages are applied correctly.

Data often needs to be encoded, scaled or preprocessed in various ways. Fitting a model to processed data requires the processing steps to be repeated every time the model is used. Some steps, like correlation estimation between feature and response, require their output to be stored for later use during validation, testing, inference.

Using an easier toy example of just fitting Logistic Regression to a training set and prediction on a test set quickly becomes clumsy:

# Import standard scaler and logistic regression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Import iris data set and train/test splitter
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load iris data
X, y = load_iris(return_X_y=True)

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

# Create scaler and classifier
scaler = StandardScaler()
clf = LogisticRegression()

# First fit the scaler on X_train and then transform it
X_train_trans = scaler.fit_transform(X_train)
# Then fit the classifier
clf.fit(X_train_trans, y_train)

# Using the fitted scaler transform the X_test, DO NOT refit the scaler
X_test_trans = scaler.transform(X_test)

# Finally score the classifiers accuracy
clf.score(X_test_trans, y_test)

Using a pipelinearrow-up-right all steps can be bundled into one:

Note that we no longer have to care about fit vs fit_transform. This also enables scikits learns built in cross-validation to be used.

The pipeline implements the same interface as the rest of scikit-learn meaning that fit and predict methods can be used. Accessing any component can be done using its key or index:

Setting parameters to any step can be done either in the constructor for the specific step, when creating the parameter, or on the pipeline using the set_params method. The parameter name should follow the syntax <key>__<parameter>. For instance, setting the inverse regularization strength hyperparameter C on the logistic regression step can be done as:

circle-info

Using set_params allows hyperparameters to be grid searched directly on the pipeline!

Last updated

Was this helpful?