Machine Learning consists of several stages such as encoding, scaling, feature selection and fitting a model. A pipeline is a convenient tool to ensure that all stages are applied correctly.
Data often needs to be encoded, scaled or preprocessed in various ways. Fitting a model to processed data requires the processing steps to be repeated every time the model is used. Some steps, like correlation estimation between feature and response, require their output to be stored for later use during validation, testing, inference.
Using an easier toy example of just fitting Logistic Regression to a training set and prediction on a test set quickly becomes clumsy:
# Import standard scaler and logistic regression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Import iris data set and train/test splitter
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load iris data
X, y = load_iris(return_X_y=True)
# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
# Create scaler and classifier
scaler = StandardScaler()
clf = LogisticRegression()
# First fit the scaler on X_train and then transform it
X_train_trans = scaler.fit_transform(X_train)
# Then fit the classifier
clf.fit(X_train_trans, y_train)
# Using the fitted scaler transform the X_test, DO NOT refit the scaler
X_test_trans = scaler.transform(X_test)
# Finally score the classifiers accuracy
clf.score(X_test_trans, y_test)
# Import the pipeline modules
from sklearn.pipeline import Pipeline, make_pipeline
# If you don't care about the name of the steps:
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
# If you care about the name of the steps:
pipeline = Pipeline([
('scaler',StandardScaler()),
('logreg',LogisticRegression())
])
# Fit and score can be done as if the pipeline was just any classifier
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
Note that we no longer have to care about fit vs fit_transform. This also enables scikits learns built in cross-validation to be used.
The pipeline implements the same interface as the rest of scikit-learn meaning that fit and predict methods can be used. Accessing any component can be done using its key or index:
# Access using index
scaler = pipeline[0]
# Access using key
clf = pipeline['logreg']
Setting parameters to any step can be done either in the constructor for the specific step, when creating the parameter, or on the pipeline using the set_params method. The parameter name should follow the syntax <key>__<parameter>. For instance, setting the inverse regularization strength hyperparameter C on the logistic regression step can be done as:
# Set parameter C to 10 on the logistic regression step
pipeline.set_params(logreg__C=10)
Using set_params allows hyperparameters to be grid searched directly on the pipeline!