Backtick Knowledge Base
  • Backtick Knowledge Base
  • 📊Statistics
    • Kernel Density Estimation
    • Tests
  • 🍂Machine Learning
    • Fit and predict
    • Encoding
    • Feature Scaling
    • Pipeline
    • Model Evaluation & Selection
      • The Bias-Variance Decomposition
      • Kullback–Leibler Divergence
      • AIC & BIC
      • Cross-Validation
    • Feature Selection
    • Dimensionality Reduction
    • Clustering
    • Pandas
  • 🧠Deep Learning
  • 🐍Python
    • Beautiful Data
    • S3
      • List bucket items
      • Delete bucket items
      • Get objects
      • Upload objects
      • Get files
      • Upload files
      • Read .csv-file to dataframe
      • Write dataframe to .csv-file
  • ☁️Cloud
    • GCP
    • AWS
      • Users & Policies
        • Basic setup
        • MFA
      • EKS
        • Setup
        • Kube Config
        • Dashboard
      • S3
        • Copying buckets
  • ❖ Distributed Computing
    • Map-Reduce
    • Spark
    • Dask
  • ⎈ Kubernetes
Powered by GitBook
On this page

Was this helpful?

  1. Machine Learning

Feature Scaling

Feature scaling is the process of scaling features to a common scale, required for most applications to work.

PreviousEncodingNextPipeline

Last updated 4 years ago

Was this helpful?

Feature scaling is often a necessity in many machine learning and data analysis applications. For instance, imagine two variables of mass where one is measured in kilograms and one in grams. If their true values changed the same amount, the variable measured in grams would change 1000x as much as the other. This would cause problems in fitting coefficients, doing gradient descent, regularization and any distance metrics in feature space.

Luckily, feature scaling is easy using Scikit's :

# Lets scale the iris dataset
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

# Import the preprocessing module from Scikit
from sklearn import preprocessing

# It contains several tools but perhaps the most useful one is the standard scaler
scaler = preprocessing.StandardScaler()

# The scaler needs to be fitted before it can transform the data
X = scaler.fit_transform(X)

In the snippet above the is used. It will center and scale the data to zero mean and unit variance (i.e. variance=1) by applying the transform:

xtransformed=x−xˉσx_{transformed} = \frac{x-\bar{x}}{\sigma} xtransformed​=σx−xˉ​

for any feature X with σ being its standard deviance and x̄ its mean.

The scaler must be refitted for each fold in the cross-validation.

Note that the standard deviance and mean is not known and needs to be estimated from the data, which is why the function first fits the scaler and then transform the data. As such it is very important that the features are not being scaled before cross-validation or bootstrapping, but rater being fitted on each bootstrap sample or cross-validation.

As scaling needs to be fitted on for each cross-validation fold and applied on any other dataset, such as the validation and test set, it is often convenient to wrap the scaler in a pipeline with your algorithm.

🍂
preprocessing module
StandardScaler