Encoding
Data comes in many shapes and forms. To be useable in a machine learning method, all non-numerical variables need to be encoded.
Categorical Variables
The most straight forward encoding is encoding categorical variables as dummy variables, or one-hot encoding. That is simply replacing each categorical variable with k categories with k binary variables.
This can be done using OneHotEncoder
:
# Import the encodeer from sklearn's preprocessing module
from sklearn.preprocessing import OneHotEncoder
# Create a OneHotEncoder and choose strategy for unknown values
# i.e. values in transform arguments not in the dataset provided for fit
enc = OneHotEncoder(handle_unknown='ignore')
# Fit and transform as usual
enc.fit_transform([['cat0', 42], ['cat1', 24], ['cat2', 1]])
Dictionaries
If your data is stored as a dict
the DictVectorizer
is very useful. It automatically transforms a dict into a matrix using dummy encoding for any categorical variable found.
# Import the DictVectorizer from the feature extraction module
from sklearn.feature_extraction import DictVectorizer
# Create a DictVectorizer and specify
# sparse - produce scipy.sparse matrices or not
dict_vect = DictVectorizer(sparse=False)
# Use regular fit/transform calls on the vectorizer
any_dict = [{'numerical': 1, 'cat': 'cat0'}, {'numerical': 42, 'cat': 'cat1'}]
dict_vect.fit_transform(any_dict)
Text
Two additional encoders that are very useful for text are the CountVectorizer
and TF-IDF.
The CountVectorizer encodes documents by creating a sparse vector where each index corresponds to a certain word, much like dummy encoding. The vector is created by finding all words in the entire corpus. However, instead of using dummy encoding the value at a certain index is the number of times the corresponding word occurred in the document.
# Import the CountVectorizer from the text feature extraction module
from sklearn.feature_extraction.text import CountVectorizer
# Create the CountVectorizer
count_vect = CountVectorizer()
# again, call fit/transform as needed
corpus = [
'nlp is the best',
'I wish I did more nlp'
]
count_vect.fit_transform(corpus)
Knowing the count of each word provides more information than just knowing that at least one is present. As discussed on the next page, machine learning models have a hard time handling features of different scales. Under the assumption that rare words carry more information, the CountVectorizer can be TF-IDF transformed using TfidfTransformer
. The TF-IDF transformer calculates the Term Frequency times the Inverse Document-Frequency. Meaning that it calculates the frequency of the word in the current document and weighs it inversely with its overall frequency of all documents. As such downscaling common words and upscaling rare words.
TF-IDF can be used either as a transformer on the output of a CountVectorizer:
# Import the tfidf transformer from the text feature extraction module
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Import sklearns pipeline to connect count and tfidf
from sklearn.pipeline import Pipeline
# Create a pipeline using the CountVectorizer above and the tfidf transformer
pipeline = Pipeline([('count', CountVectorizer()),
('tfid', TfidfTransformer())])
# Use with the usual interface
pipeline.fit_transform(corpus)
or using a TfidfVectorizer
directly:
# Import the tfidf transformer from the text feature extraction module
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
TfidfVectorizer does a lot more than TF-IDF, such as removing stop words, tokenizing, and creating n-grams, see the sklearn documentation.
Last updated
Was this helpful?