Table of Contents

One-hot encoding is a process used to convert categorical data into numerical data. It involves creating a binary column for each category and assigning a 1 or 0 (true or false) value to the column. In Python, this can be done using Scikit-Learn’s OneHotEncoder or LabelEncoder libraries. The OneHotEncoder library converts an array of integers into a sparse matrix where each column corresponds to one possible value of each feature. The LabelEncoder library encodes labels with a value between 0 and n_classes-1. Both libraries are useful for converting categorical data into numerical data for machine learning algorithms.

One-hot encoding is used to convert categorical variables into a format that can be readily used by .

The basic idea of one-hot encoding is to create new variables that take on values 0 and 1 to represent the original categorical values.

For example, the following image shows how we would perform one-hot encoding to convert a categorical variable that contains team names into new variables that contain only 0 and 1 values:

onehot1-2

The following step-by-step example shows how to perform one-hot encoding for this exact dataset in Python.

Step 1: Create the Data

First, let’s create the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
                   'points': [25, 12, 15, 14, 19, 23, 25, 29]})

#view DataFrame
print(df)

  team  points
0    A      25
1    A      12
2    B      15
3    B      14
4    B      19
5    B      23
6    C      25
7    C      29

Step 2: Perform One-Hot Encoding

Next, let’s import the OneHotEncoder() function from the sklearn library and use it to perform one-hot encoding on the ‘team’ variable in the pandas DataFrame:

from sklearn.preprocessing import OneHotEncoder

#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

#perform one-hot encoding on 'team' column 
encoder_df = pd.DataFrame(encoder.fit_transform(df[['team']]).toarray())

#merge one-hot encoded columns back with original DataFrame
final_df = df.join(encoder_df)

#view final df
print(final_df)

  team  points    0    1    2
0    A      25  1.0  0.0  0.0
1    A      12  1.0  0.0  0.0
2    B      15  0.0  1.0  0.0
3    B      14  0.0  1.0  0.0
4    B      19  0.0  1.0  0.0
5    B      23  0.0  1.0  0.0
6    C      25  0.0  0.0  1.0
7    C      29  0.0  0.0  1.0

Notice that three new columns were added to the DataFrame since the original ‘team’ column contained three unique values.

Note: You can find the complete documentation for the OneHotEncoder() function .

Step 3: Drop the Original Categorical Variable

Lastly, we can drop the original ‘team’ variable from the DataFrame since we no longer need it:

#drop 'team' column
final_df.drop('team', axis=1, inplace=True)

#view final df
print(final_df)

   points    0    1    2
0      25  1.0  0.0  0.0
1      12  1.0  0.0  0.0
2      15  0.0  1.0  0.0
3      14  0.0  1.0  0.0
4      19  0.0  1.0  0.0
5      23  0.0  1.0  0.0
6      25  0.0  0.0  1.0
7      29  0.0  0.0  1.0

Related:

#rename columns
final_df.columns = ['points', 'teamA', 'teamB', 'teamC']

#view final df
print(final_df)

   points  teamA  teamB  teamC
0      25    1.0    0.0    0.0
1      12    1.0    0.0    0.0
2      15    0.0    1.0    0.0
3      14    0.0    1.0    0.0
4      19    0.0    1.0    0.0
5      23    0.0    1.0    0.0
6      25    0.0    0.0    1.0
7      29    0.0    0.0    1.0

The one-hot encoding is complete and we can now feed this pandas DataFrame into any machine learning algorithm that we’d like.

How to Perform One-Hot Encoding in Python

Step 1: Create the Data

Step 2: Perform One-Hot Encoding

Step 3: Drop the Original Categorical Variable

Requst a

Scale

Step 1: Create the Data

Step 2: Perform One-Hot Encoding

Step 3: Drop the Original Categorical Variable

Related terms:

Requst a

Scale