Table of Contents
One-hot encoding is a process used to convert categorical data into numerical data. In R, the one-hot encoding process is accomplished by using the model.matrix() function and specifying the categorical variables. The model.matrix() function will create a matrix of 0s and 1s for each category of the categorical variable, with a 1 in the column corresponding to the category of each observation. This allows for the categorical data to be utilized in numerical analysis.
One-hot encoding is used to convert categorical variables into a format that can be used by .
The basic idea of one-hot encoding is to create new variables that take on values 0 and 1 to represent the original categorical values.
For example, the following image shows how we would perform one-hot encoding to convert a categorical variable that contains team names into new variables that contain only 0 and 1 values:
The following step-by-step example shows how to perform one-hot encoding for this exact dataset in R.
Step 1: Create the Data
First, let’s create the following data frame in R:
#create data frame df <- data.frame(team=c('A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'), points=c(25, 12, 15, 14, 19, 23, 25, 29)) #view data frame df team points 1 A 25 2 A 12 3 B 15 4 B 14 5 B 19 6 B 23 7 C 25 8 C 29
Step 2: Perform One-Hot Encoding
Next, let’s use the dummyVars() function from the caret package to perform one-hot encoding on the ‘team’ variable in the data frame:
library(caret) #define one-hot encoding function dummy <- dummyVars(" ~ .", data=df) #perform one-hot encoding on data frame final_df <- data.frame(predict(dummy, newdata=df)) #view final data frame final_df teamA teamB teamC points 1 1 0 0 25 2 1 0 0 12 3 0 1 0 15 4 0 1 0 14 5 0 1 0 19 6 0 1 0 23 7 0 0 1 25 8 0 0 1 29
Notice that three new columns were added to the data frame since the original ‘team’ column contained three unique values.
Also notice that the original ‘team’ column was dropped from the data frame since it’s no longer needed.
The one-hot encoding is complete and we can now feed this dataset into any machine learning algorithm that we’d like.
Note: You can find the complete online documentation for the dummyVars() function .
The following tutorials offer additional information about working with categorical variables: