How can I perform a Box-Cox transformation in Python?

The Box-Cox transformation is a statistical technique used to normalize data by transforming it into a Gaussian distribution. This transformation is commonly used in data analysis and machine learning. In order to perform a Box-Cox transformation in Python, you will need to first import the necessary libraries, such as scipy.stats and numpy. Then, you can use the scipy.stats.boxcox() function to apply the transformation to your data. This function takes in the data as an input and returns the transformed data as well as the lambda parameter used in the transformation. By adjusting this lambda parameter, you can control the degree of transformation applied to the data. This allows for greater flexibility in finding the most suitable transformation for your data. Overall, performing a Box-Cox transformation in Python is a simple and effective way to normalize your data for further analysis.

Perform a Box-Cox Transformation in Python


box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:

  • y(λ) = (yλ – 1) / λ  if y ≠ 0
  • y(λ) = log(y)  if y = 0

We can perform a box-cox transformation in Python by using the scipy.stats.boxcox() function.

The following example shows how to use this function in practice.

Example: Box-Cox Transformation in Python

Suppose we generate a random set of 1,000 values that come from an :

#load necessary packagesimport numpy as np 
from scipy.statsimport boxcox 
import seaborn as sns 

#make this example reproducible
np.random.seed(0)

#generate dataset
data = np.random.exponential(size=1000)

#plot the distribution of data values
sns.distplot(data, hist=False, kde=True) 

We can see that the distribution does not appear to be normal.

We can use the boxcox() function to find an optimal value of lambda that produces a more normal distribution:

#perform Box-Cox transformation on original data
transformed_data, best_lambda = boxcox(data) 

#plot the distribution of the transformed data values
sns.distplot(transformed_data, hist=False, kde=True) 

Box-cox transformation in Python

We can see that the transformed data follows much more of a normal distribution.

We can also find the exact lambda value used to perform the Box-Cox transformation:

#display optimal lambda value
print(best_lambda)

0.2420131978174143

The optimal lambda was found to be roughly 0.242.

New = (old0.242 – 1) / 0.242

We can confirm this by looking at the values from the original data compared to the transformed data:

#view first five values of original dataset
data[0:5]

array([0.79587451, 1.25593076, 0.92322315, 0.78720115, 0.55104849])

#view first five values of transformed dataset
transformed_data[0:5]

array([-0.22212062,  0.23427768, -0.07911706, -0.23247555, -0.55495228])

The first value in the original dataset was 0.79587. Thus, we applied the following formula to transform this value:

New = (.795870.242 – 1) / 0.242 = -0.222

We can confirm that the first value in the transformed dataset is indeed -0.222.

Additional Resources

How to Create & Interpret a Q-Q Plot in Python
How to Perform a Shapiro-Wilk Test for Normality in Python

x