Table of Contents
A Box-Cox transformation is a statistical method used to transform non-normal distributions into a normal shape. It can be performed in Python using the scipy.stats.boxcox() method, which takes in an array of data points and returns the transformed data along with the optimal lambda parameter used for the transformation. The lambda parameter specifies the type of transformation used, with 0 being a log transformation and positive values resulting in a power transformation.
A box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.
The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:
- y(λ) = (yλ – 1) / λ if y ≠ 0
- y(λ) = log(y) if y = 0
We can perform a box-cox transformation in Python by using the scipy.stats.boxcox() function.
The following example shows how to use this function in practice.
Example: Box-Cox Transformation in Python
Suppose we generate a random set of 1,000 values that come from an :
#load necessary packages import numpy as np from scipy.stats import boxcox import seaborn as sns #make this example reproducible np.random.seed(0) #generate dataset data = np.random.exponential(size=1000) #plot the distribution of data values sns.distplot(data, hist=False, kde=True)
We can see that the distribution does not appear to be normal.
We can use the boxcox() function to find an optimal value of lambda that produces a more normal distribution:
#perform Box-Cox transformation on original data transformed_data, best_lambda = boxcox(data) #plot the distribution of the transformed data values sns.distplot(transformed_data, hist=False, kde=True)
We can see that the transformed data follows much more of a normal distribution.
We can also find the exact lambda value used to perform the Box-Cox transformation:
#display optimal lambda value print(best_lambda) 0.2420131978174143
The optimal lambda was found to be roughly 0.242.
New = (old0.242 – 1) / 0.242
We can confirm this by looking at the values from the original data compared to the transformed data:
#view first five values of original dataset data[0:5] array([0.79587451, 1.25593076, 0.92322315, 0.78720115, 0.55104849]) #view first five values of transformed dataset transformed_data[0:5] array([-0.22212062, 0.23427768, -0.07911706, -0.23247555, -0.55495228])
The first value in the original dataset was 0.79587. Thus, we applied the following formula to transform this value:
New = (.795870.242 – 1) / 0.242 = -0.222
We can confirm that the first value in the transformed dataset is indeed -0.222.
How to Create & Interpret a Q-Q Plot in Python
How to Perform a Shapiro-Wilk Test for Normality in Python