Calculate Cook’s Distance in SAS ?

Cook’s Distance is a measure used in regression analysis to determine which of the data points have the greatest influence on the estimated regression line. It can be calculated in SAS by using the “PROC REG” command with the “COOKD” option. This will generate a table of values that shows the distance for each observation from the estimated regression line, and can be used to identify and isolate any influential points that may be skewing the results.


Cook’s distance is used to identify influential in a regression model.

The formula for Cook’s distance is:

Di = (ri2 / p*MSE) * (hii / (1-hii)2)

where:

  • ri is the ith residual
  • is the number of coefficients in the regression model
  • MSE is the mean squared error
  • hii is the ith leverage value

Essentially Cook’s distance measures how much all of the fitted values in the model change when the ith observation is deleted.

The larger the value for Cook’s distance, the more influential a given observation.

A rule of thumb is that any observation with a Cook’s distance greater than 4/n (where n = total observations) is considered to be highly influential.

The following example shows how to calculate Cook’s distance for each observation in a regression model in SAS.

Example: Calculating Cook’s Distance in SAS

Suppose we have the following dataset in SAS:

/*create dataset*/
data my_data;
    input x y;
    datalines;
8 41
12 42
12 39
13 37
14 35
16 39
17 45
22 46
24 39
26 49
29 55
30 57
;
run;

/*view dataset*/
proc print data=my_data;

We can use PROC REG to fit a to this dataset and then use the OUTPUT statement along with the COOKD statement to calculate Cook’s distance for each observation in the regression model:

/*fit simple linear regression model and calculate Cook's distance for each obs*/
proc reg data=my_data;
    model y=x;
    output out=cooksData cookd=cookd;
run;

/*print Cook's distance values for each observation*/
proc print data=cooksData;

The final table in the output displays the original dataset along with Cook’s distance for each observation:

  • Cook’s distance for the first observation is 0.36813.
  • Cook’s distance for the second observation is 0.06075.
  • Cook’s distance for the third observation is 0.00052.

And so on.

The PROC REG procedure also produces several diagnostic plots in the output and the chart for Cook’s distance can be seen in this output:

Cook's distance in SAS

The x-axis shows the observation number and the y-axis shows Cook’s distance for each observation.

Note that a cutoff line is placed at 4/n (in this case n = 12, thus the cutoff is at 0.33) and we can see that three observations in the dataset are greater than this line.

This indicates that these observations could be highly influential to the regression model and should perhaps be examined more closely before interpreting the output of the model.

The following tutorials explain how to perform other common tasks in SAS:

x