What is an influential observation in statistics?

An influential observation in statistics is an observation that has a large effect on the results of a statistical model. It can be identified by looking for outliers or points that are far away from the other points in a data set. In general, influential observations have a greater effect on the results of a model than a typical observation, and can drastically change the interpretation of the model.


In statistics, an influential observation is an observation in a dataset that, when removed, dramatically changes the of a regression model.

The most common way to measure the influence of observations is to use Cook’s distance, which quantifies how much all of the fitted values in a regression model change when the ith observation is deleted.

As a rule of thumb, any observation with a Cook’s distance greater than 1 is considered to be an observation with high leverage.

The following example shows how to calculate and interpret Cook’s distance for a given dataset to detect potential influential observations.

Example: Detecting Influential Observations

Suppose we have the following dataset with 14 values:

Now suppose we fit a . The regression output is shown below:

Using statistical software, we can calculate the following values for Cook’s distance for each observation:

Notice that the last observation has a value significantly greater than 1 for Cook’s distance, which tells us that it’s an influential observation.

Suppose we remove this value from the dataset and fit a new simple linear regression model. The output for this model is shown below:

Notice that the regression coefficients for the intercept and x both changed dramatically. This tells us that removing the influential observation from the dataset completely changed the fitted regression model.

The following plots show the difference between these two fitted regression equations:

Notes

It’s important to note that Cook’s distance should be used as a way to identify potentially influential observations. However, just because an observation is influential doesn’t necessarily mean that it should be deleted from the dataset.

First, you should verify that the observation isn’t a result of a data entry error or some other odd occurrence. If it turns out to be a legit value, you can then decide to deal with it in one of the following ways:

  • Delete it from the dataset.
  • Leave it in the dataset.
  • Replace it with an alternative value like the mean or median.

Depending on your specific scenario, one of these options may make more sense than the others.

How to Calculate Cook’s Distance in Practice

The following tutorials explain how to calculate Cook’s distance for a given dataset in Python and R:

x