Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a statistical technique used for classification and dimensionality reduction. It aims to find a linear combination of features that best separates different classes of data, and can be used for both binary and multiclass classification problems. LDA works by maximizing the ratio of between-class variance to within-class variance, making it an effective method for distinguishing between different groups or clusters of data. It is commonly used in various fields such as machine learning, pattern recognition, and data analysis to improve the accuracy and efficiency of classification tasks.

What is Linear Discriminant Analysis?

Linear Discriminant Analysis is a statistical test used to predict a single categorical variable using one or more other continuous variables. It also is used to determine the numerical relationship between such sets of variables. The variable you want to predict should be categorical and your data should meet the other assumptions listed below.

Linear Discriminant Analysis is a statistical test used to predict a single categorical variable using one or more other continuous variables.

Linear Discriminant Analysis is sometimes also called normal discriminant analysis (NDA), or discriminant function analysis.

Assumptions for Linear Discriminant Analysis

Every statistical method has assumptions. Assumptions mean that your data must satisfy certain properties in order for statistical method results to be accurate.

The assumptions for Linear Discriminant Analysis include:

  1. Linearity
  2. No Outliers
  3. Independence
  4. No Multicollinearity
  5. Similar Spread Across Range
  6. Normality

Let’s dive in to each one of these separately.


Logistic regression fits a logistic curve to binary data. This logistic curve can be interpreted as the probability associated with each outcome across independent variable values. Logistic regression assumes that the relationship between the natural log of these probabilities (when expressed as odds) and your predictor variable is linear.

No Outliers

The variables that you care about must not contain outliers. Logistic Regression is sensitive to outliers, or data points that have unusually large or small values. You can tell if your variables have outliers by plotting them and observing if any points are far from all other points.


Each of your observations (data points) should be independent. This means that each value of your variables doesn’t “depend” on any of the others. For example, this assumption is usually violated when there are multiple data points over time from the same unit of observation (e.g. subject/participant/customer/store), because the data points from the same unit of observation are likely to be related or affect one another.

No Multicollinearity

Multicollinearity refers to the scenario when two or more of the independent variables are substantially correlated amongst each other. When multicollinearity is present, the regression coefficients and statistical significance become unstable and less trustworthy, though it doesn’t affect how well the model fits the data per se.

Similar Spread Across Range

In statistics this is called homoscedasticity, which describes when variables have a similar spread across their ranges.

Homoscedasticity describes a variable that has equal spread across its range. In this figure, the two variables in the top plot satisfy this assumption, whereas the two in the bottom plot do not.


Linear Discriminant Analysis assumes that the distribution of these your variables will match a normal (or bell curve) distribution shape.

When to use Linear Discriminant Analysis?

You should use Linear Discriminant Analysis in the following scenario:

  1. You want to use one variable in a prediction of another, or you want to quantify the numerical relationship between two variables
  2. The variable you want to predict (your dependent variable) is categorical
  3. Your dependent variables are all continuous

Let’s clarify these to help you know when to use Linear Discriminant Analysis


You are looking for a statistical test to predict one variable using another. This is a prediction question. Other types of analyses include examining the strength of the relationship between two variables (correlation) or examining differences between groups (difference).

Categorical Dependent Variable

A categorical variable is a variable that describes a category that doesn’t relate naturally to a number. Examples of categorical variables are eye color, city of residence, type of dog, etc..

Types of data that are NOT categorical include ordered data (such as finishing place in a race, best business rankings, etc.), binary data (true/false, purchased the product or not, etc.), or continuous data (height, income, etc.).

If your dependent variable is continuous, you should use Simple Linear Regression, and if your dependent variable is binary, then you should use Simple Logistic Regression.

All Continuous

Linear Discriminant Analysis is used when each of the predictor variables is continuous.

If your independent variables are all continuous, then you can use Multinomial Logistic Regression.

Linear Discriminant Analysis Example

Dependent Variable: Website format preference (e.g. format A, B, C, etc)
Independent Variable 1: Consumer age
Independent Variable 2: Consumer income

The null hypothesis, which is statistical lingo for what would happen if the treatment does nothing, is that there is no relationship between consumer age/income and website format preference. Our test will assess the likelihood of this hypothesis being true.

We gather our data and observe that consumer income is not normally distributed, so we transform it to meet the normality assumption of this analysis.

One output of linear discriminant analysis is a formula describing the decision boundaries between website format preferences as a function of consumer age in income. In addition, the results of this analysis can be used to predict website preference using consumer age and income for other data points.