Multivariate Multiple Linear Regression

Multivariate Multiple Linear Regression is a statistical method used to analyze the relationship between multiple independent variables and a single dependent variable. It is an extension of simple linear regression, where only one independent variable is considered. This technique allows us to understand the impact of each independent variable on the dependent variable while controlling for the effects of other variables. It is often used in data analysis and predictive modeling to make accurate predictions and uncover underlying relationships between variables. This method is particularly useful when dealing with complex data sets that have multiple variables influencing the outcome.


What is Multivariate Multiple Linear Regression?

Multivariate Multiple Linear Regression is a statistical test used to predict multiple outcome variables using one or more other variables. It also is used to determine the numerical relationship between these sets of variables and others. The variable you want to predict should be continuous and your data should meet the other assumptions listed below.

Multivariate multiple linear regression is a statistical method used to predict one or more dependent variables using one or more independent variables.

Assumptions for Multivariate Multiple Linear Regression

Every statistical method has assumptions. Assumptions mean that your data must satisfy certain properties in order for statistical method results to be accurate.

The assumptions for Multivariate Multiple Linear Regression include:

  1. Linearity
  2. No Outliers
  3. Similar Spread across Range
  4. Normality of Residuals
  5. No Multicollinearity

Let’s dive in to each one of these separately.

Linearity

The variables that you care about must be related linearly. This means that if you plot the variables, you will be able to draw a straight line that fits the shape of the data.

No Outliers

The variables that you care about must not contain outliers. Linear Regression is sensitive to outliers, or data points that have unusually large or small values. You can tell if your variables have outliers by plotting them and observing if any points are far from all other points.

Similar Spread across Range

In statistics this is called homoscedasticity, which describes when variables have a similar spread across their ranges.

Homoscedasticity describes a variable that has equal spread across its range. In this figure, the two variables in the top plot satisfy this assumption, whereas the two in the bottom plot do not.

Normality of Residuals

The word “residuals” refers to the values resulting from subtracting the expected (or predicted) dependent variables from the actual values. The distribution of these values should match a normal (or bell curve) distribution shape.

Meeting this assumption assures that the results of the regression are equally applicable across the full spread of the data and that there is no systematic bias in the prediction.

No Multicollinearity

Multicollinearity refers to the scenario when two or more of the independent variables are substantially correlated amongst each other. When multicollinearity is present, the regression coefficients and statistical significance become unstable and less trustworthy, though it doesn’t affect how well the model fits the data per se.


When to use Multivariate Multiple Linear Regression?

You should use Multivariate Multiple Linear Regression in the following scenario:

  1. You want to use one variable in a prediction of multiple other variables, or you want to quantify the numerical relationship between them
  2. The variables you want to predict (your dependent variable) are continuous
  3. You have more than one independent variable, or one variable that you are using as a predictor
  4. You have no repeated measures from the same unit of observation
  5. You have more than one dependent variable

Let’s clarify these to help you know when to use Multivariate Multiple Linear Regression.

Prediction

You are looking for a statistical test to predict one variable using another. This is a prediction question. Other types of analyses include examining the strength of the relationship between two variables (correlation) or examining differences between groups (difference).

Continuous Dependent Variable

The variable you want to predict must be continuous. Continuous means that your variable of interest can basically take on any value, such as heart rate, height, weight, number of ice cream bars you can eat in 1 minute, etc.

Types of data that are NOT continuous include ordered data (such as finishing place in a race, best business rankings, etc.), categorical data (gender, eye color, race, etc.), or binary data (purchased the product or not, has the disease or not, etc.).

If your dependent variable is binary, you should use Multiple Logistic Regression, and if your dependent variable is categorical, then you should use Multinomial Logistic Regression or Linear Discriminant Analysis.

More than One Independent Variable

Multivariate Multiple Linear Regression is used when there is one or more predictor variables with multiple values for each unit of observation.

No Repeated Measures

This method is suited for the scenario when there is only one observation for each unit of observation. The unit of observation is what composes a “data point”, for example, a store, a customer, a city, etc…

If you have one or more independent variables but they are measured for the same group at multiple points in time, then you should use a Mixed Effects Model.

More than One Dependent Variable

To run Multivariate Multiple Linear Regression, you should have more than one dependent variable, or variable that you are trying to predict.

If you are only predicting one variable, you should use Multiple Linear Regression.


Multivariate Multiple Linear Regression Example

Dependent Variable 1: Revenue
Dependent Variable 2: Customer traffic
Independent Variable 1: Dollars spent on advertising by city
Independent Variable 2: City Population

The null hypothesis, which is statistical lingo for what would happen if the treatment does nothing, is that there is no relationship between spend on advertising and the advertising dollars or population by city. Our test will assess the likelihood of this hypothesis being true.

We gather our data and after assuring that the assumptions of linear regression are met, we perform the analysis.

This analysis effectively runs multiple linear regression twice using both dependent variables. Thus, when we run this analysis, we get beta coefficients and p-values for each term in the “revenue” model and in the “customer traffic” model. For any linear regression model, you will have one beta coefficient that equals the intercept of your linear regression line (often labelled with a 0 as β0). This is simply where the regression line crosses the y-axis if you were to plot your data. In the case of multiple linear regression, there are additionally two more more other beta coefficients (β1, β2, etc), which represent the relationship between the independent and dependent variables.

These additional beta coefficients are the key to understanding the numerical relationship between your variables. Essentially, for each unit (value of 1) increase in a given independent variable, your dependent variable is expected to change by the value of the beta coefficient associated with that independent variable (while holding other independent variables constant).

The p-value associated with these additional beta values is the chance of seeing our results assuming there is actually no relationship between that variable and revenue. A p-value less than or equal to 0.05 means that our result is statistically significant and we can trust that the difference is not due to chance alone. To get an overall p-value for the model and individual p-values that represent variables’ effects across the two models, MANOVAs are often used.

In addition, this analysis will result in an R-Squared (R2) value. This value can range from 0-1 and represents how well your linear regression line fits your data points. The higher the R2, the better your model fits your data.

x