how to Use aggregate() and Not Drop Rows with NA in R

The aggregate() function in R can be used to compute summary statistics of data. This function does not drop rows with NA values by default, and instead returns NA in the corresponding output statistic. If the data contains NA values, users should set the na.rm argument to TRUE in order to remove those rows from the calculations. This will ensure that the output summary statistics are accurate and represent the true mean or sum of the data.


You can use the aggregate() function in R to calculate summary statistics for variables in a data frame.

By default, if the aggregate() function encounters a row in a data frame with one or more NA values, it will simply drop the row when performing calculations.

This can cause unintended consequences when performing calculations.

To avoid this behavior, you can use the argument na.action=NULL within the aggregate() function.

The following example shows how to use this argument in practice.

Example: Use aggegate() in R and Do Not Drop Rows with NA

Suppose we have the following data frame in R that shows the points and assists for basketball players on various teams:

#create data frame
df <- data.frame(team=c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'),
                 points=c(5, 9, 12, 14, 14, 13, 10, 6, 15, 18),
                 assists=c(NA, 4, 4, 5, 3, 6, 8, 4, 10, NA))

#view data frame
df

   team points assists
1     A      5      NA
2     A      9       4
3     A     12       4
4     A     14       5
5     A     14       3
6     B     13       6
7     B     10       8
8     B      6       4
9     C     15      10
10    C     18      NA

Now suppose that we attempt to use the aggregate() function to calculate the sum of points and assists, grouped by team:

#attempt to calculate sum of points and assists, grouped by team
aggregate(. ~ team, data=df, FUN=sum, na.rm=TRUE)

  team points assists
1    A     49      16
2    B     29      18
3    C     15      10

The output appears to show us the sum of points and assists by team, but the rows with NA values were actually dropped when performing these calculations.

We can confirm this by viewing the original data frame and seeing that team C has two values in the points column:

  • 15
  • 18

Thus, team C should have a sum of points of 33, but the output only shows 15.

This is because the row with a points value of 18 has a value of NA in the assists column, which means this row was actually not used when calculating the sum of points for team C.

To ensure that rows with NA values are not dropped when performing calculations, we must use the argument na.action=NULL as follows:

#calculate sum of points and assists, grouped by team (don't drop NA rows)
aggregate(. ~ team, data=df, FUN=sum, na.rm=TRUE, na.action=NULL)

  team points assists
1    A     54      16
2    B     29      18
3    C     33      10

Note: The argument na.rm=TRUE specifies that NA values should be ignored when performing a calculation in a specific column.

The following tutorials explain how to perform other common tasks in R:

 

x