How can data cleaning be performed in R, with an example?

How can data cleaning be performed in R, with an example?

Data cleaning is a process of identifying and correcting inaccurate, incomplete, or irrelevant data in a dataset to ensure its accuracy and reliability for analysis. In R, data cleaning can be performed using various functions and packages such as “tidyverse” and “dplyr”. For example, the “na.omit” function can be used to remove rows with missing values, the “select” function can be used to select specific columns, and the “mutate” function can be used to create new variables. Furthermore, the “gsub” function can be used to replace incorrect values in a dataset. Overall, data cleaning in R involves using a combination of functions and packages to identify and correct any errors or inconsistencies in the data.

Perform Data Cleaning in R (With Example)


Data cleaning refers to the process of transforming into data that is suitable for analysis or model-building.

In most cases, “cleaning” a dataset involves dealing with missing values and duplicated data.

Here are the most common ways to “clean” a dataset in R:

Method 1: Remove Rows with Missing Values

library(dplyr)

#remove rows with any missing values
df %>% na.omit()

Method 2: Replace Missing Values with Another Value

library(dplyr)
library(tidyr)

#replace missing values in each numeric column with median value of column
df %>% mutate(across(where(is.numeric), ~replace_na(., median(., na.rm=TRUE))))

Method 3: Remove Duplicate Rows

library(dplyr)

df %>% distinct(.keep_all=TRUE)

The following examples show how to use each of these methods in practice with the following data frame in R that contains information about various basketball players:

#create data frame
df <- data.frame(team=c('A', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'),
                 points=c(4, 4, NA, 8, 6, 12, 14, 86, 13, 8),
                 rebounds=c(9, 9, 7, 6, 8, NA, 9, 14, 12, 11),
                 assists=c(2, 2, NA, 7, 6, 6, 9, 10, NA, 14))

#view data frame
df

   team points rebounds assists
1     A      4        9       2
2     A      4        9       2
3     B     NA        7      NA
4     C      8        6       7
5     D      6        8       6
6     E     12       NA       6
7     F     14        9       9
8     G     86       14      10
9     H     13       12      NA
10    I      8       11      14

Example 1: Remove Rows with Missing Values

We can use the following syntax to remove rows with missing values in any column:

library(dplyr)

#remove rows with missing values
new_df <- df %>% na.omit()

#view new data frame
new_df

   team points rebounds assists
1     A      4        9       2
2     A      4        9       2
4     C      8        6       7
5     D      6        8       6
7     F     14        9       9
8     G     86       14      10
10    I      8       11      14

Notice that the new data frame does not contain any rows with missing values.

Example 2: Replace Missing Values with Another Value

We can use the following syntax to replace any missing values with the median value of each column:

library(dplyr)
library(tidyr)

#replace missing values in each numeric column with median value of column
new_df <-df %>% mutate(across(where(is.numeric),~replace_na(.,median(.,na.rm=TRUE)))) 

#view new data frame
new_df

   team points rebounds assists
1     A      4        9     2.0
2     A      4        9     2.0
3     B      8        7     6.5
4     C      8        6     7.0
5     D      6        8     6.0
6     E     12        9     6.0
7     F     14        9     9.0
8     G     86       14    10.0
9     H     13       12     6.5
10    I      8       11    14.0

Notice that the missing values in each numeric column have each been replaced with the median value of the column.

Note that you could also replace median in the formula with mean to instead replace missing values with the mean value of each column.

Note: We also had to load the tidyr package in this example because the drop_na() function comes from this package.

Example 3: Remove Duplicate Rows

We can use the following syntax to replace any missing values with the median value of each column:

library(dplyr)

#remove duplicate rows
new_df <- df %>% distinct(.keep_all=TRUE)

#view new data frame
new_df

  team points rebounds assists
1    A      4        9       2
2    B     NA        7      NA
3    C      8        6       7
4    D      6        8       6
5    E     12       NA       6
6    F     14        9       9
7    G     86       14      10
8    H     13       12      NA
9    I      8       11      14

Notice that the second row has been removed from the data frame because each of the values in the second row were duplicates of the values in the first row.

Note: You can find the complete documentation for the dplyr distinct() function .

Cite this article

stats writer (2024). How can data cleaning be performed in R, with an example?. PSYCHOLOGICAL SCALES. Retrieved from https://scales.arabpsychology.com/stats/how-can-data-cleaning-be-performed-in-r-with-an-example/

stats writer. "How can data cleaning be performed in R, with an example?." PSYCHOLOGICAL SCALES, 25 Jun. 2024, https://scales.arabpsychology.com/stats/how-can-data-cleaning-be-performed-in-r-with-an-example/.

stats writer. "How can data cleaning be performed in R, with an example?." PSYCHOLOGICAL SCALES, 2024. https://scales.arabpsychology.com/stats/how-can-data-cleaning-be-performed-in-r-with-an-example/.

stats writer (2024) 'How can data cleaning be performed in R, with an example?', PSYCHOLOGICAL SCALES. Available at: https://scales.arabpsychology.com/stats/how-can-data-cleaning-be-performed-in-r-with-an-example/.

[1] stats writer, "How can data cleaning be performed in R, with an example?," PSYCHOLOGICAL SCALES, vol. X, no. Y, ص Z-Z, June, 2024.

stats writer. How can data cleaning be performed in R, with an example?. PSYCHOLOGICAL SCALES. 2024;vol(issue):pages.

Download Post (.PDF)
Slide Up
x
PDF
Scroll to Top