How can the sample.split() function be used in R?

The sample.split() function in R is used to split a given dataset into training and testing subsets. It takes in the dataset as an input and randomly assigns each row to either the training or testing set based on a specified proportion. This is helpful in machine learning and data analysis tasks, where it is important to have a separate set of data to test the accuracy and performance of a model. By splitting the dataset, the sample.split() function allows for unbiased evaluation of the model and helps prevent overfitting.


You can use the sample.split() function from the caTools package in R to split a data frame into training and testing sets for model building.

This function uses the following basic syntax:

sample.split(Y, SplitRatio, …)

where:

  • Y: vector of outcomes
  • SplitRatio: percentage of data to use in training set

The following example shows how to use this function in practice.

Example: How to Use sample.split() in R

Suppose we have some data frame in R with 1,000 rows that contains information about hours studied by students and their corresponding score on a final exam:

#make this example reproducible
set.seed(0)

#create data frame
df <- data.frame(hours=runif(1000, min=0, max=10),
                 score=runif(1000, min=40, max=100))

#view head of data frame
head(df)

     hours    score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
6 2.016819 47.10139

Suppose we would like to fit a that uses hours studied to predict final exam score.

Suppose we would like to train the model on 80% of the rows in the data frame and test it on the remaining 20% of rows.

The following code shows how to use the sample.split() function from the caTools package to split the data frame into training and testing sets:

library(caTools)

#specify split
split = sample.split(df$score, SplitRatio=0.8)

#create training set
df_train = subset(df, split==TRUE)

#create test set
df_test = subset(df, split==FALSE)

#view number of rows in each set
nrow(df_train)

[1] 800

nrow(df_test)

[1] 200

We can see that our training dataset contains 800 rows, which represents 80% of the original dataset.

Similarly, we can see that our test dataset contains 200 rows, which represents 20% of the original dataset.

We can also view the first few rows of each set:

#view head of training set
head(df_train)

     hours    score
1 8.966972 55.93220
5 9.082078 97.29928
6 2.016819 47.10139
7 8.983897 42.34600
8 9.446753 70.27030
9 6.607978 74.70895

#view head of testing set
head(df_test)

      hours    score
2  2.655087 71.84853
3  3.721239 81.09165
4  5.728534 62.99700
20 3.800352 47.95551
23 2.121425 89.17611
35 1.862176 98.07025

Additional Resources

The following tutorials explain how to perform other common tasks in R:

x