Table of Contents
The sample.split() function in R is used to split a given dataset into training and testing subsets. It takes in the dataset as an input and randomly assigns each row to either the training or testing set based on a specified proportion. This is helpful in machine learning and data analysis tasks, where it is important to have a separate set of data to test the accuracy and performance of a model. By splitting the dataset, the sample.split() function allows for unbiased evaluation of the model and helps prevent overfitting.
You can use the sample.split() function from the caTools package in R to split a data frame into training and testing sets for model building.
This function uses the following basic syntax:
sample.split(Y, SplitRatio, …)
where:
- Y: vector of outcomes
- SplitRatio: percentage of data to use in training set
The following example shows how to use this function in practice.
Example: How to Use sample.split() in R
Suppose we have some data frame in R with 1,000 rows that contains information about hours studied by students and their corresponding score on a final exam:
#make this example reproducible
set.seed(0)
#create data frame
df <- data.frame(hours=runif(1000, min=0, max=10),
score=runif(1000, min=40, max=100))
#view head of data frame
head(df)
hours score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
6 2.016819 47.10139
Suppose we would like to fit a that uses hours studied to predict final exam score.
Suppose we would like to train the model on 80% of the rows in the data frame and test it on the remaining 20% of rows.
The following code shows how to use the sample.split() function from the caTools package to split the data frame into training and testing sets:
library(caTools)
#specify split
split = sample.split(df$score, SplitRatio=0.8)
#create training set
df_train = subset(df, split==TRUE)
#create test set
df_test = subset(df, split==FALSE)
#view number of rows in each set
nrow(df_train)
[1] 800
nrow(df_test)
[1] 200
We can see that our training dataset contains 800 rows, which represents 80% of the original dataset.
Similarly, we can see that our test dataset contains 200 rows, which represents 20% of the original dataset.
We can also view the first few rows of each set:
#view head of training set
head(df_train)
hours score
1 8.966972 55.93220
5 9.082078 97.29928
6 2.016819 47.10139
7 8.983897 42.34600
8 9.446753 70.27030
9 6.607978 74.70895
#view head of testing set
head(df_test)
hours score
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
20 3.800352 47.95551
23 2.121425 89.17611
35 1.862176 98.07025
Additional Resources
The following tutorials explain how to perform other common tasks in R: