How can I use the cut() function in R?

The cut() function in R is used to create bins of continuous variables by dividing them into intervals. It takes a vector as an argument and returns a factor with levels indicating the intervals. The cut() function is useful in data analysis to group observations into different categories, such as age groups, income brackets, or other similar groups. It can also be used to bin continuous data into discrete categories for plotting.


The cut() function in R can be used to cut a range of values into bins and specify labels for each bin.

This function uses the following syntax:

cut(x, breaks, labels = NULL, …)

where:

  • x: Name of vector
  • breaks: Number of breaks to make or vector of break points
  • labels: Labels for the resulting bins

The following examples show how to use this function in different scenarios with the following data frame in R:

#create data frame
df <- data.frame(player=c('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'),
                 points=c(4, 7, 8, 12, 14, 16, 20, 26, 36))

#view data frame
df

  player points
1      A      4
2      B      7
3      C      8
4      D     12
5      E     14
6      F     16
7      G     20
8      H     26
9      I     36

Example 1: Cut Vector Based on Number of Breaks

The following code shows how to use the cut() function to create a new column called category that cuts the points column into bins of four equal sizes:

#create new column that places each player into four categories based on points
df$category <- cut(df$points, breaks=4)

#view updated data frame
df

  player points  category
1      A      4 (3.97,12]
2      B      7 (3.97,12]
3      C      8 (3.97,12]
4      D     12 (3.97,12]
5      E     14   (12,20]
6      F     16   (12,20]
7      G     20   (12,20]
8      H     26   (20,28]
9      I     36   (28,36]

Since we specified breaks=4, the cut() function split the values in the points column into bins of four equal sizes.

Here is how the cut() function did this:

  • First, it found the difference between the largest and smallest values in the points column (36 – 4 = 32)
  • Then, it divided this difference by 4 (32 / 4 = 8)
  • The result is four bins each with a width of 8

Note: The lowest interval is equal to 3.97 instead of 4 because of the following functionality from the cut() :

When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals.

Example 2: Cut Vector Based on Specific Break Points

The following code shows how to use the cut() function to create a new column called category that cuts the points column based on a vector of specific break points:

#create new column based on specific break points
df$category <- cut(df$points, breaks=c(0, 10, 15, 20, 40))

#view updated data frame
df

  player points category
1      A      4   (0,10]
2      B      7   (0,10]
3      C      8   (0,10]
4      D     12  (10,15]
5      E     14  (10,15]
6      F     16  (15,20]
7      G     20  (15,20]
8      H     26  (20,40]
9      I     36  (20,40]

The cut() function categorized each player into bins based on the specific vector of break points we provided.

Example 3: Cut Vector Using Specific Break Points and Labels

The following code shows how to use the cut() function to create a new column called category that cuts the points column based on a vector of specific break points with custom labels:

#create new column based on values in points column
df$category <- cut(df$points,
                   breaks=c(0, 10, 15, 20, 40),
                   labels=c('Bad', 'OK', 'Good', 'Great'))

#view updated data frame
df

  player points category
1      A      4      Bad
2      B      7      Bad
3      C      8      Bad
4      D     12       OK
5      E     14       OK
6      F     16     Good
7      G     20     Good
8      H     26    Great
9      I     36    Great

The new category column classifies each player as Bad, OK, Good, or Great depending on their corresponding value in the points column.

Note: The number of labels should always be one less than the number of break points to avoid the following error:

Error in cut.default(df$points, breaks = c(0, 10, 15, 20, 40), labels = c("Bad",  : 
  lengths of 'breaks' and 'labels' differ

The following tutorials explain how to use other common functions in R:

x