What is A Complete Guide to the Built-in Datasets in R?

A Complete Guide to the Built-in Datasets in R is a comprehensive and authoritative resource that provides detailed information about all the datasets that are included in the programming language R. This guide serves as a go-to reference for users who want to explore and understand the various datasets available in R, including their sources, descriptions, and potential uses. It also offers insights into the structure and format of the datasets, as well as tips and techniques for working with them effectively. Whether you are a beginner or an experienced R user, this guide is an essential tool for making the most out of the built-in datasets in R.

A Complete Guide to the Built-in Datasets in R


The R programming language comes with several built-in datasets that are useful for practicing building models, summarizing datasets, and creating visualizations.

You can find a complete list of available built-in datasets by typing the following into your R console:

library(help='datasets')

There are over 50 built-in datasets but some of the most popular ones include:

  • iris: A dataset that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.
  • mtcars: A dataset in R that contains measurements on 11 different attributes for 32 different cars.
  • airquality: A dataset that contains air quality measurements in New York City from 1973 with 154 observations and 6 variables.
  • AirPassengers: A dataset that contains the number of monthly airline passengers from 1949 to 1960.

The following example explains how to gain a quick understanding of any of these datasets by using the iris dataset as an example.

Example: How to Analyze a Built-in Dataset in R

One of the easiest ways to gain a quick understanding of a built-in dataset is by using the head function, which allows you to view the first six rows of the dataset.

#view first six rows of iris dataset
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

You can also use the summary function to quickly summarize each variable in the dataset:

#summarize iris dataset
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  

For each of the numeric variables we can see the following information:

  • Min: The minimum value.
  • 1st Qu: The value of the first quartile (25th percentile).
  • Median: The median value.
  • Mean: The mean value.
  • 3rd Qu: The value of the third quartile (75th percentile).
  • Max: The maximum value.

For the only categorical variable in the dataset (Species) we see a frequency count of each value:

  • setosa: This species occurs 50 times.
  • versicolor: This species occurs 50 times.
  • virginica: This species occurs 50 times.

You can also use the dim function to get the dimensions of the dataset in terms of number of rows and number of columns:

#display rows and columns
dim(iris)

[1] 150   5

We can also create some plots to visualize the values in the dataset.

For example, we can use the hist() function to create a histogram of the values for a certain variable:

#create histogram of values for sepal length
hist(iris$Sepal.Length,
     col='steelblue',
     main='Histogram',
     xlab='Length',
     ylab='Frequency')

This histogram allows us to visualize the distribution of values for the Sepal.Length variable.

Feel free to use each of the functions shown here to explore any of the built-in datasets in R that you’d like.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

x