Table of Contents
The Comprehensive Guide to the Built-in Datasets in R is a detailed resource that provides information on all the datasets that are included in the R programming language. It includes a description of each dataset, its source, and instructions on how to access and use it within R. This guide serves as a reference for those working with R, helping them to easily find and utilize the various built-in datasets available for analysis and visualization.
The R programming language comes with several built-in datasets that are useful for practicing building models, summarizing datasets, and creating visualizations.
You can find a complete list of available built-in datasets by typing the following into your R console:
library(help='datasets')
There are over 50 built-in datasets but some of the most popular ones include:
- iris: A dataset that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.
- mtcars: A dataset in R that contains measurements on 11 different attributes for 32 different cars.
- airquality: A dataset that contains air quality measurements in New York City from 1973 with 154 observations and 6 variables.
- AirPassengers: A dataset that contains the number of monthly airline passengers from 1949 to 1960.
The following example explains how to gain a quick understanding of any of these datasets by using the iris dataset as an example.
Example: How to Analyze a Built-in Dataset in R
One of the easiest ways to gain a quick understanding of a built-in dataset is by using the head function, which allows you to view the first six rows of the dataset.
#view first six rows of iris dataset
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can also use the summary function to quickly summarize each variable in the dataset:
#summarize iris dataset
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
For each of the numeric variables we can see the following information:
- Min: The minimum value.
- 1st Qu: The value of the first quartile (25th percentile).
- Median: The median value.
- Mean: The mean value.
- 3rd Qu: The value of the third quartile (75th percentile).
- Max: The maximum value.
For the only categorical variable in the dataset (Species) we see a frequency count of each value:
- setosa: This species occurs 50 times.
- versicolor: This species occurs 50 times.
- virginica: This species occurs 50 times.
You can also use the dim function to get the dimensions of the dataset in terms of number of rows and number of columns:
#display rows and columns
dim(iris)
[1] 150 5
We can also create some plots to visualize the values in the dataset.
For example, we can use the hist() function to create a histogram of the values for a certain variable:
#create histogram of values for sepal length
hist(iris$Sepal.Length,
col='steelblue',
main='Histogram',
xlab='Length',
ylab='Frequency')
This histogram allows us to visualize the distribution of values for the Sepal.Length variable.
Feel free to use each of the functions shown here to explore any of the built-in datasets in R that you’d like.
Additional Resources
The following tutorials explain how to perform other common tasks in R: