Table of Contents
A Complete Guide to the Built-in Datasets in R is a comprehensive and authoritative resource that provides detailed information about all the datasets that are included in the programming language R. This guide serves as a go-to reference for users who want to explore and understand the various datasets available in R, including their sources, descriptions, and potential uses. It also offers insights into the structure and format of the datasets, as well as tips and techniques for working with them effectively. Whether you are a beginner or an experienced R user, this guide is an essential tool for making the most out of the built-in datasets in R.
A Complete Guide to the Built-in Datasets in R
The R programming language comes with several built-in datasets that are useful for practicing building models, summarizing datasets, and creating visualizations.
You can find a complete list of available built-in datasets by typing the following into your R console:
library(help='datasets')
There are over 50 built-in datasets but some of the most popular ones include:
- iris: A dataset that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.
- mtcars: A dataset in R that contains measurements on 11 different attributes for 32 different cars.
- airquality: A dataset that contains air quality measurements in New York City from 1973 with 154 observations and 6 variables.
- AirPassengers: A dataset that contains the number of monthly airline passengers from 1949 to 1960.
The following example explains how to gain a quick understanding of any of these datasets by using the iris dataset as an example.
Example: How to Analyze a Built-in Dataset in R
One of the easiest ways to gain a quick understanding of a built-in dataset is by using the head function, which allows you to view the first six rows of the dataset.
#view first six rows of iris dataset
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can also use the summary function to quickly summarize each variable in the dataset:
#summarize iris dataset
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
For each of the numeric variables we can see the following information:
- Min: The minimum value.
- 1st Qu: The value of the first quartile (25th percentile).
- Median: The median value.
- Mean: The mean value.
- 3rd Qu: The value of the third quartile (75th percentile).
- Max: The maximum value.
For the only categorical variable in the dataset (Species) we see a frequency count of each value:
- setosa: This species occurs 50 times.
- versicolor: This species occurs 50 times.
- virginica: This species occurs 50 times.
You can also use the dim function to get the dimensions of the dataset in terms of number of rows and number of columns:
#display rows and columns
dim(iris)
[1] 150 5
We can also create some plots to visualize the values in the dataset.
For example, we can use the hist() function to create a histogram of the values for a certain variable:
#create histogram of values for sepal length
hist(iris$Sepal.Length,
col='steelblue',
main='Histogram',
xlab='Length',
ylab='Frequency')
This histogram allows us to visualize the distribution of values for the Sepal.Length variable.
Feel free to use each of the functions shown here to explore any of the built-in datasets in R that you’d like.
Additional Resources
The following tutorials explain how to perform other common tasks in R: