Table of Contents
The e(sample) command is a useful tool for identifying the cases used by an estimation command. By using this command, the user can easily identify the specific cases that were used in the estimation process. This can be helpful in understanding the results of the estimation and ensuring that all relevant cases were included in the analysis. The e(sample) command can also be used to subset or manipulate the data based on the cases used in the estimation, allowing for further analysis and exploration. Overall, the e(sample) command is a valuable tool for identifying and working with specific cases in an estimation command.
How can I identify cases used by an estimation command using e(sample)? | Stata FAQ
When performing data analysis, it is very common for a given model (e.g. a
regression model), to not use all cases in the dataset. This can occur for a
number of reasons, for example because if was used to tell Stata to perform the analysis on a subset of
cases, or because some cases had missing values on some or all of the variables
in the analysis. To allow you to identify the cases used in an analysis,
most Stata estimation commands return a function that takes on a value of
one if the case was included in the analysis, and zero otherwise (for more
information see our Stata FAQ: How can I access information stored after I run a command in Stata (returned results)?). Below we show how this can be useful in
two common situations. Many more situations exist, once you’re aware of this
function and how it works, you’ll recognize them. The examples below use
two different versions of the hsb2 dataset. Both versions contain information on
200 high school
students, including their scores on a series of standardized tests, and some demographic information.
When analyzing a subset of data
In this example we run a regression model predicting student’s reading scores based on
their scores for math, and science. However, we use if to indicate that we want to
run our model on only those cases where the variable write is greater
than or equal to 50. Below we see the output for this regression. Note that 128
observations were used in the analysis, rather than the full 200, because we
restricted the sample using if.
use https://stats.idre.ucla.edu/stat/stata/notes/hsb2, clear
regress read math science if write>=50
Source | SS df MS Number of obs = 128
-------------+------------------------------ F( 2, 125) = 43.61
Model | 4595.51237 2 2297.75618 Prob > F = 0.0000
Residual | 6585.72982 125 52.6858386 R-squared = 0.4110
-------------+------------------------------ Adj R-squared = 0.4016
Total | 11181.2422 127 88.0412771 Root MSE = 7.2585
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | .4952647 .0898744 5.51 0.000 .3173921 .6731373
science | .2960287 .0942091 3.14 0.002 .1095772 .4824803
_cons | 11.7755 4.86182 2.42 0.017 2.153353 21.39764
------------------------------------------------------------------------------Once we have run our model, we can generate predicted values using the predict
command. Below generate a new variable, p1, which contains the predicted
values for each case. When we use summarize to examine the predicted values, we see that predict
that the variable p1 has 200 observations, but the model from which these
predictions was made used only 128 observations. Predicted values were generated
for both the 128 cases where write>=50 and the 72 cases where write<50 (who were
not used to estimate the model). Generally, we don’t want to use a model
estimated on one sample (in our case, observations where write>=50) on a
different sample (observations where write<50). This is particularly true in
cases like this one, where we know there is a systematic difference between the
samples.
predict p1
(option xb assumed; fitted values)
summarize p1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
p1 | 200 53.1978 6.875587 40.55244 70.23441We can use e(sample) to generate predicted values only for those cases
used to estimate the model. Below we use predict to generate a new
variable, p2, that contains predicted
values, but this time we add if e(sample)==1, which indicates that
predicted values should only be created for cases used in the last model we ran.
This time Stata tells us that we have generated 72 missing values. There are 72
cases where write<=50 in the dataset, rather than predicted values, these cases
were given missing values for p2. Summarizing the data again
predict p2 if e(sample)==1
(option xb assumed; fitted values)
(72 missing values generated)
sum p2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
p2 | 128 56.11719 6.015408 42.64159 70.23441
For model comparison
When we want to compare nested models, the models must be estimated on the
same sample in order for the comparison to be valid. When a dataset contains
missing values, adding additional predictor variables to a model often reduces
the number of cases available for a given model. In this example we fit a model
where write predicts read, and compare the fit of this model to a model that
contains math and science as well as write as predictors. We will
compare the two models using a likelihood ratio test (i.e. the command lrtest). Below we first run a
regression model where the variable read is predicted
by the variable write and store the estimates from that model as m1 using
the command estimates store m1.
use https://stats.idre.ucla.edu/stat/stata/faq/hsb2_mar, clear
regress read write
Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 1, 168) = 94.38
Model | 6188.25135 1 6188.25135 Prob > F = 0.0000
Residual | 11014.7428 168 65.563945 R-squared = 0.3597
-------------+------------------------------ Adj R-squared = 0.3559
Total | 17202.9941 169 101.792865 Root MSE = 8.0972
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .6496086 .0668652 9.72 0.000 .5176042 .7816129
_cons | 17.65687 3.589724 4.92 0.000 10.5701 24.74365
------------------------------------------------------------------------------
estimates store m1Below we run a second model where read is predicted by write, math,
and science. We store the estimates from this model as m2.
reg read write math science
Source | SS df MS Number of obs = 141
-------------+------------------------------ F( 3, 137) = 47.27
Model | 7560.153 3 2520.051 Prob > F = 0.0000
Residual | 7304.15906 137 53.3150296 R-squared = 0.5086
-------------+------------------------------ Adj R-squared = 0.4979
Total | 14864.3121 140 106.173658 Root MSE = 7.3017
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .2143165 .0915771 2.34 0.021 .0332291 .3954039
math | .3973615 .1020276 3.89 0.000 .1956088 .5991141
science | .3108218 .0905435 3.43 0.001 .1317781 .4898654
_cons | 3.851736 4.091921 0.94 0.348 -4.239757 11.94323
------------------------------------------------------------------------------
estimates store m2Now that we have estimated the two models and stored the results, we want to test whether the model that contains
write, math, and science fits significantly better
than the model that contains only write as a predictor. One way to do this is using a likelihood ratio
test, which is what is done below with the command lrtest m1 m2. However this command generates an error message.
It turns out, the models were not estimated on the same number of cases. In order for the test to be valid,
the two models must be run on the same cases, clearly this is not the case. Looking at the error message and the output from our regressions we see that the
model using only write as a predictor was run on 170 cases, while the model that contained
write, math, and
science as predictors was run on 141 cases. The only difference between these two models is the
addition of the variables
math and science, indicating that the difference in sample size for the two models is due to missing data on
math, and science.
lrtest m1 m2 observations differ: 141 vs. 170 r(498);
So how do we make sure that the two models contain the same number of cases? First,
we run the model with write, math, and science as predictors, and store the estimates
as m2. Then we use the generate command (gen) to create a new variable
called sample that is equal to the function e(sample). In other words the variable sample
is equal to one if the case was included in the last analysis (i.e. the model we just
ran) and zero otherwise.
regress read write math science
Source | SS df MS Number of obs = 141
-------------+------------------------------ F( 3, 137) = 47.27
Model | 7560.153 3 2520.051 Prob > F = 0.0000
Residual | 7304.15906 137 53.3150296 R-squared = 0.5086
-------------+------------------------------ Adj R-squared = 0.4979
Total | 14864.3121 140 106.173658 Root MSE = 7.3017
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .2143165 .0915771 2.34 0.021 .0332291 .3954039
math | .3973615 .1020276 3.89 0.000 .1956088 .5991141
science | .3108218 .0905435 3.43 0.001 .1317781 .4898654
_cons | 3.851736 4.091921 0.94 0.348 -4.239757 11.94323
------------------------------------------------------------------------------
estimates store m2
generate sample = e(sample)Now we can use the variable sample to run the model with only write as
a predictor
regress read write if sample==1
Source | SS df MS Number of obs = 141
-------------+------------------------------ F( 1, 139) = 70.12
Model | 4984.37291 1 4984.37291 Prob > F = 0.0000
Residual | 9879.93915 139 71.0786989 R-squared = 0.3353
-------------+------------------------------ Adj R-squared = 0.3305
Total | 14864.3121 140 106.173658 Root MSE = 8.4308
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .6479233 .0773728 8.37 0.000 .4949436 .800903
_cons | 17.43003 4.1396 4.21 0.000 9.245303 25.61475
------------------------------------------------------------------------------
estimates store m1Now we can use the lrtest command again, to test whether the model with write, math and
science as predictors fits significantly better than a model with just write as a predictor.
lrtest m1 m2 Likelihood-ratio test LR chi2(2) = 42.59 (Assumption: m1 nested in m2) Prob > chi2 = 0.0000
Cite this article
stats writer (2024). How can I identify cases used by an estimation command using e(sample)?. PSYCHOLOGICAL SCALES. Retrieved from https://scales.arabpsychology.com/stats/how-can-i-identify-cases-used-by-an-estimation-command-using-esample/
stats writer. "How can I identify cases used by an estimation command using e(sample)?." PSYCHOLOGICAL SCALES, 1 Jul. 2024, https://scales.arabpsychology.com/stats/how-can-i-identify-cases-used-by-an-estimation-command-using-esample/.
stats writer. "How can I identify cases used by an estimation command using e(sample)?." PSYCHOLOGICAL SCALES, 2024. https://scales.arabpsychology.com/stats/how-can-i-identify-cases-used-by-an-estimation-command-using-esample/.
stats writer (2024) 'How can I identify cases used by an estimation command using e(sample)?', PSYCHOLOGICAL SCALES. Available at: https://scales.arabpsychology.com/stats/how-can-i-identify-cases-used-by-an-estimation-command-using-esample/.
[1] stats writer, "How can I identify cases used by an estimation command using e(sample)?," PSYCHOLOGICAL SCALES, vol. X, no. Y, ص Z-Z, July, 2024.
stats writer. How can I identify cases used by an estimation command using e(sample)?. PSYCHOLOGICAL SCALES. 2024;vol(issue):pages.
