What are the some of the methods for analyzing clustered data in Stata?

What are the some of the methods for analyzing clustered data in Stata?

Stata is a statistical software that offers various methods for analyzing clustered data, which is data that is grouped or clustered together in some way. Some of the methods for analyzing clustered data in Stata include cluster-robust standard errors, cluster-robust variance-covariance matrix estimation, and clustered linear regression. These methods take into account the correlation among observations within the same cluster, which can lead to biased results if not properly addressed. Additionally, Stata offers tools such as intra-cluster correlation coefficients and cluster-specific estimates to further explore and understand the clustering effect on the data. These methods are useful in a variety of fields, such as economics, social sciences, and public health, where data is often naturally clustered. Using these methods in Stata allows for more accurate and reliable analysis of clustered data.

What are the some of the methods for analyzing clustered data in Stata? | Stata FAQ

This page was created to show various ways that Stata can analyze clustered data. The intent is
to show how the various cluster approaches relate to one another. It is not meant as a way to select a
particular model or cluster approach for your data. In selecting a method to be used in analyzing
clustered data the user must
think carefully about the nature of their data and the assumptions underlying each of the approaches
shown below. More examples of analyzing clustered data can be found on our webpage
Stata Library: Analyzing Correlated
Data.

The dataset we will use to illustrate the various procedures is imm23.dta that was used in the Kreft and de Leeuw
Introduction
to multilevel modeling
. This dataset has 519 students clustered in 23 schools. In each of
the models we will regress math on homework. The intraclass correlation for these
data are a hefty 0.30.

We begin by reading in the data.

use https://stats.idre.ucla.edu/stat/examples/imm/imm23, clear

The table below provides a summary of the various models shown that we tried.
Some of the models only adjust the standard error by computing
a cluster robust standard error for the coefficient. Other procedures do more complex
modeling of the multilevel structure. And there are some procedures that do various combinations of
the two.

 # model                                              coef    se coef  ss residucal  bic
 1 regress math homework                              3.126    .286      48259.9     3837.7
 2 regress math homework, cluster(schid)              3.126    .543      48259.9     3837.7
 3 svy: regress math homework                         3.126    .543      48259.9       **

 4 areg math homework, absorb(schid)                  2.361    .281      35281.8     3675.1
 5 areg math homework, absorb(schid) cluster(schid)   2.361    .650      35281.8     3668.9
 6 xtreg math homework, i(schid) fe                   2.361    .281      35281.8     3675.1
 7 xtreg math homework, i(schid) robust fe            2.361    .321      35281.8     3675.1
 
 8 xtreg math homework, i(schid) re                   2.398    .277      35510.8       **
 9 xtreg math homework, i(schid) robust re            2.398    .300      35510.8       **
 
10 xtreg math homework, i(schid) corr(exc) pa         2.383    .259        ??          **
11 xtreg math homework, i(schid) corr(exc) robust pa  2.383    .623        ??          **

12 xtgls math homework, i(schid) panels(iid)          3.126    .286      48259.9     3837.7
13 xtgls math homework, i(schid) panels(hetero)       3.536    .271      48472.9     3805.8

14 xtreg math homework, i(schid) mle                  2.402    .277      35283.3     3755.5
15 xtmixed math homework || schid:, mle               2.402    .277      35546.0     3755.5
16 xtmixed math homework || schid:, reml              2.400    .277      35525.0     3754.3

?? population averaged models do not generate residuals with predict
** bic not avaiable for this procedure

Below you will find the complete results for each of the above models.

The plain vanilla OLS
that does not account for clustering is included for two reasons: 1) to provide
comparison values for other more appropriate models and 2) because many people
still analyze clustered
data in this manner.

/* model 1 -- plain vanilla OLS */

regress math homework

      Source |       SS       df       MS              Number of obs =     519
-------------+------------------------------           F(  1,   517) =  119.43
       Model |  11148.1461     1  11148.1461           Prob > F      =  0.0000
    Residual |  48259.9001   517   93.346035           R-squared     =  0.1877
-------------+------------------------------           Adj R-squared =  0.1861
       Total |  59408.0462   518  114.687348           Root MSE      =  9.6616

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   3.126375   .2860801    10.93   0.000     2.564352    3.688397
       _cons |   45.56015   .7055719    64.57   0.000     44.17401    46.94629
------------------------------------------------------------------------------

/* model 2 -- same as svy: regress with psu */

regress math homework, cluster(schid)

Linear regression                                      Number of obs =     519
                                                       F(  1,    22) =   33.09
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.1877
Number of clusters (schid) = 23                        Root MSE      =  9.6616

------------------------------------------------------------------------------
             |               Robust
        math |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   3.126375   .5434562     5.75   0.000     1.999315    4.253434
       _cons |   45.56015   1.428639    31.89   0.000     42.59734    48.52297
------------------------------------------------------------------------------

/* model 3 -- same as OLS with cluster option */

svyset schid

      pweight: 
          VCE: linearized
     Strata 1: 
         SU 1: schid
        FPC 1: svy: regress math homework
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =       519
Number of PSUs     =        23                  Population size    =       519
                                                Design df          =        22
                                                F(   1,     22)    =     33.16
                                                Prob > F           =    0.0000
                                                R-squared          =    0.1877

------------------------------------------------------------------------------
             |             Linearized
        math |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   3.126375   .5429314     5.76   0.000     2.000404    4.252345
       _cons |   45.56015    1.42726    31.92   0.000      42.6002    48.52011
------------------------------------------------------------------------------

/* model 4 -- same as xtreg with fe option */

areg math homework, absorb(schid)

Linear regression, absorbing indicators                Number of obs =     519
                                                       F(  1,   495) =   70.82
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.4061
                                                       Adj R-squared =  0.3785
                                                       Root MSE      =  8.4425

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.360971   .2805572     8.42   0.000     1.809741    2.912201
       _cons |   47.06884   .6656947    70.71   0.000      45.7609    48.37677
-------------+----------------------------------------------------------------
       schid |        F(22, 495) =      8.276   0.000          (23 categories)
       
/* model 5 */  

areg math homework, absorb(schid) cluster(schid)

Linear regression, absorbing indicators                Number of obs =     519
                                                       F(  1,    22) =   13.18
                                                       Prob > F      =  0.0015
                                                       R-squared     =  0.4061
                                                       Adj R-squared =  0.3785
                                                       Root MSE      =  8.4425

                                 (Std. Err. adjusted for 23 clusters in schid)
------------------------------------------------------------------------------
             |               Robust
        math |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.360971   .6502249     3.63   0.001     1.012487    3.709455
       _cons |   47.06884   1.281657    36.72   0.000     44.41084    49.72683
-------------+----------------------------------------------------------------
       schid |   absorbed                                      (23 categories)

/* model 6 -- same as areg */

xtreg math homework, i(schid) fe

Fixed-effects (within) regression               Number of obs      =       519
Group variable (i): schid                       Number of groups   =        23

R-sq:  within  = 0.1252                         Obs per group: min =         5
       between = 0.1578                                        avg =      22.6
       overall = 0.1877                                        max =        67

                                                F(1,495)           =     70.82
corr(u_i, Xb)  = 0.2213                         Prob > F           =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.360971   .2805572     8.42   0.000     1.809741    2.912201
       _cons |   47.06884   .6656947    70.71   0.000      45.7609    48.37677
-------------+----------------------------------------------------------------
     sigma_u |  5.0555127
     sigma_e |  8.4425339
         rho |  .26393678   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0:     F(22, 495) =     8.28             Prob > F = 0.0000

/* model 7 */

xtreg math homework, i(schid) robust fe

Fixed-effects (within) regression               Number of obs      =       519
Group variable (i): schid                       Number of groups   =        23

R-sq:  within  = 0.1252                         Obs per group: min =         5
       between = 0.1578                                        avg =      22.6
       overall = 0.1877                                        max =        67

                                                F(1,495)           =     53.95
corr(u_i, Xb)  = 0.2213                         Prob > F           =    0.0000

------------------------------------------------------------------------------
             |               Robust
        math |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.360971   .3214347     7.35   0.000     1.729426    2.992516
       _cons |   47.06884   .7078415    66.50   0.000     45.67809    48.45958
-------------+----------------------------------------------------------------
     sigma_u |  5.0555127
     sigma_e |  8.4425339
         rho |  .26393678   (fraction of variance due to u_i)
------------------------------------------------------------------------------

/* model 8 */

xtreg math homework, i(schid) re

Random-effects GLS regression                   Number of obs      =       519
Group variable (i): schid                       Number of groups   =        23

R-sq:  within  = 0.1252                         Obs per group: min =         5
       between = 0.1578                                        avg =      22.6
       overall = 0.1877                                        max =        67

Random effects u_i ~ Gaussian                   Wald chi2(1)       =     74.91
corr(u_i, X)       = 0 (assumed)                Prob > chi2        =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |    2.39842   .2771195     8.65   0.000     1.855276    2.941564
       _cons |   46.36018    1.17714    39.38   0.000     44.05303    48.66733
-------------+----------------------------------------------------------------
     sigma_u |  4.7060369
     sigma_e |  8.4425339
         rho |  .23705881   (fraction of variance due to u_i)
------------------------------------------------------------------------------

/* model 9 */

xtreg math homework, i(schid) robust re

Random-effects GLS regression                   Number of obs      =       519
Group variable (i): schid                       Number of groups   =        23

R-sq:  within  = 0.1252                         Obs per group: min =         5
       between = 0.1578                                        avg =      22.6
       overall = 0.1877                                        max =        67

Random effects u_i ~ Gaussian                   Wald chi2(1)       =     63.70
corr(u_i, X)       = 0 (assumed)                Prob > chi2        =    0.0000

------------------------------------------------------------------------------
             |               Robust
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |    2.39842   .3004959     7.98   0.000     1.809459    2.987381
       _cons |   46.36018   1.185593    39.10   0.000     44.03646     48.6839
-------------+----------------------------------------------------------------
     sigma_u |  4.7060369
     sigma_e |  8.4425339
         rho |  .23705881   (fraction of variance due to u_i)
------------------------------------------------------------------------------

/* model 10 */

xtreg math homework, i(schid) corr(exc) nolog pa

GEE population-averaged model                   Number of obs      =       519
Group variable:                      schid      Number of groups   =        23
Link:                             identity      Obs per group: min =         5
Family:                           Gaussian                     avg =      22.6
Correlation:                  exchangeable                     max =        67
                                                Wald chi2(1)       =     84.39
Scale parameter:                  94.57586      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.382626   .2593715     9.19   0.000     1.874267    2.890985
       _cons |   46.41468   1.340197    34.63   0.000     43.78794    49.04142
------------------------------------------------------------------------------

/* model 11 */

xtreg math homework, i(schid) corr(exc) robust nolog pa

GEE population-averaged model                   Number of obs      =       519
Group variable:                      schid      Number of groups   =        23
Link:                             identity      Obs per group: min =         5
Family:                           Gaussian                     avg =      22.6
Correlation:                  exchangeable                     max =        67
                                                Wald chi2(1)       =     14.61
Scale parameter:                  94.57586      Prob > chi2        =    0.0001

                                  (Std. Err. adjusted for clustering on schid)
------------------------------------------------------------------------------
             |             Semi-robust
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.382626   .6232753     3.82   0.000     1.161029    3.604223
       _cons |   46.41468   1.632429    28.43   0.000     43.21518    49.61418
------------------------------------------------------------------------------

/* model 13 -- same as regular OLS */

xtgls math homework, i(schid) panels(iid)

Cross-sectional time-series FGLS regression

Coefficients:  generalized least squares
Panels:        homoskedastic
Correlation:   no autocorrelation

Estimated covariances      =         1          Number of obs      =       519
Estimated autocorrelations =         0          Number of groups   =        23
Estimated coefficients     =         2          Obs per group: min =         5
                                                               avg =  22.56522
                                                               max =        67
                                                Wald chi2(1)       =    119.89
Log likelihood             =   -1912.6          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   3.126375   .2855283    10.95   0.000     2.566749       3.686
       _cons |   45.56015   .7042111    64.70   0.000     44.17992    46.94038
------------------------------------------------------------------------------

/* model 13 */

xtgls math homework, i(schid) panels(hetero)

Cross-sectional time-series FGLS regression

Coefficients:  generalized least squares
Panels:        heteroskedastic
Correlation:   no autocorrelation

Estimated covariances      =        23          Number of obs      =       519
Estimated autocorrelations =         0          Number of groups   =        23
Estimated coefficients     =         2          Obs per group: min =         5
                                                               avg =  22.56522
                                                               max =        67
                                                Wald chi2(1)       =    170.34
Log likelihood             = -1896.654          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   3.535692   .2709065    13.05   0.000     3.004725    4.066659
       _cons |   44.95865   .6672198    67.38   0.000     43.65092    46.26638
------------------------------------------------------------------------------

/* model 14 -- same as xtmixed with mle option */

xtreg math homework, i(schid) mle nolog

Random-effects ML regression                    Number of obs      =       519
Group variable (i): schid                       Number of groups   =        23

Random effects u_i ~ Gaussian                   Obs per group: min =         5
                                                               avg =      22.6
                                                               max =        67

                                                LR chi2(1)         =     70.28
Log likelihood  =  -1865.247                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.401972   .2771764     8.67   0.000     1.858717    2.945228
       _cons |   46.34945   1.142005    40.59   0.000     44.11117    48.58774
-------------+----------------------------------------------------------------
    /sigma_u |   4.497317   .7861614                      3.192697    6.335039
    /sigma_e |   8.434685   .2677724                      7.925855    8.976181
         rho |   .2213627   .0616007                      .1202136    .3589456
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)=   94.71 Prob>=chibar2 = 0.000

/* model 15 -- same xtreg with mle option */

xtmixed math homework || schid:, mle nolog

Mixed-effects ML regression                     Number of obs      =       519
Group variable: schid                           Number of groups   =        23

                                                Obs per group: min =         5
                                                               avg =      22.6
                                                               max =        67


                                                Wald chi2(1)       =     75.32
Log likelihood =  -1865.247                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.401972   .2767745     8.68   0.000     1.859504     2.94444
       _cons |   46.34945   1.141154    40.62   0.000     44.11283    48.58607
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
schid: Identity              |
                   sd(_cons) |   4.497318   .7861623      3.192697    6.335042
-----------------------------+------------------------------------------------
                sd(Residual) |   8.434685   .2677724      7.925855    8.976181
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =    94.71 Prob >= chibar2 = 0.0000

/* model 16 */

xtmixed math homework || schid:, nolog

Mixed-effects REML regression                   Number of obs      =       519
Group variable: schid                           Number of groups   =        23

                                                Obs per group: min =         5
                                                               avg =      22.6
                                                               max =        67


                                                Wald chi2(1)       =     74.95
Log restricted-likelihood = -1864.6598          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
        math |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    homework |   2.399867   .2771976     8.66   0.000     1.856569    2.943164
       _cons |   46.35575   1.162776    39.87   0.000     44.07676    48.63475
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
schid: Identity              |
                   sd(_cons) |   4.619694   .8198156      3.262557    6.541363
-----------------------------+------------------------------------------------
                sd(Residual) |    8.44297   .2682979      7.933157    8.985545
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =    96.43 Prob >= chibar2 = 0.0000

Cite this article

stats writer (2024). What are the some of the methods for analyzing clustered data in Stata?. PSYCHOLOGICAL SCALES. Retrieved from https://scales.arabpsychology.com/stats/what-are-the-some-of-the-methods-for-analyzing-clustered-data-in-stata/

stats writer. "What are the some of the methods for analyzing clustered data in Stata?." PSYCHOLOGICAL SCALES, 1 Jul. 2024, https://scales.arabpsychology.com/stats/what-are-the-some-of-the-methods-for-analyzing-clustered-data-in-stata/.

stats writer. "What are the some of the methods for analyzing clustered data in Stata?." PSYCHOLOGICAL SCALES, 2024. https://scales.arabpsychology.com/stats/what-are-the-some-of-the-methods-for-analyzing-clustered-data-in-stata/.

stats writer (2024) 'What are the some of the methods for analyzing clustered data in Stata?', PSYCHOLOGICAL SCALES. Available at: https://scales.arabpsychology.com/stats/what-are-the-some-of-the-methods-for-analyzing-clustered-data-in-stata/.

[1] stats writer, "What are the some of the methods for analyzing clustered data in Stata?," PSYCHOLOGICAL SCALES, vol. X, no. Y, ص Z-Z, July, 2024.

stats writer. What are the some of the methods for analyzing clustered data in Stata?. PSYCHOLOGICAL SCALES. 2024;vol(issue):pages.

Download Post (.PDF)
Slide Up
x
PDF
Scroll to Top