Threat to validity of regression analysis – Omitted Variables Bias

Most of the readers of this blog would be familiar with ordinary least squares estimator and regression models. Let us talk about one source that can cause these estimates and models to be biased and inconsistent. This is especially important when we think about the causal relationship of interest or the relationship which being studied. Proving causality can be difficult. But if we assess our econometric analysis and consider the different threats to the validity of our statistical inference then we can be more assured about our analysis, results, and inferences.

We shall be discussing omitted variables bias. This is bias that occurs when

  • the regressor X is correlated with an omitted variable Z.
  • omitted variable Z is a determinant of the dependent variable Y.

Both of these conditions result in the violation of the Gauss-Markov assumption of ordinary least squares regression

which is the assumption that states that the error term is uncorrelated with the regressors. The u denotes the error term while the X denotes the regressors. Simply put, this bias occurs when an econometric model leaves out one or more relevant variables. This bias results from the model attributing the effects of the missing variable(s) to those that are included in the model.

One simple example of this bias is by taking an example of dependent variable Salary, regressor Education and omitted variable Ability. Here, Salary is the annual salary of the individuals in our sample. Education can be either years of education or their scores on tests or any other measure of education. Ability is some variable which signifies talent, skill, or proficiency in general. We can think of Ability as being unmeasurable as well.

Let us think about a bias induced when we omit the Ability variable as a regressor, either due to a mistake or because we cannot measure it. At the same time, the true data generating process includes Ability and is

In this case, the variable Ability would have some impact on both Salary as well as Education. Hence, we can say that Ability is correlated with Education. The effect on Salary could be directly captured by including it in the regression. At the same time, we have no method to quantify Ability. And if we do not include it then its effect on Salary will not be picked up as an indirect effect through the variable Education.

If we do not use the true data generating process and instead use

Then we run into the problem of omitted variable bias. This causes our ordinary least squares estimate of the estimate of Education (denoted by beta_1)to be biased and inconsistent. This means that the bias cannot be prevented by increasing the sample size because omitted variable bias prevents ordinary least squares estimate from converging in probability to the true parameter value. The strength and direction of the bias is determined by the correlation between the error term and the regressor.

Now that we know exactly what the issue of Omitted Variable Bias is, let us consider some solutions.

One answer to this issue is to include more variables in the regression model. By doing this, the regression model uses as independent variables, not only the ones whose effects on the dependent variable are of interest, but also any potential variables which might cause omitted variables bias. Including these additional variables can help us reduce the risk of inducing omitted variables bias but at the same time, it may increase the variance of the estimator.

Some general guidelines to follow in this case that help us in our decision to include additional variables are:

  • Specify the coefficient of interest.
  • Based on your knowledge of the variables and model, identify possible sources of omitted variables bias. This should give you a starting point specification as baseline and a set of regressor variables, sometimes called control variables.
  • Use different model specifications and test against your baseline.
  • Use tables to provide full disclosure of your results – by presenting different model specifications, you can support your argument and enable readers to see the impact of including other regressors.

If diminishing the bias by including additional variables is not possible, such as in the cases where there are no adequate control variables, then there are still a variety of approaches which can help us solve this problem.

  • Making use of panel data methods.
  • Making use of instrumental variables regression methods such as Two Stage Least Squares.
  • Making use of a randomized control experiment.

These approaches are important to consider because they help us to avoid false inferences of causality due to the presence of another underlying variable, the omitted variable, that influences both the independent and dependent variables.

Wooldridge, Jeffrey M. (2009). “Omitted Variable Bias: The Simple Case”. Introductory Econometrics: A Modern Approach. Mason, OH: Cengage Learning. ISBN 9780324660548.
Greene, W. H. (1993). Econometric Analysis (2nd ed.). Macmillan.

The black box and econometrics.

Note: Picture was taken from Aravindan (2019)

Some of the most popular models used in Data Analysis imply the use of the so-called “Black Box” approach. Regarding the simplest interpretation one can give in this context, it depends on the inputs and outputs that a certain model can deliver in terms of prediction power.

If econometrics is thought to estimate population parameters, and provide their causal inference, the black box approach proper of data analysis is somewhat opposite to this concept. In fact, we only care about responses and predicted responses to discriminate across models given a certain amount of data (captured in an observable sample). We then calculate the prediction contrasted with the actual value and derive measures of the error, and thus, we select a rational model which provides the best explanation of the response variable considering, of course, the tradeoff between variance and bias induced.

In an article by Mullainathan & Spiess (2017) from the Journal of Economic Perspectives, a short description of supervised and unsupervised approaches of machine learning are described. The out-of-sample performance for these methods in comparison to the least-squares is potentially greater. See the next table taken from the article of these authors:

Source: Mullainathan & Spiess (2017, 90) Note: The dependent variable is the log-dollar house value of owner-occupied units in the 2011 American Housing Survey from 150 covariates including unit characteristics and quality measures. All algorithms are fitted on the same, randomly drawn training sample of 10,000 units and evaluated on the 41,808 remaining held-out units. The numbers in brackets in the hold-out sample column are 95 percent bootstrap confidence intervals for hold-out prediction performance, and represent measurement variation for a fixed prediction function. For this illustration, we do not use sampling weights. Details are provided in the online Appendix at

In this exercise, a training sample and a test sample were used to calculate the “prediction performance” given by the R2. In econometrics, we would call this, the goodness of fit of the model, or also, the percentage of linear explanation regarding the model. It is not a secret that when the goodness of fit of the model increases, we will have a higher prediction power (considering of course that we would never actually going to have an R2 of 1 unless we have some overfitting issues).

When you compare table 1 results in the “hold out of sample” column, you can find that some other approaches may outperform the least-squares regression in terms of the prediction power. A mere mention of this can be witnessed in the row corresponding to LASSO estimates, and hence, one can states that there’s an increased prediction performance compared to least squares. And therefore, the LASSO model is capturing somewhat better, the behavior of the response variable (at least for this sample).

One should ask at this point what is the objective of the analysis. If we are going for statistical inference and the estimation of population parameters, we should stick to the non-black-box approaches. Some of them may involve traditional LS, GMM, 2SLS to mention an example. But, if we are more interested in the prediction power and performance, the black box approaches surely will come in handy, and sometimes, may outperform the econometrical procedures to estimate population parameters. In the way I see it, the black box even when it is unknown to us in the closer details, has the ability to adapt itself to the data (but of course this should be considering the variety of machine learning methods and algorithms, not the penalized regression).

As the authors expressed in their article, it could be tentative to draw conclusions from these methods like we usually do in econometrics, but first, we need to consider some of the limitations in the application of the black box approaches. A mention of these could be defined as 1) Sometimes the correlation steps in, 2) The production of standard errors become harder, 2) Some of the methods are inconsistent if we change the initial conditions, 3) There’s a risk of choosing incorrect models and may induce to omitted variable bias.

However, even with the above problems, we are able to get some useful connections between the black box approaches and the econometrical methods. The advantage of machine learning over the estimation of traditional econometrical models may be superior in the context of large samples, in which, the researcher may need to define a set of covariates of influence to define or test a theory. Also, even for policymakers, it can be a useful tool associated with econometrical analysis. This provides the economist “a tool to infer actual from reported” values and proceed with comparisons given the samples of the researcher.

We are also able to correct some of the problems associated with the prediction powers to estimate population parameters, as the authors appoint, consider the case of two-stage least squares, where in the first stage we are required to make a prediction of the endogenous regressor considering an instrument, the black box approach may even be useful to perform better predictions and include it in second-stage regression, however, it should be noted that instruments selected should be at least reasonable exogenous, because if we let the black box alone, it would just take correlations and possible bring up reverse causality problems.

Supervised or non-supervised methods in machine learning may provide a better understanding from a different approach, and with this, I refer to the “black box” approach. Since even when it is not exactly part of the causal analysis. It may be useful to select some possible covariates of a phenomenon, thus, the rational analysis and the selected outcome should always be considered and criticized in terms to provide the best inference. From this perspective, even when we don’t know what exactly happens inside the box, the outcome of the black box itself is giving us some useful information.

This is a topic that is getting constant reviews and enhancement for real-world applications, I believe that the bridge between the black box approaches from machine learning and the econometrical theory will eventually be more strong over time, considering, of course, the needs of the growing society in terms of information.


Aravindan, G. (2019) Challenges of AI-based adoptions: Simplified, Sogetilabs. Recuperated from:

Mullainathan, S. & Spiess, J. (2017) Machine Learning: An Applied Econometric Approach, Volume 31, Number 2—Spring 2017—Pages 87–106.

Rudin, C. & Radin, J. (2019) Why Are We Using Black Box Models in AI When We Don’t Need To? A Lesson From An Explainable AI Competition, Recuperated from:

Estimating long-run coefficients from an ARDL model

Whether if we’re working with Time Series Data or Panel Data, most of the times we want to follow the analysis of the long-run behavior and the short-run dynamics. An interesting but well-known model that enable us for such approach is the Auto-Regressive Distributed Lag model which stands as ARDL. There are a lot of implications regarding the form of the ARDL, maybe some re-parametrizations, maybe some conditional cointegration forms, or fully cointegration equations derived from the ARDL. In this article we’re going to describe how to calculate the long-run coefficient of an ARDL model either for time series or panel data.

Consider the basic Auto-Regressive Distributed Lag model with an exogenous variable, which is of the form:

Where y represents the dependent variable, p represents the autoregressive order of the ARDL, where it is directly associated to the y (the dependent variable). X is an exogenous explanatory variable which has l lags (also a contemporaneous value of x can be included) and the residual term u.

The present form of the ARDL is not actually a long-run form, in fact, it is more a short-run model. Therefore, the actual impact of x through α must be done considering the size and orders associated with the dependent variable y through ß. The above leads to a situation where we want to weigh the cumulative impact of α, and the way to do so is by using a long-run multiplier. Blackburne & Frank (2007) indicate to us that an approximation to this long-run multiplier, would involve a non-linear transformation to get a long-run coefficient, such transformation is given in the general form of:

Where this is the long-run multiplier of the variable X, also please note how this formula works. It’s using the sums of the coefficient α associated to the independent variable (and its lags) divided by 1 minus the sums of the autoregressive ß coefficients. Upper part corresponds to the Long-Run Propensity of X towards y, which is just simply the sums of the coefficients, and it’s interpreted that given one permanent change of one unit in x, the sums would be the long-run propensity as impact on y. The down part represents the weight associated to the response of the autoregressive structure.

This means that if for example, if we got an ARDL (2,2) it refers to a model where we got two lags of the dependent variable and two lags associated to the independent variable (considering of course the contemporaneous value of x). This model is one of the form of

And the weighted long-run multiplier will be given in the form of:

Where α goes from 1 up to 3, it starts from the contemporaneous value of x given by coefficient α1 and then sums the coefficients of the lag orders α2 for lag 1, and α3 for lag 2. Notice that we subtract the sums of the autoregressive parameters ß from the unity to weight the size of the impact of the cumulative sums of x.

Interpretation of the long-run coefficient goes as follow: if x in levels change by one unit, then the average/expected change in y would be given by the long-run coefficient.

Let’s put this together with an example in Stata.

Load up the data base and generate a time identification variable with:

generate t = _n

Then tell to Stata that you’re working with time series, so:

tsset t, y

Now let’s estimate an ARDL (2,2) model using the variables of price and weight, where the price is the dependent variable and weight is the independent variable (all assumed to be stationary variables).

reg price L.price L2.price weight L1.weight L2.weight

From here you can analyze a lot of things, for example, the long-run propensity will be given by:

** Long-run propensity of x (weight)
display _b[weight] +_b[L1.weight]+_b[L2.weight]

And the long-run multiplier which we discussed, can be calculated by:

** Long-run multiplier of x
display (_b[weight] +_b[L1.weight]+_b[L2.weight]) / (1-(_b[L1.price] + _b[L2.price]))

And from here, you can even go to estimate the long-run coefficient with statistical significance and the actual value of the long-run coefficient by using nlcom: this can be done by using:

nlcom (_b[weight] +_b[L1.weight]+_b[L2.weight]) / (1-(_b[L1.price] + _b[L2.price]))

Notice that when the weight increases in unit over the long-run the expected change would be of 1.68 units on the price, statistically significant with a 10% level of significance.

You can extend such analysis to the famous long-run & short-run dynamics of the Cointegration tests of Engle & Granger, where you just will have to compute the short-run coefficients in order to obtain the long-run coefficients, this will be done in a future next post.

An excellent video to help you to get this idea can be found in Nyboe Tabor (2016).


Blackburne, E. F. & Frank, M.W. (2007) Estimation of nonstationary heterogeneous panels The Stata Journal (2007), 7, Number 2, pp. 197-208.

Nyboe Tabor, M. (2016) The ADL Model for Stationary Time Series: Long-run Multipliers and the Long-run Solution, Recuperated from:

Panel Data Nonlinear Simultaneous Equation Models with Two-Stage Least Squares using Stata

In this article, we will follow Woolridge (2002) procedure to estimate a set of equations with nonlinear functional forms for panel data using the two-stage least squares estimator. It has to be mentioned that this topic is quite uncommon and not used a lot in applied econometrics, this is due that instrumenting the nonlinear terms might be somewhat complicated.

Assume a two-equation system of the form:

Where the y’s represents the endogenous variables, Z represents the exogenous variables taken as instruments and u are the residuals for each equation. Notice that y2 is in a quadratic form in the first equation but also present in linear terms on the second equation.

Woolridge calls this model as nonlinear in endogenous variable, yet the model still linear in the parameters γ making this a particular problem where we need to somehow instrument the quadratic term of y2.

Finding the instruments for the quadratic term is a particular challenge than already it is for linear terms in simple instrumental variable regression. He suggests the following:

“A general approach is to always use some squares and cross products of the exogenous variables appearing somewhere in the system. If something like exper2 appears in the system, additional terms such as exper3 and exper4 would be added to the instrument list.” (Wooldridge, 2002, p. 235).

Therefore, it worth the try to use nonlinear terms of the exogenous variables from Z, in the form of possible Z2 or even Z3. And use these instruments to deal with the endogeneity of the quadratic term y2. When we define our set of instruments, then any nonlinear equation can be estimated with two-stage least squares. And as always, we should check the overidentifying restrictions to make sure we manage to avoid inconsistent estimates.

The process with an example.

Let’s work with the Example of a nonlinear labor supply function. Which is a system of the form:

Some brief description of the model indicates that for the first equation, the hours (worked) are a nonlinear function of the wage, the level of education (educ), the age (age), the kids situation associated to the age, whether if they’re younger than 6 years old or between 6 and 18 (kidslt6 and kidsge6), and the wife’s income (nwifeinc).

On the second equation, the wage is a function of the education (educ), and a nonlinear function of the exogenous variable experience (exper and exper2).

We work on the natural assumptions that E(u|z)=0 therefore the instruments are exogenous. Z in this case contains all the other variables which are not endogenous (hours and wage are the endogenous variables).

We will instrument the quadratic term of the logarithm of the wage in the first equation, and for such instrumenting process we will add three new quadratic terms, which are:

And we include those in the first-stage regression.

With Stata we first load the dataset which can be found here.

Load up the data (double click the file with Stata open or use some path command to get it ready)

use MROZ.dta

Generate the squared term for the logarithm of the wage with:

gen lwage_sq=lwage *lwage

Then, get ready to use the following command with ivregress, however, we will explain it in detail.

ivregress 2sls hours educ age kidslt6 kidsge6 nwifeinc (lwage lwage_sq  = educ c.educ#c.educ exper expersq age c.age#c.age kidsge6 kidslt6 nwifeinc c.nwifeinc#c.nwifeinc), first

Which has the following interpretation. According to the syntaxis of Stata’s program. First, make sure you specify the first equation with the associated exogenous variables, we do that with the part.

ivregress 2sls hours educ age kidslt6 kidsge6 nwifeinc

Now, let’s tell to Stata that we have two other endogenous regressors, which are the wage and the squared term of the wages. We open the bracket and put

(lwage lwage_sq  =

This will tell to Stata that lwage and lwage_sq are endogenous, part of the first equation of hours, and after the equal, we specify ALL the exogenous variables including the instruments for the endogenous terms, this will lead to include the second part as:

(lwage lwage_sq  = educ c.educ#c.educ exper expersq age c.age#c.age kidsge6 kidslt6 nwifeinc c.nwifeinc#c.nwifeinc)

Notice that this second part will have a c.var#c.var structure, this is Stata’s operator to indicate a multiplication for continuous variables, (and we induce the quadratic terms without generating the variables with another command like we did with the wage).

So notice we have c.educ#c.educ which is the square of the educ variable, and c.age#c.age which is the square of the age, and we also square the wife’s income with c.nwifeinc#c.nwifeinc. These are the instruments for the quadratic term.

The fact that we have two variables on the left (lwage and lwage_sq) indicates that the set of instruments will hold first for an equation for lwage and second for lwage_sq given the exact same instruments.

We include the option , first to see what were the regressions in the first stage.

ivregress 2sls hours educ age kidslt6 kidsge6 nwifeinc (lwage lwage_sq  = educ c.educ#c.educ exper expersq age c.age#c.age kidsge6 kidslt6 nwifeinc c.nwifeinc#c.nwifeinc), first

The output of the above model for the first stage equations is:

And the output for the two stage equation is:

Which yields in the identical coefficients in Woolridge’s book (2002, p- 236) also with some slightly difference in the standard errors (yet these slight differences do not change the interpretation of the statistical significance of the estimators).

In this way, we instrumented both endogenous regressors lwage and lwage_sq. Which are a nonlinear relationship in the model.

As we can see, the quadratic term is not statistically significant to explain the hours worked.

At last, we need to make sure that overidentification restrictions are valid. So we use after the regression

estat overid

And within this result, we cannot reject the null that overidentifying restrictions are valid.


Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. Cam-bridge, MA: MIT Press.

Wooldridge Serial Correlation Test for Panel Data using Stata.

In this article, we will follow Drukker (2003) procedure to derive the first-order serial correlation test proposed by Jeff Wooldridge (2002) for panel data. It has to be mentioned that this test is considered a robust test, since works with lesser assumptions on the behavior of the heterogeneous individual effects.

We start with the linear model as:

Where y represents the dependent variable, X is the (1xK) vector of exogenous variables, Z is considered a vector of time-invariant covariates. With µ as individual effects for each individual. Special importance is associated with the correlation between X and µ since, if such correlation is zero (or uncorrelated), we better go for the random-effects model, however, if X and µ are correlated, it’s better to stick with fixed-effects.

The estimators of fixed and random effects rely on the absence of serial correlation. From this Wooldridge use the residual from the regression of (1) but in first-differences, which is of the form of:

Notice that such differentiating procedure eliminates the individual effects contained in µ, leading us to think that level-effects are time-invariant, hence if we analyze the variations, we conclude there’s non-existing variation over time of the individual effects.

Once we got the regression in first differences (and assuming that individual-level effects are eliminated) we use the predicted values of the residuals of the first difference regression. Then we double-check the correlation between the residual of the first difference equation and its first lag, if there’s no serial correlation then the correlation should have a value of -0.5 as the next expression states.

Therefore, if the correlation is equal to -.5 the original model in (1) will not have serial correlation. However, if it differs significantly, we have a serial correlation problem of first-order in the original model in (1).

For all of the regressions, we account for the within-panel correlation, therefore all of the procedures require the inclusion of the cluster regression, and also, we omit the constant term in the difference equation. In sum we do:

  1. Specify our model (whether if it has fixed or random effects, but these should be time-invariant).
  2. Create the difference model (using first differences on all the variables, therefore the difference model will not have any individual effects). We perform the regression while clustering the individuals and we omit the constant term.
  3. We predict the residuals of the difference model.
  4. We regress the predicted residual over the first lag of the predicted residual. We also cluster this regression and omit the constant.
  5. We test the hypothesis if the lagged residual equal to -0.5.

Let’s do a quick example of these steps using the same example as Drukker.

We start loading the database.


Then we format the database for stata with the code:

xtset idcode year

Then we generate some quadratic variables.

gen age2 = age^2
gen tenure2 = tenure^2

We regress our model of the form of:

xtreg ln_wage age* ttl_exp tenure* south, fe

It doesn’t matter whether if it is fixed or random effects as long as we assume that individuals’ effects are time invariant (therefore they get eliminated in the first difference model).

Now let’s do the manual estimation of the test. In order to do this, we use a pooled regression of the model without the constant and clustering the regression for the panel variable. This is done of the form:

reg d.ln_wage d.age* d.ttl_exp d.tenure* d.south, noconst cluster(idcode)

The options noconst eliminates the constant term for the difference model, and cluster option includes a clustering approach in the regression structure, finally idcode is the panel variable which we identify our individuals in the panel.

The next thing to do is predict the residuals of the last pooled difference regression, and we do this with:

predict u, res

Then we regress the predicted residual u against the first lag of u, while we cluster and also eliminate the constant of the regression as before.

reg u L.u, noconst cluster(idcode)

Finally, we test the hypothesis whether if the coefficient of the first lag of the pooled difference equation is equal or not to -0.5

test L.u==-0.5

According to the results we strongly reject the null hypothesis of no serial correlation with a 5% level of significance. Therefore, the model has serial correlation problems.

We can also perform the test with the Stata compiled package of Drukker, which can be somewhat faster. We do this by using

xtserial ln_wage age* ttl_exp tenure* south, output

and we’ll have the same results. However, the advantage of the manual procedure of the test is that it can be done for any kind of model or regression.


Drukker, D. (2003) Testing for serial correlation in linear panel-data models, The Stata Journal, 3(2), pp. 168–177. Taken from:

Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. Cam-bridge, MA: MIT Press.

HAC robust standard errors.

While we’re using the time series datasets, often we’re highly likely to find serial correlation and heteroskedasticity in our data. These cases increase the chances to obtain serially correlated errors with non-constant variance.

If we’re purely interested in statistical inferences, we should go for the HAC robust standard errors under the Time Series context. This name as Woolridge appoints refers to:

“In the time series literature, the serial correlation–robust standard errors are sometimes called heteroskedasticity and autocorrelation consistent, or HAC, standard errors.” (Wooldridge, ,p. 432).

We got to appoint that HAC standard errors (also called HAC estimators) are derived from the work of Newey & West (1987) where the objective was to build a robust approach to handle the usual problems of time series associated with serial correlation and heteroskedasticity.

What’s the idea behind these standard errors? Well, we can summarize it as:

  1. We do not know the form of the serial correlation.
  2. Works for arbitrary forms of serial correlation and the autocorrelation structure can be derived from the sample size.
  3. With larger samples, we can be flexible in the amount of serial correlation.

This means, that even when the robust standard error is consistent in the presence of the serial correlation and heteroskedasticity, we still need to figure the lag structure for the autocorrelation. Again, Woolridge helps us to decide this on simple basis:

Annual data = 1 lag, 2 lags. Quarterly data= 4 up to 8 lags. Monthly data = 12 up to 24 lags.

Let’s dig into some formulas to understand the relationship between HAC and OLS.

First, Newey & West standard errors work under the ordinary least squares estimator of the form:

Where X is the matrix of independent variables and Y is the vector of the dependent variable. This leads to establishing that Newey-West estimates in terms of values of the estimators will not differ from the OLS estimates.

Second, Newey & West standard errors modify the role of the estimated variances to include White’s robust approach to heteroskedasticity and also the serial correlation structure.

Consider that estimates of the variance in OLS are given by:

Where Ω is the diagonal matrix containing the distinct variances (for a representation of heteroskedasticity).

Now, White robust estimator is defined by:

Where n is the sample size, and e is the estimated time-period residual with ith row of the matrix of independent variables. Let’s define this as robust estimates with 0 lags (since it is only handling heteroskedasticity).

Now here’s where Newey & West extended the White estimator to include the arbitrary forms of serial correlation with a m-lag structure:

As it is visible, the HAC estimates of the variance now include the heteroskedasticity and a m-lag consistent estimate. K represents the number of independent variables, t the time periods and x is the row of matrix of independent variables observed at time t.

So with this, it is more clearly to work under the frame of :

Annual data: m=1,2 lags. Quarterly data: m=4,8 lags. Monthly data: 12,24 lags.

Let’s see an example:

generate t = _n
tsset t
regress price weight displ, vce(robust)

Up to this point, this is the White robust standard errors to heteroskedasticity, now let’s estimate the HAC estimator with the equivalent which is 0 lags.

newey price weight displ, lag(0)

As you can see everything is exact in comparison to the White’s robust standard errors. Now let’s start to use the HAC structure under 2 lags.

newey price weight displ, lag(2)

Notice as well that the values of standard errors of the independent variables have changed with this estimation.

I would recommend always to provide estimates of the HAC SE, in order to obtain more comparative estimates and correct inferences.

As a last mention, Greene (2012) states as a usual practice to select the integer approximate of T^(1/4) where T is the total of time periods of time. For example, for our case considering it is annual data, it would be

display (74)^(1/4)

and Stata will display a value, therefore our lags to select would be 3 and 2 (with no specific criteria to select one over the other).


Greene, W. H. 2012. Econometric Analysis, 7th edition, section 20.5.2, p. 960

Newey, W. K., and K. D. West. 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55: 703–708.

Wooldridge, J. 2013. Introductory Econometrics: A Modern Approach, Fifth Edition. South-Western CENGAGE Learning.

The Rise of Behavioral Econometrics..

The lessons from behavioral economics have ameliorated social wellbeing and economic success in recent years. Academics and policymakers now recognize that integrating how individuals behave and make decisions in real-life dramatically improves the effectiveness of public policies and the validity of simple theoretical models. Thus, this area of research has enhanced our understanding of the barriers to decision-making and led to the emergence of a wider and richer theoretical and empirical framework to inform human decision making.

This framework builds on fields such as sociology, anthropology, psychology, economics, and political science. Two of the last four Nobel Prizes in Economics (2017 and 2019) have been awarded to Behavioral and Experimental economists working also on development-related problematics. The wider results from this body of work have been used by academics, governments, and international organizations to design evidence-based policies in a wide range of activities such as finance, tax collection, healthcare, education, energy consumption and human cooperation.

Based on this relevance, the present workshop aims to teach foundations on behavioral economics and how their instruments can help improve social and economic outcomes in problems found in modern public policy. Similarly, the workshop will establish statistical and econometric techniques (and commands) to secure the correct implementation of interventions, and the assessment of their results.

Learn more and register at the upcoming workshop in March 2021 at