## Estimating long-run coefficients from an ARDL model

Whether if we’re working with Time Series Data or Panel Data, most of the times we want to follow the analysis of the long-run behavior and the short-run dynamics. An interesting but well-known model that enable us for such approach is the Auto-Regressive Distributed Lag model which stands as ARDL. There are a lot of implications regarding the form of the ARDL, maybe some re-parametrizations, maybe some conditional cointegration forms, or fully cointegration equations derived from the ARDL. In this article we’re going to describe how to calculate the long-run coefficient of an ARDL model either for time series or panel data.

Consider the basic Auto-Regressive Distributed Lag model with an exogenous variable, which is of the form:

Where y represents the dependent variable, p represents the autoregressive order of the ARDL, where it is directly associated to the y (the dependent variable). X is an exogenous explanatory variable which has l lags (also a contemporaneous value of x can be included) and the residual term u.

The present form of the ARDL is not actually a long-run form, in fact, it is more a short-run model. Therefore, the actual impact of x through α must be done considering the size and orders associated with the dependent variable y through ß. The above leads to a situation where we want to weigh the cumulative impact of α, and the way to do so is by using a long-run multiplier. Blackburne & Frank (2007) indicate to us that an approximation to this long-run multiplier, would involve a non-linear transformation to get a long-run coefficient, such transformation is given in the general form of:

Where this is the long-run multiplier of the variable X, also please note how this formula works. It’s using the sums of the coefficient α associated to the independent variable (and its lags) divided by 1 minus the sums of the autoregressive ß coefficients. Upper part corresponds to the Long-Run Propensity of X towards y, which is just simply the sums of the coefficients, and it’s interpreted that given one permanent change of one unit in x, the sums would be the long-run propensity as impact on y. The down part represents the weight associated to the response of the autoregressive structure.

This means that if for example, if we got an ARDL (2,2) it refers to a model where we got two lags of the dependent variable and two lags associated to the independent variable (considering of course the contemporaneous value of x). This model is one of the form of

And the weighted long-run multiplier will be given in the form of:

Where α goes from 1 up to 3, it starts from the contemporaneous value of x given by coefficient α1 and then sums the coefficients of the lag orders α2 for lag 1, and α3 for lag 2. Notice that we subtract the sums of the autoregressive parameters ß from the unity to weight the size of the impact of the cumulative sums of x.

Interpretation of the long-run coefficient goes as follow: if x in levels change by one unit, then the average/expected change in y would be given by the long-run coefficient.

Let’s put this together with an example in Stata.

Load up the data base and generate a time identification variable with:

`use https://www.stata-press.com/data/r16/autogenerate t = _n`

Then tell to Stata that you’re working with time series, so:

`tsset t, y`

Now let’s estimate an ARDL (2,2) model using the variables of price and weight, where the price is the dependent variable and weight is the independent variable (all assumed to be stationary variables).

`reg price L.price L2.price weight L1.weight L2.weight`

From here you can analyze a lot of things, for example, the long-run propensity will be given by:

```** Long-run propensity of x (weight)
display _b[weight] +_b[L1.weight]+_b[L2.weight]```

And the long-run multiplier which we discussed, can be calculated by:

```** Long-run multiplier of x
display (_b[weight] +_b[L1.weight]+_b[L2.weight]) / (1-(_b[L1.price] + _b[L2.price]))```

And from here, you can even go to estimate the long-run coefficient with statistical significance and the actual value of the long-run coefficient by using nlcom: this can be done by using:

`nlcom (_b[weight] +_b[L1.weight]+_b[L2.weight]) / (1-(_b[L1.price] + _b[L2.price]))`

Notice that when the weight increases in unit over the long-run the expected change would be of 1.68 units on the price, statistically significant with a 10% level of significance.

You can extend such analysis to the famous long-run & short-run dynamics of the Cointegration tests of Engle & Granger, where you just will have to compute the short-run coefficients in order to obtain the long-run coefficients, this will be done in a future next post.

An excellent video to help you to get this idea can be found in Nyboe Tabor (2016).

Bibliography:

Blackburne, E. F. & Frank, M.W. (2007) Estimation of nonstationary heterogeneous panels The Stata Journal (2007), 7, Number 2, pp. 197-208.

Nyboe Tabor, M. (2016) The ADL Model for Stationary Time Series: Long-run Multipliers and the Long-run Solution, Recuperated from: https://www.youtube.com/watch?v=GLpCVrZbW-g

## Panel Data Nonlinear Simultaneous Equation Models with Two-Stage Least Squares using Stata

In this article, we will follow Woolridge (2002) procedure to estimate a set of equations with nonlinear functional forms for panel data using the two-stage least squares estimator. It has to be mentioned that this topic is quite uncommon and not used a lot in applied econometrics, this is due that instrumenting the nonlinear terms might be somewhat complicated.

Assume a two-equation system of the form:

Where the y’s represents the endogenous variables, Z represents the exogenous variables taken as instruments and u are the residuals for each equation. Notice that y2 is in a quadratic form in the first equation but also present in linear terms on the second equation.

Woolridge calls this model as nonlinear in endogenous variable, yet the model still linear in the parameters γ making this a particular problem where we need to somehow instrument the quadratic term of y2.

Finding the instruments for the quadratic term is a particular challenge than already it is for linear terms in simple instrumental variable regression. He suggests the following:

“A general approach is to always use some squares and cross products of the exogenous variables appearing somewhere in the system. If something like exper2 appears in the system, additional terms such as exper3 and exper4 would be added to the instrument list.” (Wooldridge, 2002, p. 235).

Therefore, it worth the try to use nonlinear terms of the exogenous variables from Z, in the form of possible Z2 or even Z3. And use these instruments to deal with the endogeneity of the quadratic term y2. When we define our set of instruments, then any nonlinear equation can be estimated with two-stage least squares. And as always, we should check the overidentifying restrictions to make sure we manage to avoid inconsistent estimates.

The process with an example.

Let’s work with the Example of a nonlinear labor supply function. Which is a system of the form:

Some brief description of the model indicates that for the first equation, the hours (worked) are a nonlinear function of the wage, the level of education (educ), the age (age), the kids situation associated to the age, whether if they’re younger than 6 years old or between 6 and 18 (kidslt6 and kidsge6), and the wife’s income (nwifeinc).

On the second equation, the wage is a function of the education (educ), and a nonlinear function of the exogenous variable experience (exper and exper2).

We work on the natural assumptions that E(u|z)=0 therefore the instruments are exogenous. Z in this case contains all the other variables which are not endogenous (hours and wage are the endogenous variables).

We will instrument the quadratic term of the logarithm of the wage in the first equation, and for such instrumenting process we will add three new quadratic terms, which are:

And we include those in the first-stage regression.

With Stata we first load the dataset which can be found here.

`https://drive.google.com/file/d/1m4bCzsWgU9sTi7jxe1lfMqM2T4-A3BGW/view?usp=sharing`

Load up the data (double click the file with Stata open or use some path command to get it ready)

`use MROZ.dta`

Generate the squared term for the logarithm of the wage with:

`gen lwage_sq=lwage *lwage`

Then, get ready to use the following command with ivregress, however, we will explain it in detail.

ivregress 2sls hours educ age kidslt6 kidsge6 nwifeinc (lwage lwage_sq  = educ c.educ#c.educ exper expersq age c.age#c.age kidsge6 kidslt6 nwifeinc c.nwifeinc#c.nwifeinc), first

Which has the following interpretation. According to the syntaxis of Stata’s program. First, make sure you specify the first equation with the associated exogenous variables, we do that with the part.

ivregress 2sls hours educ age kidslt6 kidsge6 nwifeinc

Now, let’s tell to Stata that we have two other endogenous regressors, which are the wage and the squared term of the wages. We open the bracket and put

(lwage lwage_sq  =

This will tell to Stata that lwage and lwage_sq are endogenous, part of the first equation of hours, and after the equal, we specify ALL the exogenous variables including the instruments for the endogenous terms, this will lead to include the second part as:

(lwage lwage_sq  = educ c.educ#c.educ exper expersq age c.age#c.age kidsge6 kidslt6 nwifeinc c.nwifeinc#c.nwifeinc)

Notice that this second part will have a c.var#c.var structure, this is Stata’s operator to indicate a multiplication for continuous variables, (and we induce the quadratic terms without generating the variables with another command like we did with the wage).

So notice we have c.educ#c.educ which is the square of the educ variable, and c.age#c.age which is the square of the age, and we also square the wife’s income with c.nwifeinc#c.nwifeinc. These are the instruments for the quadratic term.

The fact that we have two variables on the left (lwage and lwage_sq) indicates that the set of instruments will hold first for an equation for lwage and second for lwage_sq given the exact same instruments.

We include the option , first to see what were the regressions in the first stage.

`ivregress 2sls hours educ age kidslt6 kidsge6 nwifeinc (lwage lwage_sq  = educ c.educ#c.educ exper expersq age c.age#c.age kidsge6 kidslt6 nwifeinc c.nwifeinc#c.nwifeinc), first`

The output of the above model for the first stage equations is:

And the output for the two stage equation is:

Which yields in the identical coefficients in Woolridge’s book (2002, p- 236) also with some slightly difference in the standard errors (yet these slight differences do not change the interpretation of the statistical significance of the estimators).

In this way, we instrumented both endogenous regressors lwage and lwage_sq. Which are a nonlinear relationship in the model.

As we can see, the quadratic term is not statistically significant to explain the hours worked.

At last, we need to make sure that overidentification restrictions are valid. So we use after the regression

`estat overid`

And within this result, we cannot reject the null that overidentifying restrictions are valid.

Bibliography

Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. Cam-bridge, MA: MIT Press.

## Wooldridge Serial Correlation Test for Panel Data using Stata.

In this article, we will follow Drukker (2003) procedure to derive the first-order serial correlation test proposed by Jeff Wooldridge (2002) for panel data. It has to be mentioned that this test is considered a robust test, since works with lesser assumptions on the behavior of the heterogeneous individual effects.

Where y represents the dependent variable, X is the (1xK) vector of exogenous variables, Z is considered a vector of time-invariant covariates. With µ as individual effects for each individual. Special importance is associated with the correlation between X and µ since, if such correlation is zero (or uncorrelated), we better go for the random-effects model, however, if X and µ are correlated, it’s better to stick with fixed-effects.

The estimators of fixed and random effects rely on the absence of serial correlation. From this Wooldridge use the residual from the regression of (1) but in first-differences, which is of the form of:

Notice that such differentiating procedure eliminates the individual effects contained in µ, leading us to think that level-effects are time-invariant, hence if we analyze the variations, we conclude there’s non-existing variation over time of the individual effects.

Once we got the regression in first differences (and assuming that individual-level effects are eliminated) we use the predicted values of the residuals of the first difference regression. Then we double-check the correlation between the residual of the first difference equation and its first lag, if there’s no serial correlation then the correlation should have a value of -0.5 as the next expression states.

Therefore, if the correlation is equal to -.5 the original model in (1) will not have serial correlation. However, if it differs significantly, we have a serial correlation problem of first-order in the original model in (1).

For all of the regressions, we account for the within-panel correlation, therefore all of the procedures require the inclusion of the cluster regression, and also, we omit the constant term in the difference equation. In sum we do:

1. Specify our model (whether if it has fixed or random effects, but these should be time-invariant).
2. Create the difference model (using first differences on all the variables, therefore the difference model will not have any individual effects). We perform the regression while clustering the individuals and we omit the constant term.
3. We predict the residuals of the difference model.
4. We regress the predicted residual over the first lag of the predicted residual. We also cluster this regression and omit the constant.
5. We test the hypothesis if the lagged residual equal to -0.5.

Let’s do a quick example of this steps using the same example as Drukker.

`use http://www.stata-press.com/data/r8/nlswork.dta`

Then we format the database for stata with the code:

`xtset idcode year`

Then we generate some quadratic variables.

`gen age2 = age^2gen tenure2 = tenure^2`

We regress our model of the form of:

`xtreg ln_wage age* ttl_exp tenure* south, fe`

It doesn’t matter whether if it is fixed or random effects as long as we assume that individuals’ effects are time invariant (therefore they get eliminated in the first difference model).

Now let’s do the manual estimation of the test. In order to do this, we use a pooled regression of the model without the constant and clustering the regression for the panel variable. This is done of the form:

`reg d.ln_wage d.age* d.ttl_exp d.tenure* d.south, noconst cluster(idcode)`

The options noconst eliminates the constant term for the difference model, and cluster option includes a clustering approach in the regression structure, finally idcode is the panel variable which we identify our individuals in the panel.

The next thing to do is predict the residuals of the last pooled difference regression, and we do this with:

`predict u, res`

Then we regress the predicted residual u against the first lag of u, while we cluster and also eliminate the constant of the regression as before.

`reg u L.u, noconst cluster(idcode)`

Finally, we test the hypothesis whether if the coefficient of the first lag of the pooled difference equation is equal or not to -0.5

`test L.u==-0.5`

According to the results we strongly reject the null hypothesis of no serial correlation with a 5% level of significance. Therefore, the model has serial correlation problems.

We can also perform the test with the Stata compiled package of Drukker, which can be somewhat faster. We do this by using

`xtserial ln_wage age* ttl_exp tenure* south, output`

and we’ll have the same results. However, the advantage of the manual procedure of the test is that it can be done for any kind of model or regression.

Bibliography

Drukker, D. (2003) Testing for serial correlation in linear panel-data models, The Stata Journal, 3(2), pp. 168–177. Taken from: https://journals.sagepub.com/doi/pdf/10.1177/1536867X0300300206

Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. Cam-bridge, MA: MIT Press.

## HAC robust standard errors.

While we’re using the time series datasets, often we’re highly likely to find serial correlation and heteroskedasticity in our data. These cases increase the chances to obtain serially correlated errors with non-constant variance.

If we’re purely interested in statistical inferences, we should go for the HAC robust standard errors under the Time Series context. This name as Woolridge appoints refers to:

“In the time series literature, the serial correlation–robust standard errors are sometimes called heteroskedasticity and autocorrelation consistent, or HAC, standard errors.” (Wooldridge, ,p. 432).

We got to appoint that HAC standard errors (also called HAC estimators) are derived from the work of Newey & West (1987) where the objective was to build a robust approach to handle the usual problems of time series associated with serial correlation and heteroskedasticity.

What’s the idea behind these standard errors? Well, we can summarize it as:

1. We do not know the form of the serial correlation.
2. Works for arbitrary forms of serial correlation and the autocorrelation structure can be derived from the sample size.
3. With larger samples, we can be flexible in the amount of serial correlation.

This means, that even when the robust standard error is consistent in the presence of the serial correlation and heteroskedasticity, we still need to figure the lag structure for the autocorrelation. Again, Woolridge helps us to decide this on simple basis:

Annual data = 1 lag, 2 lags. Quarterly data= 4 up to 8 lags. Monthly data = 12 up to 24 lags.

Let’s dig into some formulas to understand the relationship between HAC and OLS.

First, Newey & West standard errors work under the ordinary least squares estimator of the form:

Where X is the matrix of independent variables and Y is the vector of the dependent variable. This leads to establishing that Newey-West estimates in terms of values of the estimators will not differ from the OLS estimates.

Second, Newey & West standard errors modify the role of the estimated variances to include White’s robust approach to heteroskedasticity and also the serial correlation structure.

Consider that estimates of the variance in OLS are given by:

Where Ω is the diagonal matrix containing the distinct variances (for a representation of heteroskedasticity).

Now, White robust estimator is defined by:

Where n is the sample size, and e is the estimated time-period residual with ith row of the matrix of independent variables. Let’s define this as robust estimates with 0 lags (since it is only handling heteroskedasticity).

Now here’s where Newey & West extended the White estimator to include the arbitrary forms of serial correlation with a m-lag structure:

As it is visible, the HAC estimates of the variance now include the heteroskedasticity and a m-lag consistent estimate. K represents the number of independent variables, t the time periods and x is the row of matrix of independent variables observed at time t.

So with this, it is more clearly to work under the frame of :

Annual data: m=1,2 lags. Quarterly data: m=4,8 lags. Monthly data: 12,24 lags.

Let’s see an example:

`use https://www.stata-press.com/data/r16/autogenerate t = _ntsset tregress price weight displ, vce(robust)`

Up to this point, this is the White robust standard errors to heteroskedasticity, now let’s estimate the HAC estimator with the equivalent which is 0 lags.

`newey price weight displ, lag(0)`

As you can see everything is exact in comparison to the White’s robust standard errors. Now let’s start to use the HAC structure under 2 lags.

`newey price weight displ, lag(2)`

Notice as well that the values of standard errors of the independent variables have changed with this estimation.

I would recommend always to provide estimates of the HAC SE, in order to obtain more comparative estimates and correct inferences.

As a last mention, Greene (2012) states as a usual practice to select the integer approximate of T^(1/4) where T is the total of time periods of time. For example, for our case considering it is annual data, it would be

`display (74)^(1/4)`

and Stata will display a value, therefore our lags to select would be 3 and 2 (with no specific criteria to select one over the other).

Bibliography:

Greene, W. H. 2012. Econometric Analysis, 7th edition, section 20.5.2, p. 960

Newey, W. K., and K. D. West. 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55: 703–708.

Wooldridge, J. 2013. Introductory Econometrics: A Modern Approach, Fifth Edition. South-Western CENGAGE Learning.

## Identifying Patterns with Stata Graphs

When we start to analyze any type of economic relationship, it is often said that we always need to graph the data. The importance of this step is having a visual where we can increase the understanding of our current relationships in the data. Sometimes with this, we can improve the mathematical functional form in the econometric modelling to capture better the relationships and dynamics in the data.

I would suggest to first do the following steps:

1. Scatter your independent variable (in the x-axis) against your dependent variable (in the y-axis)
2. Observe what kind of linear and non-linear relationships may exists in the graph.
3. Place the mean values of the variables to have some sort of idea of what kind of data concentrations we might have.
4. Make your inferences accordingly, and do a matrix with correlations with everything.

To do an example of this, let’s make an example with a Data Generating Process of the form:

And to generate the random sample we will use:

```clear all
set obs 100
gen n=_n
set seed 1234
gen x=rnormal()
gen x_sq=x*x
gen z=rnormal()
gen y= 1 + (0.5*x)+ (- 0.2*x_sq) + (1.5*z)```

Now let’s see a summary of our variables.

`sum`

Which will have as a result

Skipping n, which is just the individual identificatory variable, we can see the mean values of these variables. Now let’s start to play with some scatter plots.

`scatter y xscatter y z`

And we will have two graphs that look like this:

First graph, which is the scatter of y and x doesn’t show any clear relationship, in fact, we might state that there’s no relationship by such dispersion, On the second hand, we find out that there’s a possible linear relationship with y and z.

Let’s go and place the means of each variable in the scatter graph, remember that x mean is 0.0078 and y mean is 0.7479, with these values we will have something like this:

`scatter y x, xline(.0078032) yline(.747933)scatter y z, xline(-.0452837) yline(.747933)`

According to this, the data appears to be normal distributed (as it should be since we use a random sampling with normal distribution), in other cases, we might find that the mean is allocated in extreme values in either of the axis, which might imply some sort of kurtosis or non-normal distributions.

Now let’s use some linear and non-linear predictions using the not so common lfitci and qfitci. To do this, we type:

`twoway (lfitci y x)twoway (lfitci y z)`

And the respective output will be:

If we want to use lines instead of shaded area, we might type

`twoway (lfitci y x, ciplot(rline) )twoway (lfitci y z, ciplot(rline) )`

And it will display the same graph, but without shaded areas.

We can extend the same idea with non-linear relationships with a quadratic form using qfitci:

`twoway (qfitci y x)twoway (qfitci y z)`

And the output of the graph will be:

Notice that the quadratic relationship is now more visible using the quadratic adjustment for x and y. Therefore, it is a good practice to perform the quadratic adjustment even when the relationship is totally linear like in the case of y and z.

One last type of graphical analysis is using the fractional polynomial, where the syntax is given by:

`twoway (fpfitci y x)twoway (fpfitci y z)`

Finally, and to complete the steps we mentioned in this post, let’s do the matrix of correlations. Which is just simply the scatter plots together.

`graph matrix y x z`

The useful thing to consider with the matrix of correlations is that we can observe not only the scatter plots to a certain variable, but instead we got the scatter plots associated to all the variables we place in the command. Therefore, in regression analysis, this is quite useful to inspect to multicollinearity issues among the independent variables and not only the correlation between the dependent variable.

We can say that similar to x and z, there’s no strong linear correlation since it looks like more like a cloud of dots instead of a linear relationship like it has y and z.

Notice, however, that unless we use a quadratic adjustment, we don’t have it easy to detect the quadratic relationship between y and x, therefore, it is recommended to use the qfitci command to investigate such non-linear relationship.

Bibliography.

StataCorp (2020) Graph twoway fpfitci, Recuperated from: https://www.stata.com/manuals13/g-2graphtwowayfpfitci.pdf#g-2graphtwowayfpfitci

## Investigating Non-linear relationships with curvefit using Stata

While modelling specific phenomenon’s in economics, sometimes we might encounter a functional form which may not be linear in the explanatory variables. Assuming, that we still have linearity in the estimators, we have the capability to include in the regression, variables with powers. As an example, consider the following model:

The last equation presents the dependent variable Y as a function of X however, we can see that the polynomial in the model is of second-order degree. A few mentions can be done from here: 1) the model still linear in the parameters β. 2) No multicollinearity can be argued to exists between the regressors in X and the square of X (the model itself in terms of X will be highly correlated) therefore we’re modeling a structure where both of them will move together. 3) The parameters will no longer have a static/basic marginal effect, to find out this marginal effect we need to calculate the derivate of the model, given by:

Which represents that when X increase in one unit, the change in y is the above expression.

Considering the derivate, a turning point is given in the effect of X to Y, and can be found when we equal this derivate to 0 (to find the numerical spot where the slope is equal to 0). And that is done by solving the equation for the value of X:

We clear X and we have:

Let’s see this in practice, first let’s formulate a Data Generating Process -DGP- as follows without any noise or error:

Where X~N(0,1), with Stata let’s generate some random observations and the square variable.

```clear all
** Setting observations
set obs 50
gen n=_n
set seed 1234
gen x=rnormal()
gen x_sq=x*x
gen y= 1 + (0.5*x)+ (- 0.2*x_sq)```

After that, let’s scatter y, over x. and using scatter y x we have the next graph:

If we regress this functional form with the next command:

`regress y x x_sq`

We have the regression totally adjusted to the DGP. But with missing values on lots of statistics (since there is no residual at all!).

Notice also that the linear adjustment for r-squared is 1, meaning it is matching the data perfectly.

Now confirming that coefficients are 0.5, -0.2 and 1 for the constant. Let’s confirm that the turning point of the model is in:

Solving and changing the parameter’s we have that:

The slope of the curve where it turns to be 0 it should be allocated in X=1.25, with an image in Y=1+0.5(1.25)-0.2(1.25^2)= 1.3125 after that, there’s a decreasing effect in Y given changes in X.

Let’s redo the graph but marking those points.

`scatter y x, yline(1.3125) xline(1.25)`

We allocated the exact point where the input of x variable is enough to create a decreasing effect on the dependent variable (specifically at x=1.25, y=1.3525) and moving to x>1.25 we have decreasing effects on y, where areas before this point it was positive.

Within this context, let’s introduce to curvefit command.

This package created by Liu wei (2010) and it is good to investigate this kind of nonlinearities, let’s look it in action.

`curvefit y x, function(1)`

By placing the variables of interest (y as dependent and x as an independent), we need to specify the behavior of the polynomial, as the examples show, function(1) equals a first-order polynomial (a single straight line equation). With the following output.

As you can see, it gives estimates of the coefficients (b0 as the constant with b1 as the slope) and the basic statistic of the number of observations (N) and the adjusted r-squared. The graph displayed is:

Which is a linear model. A simple regression with first-order power in X. let’s try another function (the quadratic function). We type:

`curvefit y x, function(4)`

Which gives the following output:

Where b0 is the constant parameter, b1 would equal to the X without any power, and finally, b2 is the parameter associated with X^2. Giving an R^2 adjusted of 1, represents the goodness fit of the model of 100%. With the associated graph:

As you can see, the curve provides estimates pretty decent of the structure of the data given different types of mathematical models.

Here’s the complete list of what kind of functions it can be modeled.

```function(string) The following are alternative Models correspond with the values of the sting:

. string = 1 Linear: Y = b0 + (b1 * X)
. string = 2 Logarithmic: Y = b0 + (b1 * ln(X))
. string = 3 Inverse: Y = b0 + (b1 / X)
. string = 4 Quadratic: Y = b0 + (b1 * X) + (b2 * X^2)
. string = 5 Cubic: Y = b0 + (b1 * X) + (b2 * X^2) + (b3 * X^3)
. string = 6 Power: Y = b0 * (X^b1) OR ln(Y) = ln(b0) + (b1 * ln(X))
. string = 7 Compound: Y = b0 * (b1^X) OR ln(Y) = ln(b0) + (ln(b1) * X)
. string = 8 S-curve: Y = e^(b0 + (b1/X)) OR ln(Y) = b0 + (b1/X)
. string = 9 Logistic: Y = b0 / (1 + b1 * e^(-b2 * X))
. string = 0 Growth: Y = e^(b0 + (b1 * X)) OR ln(Y) = b0 + (b1 * X)
. string = a Exponential: Y = b0 * (e^(b1 * X)) OR ln(Y) = ln(b0) + (b1 * X)
. string = b Vapor Pressure: Y = e^(b0 + b1/X + b2 * ln(X))
. string = c Reciprocal Logarithmic: Y = 1 / (b0 + (b1 * ln(X)))
. string = d Modified Power: Y = b0 * b1^(X)
. string = e Shifted Power: Y = b0 * (X - b1)^b2
. string = f Geometric: Y = b0 * X^(b1 * X)
. string = g Modified Geometric: Y = b0 * X^(b1/X)
. string = h nth order Polynomial: Y = b0 + b1X + b2X^2 + b3X^3 + b4X^4 + b5*X^5 …
. string = i Hoerl: Y = b0 * (b1^X) * (X^b2)
. string = j Modified Hoerl: Y = b0 * b1^(1/X) * (X^b2)
. string = k Reciprocal: Y = 1 / (b0 + b1 * X)
. string = l Reciprocal Quadratic: Y = 1 / (b0 + b1 * X + b2 * X^2)
. string = m Bleasdale: Y = (b0 + b1 * X)^(-1 / b2)
. string = n Harris: Y = 1 / (b0 + b1 * X^b2)
. string = o Exponential Association: Y = b0 * (1 - e^(-b1 * X))
. string = p Three-Parameter Exponential Association: Y = b0 * (b1 - e^(-b2 * X))
. string = q Saturation-Growth Rate: Y = b0 * X/(b1 + X)
. string = r Gompertz Relation: Y = b0 * e^(-e^(b1 - b2 * X))
. string = s Richards: Y = b0 / (1 + e^(b1 - b2 * X))^(1/b3)
. string = t MMF: Y = (b0 * b1+b2 * X^b3)/(b1 + X^b3)
. string = u Weibull: Y = b0 - b1*e^(-b2 * X^b3)
. string = v Sinusoidal: Y = b0+b1 * b2 * cos(b2 * X + b3)
. string = w Gaussian: Y = b0 * e^((-(b1 - X)^2)/(2 * b2^2))
. string = x Heat Capacity: Y = b0 + b1 * X + b2/X^2
. string = y Rational: Y = (b0 + b1 * X)/(1 + b2 * X + b3 * X^2)
. string = ALL refers to a total of above models (Attention: it's uppercase!) nograph Curve Estimation without curve fit graph.
```

This package can be installed using:

`ssc install curvefit, replace.`

Bibliography.

Liu Wei (2010) “CURVEFIT: Stata module to produces curve estimation regression statistics and related plots between two variables for alternative curve estimation regression models,” Statistical Software Components S457136, Boston College Department of Economics, revised 28 Jul 2013.

## Box-Pierce Test of autocorrelation in Panel Data using Stata.

The test of Box & Pierce was derived from the article “Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models” in the Journal of the American Statistical Association (Box & Pierce, 1970).

The approach is used to test first-order serial correlation, the general form of the test is given the statistic as:

Where the statistic of Box- Pierce Q is defined as the product between the number of observations and the sum of the square autocorrelation ρ in the sample at lag h. The test is closely related to the Ljung & Box (1978) autocorrelation test, and it used to determine the existence of serial correlation in the time series analysis. The test works with chi-square distribution by the way.

The null hypothesis of this test can be defined as H0: Data is distributed independently, against the alternative hypothesis of H1: Data is not distributed independently. Therefore, the null hypothesis is that data is not suffering from an autocorrelation structure against the alternative which proposes that the data has an autocorrelation structure.

The test was implemented in Stata with the panel data structure by Emad Abd Elmessih Shehata & Sahra Khaleel A. Mickaiel (2004), the test works in the context of ordinary least squares panel data regression (the pooled OLS model). And we will develop an example here.

First we install the package using the command ssc install as follows:

`ssc install lmabpxt, replace`

Then we will type help options.

`help lmabpxt`

From that we got the next result displayed.

We can notice that the sintax of the general form is:

`lmabpxt depvar indepvars [if] [in] [weight] , id(var) it(var) [noconstant coll ]`

In this case id(var) and it(var) represents the identificatory of individuals (id) and identificatory of the time structure (it), so we need to place them in the model.

Consider the next example

`clear alluse http://www.stata-press.com/data/r9/airacc.dtaxtset airline time,yreg pmiles inproglmabpxt  pmiles inprog, id(airline) it(time)`

Notice that the Box-Pierce test implemented by Emad Abd Elmessih Shehata & Sahra Khaleel A. Mickaiel (2004) will re-estimate the pooled regression. And the general output would display this:

In this case, we can see a p-value associated to the Lagrange multiplier test of the box-pierce test, and such p-value is around 0.96, therefore, with a 5% level of significance, we cannot reject the null hypothesis, which is the No AR(1) panel autocorrelation in the residuals.

Consider now, that you might be using fixed effects approach. A numerical approach would be to include dummy variables (in the context of least squares dummy variables) of the individuals (airlines in this case) and then compare the results.

To do that we can use:

`tab airlines, gen(a)`

and then include from a2 to a20 in the regression structure, with the following code:

`lmabpxt  pmiles inprog a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 , id(airline) it(time)`

This would be different from the error component structure, and it would be just a fixed effects approach using least squares dummy variable regression. Notice the output.

Using the fixed effects approach with dummy variables, the p-value has decreased significantly, in this case, we reject the null hypothesis at a 5% level of significance, meaning that we might have a problem of first-order serial correlation in the panel data.

With this example, we have done the Box-Price test for panel data (and additionally, we established that it’s sensitive to the fixed effects in the regression structure).

Notes:

The lmabpxt appears to be somewhat sensitive if the number of observations is too large (bigger than 5000 units).

There are an incredible compilation and contributions made by Shehata, Emad Abd Elmessih & Sahra Khaleel A. Mickaiel which can be found in the next link:

http://www.haghish.com/statistics/stata-blog/stata-programming/ssc_stata_package_list.php

I suggest you to check it out if you need anything related to Stata.

Bibliography

Box, G. E. P. and Pierce, D. A. (1970) “Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models”, Journal of the American Statistical Association, 65: 1509–1526. JSTOR 2284333

G. M. Ljung; G. E. P. Box (1978). “On a Measure of a Lack of Fit in Time Series Models”. Biometrika 65 (2): 297-303. doi:10.1093/biomet/65.2.297.

Shehata, Emad Abd Elmessih & Sahra Khaleel A. Mickaiel (2014) LMABPXT: “Stata Module to Compute Panel Data Autocorrelation Box-Pierce Test”

## Ramsey RESET Test on Panel Data using Stata

In regression analysis, we often check the assumptions of the econometrical model regressed, during this, one of the key assumptions is that the model has no omitted variables (and it’s correctly specified). In 1969, Ramsey (1969) developed an omitted variable test, which basically uses the powers of the predicted values of the dependent variable to check if the model has an omitted variable problem.

Assume a basic fitted model given by:

Where y is the vector of containing the dependent variable with nx1 observations, X is the matrix that contains the explanatory variables which is nxk (n are the total observations and k are the number of independent variables). The vector b represents the estimated coefficient vector.

Ramsey test fits a regression model of the type

Where z represents the powers of the fitted values of y, the Ramsey test performs a standard F test of t=0 and the default setting is considering the powers as:

In Stata this is easily done with the command

`estat ovtest`

after the regression command reg.

To illustrate this, consider the following code:

```use https://www.stata-press.com/data/r16/auto
regress mpg weight foreign
estat ovtest```

The null hypothesis is that t=0 so it means that the powers of the fitted values have no relationship which serves to explain the dependent variable y, meaning that the model has no omitted variables. The alternative hypothesis is that the model is suffering from an omitted variable problem.

In the panel data structure where we have multiple time series data points and multiple observations for each time point, in this case we fit a model like:

With i=1, 2, 3, …, n observations, and for each i, we have t=1, 2, …, T time periods of time. And v represents the heterogenous effect which can be estimated as parameter (in fixed effects: which can be correlated to the explanatory variables) and as variable (in random effects which is not correlated with the explanatory variables).

To implement the Ramsey test manually in this regression structure in Stata, we will follow Santos Silva (2016) recommendation, and we will start predicting the fitted values of the regression (with the heterogenous effects too!). Then we will generate the powers of the fitted values and include them in the regression in (4) with clustered standard errors. Finally, we will perform a significant test jointly for the coefficients of the powers.

```use https://www.stata-press.com/data/r16/nlswork

xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south, fe cluster(idcode)

predict y_hat,xbu

gen y_h_2=y_haty_hat gen y_h_3=y_h_2y_hat

gen y_h_4=y_h_3*y_hat

xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south y_h_2 y_h_3 y_h_4, fe cluster (idcode)

test y_h_2 y_h_3 y_h_4```

Alternative you can skip the generation of the powers and apply them directly using c. and # operators in the command as it follows this other code:

```use https://www.stata-press.com/data/r16/nlswork

xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south, fe cluster(idcode)

predict y_hat,xbu

xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south c.y_hat#c.y_hat c.y_hat#c.y_hat# c.y_hat c.y_hat#c.y_hat# c.y_hat# c.y_hat , fe cluster (idcode)

test c.y_hat#c.y_hat c.y_hat#c.y_hat# c.y_hat c.y_hat#c.y_hat# c.y_hat# c.y_hat```

At the end of the procedure you will have this result.

Where the null hypothesis is that the model is correctly specified and has no omitted variables, however in this case, we reject the null hypothesis with a 5% level of significance, meaning that our model has omitted variables.

As an alternative but somewhat more restricted, also with more features, you can use the user-written package “resetxt” developed by Emad Abd & Sahra Khaleel (2015) which can be used after installing it with:

`ssc install resetxt, replace`

This package however doesn’t work with factor-variables or time series operators, so we cannot include c. or i. and d. or L. operators for example.

```clear all

use https://www.stata-press.com/data/r16/nlswork

gen age_sq=ageage gen ttl_sq= ttl_exp ttl_exp

gen tenure_sq= tenure* tenure

xtreg ln_w grade age age_sq ttl_exp ttl_sq tenure tenure_sq race not_smsa south, fe cluster(idcode)

resetxt ln_w grade age age_sq ttl_exp ttl_sq tenure tenure_sq race not_smsa south, model(xtfe) id(idcode) it(year)```

however, the above code might be complicated to calculate in Stata, depending on how much memory do you have to do the procedure. That’s why in this post it was implemented the manual procedure of the Ramsey test in the panel data structure.

Bibliography

Emad Abd, S. E., & Sahra Khaleel, A. M. (2015). RESETXT: Stata Module to Compute Panel Data REgression Specification Error Tests (RESET). Obtained from: Statistical Software Components S458101: https://ideas.repec.org/c/boc/bocode/s458101.html

Ramsey, J. B. (1969). Tests for specification errors in classical linear least-squares regression analysis. Journal of the Royal Statistical Society Series B 31, 350–371.

Santos Silva, J. (2016). Reset test after xtreg & xi:reg . Obtained from: The Stata Forum: https://www.statalist.org/forums/forum/general-stata-discussion/general/1327362-reset-test-after-xtreg-xi-reg?fbclid=IwAR1vdUDn592W6rhsVdyqN2vqFKQgaYvGvJb0L2idZlG8wOYsr-eb8JFRsiA