A discussed solution in order to accomplish the normality assumption in regression models relates to the correct specification of a Data Generating Process (Rodríguez Revilla, 2014), the objective here is to demonstrate how functional form might influence the distribution of the residuals in a regression model using ordinary least squares technique.
Let’s start with a Monte Carlo exercise using the theory of Mincer (1974) in which we have a Data Generating Process -DGP- of the income for a cross-sectional study of a population of a city.
The DGP expressed in (1) is the correct specification of income for the population of our city. Where y is the income in monetary units, schooling is the years of school of the individual, exp is the number of years of experience in the current job. Finally, we got the square of the experience which reflects by the negative sign, the decreasing returns of the variable over the income.
Let’s say we want to study the income in our city, so one might use a simple approximation model for the regression equation. In this case, we know by some logic that schooling and experience are related to the income, so we propose to use the next model in (2) to study the phenomena.
Regressing this model with our Monte Carlo exercise with the specification in (2) we got the next results, considering a sample size of 1000 individuals.
We can see that coefficients of the experience and the constant term are not so close to the DGP process, and that the estimator of schooling years on the other hand it’s approximately accurate. All variables are relevant at a 5% significance level and R^2 is pretty high.
We want to make sure if we got the right variables, so we use Ramsey RESET test to check if we got a problem of omitted variables. Let’s predict first the residuals with predict u, res of the above regression and then perform the test of omitted variables (using Ramsey omitted variable test with estat ovtest):
Ramsey test indicates no omitted variables at a 5% level of significance, so we have now an idea that we’re using the right variables. Let’s check out now, the normality assumption with a graphic distribution of the predicted residuals, in Stata we use the command histogram u, norm
Graphically the result shows that the behavior of the residuals is non-normal. In order to confirm this, we perform a formal test with sktest u and we’ll see the following results.
The test of normality of the residuals is not good. Meaning that with a 5% of the significance level of the error, the predicted residuals have a non-normal distribution. This invalidates the results of the t statistics in the coefficients in the regression of equation (2).
We should get back to our functional form in the regression model in (2), and now we should consider that experience might have some decreasing or increasing returns over the income. So, we adapt our specification including the square term of the experience to capture the marginal effect of the variable:
Now in order to regress this model in Stata, we need to generate the squared term of the experience. To do this we type gen exp_sq=experience*experience where experience is our variable.
We have now our squared variable of experience which we include the regression command as the following image presents.
We can see that coefficients are pretty accurate to the DGP of (1), which is because the specification is closer to the real relationship of the variables in our simulated exercise. The negative sign in the squared term indicates a decreasing return of experience over the income, and the marginal effect is given by:
Let’s predict our residuals of our new regression model with predict u2,res and let’s check the distribution of the residuals using histogram u2, norm
Residuals by graphic inspection presents a normal distribution, we confirm this with the formal test of normality with the command sktest u2
According to the last result we cannot reject the null hypothesis of a normal distribution in the predicted residuals of our second regression model, so we accept that residuals of our last estimates have a normal distribution with a 5% significance level.
The conclusion of this exercise is that even if we have the right variables for a regression model, just like we considered in equation (2), if the specification functional form isn’t correct then the behavior of the residuals will be not be normally distributed.
A correction in the specification form of the regression model can be considered as a solution for non-normality problems, since the interactions of the variables can be modeled better. However in real estimations, finding the right functional form is frequently harder and it’s attached to problems of the data, non-linear relationships, external shocks and atypical observations, but it worth the try to inspect the data in order to find what could be the proper functional form of the variables in order to establish a good regression model which come as accurate as possible to the data generating process.
Mincer, J. (1974). Education, Experience and the Distribution of Earnings and of employment. New York: National bureau of Economic Research (for the Carnegie Comission).
Rodríguez Revilla, R. (2014). Econometria I y II. Bogotá. Colombia : Universidad Los Libertadores.
StataCorp (2017) Stata Statistical Software: Release 15. College Station, TX: StataCorp LLC. Avaliable in: https://www.stata.com/products/