There exist many different types of models of equations for which there exists no closed form solution. In these cases, we use a method known as log-linearisation. One example of these kinds of models are non-linear models like Dynamic Stochastic General Equilibrium (DSGE) models. DSGE models are non-linear in both parameter and in variables. Because of this, solving and estimating these models is challenging.
Hence, we have to use approximations to the non-linear models. We have to make concessions in this, as some features of the models are lost, but the models become more manageable.
In the simplest terms, we first take the natural logs of the non-linear equations and then we linearise the logged difference equations about the steady state. Finally, we simplify the equations until we have linear equations where the variables are percentage deviations from the steady state. We use the steady state as that is the point where the economy ends up in the absence of future shocks.
Usually in the literature, the main part of estimation consisted of linearised models, but after the global financial crisis, more and more non-linear models are being used. Many discrete time dynamic economic problems require the use of log-linearisation.
There are several ways to do log-linearisation. Some examples of which, have been provided in the bibliography below.
One of the main methods is the application of Taylor Series expansion. Taylor’s theorem tells us that the first-order approximation of any arbitrary function is as below.
We can use this to log-linearise equations around the steady state. Since we would be log-linearising around the steady state, x* would be the steady state.
For example, let us consider a Cobb-Douglas production function and then take a log of the function.
The next step would be to apply Taylor Series Expansion and take the first order approximation.
Since we know that
Those parts of the function will cancel out. We are left with –
For notational ease, we define these terms as percentage deviation of x about x* where x* signifies the steady state. Thus, we get
At last, we have log-linearised the Cobb-Douglas production function around the steady state.
The test of Box & Pierce was derived from the article “Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models” in the Journal of the American Statistical Association (Box & Pierce, 1970).
The approach is used to test first-order serial correlation, the general form of the test is given the statistic as:
Where the statistic of Box- Pierce Q is defined as the product between the number of observations and the sum of the square autocorrelation ρ in the sample at lag h. The test is closely related to the Ljung & Box (1978) autocorrelation test, and it used to determine the existence of serial correlation in the time series analysis. The test works with chi-square distribution by the way.
The null hypothesis of this test can be defined as H0: Data is distributed independently, against the alternative hypothesis of H1: Data is not distributed independently. Therefore, the null hypothesis is that data is not suffering from an autocorrelation structure against the alternative which proposes that the data has an autocorrelation structure.
The test was implemented in Stata with the panel data structure by Emad Abd Elmessih Shehata & Sahra Khaleel A. Mickaiel (2004), the test works in the context of ordinary least squares panel data regression (the pooled OLS model). And we will develop an example here.
First we install the package using the command ssc install as follows:
ssc install lmabpxt, replace
Then we will type help options.
From that we got the next result displayed.
We can notice that the sintax of the general form is:
In this case id(var) and it(var) represents the identificatory of individuals (id) and identificatory of the time structure (it), so we need to place them in the model.
Consider the next example
clear all use http://www.stata-press.com/data/r9/airacc.dta xtset airline time,y reg pmiles inprog lmabpxt pmiles inprog, id(airline) it(time)
Notice that the Box-Pierce test implemented by Emad Abd Elmessih Shehata & Sahra Khaleel A. Mickaiel (2004) will re-estimate the pooled regression. And the general output would display this:
In this case, we can see a p-value associated to the Lagrange multiplier test of the box-pierce test, and such p-value is around 0.96, therefore, with a 5% level of significance, we cannot reject the null hypothesis, which is the No AR(1) panel autocorrelation in the residuals.
Consider now, that you might be using fixed effects approach. A numerical approach would be to include dummy variables (in the context of least squares dummy variables) of the individuals (airlines in this case) and then compare the results.
To do that we can use:
tab airlines, gen(a)
and then include from a2 to a20 in the regression structure, with the following code:
This would be different from the error component structure, and it would be just a fixed effects approach using least squares dummy variable regression. Notice the output.
Using the fixed effects approach with dummy variables, the p-value has decreased significantly, in this case, we reject the null hypothesis at a 5% level of significance, meaning that we might have a problem of first-order serial correlation in the panel data.
With this example, we have done the Box-Price test for panel data (and additionally, we established that it’s sensitive to the fixed effects in the regression structure).
The lmabpxt appears to be somewhat sensitive if the number of observations is too large (bigger than 5000 units).
There are an incredible compilation and contributions made by Shehata, Emad Abd Elmessih & Sahra Khaleel A. Mickaiel which can be found in the next link:
I suggest you to check it out if you need anything related to Stata.
Box, G. E. P. and Pierce, D. A. (1970) “Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models”, Journal of the American Statistical Association, 65: 1509–1526. JSTOR 2284333
G. M. Ljung; G. E. P. Box (1978). “On a Measure of a Lack of Fit in Time Series Models”. Biometrika 65 (2): 297-303. doi:10.1093/biomet/65.2.297.
Shehata, Emad Abd Elmessih & Sahra Khaleel A. Mickaiel (2014) LMABPXT: “Stata Module to Compute Panel Data Autocorrelation Box-Pierce Test”
In regression analysis, we often check the assumptions of the econometrical model regressed, during this, one of the key assumptions is that the model has no omitted variables (and it’s correctly specified). In 1969, Ramsey (1969) developed an omitted variable test, which basically uses the powers of the predicted values of the dependent variable to check if the model has an omitted variable problem.
Assume a basic fitted model given by:
Where y is the vector of containing the dependent variable with nx1 observations, X is the matrix that contains the explanatory variables which is nxk (n are the total observations and k are the number of independent variables). The vector b represents the estimated coefficient vector.
Ramsey test fits a regression model of the type
Where z represents the powers of the fitted values of y, the Ramsey test performs a standard F test of t=0 and the default setting is considering the powers as:
In Stata this is easily done with the command
after the regression command reg.
To illustrate this, consider the following code:
regress mpg weight foreign
The null hypothesis is that t=0 so it means that the powers of the fitted values have no relationship which serves to explain the dependent variable y, meaning that the model has no omitted variables. The alternative hypothesis is that the model is suffering from an omitted variable problem.
In the panel data structure where we have multiple time series data points and multiple observations for each time point, in this case we fit a model like:
With i=1, 2, 3, …, n observations, and for each i, we have t=1, 2, …, T time periods of time. And v represents the heterogenous effect which can be estimated as parameter (in fixed effects: which can be correlated to the explanatory variables) and as variable (in random effects which is not correlated with the explanatory variables).
To implement the Ramsey test manually in this regression structure in Stata, we will follow Santos Silva (2016) recommendation, and we will start predicting the fitted values of the regression (with the heterogenous effects too!). Then we will generate the powers of the fitted values and include them in the regression in (4) with clustered standard errors. Finally, we will perform a significant test jointly for the coefficients of the powers.
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south, fe cluster(idcode)
gen y_h_2=y_haty_hat gen y_h_3=y_h_2y_hat
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south y_h_2 y_h_3 y_h_4, fe cluster (idcode)
test y_h_2 y_h_3 y_h_4
Alternative you can skip the generation of the powers and apply them directly using c. and # operators in the command as it follows this other code:
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south, fe cluster(idcode)
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure c.tenure#c.tenure 2.race not_smsa south c.y_hat#c.y_hat c.y_hat#c.y_hat# c.y_hat c.y_hat#c.y_hat# c.y_hat# c.y_hat , fe cluster (idcode)
test c.y_hat#c.y_hat c.y_hat#c.y_hat# c.y_hat c.y_hat#c.y_hat# c.y_hat# c.y_hat
At the end of the procedure you will have this result.
Where the null hypothesis is that the model is correctly specified and has no omitted variables, however in this case, we reject the null hypothesis with a 5% level of significance, meaning that our model has omitted variables.
As an alternative but somewhat more restricted, also with more features, you can use the user-written package “resetxt” developed by Emad Abd & Sahra Khaleel (2015) which can be used after installing it with:
ssc install resetxt, replace
This package however doesn’t work with factor-variables or time series operators, so we cannot include c. or i. and d. or L. operators for example.
gen age_sq=ageage gen ttl_sq= ttl_exp ttl_exp
gen tenure_sq= tenure* tenure
xtreg ln_w grade age age_sq ttl_exp ttl_sq tenure tenure_sq race not_smsa south, fe cluster(idcode)
resetxt ln_w grade age age_sq ttl_exp ttl_sq tenure tenure_sq race not_smsa south, model(xtfe) id(idcode) it(year)
however, the above code might be complicated to calculate in Stata, depending on how much memory do you have to do the procedure. That’s why in this post it was implemented the manual procedure of the Ramsey test in the panel data structure.
Emad Abd, S. E., & Sahra Khaleel, A. M. (2015). RESETXT: Stata Module to Compute Panel Data REgression Specification Error Tests (RESET). Obtained from: Statistical Software Components S458101: https://ideas.repec.org/c/boc/bocode/s458101.html
Ramsey, J. B. (1969). Tests for specification errors in classical linear least-squares regression analysis. Journal of the Royal Statistical Society Series B 31, 350–371.
Santos Silva, J. (2016). Reset test after xtreg & xi:reg . Obtained from: The Stata Forum: https://www.statalist.org/forums/forum/general-stata-discussion/general/1327362-reset-test-after-xtreg-xi-reg?fbclid=IwAR1vdUDn592W6rhsVdyqN2vqFKQgaYvGvJb0L2idZlG8wOYsr-eb8JFRsiA
“A replication study attempts to validate the findings of a published piece of research. By doing so, that prior research is confirmed as being both accurate and broadly applicable”
A replication process generally consists of two parts. The first part is concerned with reproducing key findings from the original study. If this step was successful, the next part will be performing robustness checks. Meta-analysis reveals another side of replicating published research. Meta-based studies survey the empirical results of a group of published papers attempting to test three key dimensions— statistical power, selective reporting bias, and between-study heterogeneity.
From the perspective of contributing to scientific research, replication studies are important for the continued progress of science. Given the relative scarcity of replication studies and in recognition of the importance of these methods, there has been increasing attention by editors of A-class journals (American Economic Review, Journal of Political Economy, Review of Economic Studies, Journal of Applied Econometrics) in publishing replicative studies.
The one-day intensive online workshop on 29 June 2020 by “Econometric Replication: Methods & Guidelines for Designing a Replicated Study” will teach you theoretically and practically how to design a novel replicated study.
In the last month, while I was researching through the literature of the military expenditure and economic growth, I found a little statement from an article, which appointed one of the things less discussed in econometrics, such statement is:
“The Holy Grail of applied econometrics is a tight theoretical model, which fits the data well. Like the Holy Grail, such models are hard to find.” (Dunne, Smith, & Willenbockel, 2005)
When one, as a researcher meditate this, one really knows that matching theoretical models with regression equations it’s indeed hard. Even when econometrics can be defined as the measure and validation part of the economic science, the relationships which are addressed to study are not exactly as accurate as the theory states.
I want to put an example, let’s see the conclusions of the Solow Swan (1956) model with technology. which are compiled in the next equation.
Where Y/L is the gross domestic product -GDP- of the economy measured in per capita units, A is a level of technology, α is the elasticity of the aggregate stock of capital of the economy, s is an exogenous saving rate, δ is the depreciation rate, x is the growth rate of the technology, and n is the growth rate of the population.
The term ε is just added as the stochastic error in the equation to proceed with the regression analysis, which theoretically is defined as independent of the variables in the regression and represents external shocks in the per capita product. However, if this doesn’t happen in the time series context, it could be possible that this term contains all the variables not included in the regression, therefore violating the exogeneity assumption and inducing an omitted variable bias with misspecification.
model is telling us that the growth of the economy is positively given by the
technology and the rate of saving of the economy which is invested in physical
Now the Augmented Solow-Model proposed by Mankiw, Romer & Weil (1992) also known as the MRW model, concludes the following:
Where we got some new terms denoted with β as the elasticity of the aggregate stock of human capital in the production function, and separated terms of the savings, denoted by s_k as the saving rate dedicated to the accumulation of physical capital and s_h which is the saving rate dedicated to the accumulation of human capital.
The Augmented Model proposed by Mankiw, Romer & Weil has more variables in the specification of the growth of the economy.
Which one is correct? The answer relies on the regressions they have performed with both models, in general, the augmented model explains better the economic growth and the convergence of the economies than the simple Solow-Swan model.
The simple Solow-Swan model has a problem in the specification and an omitted variable problem, the augmented Solow-Swan model correct this by introducing the measure and importance of human capital accumulation. Both are theoretical constructions, but the augmented model fits better in reality than the original model.
Going further, one could ask if it would be wrong to consider all variables as endogenous? In the last two models, we have seen that the savings of physical or human capital are exogenous along with the growth rate of technology, but more theoretical considerations, like the Ramsey (1928) model could determinate the savings as endogenous, even the depreciation and the technology can be endogenized, so regressing the above equation with two-stage or three-stage least squares would be the best approach.
Considering this set of ideas, econometricians then will have to face a difficult situation when the theoretical approach might not be suitable for the reality of the sample, and I say this because this is a complex world, where a single explanation for all the situations is not plausible.
We need to remember also, that the whole objective of the theory is to explain reality, and if this theory fails to succeed in this objective, even the most logical explanation would be useless. Something completely out of sense is to modify reality to match with the theory.
The holy grail then would be the adequacy of the theory with the reality, and in econometrics, this means that we need to find a strong theoretical framework that matches our data generating process. But the validation techniques should have some logical approaches considering the assumptions of the regression.
Going backward, before theory and empirical methods, we are interested in finding the truth, and this truth goes from discovery existing or non-existing relationships and causality, in order to explain reality. Such findings, even when they start from a deviated or wrong approach are useful to build the knowledge.
A great example of this is the Phillip’s Curve (Phillips, 1958), it started as an empirical fact which correlated positive rates of inflation with employment, and then it began to be strongly study on Phelps (1967) and Friedman (1977) with more theoretical concepts as rational expectations over the phenomenon of inflation.
Econometricians should then do research with logical economic sense when they are heading to estimate relationships, but have to be aware that samples and individuals are not the same in the space (they change according to locations and the time itself). However, the theoretical framework is the main basis we need to always consider during the economic research, but also remember we can propose a new theoretical framework, to explain the reality on the basis of facts and past theories.
Dunne, J., Smith,
R. P., & Willenbockel, D. (2005). MODELS OF MILITARY EXPENDITURE AND
GROWTH: A CRITICAL REVIEW. Defence and Peace Economics, Volume 16, 2005 –
Issue 6, 449-461.
Friedman. (1977). Nobel
Lecture Inflation and Unemployment. Journal of Political Economy, Vol. 85,
No. 3 (Jun., 1977), 451-472.
Kwat, N. (2018). The
Circular Flow of Economic Activity. Obtenido de Economics Discussion:
Mankiw, N. G.,
Romer, D., & Weil, N. D. (1992). A CONTRIBUTION TO THE EMPIRICS OF
ECONOMIC GROWTH. Quarterly Journal of Economics, 407- 440.
(2012). Indifference Curve Confusion and Possible Critique. Obtenido de Radical
(2002). Microeconomic Theory. México D.F.: Thompson Learning.
Phelps, E. (1967).
Phillips Curves, Expectations of Inflation and Optimal Unemployment over Time
. Economica, New Series, Vol. 34, No. 135 (Aug., 1967), 254-281.
Phillips, A. W.
(1958). The Relation between Unemployment and the Rate of Change of Money Wage
Rates in the United Kingdom, 1861-1957. Economica, New Series, Vol. 25, No.
100. (Nov., 1958),, 283-299.
Ramsey, F. P.
(1928). A mathematical theory of saving. Economic Journal, vol. 38, no.
Solow, R. (1956).
A Contribution to the Theory of Economic Growth. The Quarterly Journal of Economics,
Vol. 70, No. 1 (Feb., 1956),, 65-94.
Following the last post which gave an example to model the Cobb-Douglas utility function regarding microeconometrics, we need to provide an important aspect related to the behavior of the consumer. That is the budget constraint (referred to as a monetary linear constraint) which limits the number of goods that the consumer can buy and use to get a certain level of utility.
In this article, I want to start with an introduction of the basic concept of budget constrain related to the income in microeconomics, and that’s the linear constraint given a set of quantities and prices of the goods which determine the utility for the consumer, this is closely related to the Cobb-Douglas utility function (and overall utility functions) since it is one of the main aspects of the microeconomic theory.
Keeping the utility function as the traditional Cobb-Douglas function:
We know that the utility is sensitive to the elasticity αand B. With αand B lesser or equal to one. And since resources are not infinite, we can establish that the amount of goods that the consumer can pay is not infinite. In markets, the only way to get goods and services is with money, and according to the circular flow of the economy, the factor market can revenue two special productive factors: labor and capital, we can say that consumers have a level of income derived from his productive activities.
Inside the microeconomic theory in general, utility U is restricted to the income of the consumer within a maximization process with a linear constraint containing the goods and prices which are consumed. The budget constraint for the two good model looks as it follows:
Where I is the income of the individual, Px is the price of the good X and Py is the price of the good Y. One might wonder if the income of the customer is the sum of prices times goods, which doesn’t seem as close to what the circular flows states in a first glance. Income could be defined as the sum of the salary and overall returns of the productive activities (like returns on assets) of the consumer, and there’s no such thing as that in the budget equation.
However, if you look at the equation as a reflection of all the spending on goods (assuming the consumer will spend everything) this will equally match all that he has earned from his productive activities.
The maximization problem of the consumer is established as:
And typical maximization solution is done by using the Lagrange operator where the whole expression of the Lagrange function can be stated as:
A useful trick to remember how to write this function is to remember that if λ is positive then the income is positive and the prices and goods are negative (we’re moving everything to the left from the constraint equation). And the first-order conditions are given by:
By simply dividing the first two differential equations you’ll get the solution to the consumer’s problem which satisfies the relation as the next ratio:
Each good then is primarily sensitive to his own price and the weight (elasticity) in the utility function, seconded by the prices and quantities of the other good Y. If we replace one of the solutions in the last differential equation, say X, we’ll get:
Taking as a general factor the Py*Y will result in:
The quantities of the good Y are a ratio of the Income times the elasticity B and this is divided by the price of the same good Y given the sum of the elasticities. Before we stated that α+ B = 1 so we got that B=1- α and the optimum quantities of the goods can be defined now as:
This optimal place its graphically displayed ahead, and it represents the point where the utility is the maximum given a certain level of income and a set of prices for two goods, if you want to expand this analysis please refer to Nicholson (2002).
The budget constrains: An econometric appreciation
Suppose we got a sample of n individuals which only consumes a finite number of goods. The income is given for each individual and also the quantities for each good. How we would be able to estimate the average price that each good has? If we start by assuming that the income is a relation of prices and quantities from the next expression:
Where X_1 is the good number one associated with the price of the good P_1, the income would be the sum of all quantities multiplicated by their prices or simply, the sum of all expenses. That’s the approach on demand-based income. In this case we got m goods consumed.
Now assume we can replace each price for another variable.
Looks familiar, isn’t it? It’s a regression structure for the equation, so in theory, we are able to estimate each price with ordinary least squares. Assuming as the prices, the estimators associated with each good with B-coefficients. And that all the income is referred to as the other side of the coin for the spending process.
The simulation exercises
Assume we got a process which correlates the following variables (interpret it as the Data Generating Process):
Where I is the total income, Px, Py, Pz are the given prices for the goods X, Y, and Z and we got s which refers to a certain amount of savings, all of this of the individual i. This population according to the DGP not only uses the income for buying the goods X, Y, and Z, but also deposits an amount of savings in s. The prices used in the Monte Carlo approach are Px=10, Py=15, and Pz=20.
If we regress the income and the demanded quantities of each good, we’ll have:
The coefficients don’t match our DGP and that is because our model is suffering from a bias problem related to omitted variables. In this case, we’re not taking into account that the income is not only the sum of expenses in goods but also the income is distributed in savings. Regressing the expression with the s variable we have:
The coefficients for the prices of each good (X, Y, Z) match our DGP almost accurate, R squared has gotten a significant increase from 51.45% to 99.98%. And the overall variance of the model has been reduced. The interesting thing to note here is that the savings of the individuals tend to be associated with an increase in the income with an increase of one monetary unit in the savings.
Remember that this is not an exercise of causality, this is more an exercise of correlation. In this case, we’re just using the information of the goods for the individuals of our sample to estimate the average price for the case of two goods. If we have a misspecification problem, such an approach cannot be performed.
This is one way to estimate the prices that the consumers pay for each good, however, keep in mind that the underlying assumptions are that 1) the prices are given for everyone, they do not vary across individuals, 2) The quantities of X, Y and the amount of savings must be known for each individual and it must be assumed that the spending (including money deposited in savings) should be equivalent to the income. 3) The spending of each individual must be assumed to be distributed among the goods and other variables and those have to be included in the regression, otherwise omitted variable bias can inflict problems in the estimators of the goods we’re analyzing.
Kwat, N. (2018). The Circular Flow of Economic Activity. Economics Discussion. Recuperated from: http://www.economicsdiscussion.net/circular-flow/the-circular-flow-of-economic-activity/18159
Marmolejo, I. (2012). Indifference Curve Confusion and Possible Critique. Radical Subjectivist. Recuperated from: https://radicalsubjectivist.wordpress.com/2012/02/10/indifference-curve-confusion-and-possible-critique/
(2002). Microeconomic Theory. México D.F.: Thompson Learning.
Regarding microeconometrics, we can find applications that go from latent variables to model market decisions (like logit and probit models) and techniques to estimate the basic approaches for consumers and producers.
In this article, I want to start with an introduction of a basic concept in microeconomics, which is the Cobb-Douglas utility function and its estimation with Stata. So we’re reviewing the basic utility function, some mathematical forms to estimate it and finally, we’ll see an application using Stata.
Let’s start with the traditional Cobb-Douglas function:
Depending on the elasticity α and β for goods X and Y, we’ll have a respective preference of the consumer given by the utility function just above. In basic terms, we restrict α + β =1 in order to have an appropriate utility function which reflects a rate of substitution between the two goods X and Y. If we assume a constant value of the utility given by U* for the consumer, we could graph the curve by solving the equation for Y, in this order of ideas.
And the behavior of the utility function will be given by the number of quantities of the good Y explained by X and the respective elasticities α and β. We can graph the behavior of the indifference curve given a constant utility level according to the quantities of X and Y, also for a start, we will assume that α =0.5 and β=0.5 where the function has the following pattern for the same U* level of utility (example U=10), this reflects the substitution between the goods.
If you might wonder what happens when we alter the elasticity of each good, like for example, α =0.7 and β=0.3 the result would be a fast decaying curve instead of the pattern of the utility before.
Estimating the utility function of the Cobb-Douglas type will require data of a set of goods (X and Y in this case) and the utility.
Also, it would imply that you somehow
measured the utility (that
is, selecting a unit or a measure for the utility), sometimes this can be in
monetary units or more complex ideas deriving from subjective utility measures.
Applying logarithms to the equation of the Cobb-Douglas function would result in:
Which with properties of logarithms can be expressed as:
This allows a linearization of the function as well, and we can see that the only thing we don’t know regarding the original function is the elasticities of α and β. The above equation fits perfectly in terms of a bivariate regression model. But remember to add the stochastic part when you’re modeling the function (that is, including the residual in the expression). With this, we can start to do a regressing exercise of the logarithm of the utility for the consumers taking into account the amount of the demanded goods X and Y. The result would allow us to estimate the behavior of the curve.
However, some assumptions must be noted: 1) We’re assuming that our sample (or subsample) containing the set of individuals i tend to have a similar utility function, 2) the estimation of the elasticity for each good, would also be a generalization of the individual behavior as an aggregate. One could argue that each individual i has a different utility function to maximize, and also that the elasticities for each good are different across individuals. But we can argue also that if the individuals i are somewhat homogenous (regarding income, tastes, and priorities, for example, the people of the same socioeconomic stratum) we might be able to proceed with the estimation of the function to model the consumer behavior toward the goods.
The Stata application
As a first step would be to inspect the data in graphical terms, scatter command, in this case, would be useful since it displays the behavior and correlation of the utility (U) and the goods (X and Y), adding some simple fitting lines the result would be displayed like this:
twoway scatter U x || lfit U x
twoway scatter U y || lfit U y
Up to this point, we can detect a higher dispersion regarding good Y. Also, the fitted line pattern relative to the slope is different for each good. This might lead to assume for now that the overall preference of the consumer for the n individuals is higher on average for the X good than it is for the Y good. The slope, in fact, is telling us that by an increase of one unit in the X good, there’s a serious increase in the utility (U) meanwhile, the fitted line on the good Y regarding to its slope is telling us comparatively speaking, that it doesn’t increase the utility as much as the X good. For this cross-sectional study, it also would become more useful to calculate Pearson’s correlation coefficient. This can be done with:
correlate U y x
The coefficient is indicating us that exists somewhat of a linear association between the utility (U) and the good Y, meanwhile, it exists a stronger linear relationship relative to the X good and the utility. As a final point, there’s an inverse, but not significant or important linear relationship between goods X and Y. So the sign is indicating us that they’re substitutes of each other.
Now instead of regressing U with X and Y, we need to convert it into logarithms, because we want to do a linearization of the Cobb-Douglas utility function.
gen ln_U=ln(U) gen ln_X=ln(x) gen ln_Y=ln(y) reg ln_U ln_X ln_Y
And now performing the regression without the constant.
Both regressions (with and without the constant) tends to establish the parameters in α =0.6 and β=0.4 which matches the Data Generating Process of the Montecarlo simulation. It appears that the model with the constant term has a lesser variance, so we shall select these parameters for further analysis.
How would it look then our estimation of this utility function for our sample? well, we can start using the mean value of the utility using descriptive statistics and then use a graphical function with the parameters associate. Remember that we got:
And we know already the parameters and also we can assume that the expected utility would be the mean utility in our sample. From this, we can use the command:
sum U y x
And with this, the estimated function for the utility level U=67.89 with approximated elasticities of 0.6 and 0.4 would look like this:
In this order of ideas, we just estimated the indifference curve for a certain population which consists of a set of i individuals. The expected utility from both goods was assumed as the mean value of the utility for the sample and with this, we can identify the different sets of points related to the goods X and Y which represents the expected utility. This is where it ends our brief example of the modeling related to the Cobb-Douglas utility function within a sample with two goods and defined utilities.
As we saw in other econometric blogs of M&S Research Hub, the use of logarithms constitutes a usual practice in econometrics, not only for the problems that can be derived from overusing them, but also it was mentioned the advantage to reduce the Heteroscedasticity -HT- (Nau, 2019) present in the series of a dataset, and some improvements that the monotonic transformation performs on the data as well.
In this article, we’re going to explore the utility of the logarithm transformation to reduce the presence of structural breaks in the time series context. First, we’ll review what’s a structural break, what are the implications of regressing data with structural breaks and finally, we’re going to perform a short empirical analysis with the Gross Domestic Product -GDP- of Colombia in Stata.
The structural break
We can define a structural break as a situation where a sudden, unexpected change occurs in a time series variable, or a sudden change in the relationship between two-time series (Casini & Perron, 2018). In this order of ideas, a structural change might look like this:
The basic idea is to identify abrupt changes in time series variables but we’re not restricting such identification to the domain of time, it can be detected also when we scatter X and Y variables that not necessarily consider the dependent variable as the time. We can distinguish different types of breaks in this context, according to Hansen (2012) we can encounter breaks in 1) Mean, 2) Variance, 3) Relationships, and also we can face single breaks, multiple breaks, and continuous breaks.
Basic problems of the structural breaks
Without going into complex mathematical definitions of
the structural breaks, we can establish some of the problems when our data has
this situation. The first problem was identified by Andrews (1993) regarding to the
parameter’s stability related to structural changes, in simple terms, in the
presence of a break, the estimators of the least square regression tend to vary
over time, which is of course something not desirable, the ideal situation is
that the estimators would be time invariant to consolidate the Best Linear
Unbiased Estimator -BLUE-.
The second problem of structural breaks (or changes) not taken in account during the regression analysis is the fact that the estimator turns to be inefficient since the estimated parameters are going to have a significant increase in the variance, so we’re not getting a statistical unbiased estimator and our exact inferences or forecasting analysis wouldn’t be according to reality.
A third problem might appear if the structural break influences the unit root identification process, this is not a wide explored topic but Tai-Leung Chong (2001) makes excellent appoints related to this. Any time series analysis should always consider the existence of unit roots in the variables, in order to provide further tools to handle a phenomenon, that includes the cointegration field and the forecasting techniques.
An empirical approximation
Suppose we want to model the tendency of the GDP of the Colombian economy, naturally this kind of analysis explicitly takes the GDP as the dependent variable, and the time as the independent variable, following the next form:
In this case, we know that the GDP expressed in Y is going to be a function of the time t. We can assume for a start that the function f(t) follows a linear approximation.
With this expression in (1), the gross domestic
production would have an independent autonomous value independent of time
defined in a, and we’ll get
the slope coefficient in α
which has the usual interpretation that by an increase of one-time unit, the
GDP will have an increase of α.
The linear approximation sounds ideal to model the GDP against the changes over time, assuming that t has a periodicity of years, meaning that we have annual data (so we’re excluding stational phenomena); however, we shall always inspect the data with some graphics.
With Stata once we already tsset the database, we can
watch the graphical behavior with the command “scatter y t”.
In sum, the linear approximation might not be a good idea with this behavior of the real GDP of the Colombian economy for the period of analysis (1950-2014). And it appears to be some structural changes judging by the tendency which changes the slope of the curve drastically around the year 2000.
If we regress the expression in (1), we’ll get the next results.
The linear explanation of the time (in years) related
to the GDP is pretty good, around 93% of the independent variable given by the
time, explains the GDP of the Colombian economy, and the parameter is
significant with a level of 5%.
Now I want you to focus in two basic things, the
variance of the model which is 1.7446e+09 and the confidence intervals, which
positions the estimator between 7613.081 and 8743.697. Without having other
values to compare these two things, we should just keep them in mind.
Now, we can proceed with a test to identify structural breaks in the regression we have just performed. So, we just type “estat sbsingle” in order to test for a structural break with an unknown date.
The interesting thing here is that the structural break test identifies one important change over the full sample period of 1950 to 2014, the whole sample test is called “supremum Wald test” and it is said to have less power than average or exponential tests. However, the test is useful in terms of simply identify structural terms which also tend to match with the graphical analysis. According to the test, we have a structural break in the year 2002, so it would be useful to graph the behavior before and after this year in order to conclude the possible changes. We can do this with the command “scatter y t” and include some if conditions like it follows ahead.
twoway (scatter Y t if t<=2002)(lfit Y t if t<=2002)(scatter Y t if t>=2002)(lfit Y t if t>=2002)
We can observe that tendency is actually changing if we adjust the line for partial periods of time, given by t<2002 and t>2002, meaning that the slope change is a sign of structural break detected by the program. You can attend this issue including a dummy variable that would equal 0 in the time before 2002 and equal 1 after 2002. However, let’s graph now the logarithm transformation of GDP. The mathematical model would be:
Applying natural logarithms, we got:
α now becomes the average growth rate per year of the GDP of the Colombian economy, to implement this transformation use the command “gen ln_y=ln(Y)” and the graphical behavior would look like this:
scatter ln_Y t
The power of the monotonic transformation is now visible, there’s a straight line among the variable which can be fitted using a linear regression, in fact, let’s regress the expression in Stata.
Remember that I told you to keep in mind the variance
and the confidence intervals of the first regression? well now we can compare
it since we got two models, the variance of the last regression is 0.0067 and
the intervals are indeed close to the coefficient (around 0.002 of difference
between the upper and lower interval for the parameter). So, this model fits
even greatly than the first.
If we perform again the “estat sbsingle” test again,
it’s highly likely that another structural break might appear. But we should
not worry a lot if this happens, because we rely on the graphical analysis to
proceed with the inferences, in other words, we shall be parsimonious with our
models, with little, explain the most.
The main conclusion of this article is that the logarithms used with its property of monotonic transformation constitutes a quick, powerful tool that can help us to reduce (or even delete) the influences of structural breaks in our regression analysis. Structural changes are also, for example, signs of exogenous transformation of the economy, as a mention to apply this idea for the Colombian economy, we see it’s growing speed changing from 2002 until the recent years, but we need to consider that in 2002, Colombia faced a government change which was focused on the implementation of public policies related to eliminating terrorist groups, which probably had an impact related to the investment process in the economy and might explain the growth since then.
Andrews, D. W.
(1993). Tests for Parameter Instability and Structural Change With Unknown
Change Point. Journal of the Econometric Society Vol. 61, No. 4 (Jul., 1993),
Casini, A., &
Perron, P. (2018). Structural Breaks in Time Series. Retrieved from
Economics Department, Boston University: https://arxiv.org/pdf/1805.03807.pdf
Hansen, B. E.
(2012). Advanced Time Series and Forecasting. Retrieved from Lecture 5
tructural Breaks. University of Wisconsin Madison:
Nau, R. (2019). The
logarithm transformation. Retrieved from Data concepts The logarithm
Shresta, M., & Bhatta, G. (2018). Selecting
appropriate methodological framework for time series data analysis. Retrieved from The
Journal of Finance and Data Science: https://www.sciencedirect.com/science/article/pii/S2405918817300405
Tai-Leung Chong, T.
(2001). Structural Change In Ar(1) Models. Retrieved from Econometric
Theory,17. Printed in the United States of America: 87–155