ARDL model and General to simple methodology

Listening to the word ARDL, the first things that comes into mind is the bound testing approach introduced by Pesaran and Shin (1999).The Pesaran and Shin’s approach is an incredible use of the ARDL, however, the term ARDL is much elder, and the ARDL model has many other uses as well. In fact, the equation used by Pesaran and Shin is a restricted version of ARDL, and the unrestricted version of ARDL was introduced by Sargan (1964) and popularized by David F Hendry and his coauthors in several papers. The most important paper is one which is usually known as DHSY, but we will come to the details DHSY later. Let me introduce what is ARDL and what are the advantages of this model


What is ARDL model?

ARDL model is an a-theoretic model for modeling relationship between two time series. Suppose we want to see the effect of time series variable Xt on another variable Yt. The ARDL model for the purpose will be of the form

The same model can be written as

This means, in the layman language the dependent variable is regressed on its own lags, independent variable and the lags of independent variables. The above ARDL model can be termed as ARDL (j, k) model, referring to number of lags j & K in the model.

The model itself is written without any theoretical considerations. However, a large number of theoretical models are embedded inside this model and one can drive appropriate theoretical model by testing and imposing restrictions on the model.

To have more concrete idea, let’s consider the case of relationship between consumption and income. To further simplify, lets consider j=k=1, so that the ARDL(1,1) model for the relationship of consumption and income can be written as

Model 1:          Ct=a+b1Ct-1+d0Yt+d1Yt-1+et

HereC denotes consumption and Y denotes income, a,b1,d0,d1 denote the regression coefficient and et denotes error term. So far, no theory is used to develop this model and the regression coefficients don’t have any theoretical interpretation. However, this model can be used to select appropriate theoretical model for the consumption.

Suppose we have estimated the above mentioned model and found the regression coefficients. We can test any one of the coefficient and/or number of coefficient for various kinds of restriction. Suppose we test the restriction that   

R1: H0: (b1 d1)=0

Suppose testing restriction on actual data implies that restriction is valid, this means we can exclude the curresponding variables from the model. Excluding the variables, the model will become

Model 2:          Ct=a+ d0Yt+et

The model 2 is actually the Keynesian consumption (also called absolute income hypothesis), which says that current consumption is dependent on current income only. The coefficient of income in this equation is the marginal propensity to consume and Keynes predicted that this coefficient would be between 0 and 1, implying that individuals consume a part of their income and save a part of their incomes for future.

Suppose that the data did not suppose the restriction R1, however, the following restriction is valid

R2: H0: d1=0

This means model 1 would become

Model 3:          Ct=a+b1Ct-1+d0Yt+et

This means that current consumption is dependent on current income and past consumption. This is called Habit Persistence model. The past consumption here is the proxy of habit. The model says that what was consumed in the past is having effect on current consumption and is evident from human behavior.

Suppose that the data did not suppose the restriction R1, however, the following restriction is valid

R3: H0: b1=0

This means model 1 would become

Model 4:          Ct=a+ d0Yt+d1Yt-1+et

This means that current consumption is dependent on current income and past income. This is called Partial Adjustment model. As per implications of Keynesian consumption function, the consumption should only depend on the current income, but the partial adjustment model says that it takes sometimes to adjust to the new income. Therefore, the consumption is partially on the current income and partially on the past income

In a similar way, one can derive many other models out of the model 1 which are representative of different theories. The details of the models that can be drawn from model 1 can be found in Charemza and Deadman (1997)’s ‘New Directions in Econometric Practice…’.  

It can also be shown that the difference form models are also derivable from model 1. Consider the following restriction

R 4:

If this restriction is valid, the model 1 will become

Ct=a+Ct-1+d0Yt-d0Yt-1+et

This model can be re-written as

Ct-Ct-1=d0(Yt-Yt-1)+et

  This means

Model 5: DCt=d0DYt+et

This indicates that the difference form models can also be derived from the model 1 with certain restrictions

Further elaboration shows that the error correction models can also be derived from model 1.

Consider model 1 again and subtract Ct-1 both sides, we will get

 Ct- Ct-1=a+b1Ct-1 -Ct-1+d0Yt+d1Yt-1+et

Adding and subtracting d0Yt-1 on the right hand side we get

DCt=a+(b1-1)Ct-1+d0Yt+d1Yt-1 +d0Yt-1 -d0Yt-1 +et

DCt=a+(b1-1)Ct-1+d0DYt+d1Yt-1 +d0Yt-1 +et

DCt=a+(b1-1)Ct-1+d0DYt+(d1+d0)Yt-1+et

This equation contains error correction mechanism if

R6: (b1-1)= – (d1+d0)

Assume

 (b1-1)= – (d1+d0)=F

The equation will reduce to

DCt=a+F(Ct-1-Yt-1)+ d0DYt +et

This is our well known error correction model and can be derived if R6 is valid.

Therefore, existence of an error correction mechanism can also be tested from model 1 and restriction to be considered valid if R6 is valid.  

As we have discussed, number of theoretical models can be driven from model 1 by testing certain restrictions. We can start from model 1 and go with testing different restrictions. We can impose the restriction which is found valid and discard the restrictions which were found invalid in our testing. This provides us a natural way of selection among various theoretical models.

When we say theoretical model, this means there is some economic sense of the model. For example the models 2 to model 6 all make economic sense. So, how to decide between these models? This problem can be solved if we start with an ARDL model and choose to impose restrictions which are permitted by the data

The famous DHSY paper recommends a methodology like this. DHSY recommend that we should start with a large model which encompasses various theoretical models. The model can then be simplified by testing certain restrictions.

In another blog I have argued that if there are different theories for a certain variables, the research must be comparative. This short blog gives the brief outlines about how we can do this. Practically, one need to take larger ARDL structures and number of models that can be derived from the parent model would also be large.

Please follow and like us:

Research in presence of multiple theories

Consider a hypothetical question, a researcher was given with a research question; compare the mathematical ability of male and female students of grade 5. The researcher collected data of 300 female students and 300 male students of grade 5 and administered a test of mathematical questions. The average score for female students was 80% and average score of male student was 50%, the difference was statistically significant and therefore, the researcher concluded that the female students have better mathematical aptitude.

The findings seem strong and impressive, but let me add into the information that the male students were chosen from a far-off village with untrained educational staff and lack of educational facilities. The female students were chosen from an elite school of a metropolitan city, where the best teachers of the city actually serve. What should be the conclusion now? It can be argued that actually difference doesn’t come from the gender, the difference is coming from the school type.

The researcher carrying out the project says ‘look, my research assignment was only to investigate the difference due to gender, the school type is not the question I am interested in, therefore, I have nothing to do with the school type’.

Do you think that the argument of researcher is valid and the findings should be considered reliable? The answer is obvious, the findings are not reliable, and the school type creates a serious bias.  The researcher must compare students from the same school type. This implies you have to take care of the variables not having any mention in your research question if they are determinants of your dependent variable.

Now let’s apply the same logic to econometric modeling, suppose we have the task to analyze the impact of financial development on economic growth. We are running a regression of GDP growth on a proxy of financial development; we are getting a regression output and presenting the output as impact of financial development on economic growth. Is it a reliable research?

This research is also deficient just like our example of gender and mathematical ability. The research is not reliable if ceteris paribus doesn’t hold. The other variables which may affect the output variable should remain same.

But in real life, it is often very difficult to keep all other variables same. The economy continuously evolves and so are the economic variables. The other solution to overcome the problem is to take the other variables into account while running regression. This means other variables that determine your dependent variable should be taken as control variables in the regression. This means suppose you want to check the effect of X1 on Y using model Y=a+bX1+e. Some other research studies indicate that another model exist for Y which is Y=c+dX2+e.   Then I cannot run the first model ignoring the second model. If I am running only model 1 ignoring the other models, the results would be biased in a similar way as we have seen in our example of mathematical ability. We have to use the variables of model 2 as control variable, even if we are not interested in coefficients of model 2. Therefore, the estimated model would be like Y=a+bX1+cX2+e

Taking the control variables is possible when there are a few models. The seminal study of Davidson, Hendry, Sarba and Yeo titled ‘Econometric modelling of the aggregate time-series relationship between …. (often referred as DHSY)’ summarizes the way to build a model in such a situation. But it often happens that there exists very large number of models for one variable. For example, there is very large number of models for growth. A book titled ‘Growth Econometrics’ by Darlauf lists hundreds of models for growth used by researchers in their studies. Life becomes very complicated when you have so many models. Estimating a model with all determinants of growth would be literally impossible for most of the countries using the classical methodology. This is because growth data is usually available at annual or quarterly frequency and the number of predictors taken from all models collectively would exceed number of observations. The time series data also have dynamic structure and taking lags of variables makes things more complicated. Therefore, classical techniques of econometrics often fail to work for such high dimensional data.

Some experts have invented sophisticated techniques for the modeling in a scenario where number of predictor becomes very large. These techniques include Extreme Bound Analysis, Weighted Average Least Squares, and Autometrix etc. The high dimensional econometric techniques are also very interesting field of econometric investigation. However, DHSY is extremely useful for the situations where there are more than one models for a variable based on different theories. The DHSY methodology is also called LSE methodology, General to Specific Methodology or simply G2S methodology.  

Please follow and like us:

Learning Central Limit Theorem with Microsoft Excel

Many statistical and econometric procedures depend on the assumption of normality. The importance of the normal distribution lies in the fact that sums/averages of random variables tend to be approximately normally distributed regardless of the distribution of draws. The central limit theorem explains this fact. Central Limit Theorem is very important since it provides justification for most of statistical inference. The goal of this paper is to provide a pedagogical introduction to present the CLT, in form of self study computer exercise. This paper presents a student friendly illustration of functionality of central limit theorem. The mathematics of theorem is introduced in the last section of the paper. 

CENTRAL LIMIT THEOREM

We start by an example where we observe a phenomenon and than we will discuss the theoretical background of the phenomenon.

Consider 10 players playing with identical dice simultaneously. Each player rolls the dice large number of times. The six numbers on the dice have equal probability of occurrence on any roll and before any player. Let us ask computer to generate data that resembles with the outcomes of these rolls.

We need to have Microsoft Excel ( above 2007 preferable) for this exercise. Point to ‘Data’ tab in the menu bar, it should show ‘Data Analysis’ in the tools bar. If Data Analysis is not there, than you need to install the data analysis tool pack, for this  you have to click on the office button, which is the yellow color button at top left corner of Microsoft Excel Window.  Choose ‘Add Ins’ from the left pan that appears, than check the box against ‘Analysis Tool Pack’ and click OK.

Select Office Button Excel OptionsSelect Add Ins Þ  Analysis ToolPack ÞGo from the screen that appears

Computer will take few moments to install the analysis toolpack. After installation is done, you will see ‘Data Analysis’ on pointing again to Data Tab in the menu bar. The analysis tool pack provides a variety of tool for statistical procedures.

We will generate data that matches with the situation described above using this tool pack.

Open an Excel spread sheet, write 1, 2, 3,…6 in cells A1:A6,

Write ‘=1/6’ in cell B1 and copy it down

This shows you possible outcomes of roll of dice and their probabilities.

 This will show you following table:

10.167
20.167
30.167
40.167
50.167
60.167

Here first column contain outcomes of roll of dice and second column contain probability of outcomes. Now we want the computer to have some draws from this distribution. That is, we want computer to roll dice and record outcomes.

For this go to Data Þ Data AnalysisÞ Random Number Generation and select discrete distribution. Write number of variables =10 and number of random number =1000, enter value input and probability range A1:B6, put output range D1 and click OK.

This will generate a 1000×10 matrix of outcomes of roll of dice in cells A8:J1007. Each column represent outcome for a certain player in 1000 draws whereas rows represent outcomes for 10 players in some particular draw. In the next column ‘K’ we want to have sum of each row. Write ‘=SUM(A8:J8) and copy it down. This will generate column of sum for each draw.

Now, we are interested in knowing that what distribution of outcome for each player is:  

Let us ask Excel to count the frequency of each outcome for player 1. Choose Tools/Data Analysis/Histogram and fill the dialogue box as follows:

The screenshot shows the dialogue box filled to count the frequency of outcomes listed observed by player A. The input range is the column for which we want to count frequency of outcomes and bin range is the range of possible outcomes.  This process will generate frequency of six possible outcomes for the single player. When we did this, we got following output:

BinFrequency
1155
2154
3160
4169
5179
6183
More0

The table above gives the frequency of the outcomes whereas same frequencies are plotted in the bar chart. You observe that frequency of occurrence is not approximately equal. The height of vertical bars is approximately same. This implies that the distribution of draws is almost uniform. And we know this should happen because we made draws from uniform distribution. If we calculate percentage of each outcome it will become 15.5%, 15.4%, 16%, 16.9%, 17.9% and 18.3% respectively. These percentages are close to the probability of these outcomes i.e. 16.67%.

Now we want to check the distribution of column which contain sum of draws for 10 players, i.e. the column K. Now the range of possible values of column of sum varies from 10 to 60 (if all column have 1, the sum would be 10 and if all columns have 6 than sum would be 60, in all other cases it would be between these two numbers). It would be in-appropriate to count frequencies of all numbers in this range. Let us make few bins and count the frequencies of these bins. We choose following bins; (10,20), (20, 30),…(50, 60). Again we would ask Excel to count frequencies of these bins. To do this, write 10, 20,…60 in column M of Excel spread sheet (these numbers are the boundaries of bins we made). Now select Tools/Data Analysis/Histogram and fill the dialogue box that appears.

 The input range would be the range that contains sum of draws i.e. K8 to K1007 and bin range would be the address of cells where we have written the boundary points of our desired bins. Completing this procedure would produce the frequencies of each bin. Here is the output that we got from this exercise.

BinFrequency
100
205
30211
40638
50144
602
More0

First row of this output tells that there was no number smaller than starting point of first bin i.e. smaller than 10, and 2nd, 3rd …rows tell frequencies of bins (10-20), (20,30),…respectively. Last row informs about frequency of numbers larger than end point of last bin i.e. 60.

Below is the plot of this frequency table.

 Obviously this plot has no resemblance with uniform distribution. Rather if you remember famous bell shape of the normal distribution, this plot is closer to that shape.

Let us summarize our observation out of this experiment. We have several columns of random numbers that resemble roll of dice i.e. possible outcomes are 1…6 each with probability 1/6 (uniform distribution). If we count frequency of these outcomes in any column, the outcomes reveal the distributional shape and the histogram is almost uniform. Last column was containing sum of 10 draws from uniform distribution and we saw that distribution of this column is no longer uniform, rather it has closer match with shape of normal distribution.

Explanation of the observation:

The phenomenon that we observed may be explained by central limit theorem. According to central limit, let  be independent draws from any distribution (not necessarily uniform) with finite variance, than distribution of sum of draws  and average of draws would be approximately normal if sample size ‘n’ is large.

Mean and SE for sum of draws:

From our primary knowledge about random variables we know that:

And

Suppose

Let , than and

These two statements tell the parameters of normal distribution that emerges from sum of random numbers and we have observed this phenomenon described above.

Verification

Consider the exercise discussed above; column A:J are draws from dice roll with expectation 3.5 and variance 2.91667. Column K is sum of 10 previous columns. Thus expected value of K is thus 10*3.5=35 and variance 2.91667*10. This also implies that SE of column K is 5.400 (square root of variance.

The SD and variance in the above exercise can be calculated as follows:

Write ‘AVERAGE(K8:K1007)’ in any blank cell in spreadsheet. This will calculate sample mean of numbers in column K. The answer will be close to 35. When I did this, I found 34.95.

Write ‘VAR(K8:K1007)’ in any blank cell in spreadsheet. This will calculate sample variance of numbers in column K. The answer will be close to 29.16, when I did this, I found 30.02

Summary:

In this exercise, we observed that if we take draws from some certain distribution, the frequency of draws will reflect the probability structure of parent distribution. But when we take sum of draws, the distribution of sum reveals the shape of normal distribution. This phenomenon has its root in central limit theorem which is stated in Section …..

Please follow and like us:

Spurious Regression With Stationary Time Series

The spurious relationship is said to have occurred if the statistical summaries are indicating that two variables are related to each other when in fact there is no theoretical relationship between two variables. It often happens in time series data and there are many well-known examples of spurious correlation in time series data as well. For example, Yule (1926) observed strong relationship between marriages in church and the mortality rate in UK data. Obviously, it is very hard to explain that how the marriages in church can possibly effect the mortality, but the statistics says one variable has very strong correlation with other. This is typical example of spurious regression. Yule (1926) thought that this happens due to missing third variable.

This term spurious correlation was invented on or before 1897 i.e. in less than 15 years after invention of regression analysis. In 1897, Karl Pearson wrote a paper entitled, ‘Mathematical Contributions to the Theory of Evolution: On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs’. The title indicates the terms spurious regression was known at least as early as 1897, and it was observed in the data related to measurement of organs. The reason for this spurious correlation was use of indices. In next 20 years, many reasons for spurious correlation were unveiled with the most popular being missing third variable. This means if X is a cause of Y and X is also a cause of Z, but Y and Z are not directly associated. If you regress Y on Z, you will find spurious regression.

In 1974, Granger and Newbold (Granger won noble prize later) found that two non-stationary series may also yield spurious results even if there is no missing variable. This finding only added another reason to the possible reasons of spurious regression. Neither this finding can be used to argue that the non-stationarity is one and only reason of spurious regression nor this can be used to argue that the spurious regression is time series phenomenon. However, unfortunately, the economists adapted the two misperception. First, they thought that spurious regression is time series phenomenon and secondly, although not explicitly stated, it appears that the economists assume that the non-stationarity is the only cause of spurious regression. Therefore, although not explicitly stated, most of books and articles discussing the spurious regression, discuss the phenomenon in the context of non-stationary time series.

Granger and his coauthors in 1998 wrote a paper entitled “Spurious regressions with stationary series”, in which they show that spurious regression can occur in the stationary data. Therefore, they clear one of the common misconception that the spurious regression is only due to non-stationarity, but they were themselves caught in the second misconception that the spurious regression is time series phenomenon. They define spurious regression as “A spurious regression occurs when a pair of independent series but with strong temporal properties, are found apparently to be related according to standard inference in an OLS regression”. The use of term temporal properties implies that they assume the spurious regression to be time series related phenomenon. But a 100 years ago, Pearson has shown the spurious regression a cross-sectional data.

The unit root and cointegration analysis were developed to cope with the problem of spurious regression. The literature argues that spurious regression can be avoided if there is cointegration. But unfortunately, cointegration can be defined only for non-stationary data. What is the way to avoid spurious regression if the underlying are stationary? The literature is silent to answer this question.

Pesaran et al (1998) developed a new technique ‘ARDL Bound Test’ to test the existence of level relationship between variables. People often confuse the level relationship with cointegration and the common term used for ARDL Bound test is ARDL cointegration, but the in reality, this does not necessarily imply cointegration. The findings of Bound test are more general and imply cointegration only under certain conditions. The ARDL is capable of testing long run relationship between pair of stationary time series as well as between pair of non-stationary time series. However, the long run relationship between stationary time series cannot be termed as cointegration because by definition cointegration is the long run relationship between stationary time series.

In fact, ARDL bound test is a better way to deal with the spurious regression in stationary time series, but several misunderstandings about the test has restricted the usefulness of the test. We will discuss the use and features of ARDL in a future blog. 

Please follow and like us:

Can cointegration analysis solve spurious regression problem?

The efforts to avoid the existence of spurious regression has led to the development of modern time series analysis (see How Modern Time Series Analysis Emerged? ). The core objective of unit root and cointegration procedures is to differentiate between genuine and spurious regression. However, despite the huge literature, the unit root and cointegration analysis are unable to solve spurious regression problem. The reason lies mainly in the misunderstanding of the term spurious regression.

Spurious correlation/spurious correlation occur when a pair of variable having no (weak) causal connection appears to have significant (strong) correlation/regression. In these meanings the term spurious correlation/spurious has the same history as the term regression itself. The correlation and regression analysis were invented by Sir Francis Galton in around 1888 and in 1897, Karl Pearson wrote a paper with the following title, ‘Mathematical Contributions to the Theory of Evolution: On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs’ (Pearson, 1897).

This title indicates number of important things about the term spurious correlation: (a) the term spurious correlation was known as early as 1897, that is, in less than 10 years after the invention of correlation analysis (ii) there were more than one types of spurious correlation known to the scientists of that time, therefore, the author used the phrase ‘On a Farm of Spurious Regression’, (c) the spurious correlation was observed in measurement of organs, a cross-sectional data (d) the reason of spurious correlation was use of indices, not the non-stationarity.

One can find in classical econometric literature that that many kinds of spurious correlations were known to experts in first two decades of twentieth century. These kinds of spurious correlations include spurious correlation due to use of indices (Pearson, 1897), spurious correlation due to variations in magnitude of population (Yule, 1910), spurious correlation due to mixing of heterogeneous records (Brown et al, 1914), etc. The most important reason, as the econometricians of that time understand, was the missing third variable (Yule, 1926).

Granger and Newbold (1974) performed a simulation study in which they generated two independent random walk time series x(t)=x(t-1)+e(t) and y(t)=y(t-1)+u(t) . The two series are non-stationary and the correlation of error terms in the two series is zero so that the two series are totally independent of each other. The two variables don’t have any common missing factor to which the movement of the two series can be attributed. Now the regression of the type y(t)=a+bx(t)+e(t) should give insignificant regression coefficient, but the simulation showed very high probability of getting significant coefficient. Therefore, Granger and Newbold concluded that spurious regression occurs due to non-stationarity.

Three points are worth considering regarding the study of Granger and Newbold. First, the above cited literature clearly indicates that the spurious correlation does exist in cross-sectional data and the Granger-Newbold experiment is not capable to explain cross-sectional spurious correlation. Second, the existing understanding of the spurious correlation was that it happens due to missing variables and the experiment adds another reason for the phenomenon which cannot deny the existing understanding. Third, the experiment shows that non-stationarity is one of the reasons of spurious regression. It does not prove that non-stationarity  is ‘the only’ reason of spurious regression.

However, unfortunately, the econometric literature that emerged after Granger and Newbold, adapted the misconception. Now, many textbooks discuss the spurious regression only in the context of non-stationarity, which leads to the misconception that the spurious regression is a non-stationarity related phenomenon. Similarly, the discussion of missing variable as a reason of spurious regression is usually not present in the recent textbooks and other academic literature.

To show that spurious regression is not necessarily a time series phenomenon, consider the following example:

A researcher is interested in knowing the relationship between shoe size and mathematical ability level of school students. He goes to a high school and takes a random sample of the students present in the school. He takes readings on shoe size and ability to solve the mathematical problems of the selected students. He finds that there is very high correlation between two variables. Would this be sufficient to argue that the admission policy of the school should be based on the measurement of shoe size? Otherwise, what accounts for this high correlation?

If sample is selected from a high school having kids in different classes, the same observation is almost sure to occur. The pupil in higher classes have larger shoe size and have higher mathematical skills, whereas student in lower classes have lower mathematical skills. Therefore, high correlation is expected. However, if we take data of only one class, say grade III, we will not see such high correlation. Since theoretically, there should be no correlation between shoe size and mathematical skills, this apparently high correlation may be regarded as spurious correlation/regression. The reason for this spurious correlation is mixing of missing age factor which drives both shoe size and mathematical skills.

Since this is not a time series data, there is no question of the existence of non-stationarity, but the spurious correlation exists. This shows that spurious correlation is no necessarily a time series phenomenon. The unit root and cointegration would be just incapable to solve this problem.

Similarly, it can be shown that the unit root and cointegration analysis can fail to work even with time series data, and this will be discussed in our next blog

Please follow and like us:

How Modern Time Series Analysis Emerged?

The regression analysis which is basic tool of econometrics was invented in 1880s by Francis Galton, a cousin of Charles Darwin who is famous for his theory of evolution. Like Darwin, Galton was a biologist interested in the laws of heredity, and he intended to use the regression for the laws of heredity. He used the regression for analysis of cross-sectional data of the heights of fathers and sons. The regression analysis was than adapted and developed by the economists for analysis of economic data without any discrimination of time series and cross-sectional data.

Soon it was discovered that the application of regression analysis to the time series data could produced misleading results. In particular, regression analysis applied to time series data sometimes shows the two series to be highly correlated, when in fact there is no sensible economic relationship between the variables. This phenomenon was termed as ‘spurious regression’.

Yule (1926) wrote a detailed commentary on the spurious regression in time series. He gave number of examples in which two independent time series appear to be highly correlated. One of his examples was the relationship in marriages in Church of England and mortality rate. Obviously, the two variables don’t have any causal connection, but Yule fond 95% correlation between two variables. Yule thought that this phenomenon is because of some missing variable and could be avoided by taking into account all relevant variables. He further assumed that spurious regression would disappear if longer time series are available. This means by increasing the time series length, the chances of spurious regression will gradually diminish. In coming half century, the missing variable was thought as the main reason for the spurious correlation in time series.

In 1974, Granger and Newbold observed that in case of non-stationary time series, the spurious regression may exist even if there is no missing variable. They further found that the probability of spurious regression increases by increasing the time series length, contrary to the perception of Yule who had thought that probability of spurious regression will decrease with the increase in time series length. A few years later in 1982, Nelson and Plossor analyzed a set of time series of the United States and found that most of these series are non-stationary. Many other studies supported the finding of Nelson and Plossor creating a doubt about stationarity of time series.

If one combines the finding of Granger and Newbold with that of Nelson and Plossor, the conclusion would be, ‘most of regressions between economic time series are spurious because of non-stationarity of the underlying time series’. Therefore these studies put a big question mark on the validity of regression analysis for time series data.

In a later study, Engle and Granger (1986) found that regression of non-stationary time series could be genuine, if the underlying series are ‘cointegrated’. This means, if you have a set of  time series variables which are non-stationary, you have to ensure that they are cointegrated as well, in order to insure that the regression is no spurious.

If you are running a regression between time series variables, first you have to check the stationarity of the series because as warned by Nelson and Plossor and predecessors, most of the economic time series are non-stationary. If the series are actually non-stationary, than you have to make sure that the series have cointegration as well, otherwise the regression will be spurious.

Therefore, in order to test the validity of a regression analysis for time series, testing for stationarity and cointegration became the preliminary steps in the analysis of time series.  A new stream of literature emerged on focusing on the testing for stationarity and cointegration which give rise to current tools of time series analysis.

References

Galton, F. (1886). “Regression towards mediocrity in hereditary stature”. The Journal of the Anthropological Institute of Great Britain and Ireland. 15: 246–263.

Engle, Robert F.; Granger, Clive W. J. (1987). “Co-integration and error correction: Representation, estimation and testing”. Econometrica. 55 (2): 251–276.

Granger, C. W., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of econometrics, 2(2), 111-120

Nelson, C. R. and Plosser, C. R. (1982). Trends and random walks in macroeconmic time series: some evidence and implications. Journal of monetary economics, 10(2):139– 162

Yule, G. U. (1926). Why do we sometimes get nonsense-correlations between Time-Series?–a study in sampling and the nature of time-series. Journal of the royal statistical society, 89(1), 1-63.

Please follow and like us: