## Spurious Regression With Stationary Time Series

The spurious relationship is said to have occurred if the statistical summaries are indicating that two variables are related to each other when in fact there is no theoretical relationship between two variables. It often happens in time series data and there are many well-known examples of spurious correlation in time series data as well. For example, Yule (1926) observed strong relationship between marriages in church and the mortality rate in UK data. Obviously, it is very hard to explain that how the marriages in church can possibly effect the mortality, but the statistics says one variable has very strong correlation with other. This is typical example of spurious regression. Yule (1926) thought that this happens due to missing third variable.

This term spurious correlation was invented on or before 1897 i.e. in less than 15 years after invention of regression analysis. In 1897, Karl Pearson wrote a paper entitled, ‘Mathematical Contributions to the Theory of Evolution: On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs’. The title indicates the terms spurious regression was known at least as early as 1897, and it was observed in the data related to measurement of organs. The reason for this spurious correlation was use of indices. In next 20 years, many reasons for spurious correlation were unveiled with the most popular being missing third variable. This means if X is a cause of Y and X is also a cause of Z, but Y and Z are not directly associated. If you regress Y on Z, you will find spurious regression.

In 1974, Granger and Newbold (Granger won noble prize later) found that two non-stationary series may also yield spurious results even if there is no missing variable. This finding only added another reason to the possible reasons of spurious regression. Neither this finding can be used to argue that the non-stationarity is one and only reason of spurious regression nor this can be used to argue that the spurious regression is time series phenomenon. However, unfortunately, the economists adapted the two misperception. First, they thought that spurious regression is time series phenomenon and secondly, although not explicitly stated, it appears that the economists assume that the non-stationarity is the only cause of spurious regression. Therefore, although not explicitly stated, most of books and articles discussing the spurious regression, discuss the phenomenon in the context of non-stationary time series.

Granger and his coauthors in 1998 wrote a paper entitled “Spurious regressions with stationary series”, in which they show that spurious regression can occur in the stationary data. Therefore, they clear one of the common misconception that the spurious regression is only due to non-stationarity, but they were themselves caught in the second misconception that the spurious regression is time series phenomenon. They define spurious regression as “A spurious regression occurs when a pair of independent series but with strong temporal properties, are found apparently to be related according to standard inference in an OLS regression”. The use of term temporal properties implies that they assume the spurious regression to be time series related phenomenon. But a 100 years ago, Pearson has shown the spurious regression a cross-sectional data.

The unit root and cointegration analysis were developed to cope with the problem of spurious regression. The literature argues that spurious regression can be avoided if there is cointegration. But unfortunately, cointegration can be defined only for non-stationary data. What is the way to avoid spurious regression if the underlying are stationary? The literature is silent to answer this question.

Pesaran et al (1998) developed a new technique ‘ARDL Bound Test’ to test the existence of level relationship between variables. People often confuse the level relationship with cointegration and the common term used for ARDL Bound test is ARDL cointegration, but the in reality, this does not necessarily imply cointegration. The findings of Bound test are more general and imply cointegration only under certain conditions. The ARDL is capable of testing long run relationship between pair of stationary time series as well as between pair of non-stationary time series. However, the long run relationship between stationary time series cannot be termed as cointegration because by definition cointegration is the long run relationship between stationary time series.

In fact, ARDL bound test is a better way to deal with the spurious regression in stationary time series, but several misunderstandings about the test has restricted the usefulness of the test. We will discuss the use and features of ARDL in a future blog.

## Can cointegration analysis solve spurious regression problem?

The efforts to avoid the existence of spurious regression has led to the development of modern time series analysis (see How Modern Time Series Analysis Emerged? ). The core objective of unit root and cointegration procedures is to differentiate between genuine and spurious regression. However, despite the huge literature, the unit root and cointegration analysis are unable to solve spurious regression problem. The reason lies mainly in the misunderstanding of the term spurious regression.

Spurious correlation/spurious correlation occur when a pair of variable having no (weak) causal connection appears to have significant (strong) correlation/regression. In these meanings the term spurious correlation/spurious has the same history as the term regression itself. The correlation and regression analysis were invented by Sir Francis Galton in around 1888 and in 1897, Karl Pearson wrote a paper with the following title, ‘Mathematical Contributions to the Theory of Evolution: On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs’ (Pearson, 1897).

This title indicates number of important things about the term spurious correlation: (a) the term spurious correlation was known as early as 1897, that is, in less than 10 years after the invention of correlation analysis (ii) there were more than one types of spurious correlation known to the scientists of that time, therefore, the author used the phrase ‘On a Farm of Spurious Regression’, (c) the spurious correlation was observed in measurement of organs, a cross-sectional data (d) the reason of spurious correlation was use of indices, not the non-stationarity.

One can find in classical econometric literature that that many kinds of spurious correlations were known to experts in first two decades of twentieth century. These kinds of spurious correlations include spurious correlation due to use of indices (Pearson, 1897), spurious correlation due to variations in magnitude of population (Yule, 1910), spurious correlation due to mixing of heterogeneous records (Brown et al, 1914), etc. The most important reason, as the econometricians of that time understand, was the missing third variable (Yule, 1926).

Granger and Newbold (1974) performed a simulation study in which they generated two independent random walk time series x(t)=x(t-1)+e(t) and y(t)=y(t-1)+u(t) . The two series are non-stationary and the correlation of error terms in the two series is zero so that the two series are totally independent of each other. The two variables don’t have any common missing factor to which the movement of the two series can be attributed. Now the regression of the type y(t)=a+bx(t)+e(t) should give insignificant regression coefficient, but the simulation showed very high probability of getting significant coefficient. Therefore, Granger and Newbold concluded that spurious regression occurs due to non-stationarity.

Three points are worth considering regarding the study of Granger and Newbold. First, the above cited literature clearly indicates that the spurious correlation does exist in cross-sectional data and the Granger-Newbold experiment is not capable to explain cross-sectional spurious correlation. Second, the existing understanding of the spurious correlation was that it happens due to missing variables and the experiment adds another reason for the phenomenon which cannot deny the existing understanding. Third, the experiment shows that non-stationarity is one of the reasons of spurious regression. It does not prove that non-stationarity  is ‘the only’ reason of spurious regression.

However, unfortunately, the econometric literature that emerged after Granger and Newbold, adapted the misconception. Now, many textbooks discuss the spurious regression only in the context of non-stationarity, which leads to the misconception that the spurious regression is a non-stationarity related phenomenon. Similarly, the discussion of missing variable as a reason of spurious regression is usually not present in the recent textbooks and other academic literature.

To show that spurious regression is not necessarily a time series phenomenon, consider the following example:

A researcher is interested in knowing the relationship between shoe size and mathematical ability level of school students. He goes to a high school and takes a random sample of the students present in the school. He takes readings on shoe size and ability to solve the mathematical problems of the selected students. He finds that there is very high correlation between two variables. Would this be sufficient to argue that the admission policy of the school should be based on the measurement of shoe size? Otherwise, what accounts for this high correlation?

If sample is selected from a high school having kids in different classes, the same observation is almost sure to occur. The pupil in higher classes have larger shoe size and have higher mathematical skills, whereas student in lower classes have lower mathematical skills. Therefore, high correlation is expected. However, if we take data of only one class, say grade III, we will not see such high correlation. Since theoretically, there should be no correlation between shoe size and mathematical skills, this apparently high correlation may be regarded as spurious correlation/regression. The reason for this spurious correlation is mixing of missing age factor which drives both shoe size and mathematical skills.

Since this is not a time series data, there is no question of the existence of non-stationarity, but the spurious correlation exists. This shows that spurious correlation is no necessarily a time series phenomenon. The unit root and cointegration would be just incapable to solve this problem.

Similarly, it can be shown that the unit root and cointegration analysis can fail to work even with time series data, and this will be discussed in our next blog

## How Modern Time Series Analysis Emerged?

The regression analysis which is basic tool of econometrics was invented in 1880s by Francis Galton, a cousin of Charles Darwin who is famous for his theory of evolution. Like Darwin, Galton was a biologist interested in the laws of heredity, and he intended to use the regression for the laws of heredity. He used the regression for analysis of cross-sectional data of the heights of fathers and sons. The regression analysis was than adapted and developed by the economists for analysis of economic data without any discrimination of time series and cross-sectional data.

Soon it was discovered that the application of regression analysis to the time series data could produced misleading results. In particular, regression analysis applied to time series data sometimes shows the two series to be highly correlated, when in fact there is no sensible economic relationship between the variables. This phenomenon was termed as ‘spurious regression’.

Yule (1926) wrote a detailed commentary on the spurious regression in time series. He gave number of examples in which two independent time series appear to be highly correlated. One of his examples was the relationship in marriages in Church of England and mortality rate. Obviously, the two variables don’t have any causal connection, but Yule fond 95% correlation between two variables. Yule thought that this phenomenon is because of some missing variable and could be avoided by taking into account all relevant variables. He further assumed that spurious regression would disappear if longer time series are available. This means by increasing the time series length, the chances of spurious regression will gradually diminish. In coming half century, the missing variable was thought as the main reason for the spurious correlation in time series.

In 1974, Granger and Newbold observed that in case of non-stationary time series, the spurious regression may exist even if there is no missing variable. They further found that the probability of spurious regression increases by increasing the time series length, contrary to the perception of Yule who had thought that probability of spurious regression will decrease with the increase in time series length. A few years later in 1982, Nelson and Plossor analyzed a set of time series of the United States and found that most of these series are non-stationary. Many other studies supported the finding of Nelson and Plossor creating a doubt about stationarity of time series.

If one combines the finding of Granger and Newbold with that of Nelson and Plossor, the conclusion would be, ‘most of regressions between economic time series are spurious because of non-stationarity of the underlying time series’. Therefore these studies put a big question mark on the validity of regression analysis for time series data.

In a later study, Engle and Granger (1986) found that regression of non-stationary time series could be genuine, if the underlying series are ‘cointegrated’. This means, if you have a set of  time series variables which are non-stationary, you have to ensure that they are cointegrated as well, in order to insure that the regression is no spurious.

If you are running a regression between time series variables, first you have to check the stationarity of the series because as warned by Nelson and Plossor and predecessors, most of the economic time series are non-stationary. If the series are actually non-stationary, than you have to make sure that the series have cointegration as well, otherwise the regression will be spurious.

Therefore, in order to test the validity of a regression analysis for time series, testing for stationarity and cointegration became the preliminary steps in the analysis of time series.  A new stream of literature emerged on focusing on the testing for stationarity and cointegration which give rise to current tools of time series analysis.

References

Galton, F. (1886). “Regression towards mediocrity in hereditary stature”. The Journal of the Anthropological Institute of Great Britain and Ireland. 15: 246–263.

Engle, Robert F.; Granger, Clive W. J. (1987). “Co-integration and error correction: Representation, estimation and testing”. Econometrica. 55 (2): 251–276.

Granger, C. W., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of econometrics, 2(2), 111-120

Nelson, C. R. and Plosser, C. R. (1982). Trends and random walks in macroeconmic time series: some evidence and implications. Journal of monetary economics, 10(2):139– 162

Yule, G. U. (1926). Why do we sometimes get nonsense-correlations between Time-Series?–a study in sampling and the nature of time-series. Journal of the royal statistical society, 89(1), 1-63.