Learning Central Limit Theorem with Microsoft Excel

Many statistical and econometric procedures depend on the assumption of normality. The importance of the normal distribution lies in the fact that sums/averages of random variables tend to be approximately normally distributed regardless of the distribution of draws. The central limit theorem explains this fact. Central Limit Theorem is very important since it provides justification for most of statistical inference. The goal of this paper is to provide a pedagogical introduction to present the CLT, in form of self study computer exercise. This paper presents a student friendly illustration of functionality of central limit theorem. The mathematics of theorem is introduced in the last section of the paper. 

CENTRAL LIMIT THEOREM

We start by an example where we observe a phenomenon and than we will discuss the theoretical background of the phenomenon.

Consider 10 players playing with identical dice simultaneously. Each player rolls the dice large number of times. The six numbers on the dice have equal probability of occurrence on any roll and before any player. Let us ask computer to generate data that resembles with the outcomes of these rolls.

We need to have Microsoft Excel ( above 2007 preferable) for this exercise. Point to ‘Data’ tab in the menu bar, it should show ‘Data Analysis’ in the tools bar. If Data Analysis is not there, than you need to install the data analysis tool pack, for this  you have to click on the office button, which is the yellow color button at top left corner of Microsoft Excel Window.  Choose ‘Add Ins’ from the left pan that appears, than check the box against ‘Analysis Tool Pack’ and click OK.

Select Office Button Excel OptionsSelect Add Ins Þ  Analysis ToolPack ÞGo from the screen that appears

Computer will take few moments to install the analysis toolpack. After installation is done, you will see ‘Data Analysis’ on pointing again to Data Tab in the menu bar. The analysis tool pack provides a variety of tool for statistical procedures.

We will generate data that matches with the situation described above using this tool pack.

Open an Excel spread sheet, write 1, 2, 3,…6 in cells A1:A6,

Write ‘=1/6’ in cell B1 and copy it down

This shows you possible outcomes of roll of dice and their probabilities.

 This will show you following table:

10.167
20.167
30.167
40.167
50.167
60.167

Here first column contain outcomes of roll of dice and second column contain probability of outcomes. Now we want the computer to have some draws from this distribution. That is, we want computer to roll dice and record outcomes.

For this go to Data Þ Data AnalysisÞ Random Number Generation and select discrete distribution. Write number of variables =10 and number of random number =1000, enter value input and probability range A1:B6, put output range D1 and click OK.

This will generate a 1000×10 matrix of outcomes of roll of dice in cells A8:J1007. Each column represent outcome for a certain player in 1000 draws whereas rows represent outcomes for 10 players in some particular draw. In the next column ‘K’ we want to have sum of each row. Write ‘=SUM(A8:J8) and copy it down. This will generate column of sum for each draw.

Now, we are interested in knowing that what distribution of outcome for each player is:  

Let us ask Excel to count the frequency of each outcome for player 1. Choose Tools/Data Analysis/Histogram and fill the dialogue box as follows:

The screenshot shows the dialogue box filled to count the frequency of outcomes listed observed by player A. The input range is the column for which we want to count frequency of outcomes and bin range is the range of possible outcomes.  This process will generate frequency of six possible outcomes for the single player. When we did this, we got following output:

BinFrequency
1155
2154
3160
4169
5179
6183
More0

The table above gives the frequency of the outcomes whereas same frequencies are plotted in the bar chart. You observe that frequency of occurrence is not approximately equal. The height of vertical bars is approximately same. This implies that the distribution of draws is almost uniform. And we know this should happen because we made draws from uniform distribution. If we calculate percentage of each outcome it will become 15.5%, 15.4%, 16%, 16.9%, 17.9% and 18.3% respectively. These percentages are close to the probability of these outcomes i.e. 16.67%.

Now we want to check the distribution of column which contain sum of draws for 10 players, i.e. the column K. Now the range of possible values of column of sum varies from 10 to 60 (if all column have 1, the sum would be 10 and if all columns have 6 than sum would be 60, in all other cases it would be between these two numbers). It would be in-appropriate to count frequencies of all numbers in this range. Let us make few bins and count the frequencies of these bins. We choose following bins; (10,20), (20, 30),…(50, 60). Again we would ask Excel to count frequencies of these bins. To do this, write 10, 20,…60 in column M of Excel spread sheet (these numbers are the boundaries of bins we made). Now select Tools/Data Analysis/Histogram and fill the dialogue box that appears.

 The input range would be the range that contains sum of draws i.e. K8 to K1007 and bin range would be the address of cells where we have written the boundary points of our desired bins. Completing this procedure would produce the frequencies of each bin. Here is the output that we got from this exercise.

BinFrequency
100
205
30211
40638
50144
602
More0

First row of this output tells that there was no number smaller than starting point of first bin i.e. smaller than 10, and 2nd, 3rd …rows tell frequencies of bins (10-20), (20,30),…respectively. Last row informs about frequency of numbers larger than end point of last bin i.e. 60.

Below is the plot of this frequency table.

 Obviously this plot has no resemblance with uniform distribution. Rather if you remember famous bell shape of the normal distribution, this plot is closer to that shape.

Let us summarize our observation out of this experiment. We have several columns of random numbers that resemble roll of dice i.e. possible outcomes are 1…6 each with probability 1/6 (uniform distribution). If we count frequency of these outcomes in any column, the outcomes reveal the distributional shape and the histogram is almost uniform. Last column was containing sum of 10 draws from uniform distribution and we saw that distribution of this column is no longer uniform, rather it has closer match with shape of normal distribution.

Explanation of the observation:

The phenomenon that we observed may be explained by central limit theorem. According to central limit, let  be independent draws from any distribution (not necessarily uniform) with finite variance, than distribution of sum of draws  and average of draws would be approximately normal if sample size ‘n’ is large.

Mean and SE for sum of draws:

From our primary knowledge about random variables we know that:

And

Suppose

Let , than and

These two statements tell the parameters of normal distribution that emerges from sum of random numbers and we have observed this phenomenon described above.

Verification

Consider the exercise discussed above; column A:J are draws from dice roll with expectation 3.5 and variance 2.91667. Column K is sum of 10 previous columns. Thus expected value of K is thus 10*3.5=35 and variance 2.91667*10. This also implies that SE of column K is 5.400 (square root of variance.

The SD and variance in the above exercise can be calculated as follows:

Write ‘AVERAGE(K8:K1007)’ in any blank cell in spreadsheet. This will calculate sample mean of numbers in column K. The answer will be close to 35. When I did this, I found 34.95.

Write ‘VAR(K8:K1007)’ in any blank cell in spreadsheet. This will calculate sample variance of numbers in column K. The answer will be close to 29.16, when I did this, I found 30.02

Summary:

In this exercise, we observed that if we take draws from some certain distribution, the frequency of draws will reflect the probability structure of parent distribution. But when we take sum of draws, the distribution of sum reveals the shape of normal distribution. This phenomenon has its root in central limit theorem which is stated in Section …..

Please follow and like us:

Spurious Regression With Stationary Time Series

The spurious relationship is said to have occurred if the statistical summaries are indicating that two variables are related to each other when in fact there is no theoretical relationship between two variables. It often happens in time series data and there are many well-known examples of spurious correlation in time series data as well. For example, Yule (1926) observed strong relationship between marriages in church and the mortality rate in UK data. Obviously, it is very hard to explain that how the marriages in church can possibly effect the mortality, but the statistics says one variable has very strong correlation with other. This is typical example of spurious regression. Yule (1926) thought that this happens due to missing third variable.

This term spurious correlation was invented on or before 1897 i.e. in less than 15 years after invention of regression analysis. In 1897, Karl Pearson wrote a paper entitled, ‘Mathematical Contributions to the Theory of Evolution: On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs’. The title indicates the terms spurious regression was known at least as early as 1897, and it was observed in the data related to measurement of organs. The reason for this spurious correlation was use of indices. In next 20 years, many reasons for spurious correlation were unveiled with the most popular being missing third variable. This means if X is a cause of Y and X is also a cause of Z, but Y and Z are not directly associated. If you regress Y on Z, you will find spurious regression.

In 1974, Granger and Newbold (Granger won noble prize later) found that two non-stationary series may also yield spurious results even if there is no missing variable. This finding only added another reason to the possible reasons of spurious regression. Neither this finding can be used to argue that the non-stationarity is one and only reason of spurious regression nor this can be used to argue that the spurious regression is time series phenomenon. However, unfortunately, the economists adapted the two misperception. First, they thought that spurious regression is time series phenomenon and secondly, although not explicitly stated, it appears that the economists assume that the non-stationarity is the only cause of spurious regression. Therefore, although not explicitly stated, most of books and articles discussing the spurious regression, discuss the phenomenon in the context of non-stationary time series.

Granger and his coauthors in 1998 wrote a paper entitled “Spurious regressions with stationary series”, in which they show that spurious regression can occur in the stationary data. Therefore, they clear one of the common misconception that the spurious regression is only due to non-stationarity, but they were themselves caught in the second misconception that the spurious regression is time series phenomenon. They define spurious regression as “A spurious regression occurs when a pair of independent series but with strong temporal properties, are found apparently to be related according to standard inference in an OLS regression”. The use of term temporal properties implies that they assume the spurious regression to be time series related phenomenon. But a 100 years ago, Pearson has shown the spurious regression a cross-sectional data.

The unit root and cointegration analysis were developed to cope with the problem of spurious regression. The literature argues that spurious regression can be avoided if there is cointegration. But unfortunately, cointegration can be defined only for non-stationary data. What is the way to avoid spurious regression if the underlying are stationary? The literature is silent to answer this question.

Pesaran et al (1998) developed a new technique ‘ARDL Bound Test’ to test the existence of level relationship between variables. People often confuse the level relationship with cointegration and the common term used for ARDL Bound test is ARDL cointegration, but the in reality, this does not necessarily imply cointegration. The findings of Bound test are more general and imply cointegration only under certain conditions. The ARDL is capable of testing long run relationship between pair of stationary time series as well as between pair of non-stationary time series. However, the long run relationship between stationary time series cannot be termed as cointegration because by definition cointegration is the long run relationship between stationary time series.

In fact, ARDL bound test is a better way to deal with the spurious regression in stationary time series, but several misunderstandings about the test has restricted the usefulness of the test. We will discuss the use and features of ARDL in a future blog. 

Please follow and like us:

Can cointegration analysis solve spurious regression problem?

The efforts to avoid the existence of spurious regression has led to the development of modern time series analysis (see How Modern Time Series Analysis Emerged? ). The core objective of unit root and cointegration procedures is to differentiate between genuine and spurious regression. However, despite the huge literature, the unit root and cointegration analysis are unable to solve spurious regression problem. The reason lies mainly in the misunderstanding of the term spurious regression.

Spurious correlation/spurious correlation occur when a pair of variable having no (weak) causal connection appears to have significant (strong) correlation/regression. In these meanings the term spurious correlation/spurious has the same history as the term regression itself. The correlation and regression analysis were invented by Sir Francis Galton in around 1888 and in 1897, Karl Pearson wrote a paper with the following title, ‘Mathematical Contributions to the Theory of Evolution: On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs’ (Pearson, 1897).

This title indicates number of important things about the term spurious correlation: (a) the term spurious correlation was known as early as 1897, that is, in less than 10 years after the invention of correlation analysis (ii) there were more than one types of spurious correlation known to the scientists of that time, therefore, the author used the phrase ‘On a Farm of Spurious Regression’, (c) the spurious correlation was observed in measurement of organs, a cross-sectional data (d) the reason of spurious correlation was use of indices, not the non-stationarity.

One can find in classical econometric literature that that many kinds of spurious correlations were known to experts in first two decades of twentieth century. These kinds of spurious correlations include spurious correlation due to use of indices (Pearson, 1897), spurious correlation due to variations in magnitude of population (Yule, 1910), spurious correlation due to mixing of heterogeneous records (Brown et al, 1914), etc. The most important reason, as the econometricians of that time understand, was the missing third variable (Yule, 1926).

Granger and Newbold (1974) performed a simulation study in which they generated two independent random walk time series x(t)=x(t-1)+e(t) and y(t)=y(t-1)+u(t) . The two series are non-stationary and the correlation of error terms in the two series is zero so that the two series are totally independent of each other. The two variables don’t have any common missing factor to which the movement of the two series can be attributed. Now the regression of the type y(t)=a+bx(t)+e(t) should give insignificant regression coefficient, but the simulation showed very high probability of getting significant coefficient. Therefore, Granger and Newbold concluded that spurious regression occurs due to non-stationarity.

Three points are worth considering regarding the study of Granger and Newbold. First, the above cited literature clearly indicates that the spurious correlation does exist in cross-sectional data and the Granger-Newbold experiment is not capable to explain cross-sectional spurious correlation. Second, the existing understanding of the spurious correlation was that it happens due to missing variables and the experiment adds another reason for the phenomenon which cannot deny the existing understanding. Third, the experiment shows that non-stationarity is one of the reasons of spurious regression. It does not prove that non-stationarity  is ‘the only’ reason of spurious regression.

However, unfortunately, the econometric literature that emerged after Granger and Newbold, adapted the misconception. Now, many textbooks discuss the spurious regression only in the context of non-stationarity, which leads to the misconception that the spurious regression is a non-stationarity related phenomenon. Similarly, the discussion of missing variable as a reason of spurious regression is usually not present in the recent textbooks and other academic literature.

To show that spurious regression is not necessarily a time series phenomenon, consider the following example:

A researcher is interested in knowing the relationship between shoe size and mathematical ability level of school students. He goes to a high school and takes a random sample of the students present in the school. He takes readings on shoe size and ability to solve the mathematical problems of the selected students. He finds that there is very high correlation between two variables. Would this be sufficient to argue that the admission policy of the school should be based on the measurement of shoe size? Otherwise, what accounts for this high correlation?

If sample is selected from a high school having kids in different classes, the same observation is almost sure to occur. The pupil in higher classes have larger shoe size and have higher mathematical skills, whereas student in lower classes have lower mathematical skills. Therefore, high correlation is expected. However, if we take data of only one class, say grade III, we will not see such high correlation. Since theoretically, there should be no correlation between shoe size and mathematical skills, this apparently high correlation may be regarded as spurious correlation/regression. The reason for this spurious correlation is mixing of missing age factor which drives both shoe size and mathematical skills.

Since this is not a time series data, there is no question of the existence of non-stationarity, but the spurious correlation exists. This shows that spurious correlation is no necessarily a time series phenomenon. The unit root and cointegration would be just incapable to solve this problem.

Similarly, it can be shown that the unit root and cointegration analysis can fail to work even with time series data, and this will be discussed in our next blog

Please follow and like us:

How Modern Time Series Analysis Emerged?

The regression analysis which is basic tool of econometrics was invented in 1880s by Francis Galton, a cousin of Charles Darwin who is famous for his theory of evolution. Like Darwin, Galton was a biologist interested in the laws of heredity, and he intended to use the regression for the laws of heredity. He used the regression for analysis of cross-sectional data of the heights of fathers and sons. The regression analysis was than adapted and developed by the economists for analysis of economic data without any discrimination of time series and cross-sectional data.

Soon it was discovered that the application of regression analysis to the time series data could produced misleading results. In particular, regression analysis applied to time series data sometimes shows the two series to be highly correlated, when in fact there is no sensible economic relationship between the variables. This phenomenon was termed as ‘spurious regression’.

Yule (1926) wrote a detailed commentary on the spurious regression in time series. He gave number of examples in which two independent time series appear to be highly correlated. One of his examples was the relationship in marriages in Church of England and mortality rate. Obviously, the two variables don’t have any causal connection, but Yule fond 95% correlation between two variables. Yule thought that this phenomenon is because of some missing variable and could be avoided by taking into account all relevant variables. He further assumed that spurious regression would disappear if longer time series are available. This means by increasing the time series length, the chances of spurious regression will gradually diminish. In coming half century, the missing variable was thought as the main reason for the spurious correlation in time series.

In 1974, Granger and Newbold observed that in case of non-stationary time series, the spurious regression may exist even if there is no missing variable. They further found that the probability of spurious regression increases by increasing the time series length, contrary to the perception of Yule who had thought that probability of spurious regression will decrease with the increase in time series length. A few years later in 1982, Nelson and Plossor analyzed a set of time series of the United States and found that most of these series are non-stationary. Many other studies supported the finding of Nelson and Plossor creating a doubt about stationarity of time series.

If one combines the finding of Granger and Newbold with that of Nelson and Plossor, the conclusion would be, ‘most of regressions between economic time series are spurious because of non-stationarity of the underlying time series’. Therefore these studies put a big question mark on the validity of regression analysis for time series data.

In a later study, Engle and Granger (1986) found that regression of non-stationary time series could be genuine, if the underlying series are ‘cointegrated’. This means, if you have a set of  time series variables which are non-stationary, you have to ensure that they are cointegrated as well, in order to insure that the regression is no spurious.

If you are running a regression between time series variables, first you have to check the stationarity of the series because as warned by Nelson and Plossor and predecessors, most of the economic time series are non-stationary. If the series are actually non-stationary, than you have to make sure that the series have cointegration as well, otherwise the regression will be spurious.

Therefore, in order to test the validity of a regression analysis for time series, testing for stationarity and cointegration became the preliminary steps in the analysis of time series.  A new stream of literature emerged on focusing on the testing for stationarity and cointegration which give rise to current tools of time series analysis.

References

Galton, F. (1886). “Regression towards mediocrity in hereditary stature”. The Journal of the Anthropological Institute of Great Britain and Ireland. 15: 246–263.

Engle, Robert F.; Granger, Clive W. J. (1987). “Co-integration and error correction: Representation, estimation and testing”. Econometrica. 55 (2): 251–276.

Granger, C. W., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of econometrics, 2(2), 111-120

Nelson, C. R. and Plosser, C. R. (1982). Trends and random walks in macroeconmic time series: some evidence and implications. Journal of monetary economics, 10(2):139– 162

Yule, G. U. (1926). Why do we sometimes get nonsense-correlations between Time-Series?–a study in sampling and the nature of time-series. Journal of the royal statistical society, 89(1), 1-63.

Please follow and like us: