Consider a hypothetical question, a researcher was given with a research question; compare the mathematical ability of male and female students of grade 5. The researcher collected data of 300 female students and 300 male students of grade 5 and administered a test of mathematical questions. The average score for female students was 80% and average score of male student was 50%, the difference was statistically significant and therefore, the researcher concluded that the female students have better mathematical aptitude.
The findings seem strong and impressive, but let me add into the information that the male students were chosen from a far-off village with untrained educational staff and lack of educational facilities. The female students were chosen from an elite school of a metropolitan city, where the best teachers of the city actually serve. What should be the conclusion now? It can be argued that actually difference doesn’t come from the gender, the difference is coming from the school type.
The researcher carrying out the project says ‘look, my research assignment was only to investigate the difference due to gender, the school type is not the question I am interested in, therefore, I have nothing to do with the school type’.
Do you think that the argument of researcher is valid and the findings should be considered reliable? The answer is obvious, the findings are not reliable, and the school type creates a serious bias. The researcher must compare students from the same school type. This implies you have to take care of the variables not having any mention in your research question if they are determinants of your dependent variable.
Now let’s apply the same logic to econometric modeling, suppose we have the task to analyze the impact of financial development on economic growth. We are running a regression of GDP growth on a proxy of financial development; we are getting a regression output and presenting the output as impact of financial development on economic growth. Is it a reliable research?
This research is also deficient just like our example of gender and mathematical ability. The research is not reliable if ceteris paribus doesn’t hold. The other variables which may affect the output variable should remain same.
But in real life, it is often very difficult to keep all other variables same. The economy continuously evolves and so are the economic variables. The other solution to overcome the problem is to take the other variables into account while running regression. This means other variables that determine your dependent variable should be taken as control variables in the regression. This means suppose you want to check the effect of X1 on Y using model Y=a+bX1+e. Some other research studies indicate that another model exist for Y which is Y=c+dX2+e. Then I cannot run the first model ignoring the second model. If I am running only model 1 ignoring the other models, the results would be biased in a similar way as we have seen in our example of mathematical ability. We have to use the variables of model 2 as control variable, even if we are not interested in coefficients of model 2. Therefore, the estimated model would be like Y=a+bX1+cX2+e
Taking the control variables is possible when there are a few models. The seminal study of Davidson, Hendry, Sarba and Yeo titled ‘Econometric modelling of the aggregate time-series relationship between …. (often referred as DHSY)’ summarizes the way to build a model in such a situation. But it often happens that there exists very large number of models for one variable. For example, there is very large number of models for growth. A book titled ‘Growth Econometrics’ by Darlauf lists hundreds of models for growth used by researchers in their studies. Life becomes very complicated when you have so many models. Estimating a model with all determinants of growth would be literally impossible for most of the countries using the classical methodology. This is because growth data is usually available at annual or quarterly frequency and the number of predictors taken from all models collectively would exceed number of observations. The time series data also have dynamic structure and taking lags of variables makes things more complicated. Therefore, classical techniques of econometrics often fail to work for such high dimensional data.
Some experts have invented sophisticated techniques for the modeling in a scenario where number of predictor becomes very large. These techniques include Extreme Bound Analysis, Weighted Average Least Squares, and Autometrix etc. The high dimensional econometric techniques are also very interesting field of econometric investigation. However, DHSY is extremely useful for the situations where there are more than one models for a variable based on different theories. The DHSY methodology is also called LSE methodology, General to Specific Methodology or simply G2S methodology.