Some of the most popular models used in Data Analysis imply the use of the so-called “Black Box” approach. Regarding the simplest interpretation one can give in this context, it depends on the inputs and outputs that a certain model can deliver in terms of prediction power.
If econometrics is thought to estimate population parameters, and provide their causal inference, the black box approach proper of data analysis is somewhat opposite to this concept. In fact, we only care about responses and predicted responses to discriminate across models given a certain amount of data (captured in an observable sample). We then calculate the prediction contrasted with the actual value and derive measures of the error, and thus, we select a rational model which provides the best explanation of the response variable considering, of course, the tradeoff between variance and bias induced.
In an article by Mullainathan & Spiess (2017) from the Journal of Economic Perspectives, a short description of supervised and unsupervised approaches of machine learning are described. The out-of-sample performance for these methods in comparison to the least-squares is potentially greater. See the next table taken from the article of these authors:
Source: Mullainathan & Spiess (2017, 90) Note: The dependent variable is the log-dollar house value of owner-occupied units in the 2011 American Housing Survey from 150 covariates including unit characteristics and quality measures. All algorithms are fitted on the same, randomly drawn training sample of 10,000 units and evaluated on the 41,808 remaining held-out units. The numbers in brackets in the hold-out sample column are 95 percent bootstrap confidence intervals for hold-out prediction performance, and represent measurement variation for a fixed prediction function. For this illustration, we do not use sampling weights. Details are provided in the online Appendix at http://e-jep.org.
In this exercise, a training sample and a test sample were used to calculate the “prediction performance” given by the R2. In econometrics, we would call this, the goodness of fit of the model, or also, the percentage of linear explanation regarding the model. It is not a secret that when the goodness of fit of the model increases, we will have a higher prediction power (considering of course that we would never actually going to have an R2 of 1 unless we have some overfitting issues).
When you compare table 1 results in the “hold out of sample” column, you can find that some other approaches may outperform the least-squares regression in terms of the prediction power. A mere mention of this can be witnessed in the row corresponding to LASSO estimates, and hence, one can states that there’s an increased prediction performance compared to least squares. And therefore, the LASSO model is capturing somewhat better, the behavior of the response variable (at least for this sample).
One should ask at this point what is the objective of the analysis. If we are going for statistical inference and the estimation of population parameters, we should stick to the non-black-box approaches. Some of them may involve traditional LS, GMM, 2SLS to mention an example. But, if we are more interested in the prediction power and performance, the black box approaches surely will come in handy, and sometimes, may outperform the econometrical procedures to estimate population parameters. In the way I see it, the black box even when it is unknown to us in the closer details, has the ability to adapt itself to the data (but of course this should be considering the variety of machine learning methods and algorithms, not the penalized regression).
As the authors expressed in their article, it could be tentative to draw conclusions from these methods like we usually do in econometrics, but first, we need to consider some of the limitations in the application of the black box approaches. A mention of these could be defined as 1) Sometimes the correlation steps in, 2) The production of standard errors become harder, 2) Some of the methods are inconsistent if we change the initial conditions, 3) There’s a risk of choosing incorrect models and may induce to omitted variable bias.
However, even with the above problems, we are able to get some useful connections between the black box approaches and the econometrical methods. The advantage of machine learning over the estimation of traditional econometrical models may be superior in the context of large samples, in which, the researcher may need to define a set of covariates of influence to define or test a theory. Also, even for policymakers, it can be a useful tool associated with econometrical analysis. This provides the economist “a tool to infer actual from reported” values and proceed with comparisons given the samples of the researcher.
We are also able to correct some of the problems associated with the prediction powers to estimate population parameters, as the authors appoint, consider the case of two-stage least squares, where in the first stage we are required to make a prediction of the endogenous regressor considering an instrument, the black box approach may even be useful to perform better predictions and include it in second-stage regression, however, it should be noted that instruments selected should be at least reasonable exogenous, because if we let the black box alone, it would just take correlations and possible bring up reverse causality problems.
Supervised or non-supervised methods in machine learning may provide a better understanding from a different approach, and with this, I refer to the “black box” approach. Since even when it is not exactly part of the causal analysis. It may be useful to select some possible covariates of a phenomenon, thus, the rational analysis and the selected outcome should always be considered and criticized in terms to provide the best inference. From this perspective, even when we don’t know what exactly happens inside the box, the outcome of the black box itself is giving us some useful information.
This is a topic that is getting constant reviews and enhancement for real-world applications, I believe that the bridge between the black box approaches from machine learning and the econometrical theory will eventually be more strong over time, considering, of course, the needs of the growing society in terms of information.
Aravindan, G. (2019) Challenges of AI-based adoptions: Simplified, Sogetilabs. Recuperated from: https://labs.sogeti.com/challenges-of-ai-based-adoptions/
Mullainathan, S. & Spiess, J. (2017) Machine Learning: An Applied Econometric Approach, Volume 31, Number 2—Spring 2017—Pages 87–106.
Rudin, C. & Radin, J. (2019) Why Are We Using Black Box Models in AI When We Don’t Need To? A Lesson From An Explainable AI Competition, Recuperated from: https://hdsr.mitpress.mit.edu/pub/f9kuryi8/release/6