I was just reading the Reddit thread "My issue with data science" in r/datascience. One of the main points made in the thread is that prediction is fundamentally a different game to causal inference. When we deal with real-world data, it often isn’t feasible to design controlled experiments so that we can perform causal inference. In that case, and since we often primarily care about prediction (that is, we often don’t care why something happens – just that it happens), people just use predictive methods.
In the real world, we often have (1) limited data that was (2) not generated through any kind of controlled experiment. My understanding is that this is the worst situation: Having large amounts of data – even if it is not generated through any kind of controlled experiment – enables us to make good predictions (using, for instance, Deep Learning), and having limited data that was generated by a strictly controlled experiment also enables us to make good predictions.
So what statistical methods/tools are suitable (statistically sound) for use in such cases? What statistical methods can we use to squeeze as much predictive value out of limited data that was generated without any experimental design/controls? Are there any machine learning tools that are appropriate here, or are they all only suitable with lots of data? What is the research that I should be looking at? Someone mentioned that Bayesian methods are good for this, but I don’t know enough to have an opinion.