This is classic data mining, and it should never be relied upon to make future forecasts.
Salil Mehta, former TARP director of analytics and author of “Statistics Topics,” has been critical of pollsters’ election forecasts. He spent much of the time before the election lecturing them that their models were underestimating the possibility of a Trump victory. In an email exchange, he observed:
There is an increased craving to slice and dice the recent election data, particularly given that the major pollsters have been shamed as they all immensely errored in projecting this year’s election’s victor. All gave President-elect Trump <15% a faux probability of winning. The risk of now retorting with data-mining this single election result is that they often miss an analysis of the predictive errors in this unique match-up (e.g., record high undecideds on Election eve), don’t take into account budding geospatial patterns to validate evidence, and in most case none of this should deceptively be promoted as an election forecasting model.
Correlations are very different from what is required to create a reliable model that correctly forecasts a future election or investing outcomes. Rather than mine data, Mehta suggests instead we engage in hypothesis testing.
The obvious parallel to investing is the myriad of back-tested strategies, many of which engage in similar sorts of data mining as the recent election post-mortems do. They seem to work perfectly in the past, but they are less robust than desired. Models that inform us of what has already happened but not what might occur in the future are of limited value.
Cliff Asness of AQR warns us not to confuse factor investing with data mining. He notes that French-Fama factors such as value, momentum and size have all been tested out of sample and proven to be robust. Out-of-sample testing could verify if an election model’s backtest is valid: Take the five data claims above, then apply them to Obama versus McCain or Bush versus Gore to see if they are at all predictive. The same is true for investing models. To avoid poorly constructed models that are form-fitted to past experience, apply them to different data sets than the test.
If a gold mine is a hole in the ground with a liar standing on top of it, a successful data miner is a quant with a data set lying to himself. You probably have never seen a sales pitch that didn’t have a back test “proving” market-beating returns. If only you had a time machine to go back to the period of time covered by the data set.
Investing after the fact is easy. Investors should be cautious when presented with results that only tell you what just happened, not what is about to occur.
1 Lots of “multicolinearities” — economic inequality, poor health, low educational attainment — may be associated with Trump voters, but they are not likely to forecast the next election. For example, higher education (and therefore better health and possibly higher income) might present a proclivity toward voting red or blue, but as Mehta points out, not all college degrees are created equal. Some generate much greater potential future incomes than others (“nonheterogeneous”).