With the biggest event in the soccer calendar now underway, fans are speculating on which team might emerge victorious from the 2018 World Cup in Russia – and an artificial intelligence model based on 100,000 simulations has made its prediction too.
Using a database of statistics from previous tournaments, and three different AI methods to crunch the numbers, the international team of researchers behind the work thinks Spain is going to emerge victorious… but it's going to be close.
At the moment the bookmakers are backing Germany to be World Cup winners, but the AI analysed both the strength of the teams and their route to the final. While Germany would beat Spain in a one-off game, the models showed, the German team is likely to face more difficult opponents through the course of the competition.
"By analysing the winning probabilities conditional on reaching the single stages of the tournament, it turns out that the fact that overall Spain is slightly favoured over Germany is mainly due to the fact that Germany has a comparatively high chance to drop out in the round of 16," write the researchers.
The research team was made up of academics from Technische Universitat Dortmund, the Technical University of Munich in Germany, and Ghent University in Belgium.
They used three techniques to process the statistics: Poisson regression (interpreting and weighing many different variables), ranking method (evaluating the current strength of each team), and the random forest approach.
For the World Cup, the random forest method is particularly suitable. It can be structured like a football tournament, but its secret sauce is in the way it uses machine learning to map out random branches for a decision tree - different scenarios and their consequences - many times over.
By getting the AI to place emphasis on different factors when producing the decision trees, repeating the simulations and then comparing across factors, the forest method can give statisticians a better look at which factors are most important in the ultimate outcome - such as the opponents Germany and Spain are likely to face.
Compared with other mathematical models, it also gives more accurate information later on in the decision tree, where training data might be scarce.
The researchers plugged in a whole host of source data to inform their models. By looking back at previous tournaments, they found some factors were important (like the number of Champions League players on a team) and some were less so (like the nationality of a team's coach).
With the statistics showing a combination of ranking methods and random forest trees were the most accurate predictors, the 100,000 simulations of the 2018 World Cup were run based on these models.
Spain emerged at the top, with a 17.8 percent chance of winning. Germany is second on 17.1 percent, Brazil third on 12.3 percent, France fourth on 11.2 percent, and Belgium fifth on 10.4 percent.
Saudi Arabia, meanwhile, has a zero percent chance of reaching the final. (Apologies to anyone rooting for them.)
Keep your eyes on your tournament wall charts though – should Germany make it through to the quarter finals, the last eight teams, the team has an equal chance with Spain.
Not everyone agrees with that assessment though. Financial firm Goldman Sachs has trained its own machine learning engines on the World Cup in Russia and reckons Brazil is going to emerge victorious. The Goldman Sachs calculations also involved elements of random forest modelling.
Of course soccer, and sport in general, has a long history of upsetting both bookmakers and AI models. It's very difficult to take into account everything that might affect a team's performance – and what machine learning algorithm can predict a spectacular goal being created out of nothing?
Still, it's interesting to look at the statistical methods involved in predicting what's going to happen in Russia. By the evening of the 15th of July, we'll know if those hundred thousand AI simulations got it right.
The study has yet to be peer-reviewed, but is available on the arXiv.org pre-print server.