Data helps make decisions in football, but most of the data out there helps analyse teams and players in post-match analysis. This means that we look after the events and determine performances. This gives us a scope of how players have done in a particular time period or how a team has faired against other teams.
The next step is to look at how we can use data and forecast certain events. We can look at outcomes and results for the future, but if we look closer to specific data, we might have an idea of how future events can turn out based on historical data.
In this analysis, I want to look at shot locations over the season and by doing that have a forecast of where shots will take place in the next match. We can do that by using AutoRegressive Integrated Moving Average (ARIMA).
- What is ARIMA?
- Why do we need it?
- Data
- Methodology
- Visualisation
- Final thoughts
What is AutoRegressive Integrated Moving Average (ARIMA)
ARIMA (AutoRegressive Integrated Moving Average) is a popular model used in time series forecasting to analyze and predict future values based on past data. The model combines three main components: AutoRegressive (AR), Integrated (I), and Moving Average (MA).
- AutoRegressive (AR): This part models the relationship between an observation and a specified number of lagged observations, or previous values in the time series. The parameter p indicates the number of lagged terms to consider.
- Integrated (I): This component represents differencing, which is a way to make the time series stationary by subtracting the current value from the previous one. The parameter d is the number of differencing steps applied to stabilize the mean of the series.
- Moving Average (MA): This part focuses on the relationship between an observation and a residual error from a moving average model applied to lagged observations. The parameter q defines the number of lagged forecast errors included.
ARIMA is commonly written as ARIMA(p, d, q), where the values of p, d, and q determine the complexity of the model. By combining these three components, ARIMA can effectively capture trends, seasonality, and noise, making it powerful for time series prediction.
In the context of football, ARIMA can model shot frequencies from different zones on the pitch, revealing trends and patterns in shooting behaviour. The model can help identify which zones are becoming more popular for shooting over time.
Why do we need it?
So why do I or we need it? Like I said before, we mostly focus on post-match analysis and look back at what has happened. But, data can be used for pre-match analysis as well and most definitely also as a means to prepare for opposition. Coach analysis is growing every year and this can help in that regard.
With this metric or new data, we can predict where the next shots will come from an opposition stance, so we can prepare on in our defensive way of thinking. It can also help in an attacking way of thinking and making sure we get the most out of it. Maximising expected goals or seeing if we need to tweak our attacking open play.
Data
The data we are using for this metric comes from Opta. The shot locations are from event data from Opta and the expected goals values come from my own model. That model looks at different variables to make sure the xG values are contested. The data was collected on October 29th, 2024.
The data will be manipulated and categorised so it will fit the needs of our research. More will be explained in the methodology section.
Methodology
There are a few things that have to be done, in order to get the right results. First, we have to gather all the data, which will be the base of our database and calculation.
After that, we will divide the pitch into different zones. We will need the different zones to determine where shots are taken from most or least, so it can help us in determining a strategy or predicting where it comes from. Will it come from the central zones? The half-spaces, wide areas or deeper on the pitch. We will divide the pitch into 18 zones, so we can see clearly where the shots are from.
Now, I will filter the data down to a specific team which we will focus on. The team in question is Liverpool in the English Premier League. The season we will focus on is 2024/2025 and we use the historical data — so the current data — to see how many shots come from each specific bin/zone on the field. So we can see what the data has done and give us a start for the prediction.
However, I’ve found that this isn’t the right way for this as I want to look closer at clustering shots. Zones are a good option, but its rather static. That’s why I will look at clustering those shots.
In the above shotmap, we see all of Liverpool’s shots in the season 2024–2025 so far. This show us the expected goals of Liverpool and the locations of the shots. There is a correlation between shots distance and height of expected goals.
The next step is to look at clusters.
In the shot map above we have divided the shots into 4 clusters. Clustering is a technique in machine learning and statistics used to group data points into clusters based on their similarities. The goal is to categorise data so that items within each group (cluster) are more alike than those in other groups.
Right now we have the shots clustered and we see what Liverpool has done so far. Now we have to go to the part where we calculate the prediction.
By doing so we take the average shots per game which is 14 shots per game looking at the first 9 games of the season. We take that to calculate the locations of the next 14 shots for the next game. This will be done via code in Python.
Based on the first 9 matches, these are the shot locations for the predicted next match. With this information we can look into some concrete take aways and actions.
Visualisation
In the shot map above we see the prediction based on historical data. The clusters are different as from the historical data, but it gives us some insights:
- It looks like Liverpool will try to shoot most from the left side with lower xG.
- Most xG is generated from the central zones meaning that the player will be straight in front of the goal
- The shots on the left side will have a good angle and slightly more inside than the right side.
This is an analysis we can use to make it actionable. Why do we need to make it actionable? These are predictions using data, but in the end, we need to find a way to make it work for the coach-analysts for example.
There are two ways how we can use the predictions of shooting locations to our advantage in the framework of coaching/training:
- Attacking — Liverpool can see how they will shoot and where the volume of shots will come from, but also where the higher xG locations come from. They can train on solidifying this current way of play or explain other zones of shooting locations, to make sure the opposition will be surprised.
- Defensive — If you play against Liverpool this can be used as a way to set up defensively. So for example, Liverpool has many shots from the right so you will try to guide them to the left. Also refrain from shot in central areas as they generate the most xG.
Final thoughts
This is a way of looking at pre-match analysis using data that I have started earlier this year. Data can be used in every aspect of the game, but post-match is always a bit easier than pre-match. With prediction shot locations, we can make it actionable for the training grounds.
There are different ways of approaching this and ARIMA-model is just one way of looking at it, and it’s not foul-proof. In time I need to make a 2.0 version of predicting shot locations so it is more accurate in terms of shot outcomes, angles, left/right foot, game situations, distance, is it assisted?
I’ve looked at zones too, so next time I want to explore that a little bit more in another research. Mostly to see if that’s plausible too.
All in all, this gives a broader experience of looking at shot locations and using it to our advantage for future games.
For the Python code and xG values, subscribe to my Patreon so you can read the article + get access to the documents: