One of the most interesting topics that’s often discussed after a football match is whether a player was “involved” or not during a match. The concept makes intuitive sense to me. The oft-cited example of a player “dictating” or “influencing” a team’s possession is Andrea Pirlo. One doesn’t have to crunch data to understand that Pirlo was heavily involved in the possession of his AC Milan, Juventus, and Italy sides. However, as a data analyst, I was curious as to how we could measure this idea. Who are the most influential players to a team’s possession? Are there less obvious players to Pirlo that we’re missing? And how could one apply the measure of influence to help solve real world problems?
I decided to use graph databases to model team passing networks. From there, I borrowed a legendary algorithm from Silicoln Valley to measure influence, and ranked all players across the Top 5 leagues in Europe.
Note – this is not a direct measure of “how good” a player is. Rather, it is a measure of how involved a player is to a team’s possession.
Background: Graph Databases
Graph Databases are an alternative way to store data to the traditional data warehouse. In graph databases, nodes are entities that represent things. Edges represent links between things. Graph databases rose in popularity with the rise of social networks. Within a social network’s graph, each node represents a person, and each edge represents a relationship between people. For example, with Facebook, each Facebook profile is often graphed as a node, while each friendship is represented as an edge:
Graph databases can be applied to football. In a football match, each player is graphed as a node, while each combination of two players who pass to each other during the match is represented by an edge. Additionally, the size of the node is often displayed using a proxy of influence such as total passes or touches, while the size of the edge is weighted by the amount of passes between the two players. @11tegen11 has done great work to popularize this visualization. For example:
— 11tegen11 (@11tegen11) December 19, 2016
Example of @11tegen11’s work
@11tegen‘s work inspired me to get into modeling matches as graphs. He does really interesting work and I recommend giving him a follow.
The Legendary PageRank
So how do we measure the influence of an individual player on a team’s possession using our graphs? Luckily, some really smart people have developed algorithms to measure influence within a graph. So, we can simply apply them to our passing networks.
Perhaps the most popular measure of connectivity is one that influences your decisions every day – PageRank. PageRank was originally developed by the founders of Google as a means to organize the internet. More specifically, it is meant to answer the question, what are the relative importances of the websites on the internet. The higher the PageRank, the higher a website returns in your search results. If you would like to know more details about the algorithm, there is plenty of good writing on the subject that I will not cover here today.
Applying PageRank to a football team’s passing network provides a similar insight. In this case, it tells us the relative importance of an individual to a team’s possession. The higher the score, the more involved the player is. This is not a measure of “who is the best.” This is a measure of involvement, or, influence on team while that team is in possession.
Like my work on the classification of central midfielders, I limited my dataset to the last 18 months of player game level data from the Top 5 Leagues in Europe: English Premier League, Spanish La Liga, Italian Serie A, French League 1, and German Bundesliga. I included all positions in my analysis. I only deemed a player eligible if he had played the equivalent of 20 matches (1800 minutes) in the past 18 months.
I wrote a script that created a graph for each team in each match in the dataset. From there, I filtered each graph through PageRank. After sending the data through the algorithm, I had measures of influence on ball possession for each player in the dataset, for each match. From there, I simply took a player’s average measure of influence over the entire dataset, and ranked the players. Below are the Top 5 most influential players to their team’s possession in Europe over the past 18 months. I have included a passing map of their most influential match.
5. Bruno, Villarreal: Captain Bruno, the central ball playing midfielder, was a key member of Villarreal’s 4th place finish under Marcelino last season. On the second to last match day, Bruno played alongside Manu Trigueros in a double pivot. On the day, most of the possession flowed through Bruno as he completed 94 percent of 104 passes in a disappointing 2-0 defeat to Deportivo.
Map: Villarreal 0 – Deportivo de La Coruña 2, May 8, 2016
4. Pascal Groß, Ingolstadt: Groß plays an influential role in Ingolstadt’s limited ball possession as what I would categorize a deep lying forward. When they do have the ball, it seems to be funneled straight through Groß and teammate Tobias Levels (when Levels is in the starting 11). Groß played a key role in Ingolstadt’s 2-1 victory over Augsburg in February of last season. Groß only completed 66 percent of his 53 passes, but tallied 6 key passes. You can see how Groß played centrally behind the strikers in the map below.
Map: Ingolstadt 2 – Augsburg 1, February 6th, 2016
3. Daniel Drinkwater, Leicester City: Although Kante got the plaudits and the big money move to Chelsea, it was Drinkwater that dictated Leicester’s possession last season. Drinkwater has tallied 4 of the top 10 most influential passing matches in the Premier League over the last 18 months, including the top 3 most influential passing performances. Drinkwater’s most involved performance was in Leicester City’s 0-3 loss to Chelsea in October of this season. He drifted in front of his partner Daniel Amartey and completed 85 percent of his 102 passes.
Map: Leicester City 0 – Chelsea 3, October 15, 2016
2. Jorginho, Napoli: Central controller Jorginho had three of the top 10 most influential games in the Serie A over the last 18 months, including the most influential game in a 2-0 win over Verona. Jorginho played as a single pivot on the day, completed 93 percent of his astounding 195 passes, and provided 9 key passes. Performances like this make you wonder why clubs like Barcelona don’t sign Jorginho. He’s also an obvious heir-apparent to Cazorla at Arsenal (Cazorla also score very high on PageRank).
Map: Napoli 2 – Verona 0, November 22, 2011
1. Roberto Trashorras, Rayo Vallecano: All of the top 10 single game PageRanks in La Liga over the last 18 months belonged to central possession midfielder Roberto Trashorras. The 35 year old king of influence is still playing for Rayo this season in La Liga 2. It is worth noting that the manager – Paco Jemez – with whom Trashorras worked with last season did not agree to terms with Rayo before the start of this season. On February 28th, 2016, Trashorras tallied the most influential performance in La Liga over the last 18 months in a 2-2 draw against Betis. He completed 87 percent of 100 passes, 17 of 20 long passes, and provided an assist.
Map: Rayo Vallecano 2 – Real Betis 2, February 28, 2016
Some of you may be thinking – so what? I think ‘so what’ is always an important question to keep in mind when working with data. It helps us remember if we’re actually answering a question that can lead to an actionable insight.
The most obvious application to applying PageRank to players as a means to measure influence is for opposition analysis. One look at Rayo Vallecano’s passing maps and PageRanks clearly shows that under Paco Jemez, all the possession went through Trashorras. Stop Trashorras, and you disrupt Rayo’s tactics. Similarly, you could evaluate the balance of a team by looking at PageRanks. For example, if a team’s PageRank scores are disproportionately heavy on the wings compared to the rest of Europe, the team uses more width in possession than an average team. Using a measure of influence like PageRank helps us understand the way a team likes to operate in possession, thus giving us the opportunity to disrupt.
Please feel free to add any comments/thoughts about the results or methodology.
Note – All pass networks other than @11tegen11’s network of Liverpool were created by me. @11tegen11’s sourced network of Liverpool was created by @11tegen11. All other non-sourced images were found using Google Advances Image Search option with image rights set to ‘free to user or share.’ If Google’s classification was incorrect and you would like your image removed please contact me an I will do so immediately.