Improving Match of the Day with Python

Despite the long overdue ejection of Alan Hansen and Mark Lawrenson from the Match of the Day couch, I still find myself thinking the same thing every time I tune in. Why do they bother with the punditry?

The current state of Match of the Day punditry places the show uncomfortably between the enlightening discussions of Sky’s Monday Night Football and a pure highlights-only style show. As a result, viewers are deprived of the full benefits of either show and instead tune into a bloated mess. I seriously doubt there’s anyone watching the highlights at home thinking ‘Gee, I sure can’t wait to hear Alan Shearer’s opinion on that penalty call!’ (and not just because the usage of ‘Gee’ hasn’t really recovered since the heady days of 1939)

So instead of continuing to complain, I wrote a small python module to trim out unwanted fluff from a video of Match of the Day, leaving only the highlights of actual football.

Method

After some initial (and unsuccessful) experiments in identifying portions of the video in which highlights were playing, I settled on a method. In portions of the video in which highlights are playing, the top left of the screen displays a scoreboard:

Screen Shot 2016-08-29 at 19.10.49

This scoreboard remains in place even during closeups on players, coaches and members of the crowd (something an alternative frame ‘greenness’ method had difficulty with). Meanwhile, during shots of post match discussion in the studio (including analysis of highlights) do not have the scoreboard showing (another thing the aforementioned naive approach failed at).

As a result, we can use this to identify which parts of Match of the Day we want to keep and actually watch. But how do we identify whether the scoreboard is showing?

Thankfully, if you’re lazy like me the scikit-image documentation has a helpful example on corner detection that you can steal comes in handy here (I’ve also put a demo on github). By taking a rolling average of the number of peaks detected in the top left corner of the screen (where the score box will be during matches) we can see a clear distinction between the match highlights and the analysis segments:

cornerspic

We can then identify the start and end points of different matches by looking at when the number of peaks rises above or below the show-long average*. All thats left to do then is add some finishing touches like fade in/out, made easy by the  moviepy python package.

Results

Somehow, this pretty crude solution comes out really well. The module was developed and ‘trained’ using the last episode of Match of the Day from the 2014/15 season and I’ve since successfully tested it on the last two weeks’ episodes. Despite initially starting out as a silly experiment, I am actually considering using this for the foreseeable future. There’s also the side benefit of time saved: this week’s episode goes from 1 hour and 30 minutes down to 50 minutes after being trimmed. So long as Match of the Day don’t remove the scoreboard from the top left corner, it should continue to work, and you can go from this (plus another few minutes of Shearer & co):

ndVEt8QaGM.gif

… to this:
bi7aE1kezx.gif

(I should also apologise for the horrible quality of the gifs and note that the actual output video quality is much higher)

If you’re interested in trying this out yourself or looking at the method in a bit more detail, I put it up on Github here.


* There are more sophisticated ways to cluster 1d series, but none of the ones that I tried performed significantly better than this simple method. If you know of any method you think is particularly suited to this kind of thing, I’d be really interested to know, so tweet at me or something.

 

Advertisements

Middlesbrough, Game States and Zone 14

One of my biggest pet peeves when watching a game of football is what I’ll term the faux-forwards pass (if you can think of something that sounds less pretentious, please let me know). What I mean by this is the pass sideways, often into lots of space, that although progressing the ball vertically up the pitch, does not actually get the team closer to scoring a goal. This is almost always made worse by the fact that this is often greeted by cheers and encouragement from the crowd (see also: corners). It’s a distaste for this habit Tom Payne at Huddersfield also seems to share:

Screen Shot 2016-04-06 at 02.15.07

Why do I bring this up? Well, over the course of the season, it’s felt like the 15/16 Middlesbrough has done this a lot. This probably wasn’t helped by the signing of Jordan Rhodes, who offers little in build-up and for whom the rest of the team seem to love crossing to at the expense of other (probably more fruitful) methods of attack.

More specifically, I’ve been struck by how much more impotent build-up has seemed when level versus when Middlesbrough have been in the lead, especially when it comes to faux-forward passes. At a quick glance, it seems like this isn’t just me and this is a pattern supported by the data.

The following image shows the angle of passes originating in zone 14 (the central area in front of the box) at different game states for the 14/15  and 15/16 seasons. The length of each bar corresponds to the number of passes in that direction. If other words, if there are a lot of passes towards the goal, there will be a large bar pointing up.

Rplot21

A few things are immediately apparent. At close game states, Middlesbrough’s attack is clearly far more skewed towards the right wing compared to a relatively symmetrical distribution last season, as well as being perhaps slightly less vertical. This is likely in part due to Emilio Nsué’s emergence as first choice right back (Nsué captained Equatorial Guinea and played at centre forward in last year’s AFCON). I think it’s also fair to attribute some of this to Stewart Downing, who plays the sideways pass to Adomah/Nsué fairly frequently (this is all the more painful, when he (occasionally) is capable of things like this).

Rplot22

Meanwhile at +1 and -1 (leading and trailing by one goal, respectively), there seem to be significant differences in how the ball is played from the central attacking area. When trailing this year, it would seem like the team has been significantly less direct (at least in this area) than they were previously. That said, it’s worth noting that Karanka’s Middlesbrough are not a team that goes behind very often, so these numbers may be distorted due to sample size. On the other hand, at +1 it would seem that despite passing the ball forwards more, the team is actually penetrating the box less; last, the attack seemed to be more focused through the channels.Rplot23

These aren’t great signs for Middlesbrough’s attacking play and the turnover of players at the sharp end of the pitch this summer can’t have helped. However, it seems unlikely that Karanka will tweak the tactical scheme that has got him this far and to be fair to him, it will probably get Boro promoted anyway; the man can coach a mean defence (someone also needs to have a serious chat to whoever’s organising the transfer strategy but that’s n issue for another time).

The last question to ask is how the passing pattern displayed above compares to the average. How do teams normally pass from zone 14 at different game states. Perhaps surprisingly, the overall pattern seems relatively robust across game states, at least for the Championship:

Rplot25

Visualising the Championship: the Shot Worm 2014/15

A short post inspired by (stolen from) this chart by David Sumpter. I looked at visualising how teams’ shots for and against changed throughout the season. The result is a connected scatter plot showing teams’ shots for versus shots against. Naturally, the best teams tend to be towards the bottom right and that’s where teams should be aiming to be. Teams above the dotted line are being outshot by their opponents, while teams below the dotted line are taking more shots than their opponents. The shots are grouped into rolling 10 game averages (shots per game).

Rplot49

Points of interest:

  • Brentford’s late streak into elite shot dominance numbers (shown by being far to the right) is driven by their one-off thrashing of Blackpool. They’re a good side, but not quite that good (yet).
  • Charlton spent almost their entire season above the line (compare their worm to Blackpool’s for instance). They have been a pretty dire team by the shots numbers all year and have been kept up by a large number of draws and unsustainable conversion numbers. Leeds are another team in this category.
  • Despite gathering the plaudits early on in the season, Derby’s shots numbers were consistently fairly average and they were probably fortunate to be flying high for so long, driven as they were by unsustainably high conversion rates.
  • Norwich were a truly elite shots team and were probably unlucky to have to go up via the playoffs.

Brentford: Playing Football Manager in real-life

This weekend, Brentford beat 22nd placed Wigan by three goals to nil and secured a place in the playoffs. Brentford, a favourite among the analytics scene due to their chairman’s data-led approach, now have a chance at back to back promotions. Regardless of what happens over the coming weeks, Brentford appear to be a smartly run club with a bright future. Part of this forward-thinking approach is present in their squad construction, which shares similarities with Alex Stewart’s recent Moneyball/Football Manager mashup.
Brentford age plot

This chart shows each player in Brentford’s squad in relation to their age and the percentage of the maximum minutes that they have played this season. What is promising about this is the cluster of players just approaching peak age (shown in red). While definitions of peak age vary depending on position, injury history and other factors, the majority of first team players in Brentford’s squad are just entering their peak production years. This bodes well for the club on the pitch and on the balance sheet.

As any Football Manager veteran will no doubt tell you (see point 7 in the piece by Alex Stewart), players signed before they hit peak performance will tend to cheaper than more experienced players. While it’s obviously not as easy to sign the next Maradona in real life as it is in a video game, by signing players in their early 20s, you can eliminate some of the risk associated with youngsters due to the larger sample of games played, while still being less expensive than a “proven” 26 year old.

Rplot10

One example of this is the signing of Andre Gray. Gray was signed in the summer for a fee of around £550,000 (Transfermarkt) on the back of 30 goals and 14 assists in 45 appearances for Luton in the Conference. This has since proved a smart acquisition as he has performed well in 2014/15 and, at 23, is likely to improve; Gray’s 2014/15 haul of 0.61 Goals+Assists per 90 minutes is highly impressive, especially considering the low fee. Moreover, players in this age range are likely to command lower wages than older players with similar output (point 1).

Rplot09

As well as having more years of high performance remaining, players in their early 20s have a greater resale value (point 9, sort of) and are therefore more likely to make a profit for the club. While there are naturally benefits to having a balance between youth and experience in a playing squad, most of Brentford’s signings have landed in the Moneyball sweet-spot of 20-23 (see above chart).

The flip side of this is that there are fewer old players playing big minutes. According to the Transfermarkt data on the age-minutes plot, there is only one player over 30 who has played over half the available minutes for the club. Because there are fewer players declining due to age, there is less turnover pressure on the club; fewer players need to be replaced. Likewise, the only player with significant minutes on a contract of a year or less was Alex Pritchard. This is further evidence of the team’s stability and suggests good management of contracts; good, young players are secured for beyond just the short term, with little reliance on players in decline.

These two characteristics of having a young squad, having an improving first team and high retention of playing minutes, mean that a stable squad can continue to play and develop together. I believe that this is evidence of a club run well beyond the level of manager or head coach (and therefore less reliant on one person). This bodes well for the future, with or without Warburton, with or without promotion.

Comparing shots across the divisions of English football

Watching the highlights of each week’s Football League action, it can sometimes feel like there are more spectacular goals in the lower divisions, particularly shots from range. Although, there is the obvious counter-argument that because the highlights for leagues that aren’t the Premier League tend to be much more condensed, we are more likely to quickly forget the average goals as they are quickly passed over, in favour of goals from the likes of Lee Trundle. However, this raises a serious question; do teams shoot differently at different levels of the Premier League and Football League? Another way of thinking about this would be whether the quality of both attacking and defending increases evenly as you ascend the levels of the footballing pyramid.

While I do not intend to look at this from every possible point of view (being able to see how different skills transfer from one league to another would be one of the most valuable commodities on football – just look at the clichés about Eredivisie strikers), I will look briefly at location and mean conversion of shots in different zones in the top 3 divisions of English football.

Shot locations

These maps show the proportion of shots originating from each area of the pitch in each of the divisions.

RplotA

Conversion (Goals per Shot)

(NB: the sample sizes for headed shots outside the box is understandably very small and so the differences here are unlikely to be significant)

RplotB

Interestingly, both shot locations and conversion rates are very similarly distributed across at least these divisions of English football. Perhaps this suggests that a large degree of attacking and defending scales up and down the leagues. There is also the point that these numbers are aggregated from the whole league and so do not reflect the distribution within each league. However, I think this is a noteworthy result, if not particularly earth-shattering.

Visualising the Championship: the Age-Utility matrix

The Age-Utility matrix is (if you’ll excuse the self-important name) something I came up with to show squad balance across different age bands. By plotting the % of maximum minutes played (y axis) vs the age of each player (x axis), we can get an approximate idea of squad balance. For instance, it would be undesirable to be relying lots of older players playing a lot of minutes (especially without younger players ready to come in to replace them), because as they begin to decline as a result of normal ageing, they will need to be replaced. I have also included a marker for players on short term contracts (loans or single year deals) because, if continuity is desirable, then having lots of minutes devoted to players on short term deals is not wanted. Finally, there is also an indicator for an estimate of peak years based on the league average distribution of minutes. This is, of course, an estimate, as the age curve varies depending on myriad factors, such as position, injury history and the like.

Example: AFC Bournemouth

Rplot11

  • We can see the majority of Bournemouth’s minutes are coming from peak age players, with few older players, perhaps a contributing factor to their success this season. Does this age balance allow Bournemouth to play at a higher intensity at times?
  • There are also few loanees, with only Boruc playing a major role.
  • Likewise, there is a clearly demarcated set of core players (top group) and squad options (middle/bottom). It is reasonable to suggest that Bournemouth’s team cohesion has also been a factor in their success so far this season.

Quantifying attacking player reliance

Back in 2012, Brendan Rodgers remarked that he has “always thought that if you have three-and-a-half goalscorers in your team, you have got a chance”. This raises the question, are good teams more likely to have their shots and goals spread out amongst the team, or localised in a handful of players? If so, what does this tell us about the nature of the sport.

To test this, we need a metric that can help quantify the distribution of shots within a team. One option would be to use the the Gini coefficient, a common measure of inequality. However, this would not entirely account for the fact that teams with fewer shots, will naturally tend have a more equal distribution of shots. Instead, I have chosen to use the coefficient of variance of teams’ simple expected goal (expected goals being a model for weighting shots based on their likelihood of resulting in a goal). By this measure, a team with a lot of players contributing more equally to their expected goals tally will have a lower coefficient of variance than a team whose shots are coming from only a couple of players.

When we plot this measure against goals scored in the league, we can see a weak correlation:

Rplot09

So that’s it, focusing your attack around fewer players is more effective? Well, not necessarily. Firstly, attacking systems will tend to focus themselves around star players and if we look at where teams are placed along the x axis, it would seem sensible to suggest that teams with high quality strikers will tend t have a higher coefficient of variance. For instance, Blackburn have both Gestede and Rhodes, Watford have Deeney and Ighalo (and Vydra), and Ipswich have Murphy scoring goals for fun. If we can expect top strikers to score goals somewhat independently of the quality of their teammates, then a correlation such as the one seen here would be expected.

So what conclusions can we draw from this, if any? Well, there are obvious limitations in the method used, here. For one thing, a simple shots-based evaluation of a team’s attack will fail capture all the subtleties of attacking contribution and variation. With this and the diffuse nature of the correlation, it would be pernicious to try to draw any large or spectacular conclusions. However, we have at least derived a useful method for determining the spread of xG around a team.