Jerome Williams
March 3, 2024

Back to jeromewilliams.net.

Preamble

First, let’s load some packages and define some helpers we will use later.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1      ✔ purrr   1.0.2 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(plotrix)

source_string <- 'Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams'

theme_jw <- function() { 
  theme_bw() + 
    theme(
      plot.title = element_text(face = 'bold', hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      plot.caption = element_text(hjust = 0)
    )
}

Preparing the data

Fantasy Premier League data

We will use data from the Fantasy Premier League data project, which I downloaded previously. The data is stored season by season, so we have to combine seasons. Data on each player/gameweek (such as total points scored) are saved in merged_gw.csv. Player/gameweek data for 2016/17 through 2019/20 does not include player position data (e.g., defender, midfielder, etc.) Since we will need player position data later, we also merge merged_gw.csv with players_raw.csv (which does contain player position data) for seasons 2016/17 through 2019/20.

fpl_data_path <- 'data/Fantasy-Premier-League-master/data/'

load_and_merge_player_and_gameweek_data <- function(season_str) {
  season_string_dash <- str_replace(season_str, '/', '-')
  merged_gw <- read_csv(paste0(fpl_data_path, season_string_dash, '/gws/merged_gw.csv'),
                        show_col_types = FALSE) %>% mutate(season = season_str)
  players_raw <- read_csv(paste0(fpl_data_path, season_string_dash, '/players_raw.csv'),
                          show_col_types = FALSE)
  player_position_data <- players_raw %>% select(id, element_type)
  gameweek_data <- merged_gw %>% 
    left_join(player_position_data, by = c('element' = 'id')) %>%
    mutate(position = case_when(element_type == 1 ~ 'GK',
                              element_type == 2 ~ 'DEF',
                              element_type == 3 ~ 'MID',
                              element_type == 4 ~ 'FWD')
  )
}
load_gameweek_data <- function(season_str) {
  season_string_dash <- str_replace(season_str, '/', '-')
  merged_gw <- read_csv(paste0(fpl_data_path, season_string_dash, '/gws/merged_gw.csv'),
                        show_col_types = FALSE) %>% mutate(season = season_str) %>%
    mutate(position = if_else(position == 'GKP', 'GK', position))
}

seasons <- list()
seasons[['2016/17']] <- load_and_merge_player_and_gameweek_data(season_str = '2016/17')
seasons[['2017/18']] <- load_and_merge_player_and_gameweek_data(season_str = '2017/18')
seasons[['2018/19']] <- load_and_merge_player_and_gameweek_data(season_str = '2018/19')
seasons[['2019/20']] <- load_and_merge_player_and_gameweek_data(season_str = '2019/20')
seasons[['2020/21']] <- load_gameweek_data(season_str = '2020/21')
seasons[['2021/22']] <- load_gameweek_data(season_str = '2021/22')
seasons[['2022/23']] <- load_gameweek_data(season_str = '2022/23')
gw <- bind_rows(seasons)

As a check, let’s see how many observations we have in our combined dataset.

print(nrow(gw))
## [1] 166813

We have 166,813 observations (player/gameweek pairs).

Later on, we will need a ‘Percent selected by’ field, representing, for each player/gameweek observation, the percentage of FPL managers who selected that player in their teams in that given gameweek. To calculate ‘Percent selected by’, we need data on the total number of managers each season, which we have from the Manager Count dataset. Note that the Manager Count dataset includes data only on the end-of-season manager count and, moreover, is approximate for some of the earlier seasons. Thus, unforrtunately, our ‘Percent selected by’ numbers will be approximate.

Let’s ensure we have position data for all observations.

any_without_position_data <- gw$position %>% is.na() %>% any()
print(any_without_position_data)
## [1] FALSE

As a check, let’s plot the number of observations per season.

ggplot(gw) + 
  geom_bar(aes(x = season)) + 
  scale_y_continuous(labels = scales::number_format(big.mark = ",")) +
  theme_jw() +
  labs(title = 'Number of gameweek/player observations, by season',
       subtitle = 'Fantasy Premier League, 2016/17 - 2022/23',
       x = 'Season',
       y = 'Count',
       caption = source_string)

We are missing a few observations at the end of the 2022/23 season, since our dataset was created mid-season. This should have a noticeable effect on our calculations or conclusions, but is worth bearing in mind.

Let’s also check the number of distinct gameweeks per season. We should have 38 gameweeks per season.

by_season <- gw %>% group_by(season) %>% summarize(n_gameweeks = n_distinct(GW))
ggplot(by_season) + geom_col(aes(x = season, y = n_gameweeks)) + 
  theme_jw() +
  labs(
    title = 'Gameweeks per season',
    subtitle = 'Fantasy Premier League, 2016/17 - 2022/23',
    x = 'Season',
    y = 'Number of gameweeks',
    caption = source_string
  )

Team level data from football-data.co.uk

Our FPL data does not have team-level results for every season we are interested in, so we’ll also use data from www.football-data.co.uk. Let’s load this data as well.

fd_data_path <- 'data/footballdata_co_uk/';

load_fd_data <- function(season_string) {
  season_no_slash <- str_replace(season_string, '/', '') %>% str_replace('20', '')
  season <- read_csv(
    paste0(fd_data_path, 'E0_', season_no_slash, '.csv'),
    show_col_types = FALSE
    ) %>%
    mutate(season = season_string)
}

season_strings <- c('2016/17',
                    '2017/18',
                    '2018/19',
                    '2019/20',
                    '2020/21',
                    '2021/22',
                    '2022/23')

fd <- lmap(season_strings, load_fd_data) %>% 
  bind_rows()

Team-level home advantage

Before we consider home advantage in FPL points for individual players, let’s confirm that we see home advantage in team-level outcomes. Specifically, let’s plot the average win rate and average goals scored for Premier League teams playing at home and playing away between 2016/17 and 2022/23.

Because the data has a single row for each match (with e.g. home goals and away goals stored in separate fields), let’s reshape our dataset so that we have a single row for each match-team observation (i.e., two rows per match, with the home_away field indicating ‘Home’ or ‘Away’).

fd_home <- fd %>% select(season,
                      date = Date,
                      team = HomeTeam,
                      opponent = AwayTeam,
                      goals_scored = FTHG, # full time home goals
                      goals_conceded = FTAG, # full time away goals
                      FTR) %>%
  mutate(home_away = 'Home',
         result = case_when(FTR == 'H' ~ 'W', # full time result
                            FTR == 'A' ~ 'L',
                            FTR == 'D' ~ 'D'),
         win = (result == 'W'))
fd_away <- fd %>% select(season,
                      date = Date,
                      team = AwayTeam,
                      opponent = HomeTeam,
                      goals_scored = FTAG,
                      goals_conceded = FTHG,
                      FTR) %>%
  mutate(home_away = 'Away',
         result = case_when(FTR == 'H' ~ 'L',
                            FTR == 'A' ~ 'W',
                            FTR == 'D' ~ 'D'),
         win = (result == 'W'))

fd2 <- bind_rows(fd_home, fd_away)

Team-level home advantage in win rate

Now that we’ve have created our home/away dataset, let’s plot the mean win rate, by season and by home/away.

win_rate_stats <- fd2 %>% 
  group_by(season, home_away) %>% 
  summarize(mean = mean(win), 
            se = std.error(win),
            n = n()) %>%
  mutate(home_away = fct_relevel(home_away, 'Home', 'Away'))
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
ggplot(win_rate_stats, aes(x = season, color = home_away)) + 
  geom_point(aes(y = mean)) + 
  geom_linerange(aes(ymin = mean - 1.96 * se,
                     ymax = mean + 1.96 * se)) +
  theme_jw() + 
  labs(
    title = 'Home and away win rates, by season',
    subtitle = 'English Premier League, 2016/17 - 2022/23',
    color = '',
    x = '',
    y = 'Win rate',
    caption = 'Source: football-data.co.uk; Jerome Williams'
  )

Interestingly, we see that home advantage disappeared completely in the 2020/21 season. Because of the COVID-19 pandemic, part of 2020/21 Premier League season was played behind closed doors, i.e., with no supporters in stadiums (premierleague.com), and part was played with only a limited number of supporters in stadiums (wikipedia).

It is well documented that the behind-closed-doors 2020/21 season resulted in diminished home advantage (link)[https://www.sciencedirect.com/science/article/pii/S146902922100131X]. Indeed, the natural experiment created by the COVID-19 pandemic has provided insight into the sources of home advantage. In their 2021 article in Psychology of Sport & Exercise, Merrick, Bilalic, Neave, and Wolfson find that the diminished home advantage in behind-closed-doors matches provides support for the theory that home advantage stems from (i) the home crowd’s effect on the home team’s performacne, and (ii) the home crowd’s effect on the referee. After the 2020/21 season, alternative possible explanations, such as the effects of travel on the away team’s performance, seem less likely.

In our data, we also see that home advantage is also smaller in the 2021/22 season, when supporters were allowed back into stadiums. As far as I know, there is no good explanation for why home advantage was diminished in 2021/22.

Apart from 2020/21 and 2021/22, home advantage in other seasons is pronounced and fairly consistent. Home win rates tend to be between 0.45 to 0.5, while away win rates tend to be between approximately 0.28 and 0.35.

Team-level home advantage in goals scored

Let’s check whether home advantage also shows up in goals scored. It should, given the likely high correlation between goals scored and win rate.

win_rate_stats <- fd2 %>% 
  group_by(season, home_away) %>% 
  summarize(mean = mean(goals_scored), 
            se = std.error(goals_scored),
            n = n()) %>%
  mutate(home_away = fct_relevel(home_away, 'Home', 'Away'))
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
ggplot(win_rate_stats, aes(x = season, color = home_away)) + 
  geom_point(aes(y = mean)) + 
  geom_linerange(aes(ymin = mean - 1.96 * se,
                     ymax = mean + 1.96 * se)) +
  theme_jw() + 
  labs(
    title = 'Home and away goals scored, by season',
    subtitle = 'English Premier League, 2016/17 - 2022/23',
    color = '',
    x = '',
    y = 'Goals scored',
    caption = 'Source: football-data.co.uk; Jerome Williams'
  )

Indeed, we see a similar home advantage in goals scored, with the effect disappearing in 2020/21 and smaller in 2021/22. As the plot above shows, outside of 2020/21 and 2021/22, home teams score an average of 1.5-1.6 goals per match while away teams score an average of 1.15-1.25 goals per match.

Player-level home advantage

Now that we’ve confirmed that home advantage obtains at the team-level, let’s turn to our question of interest: do players playing at home score more FPL points on average than players playing away?

To answer this question, we’ll start using the Fantasy Premier League dataset.

Filtering to FPL-relevant players

The FPL game includes a complete squad for each Premier League team every season. However, a team’s complete squad typically includes several players who end up playing only a small number of minutes over the course of a season and who therefore tend not to be relevant to FPL. We want to exclude these players from the analysis since they will generally score zero points, whether or not their team plays at home or away. We want to focus instead on the players whom FPL managers would actually select.

To restrict to game-relevant players, I opt to filter to players who are selected by at least 0.5% of managers in the relevant gameweek.

Let’s check whether this filtering is sensible. Let’s plot the average minutes played and average FPL points among included and excluded observations, using the 0.5% selection rate exclusion criteria.

gw_check <- gw %>% 
  mutate(included = if_else(pct_selected >= 0.005, 'Included', 'Excluded'))

ggplot(gw_check) + 
  geom_histogram(aes(x = minutes), binwidth = 2) +
  facet_grid(included ~ .) +
  theme_jw() + 
  scale_y_continuous(labels = scales::comma_format()) +
  labs(title = 'Minutes played, by exclusion status',
       subtitle = 'Fantasy Premier League, 2016/17 - 2022/23',
       y = 'Observations',
       x = 'Minutes played',
       caption = '
Excluded observations are gameweek/player pairs where "% managers selected by" is less \nthan 0.5%.

Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams')

The plot confirms shows that, if we exclude player/gameweek observation with selection rates less than 0.5%, the players excluded are mostly (but not all) players playing zero minutes in the relevant gameweek. Many included players also play zero minutes, but that is to be expected: there is nothing stopping a player who is highly-selected in particular gameweek from playing zero minutes, whether because of injury or squad rotation or some other reason.

For the sake of completeness, let’s also compare FPL points scored for included/excluded observations.

gw_check <- gw %>% 
  mutate(included = if_else(pct_selected >= 0.005, 'Included', 'Excluded'))

ggplot(gw_check) + 
  geom_histogram(aes(x = total_points), binwidth = 1) +
  facet_grid(included ~ .) +
  theme_jw() + 
  scale_y_continuous(labels = scales::comma_format()) +
  labs(title = 'Gameweek FPL points, by exclusion status',
       subtitle = 'Fantasy Premier League, 2016/17 - 2022/23',
       y = 'Observations',
       x = 'Gameweek FPL points',
       caption = '
Excluded observations are gameweek/player pairs where "% managers selected by" is less \nthan 0.5%.

Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams')

As expected, excluded observations are more than included observations likely to score zero FPL points than included observations.

Plotting home advantage in FPL points

Now that we’ve established that filtering out observations with ‘Percentage selected by’ less than 0.5% is sensible, let’s proceed with analyzing player-level home advantage in FPL points. As we did with team-level home advantage, let’s plot the average numbe of FPL points scored (per player per gameweek), displayed by season and by home/away.

home_away_data <- gw %>%
  filter(pct_selected >= 0.05) %>%
  mutate(home_away = if_else(was_home, 'Home', 'Away')) %>%
  mutate(home_away = as_factor(home_away) %>% fct_relevel('Home'))

home_away_stats <- home_away_data %>%
  group_by(home_away, season) %>%
  summarize(
    total_points_mean = mean(total_points),
    total_points_se = std.error(total_points)
  )
## `summarise()` has grouped output by 'home_away'. You can override using the
## `.groups` argument.
ggplot(home_away_stats) +
  geom_point(aes(x = season, y = total_points_mean, color = home_away)) + 
  geom_linerange(aes(x = season, 
                     ymin = total_points_mean - 2*total_points_se,
                     ymax = total_points_mean + 2*total_points_se,
                     color = home_away)) +
  theme_jw() +
  labs(
    title = 'Home advantage in FPL points',
    subtitle = 'Fantasy Premier League, 2016/17 - 2022/23',
    x = '',
    y = 'Points per player per gameweek',
    color = '',
    caption = paste0(
'Plot shows means and 95% confidence intervals.
Excludes player-gameweek observations where player is selected by less than 0.5% of managers.\n\n',
source_string)
)

The plot shows that home advantage does exist in player-level FPL points: players score more FPL points on average in gameweeks when their team plays at home than in gameweeks when their team plays away. Interestingly, following the same pattern as team-level home advantage, home advantage in FPL points is reduced in 2021/22 and disppears entirely in 2020/21.

Interestingly, the plot also shows that average FPL points per player/gameweek varies quite a bit from season to season. I suspect that the performance of popular players and of popular teams in any given season is what determines this.

Sources of home advantage: point-scoring actions

Let’s examine the sources of FPL home advantage. FPL points are based on a variety of different point-scoring actions, as shown in the table below.

FPL point-scoring action Points
Played more than 0 minutes 1
Played more than 60 minutes 1
Goal scored (GK / DEF / MID / FWD) 6 / 6 / 5 / 4
Assist 3
Every 2 goals conceded (GK / DEF) -1
Clean sheet (GK / DEF / MID) 6 / 6 / 1
Every 3 saves (GK) 2
Yellow card -1
Red card -3
Penalty saved (GK) 5
Penalty missed -2

Source: https://fantasy.premierleague.com/help/rules

Let’s plot the frequency of the different point-scoring actions, for home games and for away games. For this analysis, we will exclude the 2020/21 season, since it exhibits no home advantage with respect to total points.

df_for_source_analysis <- home_away_data %>%
  filter(season != '2020/21')

vars_for_all_players <- list('minutes', 
                             'goals_scored',
                             'assists',
                             'bonus',
                             'red_cards',
                             'yellow_cards')

calc_stats <- function(df, var_, ...) {
  tibble(mean = mean(df[var_] %>% pull()),
         se = std.error(df[var_]))
}

all_stats <- tibble()
for (var in vars_for_all_players) {
  temp <- df_for_source_analysis %>% 
    group_by(home_away) %>%
    group_modify(function(df, ...) { calc_stats(df, var, ...) }) %>%
    mutate(var = var) %>%
    mutate(var = case_when(var == 'assists' ~ 'Assists',
                           var == 'goals_scored' ~ 'Goals scored',
                           var == 'bonus' ~ 'Bonus points',
                           var == 'minutes' ~ 'Minutes played',
                           var == 'red_cards' ~ 'Red cards',
                           var == 'yellow_cards' ~ 'Yellow cards'))
  all_stats <- all_stats %>% bind_rows(temp)
}

def_gk <- df_for_source_analysis %>% filter(position == 'GK' | position == 'DEF')
def_gk_vars <- list('clean_sheets', 'goals_conceded')

def_gk_stats <- tibble()
for (var_ in def_gk_vars) {
  temp <- def_gk %>% 
    group_by(home_away) %>%
    group_modify(function(df, ...) { calc_stats(df, var_, ...) }) %>%
    mutate('var' = var_)
  def_gk_stats <- def_gk_stats %>% bind_rows(temp)
}
def_gk_stats <- def_gk_stats %>%
  mutate(var = case_when(var == 'clean_sheets' ~ 'Clean sheets',
                         var == 'goals_conceded' ~ 'Goals conceded'))

gk_only <- df_for_source_analysis %>% filter(position == 'GK')
gk_only_stats <- gk_only %>%
  group_by(home_away) %>%
  group_modify(function(df, ...) { calc_stats(df, 'saves', ...)}) %>%
  mutate('var' = 'Saves')


plotting_data <- all_stats %>% 
  bind_rows(def_gk_stats) %>%
  bind_rows(gk_only_stats) %>%
  mutate(var = fct_relevel(var, c('Goals scored', 'Assists', 'Minutes played', 
                                  'Bonus points', 'Red cards', 'Yellow cards',
                                  'Goals conceded', 'Clean sheets', 'Saves')))

ggplot(plotting_data, aes(x = home_away, color = home_away)) +
  geom_point(aes(y = mean)) +
  geom_linerange(aes(ymin = mean - 1.96 * se, ymax = mean + 1.96 * se)) +
  facet_wrap(~var, scales = 'free') +
  labs(title = 'Home advantage in FPL point-scoring actions',
       subtitle = 'Fantasy Premier League, 2016/17 to 2022/23',
       caption = 'Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams

Note: Plots show means and 95% confidence intervals.
Calculations exclude 2020/21 season and player-gameweeks with selected-by rates less than 0.5%.
\'Goals conceded\' and \'Clean sheets\' calculated for defenders and goalkeepers only. \'Saves\' calculated for goalkeepers only.',
      x = '',
       y = '', 
       color = '') +
  theme_jw() + 
  theme(legend.position = 'none')

The plot above shows that most point-scoring actions in FPL contribute to home advantage in total points. There are significant differences between home and away matches in all plotted point-scoring actions except for red cards. (I omit Penalties missed and Penalties saved from the plot on account of their rarity.)

Interestingly, the average number of minutes played (per gameweek/player) is slightly higher for home games than for away games. The difference is small, however—approximately 68.5 minutes for home games vs 67.75 minutes for away games—and is therefore unlikely to be meaningful for FPL points.

Sources of home advantage: player positions

Let’s also examine home advantage by player position. Players in FPL are assigned one of four positions—GK, DEF, MID, or FWD—which is fixed for the duration of a season. As I explain above, some of the available point-scoring actions differ by position. However, given that almost all actions, including both attacking actions (goals, assists) and defensive actions (clean sheets, goals conceded, saves, etc.), exhibit home advantage, I expect that we will find home advantage across all four positions as well.

df_for_position_analysis <- home_away_data %>%
  filter(season != '2020/21')

var = 'total_points'
calculate_stats <- function(df) {
  mean_ <- mean(df[var] %>% pull())
  se_ <- std.error(df[var])
  tibble(
    position = df$position[1],
    home_away = df$home_away[1],
    mean = mean_,
    se = se_)
}

pos_stats <- df_for_position_analysis %>%
  split(list(.$position, .$home_away)) %>%
  map(calculate_stats) %>%
  list_rbind() %>%
  mutate(
    position = as_factor(position) %>% fct_relevel(c('GK', 'DEF', 'MID', 'FWD'))
  )

ggplot(pos_stats, aes(x = position, color = home_away, group = home_away)) + 
  geom_point(aes(y = mean)) +
  geom_linerange(aes(ymin = mean - 2 * se, ymax = mean + 2 * se)) +
  theme_jw() +
  labs(title = 'Home advantage in FPL points, by player position',
       subtitle = 'Fantasy Premier League, 2016/17 to 2022/23',
       x = '',
       y = 'Points per player per gameweek',
       color = '',
       caption = 'Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams

Plot shows means and 95% confidence intervals.
Calculations exclude 2020/21 season and player-gameweeks with selected-by rates less than 0.5%.')

Indeed, we do find home advantage in FPL points for all four positions. Midfielders seem to have the largest difference, as well as the highest number of points on average.

Do FPL managers take advantage of home advantage?

We have demonstrated that home advantage exists in FPL and exists across seasons and player position. This prompts a question: do managers take advantage of home advantage by selecting home players more frequently than away players?

Plotting Percent selected by for home and away fixtures

To answer this question, let’s look at how ‘Percent selected by’ varies between home matches and away matches.

Ideally, we would also account for the selection of players in managers’ first team of 11 players each gameweek: every gameweek, a manager must choose 11 players from the squad of 15 to “play” in that gameweek (i.e., to score points for the team). The remaining four players sit on the manager’s “bench” and will only score points if they are automatically substituted, which happens if one or more players from the first team of 11 fail to register a single minute of play that gameweek. My guess is that managers may account for home and away fixtures in the first 11 and bench decisions more so than in overall squad selection decisions. Managers are restricted to a single ‘free’ transfer each gameweek—subsequent transfers cost 4 points—meaning that squads change slowly over time and players are typically held by managers for multiple gameweeks (during which they will likely play both home and away matches). However, our FPL dataset includes data only on selection in the squad of 15 (our ‘Percent selected by’ variable) and not on selection in each manager’s first 11 every gameweek.

This time, let’s not exclude 2020/21. I don’t think it was widely known at the time that 2020/21 would result in no discernible home advantage, so I suspect FPL managers did not adjust their behavior that season.

selected_by_stats <- home_away_data %>% 
  group_by(season, home_away) %>% 
  summarize(
    mean = mean(pct_selected),
    se = std.error(pct_selected),
    n = n()
  )
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
ggplot(selected_by_stats, 
       aes(x = season, color = home_away)) + 
  geom_point(aes(y = mean)) + 
  geom_linerange(aes(ymin = mean - 1.96 * se, ymax = mean + 1.96 * se)) + 
  theme_jw() + 
  scale_y_continuous(labels = scales::percent_format(scale = 100)) +
  labs(
    title = 'FPL managers\' selection of players, home vs away fixtures',
    subtitle = 'Fantasy Premier League, 2016/17 to 2022/23',
    x = '',
    color = '',
    y = '% managers selected by',
    caption = 'Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams
    
Plot shows means and 95% confidence intervals.
Calculations exclude player-gameweeks with selection rates less than 0.5%.'
  )

The plot above shows that there is no meaningful difference in the average selection rate of players with home fixtures vs players with away fixtures.

However, the plot also prompts some additional questions: what exactly is the distribution of ‘Percent selected by’? I am also curious what explains a player’s selection rates: do higher scoring players have higher selection rates? Note that, as explained above, our Percent selected by numbers are (i) based on season-end manager counts, and (ii) based on approximate manager counts for some seasons. As a result, Percent selected by is an approximation and comparisons across seasons should be taken with a grain of salt.

The distribution of Percent selected by

Let’s look at the distribution of Percent selected by. For the reasons mentioned above, we’ll look at the distribution for a single season only, rather than grouping observations from multiple seasons—I selected 2018/19 arbitrarily. As before, let’s exclude player-gameweek observations where Percent selected by is less than 0.5%.

ggplot(gw %>% filter(season == '2018/19', pct_selected > 0.005)) + 
  geom_histogram(aes(x = pct_selected), binwidth = 0.001) + 
  theme_jw() +
  labs(
    title = 'Distribution of player selection rates',
    subtitle = 'Fantasy Premier League, 2018/19 season',
    x = 'Percent selected by',
    y = 'Observations',
    caption = 'Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams
    
Excludes player-gameweeks with selection rates less than 0.5%.'
  )

### Do more highly selected players score more FPL points?

What we have seen so far also prompts us to ask whether FPL managers are more likely to select players who score more highly. Let’s plot gameweek FPL points against percent selected by.

breaks <- c(-0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 1)
labels <- c('0-5%', '5-10%', '10-15%', '15-20%', '20-25%', '25-30%', '30-35%', '35-40%', '40-45%', '45-50%', '50-100%')
gw2 <- gw %>% 
  mutate(pct_selected_bin = cut(pct_selected, breaks = breaks, labels = labels))

stats <- gw2 %>% 
  group_by(pct_selected_bin) %>%
  summarize(mean = mean(total_points),
            se = std.error(total_points),
            std = sd(total_points),
            min = min(total_points),
            max = max(total_points),
            n = n()) %>%
  ungroup() %>% 
  mutate(n_pct = n / sum(n))

ggplot(stats) +
  geom_point(aes(x = pct_selected_bin, y = mean)) +
  geom_linerange(aes(x = pct_selected_bin, ymin = mean - 1.96 * se, ymax = mean + 1.96*se)) +
  theme_jw() +
  labs(
    title = 'Player/gameweek points, by selection rate',
    subtitle = 'Fantasy Premier League, 2016/17 to 2022/23',
    x = 'Percent selected by',
    y = 'Player/gameweek points',
    caption = '
Source: https://github.com/vaastav/Fantasy-Premier-League; Jerome Williams'
  )

Indeed, it does seem appear that FPL managers are more likely to select higher-scoring players. The relationship is consistent up until Percent selected by of 40% or so—at which point, there are very few observations.