Guest Post: Tyler Widdison (USA Beach) – AVP Women’s Champions Cup EDA (exploratory data analysis)

With the AVP Champions Cup over. Lets look at the stats! I did some web scraping from their results page. I won’t go over how I web scraped this data. Additional resources for Beach Volley stats https://github.com/BigTimeStats/beach-volleyball/tree/master/data by Adam Vagner. For this post I didn’t use Adams stats. Adams doesn’t have Qualifier matches, the PASS numbers, Serves in freeze (Sif) and Cblk. I want to see what AVP stats correlate the most with winning. Adams database is quite impressive! It goes back to the year 2000 and was featured on tidytuesday.

library(tidyverse)
df <- read_csv('women_cc_2020.csv')
First step is to data wrangle so I can make it usable.

Right away I notice that each player has a ‘S1’ or ‘S2’ at the end of their name. Indicating who was the first server. (But it is actually not true to who the first server was.. Annoying.) You can see comments for my wrangle reasons. But also I notice the data is quite small with only 138 matches.

# ---------------------------------
# Wrangles
# ---------------------------------
df <- df %>% 
  # this match variable name is wrong
  mutate(match = ifelse(match == 'Match 12 Final', 'Match 12 Semif', match)) %>% 
  # separate the player names from their serve order
  separate(Player, c('Player', 'serve_order'), ' S') %>% 
  # change pardon name since she played with 2 partners across 3 tournaments 
  mutate(Player = ifelse(team == 'Jace PardonEmily Hartong' & Player == 'Pardon', 'HPardon', Player),
         # sets played column
         sets_played = ifelse(set_3 != 0, 3, 2),
         # I want to know the result of the match
         result = ifelse(match_score == 2, 1, 0))

I need a way to identify team names correctly. Across multiple tournaments the team names columns change. I create a new variable for that named ‘teamname’.

# ---------------------------------
# Wrangles - Get team names accurately
# ---------------------------------
names <- unique(df$Player)

teamnames <- names %>% 
  as.data.frame() %>% 
  rename(Player = '.') %>% 
  mutate(team = lead(Player,1)) %>% 
  filter(row_number() %% 2 == 1) %>% 
  mutate(teamname = paste0(Player, '/', team)) %>% 
  pivot_longer(c(Player, team)) %>% 
  select(-name) %>% 
  rename(Player = value)

df <- left_join(df, teamnames)

Next I want to create some feature variables.

# ---------------------------------
# Wrangles / feature variables
# ---------------------------------
avp <- df %>% 
  #create opponent 
  mutate(Opponent_team = ifelse(index == 1, lead(teamname, 2),
                         ifelse(index == 3, lag(teamname, 1), NA)),
         Opponent_team = zoo::na.locf(Opponent_team)) %>%   
  #I want to be looking at these stats by team and match by match
  group_by(teamname, tourn, match, Opponent_team, result) %>% 
  #And combine the stats for team level
  summarise(Att = sum(Att),
            K = sum(K),
            Err = sum(Err),
            Digs = sum(Dig),
            Blk = sum(Blk),
            Cblk = sum(Cblk),
            Ace = sum(Ace),
            Se = sum(Se),
            Team.Pct = (K-Err)/Att,
            PASS = mean(PASS),
            Sif = sum(Sif)) %>% 
  ungroup() %>% 
  #I want to give each match a unique identifier
  arrange(tourn, match) %>% 
  mutate(., match_id = group_indices(., tourn, match))

And explore! Now I want to see how the stats influence winning a match.

# ---------------------------------
# EDA - Which skills correlate with winning the most? 
# ---------------------------------
library(GGally)

avp %>% 
  select(result, Att:Sif) %>% 
  ggpairs()
An interesting deep dive would be looking at digs and attack attempts for the women’s side. A nice coloration!

I am most interested in the top row and how the Corr: effects the result. ‘Team.Pct’ is the highest Corr. With Kills second and Aces and Digs behind. I’m surprised Digs is not ahead of Aces. Here are some other plots: (you can find the code here)

Where the red line is the ‘mean’ for the losing team and the blue line is the ‘mean’ for the winning team.
Where the red line is the ‘mean’ for the losing team and the blue line is the ‘mean’ for the winning team.

Looking at both of these plots I would say straight away ‘Team.Pct’ is the highest correlated AVP stat taken which influences winning. To make sure I will run random forest for skill importance:

# ---------------------------------
# Random Forest and AVP stat importance 
# ---------------------------------
library(randomForest)
x <- avp %>% 
  select(result, Att:Sif)

# ---------------------------------------------------------------
# Create randomForest model to find important variables using (almost) all AVP stats (missing Me).
rf <- randomForest(result~., data=x, ntree=500, 
                   mtry=2, importance=TRUE)

# ---------------------------------------------------------------
importance(rf)
#          %IncMSE        IncNodePurity
#Att       3.2809235      2.242540
#K        16.5606505      4.791360
#Err       7.3001452      2.676184
#Digs      7.2437441      2.918214
#Blk       5.9031109      1.712375
#Cblk      0.1665503      1.537938
#Ace       4.4346670      2.076082
#Se        1.7959743      1.795940
#Team.Pct 21.8656690      7.290439
#PASS     -2.3802474      1.644211
#Sif       6.9321642      2.137900

# ---------------------------------------------------------------
# According to this random forest plot, Team.Pct and K are most important for the AVP womens side
varImpPlot(rf)

Team.Pct for correlates best with match winning! Sif is so high likely because the teams that serve more in the freeze are the wins which likely win the set/match. A-team (April Ross and Alix Klineman) had 66 Sif! They didn’t drop a match, a bit of an influence to this model I am sure. On the other side of Sif Emily Stockman & Kelley Larsen Kolinske had 42 Sif where they lost a match.

It would be worth it to take a look at the following:

  • This data on a set level instead of a match level.
  • Opponent stats (any) as a variable.
  • Opp PASS, Ace, Serve Error, Digs, Cblk and Blk. This has the potential to give insight to how a team is defensively after they serve.
  • Include outside variables such as weather, time, court and refs.