With the AVP Champions Cup over. Lets look at the stats! I did some web scraping from their results page. I won’t go over how I web scraped this data. Additional resources for Beach Volley stats https://github.com/BigTimeStats/beach-volleyball/tree/master/data by Adam Vagner. For this post I didn’t use Adams stats. Adams doesn’t have Qualifier matches, the PASS numbers, Serves in freeze (Sif) and Cblk. I want to see what AVP stats correlate the most with winning. Adams database is quite impressive! It goes back to the year 2000 and was featured on tidytuesday.
library(tidyverse)
df <- read_csv('women_cc_2020.csv')

Right away I notice that each player has a ‘S1’ or ‘S2’ at the end of their name. Indicating who was the first server. (But it is actually not true to who the first server was.. Annoying.) You can see comments for my wrangle reasons. But also I notice the data is quite small with only 138 matches.
# ---------------------------------
# Wrangles
# ---------------------------------
df <- df %>%
# this match variable name is wrong
mutate(match = ifelse(match == 'Match 12 Final', 'Match 12 Semif', match)) %>%
# separate the player names from their serve order
separate(Player, c('Player', 'serve_order'), ' S') %>%
# change pardon name since she played with 2 partners across 3 tournaments
mutate(Player = ifelse(team == 'Jace PardonEmily Hartong' & Player == 'Pardon', 'HPardon', Player),
# sets played column
sets_played = ifelse(set_3 != 0, 3, 2),
# I want to know the result of the match
result = ifelse(match_score == 2, 1, 0))

I need a way to identify team names correctly. Across multiple tournaments the team names columns change. I create a new variable for that named ‘teamname’.
# ---------------------------------
# Wrangles - Get team names accurately
# ---------------------------------
names <- unique(df$Player)
teamnames <- names %>%
as.data.frame() %>%
rename(Player = '.') %>%
mutate(team = lead(Player,1)) %>%
filter(row_number() %% 2 == 1) %>%
mutate(teamname = paste0(Player, '/', team)) %>%
pivot_longer(c(Player, team)) %>%
select(-name) %>%
rename(Player = value)
df <- left_join(df, teamnames)

Next I want to create some feature variables.
# ---------------------------------
# Wrangles / feature variables
# ---------------------------------
avp <- df %>%
#create opponent
mutate(Opponent_team = ifelse(index == 1, lead(teamname, 2),
ifelse(index == 3, lag(teamname, 1), NA)),
Opponent_team = zoo::na.locf(Opponent_team)) %>%
#I want to be looking at these stats by team and match by match
group_by(teamname, tourn, match, Opponent_team, result) %>%
#And combine the stats for team level
summarise(Att = sum(Att),
K = sum(K),
Err = sum(Err),
Digs = sum(Dig),
Blk = sum(Blk),
Cblk = sum(Cblk),
Ace = sum(Ace),
Se = sum(Se),
Team.Pct = (K-Err)/Att,
PASS = mean(PASS),
Sif = sum(Sif)) %>%
ungroup() %>%
#I want to give each match a unique identifier
arrange(tourn, match) %>%
mutate(., match_id = group_indices(., tourn, match))

And explore! Now I want to see how the stats influence winning a match.
# ---------------------------------
# EDA - Which skills correlate with winning the most?
# ---------------------------------
library(GGally)
avp %>%
select(result, Att:Sif) %>%
ggpairs()

I am most interested in the top row and how the Corr: effects the result. ‘Team.Pct’ is the highest Corr. With Kills second and Aces and Digs behind. I’m surprised Digs is not ahead of Aces. Here are some other plots: (you can find the code here)


Looking at both of these plots I would say straight away ‘Team.Pct’ is the highest correlated AVP stat taken which influences winning. To make sure I will run random forest for skill importance:
# ---------------------------------
# Random Forest and AVP stat importance
# ---------------------------------
library(randomForest)
x <- avp %>%
select(result, Att:Sif)
# ---------------------------------------------------------------
# Create randomForest model to find important variables using (almost) all AVP stats (missing Me).
rf <- randomForest(result~., data=x, ntree=500,
mtry=2, importance=TRUE)
# ---------------------------------------------------------------
importance(rf)
# %IncMSE IncNodePurity
#Att 3.2809235 2.242540
#K 16.5606505 4.791360
#Err 7.3001452 2.676184
#Digs 7.2437441 2.918214
#Blk 5.9031109 1.712375
#Cblk 0.1665503 1.537938
#Ace 4.4346670 2.076082
#Se 1.7959743 1.795940
#Team.Pct 21.8656690 7.290439
#PASS -2.3802474 1.644211
#Sif 6.9321642 2.137900
# ---------------------------------------------------------------
# According to this random forest plot, Team.Pct and K are most important for the AVP womens side
varImpPlot(rf)

Team.Pct for correlates best with match winning! Sif is so high likely because the teams that serve more in the freeze are the wins which likely win the set/match. A-team (April Ross and Alix Klineman) had 66 Sif! They didn’t drop a match, a bit of an influence to this model I am sure. On the other side of Sif Emily Stockman & Kelley Larsen Kolinske had 42 Sif where they lost a match.
It would be worth it to take a look at the following:
- This data on a set level instead of a match level.
- Opponent stats (any) as a variable.
- Opp PASS, Ace, Serve Error, Digs, Cblk and Blk. This has the potential to give insight to how a team is defensively after they serve.
- Include outside variables such as weather, time, court and refs.