So…what was the point again?
Honestly, we didn’t set out to make a ratings system. We were building our xPWP model (expected Point Win Probability) and needed a reasonably accurate estimate of the initial chance of each team winning the point (ex: Nebraska serving to Wisconsin, before the serve happens, how likely is Nebraska to win this point?). Initially we just set every input xPWP for serve to the average PS% for the entire dataset. This is pretty similar to what other expected points/expected goals models do. The problem is that it only works when the worst teams are still roughly similar in quality to the best teams – possibly a fair assumption in most professional sports leagues, but not so much in the NCAA.
Pablo rankings gave us the chance of each team winning the match, but it wasn’t clear how to turn this into the chance of winning the point. So we built our own model to estimate it.
About mid-September, we realized that we could use the R volleysim package (using various statistical inputs, the package runs many simulated matches to return the likelihood of winning) to turn those xPWP values into more granular match predictions – not just who would win but in how many games. A few tweaks later, the Volleydork Ratings were born.
Ok cool, so how did we do?
The gold standard for calibration at the match level is the Pablo rankings. If Pablo says Illinois has an 80% chance of winning the match, you can expect in the long run Illinois will win 80 out of 100 matches. We are…not quite there yet. But we’re really happy with what we were able to accomplish running an essentially unoptimized model on one season’s worth of games.
We grouped our matches by prediction confidence, from 50-52% up to 98-100%, and looked at the proportion of matches we correctly predicted in each interval. If our confidence is accurate and we say we have a 50-52% chance of winning a match, we’d expect to win around 51 out of every 100 matches. However, we know that sometimes we might get lucky and win a lot more than 51 games – while sometimes we might be unlucky and win far fewer.
The graph below shows the calibration curve for our model over all games in our dataset starting after our one tweak to the model (to tie the serve “advantage” to the actual overall SO% for the season). The black points indicate the actual proportion of matches correctly predicted in each interval, and the blue error bars indicate what we would expect to get in 95% of seasons, assuming our confidence is accurate. So for example, we want to be at 51% correctly predicted in the very left bar, but 95% of the time we’d actually get somewhere between 38% and 64% correct, just because of randomness.
Our predictions were inside the blue range in 23 of 25 intervals (92% compared to 95% expected), but there were several intervals where we just barely hit the bottom of that range and we have no idea what happened at around 85% confidence.
Here we show the same thing, but for games played in November and December. The bars are wider because there are fewer games in the sample, but our predictions were inside the blue range in 24 of 25 intervals (96% compared to 95% expected).
We’ve also been reporting the Brier Score for our model predictions. Remember that a Brier Score is basically “how wrong you were”, squared. This means that a score of 0 would be savant-like prediction (you predict with 100% confidence in the outcome and you’re never wrong). A Brier Score of 0.25 indicates you’re just sitting on the fence (you predict every team has a 50/50 chance to win, so you’re always 50% wrong, hence 0.52 = 0.25).
Over the course of the season we ended up with a Brier Score of 0.15, which is pretty good considering how much randomness there is in volleyball.
Just for an example to help wrap your brain around it: we said Wisconsin had a 60% chance to beat Nebraska in the 2021 National Championship. What actually happened was that Wisconsin won 100% of that match – hence the big trophy and confetti and all that… So our Brier Score on that single match was (1 – 0.6)2 = 0.42 = 0.16. For the season, we average these scores to see how well we’re doing overall. Below is a 100-match rolling average throughout the season.
Did the Model Tweak Make It More Accurate?
We can’t really say, because there are two different forces at play here. One is the tweak we added to make the model less overconfident, and the other is feeding the model more data. In fact, Dwight looked at some papers from really smart people at Stanford and Google that suggested that the model was overconfident early on precisely because there wasn’t enough data.
Back To the Original Purpose
How do we feel about using the model to feed into our xPWP calculations? Pretty good. The graph below shows the actual sideout percentage (y-axis) against what the model predicted the sideout percentage to be (x-axis) for about 4500 team-matches (around 2250 matches, 2 teams per match). The red dots show the predictions from when we didn’t have a lot of data and didn’t tie the serve advantage to the overall SO%, and the blue dots show the later predictions after the tweak.
The graph suggests our model predictions of sideout rates are roughly accurate +/- around 10%. That is, when we predict a team to have a SO% of 65% in the match, they’re typically going to sideout at somewhere between 55-75%. In a match where you receive 100 serves at 65% sideout rate, math says you should sideout on between 56 and 74 serves (95% of the time), so our model predictions are estimating sideout rates about as well as they can.
Conclusions and Future Work
We’re really happy with how our model turned out for being essentially a public test run. We were half-expecting this project to crash and burn spectacularly, and it ended up being pretty accurate, especially with regard to what we were originally trying to predict.
The one drawback to the ratings model is that the churn takes forever…not so much the modeling part, but the part where you have to go get and manipulate all the play-by-play from the NCAA website because they don’t report team PS and SO anywhere. This shouldn’t be as big of a problem for NCAA men’s volleyball (many fewer teams to check) or for professional leagues (also many fewer teams, and better stats reporting). Although, we’re spending the next couple of months updating our xPWP model, so we’re probably not going to release men’s ratings…at least until mid-February. Then we’ll see.
We buried this at the bottom of the post because it’s the one area our model didn’t do so well in. For some reason, we were consistently underestimating the chance that a match would go 3 sets while consistently overestimating the chance that the match would go 5. This might have something to do with how variable the sideout rates are around our predictions…something to spend the next few months looking into.
But hey, thanks for reading this far! If you’ve got random thoughts, suggestions of studies to do, or just want to get in touch and say hi – Dwight and myself are pretty accessible via email or Twitter or whatever…carrier pigeon…