Dwight Wynne teaches statistics at California State University, Fullerton and supervises student projects helping their women’s volleyball team get more out of their data.
One tool that has become increasingly common in the modern consumption of sports analytics is a win probability graph. A win probability graph shows, given the current score and time left, the probability of each team winning the game. Using this graph, we can tell the story of a match: Did one team dominate from the start? How unlikely was that miraculous comeback? At what point should we have started to believe in that underdog?
In this post, we will introduce win probability graphs and explore their advantages in telling the story of a match, then (for those of you interested in the math) explore the general concepts behind building a win probability model for volleyball, and finally, get you started with creating your own win probability graphs.
Example Win Probability Graph: 2018 Men’s National Championship
In the 2018 men’s final, Long Beach State was down 2-1 to UCLA before putting the Bruins away in 5 sets. Typically, if you’ve been trying to tell the story of a match visually, you’ve been using a point differential graph – showing how many points ahead or behind a team is as a set progresses. A point differential graph looks something like this:
You might see this kind of graph in some different formats – maybe as a bar graph, maybe with a bit of extra information or color, maybe broken up by set instead of a single graph – but the main point of this kind of graph is to illustrate where the major runs in each set were. On the other hand, a win probability graph for the match looks something like this:
Like with the point differential graph, we can use this graph to visually tell the story of the game, for someone who missed watching it – or for a coach who wants to go beyond the box score to figure out just how critical that service run was. But there is an additional bit of information here that makes our story much more powerful: the direct contribution of each point toward winning or losing the match.
At the start of the match, Long Beach State is rightly favored – the model gives UCLA about a 30% chance to win. The relatively flat line toward the end of Set 1 indicates the point at which Long Beach State is in complete control of the first set; at this point, UCLA’s win probability has dropped to about 20%. Note that the Long Beach State service run toward the end of the first set – from up 4 to up 7 – contributes very little to the match win probability.
For most of the second set, UCLA’s win probability is below 20% until they start to climb out of a hole in the middle of the set. By the end of the set, having tied the match at 1-1, UCLA is almost up to a 40% chance to win.
In the middle of Set 3, we twice see a run by UCLA answered shortly by a run from Long Beach before a huge UCLA run toward the end of the set gives them the 2-1 lead. The win probability graph shows that even after UCLA’s second run of the set (from down 2 to up 2), Long Beach is still favored to win the match – UCLA does not become the favorite until taking a 19-17 lead in the set. Even after taking the 2-1 lead, UCLA only has somewhere between a 60-70% chance to win.
UCLA continues to lead in Set 4, reaching over 90% win probability twice – the last being at 17-13. (Again, think about what this says compared to the point differential graph: being up 4 points in Set 4 vs. having a 91% chance to win the match.) We can immediately see the extended run from Long Beach dropping UCLA’s win probability back down to about 60%, before a sequence of sideouts. Finally, UCLA drops the set, and at the end of Set 4 is back down to around a 40% chance to win.
For Set 5, being the pivotal set, the point differential graph and the win probability graph have similar shapes – but again, think about how important the different points are. Even when UCLA is up 2 in Set 5, they only have around a 65% chance to win, while once they get down 2, their chance to win has gone down under 20% again!
If we dive a bit deeper into the actual win probability data, we add some additional numerical context to key plays in the match:
- The UCLA service run toward the end of Set 3 (from down 1 to up 5) added about 30 percentage points to their match win probability, while the extended Long Beach State run toward the end of Set 4 (from down 4 to up 1) added about the same amount to their chances of winning the match.
- If you watched the match, it was clear that UCLA’s struggles from the service line – especially in Sets 4 and 5 – were a major factor. While there’s no guarantee that the outcome would have been any different had any of these serves gone in, we can now quantify exactly how badly those service errors hurt UCLA:
|Set||LBSU Service Errors||LBSU Win Prob Lost||UCLA Service Errors||UCLA Win Prob Lost|
As Chad has posted before, this kind of high-risk, high-reward serving strategy is a hallmark of a John Speraw team – you can get away with a lot of service errors as long as they come at the right times, but it really hurts when they come at the wrong ones.
Now let’s take a look at the math behind those win probability calculations. If you’re just interested in the graphs, scroll down to the next example.
If you’ve read Steve Aronson’s excellent analysis of set win percentages, you should have a pretty good idea of why models are very useful. To recap, modeling is useful because it allows us to “fill in” gaps where we don’t have a lot of data, or even make predictions where we don’t even have any data!
While models are never able to perfectly capture the complexity of the real world, they are often “good enough” to give us useful insights into whatever we are modeling. To give you an example, look at Steve’s graphs predicting the probability of winning a set at different Home Rank Advantages. The model suggests that you are more likely to win the set if you are up 3-1 than if you are up 5-3. This is kind of ridiculous if you think about it, but we go along with it because overall the model does a very good job of explaining the data we observed.
Assumptions of a Win Probability Model
While many models in statistics are produced by looking at observed data and thinking about what kind of function “fits” the observed data well, other models are produced by thinking about how the data are generated and making some simplifying assumptions that allow us to express those thoughts mathematically. Here are some simplifying assumptions I make about how point-by-point data is generated in volleyball:
- What happened on the previous point has no effect on what happens on this point. There is no “hot hand” effect – if your best server just smashed an ace, that won’t make their upcoming serve any better or worse. This assumption may not perfectly describe the game – volleyball is a game of momentum – but for a first attempt at modeling the chance of winning a game, it makes the math a lot easier.
- On each serve, a team has an “average” chance of scoring a point no matter who is serving or what rotation they (or their opponent) are in. Depending on the model, we can assume that every team has the same chance, or vary that average depending on team and opponent strength. Again, we are oversimplifying – some teams get stuck in certain rotations more often than others, not all servers are created equal – but in the long run, all of these things will average out (and it makes the math easier).
- What happened in the previous set has no effect on what happens in this set. This is similar to assumption #1 – your chance of winning Set 2 is the same as your chance of winning Set 1. Again, this assumption may not perfectly describe the game – presumably the coaching staff is making adjustments – but (say it with me) it makes the math easier.
Modeling a Set: Assumptions 1 and 2
Mathematicians like to model games like volleyball using something called a directed graph. Basically, it looks like a flow chart. By following the arrows, we can find paths to get from one “node” (flow chart box) to another. For example, the directed graph below models a deuce game (next team to get a 2-point lead wins):
Starting at the “Up 1, with Serve” box, we can go to the “Win by 2” box or to the “Tied, Receiving” box. From the “Tied, Receiving” box we can go to the “Down 1, Receiving” box or back to the “Up 1, with Serve” box. In fact, we can make a loop between “Up 1, with Serve” and “Tied, Receiving” boxes very many times! (Imagine that you get up 24-23, and then both teams keep siding out for what seems like forever.) However, since there are only arrows going into the “Win by 2” and “Lose by 2” boxes, no arrows going out, once we get to one of those boxes, we have to stay there. This makes sense – once one of those conditions is met, the set is over.
Now that we have our graph, the next step is to figure out how likely it is that we go from one box to another box. Here are where the first two assumptions come in. Let’s say that you sideout 70 percent of the time, while your opponent sides out only 60 percent of the time. By the first two assumptions, you will always have a 70% chance of sideout when your opponent serves, while your opponent will always have a 60% chance of sideout when you serve. Then, from the “Up 1, with Serve” box, there is a 60% chance that you will go to the “Tied, Receiving” box (because there is a 60% chance the opponent will sideout and get the point and serve) and the remaining 40% of the time you will go to the “Win by 2” box. Similarly, from the “Tied, Receiving” box, there is a 70% chance that you will go to the “Up 1, with Serve” box and a 30% chance that you will go to the “Down 1, Receiving” box.
A mathematician would then turn the graph into a series of equations and use a bunch of algebra to find that you have about a 56% chance to win the set starting at the “Tied, with Serve” box. Once they solve this problem for scores of 24-24 (serving/receiving), 25-24 (serving), and 24-25 (receiving), they can then start using a process called recursion to find the chance of winning the set at any other score. For example, if you are down 23-24 with serve, you have a 60% chance of losing the set on this point and a 40% chance of reaching the “Tied, with Serve” box on this point (where we know you have a 56% chance of winning), so overall you would have about a 22% chance to win the set. Then, if you are instead down 22-24 with serve, you still have a 60% chance of losing the set on this point, but you now have a 40% chance of being down 23-24 with serve (where we know you have a 22% chance of winning) after the point, so overall you would have a 9% chance of winning. In fact, in 2014, two Italian mathematicians (warning: paywalled) used this process to model the probability of winning a set for both the rally-scoring and sideout-scoring methods!
Modeling a Match: Assumption 3
Let’s say that we’ve won the first two sets of a five-set match already. Should we be confident in our chances to win the match, or is our opponent likely to steal the match from us? Here’s where that third assumption comes in.
When we use recursion in the previous section, we can get all the way to starting a set 0-0 either serving or receiving. This means that we can find exactly the chance of winning a set when either we start with serve or our opponent does. With our third assumption, this chance should be the same no matter which set we are modeling. If we have a 76% chance of winning Set 1 when we start with service, we should also have a 76% chance of winning Set 2, Set 3, and (if necessary) Set 4. Things get a little weird with Set 5, since it’s only played to 15 instead of 25 – in our example with our team siding out at 70% and the opponent at 60%, our chance of winning is a little lower at 73%, but we can assume this is true no matter which 2 sets (out of the first 4) we won.
So if we’ve already won the first two sets, we could either win Set 3, or Set 3 and win Set 4, or lose both the next two sets but win Set 5 – any of those would be good enough for us to win. We would have a 76% chance to win Set 3, a 24% x 76% chance to lose Set 3 but win Set 4, and a 24% x 24% x 73% chance to lose both Sets 3 and 4 but get it done in Set 5. Overall, that adds up to almost a 99% chance of winning!
On the other hand, let’s say that we’ve already lost the first two sets. We would have to win all three remaining sets, which we have a 76% x 76% x 73% chance of doing – about a 44% chance to pull it off!
We can apply similar ideas at each stage of a match, although figuring out just how many different ways there are to win a match can be a bit tricky. The recursion trick we saw when modeling our set is sometimes useful!
Creating the Win Probability Model
To find our chance of winning at each point in the match, we need to estimate three probabilities:
prob1: Our chance of winning this set
prob2: Our chance of winning the match, if we win this set
prob3: Our chance of winning the match, if we lose this set
We can find
prob1 for each point in a set from our set-modeling, and we can find
prob3 for each set from our match-modeling. We can then find our chance of winning at each point in the match as:
Prob(win match) = prob1*prob2 + (1-prob1)*prob3
Estimating Win Probabilities for the 2018 Men’s National Championship
Note that all three of the probabilities in the previous section were computed directly from an estimate of each team’s probability of sideout. In other words, all we need to know (or guess) is the chance of each team siding out and we can compute the set and match win probabilities at each point in the match.
Entering the championship match, Long Beach State was the #1 team in the nation in sideout rate at just over 72%, while UCLA was not far behind at about 71%. The big difference was at the service line: Long Beach State held opponents to only about a 57% sideout rate, while UCLA was much higher at around 63%. Doing some crude averaging, we could expect Long Beach State to sideout about 67% of the time and UCLA to sideout at about a 64% rate.
We plug these numbers into our model for estimating the chance of winning a set and a match at each point in the match, and get the corresponding win probabilities.
Example 2: Kentucky @ FLorida, March 19, 2021
Like the first example, this example comes from a 5-set match. However, in this case, I have no idea what the sideout rates of either team should be, so we’re going to investigate the effect of varying some sideout rates.
- First, we’ll assume that each team has a 60% sideout rate, which is a reasonable guess for high-level NCAA women’s volleyball.
- Second, we’ll investigate the effect of giving Florida a slight home court advantage (+0.5 percentage points for Florida, -0.5 percentage points for Kentucky to sideout rate). It turns out that this home court advantage corresponds to about a 53%-47% home court advantage in the chance of winning a set, which is slightly less than the 55%-45% home court advantage Steve found.
Here’s what we get when we assume the two teams each have a 60% sideout rate:
Under these assumptions, Kentucky is anywhere from a slight favorite to win to better than an 80% favorite to win for practically the entire match – as long as Kentucky is up in sets (1-0 in Set 2, 2-1 in Set 3) there is no way Florida can have better than a 50% chance to win. As soon as Kentucky takes control of Set 1 (a 7-0 run from down 12-14 to up 19-14) and Set 3 (a 6-1 run from down 12-13 to up 18-14), Florida is basically playing catch-up.
Even though Florida is clearly the better team in Set 4, the win probability graphs flatlines at just under 50% – until Florida actually wins the set, Kentucky has a 50% chance of winning Set 5 + a very small probability of coming back and ending the match in 4. The slowly-increasing sawtooth pattern in the middle of Set 5 is common for a sequence of Set 5 sideouts – one team is slightly ahead, while the other team is running out of chances to catch up.
Let’s look at what happens when we give Florida the slight home-court advantage:
Compared to the previous graph, there’s a bit less switching between orange and blue at the start of Sets 1 and 3 – Florida starts off with a slight advantage in both sets and it takes a little bit longer for Kentucky to build up a big enough lead to eat into that advantage – and the plateau at the end of Set 4 now shows Florida as a slight favorite rather than a slight underdog. However, the general pattern is pretty similar.
Model Talk: Sensitivity Analysis
Okay, I know I said there wouldn’t be more math stuff, but this is a really important point.
In both examples, we basically guessed each team’s sideout rate. In the Long Beach State – UCLA match, we used some crude estimates based on each team’s sideout rates that season. In the Kentucky-Florida match, we didn’t even have that – I just spitballed a reasonable guess for each team’s sideout rate. If these guesses are way off, even if our model itself is awesome, we could have some serious problems with using our model to do anything informative. In a sensitivity analysis, we play around with our model inputs (sideout rates) and see how badly our model outputs (win probabilities) are affected.
For example, we can continue looking at the effect of changing the base sideout rate and the home-court advantage in sideout rate:
|Base Sideout Rate||Home Court Advantage |
(Sideout Rate Change)
|Home Court Advantage|
(Set Win Probability at 0-0)
|Home Court Advantage |
(Match Win Probability at 0-0)
Here it looks like there isn’t a major effect of base sideout rate on the initial probability of winning a set or a match (even assuming the much less realistic 55% or 70%, the probabilities don’t change all that much), but our model is pretty sensitive to our choice of a home court advantage. This suggests that it’s more important to get right how much better one team is than the other than it is to get right the exact sideout rates for each team.
How to Make Your Own Win Probability Graph
The beauty of this approach is that you don’t need a DataVolley file or access to Volleymetrics to collect the data necessary to create your win probability graphs. You just need an Excel sheet with five columns – in one row per point played, record the set number, each team’s score at the end of the point, which team served, and which team scored. I add two more columns for team names (to make the graphs look pretty), but it’s not strictly necessary. If you’re a club coach and have a parent who’s willing to help and decent with Excel, this is a great way to get started with analytics.
If you’re more interested in the college level, most of the work is already done for you – just find the team you’re interested in, click the link to the Box Score for the game you’re interested in, and click the Play-By-Play tab. Using some cut-and-paste, the Filter function, and some basic IF formulas to clean up the data, you can get a workable Excel file in about 5 minutes (Sidearm Sports is easy as long as you use the “Paste Special” option and hand-code the set number; Presto Sports is a bit more involved but still doable). There are a few teams that don’t post point-by-point data and a few where that data isn’t easy to copy into Excel, but for the most part you should be okay.
Once you have the point-by-point data in an Excel file, it’s time to start making the win probability graph. I coded the win probability model and made all of the graphs in this post using R. The model has already part of the internal organs of the volleysim package for a few months and the plotting function should be available as of the Version 0.3.0 update.
Whew! This was a long post! But what did we accomplish here?
- We built some win probability graphs and explored how they tell a richer and more interesting story than a point differential graph
- We explored some theoretical underpinnings of a model for estimating the chance that a team wins a set/match, changing from a “see data, fit model” approach to a “think about how the data is generated” approach
- We performed a basic sensitivity analysis to investigate the effect of changing teams’ sideout rates on our win probabilities
- We discussed some future steps to get you started making your own win probability graphs