Having a lead in a set conveys confidence that we are more likely to win. We should be even more confident if that lead is near the end of the set, unless we are matched against a much better opponent. Using a lot of data and some modeling, we can quantify that confidence into winning percentages. Here, we will look at winning percentage as a function of the score, the lead, and the competition.
The first step is pure statistics. For each set of score and lead, calculate how many sets resulted in wins. The resulting win percentages will provide some insights into home court advantage and how much a single point increases our chance of winning.
What about the competition? A one-point lead against a weaker team should have a higher win percentage than a one-point lead against a stronger opponent. Including competitive advantage complicates our simple statistics because we don’t have enough data at each set of conditions (score, lead, advantage) to reliably calculate the percentages. What now?
Modeling, which allows us to fit a function to the sparse data. We can use the fitted function to explore how winning percentage changes as we vary score, lead, and advantage.
To build the model, we will first use the statistics from our first step and find a function that reasonably fits the winning percentage data. This function can be thought of as a conditional probability density function over score and lead using parameters that best fit our data. Once we have our function, we will add a term for competitive advantage and refit to the data.
Step 1: Statistics
Our data set is 2019 NCAA Women’s D1 and D2 matches. A quick check shows that we have ~23K sets and home team wins 55%:
Limiting our data to sets 1-4 and final scores <= 25, we get the following for the home team:
Let’s digest this plot in steps:
The y-axis is fraction of home wins.
- 0 – 0% of sets resulted in home win.
- 0.5 – 50%/50% win/lose.
- 1 – 100% of sets resulted in home win.
- The x-axis is the current home score.
Each color represents a point lead from -3 (down by 3) to +3 (ahead by 3).
- Looking at the black dots – tied score – to the left we see the home court advantage (0.55) which goes away as the game progresses. This makes sense as any home court advantage (5% in our data) is much less meaningful at the end of the set.
- The vertical separation of colors is the impact of one point on the win%. Plotting the data at 3 scores on new axis gives the following:
The slope of these lines (the second term in the legend) is the change in win% for one point change in lead when the score is close. We see that the slope at the end of the set is 2X the beginning and middle (~20% vs ~10%). We have quantified how much winning points near the end of the set is worth.
Step 2: Modeling with Competitive Advantage
Up to now, all analysis was done with data and simple statistics. Why transition to modeling? In this case, we want to see how the competitive advantage impacts win%. If we tried to use simple statistics, we would spread our data set across larger set of factors (score, lead, advantage) and end up calculating win% on small counts (< 20). The resulting statistics would be unreliable and would likely hide the trends that we are trying to observe.
With a model, we can fit to the sparser data by assuming that win% shouldn’t change dramatically in adjacent points. For this model, we will take a 2-step approach:
- Find a model structure (equation) that fits the win fraction data we calculated above. We consider the plot above a sample of the probability distribution of the win%. We are trying to find a function that best estimates this probability distribution.
- Add competitive advantage as a term into the same structure found above.
Our model’s predictor is probability of win. A logistic regression (logit) is a good choice for this type of outcome. We only need to decide the type of equation to plug into the logit function. A simple assumption is to use a polynomial of the input factors (score and lead). We just need to determine the order of the polynomial (score^2, score^3, etc.). We try orders from 1 to 5 and choose the lowest value that has a reasonable fit. The plot below shows these results.
While orders 4 and 5 have slightly better fits, the 3rd order is a good tradeoff between fit and complexity. Having established our structure, we are ready for competitive advantage.
Competitive Advantage Using Pablo Ratings
We will define competitive advantage as the difference in rankings, in this case Pablo Rankings care of the Rich Kern site. Because divisions are ranked separately, we will limit this section to D1 only. Home team advantage will be defined as:
HomeAdv = -(Home Rank – Visiting Rank) *
the inversion creates a positive value when the home team is ranked higher (lower rank number).
Since the Pablo rankings are based off match results, we expect it should be a good predictor of set win%. The plot below shows the home win% versus the home advantage. The prediction is another Logistic Regression using just HomeAdv as a liner predictor. The relatively good fir makes sense as the rankings are based off match results.
One more takeaway is the slope near the middle: 0.3% win probability/ranking difference. Comparing this to our lead slope from earlier:
1 point of lead ≡ ranking 30 spots ahead.
We create our 3 variable model and fit to the data:
Prob(Home Win) = LOGIT((Home Score + Point Lead + Home Rank Adv)^3)
To avoid trying to understand 3D plots, we will hold one variable constant and see how the model changes with the other 2. Let’s start by plotting the model as we did before, holding the Home Rank Adv constant at three distinct values.
- The middle plot looks almost identical to our 2D model, which seems reasonable as our 2D model effectively averages all the Home Ranks together.
- On the left, the home team is ranked 100 spots lower than the visiting team. Following the top two lines which represent having leads of 2 and 1 points, we notice that they cross Prob Home Win = 0.5 (even chance of winning) at scores of 14 and 18. This matches our intuition that leading a better team doesn’t mean much until later in the set.
- On the right, we have the flipped scenario (home team is ranked 100 spots higher). Looking at lower two lines with the home team behind, early on it is still more likely to win, but both lines cross into more likely to lose as the set progresses.
- The Home Rank Adv has the biggest impact at the start and nearly disappears towards the end of set.
We can get different views by changing the variable we hold. Below, we hold Point Lead, put Home Rank Adv on the X-axis, and each color is score at the beginning middle and end of the set.
- Most of the curves have the same S shape we saw in the original ranking data. The different colors show how these change from the beginning of the set (blue) to near the end (green).
- Looking at tied scores in the middle plot, we see that the slopes decrease from beginning to end of slope. The slope represents the how sensitive the probability of winning is to the rank advantage as we see this decrease as the set progresses.
- Only the green lines (near end of set) show significant differences at the 3 leads. Another reinforcement that while leads have some impact on probability to win early and mid set, they become most dominant at the end.
Using statistics and models, we were able to quantify the impact of point and ranking differences and see how these change from beginning to end of a set. By looking at the slopes of the models, we established some equivalencies (under conditions listed):
1 point ≡ 10% probability of win ≡ 30 difference in rank 1.
At the start of a set 2.
Score is close (within 3 points) 3.
Team rankings are within 100 of each other.