Unfortunately, just knowing the average point score % for two opponents doesn’t really predict who will win the set. I honestly thought it would…or that it would fair better than it did. Here’s what I did:
1. Find the point score % average for every team in the dataset
2. Set up all the matchups between teams, per set, as well as tag which team won the set – merged it with our previous dataframe & added the PointScore Differential between the two teams. Again, this is the difference in average PS% between team 1 and team 2 based on their overall data.
3. Because we’re looking at a binary (only 2 outcomes, winning or losing) classification problem, we’re going to use logistic regression.
3a. Specifically, we’re going to use the historical difference in PS% between the two teams to try and predict who will win the set
3b. More specifically, we’re going to split our dataset in two. We’re going to use 80% of the data to build a model – this is called the training set. The other 20% is what we call the testing set. This 20% portion of the data will allow us to see how well our model works at predicting unknown outcomes – the future, for example.
3c. Last thing I swear. Logistic regression is really about probabilities. Given an input, X, what is the % likelihood that Y will occur. In our case, given PS% difference, what is the % likelihood the team will win the set. To this extent, we classify the model as predicting a win if the probability it spits out is >50%. Naturally, if this is not the case, we determine the model predicts a loss. This is seen in the code:
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
These two charts are the same. The distribution of PS% difference is on the x-axis, the probability of winning the set is on the y-axis. The curve shows that for a given value of x (PS% diff), we expect a probability of y (prob. of winning the set)
4. So, what do these graphs even mean?! First of all, they both represent the same thing. The blue/red curves represent the logistic model we get from the data. Translation: the curve is the probability of winning the set, given the specific PS% difference – and we see that as the PS% difference goes from negative to positive, the team has a higher probability of winning the set. This makes logical sense, right?
5. Reading the charts – if historically a team’s PS% has been 45% and their opponent historically has averaged 40% (…a difference between the two of 5% or 0.05 on our charts) then we find +0.05 on the x-axis and then move up until we hit the curve, then look left at the y-axis to find the probability of winning the set, in this case it looks like it’s just below 75%.
6. Takeaway from the logistic regression curve: if your historical PS% is greater than your opponent (diff > 0) then you are expected to win the set (probability > 50%).
7. Does this actually help us predict the winner? If we knew the historical PS% of each team before the match started, how well could we actually predict the future? Not so much.
8. Few things here ^
8a – That first table with 0s and 1s at the top is a confusion matrix, it’s our way to see if our predictions match up with reality. In this case, we get 1186 (605 + 581) predictions correct out of a total of 1762. This gives us an accuracy of 67.3%. This means that 67.3% of the time, if a team lost, our model predicted a loss (0, 0 in table) OR if a team won (1, 1 in table), our model predict a win. So it’s better than picking winners at random, but that also means that a 1/3 of the time, we guessed wrong. Uh oh. Still worse than Paul the Octopus in the World Cup.
8b – The summary(glm.fit) call is just a description of the logistic model itself, we get some useful information and the coefficient is certainly significant, but again, would we trust this model to accurately predict?
8c – The pR2 function is a “psuedo-R2” that helps us account for how much variance we can explain using this model. Specifically we use the McFadden to get something comparable to an R2 value we’d get using a typical linear regression model.
9. So this model doesn’t accurately predict the outcome of the set – only accounting for 16% of variance. I’m not sure I’d be willing to wager money on this match, given this data. So why is the concept so good, but the execution so terrible? Variation.
10. So this is the PS% for each set – for just the 5 teams listed above. While these teams certainly have averages hovering around 45%, there’s also a wide range of PS% values (from 0% up to 84%). Because of this wild variation, it becomes tough to actually predict the outcome of a set using the difference between historical PS% averages. Dammit.
11. So now what? Maybe we need something more granular than just PS%? Maybe attack efficiency or passer rating? Would using the difference between two team’s ratings in those categories be a better predictor of the future? Possibly.