for the RMD version: https://rpubs.com/chadgordon09/step2
Unfortunately, just knowing the average point score % for two opponents doesn’t really predict who will win the set. I honestly thought it would…or that it would fair better than it did. Here’s what I did:
1. Find the point score % average for every team in the dataset
2. Set up all the matchups between teams, per set, as well as tag which team won the set – merged it with our previous dataframe & added the PointScore Differential between the two teams. Again, this is the difference in average PS% between team 1 and team 2 based on their overall data.
3. Because we’re looking at a binary (only 2 outcomes, winning or losing) classification problem, we’re going to use logistic regression.
3a. Specifically, we’re going to use the historical difference in PS% between the two teams to try and predict who will win the set
3b. More specifically, we’re going to split our dataset in two. We’re going to use 80% of the data to build a model – this is called the training set. The other 20% is what we call the testing set. This 20% portion of the data will allow us to see how well our model works at predicting unknown outcomes – the future, for example.
3c. Last thing I swear. Logistic regression is really about probabilities. Given an input, X, what is the % likelihood that Y will occur. In our case, given PS% difference, what is the % likelihood the team will win the set. To this extent, we classify the model as predicting a win if the probability it spits out is >50%. Naturally, if this is not the case, we determine the model predicts a loss. This is seen in the code:
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
4. So, what does this graph even mean?! The red curve represents the logistic model we get from the data. Translation: the curve is the probability of winning the set, given the specific PS% difference – and we see that as the PS% difference goes from negative to positive, the team has a higher probability of winning the set. This makes logical sense, right?
5. Reading the chart – if historically a team’s PS% has been 45% and their opponent historically has averaged 40% (…a difference between the two of 5% or 0.05 on our charts) then we find +0.05 on the x-axis and then move up until we hit the curve, then look left at the y-axis to find the probability of winning the set, in this case it looks like it’s just below 75%. Essentially: given a PS difference of x, what probability of winning the set (y) do I have?
6. Takeaway from the logistic regression curve: if your historical PS% is greater than your opponent (diff > 0) then you are expected to win the set (probability > 50%). But as you can tell, it’s not even close to perfect – the histogram of PS diff when the team wins vs. loses has a ton of overlap – meaning you could have a PS diff of 0.05 and it might only be 60/40 that you win.
7. Does this actually help us predict the winner? If we knew the historical PS% of each team before the match started, how well could we actually predict the future? Not so much.
8. Few things here ^
8a – That first table with 0s and 1s at the top is a confusion matrix, it’s our way to see if our predictions match up with reality. In this case, we get 3285 (1611+1674) predictions correct out of a total of 9729. This gives us an accuracy of 33.8%. This means that 33.8% of the time, the model predicted a loss and the team actually lost (0, 0 in table) OR if the model predicted a win and the team actually won (1, 1 in the table). So, this isn’t great – because we could pick winners and losers at random and do better than this…
8b – The summary(glm.fit) call is just a description of the logistic model itself, we get some useful information and the coefficient is certainly significant, but again, would we trust this model to accurately predict?
8c – The pR2 function is a “psuedo-R2” that helps us account for how much variance we can explain using this model. Specifically we use the McFadden value to get something comparable to an R2 value we’d get using a typical linear regression model.
9. So this model doesn’t accurately predict the outcome of the set – only accounting for 12.9% of variance. I’m not sure I’d be willing to wager money on this match, given this data. So why is the concept so good, but the execution so terrible? Variation within the data.
10. So this is the PS% for each set – for just the 5 teams listed above (Stanford, Texas, Illinois, Wisconsin, and Nebraska). While these teams certainly have averages hovering around 45%, there’s also a wide range of PS% values (from 0% up to 84%). Because of this wild variation, it becomes tough to actually predict the outcome of a set using the difference between historical PS% averages. Dammit.
11. So now what? Maybe we need something more granular than just PS%? Maybe attack efficiency or passer rating? Would using the difference between two team’s ratings in those categories be a better predictor of the future? Possibly.