Data Driven Sports Betting
Using Machine Learning and Back Testing to Improve and Validate Sports Betting Strategies - Edward Krueger
Can we predict the future? A question everyone asks at one point or another. Some may want to predict the stock market, the next winning lottery number, or the winner of the big game.
We wanted to take on this challenge ourselves by predicting NFL scores. We created a model to do just that and are pleased to share the results.
Results
Spread and Over-Under
2020 Betting Results (source: by authors)
When testing our model on the 2020 season, we ended up with these results for the spread and over-under. The wagered, returned, and profit amounts are assuming $10 bets on every regular season game with odds of -110 (a common odds amount for these bet types). This results in a profit of $9.09 per win.
Moneyline
For moneyline bets, we did not go and find previous odds for these games and therefore can’t calculate profit, but as far as correct predictions here are the results.
Moneyline results (source: by authors)
Although the hit rate is quite high at 83.75%, this is without using any sort of betting strategy as discussed further in the article. So there would have likely been games that we did not bet the moneyline on due to poor odds or other factors that week.
How Sports Betting Works
In sports betting, the 3 most common bets to make are on the moneyline, the spread, and the over-under. Let’s go through each bet type and explain them.
Odds (source: by authors)
Before we get into the betting types, we need to take a moment to explain how odds work, specifically American odds.
Using the image above, we can see the odds in the Win column are +230 for the Steelers and -280 for the Bills. What this means is if we bet $100 on the Steelers, we will win $230. On the flip side, if we are betting on the Bills, we need to bet $280 to win $100. Underdogs and favorites are indicated by the + and -, respectively. For the moneyline (win bet), The favorite will always be minus odds and require more money wagered to win less. In contrast, the underdog will be at plus odds and require less wagered to win more.
Another common misconception many novice bettors believe is that when oddsmakers set odds, they try to “predict” the winner. That is not the case; the odds move based on how people are betting and will change up until the game begins.
For example, for a close game, Team A may have opening odds of -120 (favorite), and Team B will be at +110 (underdog); however, many bets come in for Team B, which pushes their odds to -110 and Team A’s odds to +100. The change in odds does not mean that oddsmakers now believe Team B will win the match but instead are trying to entice bettors to bet on Team A. They are in the business of making money and adjust the odds accordingly.
Gif by itspronouncedjif on tenor
Now let’s get into the 3 common types of sports bets
Moneyline
The moneyline is the most basic form of sports bets. We are betting on whom we think will win. Simple enough!
With the moneyline, the odds can get quite outrageous if one team has a major advantage. Again using the image above, the Buffalo Bills are pretty decent favorites (-280) against the Steelers. If a team is a massive favorite -1000 odds or worse, a sportsbook may not even allow us to take that bet as it could be seen as “free money,” which we know as economists, does not exist.
Spread
The spread is the number of points oddsmakers say a team will win (or lose) by.
Using the Steelers and Bills again, the spread is set at 7. +7 for the Steelers and -7 for the Bills. Again, we can use the + and — to distinguish between the favorite and underdog. This line is suggesting that the Bills will win by 7 points. If we bet on the Bills spread, we will win if the Bills win by 8 or more points (7 will push). If we bet on the Steelers’ spread, we will win if they lose by 6 or fewer points (again, 7 will push).
Note here that there is a chance to push with this bet, but it is common to see half points for spreads to prevent pushes. Just another way sportsbooks can keep our money.
Over-Under
The over-under is another very basic bet, oddsmakers will set a total score, and we bet if we think the total score will be over or under that number.
The over-under is not a team-specific bet (although they do have over-unders for just a single team as well), but a total bet. The Bills-Steelers game has an over-under set at 50.5 (notice the half-point here to make pushes impossible). If we bet the over, we will win if the total score is 51 or more. If we bet the under, we will win if the total score is 50 or less.
Implied Point Totals
A neat thing we can calculate when we have the over-under and spread is the implied point total. The implied point total is the presumed points for each team given the over-under and spread set by oddsmakers.
For the Bills-Steelers game above, we have an over-under of 50.5 points. To calculate the implied points for each team, we take the 50.5, subtract the spread (-7), and divide the difference in half.
Underdog implied points = (over-under — spread)/250.5–7 = 43.5 / 2 = 21.75
The above is the implied points for the Steelers (the underdog). To get the implied points for the Bills, we take this same number and add back the spread.
Favorite implied points = underdog implied points + spread21.75 + 7 = 28.75
After these calculations, we have the implied point totals for the game.
Steelers: 21.75Bills: 28.75
Now we know there are no fractional points in football, but this is nice to give an idea of what we may expect each team to score. The implied point totals are also useful in fantasy football as we will want to have a player on a team with a high implied score.
Becoming a Successful Sports Bettor
There are two things we consider to be required to be successful at sports betting.
Being able to accurately predict the outcome of a game before it happens (go back to predicting the future point in the intro).
Having a good betting strategy. Having a good prediction of a game is not enough. If the odds are not in our favor, we will not make that bet; even if we predict the bet to hit, there is ALWAYS a chance to lose.
We confirm that we meet the above criteria by doing two things:
Quality checking our predictions using standard machine learning validation techniques. These include, but are not limited to, MSE (Mean Squared Error) for regression models or accuracy for classification models.
Once we create our model(s), we do additional verification by backtesting our strategies. Backtesting allows us to see our profit/loss over a period of time, in this case, over the 2020 NFL Regular Season.
If you are interesting in learning more about backtesting and its applications to stock trading, check out the following article by Luke Posey for an example:
Backtesting Your First Trading Strategy
Backtesting is a fundamental step in testing the viability of your trading ideas and strategies. Here is a simple…
towardsdatascience.com
Why do other approaches fail?
We are certainly not the first to attempt the task of predicting football games, but we have yet to come across an NFL prediction model that is strong enough to follow with continued success.
A popular computer model from oddsshark is a great example of this:
oddsshark computer picks for last 100 NFL games (source: by authors)
Looking at the model’s last 100 game picks, we can see that if we were to tail, we would be down quite a large amount of money. Their ATS (against the spread) and O/U (Over-under) picks are more often wrong than right, and their to-win (moneyline) picks, while often correct, does not net a large profit. Losing money while being mostly correct on moneyline picks indicates that no strategy is involved as they are not paying attention to the odds. If they pick a massive favorite, they are not gaining much profit even if they hit. The benefit does not outweigh the potential cost, and therefore, the bet should not be taken.
While we do not know how they built this model, we can make some overall assumptions on why oddsshark’s model, along with others, may be failing.
They fail to model the prediction state appropriately
These modelers may be failing to use available information before the game begins and base their model on stats that are not pre-determined. As an example, we may know that a team with more complete passes and yards wins the game more often than not, but this information is irrelevant because it is not available to us before the game. We would then have to predict the game and these more detailed statistics, thus making the problem more difficult.
They focus too much on individual teams
With the way the NFL Schedule is constructed each year, some teams rarely play each other from season to season unless they are division rivals. Due to this, it is nearly impossible to generate enough data from these teams playing each other to gather any useful information. The only way would be to use decades of data of these teams playing each other, which again will be useless as they could be drastically different teams from the last time they played. This is why our model is “team agnostic,” it allows us to gather a large amount of information to train our model that can then be applied to any team.
Another reason why focusing on individual teams (or pairings of teams) creates poor models is that the number of team pairings there could be causing the feature matrix to expand and suffer from the “curse of dimensionality.”
They are using the wrong type of model for the problem
Throughout our many years of teaching bootcamps, there has always been at least one project group in each class that attempts to predict a sports game. Like many neophyte modelers, they are attracted to neural networks like crows to shiny objects. While intelligent beings, like our students, picking up these shiny objects may cause more harm than good, unlike the crow, they may not choke on this choice, but they will end up with poor predictions from their model.
Neural networks require extremely large datasets consisting of 1000s, maybe 10000s of observations. Obtaining this large dataset is not possible when there are 240 (now 272) games a season, and with teams changing from year to year and rule changes that impact the game, the data becomes less useful with each passing year.
Another common approach to predicting NFL games is to create a classification model (even worse if attempting to use a Neural Network). This is an okay approach if we want to predict the winner of a game, but for betting the spread or over-under, it is virtually useless. While one may realize that there are only 2 (3 if able to push, but we will not count those) outcomes for the spread and the over-under, this should allow for creating a decent classification model, but this is not the case. The poor model is due to the low potential of large spreads and extreme over-unders, thus creating an imbalanced dataset.
Creating a regression model, on the other hand, is useful for all three main betting types. A single regression model can predict the teams’ scores, which allows us to pick the winner, who covers the spread, and the over-under.
They Fail at following a betting strategy
As noted above about blindly following the oddsshark computer model without a strategy, we will have small gains and big losses. The model is not accurate enough to negate using it without a strategy.
It is all too common for journeyman sports bettors to think that predicting the outcome is enough and do not require a betting strategy. They will make their bets with too much confidence in their model and end up disappointed at the end of the season.
No matter how great a model is, bettors should have a strategy to prevent risky bets. This is why we accompany our model with a betting strategy along with a discussion on Dylan’s podcast before making any bets.
We follow a simple strategy to avoid any bets with unfavorable odds, no matter how much of a “lock” it may seem.
For example, this season, Kansas City (a top-tier football team) will be playing against the Philadelphia Eagles (a bottom-tier football team) in week 4. More than likely, Kansas City will be heavily favored to win this game and have moneyline odds of -500 or worse. Remember, this means that we will have to wager $500 to win $100. Even though it is highly likely that Kansas City will win this game, anything can happen in the NFL, and there is always a chance that they can lose. The risk is not worth the reward in this scenario.
They don’t incorporate football or betting knowledge into their models
Edward does not follow football nor is involved in any sports betting. Taking on this venture without Dylan informing him of this knowledge may not have turned out as successful. When creating the model, Edward used Dylan as a subject matter expert to explain the importance of different features to ensure they made sense before incorporating them into the model. We have seen many models that attempted to predict the outcome of NFL games with great test metrics but did not make sense from a logical standpoint.
An experienced data scientist will be able to create a model, but if they are not knowledgeable about the game, they may end up focusing on and keeping features that are not as important as they think.
They think they are smarter than Vegas
“Oh, come now — don’t play the fool. Vegas has fools enough, a superfluity of them. They’re what makes it so profitable.
They come to Vegas chasing penny-ante dreams of high-living, to feel like they’re big shots, like they’re winners.”
— Mr. House, Fallout: New Vegas
The cardinal sin of sports betting (or any gambling) is the assumption that you are smarter than Vegas. Sportsbooks and Casinos are first and foremost businesses trying to make a profit, and they are very good at it.
Novice sports bettors may see their favorite team labeled the underdog by oddsmakers, think they made a mistake and see an “easy path” to a big payout. So they will make a relatively larger wager than they usually do because they are confident it will hit. After all, “the oddsmakers are wrong.” Finally, after the game, their unbeatable team is blown out of the water they are left in disbelief with empty pockets.
Surprised Pikachu via Giphy
Remember, folks, Vegas is never wrong, even if you are right.
How are we going to be different?
We construct our models using available data
We build our models using only information available before game time, such as historical stats for the teams playing. We will not be using endogenous variables to predict other endogenous variables.
Our models will be team agnostic
As previously stated, our model will not focus on individual teams; we do not even include team names in our data sets. Instead, our model learns how a team is likely to perform against another by using opposing stats (offense vs. defense).
Simple Models
Our models are simple and easy to inspect. We prefer to use linear models where possible and then continue to expand into more complex methods. However, after seeing multiple models attempting to predict sports games, we noticed that the simpler models tend to perform better in our evaluation metrics.
SME Feature Selection
We construct and select our features using a subject matter expert’s knowledge (if that’s what we want to call Dylan). Using an SME ensures that we are not including any features that may improve evaluation metrics but logically do not make sense in the model.
Regression over Classification
Classification models are overused for predicting games and appear to yield unsatisfying results. This poor performance is why we are choosing to use regression models instead. Regression models allow us to predict each team’s score, which we can use to determine the winner, the over-under and who to choose to cover the spread.
Betting Strategy & Discussion
While we believe in our model, we still do not want to follow it without question; this is how you lose money. Each week during the NFL regular season, the model’s predictions will be discussed on the Sleeping with the Numbers podcast. This discussion will allow additional insight into the upcoming games and determine, along with our strategy, if we ultimately decide to tail the model’s bets.
Conclusion
We are extremely excited to put our model to use this NFL season. If you’d like to follow the model’s picks and results you can do so in a few ways:
If you are interested in the discussion around our picks, consider subscribing to Sleeping with the Numbers on Youtube and Apple Podcasts or following us on Twitter and Instagram.
For more content on data science, machine learning, and development videos check out Edward’s YouTube Channel and subscribe to his newsletter!