{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n# Fitting the model\n\nIn this section we use odds from previous tournaments to \nfit the model to data. We do this using a method known \nas [logistic regression](https://youtu.be/yIYKR4sgzI8). \nLet's start by loading in the data. \n\nThe odds here are for two World Cups and two Euros. They are given \nin European form. Remember, the UK odds are found by taking away one\nfrom the European odds.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Import the libraries we will use.\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\n\n# Load in the data\nwc_load = pd.read_csv(\"../data/WorldCupEuroCupOdds.csv\", delimiter=';')\nwc_load = wc_load[(wc_load['Stage']==1) | (wc_load['Stage']==2)]\nwc = wc_load.rename(columns={'FTHG': 'HomeGoals', 'FTAG': 'AwayGoals'})\n\n# We start by looking at the closing odds\nwc_df = wc[['AwayTeam','HomeTeam','HomeGoals','AwayGoals','PSH','PSD','PSA']].assign(goaldiff=wc['HomeGoals']-wc['AwayGoals'])\noddslabel='Closing odds'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Making the bookmakers odds fair\n\nBookmakers odds are set up so that they have a margin (an edge). If we calculate\n\n\\begin{align}\\end{align}\n\n t = \\frac{1}{o_\\mbox{home}} + \\frac{1}{o_\\mbox{draw}} + \\frac{1}{o_\\mbox{away}} \n\nwhere \n\n\\begin{align}\\end{align}\n\n o_\\mbox{home}, o_\\mbox{draw} \\mbox{ and } o_\\mbox{away} \n\nare the (European) odds of each outcome \u2013 then we typically find a value \ngreater than one (if it is one then the odds are fair). For more about making odds\nfair (and another method for correcting the odds for the margin) see \n[here](https://www.football-data.co.uk/The_Wisdom_of_the_Crowd_updated.pdf).\n\nTo make the probabilities implied by the odds fair we thus divide each\nof the probabilites by $t$. Now the probabilities of the three outcomes add up\nto one. This is done below, along with a change that allows us to work out\nvalues for the favourite.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def MakeOddsCalculations(wc_df): \n # Calculate win, draw and lose for the outcomes.\n wc_df = wc_df.assign(draw=(wc['HomeGoals']==wc['AwayGoals'])).assign(homewin=(wc['HomeGoals']>wc['AwayGoals']))\n wc_df = wc_df.assign(awaywin=(wc['HomeGoals']wc_df['awayprob'])) | wc_df['awaywin'] & (wc_df['awayprob']>wc_df['homeprob'])))\n wc_df = wc_df.assign(favprob=np.maximum(wc_df['homeprob'],wc_df['awayprob']))\n \n return wc_df\n\n\nwc_df = MakeOddsCalculations(wc_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logistic regression\n\nA logistic regression model has the following form \n\n\\begin{align}\\end{align}\n\n P( Y = 1 | z ) = \\frac{1}{1 + \\exp(b_0+b_1 z)}\n\nwhere $z$ is a feature of the data and $Y$ is the event to be predicted. In our case, $Y$\nis the event that the favourite wins. So, $Y=1$ means the favourit wins, $Y=0$\nmeans they don't (draw or underdog wins). \n\nIn the last`section` section, we presented \nThe Betting Equation as\n\n\\begin{align}\\end{align}\n :label: eq:Betting\n\n P( Y = 1 | x ) = \\frac{1}{1 + \\alpha x^\\beta} \n\nwhere $x$ are the fair odds for the favourite in UK form. These two equations\nare slightly different but related. What we need to do now is find a \na way of calculating $z$ from the data we have just loaded in and use it to \nestimate $\\alpha$ and $\\beta$. \n\nThe (fair) UK odds for the favourite is the ratio of the probability that the favourite doesn't win and\nthe probability that the favourite does win. That is\n\n\\begin{align}x = 1/p - 1 = \\frac{1-p}{p}\\end{align}\n\nwhere $p$ is the probability that the favourite wins. We now set \n\n\\begin{align}z = \\ln \\left( \\frac{p}{1-p} \\right) = - \\ln \\left( x \\right)\\end{align}\n\nFrom this, we get\n\n\\begin{align}\\end{align}\n\n P( Y = 1 | z ) &= \\frac{1}{1 + \\exp(b_0+b_1 z)} \\\\\n &= \\frac{1}{1 + \\exp(b_0 - b_1 \\ln ( x ) )} \\\\ \n &= \\frac{1}{1 + \\exp(b_0) \\exp \\left( \\ln(x)^{-b_1} \\right)} \\\\\n &= \\frac{1}{1 + \\exp(b_0) x^{-b_1}} \\\\\n\nThis is the same as the :math:numref:`Betting Equation `, with $\\alpha = \\exp(b_0)$ and $\\beta = - b_1$. \n\nLet's calculate the log odds for our data.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Create a variable which is log odds.\n\nwc_df = wc_df.assign(favlog=np.log(wc_df['favprob']/(1-wc_df['favprob'])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets import tools required for a logistic regression model\nand use them to fit the parameters.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"test_model = smf.glm(formula=\"favwin ~ favlog\", data=wc_df, family=sm.families.Binomial()).fit()\nprint(test_model.summary())\nb=test_model.params\nalpha=np.exp(b[0])\nbeta=-b[1]\n \nprint('Estimate of alpha: ', alpha)\nprint('Estimate of beta: ',beta)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have an estimate of our paramters $\\alpha$ and $\\beta$.\n\n## Plotting the predictions\n\nLet's plot the model and compare it to the data. \nNotice that we make the comparison as a difference between reality\nand outcome. \n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def FormatFigure(ax): \n \n ax.spines[\"top\"].set_visible(False) \n ax.spines[\"right\"].set_visible(False) \n ax.set_ylim(-0.15, 0.15) \n ax.set_xlim(0, 1.0) \n ax.set_yticks(np.arange(-0.15, 0.16, 0.05), [str(x) + \"%\" for x in np.arange(-15, 16, 5)]) \n ax.set_xticks(np.arange(0, 1.01, 0.2), [str(x) + \"%\" for x in np.arange(0, 101, 20)]) \n ax.set_xlabel('Win probability implied by odds')\n ax.set_ylabel('Actual win prob - dds win prob') \n ax.legend()\n ax.plot([0,1],[0,0],color='C0')\n ax.set_title(\"Probability favourite wins\") \n\n return ax\n\n \n #Favourite model\n \ndef PlotOddsData(ax,wc_df):\n wins = wc_df['favwin']\n winpreds = wc_df['favprob']\n numbins=15\n bin_means = np.zeros(numbins)\n bin_counts = np.zeros(numbins)\n bins=np.arange(0,1+1/numbins,1/numbins)\n for i,lb in enumerate(bins[:-1]):\n lbp1=bins[i+1]\n dfs=wins[np.logical_and(winpreds>=lb,winpreds0:\n bin_means[i]=np.mean(dfs)\n bin_counts[i]=len(dfs)\n ax.scatter(bins[1:]-1/(2*(numbins-1)),np.array(bin_means)-bins[1:]+1/(2*(numbins-1)), marker='o', s=bin_counts*2, label='%s (data)' % oddslabel)\n\n return ax\n\np=np.arange(0.01,0.99,0.00001)\n \nfig,ax=plt.subplots(1,1)\nax.plot(p,1/(1+alpha*np.power((1-p)/p,beta))-p, label=oddslabel)\nax = PlotOddsData(ax,wc_df)\nax = FormatFigure(ax)\n\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These values caught my attention immediately when I first fit this model. \nThe fact that both parameters, \u03b1 and \u03b2, were \nbigger than 1 indicated that there was a (small) bias in the way the odds\nwere set.\n\nThe plot above shows two ways in which we might have an edge\nFirstly, the right hand side of the plot shows that a \nlong-shot bias exists against strong favourites. \nThese teams are typically under-estimated by the bookmakers\u2019 odds \nand therefore worth backing. The middle and left hand side of the plot\nshows that weaker favourites are over-estimated. This difference between\nodds and outcome is even more pronounced and seems to be where\nmost of the value in the model is to be found.\n\nAlthough these differences between predictions and model were small, \nJan, Marius and I knew that they were big enough for us to make a profit...\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Opening odds\n\nWe found a similar pattern in the opening odds as the closing odds, but\nthe bias was smaller. We do this fitting below.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"wc_df = wc[['AwayTeam','HomeTeam','HomeGoals','AwayGoals','Home Open','Draw Open','Away Open']].assign(goaldiff=wc['HomeGoals']-wc['AwayGoals'])\nwc_df = wc_df.rename(columns={'Home Open': 'PSH', 'Away Open': 'PSA', 'Draw Open': 'PSD'}) \noddslabel='Opening odds'\nwc_df = MakeOddsCalculations(wc_df) \n\nwc_df = wc_df.assign(favlog=np.log(wc_df['favprob']/(1-wc_df['favprob']))) \n\ntest_model = smf.glm(formula=\"favwin ~ favlog\", data=wc_df, family=sm.families.Binomial()).fit()\nprint(test_model.summary())\nb=test_model.params\nalpha=np.exp(b[0])\nbeta=-b[1]\n \nprint('Estimate of alpha: ', alpha)\nprint('Estimate of beta: ',beta)\n\nfig,ax=plt.subplots(1,1)\nax.plot(p,1/(1+alpha*np.power((1-p)/p,beta))-p, label=oddslabel)\nax = PlotOddsData(ax,wc_df)\nax = FormatFigure(ax)\n\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the big tournaments approach the odds reflect reality less!\nThis is seen in values of \u03b1 and \u03b2 being closer to one \nfor the opening odds than for the closing odds (above). \n\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}