Linear Regression

Once the data was formatted, we ran the Stochastic Gradient Descent Regression algorithm from SKlearn’s linear model package on the training data to predict the number of goals the home and away team would score in each match. The output for the number of goals a team score was always some real number, which we were trying to get as close to the integer number of goals that were actually scored in the game.

Checking the accuracy of each model using the validation set, we tweaked many of the parameters that were available in the model to first minimize the bias and then variance. Among the things we tried was changing the learning rate (α), changing the type of regularization that we were using (from the standard L2), and creating polynomial features. The polynomial features we created were the product of the different playing style scores for attack, defense and build up play. For example, we multiplied the “Defensive Pressure” score and the “Defensive Aggression” score to make an overall defense score. We also changed the cost function from the default “squared loss” to the “huber” loss function, which is less sensitive to outliers. Because the results of soccer games have depend on chance to some degree, we thought this more forgiving loss function would suit the problem better. For this, we had to adjust epsilon, the error beyond which the function did not need to worry about getting correct. The optimal value of epsilon turned out to be around 5. We did not have the option to alter the number of iterations or epochs, because the regression would automatically continue to run until the cost function no longer improved after each iteration. For this reason, we do not have any graph showing the relationship between the number of iterations and the cost function - each time the model was fit, the cost was only available at the end. We checked that the algorithm was converging to the same thing every time simply by looking at the coefficient matrix for the features and checking that they were close to identical after each fit. Because it was running on a different training set each time, the coefficients were not always perfectly identical, but they were very close every time. Below is an example coefficient matrix generated by the model where each entry is the weight on a given feature:

We used the regression model’s built-in “score” metric to determine how accurate our regression is. This score is defined as where

and . The best score is 1, and scores can be infinitely negative. Originally our score was hovering

around -65, but with all of our adjustments we were able to get it to about 0.07 for the home goals and 0.03 for the away goals.

Once we were satisfied with regression for the home and away team goals, we compared the two to see how well the linear regression was predicting the results of the games. If the number of goals for the home team was predicted to be greater than that of the away team, we would classify it as a home win, otherwise it was an away win. We ignored the possibility of a tie because the data showed that ties are predicted about 0.004% of the time by bookies. With this method, our accuracy was about 48.3%. This basically meant that in nearly every game, the linear regression predicted the home team to score more goals than the away team, with the exception of only the most unbalanced matchups. This is an example of the output of our linear regression on a test set with the huber loss function and epsilon = 5:

Thus, using linear regression to predict the outcome of soccer matches was demonstrated to be relatively ineffective. We next tried to treat the problem as a classification. Click the button below to see our SVM implementation:

Screen Shot 2018-12-07 at 4.15.05 PM.png

SVM Classification