SVM Classification

To better predict the results of games we used SKLearn’s “support vector classifier” SVM to do a more traditional classification problem. We used the same data that we used in linear regression (including the polynomial features we created), and classified a home win to be equal to 1, a draw equal to 2, and a loss equal to 3. We also used the same technique of randomizing the training and validation data to minimize the variance, and with the same ratios of training, validation and test data. Again, the default parameter for the number of iterations was set so that it would continue iterating until the cost function reached a minimum.

The kernel that yielded the most accurate predictions turned out to be the default, Radial Basis Function kernel. This kernel has two main parameters, C and , where C is the regularization term and influences the weight of each datapoint. With the default value of C = 1, the SVM classified every match as a home win, resulting in a 46% accuracy. In order to lead the model away from this (at the expense of potential overfitting), we increased the value of C to be much larger. An analysis of the bookies’ predictions showed that they predicted the away team to win about 13% of the time. With each test we checked to see if any away losses were being predicted, and if so, at what level of accuracy. We settled on C = 5000, because this was the point at which the accuracy seemed to be the highest, and the model was willing to classify games as away losses. However, with a C value this high, overfitting was certainly a problem, so the validation accuracy. We ended up getting the model to predict an away win about 10% of the time, which was as high as we got without getting significant variance.

The highest accuracy we were able to achieve on the training and validation sets was 52.6%, which is very close to the bookies’ average prediction rate of 53%. Given the very high value of C that was necessary to push the model away from predicting all home wins, there was a non-zero variance. The accuracy for the test set came out to be 51.5%, just under the bookies’ rate of prediction.

One idea we had to improve our SVM model’s accuracy was to feed the output of the linear regression model to the SVM model as new features. We took the predicted values of home and aways goals scored from the linear regression and appended them onto the training and test sets for the SVM model. We were hoping these new features would increase the classification accuracy, but the accuracy barely changed. Here is an example output of our model when the linear regression output was added as features:

The classification score for the SVM model shows 51.4%. Now here is an example output without the linear regression output added as features:

This output shows a classification score for the SVM model to be 51.5%, so virtually no difference between the model ran above. While we thought creating these new features would be a good idea, it did not affect our output. Check out our concluding remarks by clicking the button below:

Conclusion