SETUP

Experimental Setup

Given a dataset of nearly 26,000 European soccer matches and statistics on almost 300 teams, we set out to develop a model that was able to accurately predict the outcome of matches. The dataset we use, “European Soccer Database”, comes from Kaggle, and includes information about each of the matches, including the teams that were playing, the league and date of the match, the three-way money lines (win, tie, loss) set for the match by various bookies, and specific attributes about the teams, such as how quickly the team builds up its play, how well it creates crossing chances, how compactly it defends, etc.. One issue with the data set was a large amount of missing values for some features. We found that about 12% of our total data set was null. Ultimately we wanted to be able to predict the outcome of a match given the attributes of the two teams and the odds that the bookies had set on the game. We had two main thresholds to help us judge how accurate our predictions were: First, in our dataset, the home team won approximately 46% of the games. Therefore, if our model were to predict a home win for each match, it would automatically get 46% correct. The second threshold is the bookies’ accuracy - they are able to predict the outcome of a given match approximately 53% of the time. The most optimal outcome would be to be to create a model that outperforms the bookies’ accuracy rate; however, given that we are going off significantly less information than what the bookies were using, a model that matches or slightly underperforms the bookies’ prediction rate would also be considered successful.

Data setup

An important element of this project was formatting the data into a structure that was conducive to machine learning operations. This involved merging two distinct datasets, the matches and the teams, into one dataset. Since the team dataset took into account the changing characteristics and playing styles of the teams over the eight year span of the match data, we had to pair each match with the two teams’ corresponding characteristics at the time that the match was played. Additionally, several of the team attributes are given as qualitative descriptions, which we translated into numbers so that they could be learned. For example, each team’s defensive aggression tendencies are classified into either “Press”, “Double”, or “Contain”, which we translated into 1, 2, and 3 respectively. We also normalized all of the features in order to perform linear regression. Our data were divided into three groups: training (~65%), validation (~17%), and test (~17%). Each time we trained our model and checked it using the validation set we shuffled the training and validation sets together before separating them to avoid overfitting on one single training set.

Click the button below to see our first model:

Linear Regression