Predict EPL results (Part 1: Logistic Regression Example in Python)
Logistic Regression
- It estimates the probability of an outcome (aka dependent variable) based on a set of one or more input (aka independent variables or "features")
- Can be used for binary classification problem (only 2 possible outcomes) or multiclass classification problem (more than 2 outcomes)
- For example, it can be used to predict the chances of a particular transaction is fraud given some input information like the customer's details, transaction details etc.
- If there are more than 2 outcomes like the EPL game (i.e. win, draw and lose), logistic regression also can be used as it can determine the possibilities of each outcome.
- Logistic Regression Function is defined as follows:
![]() |
| Logistic Regression Function |
- Note: f(x) represents the probability of an output y given each input x.
- From the formula above, it tells us that it can take in any input x (positive or negative) and produce an output y in the range of 0.0 to 1.0.
- Back to the example earlier that we are trying to estimate the probability that a transaction is fraud. Let's say that we indicate that when y = 1, it is true that the transaction is fraud. If f(x) = 0.7, it means that there is a 70% chance that the transaction being a fraud (i.e. y = 1) given the input information
An Example in Python
- Now, let's go through how to build a Logistic Regression model to predict the English Premier League (EPL) results!
How to build the model?
1) Get past EPL results
- Download data from https://datahub.io/sports-data/english-premier-league (we did this!)
- Or you can 'crawl' / 'scrap' the data from https://www.premierleague.com/tables
2) Re-organize the data
- We only downloaded Season - 17/18 data so that's 380 examples (m) in total
- Since there are not many data, we just split the examples into 280 training set and 100 test set
- A total of 16 features
- Home team season points
- Home team season goal difference (GD)
- Away team season points
- Away team season goal difference (GD)
- Home match number
- Away match number
- Home team past 5 games points per game
- Home team past 5 home games points per game
- Home team past 5 games goal difference (GD)
- Home team current points per game
- Home team current goal difference (GD)
- Away team past 5 games points per game
- Away team past 5 home games points per game
- Away team past 5 games goal difference (GD)
- Away team current points per game
- Away team current goal difference (GD)
3) Vectorize the data
- Arrange the filtered data into matrices
4) Check the size of the matrices
- Train_set_x = (16, 280)
- Train_set_y = (3, 280)
- Test_set_x = (16, 100)
- Test_set_x = (3, 100)
Note: Since there are 3 possibilities of the outcome of the match (i.e. win, draw or lose), the dimension of the train_set_y has to be 3 by m where m is the number of training examples.
5) Normalize the row data
- Mean normalization is also known as feature scaling method
- It is used to increase the effectiveness of gradient descent by making sure that features of the data (i.e. independent variables) are on similar scale
- Tip: use numpy.std
6) Implement the Logistic Regression hypothesis model
- Set the hyperparameters like learning rate, α and number of iterations to appropriate values
- Recall that hyperparameters are numerical factors that are set by us and they can be changed to improve the performance of the machine learning algorithm
- Let us know if you want us to upload our code in the blog!
7) Evaluate the results to check if the hypothesis is accurate
- Well..we are happy that our first model performs slightly better than guessing (33%)
- We played around with the model and this is what we found out:
- When α = 2 and number of iterations = 2000, the train accuracy = 54.6% and the test accuracy = 44%
- When we changed α to 0.5 and keep the number of iterations same as 2000, the train accuracy becomes 55.4% and the test accuracy remains as 44%
- When α increased to 3, the model starts to go haywire...kidding..the costs doesn't converge to the minimum anymore (i.e. the cost is increasing instead)
- A high number of iterations like 20,000 will decrease both train and test accuracy
- Here is what we infer from the results:
- Overfitting problem: The model has high bias as the accuracy is low (just slightly higher than the chance of guessing - 33%)
- Underfitting problem: It also has high variance as there is a large difference between the train and test accuracy
8) Think of possible ways to improve the performance
- To decrease the bias, we can
- Add more features
- To decrease the variance, we can
- Get more training examples from the past seasons
- Do regularization
Thanks for reading! We appreciate any feedback or questions from you so do feel free to leave us a comment below.

Comments
Post a Comment