Predict EPL results (Part 1: Logistic Regression Example in Python)

March 29, 2019

Logistic Regression

It estimates the probability of an outcome (aka dependent variable) based on a set of one or more input (aka independent variables or "features")
Can be used for binary classification problem (only 2 possible outcomes) or multiclass classification problem (more than 2 outcomes)

For example, it can be used to predict the chances of a particular transaction is fraud given some input information like the customer's details, transaction details etc.
If there are more than 2 outcomes like the EPL game (i.e. win, draw and lose), logistic regression also can be used as it can determine the possibilities of each outcome.

Logistic Regression Function is defined as follows:

Logistic Regression Function

Note: f(x) represents the probability of an output y given each input x.
From the formula above, it tells us that it can take in any input x (positive or negative) and produce an output y in the range of 0.0 to 1.0.
Back to the example earlier that we are trying to estimate the probability that a transaction is fraud. Let's say that we indicate that when y = 1, it is true that the transaction is fraud. If f(x) = 0.7, it means that there is a 70% chance that the transaction being a fraud (i.e. y = 1) given the input information

An Example in Python

Now, let's go through how to build a Logistic Regression model to predict the English Premier League (EPL) results!

How to build the model?

1) Get past EPL results

Download data from https://datahub.io/sports-data/english-premier-league (we did this!)
Or you can 'crawl' / 'scrap' the data from https://www.premierleague.com/tables

2) Re-organize the data

We only downloaded Season - 17/18 data so that's 380 examples (m) in total
Since there are not many data, we just split the examples into 280 training set and 100 test set
A total of 16 features

Home team season points
Home team season goal difference (GD)
Away team season points
Away team season goal difference (GD)
Home match number
Away match number
Home team past 5 games points per game
Home team past 5 home games points per game
Home team past 5 games goal difference (GD)
Home team current points per game
Home team current goal difference (GD)
Away team past 5 games points per game
Away team past 5 home games points per game
Away team past 5 games goal difference (GD)
Away team current points per game
Away team current goal difference (GD)

3) Vectorize the data

Arrange the filtered data into matrices

4) Check the size of the matrices

Train_set_x = (16, 280)
Train_set_y = (3, 280)
Test_set_x = (16, 100)
Test_set_x = (3, 100)

Note: Since there are 3 possibilities of the outcome of the match (i.e. win, draw or lose), the dimension of the train_set_y has to be 3 by m where m is the number of training examples.

5) Normalize the row data

Mean normalization is also known as feature scaling method
It is used to increase the effectiveness of gradient descent by making sure that features of the data (i.e. independent variables) are on similar scale
Tip: use numpy.std

6) Implement the Logistic Regression hypothesis model

Set the hyperparameters like learning rate, α and number of iterations to appropriate values

Recall that hyperparameters are numerical factors that are set by us and they can be changed to improve the performance of the machine learning algorithm

Let us know if you want us to upload our code in the blog!

7) Evaluate the results to check if the hypothesis is accurate

Well..we are happy that our first model performs slightly better than guessing (33%)
We played around with the model and this is what we found out:

When α = 2 and number of iterations = 2000, the train accuracy = 54.6% and the test accuracy = 44%
When we changed α to 0.5 and keep the number of iterations same as 2000, the train accuracy becomes 55.4% and the test accuracy remains as 44%
When α increased to 3, the model starts to go haywire...kidding..the costs doesn't converge to the minimum anymore (i.e. the cost is increasing instead)
A high number of iterations like 20,000 will decrease both train and test accuracy

Here is what we infer from the results:

Overfitting problem: The model has high bias as the accuracy is low (just slightly higher than the chance of guessing - 33%)
Underfitting problem: It also has high variance as there is a large difference between the train and test accuracy

8) Think of possible ways to improve the performance

To decrease the bias, we can

Add more features

To decrease the variance, we can

Get more training examples from the past seasons
Do regularization

Thanks for reading! We appreciate any feedback or questions from you so do feel free to leave us a comment below.

Search This Blog

Gooey confusion

Predict EPL results (Part 1: Logistic Regression Example in Python)

Logistic Regression

How to build the model?

1) Get past EPL results

2) Re-organize the data

3) Vectorize the data

4) Check the size of the matrices

5) Normalize the row data

6) Implement the Logistic Regression hypothesis model

7) Evaluate the results to check if the hypothesis is accurate

8) Think of possible ways to improve the performance

Comments

Post a Comment

Popular posts from this blog

How to connect Python to MySQL Workbench

Predict EPL results (Part 2: Neural Network example)

Bias and Variance in Machine Learning