Predict EPL results (Part 1: Logistic Regression Example in Python)

Logistic Regression

  • It estimates the probability of an outcome (aka dependent variable) based on a set of one or more input (aka independent variables or "features")
  • Can be used for binary classification problem (only 2 possible outcomes) or multiclass classification problem (more than 2 outcomes)
    • For example, it can be used to predict the chances of a particular transaction is fraud given some input information like the customer's details, transaction details etc. 
    • If there are more than 2 outcomes like the EPL game (i.e. win, draw and lose), logistic regression also can be used as it can determine the possibilities of each outcome. 
  • Logistic Regression Function is defined as follows:
Logistic Regression Function
Logistic Regression Function

    • Note: f(x) represents the probability of an output y given each input x
    • From the formula above, it tells us that it can take in any input x (positive or negative) and produce an output y in the range of 0.0 to 1.0. 
    • Back to the example earlier that we are trying to estimate the probability that a transaction is fraud. Let's say that we indicate that when y = 1, it is true that the transaction is fraud. If f(x) = 0.7, it means that there is a 70% chance that the transaction being a fraud (i.e. y = 1) given the input information


An Example in Python

  • Now, let's go through how to build a Logistic Regression model to predict the English Premier League (EPL) results! 


How to build the model?

1) Get past EPL results


2) Re-organize the data 

  • We only downloaded Season - 17/18 data so that's 380 examples (m) in total
  • Since there are not many data, we just split the examples into 280 training set and 100 test set
  • A total of 16 features
    • Home team season points
    • Home team season goal difference (GD)
    • Away team season points
    • Away team season goal difference (GD)
    • Home match number
    • Away match number
    • Home team past 5 games points per game
    • Home team past 5 home games points per game
    • Home team past 5 games goal difference (GD)
    • Home team current points per game
    • Home team current goal difference (GD)
    • Away team past 5 games points per game
    • Away team past 5 home games points per game
    • Away team past 5 games goal difference (GD)
    • Away team current points per game
    • Away team current goal difference (GD)

3) Vectorize the data 

  • Arrange the filtered data into matrices 


4) Check the size of the matrices

  • Train_set_x = (16, 280)
  • Train_set_y = (3, 280)
  • Test_set_x = (16, 100)
  • Test_set_x = (3, 100)
Note: Since there are 3 possibilities of the outcome of the match (i.e. win, draw or lose), the dimension of the train_set_y has to be 3 by m where m is the number of training examples.

5) Normalize the row data

  • Mean normalization is also known as feature scaling method
  • It is used to increase the effectiveness of gradient descent by making sure that features of the data (i.e. independent variables) are on similar scale
  • Tip: use numpy.std 

6) Implement the Logistic Regression hypothesis model

  • Set the hyperparameters like learning rate, α and number of iterations to appropriate values
    • Recall that hyperparameters are numerical factors that are set by us and they can be changed to improve the performance of the machine learning algorithm
  • Let us know if you want us to upload our code in the blog!

7) Evaluate the results to check if the hypothesis is accurate

  • Well..we are happy that our first model performs slightly better than guessing (33%)
  • We played around with the model and this is what we found out:
    • When α = 2 and number of iterations = 2000, the train accuracy = 54.6% and the test accuracy = 44%
    • When we changed α to 0.5  and keep the number of iterations same as 2000, the train accuracy becomes 55.4% and the test accuracy remains as 44%
    • When α increased to 3, the model starts to go haywire...kidding..the costs doesn't converge to the minimum anymore (i.e. the cost is increasing instead)
    • A high number of iterations like 20,000 will decrease both train and test accuracy
  • Here is what we infer from the results:
    • Overfitting problem: The model has high bias as the accuracy is low (just slightly higher than the chance of guessing - 33%) 
    • Underfitting problem: It also has high variance as there is a large difference between the train and test accuracy   

8) Think of possible ways to improve the performance 

  • To decrease the bias, we can 
    • Add more features
  • To decrease the variance, we can
    • Get more training examples from the past seasons
    • Do regularization 

Thanks for reading! We appreciate any feedback or questions from you so do feel free to leave us a comment below.

Comments

Popular posts from this blog

How to connect Python to MySQL Workbench

Predict EPL results (Part 2: Neural Network example)

Bias and Variance in Machine Learning