PumpGym only shows the current capacity of the gym, but does not show the quietist or busiest hours throughout the day.
Create an application which would be able to predict the capacity of the gym at any given hour of the day to determine times to go to/avoid the gym.
The solution was made of two scripts:
- Capacity Scraper - stores the capacity of the gym in a dated file.
- Capacity Predictor - predict the capacity of the gym at a given hour and day of the week using a Machine Learning (ML) model trained on the data stored by the Capacity Scraper.
This component would get the capacity image from the website, apply some filters to the image and then use Tesseract to convert the text from the image to a string. The string would then be stored in a dated file alongside the current time.
From left to right: Original image, filtered image, result stored in file.
The filtering used to the original image was:
- cropping: to remove any pieces of information not required, like "current occupancy" and "FORTX".
- greyscale/thresholding/noise removal: to make it easier for Tesseract to distinguish between the capacity text and background.
As for the arguments passed to Tesseract:
capacity = pytesseract.image_to_string(img, config='-c tessedit_char_whitelist=0123456789% --psm 7 --oem 2')
- --tessedit_char_whitelist=0123456789%: limits the characters which Tesseract is looking for in the image to only numbers and the percentage sign.
- --psm 7: treats the image as a single text line.
- --oem 2: uses the Legacy and LTSM engine as it supports the character whitelist.
The Capacity Predictor is a Regression model which was trained using the data gathered by the Capacity Scraper. The two features of the model are the time and day of the week.
The main steps taken to create, pick and use a Regression model:
- Prepare the data: remove any malformed data (such as capacity at 0) and create new features such as the day of the week.
- Select a model: cross validation was ran on three models (LinearRegression, DecisionTreeRegressor, RandomForestRegressor) along side fine-tuning (using RandomizedSearchCV) the hyperparameters to find the model with the lowest error. Root Mean Square Error (RMSE) was used to determine how well a model performed, where the lower the score the better.
- Test the final model: the DecisionTreeRegressor was chosen and then ran against the test set to evaluate it.
- Use model: pass in the day of the week and hour of the day to predict the capacity.
$ python3 predict_gym_capacity.py --predict 18 monday
The steps taken are inspired by "Chapter 2. End-to-End Machine Learning Project" in Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. Would highly recommend having a look through this book and in particular the second chapter, to get an idea of the main steps involved for a ML project.