NYC Taxi Fare Prediction

Rider Fare Prediction in The Big Apple

Allen Kong

Published in

Towards Data Science

5 min readDec 7, 2020

The Data

Loading the Data

The data for this project can be found on Kaggle in the New York City Taxi Fare Prediction competition held by Google Cloud. The entire training set consists of about 55 million rows of NYC taxi fare data. There is not enough memory in the CPU to train a model using all this data so I decided to use 4 million rows of training data which is roughly 7% of the entire dataset. However, I was able to load the entire dataset on the GPU which I will go over later.

Let’s take a quick peek at how the data looks like.

We can see that there are 8 columns which consist of the unique key, the fare amount and 6 features. Before jumping into the statistics, let’s check the sample for null entries.

Null Values

There appears to be an insignificant amount of null entries in the sample so it should be fine to remove them from the sample. Now we can review the statistics of the training set and the test set to find anomalies.

Outliers, Cleaning & Engineering

After analyzing the statistics between the training set and the test set, it appears that there are outliers in the training dataset. Including these data points in the training set will prevent our model from running efficiently so it is best to handle them. The fare amount appears to range from -$62 to almost $1300 which makes no sense. The base fee for NYC taxi cabs is $2.50 so we will limit the fare amount to be at least $2.50 and below $500. The passenger count seems to go up to 208 which makes no sense so we will limit it to the maximum capacity which is 5 passengers.

Currently, Pandas is viewing the “pickup_datetime” feature as an object type in the dataframe. In order for object types to be utilized in our machine learning model, the feature needs to be transformed into a numeric type. This can be accomplished by converting the feature to a datetime type and then transforming the datetime object into multiple attributes using the Pandas datetime functionality. From the datetime object, we’ll be able to create useful attributes like Year, Month, Day, Day of Week, and Hour.

The city of New York longitude ranges between -75 and -72. The latitude ranges between 40 and 42. There are a few points in the dataset that lie outside these bounds. These points will be removed since they are not within the boundaries of the city.

Along with the datetime attributes that were added, there are a few features that can be added using the pick-up and drop-off points from the data. The distance between the two points would be a common choice. I decided to use the geopy geodesic and great_circle distance calculations which uses both pick-up and drop-off points. The city is full of many notable locations that taxi travelers tend to go to so we will also introduce the distance between several of these notable locations and the drop-off point. The locations I decided to use were the JFK airport, the LaGuardia airport, the Newark airport, Times Square, Central Park, the Statue of Liberty, Grand Central, the MET museum, and the World Trade Center.

Model Training

Now that we have a clean dataset, we are ready to train a model for predicting taxi fare. Before we do that through, we will split the dataset into a train (80%) and test (20%).

XGBoost

The model I chose for this particular dataset was XGBoost. Another alternative would be LightGBM which many others have used as well in the Kaggle competition. XGBoost uses the DMatrix data structure for optimized and efficient computing so we will transform the training and test sets into a DMatrix. The parameters are already preset in my model from hyperparameter tuning using grid search. The preset parameters will be used to train a booster for 700 rounds and the loss score has to improve at least once every 100 rounds.

Prediction

After boosting the model, we are ready to test it against the real test dataset. This model which was trained and boosted on the CPU ended up achieving a score of 3.0325 which places us in the 80th percentile on the Kaggle leaderboard. In order to achieve a better score, it may be time to scale up and utilize more of the data since we are only using about 7%.

GPU Based Training

It’s very easy to run out of memory using the CPU when training the XGBoost model. Not only that, cleaning and training using 7% of the data took more than 8 hours. This makes it very inefficient to use a CPU when dealing with large datasets. GPUs on the other hand are extremely efficient when it comes to parallel computing. The perfect technology to use for this case would be RAPIDS which is an advanced library that runs on CUDA and utilizes dask for parallel computing. Using the same data cleaning processes earlier, I decided to put the XGBoost model to test using the entire dataset.

Training with 7000 boost rounds and 44 million rows (80%) of data only took about 17 minutes! This shows the magnitude of using the GPU for training. It feels too good to be true but the numbers don’t lie. I only used 1 GPU for this process and it’s easily 10 times more efficient than using the CPU. Using the GPU trained model, the predictions turned out to be much better than the CPU trained model. The GPU trained model ended up achieving a score of 2.89185 which places us in the 94th percentile on the Kaggle leaderboard.

Conclusion

When we’re given data, it’s extremely important to be able to utilize all of it. Using 7% and 100% makes a huge difference and we have the technologies to be able to do that. The GPU offers better performance, higher efficiency and unmatched computing power compared to the CPU counterpart. Whenever you plan on your next data science project, just remember that the GPU can save you a whole day on pre-processing and training!