This article introduces ByteHub: an open-source feature store, designed to help data scientists build better models of time-series data, ready to be used for tasks like forecasting and anomaly detection. While we often focus on the machine-learning techniques and algorithms used for these tasks, the data plumbing often gets overlooked, even though it is crucial to both developing a high-quality model and being able to deploy and maintain it as part of a larger project.
We’ll explore three ways in which the Data Science workflow can be improved when building time-series models, with examples to demonstrate how this works with ByteHub.
Access to data
Models are only as good as the input data they are trained on. An important task for any data scientist working with time-series models is to identify which data sources to use. The results can be dramatic: for example the inclusion of a single weather variable results can often result in big improvements in accuracy when dealing with a variety of modelling problems in retail, energy, transport and other sectors.
Unfortunately, adding a new data source to a model is often not an easy task. Datasets are frequently locked away in different databases or behind complex APIs, making it hard to iterate quickly through different data inputs and find which are best.
ByteHub addresses this problem in two ways. Firstly by providing a simple way to store time-series data so that it can easily be accessed, filtered, resampled and fed into a model. For example, the following code snippet demonstrates how multiple time-series features can be loaded and resampled to hourly resolution for use in model training without any additional preparation.
Secondly, these datasets are stored with descriptions and extra metadata, and are searchable from a simple Python interface. Instead of re-inventing the wheel on each new project, data scientists can search for and re-use existing data much more easily, developing better models in less time.

Prepare data
After finding the right data, the next step is to prepare and transform it before feeding it into a model. Known as Feature Engineering, it allows us to use our own domain expertise to improve the model performance.
ByteHub introduces a concept called feature transforms: snippets of Python code that can be applied to raw time-series data. These can be chained together, allowing you to build surprisingly complex feature-engineering pipelines, which are then published for easy re-use between different projects and models.
For example, in this weather forecasting example we needed to convert wind speed and direction to wind x- and y- speeds, because otherwise the model we are using will struggle to interpret the wind angle properly. The following Python functions contain the maths required for this, while the decorators add these transforms to the feature store so that they can be reused.
Keeping it organised!
In the past all of this data plumbing could easily result in a bird’s-nest of code that is difficult to understand and maintain. By using a feature store it becomes possible to decouple the data preparation from the model training code. Not only does this make the model easier to understand, but it also makes it much easier to deploy: the data load and preparation can be done with a single line of code. For example, a typical model training script in ByteHub looks as follows, with simple, clear steps to load the data, build and then train the model.
Final thoughts
I hope this post inspired you to try out ByteHub on your next time-series modelling project. If you’d like to take it for a spin, full documentation and installation instructions are on GitHub.
All of the tutorials and examples can be run from Google Colab notebooks.
If you have any thoughts, comments or questions, please leave a comment below or feel free to get in touch directly.
[1] Transport for London: Total number of hires of the Santander Cycle Hire Scheme, by day, month and year.