Notes from Industry
Building datasets for machine learning is a time-consuming process. I find that I tend to spend about 70–80% of my time on a project preparing and cleaning the datasets.
Often how I process the datasets follows a similar path. I filter my data down to only the relevant samples and features, and then I clean it. The cleaning process itself changes often depending on the data type, but a simple solution is to drop the dirty rows if you have enough data to have an adequate sample size afterwards.
Once the data is cleaned, I then start manipulating it to the appropriate level. Often when I’m dealing with time-series data this manipulation includes some form of quantisation. Sometimes time-series data is nice and clean, with events happening at regular intervals, but this is rarely the case when dealing with healthcare-related data. People don’t get sick or visit the doctors on a schedule, and this random distribution of samples is what I like to call "Jagged Timeseries" data. The Jagged nature of this data means that quantisation is the only option. I use the word quantisation to describe the process of aggregating/averaging and imputing values to massage the data into an evenly spread format. This might mean taking the averages of all measures during a month and imputing data for the months with no data.
It is relatively common for an ML model in the healthcare setting to need the data prepared so that the absolute times of events are shifted to be all relative to each other. This would mean that the first time step represents the onset of a given disease for all patients or the first time all patients were admitted to hospitals. I refer to this action commonly as rolling, this comes from the Numpy action of rolling and axis, and to me, it describes the situation well.
One important thing to note is, I often leave the imputation of missing data until the final step, once all the cleaning, quantisation (excluding imputation) and rolling has been done. This means imputation can then happen in a more intuitive way for things like disease progression. The average for the entire population at a given timestep encodes a little bit of helpful information, unlike the case before rolling occurs.

After doing this many times, I realised this would be the perfect idea for building a simple library to handle these standard functions for me for many projects. In addition to simplifying the building process, I realised that the library could handle version datasets. This being done in a scriptable and configurable way could be very useful for concepts like autoML, allowing the parameters for the dataset’s construction to be part of the hyperparameter optimisation space!
How can I write a library to do something like this and expect it to be usable against other datasets? It is simple – the EADV format. EADV stands for Entity, Attribute, Datetime, Value. Entity represents a unique instance of something (for instance, a person), which can have multiple attributes (for instance a pulse rate), and those multiple attributes can have multiple DateTime stamps and values (their pulse was 80bpm on the 01/01/2008 14:21:04). Using this format, you can essentially express very complex data as a single four-column table. You can convert any time-series dataset to this format, and because of that fact, this library can work its magic!
Before we can produce the dataset, there needs to be some initial cleaning take place for the code later to make assumptions about the dataset. These things include renaming all features to their lowercase counterparts and fixing dates if you have data with two-digit year columns.
Once we have clean input data, we can produce the actual ML dataset. This is done using the following function, which filters, quantises, rolls and imputes!
Here is an example config file written in JSON that describes some actions needed for a real dataset I needed to build a regressor for at work.