What format should I store my ECG data in for DL training?

A case study of optimizing ECG data layout to improve deep learning training performance.

Emily Potyraj
Towards Data Science

--

Historically, we saved data so we could read it. Now, with rapid advances in data science, machines are reading our data. We’re shifting from humans accessing small data to deep learning scripts accessing large volumes of data repeatedly.

It’s the difference between loading a patient’s x-ray on a monitor to visually assess it versus running the x-ray 1,000 times through a neural network that converts the raw image to vectors, crops it, rotates it, blurs it, and then tries to identify a pneumothorax.

That’s why we’ve seen increasing usage of GPUs for deep learning — we need to perform more and more math on our data. The compute elements in a GPU are designed for massive, data-parallel math operations that are the foundation of modern graphics pipelines. And, as it turns out, the math behind deep learning workflows.

But it’s not as easy as [data → GPU]

While GPUs perform the complex math of deep learning, each training job is a pipeline with steps before that GPU work.

In training, before the data enters the neural network, it gets preprocessed in a series of manipulations. Preprocessing often includes a series of tasks: indexing (listing) the items in a training dataset, shuffling them, actually loading them from storage, and then performing distortions. Preprocessing usually occurs on CPUs (frequently, on the CPUs within a GPU server).

So data has to move from where it’s stored, through a CPU, to the GPUs for computation.

The data load itself is often the cause of the largest throughput bottleneck for training pipelines.

If you were loading a patient’s x-ray to look at it, you might find a 0.04 sec data load time acceptable. You’d hardly notice the latency because your “review” time might be several minutes long.

On the other hand, a GPU server that only takes 0.001 sec to “review” the image would be slowed down by a 0.04 sec data load time. And the delay compounds. With massive training datasets that contain not 1 item but thousands–or millions–and iterative computations (e.g. 20 epochs in a single job), data loading delays are incurred a lot of times.

Regardless of how fast your GPUs are, your training job can only be as fast as your data load time.

So, there’s a massive performance benefit in ensuring that data loading is efficient.

This post describes our project with Geisinger’s Department of Imaging Science and Innovation to optimize training throughput for a dataset of electrocardiogram results. We produced a simulated dataset matched in data size and type for optimization testing.

First, identify the performance baseline

A good place to start a DL performance investigation is to check the ratio between duration of data load work and duration of training work.

After we ran this basic task breakdown, we saw that data load time was significantly larger than training time. This is highly inefficient because, with a train:load ratio of 1:4, our GPUs are idle for ¾ of the job duration.

Every batch, 75% of the GPU time was spent idle while we waited for the next batch to load.

We need to be loading data a lot faster.

Why are files slow to load?

Our synthetic training data represented logs from 15-lead ECG tests: 15x 5GB HDF5 files, each with 1,000,000 records (patients). Each file represented a single ECG channel, inside of which a ‘patient_id’ was the key and that channel’s results were its record.

We were treating HDF5 like a key-value store, but the file format is actually designed to support more complex data structures. HDF (Hierarchical Data Format) is great for storing large numerical datasets that have extensive metadata. It enables grouping within the file, effectively making an HDF file a portable file system. HDF is primarily used for scientific datasets in HPC use cases.

A downside of that functionality, however, is that there’s significant overhead per read of an HDF5 file.

We can use strace to snoop on the system calls that occur when we issue a single read (issued via the h5py python library).

‘lseek’ is used to move the file pointer to specific locations within the file, then ‘read’ is used for actually reading data. The final line is a read of 4992 bytes to retrieve one value in the key-value pairs that comprise our HDF5 file. The prior 5 reads are of metadata (pointers) inside the file that are required to locate the desired data.

Why are there so many metadata reads? Because HDF uses a B-Tree structure to help increase performance of its file-system-like metadata queries. Just like on a filesystem, knowing where to look for data can be a harder problem than moving the data into a CPU.

After seeing how many redundant system calls are needed for HDF reads, we were convinced that there was a more read-performant way to save our simplistic dataset. We weren’t really using the hierarchical functionality that gives HDF its name.

What would optimal look like?

As a thought experiment, we started with the question, “What’s the best possible layout for key-value data inside a file for maximum read throughput?”

What are we solving for?

1. We don’t need to save complex metadata

The only descriptive information we’re using for the records is the key.

Let’s move away from the functionality in HDF that we’re not making use of.

2. Fast reads for both individual records and for collecting the list of keys

As is common during training jobs, our scripts start by shuffling the dataset. The dataset’s contents must be enumerated before those items can be shuffled. Since this enumeration step occurs at the beginning of every job, we don’t want to build in front-end delays by having slow key collection (which we had seen with HDF5).

3. The larger the IO, the better

We’d prefer to train from datasets saved on remote storage. Why? First, data management is simpler, and we don’t have to spend time copying datasets to local devices ahead of time. Second, our dataset was too large to fit in local memory in its entirety, so we’d have to manage some data sharding process if we trained from a local storage location.

So, since we’ll be issuing reads that have to go out over the network to storage, we might as well optimize the cost of that network round trip by making the read size larger.

We started by combining data from all 15 ECG leads into a single record. This change both simplifies our dataset and helps increase IO size; now reading a single patient’s ECG result can be collected in 1 trip over the network, not 15.

Each record contains an array of data from all 15 ECG leads:

[ [channel 1], … , [channel 15] ]

We can associate a single result to the ECG lead number it came from simply by its position in the array.

The block of keys contains [key_id, value_location] for each record. Now we can jump (`lseek`) directly to a specific record’s data inside the file.

We store an array of all the keys together at the end of the file. Upon opening the file, we read a header which contains file configuration information like schema specs and, most importantly, the location of the key block within the file. Key loading is extremely efficient because we can jump straight to the starting location of the key block and then read all the keys at once.

In HDF5 files, the keys were spread randomly throughout the file. Having contiguous keys in our new file format allows large, sequential reads of the key block, minimizing network overhead.

A 30x solution

The resulting file format, which we’ve named “Fleet” and are making available on GitHub, gave 30x higher read throughput than HDF5 for our workload. Even better, our train:load ratio was now above 1:1 (for both remote storage and local storage). We can efficiently train with our ECG data in-place on NFS.

We think such a lightweight file format is optimal for a large numeric datasets being curated for deep learning training. If grouping were necessary, additional Fleet files could be created to represent each group.

Fleet is an open source file format. You can find the code and several helper scripts on the Fleet GitHub page.

What we learned

While training scripts that consume JPEG images or tensor records have been tested and tuned extensively in the AI ecosystem, training on other file formats will likely require some performance analysis to optimize overall job throughput.

Sometimes the format of a training dataset may introduce read patterns or compute demands that slow the overall training job. In this case, we optimized training performance by switching to a lightweight, read-optimized file format.

As we continue to apply deep learning methods to a wider range of use cases and datasets, we should continue to ask ourselves, “What would our data look like if we solved for optimal training throughput?”

This investigation was a project run with my colleague Brian Gold at Pure Storage and with the Department of Imaging Science and Innovation at Geisinger. Thanks especially to David Vanmaanen, Sush Raghunath, and Alvaro Ulloa Cerna.

--

--