
Note – To learn how to write a data loader for a custom dataset either that be sequential or image, refer here.
For a sequential dataset where the size of data points could be different, we used zero-padding to make all the data points of the same size. Thus the batch can be converted to tensor and passed to the graphic card (GPU) for parallel processing.
But this method is not the most optimum. Consider two batches of 8 data points, with the size of each data point as follows:
- Batch 1— [2, 5, 5, 4, 6, 7, 8, 2], final_size_of _batch = 8*8
- Batch 2— [2, 32, 5, 36, 6, 34, 8, 2], final_size_of _batch = 36*8
In Batch 1 all the data points need to be converted to size 8 using zero padding, while in Batch 2 all the data points need to be converted to size 36. In both cases, the minimum size is 2, but in Batch 1 we add just 6 zeros to the element of size 2 whereas in Batch 2 we need to add 34 zeros. Thus we can see that Batch 2 causes a lot of wastage of GPU memory on processing zeros that are of no use while Batch 1 is an efficient packing, wasting quite less GPU. This example shows the problem and also gives us our solution. If we can somehow make the batches manually then we can ensure that each packing is efficient.
But how do we do that??
Remember the index variable in getitem function which was provided internally, instead of that if we can manually provide the indexes for each batch, then our job is done. In order to do so, we can use the _batchsampler argument in the data loader as shown below.
train_dataset = Dataset_seq(word2id, train_path)
sampler = Sampler(tokens, data, bin_size) #data is list of sentences
present in whole corpus
train_batch_sampler_loader = Dataloader(
train_dataset,
batch_sampler = sampler,
collate_fn = collate_fn)
Now the index for a batch will be provided using the sampler function which we will define below.
Note – It is always preferred to have different sets of data points in a batch for different epochs i.e. if in the first epoch a batch passes (data 1, data 2, data 3, data 4 ), in other epochs we should make sure to not provide the same set (data 1, data 2, data 3, data 4) together. This is to ensure that our model does not learn the pattern/sequence in which the data points are provided.
Let us now understand how we will write the algorithm for our sampler function:
- We will create a list _textlen containing the length of each data point.
- Then we will create bins (or bucket data) such that each bin stores the index of data points with size less than or equal to the size corresponding to that bin. Here, the size corresponding to each bin is dependent on _binsize. For eg. If _binsize = 2 then the sizes of bins will be 3,5,7…. till the maximum size present in _textlen.
- If say _textlen = [2, 3, 4, 5, 6, 7, 8] then we will get the bins as {3: [0, 1], 5: [2, 3], 7: [4, 5], 8: [6]} i.e. values at index 0, 1 have size ≤ 3, values at index 2, 3 have size ≤ 5 and so on. The last bin has size 8 because that is the maximum size present in _textlen.
- Now, that we have whole data segregated on the basis of size we will create our batches. For that, we will use a parameter _ntokens which indicates the maximum total size (including zero-padding) that can be loaded in the GPU. So if _ntokens=500 then we can make each batch such that after zero-padding the sum of the size of each data point in a batch is less than or equal to 500.
- Now to form the batches, we start from the largest bucket and keep picking indexes sequentially till the total size for that batch is just less than or equal to _ntokens. Once a batch is formed we append it to _finalindicies which is a list of lists. This process continues till all the data points(present in all the bins) are picked up and allotted to a batch.
- To make sure that the same set of batches are not sent across different epoch, after each epoch, we randomly shuffle the list stored for each bin. So when we sequentially start picking from the bins we get different data each time.
Refer to the code for this algorithm below
I hope this blog helps you understand and explore new applications. Please do share your feedback and other methods that you follow to make this blog better.
Become a [Medium](https://medium.com/@AnveeNaik) member to unlock and read many other stories on medium. Follow us on Medium for reading more such blog posts.