How to use Dataset in TensorFlow

The built-in Input Pipeline. Never use ‘feed-dict’ anymore

Updated to TensorFlow 1.5

As you should know, feed-dict is the slowest possible way to pass information to TensorFlow and it must be avoided. The correct way to feed data into your models is to use an input pipeline to ensure that the GPU has never to wait for new stuff to come in.

Fortunately, TensorFlow has a built-in API, called Dataset to make it easier to accomplish this task. In this tutorial, we are going to see how we can create an input pipeline using it and how to feed the data into the model efficiently.

This article will explain the basic mechanics of the Dataset, covering the most common use cases.

You can found all the code as a jupyter notebook here :

https://github.com/FrancescoSaverioZuppichini/Tensorflow-Dataset-Tutorial/blob/master/dataset_tutorial.ipynb

Generic Overview

In order to use a Dataset we need three steps:

  • Importing Data. Create a Dataset instance from some data
  • Create an Iterator. By using the created dataset to made an Iterator instance to iterate thought the dataset
  • Consuming Data. By using the created iterator we can get the elements from the dataset to feed the model

Importing Data

We first need some data to put inside our dataset

From numpy

This is the common case, we have a numpy array and we want to pass it to tensorflow.

# create a random vector of shape (100,2)
x = np.random.sample((100,2))
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)

We can also pass more than one numpy array, one classic example is when we have a couple of data divided into features and labels

features, labels = (np.random.sample((100,2)), np.random.sample((100,1)))
dataset = tf.data.Dataset.from_tensor_slices((features,labels))

From tensors

We can, of course, initialise our dataset with some tensor

# using a tensor
dataset = tf.data.Dataset.from_tensor_slices(tf.random_uniform([100, 2]))

From a placeholder

This is useful when we want to dynamic change the data inside the Dataset, we will se later how.

x = tf.placeholder(tf.float32, shape=[None,2])
dataset = tf.data.Dataset.from_tensor_slices(x)

From generator

We can also initialise a Dataset from a generator, this is useful when we have an array of different elements lenght (e.g a sequence):

sequence = np.array([[1],[2,3],[3,4]])
def generator():
for el in sequence:
yield el
dataset = tf.data.Dataset().from_generator(generator,
output_types=tf.float32,
output_shapes=[tf.float32])

In this case you also need specify the types and the shapes of your data that will be used to create the correct tensors.

Create an Iterator

We have seen how to create a dataset, but how to get our data back? We have to use an Iterator, that will give us the ability to iterate through the dataset and retrieve the real values of the data. There exist four types of iterators.

One shot Iterator

This is the easiest iterator. Using the first example

x = np.random.sample((100,2))
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)
# create the iterator
iter = dataset.make_one_shot_iterator()

Then you need to call get_next() to get the tensor that will contain your data

...
# create the iterator
iter = dataset.make_one_shot_iterator()
el = iter.get_next()

We can run el in order to see its value

with tf.Session() as sess:
print(sess.run(el)) # output: [ 0.42116176 0.40666069]

Initializable Iterator

In case we want to build a dynamic dataset in which we can change the data sourceat runtime, we can create a dataset with a placeholder. Then we can initialize the placeholder using the common feed-dict mechanism. This is done with an initializable iterator. Using example three from last section

# using a placeholder
x = tf.placeholder(tf.float32, shape=[None,2])
dataset = tf.data.Dataset.from_tensor_slices(x)
data = np.random.sample((100,2))
iter = dataset.make_initializable_iterator() # create the iterator
el = iter.get_next()
with tf.Session() as sess:
# feed the placeholder with data
sess.run(iter.initializer, feed_dict={ x: data })
print(sess.run(el)) # output [ 0.52374458 0.71968478]

This time we call make_initializable_iterator . Then, inside thesess scope, we run the initializer operation in order to pass our data, in this case a random numpy array. .

Imagine that now we have a train set and a test set, a real common scenario:

train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.array([[1,2]]), np.array([[0]]))

Then we would like to train the model and then evaluate it on the test dataset, this can be done by initialising the iterator again after training

# initializable iterator to switch between dataset
EPOCHS = 10
x, y = tf.placeholder(tf.float32, shape=[None,2]), tf.placeholder(tf.float32, shape=[None,1])
dataset = tf.data.Dataset.from_tensor_slices((x, y))
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.array([[1,2]]), np.array([[0]]))
iter = dataset.make_initializable_iterator()
features, labels = iter.get_next()
with tf.Session() as sess:
# initialise iterator with train data
sess.run(iter.initializer, feed_dict={ x: train_data[0], y: train_data[1]})
for _ in range(EPOCHS):
sess.run([features, labels])
# switch to test data
sess.run(iter.initializer, feed_dict={ x: test_data[0], y: test_data[1]})
print(sess.run([features, labels]))

Reinitializable Iterator

The concept is similar to before, we want to dynamic switch between data. But instead of feed new data to the same dataset, we switch dataset. As before, we want to have a train dataset and a test dataset

# making fake data using numpy
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.random.sample((10,2)), np.random.sample((10,1)))

We can create two Datasets

# create two datasets, one for training and one for test
train_dataset = tf.data.Dataset.from_tensor_slices(train_data)
test_dataset = tf.data.Dataset.from_tensor_slices(test_data)

Now this is the trick, we create a generic Iterator

# create a iterator of the correct shape and type
iter = tf.data.Iterator.from_structure(train_dataset.output_types,
train_dataset.output_shapes)

and then two initialisation operations:

# create the initialisation operations
train_init_op = iter.make_initializer(train_dataset)
test_init_op = iter.make_initializer(test_dataset)

We get the next element as before

features, labels = iter.get_next()

Now, we can directly run the two initialisation operation using our session. Putting all together we get:

# Reinitializable iterator to switch between Datasets
EPOCHS = 10
# making fake data using numpy
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.random.sample((10,2)), np.random.sample((10,1)))
# create two datasets, one for training and one for test
train_dataset = tf.data.Dataset.from_tensor_slices(train_data)
test_dataset = tf.data.Dataset.from_tensor_slices(test_data)
# create a iterator of the correct shape and type
iter = tf.data.Iterator.from_structure(train_dataset.output_types,
train_dataset.output_shapes)
features, labels = iter.get_next()
# create the initialisation operations
train_init_op = iter.make_initializer(train_dataset)
test_init_op = iter.make_initializer(test_dataset)
with tf.Session() as sess:
sess.run(train_init_op) # switch to train dataset
for _ in range(EPOCHS):
sess.run([features, labels])
sess.run(test_init_op) # switch to val dataset
print(sess.run([features, labels]))

Feedable Iterator

Honestly, I don’t think they are useful. Basically instead of switch between datasets they switch between iterators so you can have, for example, one iterator from make_one_shot_iterator() and one from ` make_initializable_iterator().

Consuming data

In the previous example we have used the session to print the value of the next element in the Dataset.

...
next_el = iter.get_next()
...
print(sess.run(next_el)) # will output the current element

In order to pass the data to a model we have to just pass the tensors generated from get_next()

In the following snippet we have a Dataset that contains two numpy arrays, using the same example from the first section. Notice that we need to wrap the .random.sample in another numpy array to add a dimension that we is needed to batch the data

# using two numpy arrays
features, labels = (np.array([np.random.sample((100,2))]),
np.array([np.random.sample((100,1))]))
dataset = tf.data.Dataset.from_tensor_slices((features,labels)).repeat().batch(BATCH_SIZE)

Then as always we create an iterator

iter = dataset.make_one_shot_iterator()
x, y = iter.get_next()

We make a model, a simple neural network

# make a simple model
net = tf.layers.dense(x, 8) # pass the first value from iter.get_next() as input
net = tf.layers.dense(net, 8)
prediction = tf.layers.dense(net, 1)
loss = tf.losses.mean_squared_error(prediction, y) # pass the second value from iter.get_net() as label
train_op = tf.train.AdamOptimizer().minimize(loss)

We directly use the Tensors from iter.get_next() as input to the first layer and as labels for the loss function. Wrapping all together:

EPOCHS = 10
BATCH_SIZE = 16
# using two numpy arrays
features, labels = (np.array([np.random.sample((100,2))]),
np.array([np.random.sample((100,1))]))
dataset = tf.data.Dataset.from_tensor_slices((features,labels)).repeat().batch(BATCH_SIZE)
iter = dataset.make_one_shot_iterator()
x, y = iter.get_next()
# make a simple model
net = tf.layers.dense(x, 8, activation=tf.tanh) # pass the first value from iter.get_next() as input
net = tf.layers.dense(net, 8, activation=tf.tanh)
prediction = tf.layers.dense(net, 1, activation=tf.tanh)
loss = tf.losses.mean_squared_error(prediction, y) # pass the second value from iter.get_net() as label
train_op = tf.train.AdamOptimizer().minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(EPOCHS):
_, loss_value = sess.run([train_op, loss])
print("Iter: {}, Loss: {:.4f}".format(i, loss_value))

Output:

Iter: 0, Loss: 0.1328 
Iter: 1, Loss: 0.1312
Iter: 2, Loss: 0.1296
Iter: 3, Loss: 0.1281
Iter: 4, Loss: 0.1267
Iter: 5, Loss: 0.1254
Iter: 6, Loss: 0.1242
Iter: 7, Loss: 0.1231
Iter: 8, Loss: 0.1220
Iter: 9, Loss: 0.1210

Useful Stuff

Batch

Usually batching data is a pain in the ass, with the Dataset API we can use the method batch(BATCH_SIZE) that automatically batches the dataset with the provided size. The default value is one. In the following example, we use a batch size of 4

# BATCHING
BATCH_SIZE = 4
x = np.random.sample((100,2))
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x).batch(BATCH_SIZE)
iter = dataset.make_one_shot_iterator()
el = iter.get_next()
with tf.Session() as sess:
print(sess.run(el))

Output:

[[ 0.65686128  0.99373963]
[ 0.69690451 0.32446826]
[ 0.57148422 0.68688242]
[ 0.20335116 0.82473219]]

Repeat

Using .repeat() we can specify the number of times we want the dataset to be iterated. If no parameter is passed it will loop forever, usually is good to just loop forever and directly control the number of epochs with a standard loop.

Shuffle

We can shuffle the Dataset by using the method shuffle() that shuffles the dataset by default every epoch.

Remember: shuffle the dataset is very important to avoid overfitting.

We can also set the parameter buffer_size , a fixed size buffer from which the next element will be uniformly chosen from. Example:

# BATCHING
BATCH_SIZE = 4
x = np.array([[1],[2],[3],[4]])
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(BATCH_SIZE)
iter = dataset.make_one_shot_iterator()
el = iter.get_next()
with tf.Session() as sess:
print(sess.run(el))

First run output:

[[4]
[2]
[3]
[1]]

Second run output:

[[3]
[1]
[2]
[4]]

Yep. It was shuffled. If you want, you can also set the seed parameter.

Map

You can apply a custom function to each member of a dataset using the map method. In the following example we multiply each element by two:

# MAP
x = np.array([[1],[2],[3],[4]])
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.map(lambda x: x*2)
iter = dataset.make_one_shot_iterator()
el = iter.get_next()
with tf.Session() as sess:
# this will run forever
for _ in range(len(x)):
print(sess.run(el))

Output:

[2]
[4]
[6]
[8]

Other resources

TensorFlow dataset tutorial: https://www.tensorflow.org/programmers_guide/datasets

Dataset docs:

https://www.tensorflow.org/api_docs/python/tf/data/Dataset

Conclusion

The Dataset API gives us a fast and robust way to create optimized input pipeline to train, evaluate and test our models. In this article, we have seen most of the common operation we can do with them.

You can use the jupyter-notebook that I’ve made for this article as a reference.

Thank you for reading,

Francesco Saverio