# Autoencoders — Introduction and Implementation in TF.

### Introduction and Concepts:

*Autoencoders (AE) *are a family of neural networks for which **the input is the same as the output** (they implement a identity function). They work by compressing the input into a *latent-space representation, *and then reconstructing the output from this representation.

A really popular use for autoencoders is to apply them to images. The **trick** is to replace *fully connected* layers by *convolutional *layers. These, along with pooling layers, convert the input from **wide** and **thin **(let’s say 100 x 100 px with 3 channels — RGB) to **narrow** and **thick**. This helps the network extract **visual features **from the images, and therefore obtain a much more accurate latent space representation. The reconstruction process uses *upsampling* and convolutions.

The resulting network is called a *Convolutional Autoencoder *(*CAE*).

#### Use of CAEs

#### Example : Ultra-basic image reconstruction

Convolutional autoencoders can be useful for reconstruction. They can, for example, learn to remove noise from picture, or reconstruct missing parts.

To do so, we don’t use the same image as input and output, but rather a **noisy version as input** and the **clean version as output**. With this process, the networks learns to fill in the gaps in the image.

Let’s see what a CAE can do to **replace part of an image of an eye**. *Let’s say there’s a crosshair and we want to remove it*. We can manually create the dataset, which is extremely convenient.

Now that our autoencoder is trained, we can use it to remove the crosshairs on pictures of eyes **we have never seen**!

### Implementation in TF:

Lets go over a sample implementation using MNIST dataset in tensorflow.

Notebook: https://github.com/mchablani/deep-learning/blob/master/autoencoder/Convolutional_Autoencoder.ipynb

#### Network Architecture

The encoder part of the network will be a typical convolutional pyramid. Each convolutional layer will be followed by a max-pooling layer to reduce the dimensions of the layers. The decoder needs to convert from a narrow representation to a wide reconstructed image.

Usually, you’ll see **transposed convolution** layers used to increase the width and height of the layers. They work almost exactly the same as convolutional layers, but in reverse. A stride in the input layer results in a larger stride in the transposed convolution layer. For example, if you have a 3x3 kernel, a 3x3 patch in the input layer will be reduced to one unit in a convolutional layer. Comparatively, one unit in the input layer will be expanded to a 3x3 path in a transposed convolution layer. The TensorFlow API provides us with an easy way to create the layers, `tf.nn.conv2d_transpose`

.

However, transposed convolution layers can lead to artifacts in the final images, such as checkerboard patterns. This is due to overlap in the kernels which can be avoided by setting the stride and kernel size equal. In this Distill article from Augustus Odena,et al, the authors show that these checkerboard artifacts can be avoided by resizing the layers using nearest neighbor or bilinear interpolation (upsampling) followed by a convolutional layer. In TensorFlow, this is easily done with`tf.image.resize_images`

, followed by a convolution. Odenaet alclaim that nearest neighbor interpolation works best for the upsampling

Autoencoders can be used to denoise images quite successfully just by training the network on noisy images. We can create the noisy images ourselves by adding Gaussian noise to the training images, then clipping the values to be between 0 and 1. We’ll use noisy images as input and the original, clean images as targets.

Note we are using sigmoid_cross_entropy_with_logits for loss. According to TF documentation: It measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.

Model definition:

learning_rate = 0.001

inputs_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='inputs')

targets_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='targets')

### Encoder

conv1 = tf.layers.conv2d(inputs=inputs_, filters=32, kernel_size=(3,3), padding='same', activation=tf.nn.relu)

# Now 28x28x32

maxpool1 = tf.layers.max_pooling2d(conv1, pool_size=(2,2), strides=(2,2), padding='same')

# Now 14x14x32

conv2 = tf.layers.conv2d(inputs=maxpool1, filters=32, kernel_size=(3,3), padding='same', activation=tf.nn.relu)

# Now 14x14x32

maxpool2 = tf.layers.max_pooling2d(conv2, pool_size=(2,2), strides=(2,2), padding='same')

# Now 7x7x32

conv3 = tf.layers.conv2d(inputs=maxpool2, filters=16, kernel_size=(3,3), padding='same', activation=tf.nn.relu)

# Now 7x7x16

encoded = tf.layers.max_pooling2d(conv3, pool_size=(2,2), strides=(2,2), padding='same')

# Now 4x4x16

### Decoder

upsample1 = tf.image.resize_images(encoded, size=(7,7), method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

# Now 7x7x16

conv4 = tf.layers.conv2d(inputs=upsample1, filters=16, kernel_size=(3,3), padding='same', activation=tf.nn.relu)

# Now 7x7x16

upsample2 = tf.image.resize_images(conv4, size=(14,14), method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

# Now 14x14x16

conv5 = tf.layers.conv2d(inputs=upsample2, filters=32, kernel_size=(3,3), padding='same', activation=tf.nn.relu)

# Now 14x14x32

upsample3 = tf.image.resize_images(conv5, size=(28,28), method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

# Now 28x28x32

conv6 = tf.layers.conv2d(inputs=upsample3, filters=32, kernel_size=(3,3), padding='same', activation=tf.nn.relu)

# Now 28x28x32

logits = tf.layers.conv2d(inputs=conv6, filters=1, kernel_size=(3,3), padding='same', activation=None)

#Now 28x28x1

# Pass logits through sigmoid to get reconstructed image

decoded = tf.nn.sigmoid(logits)

# Pass logits through sigmoid and calculate the cross-entropy loss

loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets_, logits=logits)

# Get cost and define the optimizer

cost = tf.reduce_mean(loss)

opt = tf.train.AdamOptimizer(learning_rate).minimize(cost)

Training:

sess = tf.Session()

epochs = 100

batch_size = 200

# Set's how much noise we're adding to the MNIST images

noise_factor = 0.5

sess.run(tf.global_variables_initializer())

for e in range(epochs):

for ii in range(mnist.train.num_examples//batch_size):

batch = mnist.train.next_batch(batch_size)

# Get images from the batch

imgs = batch[0].reshape((-1, 28, 28, 1))

# Add random noise to the input images

noisy_imgs = imgs + noise_factor * np.random.randn(*imgs.shape)

# Clip the images to be between 0 and 1

noisy_imgs = np.clip(noisy_imgs, 0., 1.)

# Noisy images as inputs, original images as targets

batch_cost, _ = sess.run([cost, opt], feed_dict={inputs_: noisy_imgs,

targets_: imgs})

print("Epoch: {}/{}...".format(e+1, epochs),

"Training loss: {:.4f}".format(batch_cost))

Credits: https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694