Star Wars Episode IV (1977) Remastered

A Deep Learning pipeline to remaster the deleted scenes

Neel Iyer

Published in

Towards Data Science

7 min readJun 19, 2020

Image by Agnieszka Kowalczyk on Unsplash

TLDR

Here’s the output from the model. Left is remastered output. Right is original video.

Full Jupyter Notebook for training and running inference is is available on Github

A New Hope for the deleted scenes

I’m a huge Star Wars fan. And like a lot of Star Wars fans I’ve been getting into Star Wars: The Clone Wars on Cartoon Network and Disney+. It’s a phenomenal show.

But I’m always annoyed by the drop in video quality when I watch the older stuff. For example, here are the deleted scenes from Star Wars: Episode IV: A New Hope (1977). This was the very first Star Wars to be created.

Video by Marcelo Zuniga

There’s these weird black specs that keep popping up. Small wonder why these are the deleted scenes.

Apparently those weird specs are called cue marks. They’re marks that come from scratches on film. Star Wars is a fantastic series, but it’s also fantastically old.

Deep Learning has recently been used for video restoration. The results have been very promising. Deoldify for example, allows users to colorize old videos and images. NVIDIA’s Noise2Noise model allows people to restore old images to their former glory.

But so far there’s nothing I know of that can specifically remove ‘cue marks’ and grainy spots from old film. So let’s build it!

Creating the Dataset

Creating the dataset was tricky — but still doable. Here’s what I did. I downloaded high quality videos into from youtube. Then I ruined them. I added black specs and reduced the resolution of the video. Ffmpeg was very useful in doing this.

First we’ll download the video.

youtube-dl --format best -o seinfeld.mp4 https://www.youtube.com/watch?v=nEAO60ON7yo

I’m using this video. I’m using a clip from Seinfeld. Cause why not?

Video by SeriesHD

Then we’ll need to ruin it. To do this I downloaded a grainy film overlay from youtube. Then I overlayed the video using ffmpeg with the blend setting set to softlight. Finding the right blend setting took a lot of trial and error. The ffmpeg docs don’t have a lot of examples.

Now we have two videos. One in perfect quality and another in crappy quality.

Video by author

Now we’ll extract frames from each video. Initially, I adopted a naive approach for doing this. Where I would do through the video in python and scrape each frame individually. But that took too long. I eventually realised we can use multi-processing here to really speed things up. This was adapted from Hayden Faulker’s script.

Great. Now we have two datasets. One of crappy quality images (taken from the ruined video) and one of good quality images (taken from the high quality video). To make the crappy images crappier, I’ll downscale them (this isn’t a necessary step though).

def resize_one(img, size):   targ_sz = resize_to(img, size, use_min = True)
   img = img.resize(targ_sz, resample =  PIL.Image.BILINEAR).convert('RGB')
   return img

This is what the crappy and normal images looked like now. Side note: this is a great scene from seinfeld.

A quick check shows that we have a dataset of about 10014 files. Pretty good.

Neural Network

Let’s make the most of those 10014 files by using transforms.

I added horizontal and vertical flips, zoom changes, lighting changes and rotation changes. With Fastai this is really easy to do.

Here are some of the image transforms.

Not bad!

We’ll use the NoGAN network pioneered by fastai and jason antic on this data. This code was inspired by lesson 7 of the fastai course.

I trained the model on google colab’s free gpus. They’re a great resource and I can’t believe they are free.

Training

The interesting thing that fastai recommends is increasing the size of your images gradually.

So at first, you train on small size images, then you upscale your images and retrain on the larger images. It saves you a lot of time. Pretty smart.

First, we’ll train on images of size 128x128. Because the images are so small I can up the batch size to 64.

I picked a learning rate of 1e-2 for this. I wanted something aggressive, but still on the safe side of explosion. This has been shown to be very useful.

The network will print the results during training. The input is on the left, the prediction in the middle and the target on the right. The results look very promising!

I resized and trained again. And again. Every time I make the resize slightly larger than it was previously. I moved from 128x128 to 480x480 to the original size of the video frames.

This was the final train. For this I used pct_start = 0.3. I wanted to learning rate to reduce 70% of the time during training. I prefer a lower learning rate when fine tuning models. The results from this piece of training look really good.

Inference: Applying to Star Wars

Once this network had trained, I ran inference. This was more involved than I originally thought.

I had to download the Star Wars deleted scenes (using youtube-dl) and then extract all the frames in this video. I extracted the frames using the same method previously.

Then I had to run inference from the learner on each individual frame of the video. That takes a long time.

I added some hacks here.

First, I added a render factor. This was taken from Deoldify. The idea is that I downscale the image and convert it to a square. Then I run inference on that image. The model is more receptive to images that are square shaped. This has been shown to reduce ‘glitches’ considerably.

After running inference on the square shaped image I convert it back to its original shape. I found this to reduce glitches and generally result in a smoother video output. I set the render_factor to 40, although it can be higher if we want higher res output. I may need a larger RAM for that though.

Second, I adjust brightness. This isn’t really a hack. Seems like more of a mistake that I’m correctly manually. For some reason, the model inference results in images that are very low in brightness.

I suspect it’s something to with the softlight filter we used for ffmpeg earlier. But I'm having to manually correct that here. I'll need to look into this further.

Third, I’m using matplotlib’s save functionality. I found fastai’s save image functionality to give me very weird results (Luke’s clothes were fluroscent blue and red). But strangely, matplotlib’s save functionality gives me okay results. I’ll need to look into this. I suspect that I may be losing quality on the image because I’m using matplotlib’s savefig functionality.

Here’s some of the outputs from the model.

Then I had to stitch all these frames together to create a video. To do that I initially used ffmpeg but I ended up overloading my RAM. Instead, I used opencv2’s VideoWriter.

Here is the final output.

Video by author

And the original video

Video by Marcelo Zuniga

Improvements

The sky needs a bit more work. But I like the vibrancy of the background. That is an interesting (and completely unplanned) effect. The goal was to remove the ‘cue marks’ (annoying black specs) from the video. I think it’s done okay in that respect — but there’s still more to do.
I like how the network has intensified the sun though. It completely changes the the scene between Luke and Biggs when Biggs says he’s joining the rebellion.

Original Frame (left). Output from network (right)

2. There’s a weird horizontal bar line that shows up around the 22 second mark. I didn't add any horizontal bars in the training set so it's completely understandable that the network didn't remove that at all. But in the future, I'll need to add more horizontal bars to my training set to fix these.

3. I’m also thinking of doing more super-resolution on the video. It would be nice to show a young Luke Skywalker in high quality. To do that I could resize the images before training further. I’ve already downscaled the image, but potentially I could downscale it further.
Alternatively, to achieve superres I could potentially use a ready-made upscaler such as VapourSynth. This is probably the best option as the original video is already in poor quality.

4. Inference is also an issue. It tends to overload memory and crash. The result is that 42 seconds is the longest I could get for this video. I'm not completely sure how to solve this problem. But I'll need to solve it if I'm going to be using this further.

So much to do!

Full code available on Github

Originally published at https://spiyer99.github.io on June 19, 2020.