The magazine that writes itself

We asked an AI to write about the future, here’s what it has to say and the steps we took to build it.

Kathryn Isabelle Lawrence
Towards Data Science

--

by Kathryn Lawrence & Ann Kidder

A screenshot of MONTAG magazine’s homepage at montag.wtf

TL;DR (or if you don’t care how it’s built, and just want to read some articles about the future written by AI): Yes, the magazine can write itself, but it still needs a human publisher. Check it out at montag.xyz

The source material for this Spell-enabled project is MONTAG Magazine, an online and print magazine of stories about how technology will change the way we live. MONTAG started in 2017 as a brainchild of Grover, a Berlin-based startup offering consumer electronics rentals on a flexible monthly basis. We started MONTAG to look beyond the tech devices available today and think more deeply about the effects of technology on society.

People often ask why we distribute such forward-thinking content in such an archaic format — what place does print media have in the digital age? And we often asked ourselves, How will we work with technology in media and the creative industries in the future? How will interactions with AI influence human creativity?

This project was incepted to answer some of these questions, particularly with the rapid advances in the capacity for AI to write convincingly like OpenAI’s GPT-2. After covering the topic of neural network-generated art and texts in MONTAG Issue 3, “Coding Creativity,” I decided to take a crack at it and try to use a neural network to produce a magazine that writes itself.

  • Step one, web scraping using shell scripts and Python
  • Step two, training Max Woolfe’s Simple-GPT2 using Spell.run
  • Step three, taking text samples

Step 1: Scraping and cleaning the article text

Most data scientists will tell you they spend way more time creating good data sets by collecting, organizing, and cleaning their data before they can do anything fun or useful with it. Unfortunately, this is also the case with the text data that we’ll need to train the neural network.

As one of the magazine’s writers, I could theoretically go into our content management tool (MONTAG uses Ghost, a very cool open source publishing platform) and copy out all the text manually, but it will be way easier to scrape the text data off of the website. Plus, if I want to repeat this process when new articles have been written or use it for another online blog or magazine, it’ll be much more replicable.

To scrape the data from the website, I’m alternating between writing shell scripts which ask my computer to download HTML pages and Python scripts that read those HTML pages for me and find the information I need. Using this process, I just have to find all of the article links, download each article as an HTML page, and then convert all the HTML page data to a text file. I’m going to use Spell for this from the jump, so that all the text data lives in the same place we’ll be doing the processing, and my computer doesn’t get full of HTML files — if we were scraping a much bigger text corpus, that could be a problem.

First, I’ll need to sign up for and log in to a Spell account (more on how to do that here). Then I’ll install the Spell command line interface and log in with the username and password I created on sign up:

$ pip install spell
$ spell login

Then, I’ll create a folder to house any scripts I write for my project.

$ mkdir montagproject
$ cd montagproject

We’ll leave the folder empty for now, and add our scripts later on in this article.

Looking at the site I want to scrape, which in this case is www.montag.wtf, I see the pagination at the bottom says “Page 1 of 27,” so that’s great, now I know how many pages I need to grab. Here, I’ll write a script which will go through every page on montag.wtf and store it in my Spell run. Since this script doesn’t require GPU, we can just run it on Spell’s default CPU machine.

$ spell run 'for p in `seq 1 27`; do curlhttps://www.montag.wtf/page/$p/ -o montagpage$p.html; done'
Could not find a git repository, so no user files will be available. Continue anyway? [Y/n]:y
💫 Casting spell #1...
✨ Stop viewing logs with ^C
✨ Machine_Requested... done
✨ Building... done
✨ Run is running
% Total % Received % Xferd Average Speed Time Time Time Current
.
.
.
✨ Saving... done
✨ Pushing... done
🎉 Total run time: 30.39674s
🎉 Run 1 complete

Voila, a few seconds later I have 27 HTML files. We can view them by using the ls command. If you want to download any of the files (or all of them) you can use the spell cp command, which works on either a path to a directory or a path to a file:

$ spell ls runs/1
35 May 21 18:48 montagpage1.html
12228 May 21 18:48 montagpage10.html
12166 May 21 18:48 montagpage11.html
12224 May 21 18:48 montagpage12.html
$ spell cp runs/1/montagpage1.html
✔ Copied 1 files

The links to the articles are hiding somewhere on these pages. I can look at the first HTML page in this folder to see how I’m going to extract the links to the articles using BeautifulSoup, and prepare another script to scrape the article pages.

It looks like each article link is inside a div element called homepage-post-info, so first we ask BeautifulSoup to find all of those. Then for each object it finds, we ask it to find all of the anchor tags inside, and for each anchor tag to extract the href, which is our relative link to the article. The next part opens the file for our new shell script to download all the article HTML pages and adds a line to output each article page to a new HTML page.

The Python script to extract the links for the articles and create a new shell script to download them is below. Let’s copy the code into a file in our montagproject folder called montagarticlesoup.py.

from urllib import urlopen
from bs4 import BeautifulSoup
import os
import sys
count = 1
montagpages = os.listdir(sys.argv[1])
path_to_pages = sys.argv[1]
for htmlFile in montagpages:
textPage = urlopen(path_to_pages + "/" + str(htmlFile))
text = textPage.read()
soup = BeautifulSoup(text, features="html.parser")
articles = soup.find_all(class_="homepage-post-info")
for x in articles:
links = x.find_all("a")
for a in links:
link = a['href']
f = open("getmontagarticles.sh", "a")
f.write("curl https://www.montag.wtf" + link + " -o montagarticles" + str(count) + ".html\n")
count += 1
f.close()

To run the script, we’ll first need to commit our file to git. This isn't necessary if you're running locally, but since we're going to run on Spell we'll need our files in a git repository. We'll also need to init the git repo since it's our first time using it.

$ git init && git add * && git commit -m "add montagarticlesoup.py"

Then to run our script, use the following command. You have to pass in one command line argument, the name of the directory with our Montag pages. The argument goes right after the name of the python file.

Don’t forget to replace runs/1 below with the number of the run you ran above!

If you’ve forgotten the run number, you can look it up on web.spell.run/runs or using spell ps.

$ spell run -m runs/1:pages --pip urllib3 --pip beautifulsoup4 --python2 "python montagarticlesoup.py pages"...
✨ Saving... done
✨ Pushing... done
🎉 Total run time: 23.557325s
🎉 Run 2 complete

Another nice thing about running this whole project on Spell is that I don’t need to install anything on my computer, or use a virtual environment to collect all the dependencies. I use the Spell--pip flag to add the packages I need.

I also need to use the -m flag to mount my files from the previous command. If you'll recall, all the html files are in the output of run 1 (or, if you've been playing around with Spell a bit, whatever the run number is for when you did the first step). Then I indicate I want to mount those files into a directory called pages in my new run. Lastly, I'll pass that directory name, pages, in as my command line argument so my script knows where to look for the files.

When the run is done we’ll see these steps. If we ls the run, we'll see our getmontagarticles.sh file - nice!

$ spell ls runs/2
14542 May 22 14:50 getmontagarticles.sh

Let’s run the new script to get all the articles. You might need to change the permissions on your script before you can run it, I’ve done that below with the chmod command. Don't forget to replace 2 with the number from your run.

$ spell run -m runs/2:script "chmod +x script/getmontagarticles.sh; script/getmontagarticles.sh"...🎉 Run 3 complete

We should find that our run output now has over 100 HTML pages in it. That’s a lot of articles we now have to get sweet, sweet text data from to train the neural network.

There’s just one more soup we have to make before we’re home free with all the text data. This time we’re going to go through all of those articles and pull out everything that is text and write it to a text file called montagtext.txt - no links, no images, no formatting, no problemo. Let's call this montagtextsoup.py:

from urllib import urlopen
from bs4 import BeautifulSoup
import os
import sys
articlepages = os.listdir(sys.argv[1])
path_to_pages = sys.argv[1]
for htmlFile in articlepages:
if ".html" in htmlFile:
textUrl = path_to_pages + "/" + htmlFile
textPage = urlopen(textUrl)
text = textPage.read()
soup = BeautifulSoup(text, features="html.parser")
articletext = soup.find_all(class_="content-text")
for element in articletext:
words = element.get_text().encode('utf-8')
f = open("montagtext.txt", "a")
f.write(words)
f.close()

Add this to your montagproject directory, and don't forget to commit it before you run it.

$ git add . && git commit -m "add montagtextsoup.py script"

Then run:

$ spell run -m runs/3:articles "python montagtextsoup.py articles" --pip urllib3 --pip beautifulsoup4 --python2
.
.
.
🎉 Run 4 complete

Bingo bango, we now have a 1.3 MB plain text file of training data!

Since we’re going to need this file a lot, let’s create a link to it so we don’t need to memorize the run id when mounting it.

$ spell link montag_text_file runs/4/montagtext.txt
Successfully created symlink:
montag_text_file → runs/4/montagtext.txt Jun 11 10:59

Step 2. Feeding the text to a neural network

Now we can start playing with Spell’s GPUs. To get the hang of this, I watched some of the videos on Learn.Spell.run, and found the article Generating RNN Text on Spell really helpful .

The char-RNN code we’re going to be using is in this repository: https://github.com/minimaxir/gpt-2-simple.git — but you don’t have to clone or add it locally to use it on Spell.

I’m going to add a Python script to my montagproject folder called run.py, which is going to call on the power of GPT-2 and fine tune the 117M model based on our MONTAG text data:

import gpt_2_simple as gpt2
import sys
model_name = "117M"
file_name = sys.argv[1]
gpt2.download_gpt2(model_name=model_name) # model is saved into current directory under /models/117M/
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
file_name, # name of the file with our text data
model_name=model_name,
steps=1000) # steps is max number of training steps
gpt2.generate(sess)

I commit my git repo again, noting that I added the run.py script.

$ git add . && git commit -m "add run.py script"

It’s the moment of truth, time to make the biscuits.

The last argument here, -t k80, specifies that we want to use Spell's K80 GPUs, which will be much faster than using CPU power - and Spell gives you $10 of free credits when you sign up, so this will be both fast and free.

$ spell run --pip gpt_2_simple "python run.py montag.txt" -m montag_text_file:montag.txt -t k80

If all is going as planned, you’ll start seeing a lot of output, which means it’s alive, it’s aliiiive!

This is the beauty of Spell — you don’t have to worry about watching your computer to make sure it doesn’t catch fire or take flight. In fact, you can keep using your computer normally the whole time the neural network is training. Since we selected a K80, it should go faster than a CPU, but there's still enough time to get up and do a little celebratory dance while it works.

Step 3. Sampling the text

One hour, 14 minutes, and 55 seconds later, GPT-2 has finished training. Looking at some of the outputs that were produced along the way, we know it’s on a good path. Here are two of the most convincing snippets from samples:

101: "The idea of a currency is exciting, but not at all convincing. As a modern society, we often try to imagine how much food and drink we would need in order to live - which, no matter how much you drink, doesn't always seem as important as what someone else wants, either. So, in the past two millennia, we have tried to create things that are like any existing economy, but with fewer and fewer laws and regulations, and easier access to credit, buying and selling seems like simple economics.In reality, we're doing things the old-fashioned way, creating new currencies for buying and selling, but with a big emphasis on flexibility and anonymity."
501: "The future (with some exceptions) tends to bang itself around a little more, so sometimes the combination of tech and mindfulness that accompany truly being constantly alive gets you really high.
For me, that combination of being constantly alive is extremely important to the success of my company. Back in 2014, I completed a study by Citi that found that mindfulness meditation led to lower prices for wine and a better experience, and that engaging with material things was linked to less dopamine release.And while a healthy and happy life is fun and hip, a study by Mindful reported an 80% decrease in anxiety and panic attacks among suitably excited customers.So while you're at it, join me as I meditate on a beautiful day: for a low, gain access to plenty of happiness in all you potential customers I can influence into spending less time on pointless things like Netflix, I can boost creativity, and my life will be more fulfilling.Open Happiness.Find it on Twitter.Youtube.And don't miss the chance to get involved!"

That’s good content marketing material if I’ve ever seen it. Now we want to create a small script that will let us sample the text whenever we want and tweak things like the temperature, which (put very, very simply) controls how closely the output matches the data — lower temperatures should produce more realistic writing, and higher ones, weirder stuff.

One problem we run into using Max Woolfe’s GPT-2 implementation is that load_gpt2 and generate, two commands which we see in the README example, don't work with Spell without a little fiddling. After taking in run_name as a parameter, the next part of this function tries to find the trained data under checkpoint/run1 by default. We want to point the load and generate functions to the outputs we created in the training run on Spell instead, so here's what we have to do:

First create a Spell link to mount the checkpoint from the run you trained (in my case, it was Run 5). This isn't strictly necessary but it makes it easier to keep track of the run:

$ spell link montagmodel runs/5

Next, we’ll put the script to generate new text from a checkpoint (which you can find in the README.md of the repo) into a file called generate.py:

import gpt_2_simple as gpt2sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)
gpt2.generate(sess)

When we run generate.py, we specify to load the checkpoints from the trained model in run 5 this way (and don't forget to commit the new file!):

Don’t forget to commit to git after you edit your generate.py file:

$ git add . && git commit -m "add generate.py"
$ spell run --pip gpt_2_simple "python generate.py" -m montagmodel/checkpoint:checkpoint

Now that I’ve made sure that script can find the trained models and can generate text from them, let’s tweak generate.py a little by changing the temperature and produce more samples to see if we can get a more realistic article. This code creates 20 samples each for temperatures 0.5, 0.7, 1.0, and 1.5, and outputs them each to their own text file.

import gpt_2_simple as gpt2sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)
gpt2.generate(sess)
gpt2.generate_to_file(sess, destination_path='montagarticlernn-temp0-5-20samples.txt', nsamples=20, temperature=0.5)
gpt2.generate_to_file(sess, destination_path='montagarticlernn-temp0-7-20samples.txt', nsamples=20, temperature=0.7)
gpt2.generate_to_file(sess, destination_path='montagarticlernn-temp1-20samples.txt', nsamples=20, temperature=1.0)
gpt2.generate_to_file(sess, destination_path='montagarticlernn-temp1-5-20samples.txt', nsamples=20, temperature=1.5)

And to run, we use our same command from before:

$ spell run --pip gpt_2_simple “python generate.py” -m montagmodel/checkpoint:checkpoint

Here are a few article samples from the most conservative temperature, 0.5:

What appears to be an explanation of a new YouTube trend and modern art movement manifesto, which is particularly impressive because it includes a Twitter picture link and caption attribution to a Berlin art museum which does not exist:

"Youtube Video - "Bombing"Bombing is when an utterly stupid idea, quickly borne from parents' and peers' knowledge base, sets off alarm bells.Bombing is when a concept, idea or practice - known in some countries as "art" - breaks through the normal rules of normal human behaviour and convinces - or induces - us to act.Bombing movies, TV shows, games - many more are likely to be flaunted around the world than art.So if you really want to see how susceptible we are to the ravages of modern art-based violence, look no further than the Brutalist fortress city of Dusseldorf, where the city walls have been razed to the ground and graffiti-covered lampposts have been torn up and flung into the distance; where standing ovation is still the most effective form of resistance.But don't be alarmed if art-based paleo-ethno-blasting gets itself into a bit of a quandary. Berlin's Trafigura Museum has started a fund-raiser to pay for wall-removal, claiming $150,000 as their "reward" for completing and destroying a 60-foot-tall wall in the centre of the museum.pic.twitter.com/7xilOswu1h - The Trafigura Museum (@TheTrafigura) August 8, 2017The fact that this kind of art-based annihilation is so readily available to a fraction of the population is down to a stubborn refusal to take risks, a stubborn refusal that is partly understandable, partly reflects the fragility of local democratic culture, and partly reflects the desperation of the ruling class, who are determined that any resistance is met with force and violence.Anti-art, anti-future, anti-art, anti-future. The only thing standing between us and this kind of apocalypse is a willingness to take the risk.The day has come."

Speculations on the future of artificial intelligence, by artificial intelligence:

"OMFG, what is the future of artificial intelligence?Let's keep these two threads running for as long as possible. AI generally likes to think of itself as a race to be first in the race, and if it can be thought of as a, well, any race, then so be it. But human intelligence also likes to think of it as a race to be free. To be the person who is not afraid to die, who doesn't want to be drained of all sense of self, and who doesn't want to be anything but.And the more you think about it, the more you'll understand why automation is bad for our planet and why deep learning is bad for us.And it's not even just the jobs that are going to be automated: the people who are unaugmented by their jobs and are actively working to make their jobs obsolete. It's the automation of the cultural sphere, at least for a while.What about the other jobs too? Those that are situational unemployment insurance workers, who either can't or won't work due to the lack of cultural capital or cultural influence to live elsewhere, until they find a better life elsewhere.These people are going to do anything to survive and thrive in a technologically-driven world, and even if they survive, they're doing it in a way that makes them less than human. And they're not going to get a job or a raise any time soon."

From samples at Temperature 1.0:

Musings about the future of capitalism and automation with a byline by Federico Zannier (a software developer who sold his own data on Kickstarter in 2013 — guess he virtually freelances for our AI now!), and a made up quote from expert Dan Merica, a political reporter for CNN. As convincingly as it starts, however, this article does devolve into a recipe:

"By Federico ZannierWhat does the future have in store for us after billions of Euro has been gambled on global capitalism? One company has taken advantage of this reality, opening factories in China, and setting up shop in one of the world's most sweatshops - cutting the work environment for Chinese workers, and for the global capitalist project. Workers in China start cashing in on their garment and real estate investments, and after years of being let off the hook for unpaid internships and part-time work, are finally being punished for their collective labor and working conditions. Workers in Costa Rica, chipping in for ethical sweatshows, and setting up shop in sweatshops all over the world. Open Capitalism Now!But is it really open for capitalism to go to a sweatshop and begin to innovate? Maybe it's not capitalism that takes the risk, it's open work, capitalism in action.Some are afraid that, if allowed to go ahead, the next few years will be one of the most dangerous for capitalism. Over at MONTAG, Dan Merica warns of the dangers of automation and digital wage manipulation:"Americans are going to be driving the decision to buy a car - and many are predicting that as more automation and digital labor labor comes to the U.S. they'll be the ones who are hardest hit. And their predictions are off, at least for the summer."And Mark Giustina at technobabylontips says something vaguely optimistic about the future of work: "The worst fears of automation workers - who want strict control over their jobs before they ever have a chance to build their own company - are unfounded."Return to text version: this article originally ran on MONTAGIn the second of my looks into tomorrow's tomorrow's sweatshop-like work, I want to see how robots may fare in the kitchen. This part is tricky. Knock down a tomato and bake it for about 500 minutes, then fry it in a fragrant butter sauce until golden brown. Repeat. Flavor it up."

And speculations about a cryptocurrency called Minie-Pants, which is as good a name for an altcoin as any:

"TL;DR: There's no point in hoarding currencies in the first placeOnly after discussing the likelihood with a crypto evangelist, would we not want to write a very detailed analysis of the individual reasons why, even with all of the giddiness and fanciness of the concept, it is hard to picture putting two and two (even though there are plenty of them in all of us).The death of currency would cease when most people no longer needed it, and cryptocurrencies would be the wrong entity to mint. We need instead a new coin that we can knock around and use to ferry messages between millions.There is already an utterly fascinating array of altcoins waiting in the wings, and none of them are nearly as obscure as the Minie-Pants. Oh, the lost alloys Tenka and The Bull.They too are trying to mint a new currency, and likelihood of success is as good: if the altcoin is truly effective at changing how people spend USD, it must also be persuasive enough to convince potential buyers that it is a good investment.And yet, the Minie-Pants seem more convinced than I.T. pros that they're still in the early days of crypto. So they seem to be genuinely interested in the coin, and can see clearly when to take it out of proportion. They also seem to be fundamentally wrong about taking money out of thin air.Why? The coins and concepts that make them tick all have a lot to do with money's freeze-and-reset cycle: everything from debit cards to US currency, including Bitcoin, has to be part of the normal economy to keep the system running smoothly.The Minie-Pants are fundamentally unfair: they want to make sure everyone has a USB stick to use Bitcoin at all times, and don't want governments, banks and central banks to be sending coins to those countries where there are no internet services or banks to be found.They want to make sure no-one has access to money, and neither do we. But to this we say: Shut the fuck up, girl!"

While this neural network is clearly a fairly competent writer on future-thinking tech topics like cryptocurrency, artificial intelligence, and labor automation, it hasn’t learned to upload articles and hit the publish button yet — but with a little more programming, this could be easily accomplished.

Hold on… did I just automate myself out of a job?

Until the AI decides I’m obsolete, you can read a magazine written by AI, published by humans, at montag.xyz.

--

--