The world’s leading publication for data science, AI, and ML professionals.

New York Seeks Haikus: Generating Haikus from NYC Government Job Descriptions

Six years ago, New York City passed a law requiring city agencies to make their data publicly available. Since then, over 1,600 datasets…

Six years ago, New York City passed a law requiring city agencies to make their data publicly available. Since then, over 1,600 datasets have been made available on the city’s open data portal and new data is constantly being made available.

Open Data Week – a city-wide endeavor by the NYC Mayor’s Office of Data Analytics – was a celebration of this progress, and coincided with the Data Through Design exhibit. The goal of this exhibit was to challenge artists to use this open data to tell insightful and interesting stories about the city.

As a native New Yorker, data scientist and civil servant, this challenge piqued my interest, as it is my job to use the city’s data in creative new ways.

It takes 400,000 government employees to keep New York City (the United States’ most populated metro area) running. From maintaining infrastructure, providing emergency services, and creating innovative methods for reducing waste, homelessness and crime, it all comes down to them.

So, for Data Through Design, I wanted to tell the story of those 400,000 individuals working behind the scenes and what they do to keep my hometown alive and thriving. Eventually, this concept evolved into a program that algorithmically generated haikus based on job descriptions for city government jobs.


New York City Seeks

The data comes from the NYC Jobs dataset. It contains current job postings available on the City of New York’s official jobs site. Most of the data on the Open Data portal is about what happens in the city and what is done by the city. This dataset offers a glimpse into who and what is behind those things.

Each job posting contains information such as agency, salary range and posting date. For this project, I used three columns: civil service title, job description and job qualifications.

Additionally, I used a dataset containing the number of syllables in words from Jim Kang’s phonemenon. I had to modify the data to include non-standard words such as agency acronyms (i.e. NYPD, 4 syllables and DOT, 3 syllables).


Generating Haikus

The goal was to use NYC job descriptions to create haikus. Originally a Japanese poetic form, haikus are poems which contain three lines with five syllables in the first line, seven in the second and five in the third.

To produce a haiku, I used a custom markov chain method. A markov chain is a technique to generate a sequence given the current value and the probabilities of what values would follow the current one. In this case, given a word, what words are likely to follow?

The first step is to determine these probabilities. I divided the data up by civil service title (Computer Systems Manager, Painter, Civil Engineer, etc.) and for each, built a separate corpus of data from the text in the job description and preferred skills fields. Then I split the corpus into sentences and divided the sentence into words and counted the number of times A followed B.

The following example shows the most common words to follow "data" for a Computer Systems Manager. Given a table like this, the markov chain will pick a random next word weighted by the probabilities. It will then take that resulting word and repeat the process again and again.

Since I was generating haikus, I had a strict syllable constraint. I only considered the next word if it fit within the syllable limit. For example, if I was on the first line (5 syllables) and my current word was "data", I wouldn’t choose "analysis" or "integration" as the next word because it would put the line over 5 syllables.

During this process the generator would at times write itself into an impossible state, when there were no valid choices within the syllable limit. In this case, it would go back and try a new word to see if it would lead to a valid haiku.

The haikus came out of the process in a raw state – all lower case, no punctuation and sometimes they just weren’t very good. The biggest problem I found was that, because haikus are so short, the results were often incomplete thoughts.

The markov chain would be in the middle of a sentence when it hit the syllable count and stopped short. I tried to correct for that by only ending with words that could be logical ending words, but it didn’t work for every situation.

For example, both of the following end with "design and construction process" but only the first is a complete sentence:

Work with project leads to create a design and construction process

and

The new york city department of design and construction process

Some of these results were actually amusing:

The environment and the environment and the environment.

and

The city of New York City: open data, open government.

Through a semi-manual, semi-automated editing process, I cleaned up the haikus to get presentable results. The final piece had 750+ haikus.


Adding Audience Participation

I now had the ability to generate haikus; but how should I present them? I wanted to give some insight into how the algorithm worked. To do that, I decided to show the haikus as they were iteratively created. Each word the algorithm tries is shown, and when it hits a fail state it deletes words and tries again.

Here is how that looks:

Exhibit organizer Michelle Ho suggested printing the haikus as they were generated on labels so they could be taken home as souvenirs for Data Through Design exhibit goers.

I added a button that would generate and print the next haiku when pressed:

The result was a whimsical and interactive look at some of the responsibilities and skills of the city’s civil servants.

Here are a few favorites:

The New York City government is a plus but is not required.

Finance is seeking a dynamic intern to function as a team.

Ensure data is accurate, neat, timely and ready for audit.

Profound knowledge of trunk water and wastewater collection system.

Who wants to become part of an information security staff?


_This was originally posted on Data Driven Journalism._

Thanks to Abigail Pope-Brooks for editing and feedback.

_All the code and data used is available on github. Jeremy Neiman can be found doing equally pointless things on his website: http://jeremyneiman.com/_


Related Articles