Self-Serve Local News

Scaling Empathy in an Unstable Mediascape

Published in

Towards Data Science

13 min readDec 29, 2018

The Crisis at Hand

At the dual inflection points of the internet and the Donald Trump presidency, American journalism is in crisis. Social media and the online attention economy have disrupted the economic models and readership patterns that once propped up good reporting, resulting in mass layoffs of talented reporters from print outlets and a swing toward outrage, clickbait and corporate sponsorship in their digital counterparts. Concurrently — and perhaps as a result of the same underlying techno-cultural shifts — the election of the most openly anti-press president since at least Richard Nixon (if not earlier) has rekindled endemic skepticism of media institutions and disrupted the presumption of consensus support for free, independent journalism.

But although it is the big names — the Times and Posts of the nation — that have drawn attention in these tumultuous times, it is at the local level that the brunt of the impact has been born. A recent study by the Pew Research Center found that American newsroom employment dropped by nearly a quarter between 2008 and 2017, and at 45 percent, that loss was even worse for newspapers. But it is mid-sized papers that seem to have lost the most blood; while national coverage has (ironically) been re-energized since 2016, and hyper-local papers are supported by their tight-knit communities, many city and regional outlets lack both the reader support and economic solvency to keep up pre-internet standards.

This is, it should go without saying, a crisis of democracy. The press is named in the First Amendment not only to ensure the right of journalists to write but also of citizens to read. Voting is only truly democratic, after all, insofar as it is informed.

One answer to this issue — among many — is to revitalize city and regional papers through new economic models and coverage methods. A possible method for doing so? By partially outsourcing the production of copy to external partners, time and energy expenditures within newsrooms could be cut and redundancies between newsrooms (as when multiple city papers commit reporters to covering the same story) would be eliminated. Of course, this is already standard for certain types of reporting, as with syndicated wire stories (the Associated Press), multipurpose data tools (ProPublica) or public document repositories (The Intercept). But these options run into problems when it comes to creating a unique local experience. For instance, every city paper that runs a Reuters story about the latest gun control measure runs the same Reuters story, even if national trends or details don’t align with those which would make more sense for a particular local readership.

Consider a 2015 Pew report which found that “about nine-in-ten residents in each city said they followed news about their local area at least somewhat closely, while about eight-in-ten said the same about neighborhood news.” These readers have a demonstrable interest in being educated about the issues — and yet, without economically-viable outlets meeting their demand, it will likely go unmet. By rethinking how journalists mediate between regional and national stories and how their employers allocate resources between coverage of the two, that gap can be filled.

The question of local coverage is also one of journalistic efficacy. By localizing what would otherwise be national stories, abstract issues are made into tangible concerns for readers. A New York Times article about the U.S. drug war might interest readers in, say, Tampa, but that same story adapted to include details specific to Tampa is of much more immediate interest to them. If journalism is in large part a tool of political education, and local politics are those that have the most explicit impact on readers (at least in the short term), then it makes sense for reporters to explore big-picture issues through a localized lens when possible.

The proposition of self-serve local news, then, is thus: by creating technical infrastructures and reportorial processes that leverage national-scale resources toward local-scale stories, it’s possible to simultaneously engage with both the economic realities and the content needs of modern local journalism.

Localizing the News

As a proof of concept of this theory, our team chose to work with a FiveThirtyEight article from 2017 by Maimuna Majumder titled “Higher Rates Of Hate Crimes Are Tied To Income Inequality.” Majumder’s article uses data analysis to explore — at the national scale — the connection between hate crimes per capita and economic inequality (as well as, to a lesser degree, education inequality).

We chose to experiment with localizing this article for a number of reasons:

The idea was interesting; especially in the context of Trump’s surprise electoral victory and the spike in hate crimes that followed, Majumder’s thesis piqued our interest.
It had clearly defined claims — hate crimes and economic inequality are related; hate crimes and education levels are related — which could be tested quantitatively.
Majumder made all her data available in a GitHub repository, eliminating many sourcing challenges.
Data about hate crimes, economic inequality and education inequality is inherently geographic, such that our central task of localizing the article made sense given its content.

To begin developing a system through which to localize Majumder’s article, we first created a neutral “template” version of the article off of which modified versions could be adapted. This involved removing any date-specific content from the template (as the article was over a year old) and cutting out most of the content explaining the methodology behind Majumder’s statistical analysis. This left us with what we referred to as the base layer.

We then had to aggregate all the data necessary to take Majumder’s story and zoom in to the local level. This local data came from a variety of sources. We sourced demographic data for a given city or state, such as the Gini Index of economic inequality or high school graduation rates, from the annual American Community Survey by the United States Census Bureau. Meanwhile, data on hate crimes came from the FBI’s Unified Crime Reporting Program, which compiles crime data from local agencies. Lastly, we queried the Southern Poverty Law Center (SPLC) to find hate groups active in each state.

With the exception of the SPLC data, we utilized precomputed summary statistics. For instance, for the graduation rates, the Census Bureau had already gathered details about the educational attainment of every resident in a given city and computed the percentage who attained at least a high school degree. As such, sourcing the data meant downloading these precomputed percentages rather than downloading the educational records of every single resident.

We then created a dataframe in the programming language R that would allow us to test the correlation between hate crimes and both economic inequality and education rates. Joining together data from the FBI (hate crimes), SPLC (hate groups) and U.S. Census Bureau (Gini Index of economic inequality and high school graduation rates), we were left with a robust data set that could explore relationships between different factors over time — for instance, how closely correlated hate crime rates and economic inequality were over several years in any given city or state. We only looked at cities and states with data from at least six years in order to calculate the correlations.

Moving from the particularities of regional data to the broad strokes of the stories a journalist might want to tell about that regional data, we settled on eight general types of articles that might be necessary to publish, depending on which city or state a reporter was writing for. These articles were based on three binary values:

Whether or not there was a statistically significant relationship between hate crimes and economic inequality.
Whether or not there was a statistically significant relationship between hate crimes and education levels.
Whether the given region was a city or a state.

thus giving us 2*2*2 factors at play, or eight total possible articles. We conceptualized this stage of article localization as a decision tree whereby relevant details about the data (correlations identified in our data set) and user input (the geography to which the article would be localized) were leveraged to identify which of multiple pre-written article skeletons would be loaded into the system.

Each of the eight “leaves” on the resulting tree was identified with one of these skeletons, which were assembled by modifying the base layer article. There were two main ways we approached this modification.

One type of modification was to reorganize the article by moving certain chunks of text up or down based on their relative importance in the article, based on which trends were identified as being relevant for a given locality. For instance, an article about a city with an identified correlation between economic inequality and hate crimes would want to emphasize that relationship, while one about a state with an identified correlation between education levels and hate crimes would want to emphasize that one instead. This is rooted in the journalistic practice of the inverted pyramid, whereby the most important information in an article goes at the top.

The other type of modification was to identify content-neutral sites within an article skeleton where additional, location-specific information could be plugged in. We conceptualized this stage of article localization as a Mad Libs game. In Mad Libs, the framework of a short story is provided, but spaces are left where certain words can go. However, to make sense within the context of the article — and to emulate natural language — these plugged-in words have to meet certain parameters; most commonly, a part of speech. Similarly, our article skeletons identified what type of data or other information could be automatically inserted into a specific part of the given skeleton. But where a Mad Libs game might call for a verb or noun, the article skeletons might call for the most common cause of hate crimes in State X or the Gini Index in City Y.

First we identified nearly 70 possible data features that could be plugged into an article:

Then we incorporated those data points into our skeletons, based on which would be most relevant when telling a story about a locality that fell into one of the eight bins on the decision tree. This resulted in the eight final skeletons, one of which started with:

The bracketed purple elements are the plugged-in data points, while the green text identifies any copy added onto the base layer. Meanwhile, the bracketed orange features are a secondary type of Mad Libs-style plug-in that calls for more narrative (rather that numerical) details. Because certain locality-specific “values” that one might want to include in an article — such as anecdotes or quotes — don’t (yet) exist in organized databases or spreadsheets, these spots identify to a local reporter where they would still need to go out and do their own reporting. The final output of this system, then, would fill in the purple features with relevant values but leave the orange ones unmodified.

Localization in Practice

If a journalist were writing for a paper in Los Angeles, this two-part approach to localization would result in the following decision tree…

… which would lead to the following article skeleton…

… which, when the system plugged in the relevant data points, would output the final article…

With eight possible article skeletons combined with data on 3020 cities, 44 states and the District of Columbia, the result is a flexible system for automatically generating unique, customized articles that could be syndicated to local outlets across the country with relatively little overhead. Although not every aspect of the process is automated — namely, writing the article skeletons and filling in the orange text features — this process does suggest one possible approach to producing localized articles at scale.

Using the JavaScript libraries Mapbox for plotting and D3.js for data visualization, our team developed a web app that makes this system easy to use and accessible for those without a background in computer or data science:

By clicking on their city or state, a resource-scarce newsroom could pull up not only a fleshed-out article to work from…

… but also useful summary statistics…

… locality-specific data visualizations…

… and even a succinct tip sheet that, if the article content were not itself necessary, indicates what might still be worth looking into…

Expanding the Scope

To further develop this proof-of-concept, our team decided to apply the same methodology to another article: Ben Casselman’s “Where Police Have Killed Americans In 2015,” another FiveThirtyEight piece. Using data from the U.S. Census Bureau and a data project by The Guardian about police killings, Casselman’s article explores the relationship between poverty, race and where in the country police kill the most civilians.

Localizing this narrative followed essentially the same process as before:

Create a neutral “base layer” from initial copy.
Aggregate data and join them together in dataframe.
Create a decision tree that accounts for scale (state or city) and statistical trends (whether police killings per capita and average household income are above or below the national average).
Identify Mad Libs-style features (quantitative and qualitative) that can be plugged in, either from the dataset or by local reporters.
Modify the base layer to account for the decision tree results (eight possibilities again) and Mad Libs features.
Embed this process in a map-based web app.
Incorporate additional data visualizations, summary statistics, tip sheets, etc. as available.
Make the outcomes accessible to journalists.

Relying primarily on summary statistics begs the question of whether this sort of product could scale to account for massive datasets. For instance, when working with the police killings article, we encountered a dataset containing records of every single police killing in the United States over several years. When the user clicks on a specific locality to see the auto-generated article populated with relevant data, a naive approach would involve querying the dataset for all police killings that happened in the specific city and then computing the appropriate information. However, this approach is computationally expensive and inefficient because the same queries and calculations might be performed many times. A more robust solution would involve precomputing the necessary information for each city and storing that information in a database that maps cities to their specific data, such that each time the user clicks on a specific city, our system would directly return the information without making additional computations.

For instance, the diagram below illustrates what happens when the user clicks on the city Waco, Texas. Each row contains information about a single police killing victim. In the aforementioned naive solution, the program would search through the table to select every row whose city matches Waco. However, in the robust solution, the program would search for Waco and return the chunk of data associated with it. Therefore, we see that our program is computationally scalable for large datasets.

The Big Picture

For a newsroom strapped for cash, suffering from multiple bouts of layoffs and working to maintain reader trust in an environment geared for the opposite, this sort of system could — with the infrastructural and reportorial backing of an organization like the AP or ProPublica — outsource a lot of the menial labor involved in standard reporting processes. Currently, work like data acquisition, data cleaning, statistical analysis, article writing, source editing, copy editing, graphic development and more are a significant strain on newsrooms, and can also present labor redundancies across multiple newsrooms that cover similar issues. Were that work to be centralized, it would lift a significant burden off of media outlets already under significant pressure to produce lots of high-quality content with limited resources.

As with any automation, there is the risk that newsrooms would see this as an opportunity to cut staff. But smart ones, at least, would see it for what it is: a chance to free up talented reporters for more important work, like empathy-stoking human interest stories or in-depth investigate pieces that hold those in positions of power accountable. Done intelligently and with passion, the end result would be a stronger, smarter, more effective mediascape at every level of the American free press.

Notes

This project was created for the Stanford University class “Exploring Computational Journalism” (CS206/COMM281) in Autumn 2018, in concert with the Brown Institute for Media Innovation.

Both web apps can be accessed here: http://web.stanford.edu/~sharon19/.

Special thanks to:

Profs. Krishna Bharat, Maneesh Agrawala and R.B. Brenner for providing insight, guidance and expert advice over the course of this project;
Maimuna Majumder, Ben Casselman and FiveThirtyEight for their incisive reporting and accessible data;
Prof. Cheryl Phillips, Dan Jenson and the Stanford Open Policing Project for their help with an early iteration of this project;
and everyone else involved in making this work a reality.

Image citations:

Graphic 1: Pew Research Center (link)
Graphic 5: Sam Spurlin, Medium (link)

Data sources include:

The Federal Bureau of Investigation
The Guardian
The Southern Poverty Law Center
The United States Census Bureau
The Washington Post

If you are interested in learning more about this work, those involved can be contacted at:

Sharon Chen (sharon19@stanford.edu)
Brian Contreras (brianc42@stanford.edu, @_Brian_Contreras_)
Daniel Huang (dhuang7@stanford.edu)