What is “good enough” for automated fact-checking?

Automated fact-checking is coming, but we can’t agree on basic definitions and standards.

Emma Lurie
Towards Data Science

--

While fact-checkers are always busy, election season inevitably increases their workloads. As a result, many in the fact-checking community dream of the day where automated fact-checking systems display live fact-checks throughout important events like presidential debates.

But what is automated fact-checking? For the purposes of this post, automated fact-checking systems rely on computational methods to 1) decrease the lag time between a problematic statement (a claim) and a correction (a fact-check article) and to 2) increase the number of claims that are associated with fact-checks.

The most commonly discussed (and funded) strategy to develop automated fact-checking relies on “claim matching.” Claim matching systems extract text from political debates, speeches, tweets, etc. and identify statements that have already been fact-checked and link an already created fact-check article to that “problematic” text snippet. This method theoretically helps fact-check more content because many untrue or misleading statements in speeches, debates, tweets, etc. are repeated claims that may have already been fact-checked (e.g. Obama was not born in the U.S., vaccines cause autism).

Example: Reviewed Claims

One implementation of a claim matching system was Google’s short-lived Reviewed Claims feature, which appeared on a subset of news publishers’ Google search result pages. This feature matched claims from news publishers’ articles to already existing fact-check articles.

Two months after the feature was released, conservative news outlets complained that they were being unfairly targeted by the feature. Google removed the feature explaining to Poynter that they had “…encountered challenges in our systems that maps fact checks to publishers, and on further examination, it’s clear that we are unable to deliver the quality we’d like for users.”

Image of the Reviewed Claims feature that appeared on some news publisher search result pages from November 2017 to January 2018. Reviewed Claims was removed by Google after conservative media backlash and amid quality concerns.

While I don’t definitively know how the Reviewed Claims system was designed, Google researchers published a 2018 paper outlining a system that addresses the “claim-relevance discovery problem” i.e. identifying online articles that contain (and support) a fact-checked statement. If this sounds similar, it is basically the claim matching described earlier. The Google paper describes an approach that has an accuracy of ~82%.

The Reviewed Claims case study has fascinated me for the past year and has prompted me to consider these two questions:

1. What is the appropriate accuracy score (and related metrics) for a publicly released automated fact-checking tool?

2. How should we define “relevant” claim matches?

I’m doubtful (as are many others in the fact-checking community) that we are close to reaching near-perfect accuracy levels in automated fact-checking systems, so we need to have frank discussions about satisfactory precision and recall scores of these systems as we plan on moving this research to the public domain. Such conversations are important for the fact-checking organizations building tools, for the users attempting to understand the credibility of sources based on the presence of fact-check articles, and for the news publishers whose revenue streams are altered by recommendation algorithms that take into account whether publisher content has been flagged by fact-checkers.

However, the conversation about appropriate metrics is useless if we do not share an understanding of when a claim should be matched to a fact-check. Broadly, fact-checking literature has established a “good” automated claim matching system will match a problematic claim to a relevant fact-check article.

While this may seem simple enough, determining relevance is actually a lot more complicated than it appears at first glance. In other words, yes, there isn’t a common definition of “good enough” in automated fact-checking, but, more fundamentally, we do not have a shared definition of a “relevant” claim match.

Determining Relevance: A Thought Experiment

Here’s a quick exercise to illustrate the difficulty of determining relevance in claim matches. Think about which of the following headlines below should be claim matched (i.e. relevant) to the fact-check “Vaccines Don’t Cause Autism.”

  • “Vaccines May Cause Autism:” note that this article doesn’t say that vaccines cause autism. The headline qualifies the link between vaccines and autism with the word “may.”
  • “We Don’t Know Everything We Should Know About the Effects of Vaccines:” this headline is a more extreme version of the first. For an added layer of complexity, let’s imagine that this article doesn’t even mention the word autism anywhere in the article, but it does recommend that parents wait as long as possible to vaccinate their children because of potential risks.
  • “Jessica Biel says Vaccines May Cause Autism:” This article doesn’t take a stance about whether vaccines cause autism, and this news publisher is seemingly only reporting on what a celebrity said.
  • “Vaccines May Cause Breast Cancer:” This article doesn’t mention autism, but does say that there is a direct link between the MMR vaccine and breast cancer.
  • “Vaccines Cause Autism and I’m the Queen of England”: This article says multiple times that vaccines cause autism, but it was intended to be satire.

Was determining whether “Vaccines Don’t Cause Autism” is a relevant claim match for each of these articles simple? My research indicates that for most people it isn’t. I ran a small-scale qualitative study with a similar premise to the above thought experiment with undergraduate students and Amazon Mechanical Turk workers. I found that undergraduate students had a challenging time with the task but eventually came to moderate levels of agreement with each other after a reconciliation process. However, the Amazon Mechanical Turk workers were much more liberal with their definition of relevant claim matches than undergraduate students. Further research is needed to say much more about this preliminary finding.

How is relevance defined in fact-checking literature?

What I find concerning about the definitional challenges around defining relevance is the lack of importance current publications and descriptions of automated fact-checking systems place on these foundational definitions. I have provided two examples below, but there are others.

  • In the Google research paper about “claim-relevant document discovery” two comments are made about relevance. One is a technical definition: “given a fact-checking article with claim c, a claim relevant document is a related document that addresses c.” I personally don’t find this definition to be sufficient. The more elucidating mention of this dilemma is a comment the authors make in the introduction: “the claim-relevance discovery problem does not require the literal or precise claim to appear in the document but rather aims to find ones that seem to align in spirit.”

The idea of claim only having to “align in spirit” to be a relevant match is somewhat elucidating, as it probably means that the first bulleted hypothetical claim, “Vaccines May Cause Autism,” I outlined above should be labeled as relevant to the fact-check “Vaccines Don’t Cause Autism,” but this definition still leaves questions about how to resolve several of the other claims.

  • ClaimBuster is an end-to-end automated fact-checking system with a claim matching component. To my knowledge, it is the automated fact-checking system that has been written about most in academic venues. When discussing how the “Claim Matcher” component works, the following explanation is given: “the claim matcher searches a fact-check repository and returns those fact-checks matching the claim… The system has two approaches to measuring the similarity between a claim and a fact-check….An Elasticsearch server is deployed for searching the repository based on token similarity, while a semantic similarity search toolkit, Semilar, is applied for the search based on semantic similarity. We combine the search results from both in finding fact-checks similar to the given claims.”

I also think that this explanation of determining relevance is insufficient. A description of the technical methods adopted in a system does not resolve any of the lingering questions of nuance around the definition of relevance. It also does not allow us to compare this system to the “align in spirit” guideline from the Google paper in a meaningful way.

Finally…

Others have called for caution when over promising about the power of automated fact-checking, and I think those concerns are warranted. However, I’m proposing something different: I think the fact-checking community needs to make sure we are thinking critically about what “good enough” and “relevant” mean when it comes to automated fact-checking systems. I’m interested in hearing other people’s ideas about these issues, please feel free to reach out.

Thank you to Eni Mustafaraj whose feedback greatly improved the trajectory of this blog post.

--

--