Understanding breast cancer screening data

How the epidemiological basis of data can be used to better train and understand AI models

Thijs Kooi
Towards Data Science

--

The first months of the year 2020 will likely go into the history books for the start of the greatest pandemic in modern history. Epidemiologists from all over the world scrambled to come up with emergency plans to contain the disease. This immense loss of lives overshadowed brighter news: impressive progress in AI, especially screening mammography [1, 2].

Successful applications of AI to the medical domain are usually at least as much about tuning the data as tuning a model. It does not matter if you feed VGG or Resnet with pictures of dogs and cats, neither will learn to detect cancer in mammograms. Knowledge of the epidemiological basis of the data you work with can really make a difference.

Getting into this field however, can be daunting as it requires distilling medical literature from the past six decades. Additionally, the field tends to be less structured than hard sciences like engineering or math from where most AI practitioners originate.

This post is intended to give a concise introduction to (mammography based) breast cancer screening data and its main aspects that are important for developing AI applications and understanding the results of various recent AI papers on mammography.

The breast cancer screening process

Breast cancer is a common disease: 1 in 8 women will get it somewhere in their lifetime. Most European countries, the US and some countries in Asia screen asymptomatic women over a certain age for breast cancer by means of regular mammographic exams (X-ray images). Several randomized control trials have shown a significant reduction in mortality, screening can therefore save lives [3]. How this screening process is executed differs per country. Most European countries have nationalized programs, where women are invited with a letter by a government institution, the US does not have this.

If a woman attends the screening, a mammogram is recorded. If this exam is found suspicious, she is recalled for a diagnostic exam which can include further imaging. If this is still inconclusive, a biopsy is performed. In some countries women skip official invitations, but go straight to diagnostic exams, this is referred to as gray screening. An illustration is provided in figure 1.

Figure 1. Overview of the breast cancer screening process the way in which it is commonly executed in Europe. In some countries (such as Germany) not all women who are invited participate in regular screening, but sometimes go opportunistically for a diagnostic mammogram. This is referred to as ‘gray screening’. (image by author)

Apart from the exact process, the parameters of the screening policy also differ per country. Below are several important variables.

Age of first screening

The risk of cancer increases with age. Performing screening at a young age is often unnecessary. Exceptions are women in high risk groups (for example because of genetic mutations that make it much more likely to get breast cancer) who are typically screened from a younger age than the regular population and have supplementary imaging such as MRI. In addition to the risk, the tissue composition changes with age, making cancers hard to see in younger women. In most European countries, screening starts at age 50 (although some exceptions are Sweden and the UK). In the US and east Asia this is typically at 40.

When a woman is screened for the first time, there is a higher chance a cancer is detected, because you will find cancers that have been growing for a few years. The first and subsequent screening rounds are often referred to as prevalence and incidence screening respectively.

Time between screening rounds

The time between screenings, or the screening interval varies from one year in the US to three in the UK, where most European countries use a two year interval. Similar to the age at which screening starts and the screening round, the interval in part determines the number of cancers that are (or can be) detected in every screening round. An illustration of this is provided in figure 2.

The y-axis represents the number of detectable cancers in a population, the x-axis represents time. The longer one waits in between screenings, the more cancers are likely to be detected. Note that this is a simplification, in practice not all women in a population are screened simultaneously. Also note that the number of detectable cancers does not go down to zero: some cancers are missed, we will get to that later in this post.

Figure 2. Illustration of two different screening intervals and the effect it has on the number of detectable cancers. The red line illustrates the number of detectable cancers, the yellow line a screening round. The longer you wait in between exams, the more cancers you detect. Note that a ‘screening round’ is represented as a single event where all patients are screened simultaneously. In practice this is not the case of course (image by author).

On top of the variables mentioned above, there are a few parameters that can be captured by ‘rates’, which are mostly (except for the participation rate) the result of a trade-off between sensitivity and specificity of the program. Below are some commonly used rates.

  • Participation rate In a national screening program, the government usually sends out letters of invitation. The participation rate describes the fraction of women that actually attended screening, normalized by all that got the invitation. Looking at figure 3, we can get the participation rate by dividing the orange circle by the gray circle.
  • Recall rate From all women that participate in screening only a fraction is recalled (invited for a follow-up exam). This fraction strongly depends on the screening program and/or center. In the US the recall rates are typically high: in the order of 10%, whereas in Europe this is closer to 3%. Recall rates are also usually higher on prevalence screening. In figure 3, the recall rate can be obtained by dividing the dark red circle by the orange circle.
  • Biopsy rate When the woman is recalled for a follow-up exam, they sometimes do a second mammogram (a diagnostic mammogram), perform an ultrasound or do a biopsy if they are still unsure. About 50% of the diagnostic exams result in a biopsy. The biopsy rate is obtained by dividing the blue circle by the dark red circle in figure 3.
  • Cancer detection rate The cancer detection rate (CDR) is simply the number of cancers detected per 1000 patients. Again using the blobs in figure 3, this is obtained by taking the intersection of the bright red circle, the bright blue circle and dividing it by the orange circle. Just like the recall rate, the CDR is usually higher on prevalence screening.
Figure 3. Illustration of different ‘rates’ in the screening process. The big gray blob represents all women in the age group which are screened, the orange circle all women that actually attend screening, the dark red circle all women that are recalled, the bright blue circle all cases that are biopsied and the bright red circle all cancers (image by author).

Missed cancers

Note that in figure 2 the number of detectable cancers does not go down to zero after a screening round and in figure 3, the bright red blob is not a subset of the dark red blob. This is to illustrate that cancers are missed. Some of these are detected in between two screenings, these are referred to as interval cancers. Some are detected at the next screening.

Interval cancers are often used to evaluate the quality of a screening program. If there are many, this means many cancers are likely missed during screening and the sensitivity is too low. An interval cancer does not directly mean a cancer is missed during screening, however. Studies show that about 20–30% of interval cancers and roughly the same number of cancers detected at the next screening round were missed [4, 5, 6] (i.e., detectable on the prior mammogram). This is because a cancer can start growing after the last screening round or was simply not visible on the mammogram. An illustration of this process is provided in figure 4.

Figure 4. Illustration of the cancer detection process in breast cancer screening. The bright red circle on the left illustrates all cancers detected a screening time t (screen detected cancers). The blob in the center is the set of interval cancers and the blob on the right the cancers detected at the next screening round (time t + interval i). About 20 to 30 of the interval cancers and cancers detected at the next screening round are already visible on the previous screening. All bright red blobs represent all detectable cancers at time t (image by author).

Defining labels

Defining labels is important, without it AI models can not learn well. For medical data this is often less straightforward than for natural images because it is not always clear what the ‘truth’ is. A distinction is sometimes made between a reference standard and a gold standard. In the first case this is the best guess of a set of readers, in the second case it has some more information, for instance from histopathology, that the reader can not know just by looking at the image.

Unfortunately it is also not that obvious which to choose. When building AI tools for screening, for instance, should you only detect cancer or detect everything radiologists thought was suspicious? If you choose the former and use the tool to assist radiologists, they may think it is strange the system does not mark something they think is very suspicious. Getting significantly better than a radiologist will be hard though, if the system only has access to the same information. These two choices are further explained below.

BIRADS scores — reference standard

In most European countries, an exam is read by two certified radiologists. In the US this is typically just a single person. After reading the exam she assigns a score: the BIRADS score [7]. This is a rating defined in an international standardized reporting system for mammography. These scores can be explained as [8]:

  • BIRADS 1 This is good news, the exam is ‘empty’, there are no findings.
  • BIRADS 2 This is also good news, but some benign abnormalities were found.
  • BIRADS 3 In this case the radiologist is unsure and a follow-up in about half a year is needed. This score is not allowed in some countries, because readers will tend to use it too much.
  • BIRADS 4a This means something suspicious is found on the mammogram to warrant a diagnostic exam. However the chance that this is actually cancer is still small.
  • BIRADS 4b This means a slightly higher chance of cancer than 4A. This score is also not used in all countries, because 4A is enough reason to recall the woman.
  • BIRADS 4C A very suspicious lesion is found and the reader is pretty sure this exam depicts cancer.
  • BIRADS 5 This means one or more textbook cancers are found and the reader is very certain the exam depicts a malignant tumor.

The BIRADS score is technically not an ordinal scale, because BIRADS 1 and 2 have a similar degree of suspiciousness. It should therefore not be interpreted as such.

Histopathological outcome — gold standard

If the diagnostic exam can not rule out cancer, clinicians proceed to biopsy the breast. This is typically done by a (vacuum assisted) core needle biopsy, whereby a small piece of tissue from the suspicious site is extracted. The tissue placed under a microscope to determine what it is exactly. Pathologists typically agree if it is cancer or not, but can disagree about the exact sub classification.

A common classification system for (breast) cancers is by their tissue of origin. Most cancers in the breast are carcinoma, meaning they originated from epithelial tissue. The breast consists of lobules and ducts, responsible for milk production and transportation. Cancers can be in-situ, meaning they did not proliferate beyond their tissue of origin or invasive/infiltrative meaning cancer cells started migrating.

Taking the Cartesian product of those, gives us four main types: Invasive ductal carcinoma (the most common), invasive lobular carcinoma, ductal carcinoma in-situ and lobular carcinoma in situ. The last one is no longer considered a true cancer in the recent guidelines on breast pathology [9]. Invasive cancers are the most dangerous ones and should definitely not be missed by AI systems. This is just the tip of the iceberg, there are dozens of other pathologies and classification systems [9].

Defining positives and negatives

The most common definition of a positive is to simply look at the histopathological outcome, i.e., cancer or not. However, by doing that we are not there yet. Because the screening process has a temporal component, we also need to define when the specific outcome was obtained. There are (at least) three options:

Cancer is confirmed by biopsy right after the exam This is the most common definition of a positive. However, using this definition, it can be hard to show an algorithm has a higher sensitivity than humans, unless you separate them by reads or let readers read the same cases again.

This definition is equivalent to taking only the screen detected cancers, the big red blob on the left in figure 4. The corresponding human operating point is depicted by the red cross in the ROC plot in figure 5.

Cancer is confirmed by biopsy up until the next screening exam This is commonly used to determine the sensitivity of a screening program, because it is easy to compute from a cancer registry. You can simply take the number of screen detected cancers and divide it by the screen detected + interval cancers. This typically results in a sensitivity of 70–80%. It is not commonly used in evaluation studies though.

In figure 4, this means taking the big red blob on the left and the bigger of the two blobs in the center as positives. The corresponding human operating point is depicted by the yellow cross in figure 5.

Cancer is confirmed by biopsy three months after the next screening exam This definition was used in the recent Nature paper by DeepMind [1]. The reason this was used is something they refer to as the ‘gatekeeper effect’: by only taking screen detected cancers as positives, you bias the results and make the humans look better. The flip-side however, is that many cancers that will develop in the next couple of years are not yet visible on the current exam, meaning the sensitivity drops significantly.

We get this definition by considering all big blobs in figure 4 as positives. The respective human operating point is the blue cross in figure 5.

One last and probably most pure way to define positives is to simply look at all ‘detectable cancers’ (all the bright red blobs in figure 4). However, this is can be tedious as it requires determining what has been missed, which can only be assessed by reading the cases again.

Figure 5. Illustration of the different human operating points for different definitions of positives in an ROC plot. Red: only screen detected cancers are considered positive. Yellow: screen detected and interval cancers are positives. Blue: Cancers detected up to two years after the exam are considered positive (image by author).

Implications for AI studies

So what does all of this have to do with AI? Firstly, it is really important how you define the data and how you present it to the model. This is part of the secret ingredients of AI products though so I will not go into details here. It is also interesting to consider the above variables when evaluating the model and comparing your results to literature. Below are a few examples from recent literature:

  • As mentioned above, the recent Nature paper by DeepMind [1] uses a somewhat uncommon definition of a positive. This means the results in the paper look a lot worse than other papers (both the readers and algorithms), which confused many people.
  • The recent evaluation paper from the mammography DREAM challenge [10] presented results of models on two datasets. The first a US dataset provided by the challenge organizers, the second a Swedish dataset. The model was developed on the first dataset, but when applied to the second set, showed better performance. The authors postulate (among other things) this is because of the longer screening interval and cancer composition in the second set: the longer you wait in between screening, the more invasive cancers you will find. These are likely easier to detect.
  • Several AI papers combine screening and diagnostic data [2, 11]. Although cancers will likely look almost the same on a diagnostic and screening exam, taking negatives from diagnostic instead of screening data will likely mean ‘harder’ (also depending on how you train your model of course) negatives, because these are recalled during regular screening, meaning there is likely something suspicious in it. This may underestimate the model’s performance on pure screening data.
  • There are quite substantial differences between US and European screening, which can make the comparison of methods difficult. The US typically screens every year and they have a higher recall rate. Cancer detection rates in the US are higher, meaning more subtle cancers may be in the set of positives and are likely harder to detect. The AUCs of around 0.8 reported by Yala et al. [12] may seem low in comparison to European studies, but may simply be the result of much harder data.

To summarize, recent progress in AI for mammography is exciting and could mean great steps towards better healthcare. Taking the clinical and epidemiological factors of your data into account can help understand results and improve performance. However, comparing studies remains complicated. The best comparison for now may be to simply benchmark every method against radiologists [13].

References

[1] McKinney, S.M., Sieniek, M., Godbole, V., Godwin, J., Antropova, N., Ashrafian, H., Back, T., Chesus, M., Corrado, G.C., Darzi, A. and Etemadi, M., 2020. International evaluation of an AI system for breast cancer screening. Nature, 577(7788), pp.89–94.

[2] Kim, H.E., Kim, H.H., Han, B.K., Kim, K.H., Han, K., Nam, H., Lee, E.H. and Kim, E.K., 2020. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. The Lancet Digital Health, 2(3), pp.e138-e148.

[3] Marmot, M.G., Altman, D.G., Cameron, D.A., Dewar, J.A., Thompson, S.G. and Wilcox, M., 2013. The benefits and harms of breast cancer screening: an independent review. British journal of cancer, 108(11), pp.2205–2240.

[4] Hofvind, S., Geller, B., Vacek, P.M., Thoresen, S. and Skaane, P., 2007. Using the European guidelines to evaluate the Norwegian breast cancer screening program. European journal of epidemiology, 22(7), p.447.

[5] Hoff, S.R., Abrahamsen, A.L., Samset, J.H., Vigeland, E., Klepp, O. and Hofvind, S., 2012. Breast cancer: missed interval and screening-detected cancer at full-field digital mammography and screen-film mammography — results from a retrospective review. Radiology, 264(2), pp.378–386.

[6] Perry, N., Broeders, M., de Wolf, C., Törnberg, S., Holland, R. and von Karsa, L., 2008. European guidelines for quality assurance in breast cancer screening and diagnosis. — summary document. Annals of Oncology, 19(4), pp.614–622.

[7] Sickles, EA, D’Orsi CJ, Bassett LW, et al. ACR BI-RADS® Mammography. In: ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System. Reston, VA, American College of Radiology; 2013.

[8] https://www.hopkinsmedicine.org/breast_center/

[9] Giuliano, A.E., Edge, S.B. and Hortobagyi, G.N., 2018. of the AJCC cancer staging manual: breast cancer. Annals of surgical oncology, 25(7), pp.1783–1785.

[10] Schaffter, T., Buist, D.S., Lee, C.I., Nikulin, Y., Ribli, D., Guan, Y., Lotter, W., Jie, Z., Du, H., Wang, S. and Feng, J., 2020. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. JAMA network open, 3(3), pp.e200265-e200265.

[11] Rodriguez-Ruiz, A., Lång, K., Gubern-Merida, A., Teuwen, J., Broeders, M., Gennaro, G., Clauser, P., Helbich, T.H., Chevalier, M., Mertelmeier, T. and Wallis, M.G., 2019. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. European radiology, 29(9), pp.4825–4832.

[12] Yala, A., Schuster, T., Miles, R., Barzilay, R. and Lehman, C., 2019. A deep learning model to triage screening mammograms: a simulation study. Radiology, 293(1), pp.38–46.

[13] Lotter, W., Diab, A.R., Haslam, B., Kim, J.G., Grisot, G., Wu, E., Wu, K., Onieva, J.O., Boxerman, J.L., Wang, M. and Bandler, M., 2019. Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach. arXiv preprint arXiv:1912.11027.

--

--