Sensitivity and specificity in mammography screening

Every year in Germany around 3 million women take part in mammography screening. Through the programme around 18,000 breast cancer cases are found.

At a screening appointment, four x-ray images (mammograms) are recorded. this collection of images from a given screening appointment is called a study. To make the problem easy to visualize, in this demonstration we will look at 5000 studies. Each study is represented by a single dot on the diagram on the right.

One of the challenges of mammography screening is that very few studies, on average between 0.6 and 0.7% are malignant. Out of the 5000 studies, 30 are malignant.

When a study is read by a radiologist, they will assess the study as suspicious (potentially malignant) or normal. Reading mammography x-ray images is difficult and even highly trained radiologists can make mistakes. Whether a cancer is actually present is determined by a followup investigation and finally a biopsy.

If the radiologist’s assessment is correct (i.e. confirmed by biopsy) they identify:

a normal study as normal: a TRUE NEGATIVE
a malignant study as malignant: a TRUE POSITIVE

There are two kinds of incorrect assessments, identifying:

a normal study as malignant: a FALSE POSITIVE
a malignant study as normal: a FALSE NEGATIVE

The two types of incorrect assessments are captured by two important measures:

sensitivity, or the number of correctly identified malignant cases (TRUE POSITIVES) over all malignant cases. If the radiologist detects all cancers, they would have a perfect sensitivity of 1.0.
specificity, or the number of correctly identified normal cases (TRUE NEGATIVES) over all normal cases. The higher the specificity, the fewer patients get incorrectly sent for further examinations.

For these 5000 exams, the radiologist reaches a sensitivity of 0.83 and a specificity 0.93.

Trained machine learning models usually output a

continuous score between 0.0 and 1.0

This is a consequence of the model basically being a series of additions, multiplications and other functions on real numbers.

indicating how suspicious the algorithm deems a specific case. A score of 0.0 means that the case is completely unsuspicious, a score of 1.0 indicates highest suspicion. In a perfect world, we would want the algorithm to classify all normal cases with 0.0 and all cancer cases with 1.0. In practice, we see a wide range of values between those two extremes. On the plot to the right, we can see how the normal cases tend to have scores closer to 0.0 and the malignant cases score closer to 1.0.

A continuous score is not very informative for decision making, so we must determine a cut-off value or threshold for the model.

All the studies with scores above the threshold will be categorized as positive (suspicious), and those with scores below the threshold are categorized as negative (normal).

With a threshold of 0.50 there are 493 false positives and 2 cancers missed by the algorithm.

Try moving the threshold line on the right to see how the classification is affected by the choice of threshold.

As digital mammography is not perfect, and therefore neither is the algorithm, there is no one threshold that could cleanly separate all the malignant cases from normal cases. We could see how by moving the threshold to lower values we catch more cancers at the expense of falsely classifying more normal cases as malignant. This corresponds to a trade-off between the two measures introduced previously: sensitivity and specificity.

You can move both the threshold line and the point on the curve.

Notice that the curve is quite jagged: due to the small number of malignant exams, we get a large discrete change in sensitivity as soon as the threshold reaches one of them.

With the threshold at 0.50, we reach sensitivity of 0.93 and specificity of 0.90.

In contrast, the radiologist reaches a sensitivity of 0.83 and specificity 0.93.

On this small subset of exams, we can find thresholds where the algorithm is better than the radiologist regarding both sensitivity and specificity, however when evaluating on a much larger set of exams this is currently not the case. For this reason we combine the algorithm with reads from radiologists, to get a system that works better than both the algorithm and radiologist alone. How this is done will be explained in decision referral.