Every year in Germany around 3 million women take part in mammography screening. Through the programme around 18,000 breast cancer cases are found.
At a screening appointment, four x-ray images (mammograms) are recorded. this collection of images from a given screening appointment is called a study. To make the problem easy to visualize, in this demonstration we will look at 5000 studies. Each study is represented by a single dot on the diagram on the right.
One of the challenges of mammography screening is that very few studies, on average between 0.6 and 0.7% are malignant. Out of the 5000 studies, 30 are malignant.
When a study is read by a radiologist, they will assess the study as suspicious (potentially malignant) or normal. Reading mammography x-ray images is difficult and even highly trained radiologists can make mistakes. Whether a cancer is actually present is determined by a followup investigation and finally a biopsy.
If the radiologist’s assessment is correct (i.e. confirmed by biopsy) they identify:
There are two kinds of incorrect assessments, identifying:
The two types of incorrect assessments are captured by two important measures:
For these 5000 exams, the radiologist reaches a sensitivity of 0.83 and a specificity 0.93.
Trained machine learning models usually output a
This is a consequence of the model basically being a series of additions, multiplications and other functions on real numbers.
A continuous score is not very informative for decision making, so we must determine a cut-off value or threshold for the model.
All the studies with scores above the threshold will be categorized as positive (suspicious), and those with scores below the threshold are categorized as negative (normal).
With a threshold of 0.50 there are 493 false positives and 2 cancers missed by the algorithm.
Try moving the threshold line on the right to see how the classification is affected by the choice of threshold.
As digital mammography is not perfect, and therefore neither is the algorithm, there is no one threshold that could cleanly separate all the malignant cases from normal cases. We could see how by moving the threshold to lower values we catch more cancers at the expense of falsely classifying more normal cases as malignant. This corresponds to a trade-off between the two measures introduced previously: sensitivity and specificity.
You can move both the threshold line and the point on the curve.
Notice that the curve is quite jagged: due to the small number of malignant exams, we get a large discrete change in sensitivity as soon as the threshold reaches one of them.
With the threshold at 0.50, we reach sensitivity of 0.93 and specificity of 0.90.
In contrast, the radiologist reaches a sensitivity of 0.83 and specificity 0.93.
On this small subset of exams, we can find thresholds where the algorithm is better than the radiologist regarding both sensitivity and specificity, however when evaluating on a much larger set of exams this is currently not the case. For this reason we combine the algorithm with reads from radiologists, to get a system that works better than both the algorithm and radiologist alone. How this is done will be explained in decision referral.