TY - GEN
T1 - The Impact of Using Voxel-Level Segmentation Metrics on Evaluating Multifocal Prostate Cancer Localisation
AU - Yan, Wen
AU - Yang, Qianye
AU - Syer, Tom
AU - Min, Zhe
AU - Punwani, Shonit
AU - Emberton, Mark
AU - Barratt, Dean
AU - Chiu, Bernard
AU - Hu, Yipeng
PY - 2022/9/30
Y1 - 2022/9/30
N2 - Dice similarity coefficient (DSC) and Hausdorff distance (HD) are widely used for evaluating medical image segmentation. They have also been criticised, when reported alone, for their unclear or even misleading clinical interpretation. DSCs may also differ substantially from HDs, due to boundary smoothness or multiple regions of interest (ROIs) within a subject. More importantly, either metric can also have a nonlinear, non-monotonic relationship with outcomes based on Type 1 and 2 errors, designed for specific clinical decisions that use the resulting segmentation. Whilst cases causing disagreement between these metrics are not difficult to postulate, one might argue that they may not necessarily be substantiated in real-world segmentation applications, as a majority of ROIs and their predictions often do not manifest themselves in extremely irregular shapes or locations that are prone to such inconsistency. This work first proposes a new asymmetric detection metric, adapting those used in object detection, for planning prostate cancer procedures. The lesion-level metrics is then compared with the voxel-level DSC and HD, whereas a 3D UNet is used for segmenting lesions from multiparametric MR (mpMR) images. Based on experimental results using 877 sets of mpMR images, we report pairwise agreement and correlation 1) between DSC and HD, and 2) between voxel-level DSC and recall-controlled precision at lesion-level, with Cohen’s κ∈ [ 0.49, 0.61 ] and Pearson’s r∈ [ 0.66, 0.76 ] (p-values<0.001) at varying cut-offs. However, the differences in false-positives and false-negatives, between the actual errors and the perceived counterparts if DSC is used, can be as high as 152 and 154, respectively, out of the 357 test set lesions. We therefore carefully conclude that, despite of the significant correlations, voxel-level metrics such as DSC can misrepresent lesion-level detection accuracy for evaluating localisation of multifocal prostate cancer and should be interpreted with caution.
AB - Dice similarity coefficient (DSC) and Hausdorff distance (HD) are widely used for evaluating medical image segmentation. They have also been criticised, when reported alone, for their unclear or even misleading clinical interpretation. DSCs may also differ substantially from HDs, due to boundary smoothness or multiple regions of interest (ROIs) within a subject. More importantly, either metric can also have a nonlinear, non-monotonic relationship with outcomes based on Type 1 and 2 errors, designed for specific clinical decisions that use the resulting segmentation. Whilst cases causing disagreement between these metrics are not difficult to postulate, one might argue that they may not necessarily be substantiated in real-world segmentation applications, as a majority of ROIs and their predictions often do not manifest themselves in extremely irregular shapes or locations that are prone to such inconsistency. This work first proposes a new asymmetric detection metric, adapting those used in object detection, for planning prostate cancer procedures. The lesion-level metrics is then compared with the voxel-level DSC and HD, whereas a 3D UNet is used for segmenting lesions from multiparametric MR (mpMR) images. Based on experimental results using 877 sets of mpMR images, we report pairwise agreement and correlation 1) between DSC and HD, and 2) between voxel-level DSC and recall-controlled precision at lesion-level, with Cohen’s κ∈ [ 0.49, 0.61 ] and Pearson’s r∈ [ 0.66, 0.76 ] (p-values<0.001) at varying cut-offs. However, the differences in false-positives and false-negatives, between the actual errors and the perceived counterparts if DSC is used, can be as high as 152 and 154, respectively, out of the 357 test set lesions. We therefore carefully conclude that, despite of the significant correlations, voxel-level metrics such as DSC can misrepresent lesion-level detection accuracy for evaluating localisation of multifocal prostate cancer and should be interpreted with caution.
KW - Lesion-level localisation metrics
KW - Multi-parametric MR
KW - Prostate cancer
KW - Voxel-level segmentation metrics
UR - http://www.scopus.com/inward/record.url?scp=85140447191&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85140447191&origin=recordpage
U2 - 10.1007/978-3-031-17721-7_14
DO - 10.1007/978-3-031-17721-7_14
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 978-3-031-17720-0
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 128
EP - 138
BT - Applications of Medical Artificial Intelligence
A2 - Wu, Shandong
A2 - Shabestari, Behrouz
A2 - Xing, Lei
PB - Springer, Cham
T2 - 1st International Workshop on Applications of Medical Artificial Intelligence, AMAI 2022, held in conjunction with the 25th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2022
Y2 - 18 September 2022 through 18 September 2022
ER -