Classification of Prostate Cancer in 3D Magnetic Resonance Imaging Data based on Convolutional Neural Networks

Article Summary

This paper evaluates the performance of different convolutional neural networks (CNNs) in classifying prostate cancer using 3D magnetic resonance imaging (MRI) data. The goal is to develop a model that can reliably predict whether an MRI sequence contains malignant lesions, which could help streamline the diagnosis process and reduce the need for invasive biopsies.

The researchers used a dataset provided by Cantonal Hospital Aarau, which included MRI sequences and histopathological reports for each case. They trained three CNN models - ResNet3D, ConvNet3D, and ConvNeXt3D - using various data augmentation techniques, learning rates, and optimizers. The models were evaluated based on their ability to classify the images as either containing malignant lesions or being benign.

The best result was achieved by the ResNet3D model, with an average precision score of 0.4583 and an AUC ROC score of 0.6214. However, the models' overall performance was not as high as expected, and they struggled to generalize well on the given dataset. The authors suggest that incorporating additional information, such as lesion locations or anatomy segmentations, could potentially improve the models' performance.

In conclusion, while the CNN models showed some promise in classifying prostate cancer using MRI data, further improvements and refinements are needed to achieve more reliable and accurate results. Future work should focus on exploring additional data sources, machine learning algorithms, and incorporating more detailed information from histopathological reports to enhance the models' performance.

Clinical Impact

Based on the results presented in the paper, the current clinical impact of the study is likely to be limited. The authors acknowledge that the performance of the CNN models was not as high as expected, with the best model achieving an average precision score of 0.4583 and an AUC ROC score of 0.6214. These scores indicate that the models' reliability in correctly classifying prostate cancer using MRI data is not yet sufficient for clinical application.

However, the study does provide valuable insights and suggestions for future research directions that could potentially lead to more clinically significant results. The authors propose several improvements, such as:

1. Incorporating lesion locations or anatomy segmentations to provide additional context for the models.
2. Exploring additional data sources to increase the dataset size and diversity.
3. Investigating other machine learning algorithms beyond CNNs, such as logistic regression, support vector machines (SVMs), or visual transformers.
4. Utilizing more detailed information from histopathological reports, such as sextant information, to enhance the models' performance.

If future studies build upon these suggestions and achieve significantly higher accuracy and reliability in classifying prostate cancer using MRI data, there could be a potential clinical impact. A well-performing model could assist radiologists in identifying malignant lesions more quickly and accurately, reducing the need for invasive biopsies and streamlining the diagnostic process. However, before any real-world clinical application, the models would need to undergo rigorous testing and validation to ensure their safety and efficacy.

In summary, while the current study's clinical impact is limited, it lays the groundwork for further research that could potentially lead to the development of clinically useful tools for prostate cancer diagnosis using MRI data.

Questions

As a man who has just received a pathology report indicating possible prostate cancer, it is important to have an informed discussion with your doctor about your diagnosis and treatment options. While this study focuses on using AI to classify prostate cancer from MRI data, it may not have an immediate impact on your personal case. However, you can still use the general insights from the study to guide your questions for your doctor. Here are some questions you may want to ask:

1. What is the Gleason score of my biopsy, and what does it mean in terms of the aggressiveness of the cancer?

2. Did my MRI results show any visible lesions or abnormalities in the prostate? If so, what is their size and location?

3. Based on my biopsy results and MRI findings, what is the likely stage of my prostate cancer?

4. What are the recommended treatment options for my specific case, and what are the potential benefits and risks of each option?

5. Are there any additional tests or imaging studies that could provide more information about my cancer, such as a bone scan or a PET scan?

6. How will my treatment progress be monitored, and how often will I need to have follow-up tests or scans?

7. Are there any lifestyle changes or support services that could help me manage my diagnosis and treatment?

While the AI models discussed in the study are not yet ready for clinical use, you can ask your doctor if they are aware of any similar research or if they use any AI-assisted tools in their practice. It's important to remember that every case is unique, and your doctor will be able to provide personalized advice based on your specific situation.

Seeking a second opinion on your pathology report is a good idea, especially if you have any doubts or concerns about the initial diagnosis. A second opinion can provide additional insight, confirm the original diagnosis, or sometimes even result in a different diagnosis. Here are a few ways to assess the reliability of your pathology results and seek a second opinion:

1. Check the pathologist's qualifications: Make sure that a board-certified pathologist with expertise in prostate cancer reviewed your biopsy samples. You can ask your doctor about the pathologist's qualifications or look them up on the College of American Pathologists' website.

2. Ask about the lab's accreditation: Inquire whether the laboratory that processed your biopsy is accredited by the College of American Pathologists (CAP) or another recognized accrediting organization. Accreditation ensures that the lab meets quality standards.

3. Request a review by another pathologist: Ask your doctor if they can arrange for your biopsy slides to be reviewed by another pathologist within the same hospital or from a different institution. Many hospitals have routine procedures for obtaining second opinions on pathology reports.

4. Consult with a specialist: Consider seeking a second opinion from a urologist or oncologist who specializes in prostate cancer. They can review your pathology report, MRI results, and other clinical information to provide a comprehensive assessment of your case.

5. Seek a second opinion from a National Cancer Institute (NCI)-designated cancer center: NCI-designated cancer centers are known for their expertise in cancer diagnosis and treatment. They often have experienced pathologists and specialists who can provide second opinions on complex cases.

When discussing your case with another pathologist or specialist, make sure to provide them with all the relevant information, including your original biopsy slides, pathology report, MRI images, and any other test results. This will help ensure a thorough and accurate second opinion.

Keep in mind that while a second opinion can provide valuable information, it may not always be necessary or feasible. Trust your instincts and discuss your concerns with your primary doctor to determine the best course of action for your individual case.

[2404.10548] Classification of Prostate Cancer in 3D Magnetic Resonance Imaging Data based on Convolutional Neural Networks

Electrical Engineering and Systems Science > Image and Video Processing

[Submitted on 16 Apr 2024]

Malte Rippa, Ruben Schulze, Marian Himstedt, Felice Burn

Prostate cancer is a commonly diagnosed cancerous disease among men world-wide. Even with modern technology such as multi-parametric magnetic resonance tomography and guided biopsies, the process for diagnosing prostate cancer remains time consuming and requires highly trained professionals.

In this paper, different convolutional neural networks (CNN) are evaluated on their abilities to reliably classify whether an MRI sequence contains malignant lesions. Implementations of a ResNet, a ConvNet and a ConvNeXt for 3D image data are trained and evaluated. The models are trained using different data augmentation techniques, learning rates, and optimizers. The data is taken from a private dataset, provided by Cantonal Hospital Aarau. The best result was achieved by a ResNet3D, yielding an average precision score of 0.4583 and AUC ROC score of 0.6214.

Comments:	Previous version published in Buzug T.M., Handels H., Müller S., Hübner C., Mertins A., Rostalski P.: Student Conference Proceedings 2023, Infinite Science Publishing, 2023 (ISBN/EAN 978-3-945954-72-0). 7 pages, 2 figures
Subjects:	Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2404.10548 [eess.IV]
	(or arXiv:2404.10548v1 [eess.IV] for this version)
	https://doi.org/10.48550/arXiv.2404.10548

Submission history

From: Malte Rippa [view email]
[v1] Tue, 16 Apr 2024 13:18:02 UTC (351 KB)

¹¹institutetext: Institut für Medizinische Informatik, Universität zu Lübeck
FUSE-AI GmbH, Hamburg
Cantonal Hospital Aarau, Aarau (CH)

Ruben \lnameSchulze 22 Marian \lnameHimstedt 11 Felice \lnameBurn 33112233 malte.rippa@gmx.de

Abstract

Prostate cancer is a commonly diagnosed cancerous disease among men world-wide. Even with modern technology such as multi-parametric magnetic resonance tomography and guided biopsies, the process for diagnosing prostate cancer remains time consuming and requires highly trained professionals. In this paper, different convolutional neural networks (CNN) are evaluated on their abilities to reliably classify whether an MRI sequence contains malignant lesions. Implementations of a ResNet, a ConvNet and a ConvNeXt for 3D image data are trained and evaluated. The models are trained using different data augmentation techniques, learning rates, and optimizers. The data is taken from a private dataset, provided by Cantonal Hospital Aarau. The best result was achieved by a ResNet3D, yielding an average precision score of 0.4583 and AUC ROC score of 0.6214.

1 Introduction

With 1,276,106 newly diagnosed cases world-wide in 2018, prostate carcinomas (PCa) are the second most frequently diagnosed cancer disease and account for 3.8% of all deaths related to cancerous diseases among men [1]. The methods for PCa diagnosis are constantly improving and have reached a new pinnacle with the introduction of multi-parametric magnetic resonance imaging (mpMRI) [2]. mpMRI describes the usage of multiple imaging sequences such as T2-weighted (T2W), diffusion weighted image (DWI), dynamic contrast enhanced (DCE) and apparent diffusion coefficient (ADC) sequences. Every sequence reveals different characteristics of the abdominal tissue, which allows a broad assessment of the prostate [2]. However, the analysis of mpMRI sequences for PCa diagnosis remains a time consuming task and requires further assessment of the severity and clinical significance of the identified lesion(s) e.g. by conducting biopsies. In the current state of the art in medical image processing, neural networks have been shown to provide reliable predictions of the clinical significance of lesions on different types of images [3]. It was shown that a convolutional neural network (CNN) can predict the Gleason grade of a histological slice of a prostate biopsy [4]. Recently, it was proven that CNNs can reliably detect carcinomas in liver images, when trained with MRI sequences and histopathological ground truth [5].

The scope of this paper is to compare the performance of different CNN architectures, that predict whether a prostate contains malignant lesions, based on whole mpMRI sequences and information gathered from histopathological tissue assessment as image level ground truth label during training. In contrast to state of the art methods [6] the use of voxel-wise annotations is omitted as it is to be determined whether a whole image classification can be realized reliably. A whole image classification model would profit from a larger amount of data as image level labels are easier to obtain, perhaps resulting in a more robust and stable classifier. In addition a well performing classifier could be used as a filter without the need to infer computationally expensive segmentation models.

2 Material and Methods

2.1 Image Data

The data for the experiments was taken from a private dataset exclusively provided by Kantonspital Aarau, Switzerland. The dataset is structured in cases, studies and sequences, where a case contains one or more studies and a study contains one or more mpMRI sequences. Studies that contain a T2W image of insufficient quality because of artifacts that hide the prostate in the image e.g. due to endorectal coils, hip implants or anatomical phenomena (bladder protruding into the prostate or similar) were excluded from the dataset. After refinement a dataset consisting of 365 studies and 1095 image sequences has been obtained. Each image has a size of $149 \times 149 \times 32$ with a voxel spacing of 0.75, 0.75, 3 in x, y, z direction, respectively.

2.2 Labels

For each study, a histopathological report is available, containing the Gleason score per prostate sextant. The Gleason grading system is considered one of the most powerful grading systems in prostate cancer analysis. It provides information about the condition of the tissue [7]. The information from the reports are refined to image level binary labels, by condensing the Gleason scores into one binary score. Per definition of the scoring system [7], a Gleason score of $\geq 3 + 4 = 7$ is considered as malignant/clinically significant. Therefore, if one sextant contains a malignant lesion, the entire image is labeled with 1, else 0 for indicating benign/clinically insignificant lesions or no lesions at all. Inconsistencies, errors and incompleteness in human annotation make the ground truth unreliable or incomplete for some cases. The entire study including the image data was excluded from the dataset, if the ground truth is found to be of insufficient quality due to aforementioned reasons. This resulted in a total of 246 benign labels and 119 malignant labels.

2.3 Models

Generally, CNNs are considered appropriate for fast image processing and are widely used in computer vision tasks. It is suggested by literature [3, 5] to use CNNs for the classification of prostate cancer in mpMRI data by delivering strong results when applying CNNs for similar problems. For this paper the ConvNet3D, was chosen due to its simplicity combined with efficiency. A ResNet3D was chosen as the use of residual connections could be beneficial. Recently, the ConvNeXt was published. It is described as modernized ResNet incorporating the principles of transformers and is reported to deliver strong performance for object detection. All these models rely on a CNN backbone and use fully connected layers for the final classification. The ConvNet3D uses two convolutional blocks to increase the channel/feature dimension while decreasing the spatial resolution. It has 168,705 trainable parameters. The ResNet3D consists of eight convolutional blocks containing convolution operations, batch normalization and ReLU. The residual connections are realized by using “bottleneck blocks”, which are convolutional blocks that add the output after one convolution block to the input before the next convolution block. Six bottleneck blocks are used. In total 4,527,906 parameters have to be learned. The ConvNeXt3D uses features such as grouped convolutions, inverted bottlenecks with modified normalizations and activation functions. Using a combination of three convolution blocks, three inverted bottleneck blocks containing three convolutional layers each and a fully connected layer results in 31,321,561 trainable parameters.

All models output a single class activated using sigmoid, indicating the confidence of containing a malignant lesion.

As baseline, the institutional inhouse lesion segmentation model that is currently deployed in a product by FUSE-AI was run on the same dataset. The model, an anisotropic U-Net, is supposed to create a segmentation for benign and malignant lesions. It therefore must be able to classify the clinical significance of detected lesions. For comparability to the other models mentioned earlier, the model output is aggregated using global max pooling to generate a classification score.

2.4 Training process and experiment setup

Selecting only data with sufficient image and ground truth quality leaves 365 cases with a total of 1095 sequences for training. Three MRI sequences (T2W, ADC and DWI) are stacked in the channel dimension to compose a three channel 3D image as input for the model. It was proven that incorporating DWI and ADC sequences leads to increased performance when aiming to recognize anomalies in the prostate [6].

The data is then split into a training, test, and validation partition by 70, 15, and 15 percent, respectively. To improve generalization, data augmentation in form of mirror transforms, changes in contrast and resolution, addition of noise, and spatial transforms is applied to the images.

To enhance the focus on the important regions of the image, an existing model is utilized to generate a segmentation of the prostate (see Fig. 1, that is then passed to the classification network for further processing. The selected models are applied to the whole image as well to measure the impact of the pre-segmentation. However, only AUC ROC is calculated for the whole image models.

Refer to caption — Figure 1: Left: The raw input image (here T2W), Right: The cropped prostate as image input for the model

The models were trained multiple times with different hyperparameters and loss calculations, to encounter a training setup that delivers reasonable results. Once the results showed that the performance of the networks was stable, the setup was considered successful and fine tuning of the hyperparameters could commence. To counter the imbalances in class distribution, the binary cross entropy (BCE) loss was weighted with inverse class frequencies. Additionally, the networks were trained on different partitions of the entire dataset and smaller subsamples of the entire dataset (40 to 50 samples), to determine if the composition of data impacts the outcome of the model training as no cross validation was conducted. The performance of the different network implementations was measured by calculating the area under the curve of the receiver operating characteristic (AUC ROC), the average precision (AP) and the analysis of confusion matrices.

As reference the performance of a UNet-based lesion segmentation pipeline is used denoted as "Baseline model" in Tab. 1. The results of the lesion segmentations are aggregated to compose an image-level label.

3 Results

It can be seen directly, that the introduction of pre-segmentations around the prostate gland caused a positive effect on the AUC ROC score. In Tab. 1 it is documented that the models on the whole image perform with an average AUC ROC of 0.5075 whereas the models that focus only the gland yield an average AUC ROC of 0.5970. The ResNet3D was trained for 300 epochs with an initial learning rate of $8 e - 6$ and exponential and weight decay by $1 e - 4$ every 100 steps, with a batch size of 2, a dropout rate of 70% and an AdamW optimizer. It was observed that the network started to overfit after approximately 19 epochs. Decreasing the initial learning rate prevented overfitting but led to worse results for the validation and test partition.

The ConvNet3D was trained in the same setup as the ResNet3D. After fine tuning the learning rate to $1 e - 6$ , the AUC ROC score and AP did increase slightly to 0.6214 and 0.4583 respectively in the best epoch for the test partition. The ConvNeXt3D overfitted in the first epochs on an initial learning rate of $8 e - 6$ , as well, however further fine tuning did not improve the results.

It is to mention, that the training with smaller learning rates did not lead to any learning at all. The loss of the classification of the training data is oscillating throughout the entire training of 300 epochs and thus the scores for AUC ROC and AP are not improving over time. The best performance of this network was achieved in the earlier epochs of the training and the score decreases over time, while the loss remains constant.

As can be seen in Tab. 1 the best result was achieved by the ConvNet with the prostate segmentation as input in terms af AUC ROC, yielding 0.5732. At the same time the model yields the worst AP. In terms of AP none of the models was able to outperform the baseline model, yielding an AP of 0.4801 at best.

Table 1: Performance of the different networks (all entries for the respective best test epoch)

For none of the models a significant difference in applying weighted BCE or regular BCE loss could be observed.

Training the models on small subsets of the dataset, delivers strong models with AUC ROC and AP at 1 for training and over 0.8 for validation and test. The results are not reliable, however, as discussed later.

4 Discussion

Several different convolutional neural networks were tested in different training configurations to obtain the best performance for each network. The performance was measured by calculating AUC ROC and AP for training, test and validation. The test scores were used for comparison of the models. The scores were comparably low, with an AUC ROC of 0.6214 and AP of 0.4583 at best. In terms of AUC ROC the ResNet3D performed similar to the inhouse solution on this dataset, but the low AP shows that predictions are not precise. Fig. 2 shows the confusion matrices for training and test after 300 epochs of training. Analyzing the confusion matrices, the majority of correctly classified samples are benign (clinically insignificant) cases. This adheres directly to the distribution of the class labels. Also the high rate of false positives and false negatives leads to decreasing scores considering AUC ROC and AP.

It can be observed that the different models do not generalize well on small learning rates in the here presented setup. Usually, the best epochs can be found among the first training epochs. Increasing the learning rate leads to overfitting, which delivers a model that learned the distribution of the training data instead of learning to find a generalized representations of the input. Furthermore, the good performance of the models on smaller subsamples of the dataset, support the observation of the networks learning distributions instead of representation. Especially on the smaller sets, it is more likely to randomly select a set for which the distribution of training data matches the distribution of validation and test data, which results in perfect but unreliable results. The subsamples are very small considering that a median dataset size of 127 patients/studies is reported when comparing other approaches on PCa classification on MRI data [3].

The results show that the chosen networks are not able to provide reliable predictions of the clinical significance of lesions in the prostate by processing the MRI sequences without additional information or improvements in the training process. In literature an AUC ROC of 0.8328 $\pm$ 0.0878 is reported for lesion detection and characterization [6], which sets the lower boundary to be reached by the model proposed in this work. It is to mention that the well performing models in related work use lesion locations [8] or lesion segmentations as ground truth for training a UNet [9]. Making use of pretrained larger ConvNets, e.g. Inception v3 [10] in combination with zonal information and lesion locations delivers strong results on similar input as proposed in this paper.

In future work, more than one data source should be considered, which would help concluding if the data quality is insufficient or if the hyperparameters are chosen poorly. Additionally, it would be useful to assess other machine learning algorithms such as logistic regression, SVMs or visual transformers, instead of focusing exclusively on CNN architectures. Moreover, the implementation of cross validation could help to measure the performance more accurately. As suggested by the state of the art and the results achieved in this paper, incorporation of localization information in form of anatomy segmentations should be targeted. It could be helpful to use the entirety of information from the histopathological report instead of the aggregated label, thus incorporating sextant information. Using pretrained weights and/or self-supervised training was proven as effective [10, 11] and should be tried as well.

{acknowledgement}

The work has been carried out at FUSE-AI GmbH in Hamburg and supervised by the Institute of Medical Informatics, Universität zu Lübeck. Thanks to Alexander Cornelius, Sebastian Schindera, Rainer Grobholz, Stephan Wyler and Maciej Kwiatkowski from Kantonspital Aarau for providing the image data and histopathological reports. Thanks to Quang Thong Nguyen for the extensive help on the preparation of the dataset and the implementation of ResNet3D, ConvNet3D and ConvNeXt3D.

References

[1] P. Rawla “Epidemiology of Prostate Cancer” In World journal of oncology 10, 2019 DOI: https://doi.org/10.14740/wjon1191
[2] O. Rouviere and P. Moldovan “The current role of prostate multiparametric magnetic resonance imaging” In Asian Journal of Urology 6.2, 2019, pp. 137–146 DOI: 10.1016/j.ajur.2018.12.001
[3] Jose M. Castillo T. et al. “Automated Classification of Significant Prostate Cancer on MRI: A Systematic Review on the Performance of Machine Learning Applications” In Cancers 12.6, 2020 DOI: 10.3390/cancers12061606
[4] D. Karimi et al. “Deep Learning-Based Gleason Grading of Prostate Cancer From Histopathology Images - Role of Multiscale Decision Aggregation and Data Augmentation” In IEEE journal of biomedical and health informatics 24.5, 2020, pp. 1413–1426 DOI: 10.1109/JBHI.2019.2944643
[5] P. Oestmann et al. “Deep learning-assisted differentiation of pathologically proven atypical and typical hepatocellular carcinoma (HCC) versus non-HCC on contrast-enhanced MRI of the liver” In European radiology 31.7, 2019, pp. 4981–4990 DOI: 10.1007/s00330-020-07559-1
[6] H. Li et al. “Machine Learning in Prostate MRI for Prostate Cancer: Current Status and Future Opportunities” In Diagnostics (Basel, Switzerland), 2022 DOI: 10.3390/diagnostics12020289
[7] J. Gordetsky and J. Epstein “Grading of prostatic adenocarcinoma current state and prognostic implications” In Diagn Pathol 11.25, 2016 DOI: 10.1186/s13000-016-0478-2
[8] N. Aldoj, S. Lukas, M. Dewey and T. Penzkofer “Semi-automatic classification of prostate cancer on multi-parametric MR imaging using a multi-channel 3D convolutional neural network” In European Radiology, 2019 DOI: 10.1007/s00330-019-06417-z
[9] M. Arif et al. “Clinically significant prostate cancer detection and segmentation in low-risk patients using a convolutional neural network on multi-parametric MRI” In European radiology, 2020 DOI: 10.1007/s00330-020-07008-z
[10] S. Armato et al. “Computer-Aided Diagnosis - A transfer learning approach for classification of clinical significant prostate cancers from mpMRI scans” In Medical Imaging 2017, 2017 DOI: 10.1117/12.2279021
[11] Zongwei Zhou et al. “Models Genesis” In Computer Vision and Pattern Recognition arXiv, 2020 DOI: 10.48550/ARXIV.2004.07882

Search This Blog

Informed Prostate Cancer Patient Support