Radiomics Features Based on MRI-ADC Maps of Patients with Breast Cancer: Relationship with Lesion Size, Features Stability, and Model Accuracy

Begumhan BAYSAL; Hakan BAYSAL; Mehmet Bilgin ESER; Mahmut Bilal DOGAN; Orhan ALIMOGLU

doi:10.4274/MMJ.galenos.2022.70094

ABSTRACT

Conclusions:

A positive stability is indicated by an increased lesion size related to radiomics features. Neural networks may predict moleculer subtypes of breast cancers over 1 cm³ with high accuracy.

Results:

Of the 851 radiomics features, 611 had ICC >0.75, and 37 remained stable in the first experiment, 49 in the second, and 59 in the third based on CoV and VIF analysis. High accuracy was demonstrated by the Luminal B, HER2-enriched, and triple-negative models in the first experiment (>80%), all models in the second experiment, and HER2-enriched and triple-negative models in the third experiment.

Methods:

This retrospective study included 221 consecutive patients (224 lesions) with breast cancer imaged between January 2015 and January 2020. Three sample size configurations were identified based on tumor size (experiment 1: all cases, experiment 2: >1 cm³, and experiment 3: >2 cm³). The tumors were segmented by three observers based on diffusion-weighted imaging-registered ADC maps, and the volumetric agreement of these segmentations was evaluated using the Dice coefficient. Stability of radiomics features (n=851) was evaluated with intraclass correlation coefficient (ICC, >0.75) and coefficient of variation (CoV, <0.15). Feature selection was made with variance inflation factor (VIF, <10) and least absolute shrinkage and selection operator regression. Outcomes were identified as molecular subtypes (Luminal A, Luminal B, HER2-enriched, triple-negative). Neural network performance was presented as an area under the curve and accuracies.

Objective:

To predict breast cancer molecular subtypes with neural networks based on magnetic resonance imaging apparent diffusion coefficient (ADC) radiomics and to detect the relation of lesion size with the stability of radiomics features.

Keywords:

Breast carcinoma, diffusion magnetic resonance imaging, computer-assisted image processing, machine learning, artificial intelligence

INTRODUCTION

Breast cancer is the most common cancer in women^1,2. Remarkable developments in the fields of imaging, surgery, pathology, medical oncology, and genetics have led to a significant decline in breast cancer-related death rates in 30 years^1,3,4,5. In current clinical practice, patients are classified based on their molecular subtypes^3,6, which can be identified via biopsy, thus informing treatment selection⁶. Currently, molecular subtypes guide treatment using tissue samples or by immunohistochemical markers^6,7. However, the major challenges at this point are the limited volume of tumor represented in the biopsy sample and the reliability of the biopsy sample, especially in heterogeneous tumors⁷. A diagnostic prediction model may predict molecular subtypes using imaging data^8,9,10, for examples, automated artificial neural networks (ANN)⁸ in which many networks can be trained with different configurations. In automation, the separation of the sample into training, test (hyperparameter tuning), and validation (hold-out) sets can be used to determine the training, error function, hidden activation, and output activation to be selected for the network structure¹¹, and human-induced bias is reduced¹². However, explainability may be limited in nonlinear models.

Previous studies focused on radiomics features extracted from dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI)^{10,13,14,15,16,17,18,19}. Although DCE-MRI is the most sensitive sequence in breast cancer imaging, the accumulation of gadolinium-based contrast agents in the brain has been a concern²⁰. Additionally, using contrast agents in patients with obesity, diabetes, and renal failure requires great attention^1,20. Therefore, non-contrast MRI protocols such as diffusion-weighted imaging (DWI) should be develope⁵. DWI delivers functional diffusivity data, and, with apparent diffusion coefficient (ADC) mapping, quantifies diffusivity related to cell density in solid tumors. DCE-MRI has been performed to predict molecular subtype, tumor histology, risk of recurrence, response to chemotherapy, and the probability of metastasis^{8,10,13,14,15,16,17,19,21,22,23}. However, few studies used DWIs or ADCs radiomics as predictors^9,18,24.

Previous studies focusing on breast cancer molecular subtypes have not evaluated spatial overlap^{10,13,14,15,16,17,18,19}, and few have tested the interobserver reproducibility of the radiomics features^22,23. Reproducibility is very important in radiomics feature extraction^25,26,27 and, along with sharing data, is as important as the studies’ design, precision, and accuracy^26,28. Moreover, to maintain the quality and reproducibility of the studies, studies on artificial intelligence that include complex models should report data with transparency.

The relation between the size of lesions and the stability of radiomics features has not been studied before. This study primarily aimed to predict breast cancer molecular subtypes with automated ANN created based on MRI ADC radiomics features and then to investigate the relationship between lesion size and stability of radiomics features and model accuracy.

MATERIALS and METHODS

Ethical Considerations

This retrospective study was approved by Local Ethics Committee of the Istanbul Medeniyet University Goztepe Training and Research Hospital (decision no: 2020/0303, date: 18.05.2020). The requirement for written informed patient consent was waived by the local ethics committee. We ensured adherence to the STARD 2015 statement²⁹ and the white papers and statements of European, United States, and Canadian societies^{26,#*#ref30,#*#31,32}. The radiomics quality score was 18/36³³. An Image Biomarker Standardization Initiative (IBSI)-compliant software was used for feature extraction³⁴.

Study Population and Data Collection

This model development study was conceived in Istanbul Medeniyet University Goztepe Prof. Dr. Suleyman Yalcin City Hospital. Data of patients histopathologically diagnosed with breast cancer in the general-surgery service of the university hospital between January 2015 and January 2020 were collected. The inclusion criteria were as follows: patients with breast MRI, including DWI sequences; detectable lesion on the DWI and ADC map; and invasive breast cancer as per the pathology report (if the patient had two different types of tumor histology, the tumors were included separately). The exclusion criteria were as follows: patients operated at the research center but without available imaging results and lesions detected on the ADC map but were pathologically diagnosed as in situ cancer. A complete pipeline is presented in Figure 1, and a flowchart of sample selection for the study is presented in E-Figure 1 (You can access all E-Figures and E-Tables from the link at the end of the article).

Based on the above criteria, 221 patients with 224 lesions were detected for the study (experiment 1, n=224). Breast cancer molecular subtypes based on pathological examination of surgery specimens were recorded in a worksheet by the surgery team. The analysis was repeated by narrowing the data set to over 1 cm³ lesion (experiment 2, n=172) and over 2 cm³ lesions (experiment 3, n=139). The 1.5 Tesla magnet power MRI protocols are described in E-Table 1.

Statistical Analysis

Predictors: Analysis of the ADC Maps

MRIs of the included patients were taken from the hospital archive and anonymized. Using the 3D Slicer (version 4.10.2; https://www.slicer.org) software, three radiologists with 8, 5, and 3 years of experience performed the segmentation from each axial segment where the tumor was located as seen on high b-value DWI images and verified on the ADC map^35,36. Co-registration was made between T2 weighted images if the lesion was not detected from DWI. The predictor variables (radiomics features) of this study was extracted with PyRadiomics (version 2.2.0). All the radiomics features (n=851) were included, and wavelet-based filters were used. Raw ADC maps and resampled images (2.0x2.0x2.0 mm) were used and normalized^36,37. Other detailed information about the radiomics features included in the study is provided in E-Table 2.

Outcomes

The outcomes of the study were molecular subtypes of breast cancer based on the biopsy: Luminal A, Luminal B, human epidermal growth factor receptor 2 (HER2)-enriched, and triple-negative (TN). These outcomes were coded as “one-vs-rest” orientation (E-Table 3)²⁷.

Features Stability Assessment

Interobserver agreement on the segmentations and radiomics features were evaluated by Dice similarity coefficient²⁷ and the intraclass correlation coefficient (ICC), respectively³⁸. Discrepancies in the Dice similarity coefficient of segmentations (<0.50) were resolved by consensus. Features with an ICC >0.75 were further analyzed. Features from the three measurements were averaged, and the data were combined with the worksheet containing the pathology data. Worksheets containing three different sample size configurations were created. The coefficient of variation (CoV) analysis, which radiomics features showing >15% variance, was eliminated^26,28. Then, Spearman’s correlation (SC) analysis was performed to evaluate the remaining features, followed by variance inflation factor (VIF) analysis.

Collinearity-multicollinearity Analysis and Final Feature Selection

To achieve low collinearity-multicollinearity, VIF analyses were performed. In case VIF was above 10, the radiomic feature was eliminated^28,39. For validation, SC analysis was performed between features and outcomes (p<0.01)²⁶. Least absolute shrinkage and selection operator (LASSO) was used for future selection with random sampling method and 10-fold cross-validation.

Structuring Automated Artificial Neural Networks

Neural networks were binary classifiers as “one-vs-rest” orientation, and a diagnostic model was developed after feature selection using features selected for each experiment and outcome²⁷. The data were divided into three subsamples in each training session using a random number generator. The software randomly sampled 50%-70% of the cases as training set, 10%-20% as a test set (hyperparameter tuning set), and 20%-30% as a validation (hold-out) set. Multilayer perceptron (MLP) and radial basis function (RBF) neurons were trained, and networks were feedforward and fully connected¹¹. In each analysis, MLP or RBF neurons were trained, tested, and validated with unseen data set. The software automatically assigned the number of neurons (6-25), the number of layers [input layer (n= predictors for RBF and n= predictors + bias neuron for MLP), minimum two hidden layer, output layer (n=2, Positive event or not)], the number of bias neurons (minimum one per hidden layer both RBF and MLP), activation - hidden - output function [identity, logistic sigmoid, hyperbolic tangent, exponential, Softmax, and Gaussian (only available for RBF networks), and error function (sum of squares, cross entropy)], in these models by evaluating input, output data, and sub sample proportions¹¹. Hyperparameter tuning was made with early-stopping algorithm¹¹. Then, a neural network search was performed for each outcome in three sample size configurations. Figure 2 summarizes how automated ANN were trained, tested, and validated. Most accurate networks were retained for each experiment and outcome. Most efficient networks results are presented with area under the curve (AUC) (95% confidence intervals; lower and upper bounds)^26,27. In receiver operating curve analysis, AUC >0.85 and p<0.01 is considered a validated classifier neural network. TIBCO Statistica version 13.5 (TIBCO Software, Palo Alto, CA) was used for statistical analyses and neural network training.

RESULTS

Patient’s Characteristics

This study included 221 patients (mean age, 54±11 years); of them, 220 (99%) were women. Clinicopathologic characteristics of the patients are presented in Table 1.

Feature Selection Results

The interobserver mean Dice coefficient values were as follows: between observers 1 and 2, 0.81±0.08 (0.80-0.82); for observers 1 and 3, 0.80±0.10 (0.79-0.82); and between observers 2 and 3, 0.73±0.11 (0.71-0.74).

The results of the resampled image features were not presented due to lower performance on ICC, CoV, VIF, LASSO analyses, and multivariate diagnostic models.

CoV, VIF, and LASSO regression analyses were performed separately in all three experiments (Figure 3). Of the 851 radiomic features, 611 were extracted from three segmentations with ICC values >0.75 and then included in a CoV analysis: 93 in the first experiment, 118 in the second experiment, and 136 in the third experiment. E-Table 3 presents an exact number of participants and outcome events for each analysis.

In the VIF analysis, other features were excluded from the models (Figure 4) and features showing collinearity-multicollinearity were excluded, resulting in 37 features in the first experiment, 49 in the second experiment, and 59 features in the third experiment (Table 2, E-Figure 2).

In the correlation analysis, for all SC ‘r’ for the first and second experiments, the radiomics features were not successful. In the third experiment, all SC ‘r’ were <0.40, and p<0.01 for 12 predictors for TN breast cancer.

LASSO regression was used for regularization, and the analysis results for each outcome are shared in Table 3.

Diagnostic Prediction Model Results

From the three experiments, each of 12 neural networks contained four multivariable binary classifier models (Table 4). Confusion matrix and detailed performance metrics are presented for Luminal A in the E-Table 4, for Luminal B in the E-Table 5, for HER2-enriched in the E-Table 6, and for TN in the E-Table 7. In the validation (hold-out) set, the model trained for Luminal B in the first experiment and Luminal A in the second experiment reached AUC of 0.87 (0.73-0.99) and 0.87 (0.73-0.99), respectively. These findings indicate a high accuracy (>0.80) for Luminal B, HER2-enriched, and TN models in the first experiment; all models in the second experiment; and HER2-enriched, and TN models in the third experiment.

All data used in this study, the results of the analyses, and the trained neural network codes were shared publicly on GitHub (https://github.com/MBE-hub/Breast).

DISCUSSION

Based on the results of the present study, neural networks may predict molecular subtypes of breast cancer over 1 cm³. Compared with previous studies, the present study evaluated the stability of radiomic features using the Dice similarity index, ICC, and CoV; the VIF was used to eliminate highly collinear features.

Currently, a minimally invasive approach is the most prevalent in medical practice⁵. Contrast-enhanced examinations are also considered an intervention^5,20. Therefore, we focused on ADC radiomics as an alternative to invasive imaging modalities. Chen et al.¹⁴ offered that ADC radiomics provided a more accurate diagnosis than DCE MRI radiomics. Unlike previous studies, the present study performed all breast cancers without limiting lesion size^{9,10,13,14,15,16,22,23,35}. Experiment 2, which included over 1 cm³, showed the best accuracy. In the first experiment, fewer pixels were segmented in small lesions, affecting the stability of the radiomics feature. Experiment 3 has a relatively limited sample size, and the validation set proportion was set to 30% to overcome this challenge and prevent overfitting.

Previous studies that assessed the molecular subtypes of breast cancer using an interobserver design did not evaluate the spatial overlap with the Dice coefficient^{9,10,13,14,15,16}. However, the evaluation of spatial overlap is recommended²⁷. Traverso et al.^36,37 performed two studies based on cervix and rectal cancer using the Dice coefficient; the median Dice coefficient for the two observers was 0.73 and 0.75, respectively, which are similar to our results. The Dice coefficient provides a susceptible analysis as it depends on the pixel-to-pixel overlap of segmentation^36,37. Given the sensitivity of the Dice method, the agreement of segmentations was almost perfect in this study. Furthermore, as mitigating certain discrepancies has been a challenge, observers re-evaluated the patients with a Dice coefficient <0.50 (n=9) to avoid bias.

In the present study, 72% of the radiomic features had ICC value >0.75. Using super-resolution ADC images, Fan et al.²² performed a radiomics analysis to predict the histologic grade and Ki-67 expression status of breast cancer and found that shape and first-order features had an ICC >0.7, and neighborhood gray tone difference in matrix features showed large variance, with a low mean ICC. Zhang et al.³⁵ modelled multiparametric MRI to differentiate benign and malignant lesions from radiomics features and noted that all features had an ICC value of >0.75. Similarly, the present study included features with an ICC of >0.75. However, issues on reproducibility were raised due to not using an exclusion criterion for ICC^25,38 and not considering an interobserver assessment.

The European Society of Radiology (ESR) has recently published a statement on the validation of imaging biomarkers and described the validation pipeline²⁶. The first step of this pipeline offers to evaluate features with a CoV analysis, stating, “high precision (low variance) is considered mandatory for the validation.” Due to the novelty of this statement, none of the previous studies have used this analysis. In stability analysis, only 16% of features showed high stability even at the best condition (experiment 3: over 2 cm³ lesions).

Parekh and Jacobs¹⁹ reported that multivariable models had increased AUC (9-28%). Therefore, in the present study, we used multivariable models. Given the emerging use of multivariable regressions in feature selection tasks, collinearity-multicollinearity has become an essential problem. Kim³⁹ has described multicollinearity as a high degree of linear intercorrelation between predictor variables in a multivariable regression model. If collinearity is ignored, features on analysis become almost identical, thus increasing the relative error rate. In addition, features that better explain the model are ruled out due to the many identical features chosen. Although various methods have been defined, we preferred to eliminate features that show collinearity-multicollinearity in this study, and features were stable in only 7% of this elimination. Previous studies have not reported VIF analysis^{9,10,13,14,15,16}.

These results offered that radiomics features stability related to lesion size. The number of stable features increased with increasing lesion size. Despite the decrease in sample size in the second and third experiments, an increase in the Spearman correlation coefficient value, with an increase in the number of significant predictors, indicates a relationship between radiomics features stability and lesion size.

For validated biomarkers, the third item of the ESR statement pipeline requested that the p-value be <0.01 in the correlation analysis²⁶. In the univariate analysis, a few radiomics features were validated in this study (E-Figure 2). Furthermore, their correlation coefficients were weak since LASSO regression was used for regularization in the current study^23,24,25.

Sutton et al.¹⁰ have used the support vector machine, which included 38 features (mostly shape and contrast-enhancement patterns) and found the accuracy for prediction of TN molecular subtype breast cancer at 81%. This study used 851 features and found that all shape features, except for sphericity, are not stable in precision and accuracy. Moreover, sphericity could not be measured accurately, even in IBSI compliant software³⁴.

Diagnostic prediction models will benefit the clinician in detecting HER2-enriched and TN tumors. Leithner et al.⁸ trained a TN ANN classifier with AUC =0.80 and 68.2% accuracy in the validation set. However, their other classifiers presented accuracies at 38.7%-70.3%. In the present study, neurons trained above 1 cm³ configuration can estimate Luminal A, Luminal B, and HER2-enriched models with accuracy as high as that of the TN model; high specificity (>80%) was observed in the neurons in experiment 2, with moderate to high sensitivity (33%-80%).

The study has some limitations. The retrospective study design and single-center nature limit the generalizability of results. However, MR scans were performed with two different devices, and four different protocols and b-values in our center increased the potential diversity. For external validation, the cancer imaging archive was scanned, but without suitable data.

We used the manual segmentation method in this study because approximately 1/4 of our lesions were less than 1 cm³. In addition, a recent study showed that automatic segmentation is not a good option for small lesions⁴⁰. Fortunately, automated segmentation methods have made rapid progress²¹. Future studies with large datasets may focus on breast cancer molecular subtype discrimination using convolutional neural networks and automated segmentation methods.

Using automated ANN minimizes human-induced bias¹⁷. However, the models created due to the weak linear relationship between predictors and outcomes reduce the network explainability. In the preliminary stage of the study, we also experienced machine learning methods such as support vector machines and K-nearest neighbors, and we attempted to train multiclass classifier methods such as gradient descent boosting and adaptive boosting. However, all these machine learning algorithms showed obviously lower accuracy than the models used in this study.

Radiomics features extraction yields the best results on iso-voxel partitioned images. Therefore, this study used both raw data (highly interpolated) and 2.0 mm iso-voxel images. Contrary to expectations, raw images showed better performance in this study, which was not supported by the literature, thereby limiting our discussion^8,9,10. Based on our findings, high interpolation and high slice thickness caused artificial homogeneity on the resampled images, making the model success not better than raw images. Future studies should aim to increase the stability radiomics features and model success. Especially for DWI and ADC, it is necessary to increase the matrix values, decrease the section thickness, and increase the signal-to-noise ratio.

CONCLUSION

The stability of radiomics features is positively correlated to an increased lesion size. A diagnostic prediction model is a triaging and expediting the need for biopsy and/or for supporting histopathologic results in equivocal cases. However, while this prediction does not replace biopsy, it may require the triage of patients to be prioritized in radiology reporting, biopsy, and pathology reporting. The rapid and accurate triage of breast cancer molecular subtypes using imaging will be a potential development.