ABSTRACT
Objective:
This study aimed to classify open-access gene expression data of patients with hepatitis B virus-related hepatocellular carcinoma (HBV + HCC) and chronic HBV without HCC (HBV alone) using the XGBoost method, one of the machine learning methods, and reveal important genes that may cause HCC.
Methods:
This case-control study used the open-access gene expression data of patients with HBV + HCC and HBV alone. Data from 17 patients with HBV + HCC and 36 patients with HBV were included. XGBoost was constructed for the classification via 10-fold cross-validation. Accuracy, balanced accuracy, sensitivity, selectivity, positive-predictive value, and negative-predictive value performance metrics were evaluated for model performance.
Results:
According to the feature-selection method, 18 genes were selected, and modeling was performed with these input variables. Accuracy, balanced accuracy, sensitivity, specificity, positive-predictive value, negative-predictive value, and F1 score obtained from XGBoost model were 98.1%, 98.6%, 100%, 97.2%, 94.4%, 100%, and 97.1%, respectively. Based on the predictor importance findings acquired from XGBoost, the RNF26, FLJ10233, ACBD6, RBM12, PFAS, H3C11, and GKP5 can be employed as potential biomarkers of HBV-related HCC.
Conclusions:
In this study, genes that may be possible biomarkers of HBV-related HCC were determined using a machine learning-based prediction approach. After the reliability of the obtained genes are clinically verified in subsequent research, therapeutic procedures can be established based on these genes, and their usefulness in clinical practice may be documented.
INTRODUCTION
Current epidemiological and clinical data indicate that primary liver cancer is the sixth most frequently diagnosed cancer and the fourth among cancer-related deaths worldwide1. Approximately 841,000 people are diagnosed with primary liver cancer each year, and 782,400 people died from it. Hepatocellular carcinoma (HCC) accounts the majority of primary liver cancer cases. HCC is the world’s fifth most common malignant tumor, with the second-highest mortality rate among malignant tumors2,3. The most important risk factors associated with HCC are hepatitis B virus (HBV), hepatitis C virus, alcohol abuse, and non-alcoholic fatty liver disease4.
HBV infection is a global public health problem that causes significant morbidity and mortality. HBV is responsible for more than half of all HCC cases worldwide. The proportion of HCC attributable to HBV reflects the geographic distribution of HBV infection and varies significantly, accounting for <20% of all HCC cases in the United States and up to 65% in China and the Far East. Chronic HBV carriers have a 10- to 25-fold higher lifetime risk of developing HCC than non-infected ones5.
Epidemiological studies have shown that many risk factors, especially hepatotropic viruses such as HBV, affect HCC development. Three basic mechanisms are suggested for HCC development from the background of HBV infection: (1) development of chronic inflammation and hepatocyte regeneration during the HBV infection process, (2) activation of the host genes responsible for proliferation as a result of the integration of the HBV DNA into the host genome, and (3) HBV-related proteins (HBx, etc.) support cell proliferation6. These results show that HBV-related HCC is considered not only a clinical disease but also a disease with a genetic basis. The biologically different behavioral patterns of the tumor indicate that genetic and epigenetic aberrations may be important in the HCC development and course7,8. With the detection of genetic and epigenetic anomalies in the pathogenesis of HCC, studies on the molecular pathogenesis of HCC have gained tremendous momentum in the last two decades. In these studies, thousands of genes, transcription, and translation pathways associated with these genes are analyzed, which is a complex and challenging process. Therefore, artificial intelligence (AI) models are needed to analyze thousands of data and interpret the analyses.
Machine learning (ML) is a subfield of AI that make predictions about new data by performing data-driven learning when exposed to new data. AI/ML methods are one of the technologies widely used in diagnosing diseases and clinical decision support systems in recent years and have a wide application area. ML has a wide application area in health and constitutes the basic infrastructure of applications in determining genetic diseases, early diagnosis of cancer, and identifying patterns in medical imaging. In the last decade, with the availability of large datasets and greater computing power, ML methods have achieved high performance in various situations9,10. At present, it is essential to diagnose HCC, determine or predict the genes that cause HCC as biomarkers, and use them concerning the HCC stage. Thus, many studies have used ML methods to identify genes that may be biomarkers related to HCC11. A study used gene expression profiling and supervised ML to predict HBV-positive metastatic HCCs12. In another study, genes that could be biomarkers were identified by ML methods using genome-wide data to predict relapse in patients with HCC13. This study aimed to classify open-access gene expression data of patients with HBV-related HCC (HBV + HCC) and chronic HBV without HCC (HBV alone) using XGBoost and reveal important genes that may cause HCC.
MATERIALS and METHODS
Study Design and Data
This is a retrospective case-control study, and XGBoost, one of the ML methods, was applied to open-access gene expression data of patients with HBV-related HCC and chronic HBV without HCC. Data from 17 HBV-related HCC and 36 chronic HBV samples were analyzed. Complementary DNA (cDNA) microarrays obtained from liver samples were used14. cDNA refers to a piece of DNA synthesized from a mature mRNA used as a template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is the double-stranded DNA version of the mRNA molecule. mRNA is more helpful in determining polypeptide sequence than the genomic sequence in eukaryotes. Since introns are cut out, researchers prefer to work with cDNA rather than mRNA. Therefore, RNA is inherently more unstable than DNA. In addition, no amplification and purification technique can be applied to the RNA molecule. mRNA is used as a template, and reverse transcriptase synthesizes single-stranded DNA molecules. This molecule is then utilized to synthesize double-stranded DNA15.
Feature Selection
Variable selection is an essential step in predictive modeling processes. One of the most critical steps in developing a statistical model is deciding which data to include in the model. Before working with large datasets and models with high computational costs, determining the most valuable features of the dataset to be used in the study will lead to highly efficient results. Feature selection identifies the most prominent features that affect a data set’s dependent variable. The use of numerous explanatory variables can lead to long computation times and risk of overlearning the data and obtaining biased results. In addition, models created with numerous variables are challenging to interpret. Before statistical modeling, selecting important variables that affect the dependent variable is recommended16. Most ML and data-mining methods can produce ineffective results when working with extensive data. Therefore, these methods give more effective results when the dimensionality is reduced17.
Gene expression datasets are large and complex and include raw data for the analyses. Modeling analyses take a long time because gene expression datasets are large, and these datasets can cause computational inefficiency in the analysis. As a result of the high-dimensionality issue, the model’s performance may suffer. A classification algorithm can also overfit the training samples and under generalize new samples if there are numerous genes in gene expression datasets. In this study, LASSO, one of the feature-selection methods, was used to solve these problems. The LASSO method requires that the sum of the model parameters’ absolute values be less than a fixed value (upper limit). The method achieves this by penalizing the coefficients of the regression variables, causing some of them to drop to zero. Besides, the dataset should have many variables and few observations. Furthermore, by removing irrelevant variables unrelated to the response variable, LASSO improves model interpretability and eliminates overlearning18.
XGBoost Algorithm
Gradient boost is defined as a powerful ML technique for regression and classification problems where weak predictive models often produce ensemble forms of decision trees. Gradient boost aims to construct many weak learners in sequence and incorporate them into a complex model because it is based on the boosting method19.
XGBoost, the abbreviation for extreme gradient boosting, is one of the applications of gradient boosting machines, which is one of the most effective supervised learning algorithms. Its basic structure is established on gradient boosting and decision-tree algorithms. Compared with other algorithms, it is in a very advantageous position regarding speed and performance. Additionally, XGBoost is highly predictive, 10 times faster than other algorithms, and includes several regularizations that improve overall performance and reduce overfitting or overlearning. Gradient boosting is an ensemble method that combines weak classifiers with boosting to create a robust classifier. The strong learner is trained iteratively, starting with a basic learner. Both gradient boosting and XGBoost follow the same principle. They mainly differ in the implementation. By using different regularization techniques, XGBoost can achieve better performance by controlling the complexity of the trees19.
Bioinformatics Analysis
For patients (HBV + HCC and HBV alone) whose gene expression profiles were examined, differential expression analyses were performed using the limma package in the R programming language20. Differential expression analysis is the statistical analysis of normalized read count data to find quantitative differences in expression activities between treatment arms. A pipeline is designed for the relevant analyses via the R software environment. The achieved results are presented from a table of genes in order of importance and a graph to visualize differentially expressed genes. The result table contains adjusted P and log2-fold change (Log2FC) values, and genes with the smallest p values will be most reliable. Log2FC >1 was used to identify upregulated genes, and Log2FC <-1 was used to identify downregulated genes21. A volcano plot was graphed to highlight quickly large values regarding the relevant genes.
Study Protocol and Ethics Committee Approval
This study, which used the National Center for Biotechnology Information Gene Expression Omnibus open-access dataset involving human participants, was conducted in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Ethical approval was obtained from the Inonu University Institutional Review Board for Non-Interventional Clinical Research (decision no: 2022/3646, date: 07.06.2022). Strengthening the Reporting of Observational Studies in Epidemiology guideline was utilized to assess the likelihood of bias and overall quality of this study.
Statistical Analysis
The Shapiro-Wilk test of normality was used to determine whether the variables followed a normal distribution. Data were given as median (minimum-maximum) or mean ± standard deviation. The Mann-Whitney U test was employed to compare non-normally distributed data, and independent-sample t-tests were utilized to compare non-normally distributed data, where appropriate. Logistic regression analysis was performed to estimate each gene’s odds ratio (OR) (a measure of effect size). Hosmer and Lemeshow’s test for the goodness of fit and omnibus test of model coefficients were calculated for logistic regression. P-value <0.05 was considered significant. IBM SPSS Statistics, version 25.0, was used in the analysis.
Modeling Process
XGBoost, one of the ML methods, was used in the modeling. Analyses were conducted using the n-fold cross-validation method. In the n-fold cross-validation method, data were first divided into n parts, and the model used was applied to n parts. One of the n parts is used for testing, whereas the other n-1 parts are used for training the model. The mean of the obtained values is evaluated for the cross-validation method. In this study, 10-fold cross-validation was employed for the modeling process. Accuracy, balanced accuracy, sensitivity, selectivity, positive-predictive value, negative-predictive value, and F1-score were used as performance evaluation criteria. In addition, variable importances were calculated, which gives information about how much the input variables explain the output variables.
RESULTS
In this study, 53 patients (HBV + HCC =17; HBV alone =36) were used, of which 42 were male and 11 were female. The mean age of the patients was 54.91±13.76 years. While 15 of the HBV + HCC group were male and two were female, 27 patients in the HBV alone group were male and nine were female. The mean age of patients with HBV + HCC was 60.47±9.01 years, and the mean age of patients with HBV alone is 52.28±14.90 years. The dataset used contains 8516 expressions. According to the bioinformatics analysis, the first 10 results are summarized concerning minimum adjusted p values in Table 1. As shown in Table 1, five genes (ID: 1474, 1817, 6277, 4496, and 7165) were downregulated, and the other five genes were unregulated.
Table 2 presents descriptive statistics for the selected genes concerning the groups. According to Table 2, Log2FC values for the IGFBP3, HGFAC, SLC39A14, CXCL12, PLG, FBP1, RNF26, ACBD6, C8A, and CCT3 were -1.54, -1.79, -1.06, -0.96, -0.92, -1.27, 0.46, 0.65, 1.02, and 0.66, respectively. Significant differences were determined in PFAS, FRA16B, GCNT2, GKP5, MEN1, MUC4, RBM12, RNF26, TIMP3, MCM3, VPS28, CRY1, SF3B2, H3C11, ACBD6, and FLJ10233 between the groups (p<0.05). CYP24A1 and homo sapiens chromosome 5 clone RP11-998B18 complete sequence genes were not significantly different between the groups (p>0.05).
The volcano plot used to visualize differentially expressed genes is given in Figure 1. On the y- and x-axes, the significance of the volcano graph plots versus the fold change in log2 base show differentially expressed genes quickly.
Eighteen expression results were obtained by applying the LASSO feature-selection method to 8516 expression results. The explanations of the dataset with the selected expressions, examined target variable, and OR per gene for the target variable are presented in Table 2. The findings of the performance metrics from the XGBoost model are provided in Table 3.
Accuracy, balanced accuracy, sensitivity, specificity, positive-predictive value, negative-predictive value, and F1 score obtained from the XGBoost model were 98.1%, 98.6%, 100%, 97.2%, 94.4%, 100%, and 97.1%, respectively. The performance criteria values are plotted for the XGBoost model in Figure 2. Figure 3 shows the importance levels of expressions for the selected genes in explaining the output variable. RNF26 had the highest predictor importance of 100.0%, followed by FLJ10233 at 66.21% and ACBD6 at 51.47%.
DISCUSSION
Although the gene expression profiling structure of HCC and the background liver has been widely examined14, ML-based prediction of HBV-related HCC and detection of crucial candidate biomarkers have not been clarified using an AI approach. Thence, this study intends to classify HBV-related HCC and HBV without HCC gene expression data using the XGBoost method and identify important genes that may cause HCC.
HBV is widespread worldwide, with varying levels of infection in different regions. According to the World Health Organization, approximately two billion people have been infected with HBV worldwide, with 240 million people infected with chronic HBV and approximately 650,000 people die annually from hepatic failure and liver cirrhosis and HCC caused by HBV infection. HBV infection is responsible for 30% and 45% of patients with liver cirrhosis and HCC worldwide22,23.
The overall survival of patients with HCC is low, and the management of HCC risk factors needs to be rationally expanded to reduce the burden of HCC worldwide. There is a growing interest in genomics and molecular biology research to identify diagnosis early, prognostic markers, and new therapeutic targets to uncover the mechanisms of liver carcinogenesis and thus improve the clinical management of patients with HCC. Building on these studies, advances in HCC surveillance promise to significantly reduce the worldwide burden of HCC over the next few decades24,25.
In the dataset analyzed in this study, the genomic data of samples obtained from liver tissues of 17 patients with HBV-related HCC and 36 with chronic HBV without HCC were used for the relevant analyses. cDNA microarrays were obtained from the samples, and the dataset used contained 8516 expressions. According to the Log2FC values used to determine the expression fold changes between the two groups from the findings of the bioinformatics analyses (Table 2), IGFBP3 has 2.90-fold lower gene expression in patients with HBV-related HCC than in patients with chronic HBV. Similarly, HGFAC had 3.45-fold lower gene expression, SLC39A14 had 2.08-fold, FBP1 had 2.41-fold, and C8A had 2.02-fold lower gene expression. CXCL12, PLG, RNF26, and ACBD6 had the same expression between the two groups. In this instance, gene expression data are so large that modeling with these datasets can result in long analysis times and computational inefficiency. Therefore, before modeling with the existing dataset, the most important genes associated with the output variable were selected with the LASSO variable-selection method. Eighteen genes selected by the LASSO method were used in building the XGBoost model. The accuracy, balanced accuracy, sensitivity, specificity, positive and negative-predictive values, and F1 score metrics obtained with the XGBoost model were 98.1%, 98.6%, 100%, 97.2%, 94.4%, 100%, and 97.1%, respectively. The performance metrics indicated that the proposed XGBoost model could correctly classify two groups of patients based on the AI approach. Among the genes whose OR values were calculated, RNF26 (OR =722), VPS28 (OR =225), RBM12 (OR =170), ACBD6 (OR =57), FLJ10233 (OR =26), H3C11 (OR =22), PFAS (OR =10), and TIMP3 (OR =10) genes were found to have the highest OR values, respectively. According to the variable importance obtained from XGBoost, RNF26, FLJ10233, ACBD6, RBM12, PFAS, H3C11, and GKP5 can be used as candidate predictive biomarkers of HBV-related HCC. In addition, the calculated OR values and variable importance values in the study support each other. According to variable significance results, genes with huge OR values were determined as genes contributing to HBV-related HCC development. Additionally, the proposed pipeline produced a volcano plot, representing the up- and downregulation of the genes. These plots are becoming more common in omics experiments, such as genomics, proteomics, and metabolomics, where there are often thousands of replicate data points between two conditions26.
A medical study reported that RNF26 was abnormally expressed in patients with HCC27. In another study, VPS28 was upregulated28. Another study showed that a high RBM12 level in HCC indicates a poor patient prognosis29. One study reported that ACBD6 was expressed differently in HCC and chronic hepatitis30. In a study, high-grade tumors exhibited progressively higher levels of PFAS, ATIC, IMPDH1, IMPDH2, GMPS, and ADSL than low-grade tumors or normal liver tissue31. In one study, TIMP3 was found as a candidate gene in HBV-related HCC32. Another study determined that epigenetic methylation of TIMP3 is associated with HBV-associated HCC33.
In a study, SHCBP1, FOXM1, KIF4A, ANLN, KIF15, KIF18A, FANCI, NEK2, ECT2, and RAD51AP1 were found as the top 10 most important genes for HBV-related HCC34. In addition, patients with FOXM1, NEK2, RAD51AP1, ANLN, and KIF18A showed worse overall survival. In another study with HCC, the expression levels for PER1, PER2, PER3, and CRY2 genes were lower35. Another study showed that high expression of FOXM1 causes a poor prognosis for HBV-related HCC and promotes tumor metastasis36.
All diseases that cause chronic liver damage are risk factors for HCC development. Therefore, international guidelines’ follow-up of such patients is crucial for detecting possible HCC or its detection at an early stage37. The most authoritative guidelines on monitoring patients with chronic liver are published periodically by European Association for the Study of the Liver, Asian-Pacific Association for the Study of the Liver, and American Association for the Study of Liver Diseases37. The tumor doubling time of HCC varies between 4 and 6 months. Therefore, the abovementioned guidelines suggest that patients with chronic liver disease without HCC should be followed up with ultrasonography (US) and alpha-fetoprotein (AFP) at 6-month intervals37. Patients with suspected HCC (nodule diameter <10 mm) should be followed up with US and AFP at 3 or 6-month intervals. Patients with a strong suspicion of HCC should be followed up with US and AFP. Patients with nodule diameter >10 mm and/or AFP >20 ng/mL should be evaluated further with radiological examinations37.
However, these approaches may not always provide the expected results because it is not always easy for patients to reach healthcare providers in underdeveloped or developing countries. False-negative results may be higher than expected, because US is an operator-dependent examination. There is a correlation between the duration of chronic liver disease and probability of HCC development. As in all other cancer types, gene mutation and mutation-related mRNA expression changes are expected in HCC. Therefore, in the follow-up of patients with chronic liver disease, fundamental genetic analysis can be performed after a certain period to determine whether there is a genetic mutation. As shown in our results, if changes are detected in the expression of genes that are strongly associated with HCC, patients can be followed more closely, and preventive treatments can be initiated when necessary. However, there is no evidence-based data on when genetic analysis should be performed on chronic liver disease. Therefore, a prospective multicenter study is needed to determine the timing of genetic analysis for patients with chronic liver disease. With this important finding, increasing the number of patients may further increase the scope of genetic information and power of the study.
CONCLUSION
In conclusion, this study revealed possible genomic biomarkers of HBV-related HCC using gene expression data from patients with HBV-related HCC and patients with chronic HBV alone. The reliability of the genes obtained with more comprehensive analyses to be made in the future can be tested, treatment approaches can be developed based on these genes, and their usability in clinical practice can be detailed. Thus, individual-based treatments and immunotherapy approaches more applicable to clinical practice are possible.


