AI in Cancer Prevention: a Retrospective Study
Main Article Content
Abstract
This study investigates the feasibility of effectively solving population-scale cancer screening problems using artificial intelligence (AI) methods that predict malignant neoplasm risk based on minimal electronic health record (EHR) data – medical diagnosis and service codes. To address the formulated problem, we considered a broad spectrum of modern approaches, including classical machine learning methods, survival analysis, deep learning, and large language models (LLMs). Numerical experiments demonstrated that gradient boosting using survival analysis models as additional predictors possesses the best ability to rank patients by cancer risk level, enabling consideration of both population-level and individual risk factors for malignant neoplasms. Predictors constructed from EHR data include demographic characteristics, healthcare utilization patterns, and clinical markers. This solution was tested in retrospective experiments under the supervision of specialized oncologists. In the retrospective experiment involving more than 1.9 million patients, we established that the risk group captures up to 5.4 times more patients with cancer at the same level of medical examinations. The investigated method represents a scalable solution using exclusively diagnosis and service codes, requiring no specialized infrastructure and integrable into oncological vigilance processes, making it applicable for population-scale cancer screening.
Keywords:
Article Details
References
2. Cenin D. R., Tinmouth J., Naber S. K., Khalaf N., Rabeneck L., Tinmouth J. M., Earle C. C., Hilsden R. J., Leddin D., Rostom A., Issaka R. B., Heitman S. J., Lansdorp-Vogelaar I. Calculation of stop ages for colorectal cancer screening based on comorbidities and screening history. Clinical Gastroenterology and Hepatology, 2021, vol. 19, no. 3, pp. 547–555. https://doi.org/10.1016/j.cgh.2020.05.038
3. Ratushnyak S., Hoogendoorn M., van Baal P. H. M. Cost-effectiveness of cancer screening: health and costs in life years gained. American Journal of Preventive Medicine, 2019, vol. 57, no. 6, pp. 792–799. https://doi.org/10.1016/j.amepre.2019.07.027
4. Alexander M., Burbury K. A systematic review of biomarkers for the prediction of thromboembolism in lung cancer — Results, practical issues and proposed strategies for future risk prediction models. Thrombosis Research, 2016, vol. 148, pp. 63–69. https://doi.org/10.1016/j.thromres.2016.10.020
5. Jacobs M. F. Predicting cancer risk based on family history. eLife, 2021, vol. 10, e73380. https://doi.org/10.7554/eLife.73380
6. Wang X., Oldani M. J., Zhao X., Huang X., Qian D. A review of cancer risk prediction models with genetic variants. Cancer Informatics, 2014, vol. 13, suppl. 2, pp. 19–28. https://doi.org/10.4137/CIN.S13788
7. Zhu M. Recall, precision and average precision. Technical Report, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2004, 6 p.
8. Lee C., Zame W. R., Yoon J., van der Schaar M. DeepHit: A deep learning approach to survival analysis with competing risks. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32, no. 1, pp. 2314–2321. https://doi.org/10.1609/aaai.v32i1.11842
9. Nagpal C., Li X., Dubrawski A. Deep survival machines: Fully parametric survival regression and representation learning for censored data with competing risks. IEEE Journal of Biomedical and Health Informatics, 2021, vol. 25, no. 8, pp. 3163–3175. https://doi.org/10.1109/JBHI.2021.3052441
10. Babaev D., Ovsov N., Kireev I., Ivanova M., Gusev G., Nazarov I., Tuzhilin A. CoLES: Contrastive learning for event sequences with self-supervision. Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22), New York, NY, USA, ACM, 2022, pp. 1190–1199. https://doi.org/10.1145/3514221.3526129
11. Blinov P., Kokh V. Medical profile model: scientific and practical applications in healthcare. IEEE Journal of Biomedical and Health Informatics, 2023, vol. 28, no. 1, pp. 450–458. https://doi.org/10.1109/JBHI.2023.3295631
12. Yalunin A., Nesterov A., Umerenkov D. RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining. arXiv preprint, 2022, arXiv:2204.03951. https://doi.org/10.48550/arXiv.2204.03951
13. Philonenko P., Postovalov S. The new robust two-sample test for randomly right-censored data. Journal of Statistical Computation and Simulation, 2019, vol. 89, no. 8, pp. 1357–1375. https://doi.org/10.1080/00949655.2019.1577858

This work is licensed under a Creative Commons Attribution 4.0 International License.
Presenting an article for publication in the Russian Digital Libraries Journal (RDLJ), the authors automatically give consent to grant a limited license to use the materials of the Kazan (Volga) Federal University (KFU) (of course, only if the article is accepted for publication). This means that KFU has the right to publish an article in the next issue of the journal (on the website or in printed form), as well as to reprint this article in the archives of RDLJ CDs or to include in a particular information system or database, produced by KFU.
All copyrighted materials are placed in RDLJ with the consent of the authors. In the event that any of the authors have objected to its publication of materials on this site, the material can be removed, subject to notification to the Editor in writing.
Documents published in RDLJ are protected by copyright and all rights are reserved by the authors. Authors independently monitor compliance with their rights to reproduce or translate their papers published in the journal. If the material is published in RDLJ, reprinted with permission by another publisher or translated into another language, a reference to the original publication.
By submitting an article for publication in RDLJ, authors should take into account that the publication on the Internet, on the one hand, provide unique opportunities for access to their content, but on the other hand, are a new form of information exchange in the global information society where authors and publishers is not always provided with protection against unauthorized copying or other use of materials protected by copyright.
RDLJ is copyrighted. When using materials from the log must indicate the URL: index.phtml page = elbib / rus / journal?. Any change, addition or editing of the author's text are not allowed. Copying individual fragments of articles from the journal is allowed for distribute, remix, adapt, and build upon article, even commercially, as long as they credit that article for the original creation.
Request for the right to reproduce or use any of the materials published in RDLJ should be addressed to the Editor-in-Chief A.M. Elizarov at the following address: amelizarov@gmail.com.
The publishers of RDLJ is not responsible for the view, set out in the published opinion articles.
We suggest the authors of articles downloaded from this page, sign it and send it to the journal publisher's address by e-mail scan copyright agreements on the transfer of non-exclusive rights to use the work.