Abstract:
This study investigates the feasibility of effectively solving population-scale cancer screening problems using artificial intelligence (AI) methods that predict malignant neoplasm risk based on minimal electronic health record (EHR) data – medical diagnosis and service codes. To address the formulated problem, we considered a broad spectrum of modern approaches, including classical machine learning methods, survival analysis, deep learning, and large language models (LLMs). Numerical experiments demonstrated that gradient boosting using survival analysis models as additional predictors possesses the best ability to rank patients by cancer risk level, enabling consideration of both population-level and individual risk factors for malignant neoplasms. Predictors constructed from EHR data include demographic characteristics, healthcare utilization patterns, and clinical markers. This solution was tested in retrospective experiments under the supervision of specialized oncologists. In the retrospective experiment involving more than 1.9 million patients, we established that the risk group captures up to 5.4 times more patients with cancer at the same level of medical examinations. The investigated method represents a scalable solution using exclusively diagnosis and service codes, requiring no specialized infrastructure and integrable into oncological vigilance processes, making it applicable for population-scale cancer screening.