Optimizing long term disease prevention with reinforcement learning: a framework for precision lipid control

Table of Contents

Study design and participants

This study included patients who had utilized public healthcare services provided by the Hong Kong Hospital Authority (HA) since 2004. HA is the largest public healthcare provider in Hong Kong, offering government-subsidized primary, secondary, and tertiary care to all residents. It accounts for over 70% of all hospitalizations in Hong Kong²³. Previous research has confirmed the reliability of the HA’s data source which has been extensively used in multinational collaborative studies⁴⁴, including research on CVD and CVD drug studies⁴⁵, with a positive predictive value of 85% for myocardial infarction and 91% for stroke⁴⁶.

Two patient cohorts were identified based on their primary location of residence in Hong Kong: Hong Kong Island (Hong Kong West Cluster, HKWC) and Kowloon and New Territories. The Hong Kong Island (HKWC) cohort was utilized for model development, while the Kowloon and New Territories cohort served for model validation, ensuring there was no overlap between the development and validation groups. Specifically, the Hong Kong Island (Hong Kong West Cluster) cohort consisted of patients aged 18 or above who had undergone a lipid test at a hospital within the Hong Kong West Cluster between January 1, 2004, and December 31, 2019, as identified by the Hospital Authority. The Kowloon and New Territories cohort included patients aged 35 or above whose blood pressure was recorded in the Hospital Authority’s database between January 1, 2005, and December 31, 2019. Patients who predominantly sought healthcare on Hong Kong Island and those without a lipid test record during the study period were excluded. The cohort entry date was defined as the date of their first lipid test in any inpatient or outpatient setting since 2004. Patients were censored at the earliest occurrence of the first recorded CVD diagnosis, registered death, or the study’s end date (December 31, 2019). Patients who experienced a CVD event before the first lipid test or who died on the same day as the test were excluded from the cohort. The primary outcome was the initial diagnosis of CVD, as defined by the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes. The outcome was a composite measure encompassing coronary heart disease, ischemic or hemorrhagic stroke, peripheral artery disease, and congestive heart failure (see Supplementary Table 2).

Patient trajectory selection and formalization

To formalize patient trajectories, we defined states as the time steps of each lipid test and actions as the choice of LMD prescription between states. The trajectory consisted of repeated state-action pairs until reaching the cohort end date, with a final state indicating the occurrence of CVD within one year after the last state.

We selected representative real-world patient trajectories by applying a filtration process. (1) We excluded patients with fewer than two lipid test records during the study period to ensure consecutive trajectories. (2) Considering clinical guidelines and common practice, we excluded trajectories with visit intervals of less than one month or more than two years to align with real-world reliability. (3) Trajectories with incomplete lipid profiles (missing LDL-C, HDL-C, or triglyceride measurements) were excluded. Each patient state included a risk profile comprising 90 features, such as disease history, laboratory test results, healthcare utilization, and medication count. Disease history encompassed any previous diseases recorded before the state, identified using ICD-9-CM codes (refer to Supplementary Table 3 for details). Laboratory test results were obtained on the same date as the state. Healthcare utilization was determined by the number of visits within one year prior to the state’s date. Medication count referred to the number of different drugs with different British National Formulary (BNF) codes prescribed within one month prior to the state’s date (refer to Supplementary Table 4 for different drugs identified).

Representing the actions, which involve the specific LMDs or combinations of LMDs taken by patients during each interval between two consecutive states, poses significant challenges. The task becomes even more demanding when attempting to identify a series of actions from a sequence of LMD records associated with lipid tests, as the prescribed medications and laboratory records often do not align perfectly. A typical scenario involves lipid tests occurring at the 0th, 3rd, 6th, and 18th months of a patient’s trajectory, while a particular LMD is prescribed from the 3rd to the 9th month. In this case, the action for the first interval (0–3 months) is clear, indicating no LMD was taken. The action for the second interval (3–6 months) is also evident, representing the specific LMD prescribed during that period. However, the third interval (6–18 months) presents ambiguity, as the prescription only covers (9 – 6) / (18 – 6) = 25% of the interval. Determining whether to consider the third action as a continuation or discontinuation of the LMD becomes uncertain and deciding whether to include trajectories with such ambiguous actions poses a challenging choice. The complexity further escalates when multiple types of LMDs need to be considered simultaneously.

To prioritize representative and high-confidence trajectories, we implemented an empirical strategy consisting of the following steps:

1.

Calculating LMD Coverage. We calculated the coverage of LMDs for each interval between two consecutive lipid tests. For instance, if a patient had a total prescription of simvastatin 10 mg covering half of a six-month interval, the coverage for simvastatin 10 mg would be 50%. We performed this calculation for multiple types of LMDs recorded in the database, considering each interval within the patient trajectory.
2.

Excluding Ambiguous Trajectories. To ensure the quality of included trajectories, we considered any trajectory that had intervals with LMD coverage ranging from 1% to 50% as ambiguous in terms of drug continuation and discontinuation. Consequently, we excluded the entire trajectory if any intervals fell within this coverage range. If an interval had multiple LMDs prescribed with coverage above 50%, it was considered a combination of LMDs. Thus, we only considered trajectories that were unambiguous in terms of no drug, drug initiation, continuation, and discontinuation throughout their entire trajectory.
3.

Handling False Combination of LMDs. Consecutive intervals might exhibit false combinations of LMDs, e.g., representing early transitions between drugs where the prior prescription was long enough to cover the next interval by more than half. To mitigate these artifacts, we examined the prescriptions in the last interval of each patient trajectory, which are generally more stable as they approach the end. The set of prescriptions in the last interval was considered the final set of actions. We removed patient trajectories that had prescriptions not matching the defined set of actions.

By following this approach, we were able to define patient actions in terms of LMD types, ensuring representative and reliable trajectories.

Risk-based state representation

In order to incorporate the overall CVD risk level into each state, we aimed to quantify the contribution of individual features within the states. To achieve this, we performed survival analysis on the development cohort, considering the start time of their last state until the observation of CVD occurrence. Initially, we conducted a robust feature selection process to identify significant features associated with CVD occurrence³⁰. For statistical reliability and clinical relevance, we selected features without missing values (e.g., clinical laboratory tests) and an event rate above 1% (e.g., disease and medication history). The Cox proportional hazards model (CPH)⁴⁷ with least absolute shrinkage and selection operator (LASSO) regularization⁴⁸ was employed to identify statistically significant features (p-value < 0.05). The CPH model is widely used for survival analysis, and its regression coefficients can be interpreted as hazard ratios, facilitating better decision-making by clinicians. LASSO is a robust feature selection method that chooses a representative and independent set of features, ensuring reliability for downstream manual prioritization. The final set of features was also determined based on current clinical evidence to ensure comprehensiveness and relevance to CVD prognosis. Subsequently, we applied CPH with ridge regularization on the final feature set to quantify the contribution of each identified feature to CVD occurrence. Ridge regularization, a widely used stabilizer of regression coefficients, provided reliable estimates of hazard ratios for the risk variables. The contribution of each feature was represented as the natural logarithm of the hazard ratio. The calculation details for the state number are provided in Supplementary Fig. 2. SHAP value⁴⁹ of the contribution of each feature was in accordance with each corresponding hazard ratio (Supplementary Fig. 4). The feature selection results are presented in Supplementary Tables 5–7. To assess the overall risk, we calculated a prognostic index (PI) for each patient by summing the contributions of individual features. The PI allowed us to unify the overall risk of different patients on the same scale. Next, we sorted the PIs of patients in the development cohort and divided them proportionally into clusters, with each cluster corresponding to a state number. Consequently, the state number now incorporates information about CVD risk, and its increase reflects an increasing CVD risk in an interpretable and transparent manner. It is important to note that the specific number of clusters (i.e., states) was determined through manual prioritization based on qualitative evaluation of the RL policy decision boundary during model development. We added 200 to the state number to indicate states after the initiation of LMDs (ranging from 200 to 399), distinguishing them from states without prior LMD usage (ranging from 0 to 199). This state representation, which accurately captures the patient’s overall CVD risk, enables the RL agent to make more informed decisions. Furthermore, an added advantage of this state representation is that actions considered within the same state number pool share a similar baseline CVD risk, which helps mitigate selection bias. Selection bias, a significant concern in retrospective studies, occurs when higher-risk patients are more likely to be prescribed high-intensity LMDs and may still experience a higher risk of CVD compared to low-risk patients using low-intensity LMDs. This approach also facilitated a direct comparison of the safety line for LDL-C. For example, by accounting for the coefficients of 0.43 for diabetes and 0.18 for LDL-C per unit increase, we can determine that patients with diabetes and an LDL-C level of 3 mmol/L have an approximate CVD risk similar to patients without diabetes but with an LDL-C level of 5 mmol/L.

Formalization of the computational model

We formulated the patient trajectory and treatment decision-making process as a Markov decision process (MDP)¹⁷. The MDP was defined by the tuple [S, A, T, R, γ], where:

S is a finite set of states representing the risk states of patients during their healthcare visits for lipid tests (as described in the previous section).
A is the finite set of available actions representing the chosen LMD and LMD combinations (as described in the previous section).
T(s’ | s, a) is the transition matrix, which determines the probability of transitioning from state s at time t to state s’ at time t + 1 given action a. We estimated the transition matrix by counting the observed transitions in the development cohort and converting the counts to a stochastic matrix. To enhance safety, we limited the set of actions to frequently observed choices made by clinicians, excluding transitions with fewer than twenty occurrences. This approach ensures that the RL policy will learn from treatment options with high safety²².
R(s’, s, a) is the immediate reward received for a transition. Transitions to desirable states yield positive rewards, while reaching undesirable states incurs penalties. In our model, if s’ is the final state and the patient experiences no CVD occurrence within one year, a high positive reward is given; conversely, a high negative reward is assigned if CVD occurs²². For patient actions involving LMD, a small penalty is applied to account for potential side effects²⁰. The specific penalty values were determined through manual prioritization based on qualitative evaluation of the RL policy decision boundary during model development.
γ is the discount factor, which accounts for the decreasing importance of future rewards compared to immediate rewards. The common practice of γ in healthcare applications typically ranges between 0.9 and 0.99^18,19,21,22. We chose a γ value of 0.99, indicating that we assign nearly equal importance to late and early occurrences of rewards^18,22.

After defining and calculating the components of MDP, we employed policy iteration^22,26, an offline model-based dynamic programming algorithm in RL. This algorithm learns a state-action value function Q_π, which quantifies the expected long-term reward of choosing an action in a given state, and a policy π that selects the action with the highest reward according to Q_π¹⁷.

The policy iteration process began with a random policy and iteratively evaluated and improved it until convergence to an optimal solution⁵⁰.

1.

Policy evaluation on the expected reward of policy ${V}^{\pi }\left(s\right)$:

$${V}^{\pi }\left(s\right)=\sum _{{s}^{{\prime} }\in S}T\left({s}^{{\prime} }\right|s,a)[R\left({s}^{{\prime} },s,a\right)+\gamma {V}^{\pi }({s}^{{\prime} })]$$

(1)
2.

Policy improvement on the state-action value function ${Q}^{\pi }$:

$${Q}^{\pi }\left(s,a\right)=\sum _{{s}^{{\prime} }\in S}T\left({s}^{{\prime} }\right|s,a)[R\left({s}^{{\prime} },s,a\right)+\gamma {V}^{\pi }({s}^{{\prime} })]$$

(2)

$$\pi \left(s\right)=\mathop{{\rm{argmax}}}\limits_{a\in A}{Q}^{\pi }\left(s\right)$$

(3)

Until reaching the convergence of $\pi \left(s\right)$.

Model validation

We evaluated the policy value of the trained RL agent using a large independent validation time series dataset. To provide a comprehensive comparison, we introduced and evaluated two additional policies: the “no drug” policy, where the RL agent always chose not to prescribe any LMD, and the “random drug” policy, where the RL agent randomly selected an action from the available pool of actions. These policies served as baselines for comparison, allowing us to assess the performance of the RL agent against alternative decision-making strategies²².

We utilized our validation cohort C = [J_i, i = 1,2,…,n]. Each trajectory J_i = [(s_i,t, a_i,t, r_i,t), t = 1,2,…,τ_i] represented a sequence of transitions (s_i,t, a_i,t, r_i,t, s_i,t+1) from step t to step t + 1, where τ denotes the trajectory length. Within each trajectory, s_i,t represented the current state, a_i,t denoted the action taken, and r_i,t represented the immediate reward. The policy value of the clinicians’ policy is:

$${V}_{{\pi }_{0}}=\frac{1}{n}\mathop{\sum }\limits_{i=0}^{n}\mathop{\sum }\limits_{t=1}^{{{\rm{\tau }}}_{i}}{{\rm{\gamma }}}^{t-1}{r}_{i,t}$$

(4)

In order to ensure reliable estimates of the new policy’s performance before its deployment in real-world clinical settings, we engaged in off-policy evaluation (OPE)¹⁷. This process aimed to evaluate the RL policy’s performance using patient trajectories generated by the clinicians’ policy, as observed in the validation dataset. Formally, within the context of OPE, we defined π₀ as the behavior policy (the clinicians’ policy) and π₁ as the RL policy. To account for the discrepancy between these two policies and estimate their policy value, we employed importance sampling^17,21. Importance sampling is a widely recognized method in RL policy estimation, allowing us to correct for the differences between π₀ and π₁ and obtain accurate estimates of their respective policy values.

For trajectory i at time step t, the importance ratio is calculated as:

$${\rho }_{i,t}={\pi }_{1}({a}_{1,t}/{s}_{1,t})/{\pi }_{0}({a}_{1,t}/{s}_{1,t})$$

(5)

The weight of the trajectory is:

$${w}_{i}=\mathop{\prod }\limits_{t=1}^{{{\rm{\tau }}}_{i}}{\rho }_{i,t}$$

(6)

And the estimated value of the RL policy is:

$${V}_{{IS}}=\mathop{\sum }\limits_{i=0}^{n}\mathop{\sum }\limits_{t=1}^{{{\rm{\tau }}}_{i}}{{\rm{\gamma }}}^{t-1}{r}_{i,t}$$

(7)

The same procedure was applied to the no drug policy and the random drug policy to estimate their policy value.

link