Rural communities experience poorer health outcomes, lower life expectancies and more chronic disease and hospitalizations than communities in more densely populated urban environments1. Factors contributing to rural health disparities include older populations, lower incomes, less education, fewer occupational opportunities and less healthy lifestyle behaviors2,3. Additionally, rural areas struggle to both attract and retain physicians, nurses, dentists and pharmacists, creating access barriers4 that further rural health disparities5. These shortages make research that identifies factors related to health professionals selecting careers in rural settings a topic of interest to public health officials, policy-makers, training institutions and others interested in rural health.
Previous research addressing the rural healthcare professional workforce primarily focused on rural training and developing a pipeline of health professionals interested in rural practice. These studies identified rural clinical rotations as a strong predictor for choosing rural medicine6-11. Other factors identified as influencing physician retention were rural upbringing and financial incentives such as loan forgiveness11,12. Factors such as race and size of undergraduate college were not associated with rural medicine practice12. Some studies found that men were more likely to practise rural medicine than women9,13 while others did not find a link between being male and rural practice12. One study connected personality traits to rural practice and found physicians scoring higher on openness to experience, agreeableness and self-confidence more likely to choose rural medicine14.
Research about other healthcare professions, such as dentists, nurses and pharmacists, also linked an individual’s rural upbringing and rural training experiences to choosing a rural practice setting15-18. Dentists were more likely to practise in rural areas if they were male or had positive experiences with rural dental role models16,19. An Iowan study found that a dentist’s birth state was not associated with rural practice decisions, but observed a correlation with being older and being female in solo practice20. Rural nurses tended to have less nursing education and were more likely to work full time in public/community health, long-term care, or ambulatory care settings than their non-rural counterparts21. Other studies addressing the rural nursing workforce centered on job satisfaction, turnover rates and burnout22,23 and less on factors influencing career choice. Studies related to the pharmacy workforce are limited, but report that rural pharmacists are more likely to have rural roots and have received some type of rural exposure in their clinical training17,24.
Despite the existing body of research examining rural workforce predictors, a systematic review by Lee and Nichols8 concluded there is a need for more rigorous analysis and research related to implementing rural recruitment and retention strategies. More recently, Grobler et al25 conducted an extensive systematic literature review to reconcile and update the literature on the healthcare professions workforce to help determine effective incentives to retention. They concluded that many previous studies were limited by bias and confounding factors and expressed the need for more well-designed studies related to factors associated with choosing rural practice. Moreover, they cited a survey study by Trickett-Shockey et al26 that challenged previous findings identifying rural upbringing or rural training and education as a strong predictor of student intent to choose rural practice. Though small, this study illustrates the complexity of predicting professional practice settings and suggests that choice results from a nexus of underlying factors.
Another limitation to the published literature is that earlier researchers primarily used traditional statistical methods. To the authors’ knowledge, no existing study has applied newer, more robust analytics such as machine learning (ML). The algorithms employed by ML are known to be valuable, practical and applicable to a wide range of research questions, especially in health care27. ML can use data to detect patterns to predict outcomes, and advanced analytics from ML could improve the accuracy and precision of rural modelling and prediction as well as validate earlier findings3. The technique is useful to identify relationships between multiple data inputs or ‘features’ and an outcome. In ML, the computer learns by testing multiple sets of algorithms on a training dataset to determine which data variables help to classify an outcome. The results from using new analytic techniques should be of interest to educators, policy-makers and others interested in rural health. ML methods can also stimulate hypothesis testing research to explore and test previously identified associations.
This study seeks to add to the understanding of the factors related to rural practice. Specifically, it will be the first to apply ML techniques to a database and to assess the utility of ML as a tool to identify factors predicting the decision to practice in a rural area. Because the study uses a regional database, a secondary aim is to capture a more defined understanding of the demographic characteristics, of the Utah (USA) healthcare professions workforce.
This study used data collected by the Utah Medical Education Council (UMEC). Utah is a state in the Western Mountain region of USA with a population of about 3.2 million, of which 335 000 live in rural areas. UMEC gathers data on the supply of healthcare professions on different cycles every calendar year and includes information on demographic characteristics, practice settings, education, hours worked per week and career outlook. Each discipline reported on income by selecting from a range of income classifications. For example, physicians selected among 12 salary classifications before taxes and excluding benefits ranging from from less than US$49,000 to more than US$300,000 and after excluding residents and fellows yielded a median hour adjusted income for primary care providers of US$178,000 and US$229,000 for specialists28. Similarly, the other disciplines allowed respondents to select an income classification but used classifications that reflected the salary ranges of the different health professions. Most data were collected through paper surveys and Qualtrics, an online surveying tool. All data were de-identified before running any analyses.
Datasets from four healthcare professions were cleaned and merged to run analyses. These four unpublished UMEC datasets include the 2017 dental workforce, the 2016 physician workforce, the 2014 nursing workforce and the 2017 pharmacy workforce. Similar variables for each dataset were identified and recoded to match accordingly. For example, data for practice settings in the nursing workforce had different values (ie 2=hospitals) compared to the physician workforce (ie 1=hospitals) and needed to be matched in order for values to represent the same practice settings.
Primary practice location was the outcome variable of interest for this study. For the purpose of this study, the variable was categorized as practising either in a non-rural (urban or suburban area) or in a rural area and was computed by taking the primary zip code and converting it to rural and non-rural areas based on rural-urban commuting area (RUCA) codes. Although there are different ways to classify rural and non-rural areas, RUCA codes, which were developed by the United States Department of Agriculture Economic Research Service, are widely used in healthcare research29-31. RUCA codes take into account census tracts based on daily commuting patterns, population density and urbanization, and are split into 33 distinct categories. For this study, these categories were clustered into two simple levels: non-rural (RUCA codes 1, 2 and 3, representing communities with ≥50 000 residents) and rural (RUCA codes 4 and above, or ≤49 999 residents).
Data included 20 149 cases and 88 variables (ie features or attributes) obtained by merging the dental, physician, nursing and pharmacy workforces. Surveys were sent to individuals working in Utah who held active Utah licenses for dentistry, medicine, nursing or pharmacy. Response rates for the survey were 50.8% for dentistry, 47% for physicians, 42% for nursing and 30.4% for pharmacy. As mentioned in the workforce reports from UMEC, these response rates were considered satisfactory due to meeting a sufficient 95% confidence interval. In addition to general questions that crossed disciplines, each survey contained questions relevant to the specific health profession. Further information regarding the UMEC workforce supply surveys has been published previously32-34. In preparation for data analyses, this study excluded non-relevant variables (such as license ID) and variables with more than 50% of missing data among the survey respondents; this yielded a total sample size of 11 259 (dental n=902, physician n=2587, nursing n=6932, pharmacy n=838) for outcome prediction of practice location. Variables with more than 50% of missing data were excluded to ensure reliable precision and accuracy while limiting errors that may occur with the ML algorithms. The random forest (RF) method was then applied to select the most important variables or features for outcome prediction, known as feature selection. The RF method takes data inputs and randomly applies them to multiple decision trees iteratively until it identifies data features that help predict an outcome. In contrast to traditional statistical models, the models created by ML algorithms are extremely complex. Thousands of rules or parameters might be tested to define the model and consequently the exact internal processing pathways may be hard to identify. In this study, the RF method isolated 28 features for inclusion in the next steps of model building and validation. Feature selection helps prevent over-fitting of data and reduces errors in model complexity along with training time35. Appendix I outlines which features among the 28 variables, such as gender, race and debt, were dropped.
The outcome variable of interest was practice location. To adjust for an observed imbalance of data, the analyses employed a replacement strategy to create a more balanced dataset to adjust for the minority class (eg the rural class of practice location, which had a much smaller sample size). Without adjusting for the minority class of data, ML methods often fail to correctly predict the minority class and cause an inflated model performance to the majority class. Oversampling, also known as sampling with replacement, has been used in previous ML studies because it is effective in treating class imbalance with large datasets36-39. Adjustment using sampling with replacement can reduce gaps between sensitivity, specificity and errors of misclassification36,40. Thus, the minority class in the data was adjusted (proportions of the minority class were resampled until reaching a similar sample size to the majority class), resulting in a more balanced dataset of 20 291 cases with 10 130 cases classified as non-rural and 10 161 cases classified as rural for modelling.
This study used several different supervised ML methods including decision tree (DT), RF regression, extreme gradient boosting (XGBoost) and support vector machine (SVM) for modelling. In supervised ML methods, the outcome of the study is known, predetermined, or preset by the data scientist or researcher. For example, the outcome of this study was preset to be the practice site. However, in unsupervised ML, the outcome is determined by the machine during the course of data exploration, making supervised ML methods more suitable for prediction studies and unsupervised ML methods more appropriate for studies focusing on clustering and feature reduction. In contrast to more traditional statistical methods such as logistic regression, ML includes higher-order interactions and examines complex non-linear relationships between model variables and outcomes. The ML methods in this study were chosen because of their established applicability in healthcare research, capability of over-fitting prevention, simplicity of comprehension, and general acceptance as useful ML methods41-45. Application of ML involves both training and test datasets, where algorithms applied to a training dataset can help identify associations that might be challenging to observe in complex and larger datasets. After the training dataset explores and ultimately predicts the outcome variable, the prediction is validated by comparing it against the test dataset, recognized as the validation set. The model that performs the best through the stages of validation will be the final chosen model. In this study, the data were randomly split into training and testing sets for model building and validation, using DT, RF, XGBoost and SVM. An 80/20 split (ie 80% of data for training and 20% of data for validation) was chosen based on previous literature and is referred to as the 80/20 rule or the Pareto principle44. More specifically, 80% of data were trained using k-folds cross validation and then 20% of data were tested for validation to minimize issues of over-fitting and model errors46-48. The metrics used to evaluate model performance included accuracy, sensitivity, specificity and area under the curve (AUC) of the receiver operating characteristic (ROC). Several sources provide detailed descriptions of ML and the techniques used in this study49-51.
Descriptive statistics on demographic characteristics and clinical practice were analyzed for all the healthcare professions. The ML analyses were conducted using WEKA v3.9.4 (https://www.cs.waikato.ac.nz/ml/weka). Other statistical analyses, such as descriptive statistics were performed using SPSS v25.0 for Windows (IBM; http://www.spss.com).
Roseman University of Health Science Institutional Review Board conducted ethical approval of this study and determined this study as non-human subject research.
The study sample consisted of 11 259 healthcare professionals licensed in Utah, of which 36.6% were male and 63.4% were female with an average age of 46.6 years (standard deviation (SD) 12.98). Of the sample group, most healthcare professionals were Caucasian (n=10 375, 94.5%), went to a school outside of Utah (n=7024, 62.4%), and had a non-rural upbringing (n=8128, 73.5%). Only 1.9% (n=318) identified as being Hispanic. Table 1 summarizes the sample group demographics. There were significant differences of profession (p<0.001), race (p<0.001) and upbringing (p<0.001) between rural and non-rural practice location (see Table 1).
The average age of the rural health professions workforce was 46.6 years (SD=12.95), and 48% (n=538) of them worked in hospital settings (Table 2). The top five specialties for rural practice were general surgery (n=187, 17.0%), primary care consisting of general internal medicine, pediatrics and family medicine (n=177, 16.1%), other specialty (n=157, 14.2%), general dentistry (n=91, 8.3%) and emergency medicine (n=85, 7.7%).
The non-rural healthcare workforce had an average age of 46.6 years (SD=12.99) and about half worked in hospital settings (n=5096, 50.7%). The top five specialties for non-rural practice were other specialty (n=1564, 15.4%), primary care defined as general internal medicine, pediatrics and family medicine (n=1458, 14.4%), general surgery (n=946, 9.3%), general pharmacist (n=673, 6.6%) and critical care medicine (n=646, 6.4%).
Among the ML methods, the best performing classifier was SVM (accuracy 99.7%, precision 100%, sensitivity 100%, specificity 99.4%), followed by XGBoost (accuracy 96.6%, precision 100%, sensitivity 93.1%, specificity 100%), RF regression (accuracy 96.6%, precision 93.7%, sensitivity 100%, specificity 93.2%) and DT (accuracy 89.0%, precision 83.4%, sensitivity 97.5%, specificity 86.0%) (Table 3). Figure 1 presents the feature importance graph for SVM, and includes the 10 most important predictors: income, upbringing, total hours, age, years until retirement, school state, patients per week, degree year, practice setting and specialty. Importance scores are derived by constructing a prediction model in which variables that influence the model the most have the greatest impact on reducing model error. When variables with high r importance are excluded from the prediction model, increased model error occurs38,52,53. The higher the scores for these features, the more important they were in identifying rural practice location. In terms of rural practice decisions, income and upbringing were found to be the most important features. The ROC curves indicated that all ML algorithms performed exceptionally well, where curves placed closer to the top-left corners represent better performance (Fig2). Table 3 lists the model performance evaluations. Table 2 shows the descriptive statistics of the top 10 important features derived from SVM.
This study is the first to apply ML techniques to explore factors associated with practising in a rural area. Identifying these factors can facilitate development of effective strategies for recruitment and retention of healthcare professionals into rural settings. By using a healthcare workforce database, ML methods such as DT, RF regression, XGBoost and SVM assessed factors related to rural practice. Among the methods utilized, this study found that SVM worked best in terms of performance in classifying rural practice location. Performance metrics from DT, RF regression and XGBoost also fared well. This experience with a single-state database suggests that ML tools, especially SVM, will be valuable for future research analyzing other state or larger databases that have enough data points to apply ML techniques.
While several studies examine predictors, few assess the relative importance of predictor variables. An important predictor found in this study was upbringing, evidenced by having the second highest importance score (Fig1) and being identified as an important variable linked to rural practice across all the ML methods. This finding is consistent with previous research that also identified rural background as an important predictor of practising in a rural setting6-11,15-18,24. By employing four new analytic methods, each of which reconfirmed that rural upbringing remains linked to choosing a rural practice, this study updates older results and counters concerns8,25 about the validity of earlier research.
By far, income exhibited the strongest association to practice location. This association aligns with previous studies that found financial factors play a significant role in determining physician practice setting and also for nurses and dentists54-57. In the case of physicians, recent salary data indicate that the gap between rural and urban income has narrowed for primary care over the past 5 years58. Finding that income connects strongly to rural practice suggests that attractive income packages might help rural communities compete more successfully with urban areas to recruit health professionals and to address shortages. This may be especially important for surgical subspecialties where urban practice remains more lucrative58. It also highlights the need for research examining how income influences practice choice and what types of offers are most attractive. Also, while these findings demonstrate linkage to income across four healthcare professions, it may be that flexibility, incentives and other earning features are equally or more important than absolute income. While added income expense may be challenging for rural health professional employers, it might prove to be cost effective if it minimizes turnover and the number and duration of staff vacancies.
Factors with low importance scores in this study also validate earlier research about rural practice choice. Royston et al12 found that neither gender nor race predicted rural practice, which matched the current findings demonstrated by all four ML methods dropping both features from the final model. The current models also dropped current and total debt as important predictor. Current educational debt being dropped from the model suggests the need for future evaluation on loan repayment incentives. Typically, loan repayment programs offer to pay off a portion of student debt in return for working in an area of high need for a certain period. Both practice setting and specialty had relatively low importance scores compared to the other factors. This finding may be due to rural areas having fewer types of practice settings. Future studies employing ML methods should investigate the association of these factors with rural practice.
Although this study looked at providers in the USA, a disparity of health professionals between urban and rural areas remains a global problem. The finding linking rural upbringing as an important factor for selecting rural practice is similar to international studies examining rural pipelines59,60. Although there is substantially more research related to rural pipelines in high-income countries, studies from low- and middle-income countries also demonstrate that students with rural roots are more likely to practise in rural areas61,62. Fewer studies explore the impact of income as a factor and most of these focus on physicians and less on other health professions63. The current finding identifying income as a strong predictive factor suggests the need for more research about income incentives for allied health professionals and physicians both in high- and lower-income countries.
This study has several limitations. The sample size for each profession differed and was too small to apply ML techniques to each profession separately. Associations identified by this study could be more representative of one healthcare profession over another. However, the models described here provide a method to apply ML techniques to larger databases that have enough data points for separate health professions. Also, examining health professions as a group might be useful for guiding comprehensive strategies to address rural health professional shortages and merits further exploration. Causality from the models is another limitation worth noting. Although the authors identified factors associated with rural practice, causality cannot be inferred. Future studies using ML methods such as causal forest are planned to evaluate causality64. A third limitation is that more than half of the original features were dropped due to missing data. A future study using survey questionnaires or databases matched more precisely among the healthcare professions is planned to minimize missing data. Another limitation is how data balance was processed for the training and validation datasets. Although other studies found sampling with replacement to be a reliable method37-39, over-fitting could still occur when carried out on both datasets. While other techniques to prevent over-fitting, such as balance processing only on the training dataset, exist, sampling with replacement on both the training and validation datasets can reduce bias and model prediction inaccuracy, since it minimizes highly skewed distribution towards healthcare professionals choosing urban practice40,53,65. Another limitation was that the sample consisted only of professionals licensed in Utah. Data from other states and internationally will be helpful to confirm model performance and important feature selections. However, Utah represents both large non-rural and rural environments, making it a good state to test ML applications. Also, the times of data collection differed and could affect outcomes. Nonetheless, there is no reason to believe that significant changes occurred over the limited time frame examined. Further research using longitudinal designs is also needed to explore trends. Finally, there are several ways to define ‘rural’, and this study grouped smaller communities and rural communities into a single designation. In doing this, nuances between a very remote community and a smaller town near a metro area might have been lost.
This study is the first to demonstrate the utility of applying ML methods to identify features linked to rural practice. The study indicates that income is the most important factor associated with rural practice and suggests the need to study what types of income structures might attract more healthcare professionals to rural settings. Rural upbringing emerged as the next most important factor, validating and updating earlier research that identified upbringing as an important factor. Further research applying ML methods to large databases, to explore linkages and to deploy ML algorithms in software applications offer a new tool with the potential to guide and inform strategies that maximize efforts to address rural workforce shortages.