Benchmarking Foundation Models for Antibiotic Susceptibility Prediction

Helio Halperin¹ Yanai Halperin¹ Simon A. Lee² Jeffrey N. Chiang²
¹Santa Monica High School ² UCLA Computational Medicine 2024

Paper arXiv Code Dataset

Dall-E's attempt at visualizing the the research we performed. Makes our work look a lot cooler...

Abstract

The rise of antibiotic-resistant bacteria has been identified as a critical global healthcare crisis that compromises the efficacy of essential antibiotics. This crisis is largely driven by the inappropriate and excessive use of antibiotics, which leads to increased bacterial resistance. In response, clinical decision support systems integrated with electronic health records (EHRs) have emerged as a promising solution. These systems employ machine learning models to improve antibiotic stewardship by providing actionable, data-driven insights. This study therefore evaluates pre-trained language models for predicting antibiotic susceptibility, using several open-source models available on the Hugging Face platform. Despite the abundance of models and ongoing advancements in the field, a consensus on the most effective model for encoding clinical knowledge remains unclear.

Antibiotic Resistance

Antibiotic resistance is a growing global health threat, reducing the efficacy of treatments against bacterial infections. However with the ubiquitous amount of data, we are able to leverage Electronic Health Records (EHRs) which are patient histories to help predict antibitoic susceptibility. Therefore in this study we use pre-trained foundation models as a potential antibiotic stewardship effort to help combat antibiotic resistance and provide clinical decision support backed up by a data driven approach.

Transforming Electronic Health Records into text

In a previous work, we introduced a methodology for transforming Electronic Health Records (EHRs) into text, which we then used as input into pre-trained language model for predicting antibiotic susceptibility. We used the MIMIC-IV dataset, which contains de-identified health data and made a patient cohort that composed of patients presumed to have a STAPH infection. We detail our patient cohort in the Table Below:

Description	Category	train	test	totals
Prescription, n	total	4803	1173	5976
Unique ID, n	total	3283	878	4161
Age mean (SD)		59 (17)	58 (17)
Sex %	female	1341	351	1692
Sex %	male	1942	527	2469
Race/Ethnicity %	White	2212	583	2795
	Black	416	119	535
	Other	401	96	497
	Hispanic/Latino	150	55	205
	Asian	88	20	108
	Unable	12	3	15
	Native Hawaiian	4	2	6

Modeling Setup

In this work we used the Hugging Face platform to access pre-trained language models. We used the following models:

In this study we decided to freeze these pre-trained model parameters and then use these embeddings as input into a light gradient boosting machine (LGBM) model. We frame each antibiotic susceptibility prediction task as independent binary classifications. We then used the Area under the Reciever Operating Characteristic and the Area under the Precision Recall Curve as our evaluation metrics.

Antibiotics & Their Prevalance

We studied eight antibiotics in this study. The table below shows the prevalence of each antibiotic in our dataset.

Category	Antibiotic	train	test	totals	Prevalence (%)
Antibiotics	Clindamycin	2645	624	3269	54.6838
	Erythromycin	2626	639	3265	3.5141
	Gentamicin	4549	1127	5676	54.6352
	Levofloxacin	2866	715	3581	94.9799
	Oxacillin	2702	667	3369	6.4759
	Tetracycline	3747	909	4656	39.9598
	Trimethoprim/sulfa	3671	908	4579	77.9116
	Vancomycin	2529	611	3140	76.6232

Results

We begin by showing the results of our study. Figure 1 displays the Area under the Reciever Operating Characteristic and Figure 2 displays the Area under the Precision Recall Curve as our evaluation metrics. We rank our models from best (top) to worst (bottom).

Below we present the results of our study in a much cleaner format. Figure 3 displays the Area under the Reciever Operating Characteristic and Figure 4 displays the Area under the Precision Recall Curve as our evaluation metrics. We rank our models from best (top) to worst (bottom) to show who wins each antibiotic benchmark.

We see that the best model varies across different antibiotic objectives. However from these plots we do see that BiomedRoBERTa has 4 antibiotics where it performs the best.

We therefore also present the average ranks of each model where we see SciBERT having the highest average rank.

Discussion & Conclusion

From this study, we notice the varying winners across different antibiotics. This suggest that foundation models may be optimized for specific tasks in which they claim to be the new "State of the Art". Therefore we advise people who use foundation models as feature representation methods to perform benchmarks as a clear winner is not conclusive.

We therefore intend to perform a follow up study in which we fine-tune these foundation models on our dataset to see if we can improve the performance of these models and see if we can observe a clear cut winner who has a state of the art embedding.

Questions

If there are any questions or concerns, please feel free to reach out to us at heliohalperin@gmail.com or simonlee711@g.ucla.edu.