Decisive differences in healthcare AI
This story was written by Truly Render for the Rackham Graduate School
Machine learning and artificial intelligence (AI) are part of the fabric of our everyday lives, informing everything from our election campaigns to our healthcare delivery. But according to Ph.D. student Trenton Chang, we have a long way to go before developing machine learning models that promote equity.
Chang studies machine learning with the Machine Learning for Data-Driven Decisions group (MLD3), part of the AI Lab within U-M’s Division of Computer Science and Engineering.
Before receiving his master’s degree from Stanford University and joining the MLD3 Lab, Chang received his undergraduate degree in American studies from Stanford University, a path that he says fuels his current research.
“American studies introduced me to thinking about what equity actually means,” Chang says. “When I started taking computer science courses, I enjoyed the discussions about ethics, bias, and inequities embedded within machine learning models.”
As a humanities student, Chang was interested in analyzing how disparities arise and impact societies. Entering graduate school, his research focused on understanding the ways disparity informs machine learning within healthcare settings. And unfortunately, the disparities in healthcare delivery are well documented.
“In healthcare, there’s examples of lower rates of colon cancer screenings for Black patients in the VA [Veterans Affairs], or the differential presentation for heart attacks between men and women, or disparities in pain perception,” Chang says. “On the clinical end, people have known for a long time that there are inequities in the way that we deliver healthcare.”
Currently, clinicians may use AI to support a variety of care decisions, including diagnostics like screenings and tests.
While AI is often thought of as a way to attain a neutral and unbiased perspective, it’s important to remember who is behind the machine learning process: humans.
Decoding Our Bias
A large fraction of the time spent on an AI project is often dedicated to data labeling, the process of annotating raw data such as images, videos, text, and audio. Labels are used to help AI understand what an object is–or what it could mean–when encountered in data without a tag. When label quality is inaccurate, so too is the output.
“The problem that I analyze is inherently a problem with mislabeling,” Chang says. “One important application of healthcare AI may be predicting the outcome of laboratory tests, but when the rates at which tests are ordered vary across patient subgroups, the machine learning data is flawed.”
Flawed data is problematic, since it is what is used to train the machine learning model. The issue is exacerbated when machine learning developers label patients who have yet to receive diagnostic tests as being “negative” for a condition within the data.
“Most of the time this is probably pretty reasonable if we trust clinicians’ judgment, but other times this can lead to serious issues in terms of performance disparities between groups in your population,” Chang says.
Healthcare disparities and “negative” test labeling defaults can cause machine learning models to underestimate the risk for patients who were already less likely to receive a test, creating a feedback loop that perpetuates inequities.
Taking a Closer Look
In a recent simulation study, Chang trained machine learning models to determine conditions where large performance gaps across two patient groups arise. Through this process, Chang revealed disparities in label quality across patient subgroups that result in machine learning model underperformance.
While Chang’s simulations used mathematical models as points of inquiry, he illustrates his findings using the example of COVID-19 testing disparities.
Chang points out that cough and fever are symptoms of COVID-19, although they are also symptoms of other illnesses. According to a 2020 article in the New England Journal of Medicine, African Americans are more likely to have a cough or fever at time of testing than other populations because of what Chang calls “embedded health disparities,” but they are less likely to be tested for COVID-19 due to a number of factors, including clinician bias and underfunded health clinics. This leads to further disparities in patient treatment and machine learning, skewing AI’s ability to accurately understand and predict healthcare outcomes.
Chang asserts that performance disparities in AI could be reduced significantly if a high-risk patient was always more likely to be tested. He acknowledges, though, that this could likely require a significant reallocation of resources.
“Not everyone can afford to run these expensive tests,” Chang says. “It’s important to understand the need for societal and policy changes alongside the computational insights here. We’re not going to ‘machine learning’ our way into reducing resource disparities between two hospitals.”
While Chang’s current work looks at challenges within the AI and machine learning field, he is cautiously optimistic about the future.
“If we have the humility to allow experts and all the stakeholders to come in and make these decisions about machine learning models together–and we understand our own limitations–we can create collaborations that can really pave a way for creating new models that can truly have positive, equitable impacts on people,” Chang says.
Chang’s research cites two important National Library of Medicine’s National Center for Biotechnology Information studies, listed below.
Sex Differences in the Presentation and Perception of Symptoms Among Young Patients with Myocardial Infarction: Evidence from the VIRGO Study (Variation in Recovery: Role of Gender on Outcomes of Young AMI Patients).
Colorectal Cancer Screening Among African-American and White Male Veterans.
In Chang’s COVID-19 example, he cited two articles published in 2020, listed below.
Hospitalization and Mortality Among Black Patients and White Patients with COVID-19.
The Fullest Look Yet at the Racial Inequity of Coronavirus.
Disparate Censorship & Undertesting: A Source of Label Bias in Clinical Machine Learning.