Marzyeh Ghassemi, a computer science PhD graduate from MIT, has been researching the potential biases that artificial intelligence (AI) techniques may introduce in healthcare. Ghassemi, currently an assistant professor at MIT’s Department of Electrical Science and Engineering, along with three collaborators from the Computer Science and Artificial Intelligence Laboratory, has examined the disparities that arise in machine learning models. These disparities cause models to perform poorly for subgroups that have limited data available for training.
The researchers focused on “subpopulation shifts” which refer to the differences in machine learning model performance for different subgroups. They observed shifts that lead to inferior medical diagnosis and treatment outcomes. The main objective of their study is to understand the types of subpopulation shifts that occur and uncover the mechanisms behind them to develop more equitable models.
The team identified four main types of shifts: spurious correlations, attribute imbalance, class imbalance, and attribute generalization. Biases can arise from the class or the attribute, or both. For instance, if a machine learning model is assigned to sort images of animals into two classes, cows and camels, biases can be introduced if the dataset shows that cows are always on grass and camels on sand. This wrong conclusion arises due to a spurious correlation, which is a bias in both the class and the attribute.
In a medical context, machine learning models could be used to diagnose pneumonia based on X-ray images. In this case, attribute imbalance could occur if there are more males diagnosed with pneumonia compared to females, leading to better performance for males. Class imbalance may also arise if there are significantly more healthy patients than patients with pneumonia, resulting in a bias towards healthy cases. The study also highlighted the shift of attribute generalization, where the model needs to be able to predict outcomes for subgroups not represented in the training data.
The team tested 20 advanced algorithms on various datasets to evaluate their performance across different population groups. While improvements to the classifier and encoder layers of the neural network reduced spurious correlations and attribute imbalance, attribute generalization remained a challenge.
Assessing fairness among population groups is another concern. The commonly used metric, worst-group accuracy (WGA), assumes that improving accuracy for the worst-performing group will enhance overall model performance. However, the study found that boosting worst-group accuracy leads to a decrease in worst-case precision. Both accuracy and precision are essential in medical decision-making.
Overall, this research sheds light on the biases that can arise in machine learning models, contributing to the development of fairer and more accurate healthcare AI systems.