Session: Assessing Equitable Practices for Addressing Health Disparities and Inequities
A Characterization of Synthetic Data Products for AI: Transparency and Ethical Values
Friday, September 20, 2024
3:45 PM – 4:45 PM CT
Location: Midway 7-8 (First Floor)
Abstract: The underrepresentation of marginalized communities in health data has important consequences for the performance of predictive models, as algorithms that are trained using inadequately diverse data sets demonstrate poorer model performance for those not represented. The use of synthetic data – artificially created data demonstrating the same properties as real-world data – offers exciting opportunities to ensure that the benefits of artificial intelligence are shared by all. Little is known, however, about the characteristics of synthetic data products being offered, as they are often created and distributed within the commercial realm, where transparency is lacking. It is difficult, therefore, to judge whether the products being offered are aligned with important values associated with underrepresentation, such as fairness. This study aims to understand how synthetic data producers characterize their products and the degree to which these products are scrutable. We will conduct systematic database searches of relevant business news and academic research for synthetic data products marketed as augmenting representation in data sets. We will then use content analysis to generate synthetic data product categories, to characterize the organizations marketing the products, to whom these products are being marketed, and the use cases offered. Finally, we will compare information about synthetic data products being offered in the health care space – where stakes are higher – to those that are not. Our findings will serve as a foundational reference to inform the analysis of specific ethical and regulatory challenges arising from the use of synthetic data to improve model performance.
Learning Objectives:
After participating in this conference, attendees should be able to:
Describe the types of commercial entities offering synthetic data products, how they are marketed and to whom they are marketed, and the use cases offered.
Analyze the degree to which synthetic data producers describe ethically important attributes of their products.
Compare attributes of health-related synthetic data products to non-health-related products.