Application
Synthetic data generation from private source data is a prominent and lofty goal that has been the focus of multiple prior work.
A prevalent line of work relies on Differential Privacy (DP) in the data generation process. DP is considered the gold-standard for guaranteeing data privacy, allowing for desired theoretical guarantees. Due to these guarantees, DP has been adopted by multiple organization and governmental agencies such as Google, Apple and the U.S. Census Bureau.
This line of work contains diverse approaches to the task of synthetic data generation, from workload-based, statistical model-based, and GAN-based (Generative Adversarial Networks) approaches.
However, these approaches do not guarantee some desired properties of the synthetic data. In particular, if the private data is biased or unfair, the generated synthetic data will likely also be biased and even exacerbate these issues.
Our Innovation
The researchers are working on several complementing directions, depending on the approach for data generation and the fairness criterion. For example, some non-private GAN-based data generation systems can be made to satisfy DP while ensuring the fairness of the synthetic data with only slight augmentations, while other frameworks require more substantial research work.
Our plan is to devise a tailored solution for each data generation approach and examine its usefulness in terms of fairness guarantees and faithfulness to the original data.
Our first work on this subject: https://www.vldb.org/pvldb/vol16/p1573-pujol.pdf
Opportunity
We are interested in understanding how the industry uses such data generation systems and in collaborating with companies to support commercial implementations of these systems.