Privacy-Preserving Synthetic Data
Generating privacy-preserving synthetic datasets and evaluating their accuracy and fairness using Data Synthesizer and MST.
Introduction
This project explores privacy-preserving synthetic data generation using the Data Synthesizer and MST methods. We evaluate the generated data’s statistical properties and how well it preserves relationships from sensitive datasets.
Objectives:
- Generate synthetic datasets from sensitive data using various modes (Random, Independent Attribute, Correlated Attribute).
- Evaluate the accuracy of statistical properties in the synthetic data compared to the real data.
- Compare two privacy-preserving synthetic data generators: Data Synthesizer and MST.
Methodology
Synthetic Data Generation
Using Data Synthesizer, synthetic datasets were generated under the following configurations:
- Mode A: Random mode.
- Mode B: Independent attribute mode with ε = 0.1.
- Mode C: Correlated attribute mode with k = 1, ε = 0.1.
- Mode D: Correlated attribute mode with k = 2, ε = 0.1.
Each dataset contains 10,000 samples.
Evaluation Metrics:
- Statistical Queries: Compare metrics (Mean, Median, Min, Max) for age and score attributes.
- Distribution Analysis: Visualize histograms of age and sex for synthetic and real datasets.
- Statistical Tests: Use Kolmogorov-Smirnov and KL-divergence to measure data similarity.
- Mutual Information: Analyze pairwise mutual information between attributes.

Results
- Synthetic Data Accuracy: Random mode showed lower accuracy in statistical queries compared to correlated attribute modes.
- Distribution Analysis: Independent attribute mode (B) better preserved the original distribution than random mode (A).
- Privacy Budget Impact: Lower ε values increased privacy but reduced the fidelity of the synthetic data.



Code
The code can be found in GitHub repository.
Future Work
- Extend to multi-class classification datasets.
- Evaluate synthetic data generation on larger and more complex datasets.
- Compare additional privacy-preserving methods like PrivBayes and PATECTGAN.