Privacy-Preserving Synthetic Data

Introduction

This project explores privacy-preserving synthetic data generation using the Data Synthesizer and MST methods. We evaluate the generated data’s statistical properties and how well it preserves relationships from sensitive datasets.

Objectives:

Generate synthetic datasets from sensitive data using various modes (Random, Independent Attribute, Correlated Attribute).
Evaluate the accuracy of statistical properties in the synthetic data compared to the real data.
Compare two privacy-preserving synthetic data generators: Data Synthesizer and MST.

Methodology

Synthetic Data Generation

Using Data Synthesizer, synthetic datasets were generated under the following configurations:

Mode A: Random mode.
Mode B: Independent attribute mode with ε = 0.1.
Mode C: Correlated attribute mode with k = 1, ε = 0.1.
Mode D: Correlated attribute mode with k = 2, ε = 0.1.

Each dataset contains 10,000 samples.

Evaluation Metrics:

Statistical Queries: Compare metrics (Mean, Median, Min, Max) for age and score attributes.
Distribution Analysis: Visualize histograms of age and sex for synthetic and real datasets.
Statistical Tests: Use Kolmogorov-Smirnov and KL-divergence to measure data similarity.
Mutual Information: Analyze pairwise mutual information between attributes.

Figure 1: Mutual Information Heatmap of Synthetic Data

Results

Synthetic Data Accuracy: Random mode showed lower accuracy in statistical queries compared to correlated attribute modes.
Distribution Analysis: Independent attribute mode (B) better preserved the original distribution than random mode (A).
Privacy Budget Impact: Lower ε values increased privacy but reduced the fidelity of the synthetic data.

Figure 2: Box-and-Whiskers plot of KL-divergence

Figure 3: Box-and-Whiskers plot of aggregated difference in pairwise mutual information

Figure 4: Box-and-Whiskers plot of aggregated difference in pairwise mutual information for MST

Code

The code can be found in GitHub repository.

Future Work

Extend to multi-class classification datasets.
Evaluate synthetic data generation on larger and more complex datasets.
Compare additional privacy-preserving methods like PrivBayes and PATECTGAN.