Privacy Metrics

Privacy metrics allow you to assess:

The degree of similarity between rows in the synthetic dataset and those in the source dataset.
The probability of an adversary re-identifying individuals in the source dataset when given access to the synthetic dataset.

Memorization Rate

Memorization rate measures the proportion of synthetic records that are exact duplicates of records from the original dataset. A high memorization rate indicates that the synthetic data generator is simply copying original data rather than learning the underlying patterns, which can pose privacy risks. Lower memorization rates suggest better generalization and privacy protection.

See rockfish.labs.metrics.memorization_rate.

Distance to Closest Record

The Distance to Closest Record (DCR) score quantifies privacy risk by checking how similar records in the synthetic dataset are w.r.t. the source dataset.

It does so by measuring the similarity between the DCR distributions between the two dataset pairs - (source, synthetic) and (source, test). The more similar these two DCR distributions are, the more "private" the synthetic data.

Note that the test dataset should be sampled from the same distribution as the source dataset, and should not be used to train your synthetic data generator.

The DCR score is a value between 0 and positive infinity. It can be interpreted using the following Likert scale for quality:

Low: [0 - 0.75)
Medium: [0.75 - 1.0)
High: [1.0, positive infinity)

View example

Linkability

Linkability score measures the level of protection against linkability attacks, where an attacker attempts to link different fields of the original records based on synthetic records. Higher scores indicate better privacy protection, while lower scores suggest increased risk of re-identification.

See rockfish.actions.EvaluateLinkability.

Inference

Similar to linkability score, Inference score measures the level of protection against inference attacks, where an attacker attempts to infer sensitive information about individuals by analyzing patterns in the synthetic data and using auxiliary knowledge. Higher scores indicate better privacy protection against such inference attempts.

See rockfish.actions.EvaluateInference.