Fidelity Metrics and Visualizations

Metrics

For more details on metric functions, please refer to rockfish.labs.metrics.

Overall Fidelity Score

The overall fidelity score is a weighted average that quantifies the similarity between the source and synthetic data by examining their marginal distributions.

For the calculations and metrics used:

For categorical fields, the total variation distance is used.
For continuous fields, the Kolmogorov–Smirnov distance is used.

Data Type	Fields Considered
Tabular	All
Timeseries	Metadata, Measurement, Session Length, Interarrival Time

Default weights: 1. Users can adjust weights to calculate a customized score.
Score Range: 0 to 1, with 1 indicating the highest fidelity.

For detailed usage of the overall fidelity score, refer to rockfish.labs.metrics.marginal_dist_score.

Univariate Metrics

Univariate metrics evaluate how well synthetic data preserves the statistical characteristics of individual variables/fields compared to the original data.

Metric	Description	Value Range
Range coverage	Measures how well the synthetic data covers the full range of values from the original data	1 = Best; 0 = Worst
Category coverage	Measures the proportion of categories from the original data that appear in the synthetic data	1 = Best; 0 = Worst
Total Variation distance	Measures the maximum difference between probability distributions of categorical variables	0 = Best; 1 = Worst
Jensen–Shannon distance	Measures the similarity between probability distributions using the Kullback–Leibler divergence in information theory	0 = Best; 1 = Worst
Wasserstein distance	Measures the minimum "cost" to transform one distribution into another	0 = Best; ∞ = Worst
Kolmogorov–Smirnov distance	Measures the maximum difference between cumulative distribution functions	0 = Best; 1 = Worst

Bivariate Metrics

Bivariate metrics evaluate the strength and nature of relationships between pairs of variables, measuring how synthetic data maintains the statistical dependencies present in the source data.

For Numerical Fields: Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear correlation between two variables. It tells you how strongly two variables move together.
- The Pearson correlation coefficient ranges from -1 to 1:
  - A value of -1 indicates a perfect negative linear relationship between two numerical fields. The variables move perfectly together in opposite directions.
  - A value of 0 indicates no linear relationship between two numerical fields. There's no clear relationship between the variables.
  - A value of 1 indicates a perfect positive linear relationship between two numerical fields. The variables move perfectly together in the same direction.
- The p-value determines the statistical significance of the correlation coefficient. A low p-value (typically <0.05) suggests that the correlation is statistically significant.
For Categorical Fields: Cramér's V

Cramér's V measures the association between two categorical variables. It can be calculated with or without bias correction¹. By default, bias correction is enabled (correction=True). If you set correction=False, the bias correction will not be applied.
- A value of 0 indicates no association between two categorical fields.
- A value of 1 indicates a perfect association between two categorical fields, meaning one field can be completely determined by the other through a specific function.

Examples

Note

Please ensure to import the following libraries in order to evaluate the datasets

import rockfish as rf
import rockfish.labs as rl

Pre-requisite For Timeseries Dataset Evaluation

Having metadata fields or session key to define sessions is compulsory for session-related evaluation (such as Session length measurement, Interarrival time measurement, Transitions measurement).

# need to specify the metadata fields or session key and add that schema to each dataset to help compute session_related evaluation
table_metadata = rf.TableMetadata(metadata=["<metadata fields or session_key>"])
ts_data = ts_data.with_table_metadata(table_metadata)
ts_syn = ts_syn.with_table_metadata()

Session length measurement |Timeseries dataset:

|Tabular dataset: NA |

Session length refers to the number of records per session. Please follow the above pre-requisite to evaluate.

To visualize the distribution

# compute session_length
source_sess = rf.metrics.session_length(ts_data)
syn_sess = rf.metrics.session_length(ts_syn)
# "session_length" is a fixed name from the computed datasets
rl.vis.plot_kde([source_sess, syn_sess], "session_length");

To compute the similarity of distribution, you can use Kolmogorov–Smirnov distance.

rl.metrics.ks_distance(source_sess, syn_sess, "session_length")

Interarrival time measurement |Timeseries dataset:

|Tabular dataset: NA |

Interarrival time is the difference between consecutive timestamps within a session. Please follow the above pre-requisite to evaluate.

To visualize the distribution

# compute interarrival time
timestamp_field = "<timestamp_field>"
source_interarrival = rf.metrics.interarrivals(ts_data, timestamp_field)
syn_interarrival = rf.metrics.interarrivals(ts_syn, timestamp_field)
# "interarrival" is a fixed name from the computed datasets
rl.vis.plot_kde(
[source_interarrival, syn_interarrival], "interarrival", duration_unit="s"
);

To compute the similarity of distribution, you can use Kolmogorov–Smirnov distance.

rl.metrics.ks_distance(source_interarrival, syn_interarrival, "interarrival")

Transitions measurement |Timeseries dataset:

|Tabular dataset: NA |

Transitions consider all state transitions within sessions. Please follow the above pre-requisite to evaluate.

There are 3 ways of implementation:

Let's say there is a dataset with 2 sessions as below.

Session 1: A -> B -> B 
Session 2: A -> B -> B -> C

1.k-gram uncollapsed transitions

E.g when it is 2-gram, if uncollapsed, it counts all repeated states appearing at the consecutive events

Session 1: A -> B, B -> B 
Session 2: A -> B, B -> B, B -> C

2.k-gram collapsed transitions

E.g when it is 2-gram, if collapsed, it skips repeated states appearing

Session 1: A -> B , B -> B 
Session 2: A -> B, B -> B, B -> C

3.full collapsed transitions

Session 1: A -> B 
Session 2: A -> B -> C

To visualize the transition distribution

field_name = "<stateful_field>"
transitions_source = rf.metrics.transitions_within_sessions(dataset, field = field_name)
transitions_syn = rf.metrics.transitions_within_sessions(syn, field = field_name)
rl.vis.plot_bar([transitions_source, transitions_syn], f"{field_name}_transitions", orient = "horizontal");

To compute the similarity of distribution, you can use Total Variation distance.

rl.metrics.tv_distance(transitions_source, transitions_syn, f"{field_name}_transitions")

Individual field measurement |Timeseries dataset:

|Tabular dataset:

|

For continuous numerical field, we can visualize its distribution via Probablity density function (using KDE) or Histogram.

# plot kde
rl.vis.plot_kde([dataset, syn], field = "<continuous_field_name>");
# plot histogram
rl.vis.plot_hist([dataset, syn], field = "<continuous_field_name>");

To compute the similarity of distribution, we can use Kolmogorov–Smirnov distance.

rl.metrics.ks_distance(dataset, syn, "<continuous_field_name>")

For categorical field, we can visualize its distribution via Bar chart.

# plot bar
rl.vis.plot_bar([dataset, syn], field = "<categorical_field_name>");

To compute the similarity of distribution, we can use Total Variation distance.

rl.metrics.tv_distance(dataset, syn, "<categorical_field_name>")

Correlation measurement |Timeseries dataset:

|Tabular dataset:

|

This measures the pairwise correlation between multiple numerical fields. To visualize the correlation between two numerical fields, we can show the scatter plot

rl.vis.plot_correlation([dataset, syn], "<numerical_field1>", "<numerical_field2>")

To visualize the correlation betweem more than two numerical fields, we can show the correlation heatmap

rl.vis.plot_correlation_heatmap([dataset, syn], ["<numerical_field1>", "<numerical_field2>",...,"<numerical_field_k>"])

To compute the score

rl.metrics.correlation_score(data, syn, ["<numerical_field1>", "<numerical_field2>",...,"<numerical_field_k>"])

Association measurement |Timeseries dataset:

|Tabular dataset:

|

This measures the pairwise associationn between multiple categorical fields.

To visualize the association betweem more than two numerical fields, we can show the correlation heatmap

rl.vis.plot_association_heatmap([dataset, syn], ["<categorical_field1>", "<categorical_field2>",...,"<categorical_field_k>"])

To compute the score

rl.metrics.assotication_score(dataset, syn, ["<categorical_field1>", "<categorical_field2>"..."<categorical_field_k>"])

View example for timeseries dataset

View example for tabular dataset

Visualizations

Refer to rockfish.labs.vis.

Bergsma, Wicher (2013). "A bias correction for Cramér's V and Tschuprow's T". Journal of the Korean Statistical Society. 42 (3): 323–328. doi:10.1016/j.jkss.2012.10.002 ↩