Fidelity Metrics and Visualizations
Metrics
For more details on metric functions, please refer to
rockfish.labs.metrics
.
Overall Fidelity Score
The overall fidelity score is a weighted average that quantifies the similarity between the source and synthetic data by examining their marginal distributions.
For the calculations and metrics used:
- For categorical fields, the total variation distance is used.
- For continuous fields, the Kolmogorov–Smirnov distance is used.
Data Type | Fields Considered |
---|---|
Tabular | All |
Timeseries | Metadata, Measurement, Session Length, Interarrival Time |
- Default weights: 1. Users can adjust weights to calculate a customized score.
- Score Range: 0 to 1, with 1 indicating the highest fidelity.
For detailed usage of the overall fidelity score, refer to
rockfish.labs.metrics.marginal_dist_score
.
Univariate Metrics
Univariate metrics evaluate how well synthetic data preserves the statistical characteristics of individual variables/fields compared to the original data.
Metric | Description | Value Range |
---|---|---|
Range coverage | Measures how well the synthetic data covers the full range of values from the original data | 1 = Best; 0 = Worst |
Category coverage | Measures the proportion of categories from the original data that appear in the synthetic data | 1 = Best; 0 = Worst |
Total Variation distance | Measures the maximum difference between probability distributions of categorical variables | 0 = Best; 1 = Worst |
Jensen–Shannon distance | Measures the similarity between probability distributions using the Kullback–Leibler divergence in information theory | 0 = Best; 1 = Worst |
Wasserstein distance | Measures the minimum "cost" to transform one distribution into another | 0 = Best; ∞ = Worst |
Kolmogorov–Smirnov distance | Measures the maximum difference between cumulative distribution functions | 0 = Best; 1 = Worst |
Bivariate Metrics
Bivariate metrics evaluate the strength and nature of relationships between pairs of variables, measuring how synthetic data maintains the statistical dependencies present in the source data.
-
For Numerical Fields: Pearson Correlation Coefficient
The Pearson correlation coefficient measures the linear correlation between two variables. It tells you how strongly two variables move together.
- The Pearson correlation coefficient ranges from -1 to 1:
- A value of -1 indicates a perfect negative linear relationship between two numerical fields. The variables move perfectly together in opposite directions.
- A value of 0 indicates no linear relationship between two numerical fields. There's no clear relationship between the variables.
- A value of 1 indicates a perfect positive linear relationship between two numerical fields. The variables move perfectly together in the same direction.
- The p-value determines the statistical significance of the correlation coefficient. A low p-value (typically <0.05) suggests that the correlation is statistically significant.
- The Pearson correlation coefficient ranges from -1 to 1:
-
For Categorical Fields: Cramér's V
Cramér's V measures the association between two categorical variables. It can be calculated with or without bias correction1. By default, bias correction is enabled (
correction=True
). If you setcorrection=False
, the bias correction will not be applied.- A value of 0 indicates no association between two categorical fields.
- A value of 1 indicates a perfect association between two categorical fields, meaning one field can be completely determined by the other through a specific function.
Examples
Note
Please ensure to import the following libraries in order to evaluate the datasets
import rockfish as rf
import rockfish.labs as rl
Pre-requisite For Timeseries Dataset Evaluation
Having metadata fields or session key to define sessions is compulsory for session-related evaluation (such as Session length measurement, Interarrival time measurement, Transitions measurement).
# need to specify the metadata fields or session key and add that schema to each dataset to help compute session_related evaluation
table_metadata = rf.TableMetadata(metadata=["<metadata fields or session_key>"])
ts_data = ts_data.with_table_metadata(table_metadata)
ts_syn = ts_syn.with_table_metadata()
Session length measurement |Timeseries dataset
:
|Tabular dataset
: NA |
Session length refers to the number of records per session. Please follow the above pre-requisite to evaluate.
To visualize the distribution
# compute session_length
source_sess = rf.metrics.session_length(ts_data)
syn_sess = rf.metrics.session_length(ts_syn)
# "session_length" is a fixed name from the computed datasets
rl.vis.plot_kde([source_sess, syn_sess], "session_length");
rl.metrics.ks_distance(source_sess, syn_sess, "session_length")
Interarrival time measurement |Timeseries dataset
:
|Tabular dataset
: NA |
Interarrival time is the difference between consecutive timestamps within a session. Please follow the above pre-requisite to evaluate.
To visualize the distribution
# compute interarrival time
timestamp_field = "<timestamp_field>"
source_interarrival = rf.metrics.interarrivals(ts_data, timestamp_field)
syn_interarrival = rf.metrics.interarrivals(ts_syn, timestamp_field)
# "interarrival" is a fixed name from the computed datasets
rl.vis.plot_kde(
[source_interarrival, syn_interarrival], "interarrival", duration_unit="s"
);
rl.metrics.ks_distance(source_interarrival, syn_interarrival, "interarrival")
Transitions measurement |Timeseries dataset
:
|Tabular dataset
: NA |
Transitions consider all state transitions within sessions. Please follow the above pre-requisite to evaluate.
There are 3 ways of implementation:
Let's say there is a dataset with 2 sessions as below.
Session 1: A -> B -> B
Session 2: A -> B -> B -> C
E.g when it is 2-gram, if uncollapsed, it counts all repeated states appearing at the consecutive events
Session 1: A -> B, B -> B
Session 2: A -> B, B -> B, B -> C
E.g when it is 2-gram, if collapsed, it skips repeated states appearing
Session 1: A -> B , B -> B
Session 2: A -> B, B -> B, B -> C
Session 1: A -> B
Session 2: A -> B -> C
field_name = "<stateful_field>"
transitions_source = rf.metrics.transitions_within_sessions(dataset, field = field_name)
transitions_syn = rf.metrics.transitions_within_sessions(syn, field = field_name)
rl.vis.plot_bar([transitions_source, transitions_syn], f"{field_name}_transitions", orient = "horizontal");
To compute the similarity of distribution, you can use Total Variation distance.
rl.metrics.tv_distance(transitions_source, transitions_syn, f"{field_name}_transitions")
Individual field measurement |Timeseries dataset
:
|Tabular dataset
:
|
For continuous numerical field, we can visualize its distribution via Probablity density function (using KDE) or Histogram.
# plot kde
rl.vis.plot_kde([dataset, syn], field = "<continuous_field_name>");
# plot histogram
rl.vis.plot_hist([dataset, syn], field = "<continuous_field_name>");
rl.metrics.ks_distance(dataset, syn, "<continuous_field_name>")
For categorical field, we can visualize its distribution via Bar chart.
# plot bar
rl.vis.plot_bar([dataset, syn], field = "<categorical_field_name>");
rl.metrics.tv_distance(dataset, syn, "<categorical_field_name>")
Correlation measurement |Timeseries dataset
:
|Tabular dataset
:
|
This measures the pairwise correlation between multiple numerical fields. To visualize the correlation between two numerical fields, we can show the scatter plot
rl.vis.plot_correlation([dataset, syn], "<numerical_field1>", "<numerical_field2>")
rl.vis.plot_correlation_heatmap([dataset, syn], ["<numerical_field1>", "<numerical_field2>",...,"<numerical_field_k>"])
To compute the score
rl.metrics.correlation_score(data, syn, ["<numerical_field1>", "<numerical_field2>",...,"<numerical_field_k>"])
Association measurement |Timeseries dataset
:
|Tabular dataset
:
|
This measures the pairwise associationn between multiple categorical fields.
To visualize the association betweem more than two numerical fields, we can show the correlation heatmap
rl.vis.plot_association_heatmap([dataset, syn], ["<categorical_field1>", "<categorical_field2>",...,"<categorical_field_k>"])
To compute the score
rl.metrics.assotication_score(dataset, syn, ["<categorical_field1>", "<categorical_field2>"..."<categorical_field_k>"])
View example for timeseries dataset
View example for tabular dataset
Visualizations
Refer to rockfish.labs.vis
.
-
Bergsma, Wicher (2013). "A bias correction for Cramér's V and Tschuprow's T". Journal of the Korean Statistical Society. 42 (3): 323–328. doi:10.1016/j.jkss.2012.10.002 ↩