Custom sql metrics
Custom Evaluation
Rockfish platform supports a comprehensive set of built-in evaluation mechanisms, allowing you to easily assess data quality.
In addition, we offer the flexibility to create custom SQL queries for tailored evaluations, ensuring you can meet specific needs beyond standard metrics.
You can write SQL queries on your datasets using the DataFusion SQL query
engine(see SQL
reference), which
powers the SQL capabilities of the platform. Hence, according to their SELECT syntax
, if you want to query against a capitalized field, make sure to use double quotes.
Currently, the input dataset is always referred to as my_table
in the query.
Evaluate datasets based on specific conditions.
Example 1: Select the Field Without Null Values
query = """SELECT *
FROM my_table
WHERE <field_name> IS NOT NULL"""
Example 2: Select the Categorical Field Within Specified Categories
query = """SELECT *
FROM my_table
WHERE <categorical_field> IN (<category1>, <category2>)"""
Example 3: Select the Continuous Numerical Field Within a Specified Range
query = """SELECT *
FROM my_table
WHERE <numerical_field> BETWEEN <lower_bound_value> AND <upper_bound_value>"""
If you want to explore the distribution of the aggregated data per session, you can use the query below to aggregate the dataset.
Evaluate datasets based on the aggregated values
Example: count the number of unique values
query = """SELECT COUNT(DISTINCT <field_name>) AS "number of distinct values"
FROM my_table
GROUP BY <metadata_field_1>, <metadata_field_2>"""
You can then apply the defined query to the custom_plot
function by selecting the suitable plot method and providing the corresponding parameters (see table of Paramteters) except for the first parameter, list[LocalDataset], which is already provided.
import rockfish.labs as rl
# Argument of `list[LocalDataset]` should be a list of one or more datasets to which the SQL query can be applied.
rl.vis.custom_plot(list[LocalDataset], query, <rl.vis.plot_func>, <parameter1>, <parameter2>)
Alternatively, you can first obtain the selected dataset(s) and then apply them to the suitable available plot (refer to rockfish.labs.vis-funcitons)
dataset = rf.Dataset.from_csv("example", "data/path/example.csv")
dataset = dataset.sync_sql(query)
rl.vis.plot_hist(list[LocalDataset], "number of distinct values")