Skip to content

Basic Generation

Synthetic generation refers to the process of creating artificial data that mimics real-world datasets while preserving privacy and meeting specific requirements. This data generation can be useful for applications such as training machine learning models, testing, privacy-preserving analytics, and more

Synthetic conditional generation involves generating data based on specified conditions or constraints, enabling the creation of realistic datasets that follow particular patterns, distributions, or dependencies. This approach allows more control over the synthetic data, making it especially useful for testing, scenario simulations, and analytics while ensuring that privacy requirements are met.

Generate Module

Rockfish's Generate Module provides users the flexiblity to use any trained model to generate high-quality synthetic data for specific use cases.

With the Generate Module, you can:

  • Use the trained model to generate synthetic data based on your specified configurations.
  • Customize the generation process to meet specific requirements, such as data volume or target features.

Generation Process

1. Fetch the trained model

After training is complete, the model can be fetched using

model = await workflow.models().last()
2. Create a Generate Action

The generate action is created specific to the trained Rockfish model with its generation configuration. For details, please check out Model Generation Configuration.

3. Create a SessionTarget action

The session target action is to assign a target generation value to the synthetic output.

target = ra.SessionTarget(target = <target generation value>)

Default Generation

For default generation, you do not need to specify a target value:

target = ra.SessionTarget()
By default:

  • For time series models, it generates the same number of sessions as in the training data.
  • For tabular models, it generates the same number of records as in the training data.

4. Create a Save action

The save action is used to store the generated dataset.

# please give the synthetic data name
save = ra.DatasetSave(name= "<synthetic data name>") 

5. Build generation workflow

You can build the generation workflow to start the generation job with as follows

builder = rf.WorkflowBuilder()
builder.add_model(model)
builder.add_action(generate, parents=[model, target])
builder.add_action(target, parents=[generate])
builder.add_action(save, parents=[generate])
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")

6. Generate synthetic data

syn = None
async for sds in workflow.datasets():
    syn = await sds.to_local(conn)
syn.to_pandas()


Use Cases for generation

Now that you know how to generate data using the Generate Module, lets explore the different ways to generate data specific to your use case

Lets take a look at few example use cases.

Use Case 1: Regulatory Compliance and Data Masking

Scenario Problem Solution
A healthcare provider needs to share patient data with a third-party analytics firm for research purposes. Sensitive patient data cannot be shared. Since the actual patient data is sensitive, you can generate synthetic health records to share safely.

Solution: Generate Specific amount of data

To use this solution, please follow the steps desribed in the General Generation Process.


Use Case 2: Stress Testing Systems or Applications

Scenario Problem Solution
A telecom company is launching a new billing system and wants to ensure it can handle millions of transactions per minute. Not having enough data to conduct the test efficienlty By generating a massive amount of synthetic transaction data, they can stress test the system under load.

Solution: Continuous Generation

To use this solution, assign a large target generation value to the session target action for large-scale generation, as described in Step 3 above in the General Generation Process.


Use Case 3: Tracking Customer Transaction Behavior Based on Session Metadata

Scenario Problem Solution
A platform aims to analyze transaction behavior for various customer groups to promote relevant categories effectively. The platform possesses historical session data for customer transactions, but specific combinations of metadata (e.g., age and gender) are scarce or nonexistent, making it challenging or impossible to determine these groups' transaction behavior. By defining "given_metadata" in the generation config, the model generates customer data with specific demographic characteristics based on patterns learned from the training data, thereby providing the platform with sufficient data for analysis.

Solution: Generation with Conditions on Session Metadata

To use this solution, update generate configuration, described in Step 2 of the General Generation Process

Note: This feature is only supported with the RF-Time-GAN model.

Model Steps to Update Generate Action Tutorial
RF-Time-GAN 1. Define Metadata Constraints:
given_metadata = {"<metadata1>":["value1",..."value N"], "<metadata2>": ["value 1","value 2",..."value N"] }
2. Generate Configuration to include number of sessions and the metadata constraints:
generate_config = ra.GenerateTimeGAN.Config(doppelganger=ra.GenerateTimeGAN.DGConfig(given_metadata = given_metadata))
3. Create Generate Action with Specified Configurations generate = ra.GenerateTimeGAN(generate_config)
See Example tutorial Open In Colab

Use Case 4: Simulating Rare Events

Scenario Problem Solution
An e-commerce platform wants to improve its fraud detection system But fraudulent transactions make up only a tiny fraction of their overall transactions. Generate a large number of synthetic fraudulent transactionss to address the imbalance between normal and fraudulent transactions, potentially improving the fraud detection system's performance.

Solution: Generate specific amount of data with conditions

To use this solution, update the Step 4 above in the General Generation Process.

Note: Applicable to all 4 models. If you have multiple conditions with different desired amount, follow the below steps and then concatenate all the results together.

For example, users may want to generate fraud events with 1000 records or sessions.

  1. Set Conditions: You can define conditions either with the PostAmplify action or with the SQL action:
    condition_filter = ra.PostAmplify({
        "query_ast": {
            "eq": ["fraud", 1]
        },
    })
    
    Alternatively, use can also use the SQL action
    condition_filter = ra.SQL(query = "SELECT * FROM my_table WHERE fraud=1")
    
  2. Set Target Value: it controls the number of generated conditional records (for tabular data) or sessions (for time-series data):

    target = ra.SessionTarget(target=100) 
    

  3. Build the Workflow:

    builder = rf.WorkflowBuilder()
    builder.add(model)
    builder.add_action(generate, parents = [model, target])
    builder.add_action(condition_filter, parents=[generate])
    builder.add_action(target, parents=[condition_filter])
    builder.add_action(save, parents=[condition_filter])
    workflow = await builder.start(conn)
    print(f"Workflow: {workflow.id()}")
    


Use Case 5: Ensuring equal representation to meet specific synthetic data requirements.

Scenario Problem Solution
A snack manufacturer wants equal representation of flavors in their snack packages, but when generating synthetic stock data for this process, the distribution of flavors can vary, with some flavors being over- or under-represented When synthesizing stock data for the manufacturer, the model learns most of the distribution but cannot guarantee an exact equal distribution. Some flavors may be generated more frequently than others, leading to an imbalanced dataset. Hence, the stock data may not meet the marketing requirement for equal representation of each flavor. With Equal Data distribution, the synthetic data generation model can be modified to enforce equal representation of values in the "flavors" field. This constraint ensures that all flavors are equally distributed, satisfying the manufacturer's marketing requirement for stock data and promotional standards. This guarantees the generated synthetic dataset mirrors the ideal flavor distribution for the snack packages.

Solution: Equal Data Distribution

To use this solution, update the Step 5 above in the General Generation Process.

replacement = ra.Replace(
    field="flavors", 
    condition=ra.EqualizeCondition(equalization=True)
) 
builder = rf.WorkflowBuilder()
builder.add_model(model)
builder.add_action(generate, parents=[model])
builder.add_action(replacement, parents=[generate])
builder.add_action(save, parents=[replacement])
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")

Use Case 6: Validating product combinations in retail inventory data generation.

Scenario Problem Solution
In retail, certain products can only be sold together under specific bundles or promotions. For example, a "Back-to-School" promotion bundles notebooks with pens, but never with unrelated items like kitchen utensils. The inventory system must ensure that products adhere to these combination rules when generating synthetic stock or sales data. When generating synthetic retail data, the model might create product combinations that are not allowed (e.g., notebooks paired with kitchen utensils) because it doesn't understand the bundling constraints. This leads to invalid datasets that don’t align with business rules. By applying Inclusion & Exclusion Constraints, you can ensure that the generated data respects the valid product combinations. Inclusion constraints enforce that certain products (e.g., notebooks) are only paired with allowed items (e.g., pens), while exclusion constraints prevent disallowed combinations (e.g., notebooks and kitchen utensils). This helps maintain the integrity of the synthetic data while preserving other important data characteristics.

Solution: Inclusion & Exclusion Constraint on Generation

To use this solution, update the Step 5 above in the General Generation Process.

For example, the front doors in black are prefixed, the rear doors must also be black.

replacement=ra.Replace( 
    field="color",
    condition=ra.SQLCondition(query="select door=rear and color!=black as mask from my_table" ),
    resample=ra.ValuesResample(replace_values=["black"]
)
builder = rf.WorkflowBuilder()
builder.add_model(model)
builder.add_action(generate, parents=[model])
builder.add_action(replacement, parents=[generate])
builder.add_action(save, parents=[replacement])
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")