Skip to content

Models

Rockfish Models

Rockfish understands the challenge that users face in connecting their high level intents to the actual runnable algorithms and configurations for the various algorithms. Decisions like when to preprocess or encode the dataset/columns in a specific way to ensure that the structure is preserved when generating play a critical role in the quality of the generated data.

To dramatically lower the barrier of entry for users and using the naive approach of trying all possible algorithms/configurations can be costly, Rockfish's uses it proprietary Recommendation Engine that analyzes the dataset, fidelity, and privacy requirements of the use case to suggest an appropriate model and model parameters.

Rockfish Platform supports different Gen AI models to support different datasets types:

  • Rockfish DoppelGANger (RF-Time-GAN)
  • Rockfish REaLTabFormer time-series (RF-Time-Transformer)
  • Rockfish REaLTabFormer tabular (RF-Tab-Transformer)
  • Rockfish CTGAN (RF-Tab-GAN)

Guidelines for Manual Model selection

For those who want to explore and want to pick the model here are some guidelines:

  • Time Series Data: RF-Time-GAN and RF-Time-Transformer models are optimized for time series data, which consist of metadata fields, a timestamp field, and measurement fields. Generally, RF-Time-GAN has a shorter training time compared to RF-Time-Transformer.

  • Tabular Data: RF-Tab-GAN and RF-Tab-Transformer are better suited for tabular data, which is a common 2-dimensional dataset.Generally, RF-Tab-GAN has a shorter training time compared to RF-Tab-Transformer.

Guidelines for Model Training Parameters

Model Hyperparameter Description Default Value Guidelines
RF-Time-GAN batch_size Size of batches during model training and generation. 100 Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization.
sample_len Number of records generated at a time within each session. 1 Typically, sample_len is set to avg_session_len / 50. Increasing sample_len can reduce training time and memory utilization.
activate_normalization_per_sample When true, continuous fields are normalized per session. True Setting activate_normalization_per_sample = True improves fidelity (with a slight increase in training time and memory utilization), and can sometimes generate out-of-range values for continuous fields (which can be filtered out).
epoch Number of training iterations. 400
RF-Time-Transformer batch_size Size of batches during model training and generation. 8 Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization.
train_kwargs.gradient_accumulation_step Number of batches to compute gradients for each update to the model weights. 4 Larger values increase the effective batch size to batch_size * train_kwargs.gradient_accumulation_step (default = 32). This can improve fidelity and memory utilization, but will increase training time.
train_kwargs.learning_rate Learning rate used by optimizer.
train_kwargs.weight_decay Weight decay used by optimizer.
transformer.gpt2_config.layer Number of hidden layers in the encoder. 12 The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity.
transformer.gpt2_config.head Number of attention heads for each attention layer in the encoder. 12 The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity.
transformer.gpt2_config.embed Dimensionality of the embeddings and hidden states (i.e. session representation that the model uses). 768 Should be a multiple of gpt2_config.head. The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity.
child.output_max_length Maximum number of tokens in each session (i.e. encoded versions of metadata and the list of measurements). 512 Sessions of length greater than this value are ignored during training. If set to None, the model will automatically compute the maximum possible length. Set to None if the session lengths are homogenous, otherwise shorter sessions will be padded to the longest session in the batch (which will increase training time).
epochs Number of training iterations. 100
RF-Tab-Transformer batch_size Size of batches during model training and generation. 8 Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization.
train_kwargs.gradient_accumulation_step Number of batches to compute gradients for each update to the model weights. 4 Larger values increase the effective batch size to batch_size * train_kwargs.gradient_accumulation_step (default = 32). This can improve fidelity and memory utilization, but will increase training time.
train_kwargs.learning_rate Learning rate used by optimizer.
train_kwargs.weight_decay Weight decay used by optimizer.
transformer.gpt2_config.layer Number of hidden layers in the encoder. 12 The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity.
transformer.gpt2_config.head Number of attention heads for each attention layer in the encoder. 12 The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity.
transformer.gpt2_config.embed Dimensionality of the embeddings and hidden states (i.e. session representation that the model uses). 768 Should be a multiple of gpt2_config.head. The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity.
output_max_length Maximum number of tokens in each row. 512 Since the dataset is tabular, set this to x * n_continuous_columns + n_categorical_columns, where x ~ (max width of a number in data). If set to None, the model will automatically compute the maximum possible length.
epochs Number of training iterations. 100
RF-Tab-GAN batch_size Size of batches during model training and generation. 500 This value must be even, and it must be divisible by the pac parameter.
pac Number of samples to group together when applying the discriminator. 1
epochs Number of training iterations. 10

If there is a specific model implementation you would like to see in Rockfish, please send a feature request to us via support@rockfish.ai.