Models
Rockfish Models
Rockfish understands the challenge that users face in connecting their high level intents to the actual runnable algorithms and configurations for the various algorithms. Decisions like when to preprocess or encode the dataset/columns in a specific way to ensure that the structure is preserved when generating play a critical role in the quality of the generated data.
To dramatically lower the barrier of entry for users and using the naive approach of trying all possible algorithms/configurations can be costly, Rockfish's uses it proprietary Recommendation Engine that analyzes the dataset, fidelity, and privacy requirements of the use case to suggest an appropriate model and model parameters.
Rockfish Platform supports different Gen AI models to support different datasets types:
- Rockfish DoppelGANger (RF-Time-GAN)
- Rockfish REaLTabFormer time-series (RF-Time-Transformer)
- Rockfish REaLTabFormer tabular (RF-Tab-Transformer)
- Rockfish CTGAN (RF-Tab-GAN)
Guidelines for Manual Model selection
For those who want to explore and want to pick the model here are some guidelines:
-
Time Series Data: RF-Time-GAN and RF-Time-Transformer models are optimized for time series data, which consist of metadata fields, a timestamp field, and measurement fields. Generally, RF-Time-GAN has a shorter training time compared to RF-Time-Transformer.
-
Tabular Data: RF-Tab-GAN and RF-Tab-Transformer are better suited for tabular data, which is a common 2-dimensional dataset.Generally, RF-Tab-GAN has a shorter training time compared to RF-Tab-Transformer.
Guidelines for Model Training Parameters
Model | Hyperparameter | Description | Default Value | Guidelines |
---|---|---|---|---|
RF-Time-GAN | batch_size |
Size of batches during model training and generation. | 100 | Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization. |
sample_len |
Number of records generated at a time within each session. | 1 | Typically, sample_len is set to avg_session_len / 50 . Increasing sample_len can reduce training time and memory utilization. |
|
activate_normalization_per_sample |
When true, continuous fields are normalized per session. | True | Setting activate_normalization_per_sample = True improves fidelity (with a slight increase in training time and memory utilization), and can sometimes generate out-of-range values for continuous fields (which can be filtered out). |
|
epoch |
Number of training iterations. | 400 | ||
RF-Time-Transformer | batch_size |
Size of batches during model training and generation. | 8 | Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization. |
train_kwargs.gradient_accumulation_step |
Number of batches to compute gradients for each update to the model weights. | 4 | Larger values increase the effective batch size to batch_size * train_kwargs.gradient_accumulation_step (default = 32). This can improve fidelity and memory utilization, but will increase training time. |
|
train_kwargs.learning_rate |
Learning rate used by optimizer. | |||
train_kwargs.weight_decay |
Weight decay used by optimizer. | |||
transformer.gpt2_config.layer |
Number of hidden layers in the encoder. | 12 | The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity. |
|
transformer.gpt2_config.head |
Number of attention heads for each attention layer in the encoder. | 12 | The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity. |
|
transformer.gpt2_config.embed |
Dimensionality of the embeddings and hidden states (i.e. session representation that the model uses). | 768 | Should be a multiple of gpt2_config.head . The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity. |
|
child.output_max_length |
Maximum number of tokens in each session (i.e. encoded versions of metadata and the list of measurements). | 512 | Sessions of length greater than this value are ignored during training. If set to None , the model will automatically compute the maximum possible length. Set to None if the session lengths are homogenous, otherwise shorter sessions will be padded to the longest session in the batch (which will increase training time). |
|
epochs |
Number of training iterations. | 100 | ||
RF-Tab-Transformer | batch_size |
Size of batches during model training and generation. | 8 | Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization. |
train_kwargs.gradient_accumulation_step |
Number of batches to compute gradients for each update to the model weights. | 4 | Larger values increase the effective batch size to batch_size * train_kwargs.gradient_accumulation_step (default = 32). This can improve fidelity and memory utilization, but will increase training time. |
|
train_kwargs.learning_rate |
Learning rate used by optimizer. | |||
train_kwargs.weight_decay |
Weight decay used by optimizer. | |||
transformer.gpt2_config.layer |
Number of hidden layers in the encoder. | 12 | The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity. |
|
transformer.gpt2_config.head |
Number of attention heads for each attention layer in the encoder. | 12 | The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity. |
|
transformer.gpt2_config.embed |
Dimensionality of the embeddings and hidden states (i.e. session representation that the model uses). | 768 | Should be a multiple of gpt2_config.head . The default gpt2_config values should work for most cases, but you can empirically tune them for better fidelity. |
|
output_max_length |
Maximum number of tokens in each row. | 512 | Since the dataset is tabular, set this to x * n_continuous_columns + n_categorical_columns , where x ~ (max width of a number in data) . If set to None , the model will automatically compute the maximum possible length. |
|
epochs |
Number of training iterations. | 100 | ||
RF-Tab-GAN | batch_size |
Size of batches during model training and generation. | 500 | This value must be even, and it must be divisible by the pac parameter. |
pac |
Number of samples to group together when applying the discriminator. | 1 | ||
epochs |
Number of training iterations. | 10 |
If there is a specific model implementation you would like to see in Rockfish, please send a feature request to us via support@rockfish.ai.