Models

Rockfish Models

Rockfish understands the challenge that users face in connecting their high level intents to the actual runnable algorithms and configurations for the various algorithms. Decisions like when to preprocess or encode the dataset/columns in a specific way to ensure that the structure is preserved when generating play a critical role in the quality of the generated data.

To dramatically lower the barrier of entry for users and using the naive approach of trying all possible algorithms/configurations can be costly, Rockfish's uses it proprietary Recommendation Engine that analyzes the dataset, fidelity, and privacy requirements of the use case to suggest an appropriate model and model parameters.

Rockfish Platform supports different Gen AI models to support different datasets types:

Rockfish DoppelGANger (RF-Time-GAN)
Rockfish REaLTabFormer time-series (RF-Time-Transformer)
Rockfish REaLTabFormer tabular (RF-Tab-Transformer)
Rockfish CTGAN (RF-Tab-GAN)

Guidelines for Manual Model selection

For those who want to explore and want to pick the model here are some guidelines:

Time Series Data: RF-Time-GAN and RF-Time-Transformer models are optimized for time series data, which consist of metadata fields, a timestamp field, and measurement fields. Generally, RF-Time-GAN has a shorter training time compared to RF-Time-Transformer.
Tabular Data: RF-Tab-GAN and RF-Tab-Transformer are better suited for tabular data, which is a common 2-dimensional dataset.Generally, RF-Tab-GAN has a shorter training time compared to RF-Tab-Transformer.

Guidelines for Model Training Parameters

Model	Hyperparameter	Description	Default Value	Guidelines
RF-Time-GAN	`batch_size`	Size of batches during model training and generation.	100	Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization.
	`sample_len`	Number of records generated at a time within each session.	1	Typically, `sample_len` is set to `avg_session_len / 50`. Increasing `sample_len` can reduce training time and memory utilization.
	`activate_normalization_per_sample`	When true, continuous fields are normalized per session.	True	Setting `activate_normalization_per_sample = True` improves fidelity (with a slight increase in training time and memory utilization), and can sometimes generate out-of-range values for continuous fields (which can be filtered out).
	`epoch`	Number of training iterations.	400
RF-Time-Transformer	`batch_size`	Size of batches during model training and generation.	8	Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization.
	`train_kwargs.gradient_accumulation_step`	Number of batches to compute gradients for each update to the model weights.	4	Larger values increase the effective batch size to `batch_size * train_kwargs.gradient_accumulation_step` (default = 32). This can improve fidelity and memory utilization, but will increase training time.
	`train_kwargs.learning_rate`	Learning rate used by optimizer.
	`train_kwargs.weight_decay`	Weight decay used by optimizer.
	`transformer.gpt2_config.layer`	Number of hidden layers in the encoder.	12	The default `gpt2_config` values should work for most cases, but you can empirically tune them for better fidelity.
	`transformer.gpt2_config.head`	Number of attention heads for each attention layer in the encoder.	12	The default `gpt2_config` values should work for most cases, but you can empirically tune them for better fidelity.
	`transformer.gpt2_config.embed`	Dimensionality of the embeddings and hidden states (i.e. session representation that the model uses).	768	Should be a multiple of `gpt2_config.head`. The default `gpt2_config` values should work for most cases, but you can empirically tune them for better fidelity.
	`child.output_max_length`	Maximum number of tokens in each session (i.e. encoded versions of metadata and the list of measurements).	512	Sessions of length greater than this value are ignored during training. If set to `None`, the model will automatically compute the maximum possible length. Set to `None` if the session lengths are homogenous, otherwise shorter sessions will be padded to the longest session in the batch (which will increase training time).
	`epochs`	Number of training iterations.	100
RF-Tab-Transformer	`batch_size`	Size of batches during model training and generation.	8	Larger batch sizes can reduce training time and improve fidelity, but will increase memory utilization.
	`train_kwargs.gradient_accumulation_step`	Number of batches to compute gradients for each update to the model weights.	4	Larger values increase the effective batch size to `batch_size * train_kwargs.gradient_accumulation_step` (default = 32). This can improve fidelity and memory utilization, but will increase training time.
	`train_kwargs.learning_rate`	Learning rate used by optimizer.
	`train_kwargs.weight_decay`	Weight decay used by optimizer.
	`transformer.gpt2_config.layer`	Number of hidden layers in the encoder.	12	The default `gpt2_config` values should work for most cases, but you can empirically tune them for better fidelity.
	`transformer.gpt2_config.head`	Number of attention heads for each attention layer in the encoder.	12	The default `gpt2_config` values should work for most cases, but you can empirically tune them for better fidelity.
	`transformer.gpt2_config.embed`	Dimensionality of the embeddings and hidden states (i.e. session representation that the model uses).	768	Should be a multiple of `gpt2_config.head`. The default `gpt2_config` values should work for most cases, but you can empirically tune them for better fidelity.
	`output_max_length`	Maximum number of tokens in each row.	512	Since the dataset is tabular, set this to `x * n_continuous_columns + n_categorical_columns`, where `x ~ (max width of a number in data)`. If set to `None`, the model will automatically compute the maximum possible length.
	`epochs`	Number of training iterations.	100
RF-Tab-GAN	`batch_size`	Size of batches during model training and generation.	500	This value must be even, and it must be divisible by the `pac` parameter.
	`pac`	Number of samples to group together when applying the discriminator.	1
	`epochs`	Number of training iterations.	10

If there is a specific model implementation you would like to see in Rockfish, please send a feature request to us via support@rockfish.ai.