GenSynth Documentation

Creating Datasets

Refer to TensorFlow documentation for more information and the most recent recommendations. At this time, DarwinAI recommends:

  • Using the shard operator early in the dataset pipeline.

  • Using the shard before using any randomizing operator.

  • The training dataset should provide sharded, shuffled, augmented, and batched data operations in that order.

  • Training, test, and validation datasets may prepare data differently. For example, you could choose to only augment training datasets, choosing not to add randomness to test or validation datasets. Note that avoiding randomness in the test dataset will facilitate a more reproducible accuracy comparison.


If you intend to run Root Cause Analysis, avoid shuffling your validation and test datasets. Doing so could create issues while running your job.

The GenSynth process expects batched data. If you wish to produce data one item at a time, you still must use a batch dimension of size 1.

You can repeat datasets (e.g., with,since the epoch size is determined from the number-of-data and batch-size returned by each method. Specifically, these indicate the size of a training epoch; the number of batches executed during validation and testing will be number-of-data / batch_size.

Ideally, the dataset objects returned should correspond to the same training, validation, and test sets used to train your network. (Any alteration to the data split could bias the results.)