Uploading Datasets

Dataset handling in the Onyx Engine is intentionally minimal, there are no special file types or elaborate schemas to learn. The goal is to drop in data and train models as quickly as possible.

Collecting Good Data

Onyx tools are used after you already have data, but since data is critical to model performance, we’ve included some tips for getting good data.

What Data to Collect

The first rule of thumb is: Whatever dynamics you want to model, you should capture in the data. More technically speaking, the distribution of your dataset will reflect the distribution in which your AI models will be accurate. To get models that generalize well when simulating dynamics, we want our data collection to cover as much of the frequencies and behaviors of our dynamics as possible. To excite the dynamics of controlled systems, we suggest mixing the kinds of control behaviors you anticipate seeing during operation along with a combination of sine waves, sine chirps, and random steps/pulses at varying frequencies and amplitudes. With this said, if you have data already, try that out first! We’ve made it easy to experiment with different datasets to rapidly iterate on your models.

Amount of Data and Sampling Rate

There is no strict requirement for the amount of data you need, and more data is generally better, but here’s a rough baseline:

1 hour of time series data (can be separate episodes/experiments), however 10-15 minutes may be enough for some applications.
10Hz+ sampling rate. Higher is better, you can always downsample later.

Creating an Onyx Dataset

Datasets in the Python SDK are managed through the OnyxDataset class, which is simply a pandas DataFrame with additional metadata about the features and time step.

import pandas as pd
from onyxengine import Onyx
from onyxengine.data import OnyxDataset

# Initialize the client
onyx = Onyx()

# Load your raw data
df = pd.read_csv('my_hardware_data.csv')

# Create an OnyxDataset
dataset = OnyxDataset(
    dataframe=df,
    features=['acceleration', 'velocity', 'position', 'control_input'],
    dt=0.01  # Time step in seconds
)

# Access the dataset's dataframe
print(dataset.dataframe.head())

Parameters

Parameter	Type	Description
`dataframe`	`pd.DataFrame`	Your time series data
`features`	`List[str]`	Column names to include in the dataset
`dt`	`float`	Time step between samples in seconds

Data Format Requirements

Our requirements for data formatting follow general best practices for AI modeling that are applicable to any training framework. Before training, your data will need to be a time series with:

Labeled columns: Each column/feature in your dataset should have a unique name
Numeric columns: All features must be numeric (Onyx converts to float32 for efficiency)
Consistent time step: The time delta between all rows should be the same (eg. 0.001s)
No missing values: You should drop rows with NaN values or use interpolation to fill them in

Here is an example .csv file that could immediately be used for training:

time_s,acceleration,velocity,position,control_input
000,0.12,0.0,0.0,0.5
001,0.15,0.0012,0.0,0.5
002,0.18,0.0027,0.00001,0.5
003,0.14,0.0041,0.00003,0.5
004,0.10,0.0052,0.00005,0.5
005,0.08,0.0063,0.00007,0.5

Saving Datasets

Save your dataset to the Engine for training:

# Create an OnyxDataset
dataset = OnyxDataset(
    dataframe=df,
    features=['acceleration', 'velocity', 'position', 'control_input'],
    dt=0.01  # Time step in seconds
)

# Save the dataset to the Engine
onyx.save_dataset(
    name='my_training_data',
    dataset=dataset,
    time_format='s'  # Time is in seconds
)

After the dataset is uploaded, navigate to the platform and click "Process for Training" in the Data tab of the dataset. After processing completes, the dataset will be marked as Active and ready for training.

Source Dataset Tracking

When creating derived datasets, you can specify source datasets to automatically trace data lineage:

# Load an existing OnyxDataset from the Engine
raw_dataset = onyx.load_dataset('raw_sensor_data')

# Process and create a training dataset
train_df = process_data(raw_dataset.dataframe)
train_dataset = OnyxDataset(
    dataframe=train_df,
    features=['acceleration', 'velocity', 'position', 'control'],
    dt=0.01
)

# Save with source dataset tracing
onyx.save_dataset(
    name='processed_training_data',
    dataset=train_dataset,
    source_datasets=[{'name': 'raw_sensor_data'}],
    time_format='s'
)

Time Format Options

Format	Description
`'s'`	Seconds (default)
`'ms'`	Milliseconds
`'us'`	Microseconds
`'ns'`	Nanoseconds
`'datetime'`	Python datetime objects
`'none'`	No time column

Loading Datasets

Load datasets from the Engine:

from onyxengine import Onyx

onyx = Onyx()

# Load the latest version
dataset = onyx.load_dataset('my_training_data')

# Access the data
print(dataset.dataframe.head())
print(f"Features: {dataset.config.features}")
print(f"Time step: {dataset.config.dt} seconds")

Loading Specific Versions

Each dataset save creates a new version. Load a specific version by ID:

# Load a specific version
dataset = onyx.load_dataset(
    'my_training_data',
    version_id='a05fb872-0a7d-4a68-b189-aeece143c7e4'
)

Local Caching

Datasets are cached locally after the first download. The SDK automatically:

Checks if the local version matches the requested version
Downloads only if the local cache is outdated
Stores files in ~/.onyx/datasets/

Next Steps

Training Models

Use your dataset to train a model

Getting Started

Tutorials

Concepts

Collecting Good Data

What Data to Collect

Amount of Data and Sampling Rate

Creating an Onyx Dataset

Parameters

Data Format Requirements

Saving Datasets

Source Dataset Tracking

Time Format Options

Loading Datasets

Loading Specific Versions

Local Caching

Next Steps

Training Models

Getting Started

Tutorials

Concepts

​Collecting Good Data

​What Data to Collect

​Amount of Data and Sampling Rate

​Creating an Onyx Dataset

​Parameters

​Data Format Requirements

​Saving Datasets

​Source Dataset Tracking

​Time Format Options

​Loading Datasets

​Loading Specific Versions

​Local Caching

​Next Steps

Training Models

Collecting Good Data

What Data to Collect

Amount of Data and Sampling Rate

Creating an Onyx Dataset

Parameters

Data Format Requirements

Saving Datasets

Source Dataset Tracking

Time Format Options

Loading Datasets

Loading Specific Versions

Local Caching

Next Steps