Collecting Good Data
Onyx tools are used after you already have data, but since data is critical to model performance, we’ve included some tips for getting good data.What Data to Collect
The first rule of thumb is: Whatever dynamics you want to model, you should capture in the data. More technically speaking, the distribution of your dataset will reflect the distribution in which your AI models will be accurate. To get models that generalize well when simulating dynamics, we want our data collection to cover as much of the frequencies and behaviors of our dynamics as possible. To excite the dynamics of controlled systems, we suggest mixing the kinds of control behaviors you anticipate seeing during operation along with a combination of sine waves, sine chirps, and random steps/pulses at varying frequencies and amplitudes. With this said, if you have data already, try that out first! We’ve made it easy to experiment with different datasets to rapidly iterate on your models.Amount of Data and Sampling Rate
There is no strict requirement for the amount of data you need, and more data is generally better, but here’s a rough baseline:1 hourof time series data (can be separate episodes/experiments), however 10-15 minutes may be enough for some applications.10Hz+sampling rate. Higher is better, you can always downsample later.
Creating an Onyx Dataset
Datasets in the Python SDK are managed through theOnyxDataset class, which is simply a pandas DataFrame with additional metadata about the features and time step.
Parameters
| Parameter | Type | Description |
|---|---|---|
dataframe | pd.DataFrame | Your time series data |
features | List[str] | Column names to include in the dataset |
dt | float | Time step between samples in seconds |
Data Format Requirements
Our requirements for data formatting follow general best practices for AI modeling that are applicable to any training framework. Before training, your data will need to be a time series with:- Labeled columns: Each column/feature in your dataset should have a unique name
- Numeric columns: All features must be numeric (Onyx converts to
float32for efficiency) - Consistent time step: The time delta between all rows should be the same (eg.
0.001s) - No missing values: You should drop rows with
NaNvalues or use interpolation to fill them in
.csv file that could immediately be used for training:
Saving Datasets
Save your dataset to the Engine for training:"Process for Training" in the Data tab of the dataset. After processing completes, the dataset will be marked as Active and ready for training.
Source Dataset Tracking
When creating derived datasets, you can specify source datasets to automatically trace data lineage:Time Format Options
| Format | Description |
|---|---|
's' | Seconds (default) |
'ms' | Milliseconds |
'us' | Microseconds |
'ns' | Nanoseconds |
'datetime' | Python datetime objects |
'none' | No time column |
Loading Datasets
Load datasets from the Engine:Loading Specific Versions
Each dataset save creates a new version. Load a specific version by ID:Local Caching
Datasets are cached locally after the first download. The SDK automatically:- Checks if the local version matches the requested version
- Downloads only if the local cache is outdated
- Stores files in
~/.onyx/datasets/
Next Steps
Training Models
Use your dataset to train a model