Sampling into datasets

FathomNet
Jun 9, 2025
2 min read

Portal supports generating datasets in formats popular with the AI community such as YOLO formats for Detection and Classification. This post goes into more detail about those formats and outputs, as well as includes and end-to-end tutorial.

What makes up a dataset

In a world with lots of data, creating dataset samples is a useful endeavor for converting

raw video into usable formats for development of machine vision models or AI exploitation algorithms. Portal supports generating datasets in formats popular with the AI community such as YOLO formats for Detection and Classification .

Detection

Detection datasets are made up of key frames which are extracted from raw video or a replication of a processed image. Along-side the frame is an annotation label. The file format indicates the objects location in the frame as well as its unique label index.

Sampling into datasets

Portal supports generating datasets in formats popular with the AI community such as YOLO formats for Detection and Classification. This post goes into more detail about those formats and outputs, as well as includes and end-to-end tutorial.

What makes up a dataset

In a world with lots of data, creating dataset samples is a useful endeavor for converting

Detection

The extracted frame and label name contain tie-ins back to the Portal database. The schema of the file name is `_.png` for frame extracts and `_.txt` for label files. This format is compatible with training yolov5 or yolov8 detection networks. It can also be imported into [FiftyOne]( https://voxel51.com/fiftyone/ ).

Classification

Classification datasets differ in that instead of containing frame images and label files they are only images. Instead, a folder structure is utilized to create a tree of images where each folder contains regions of interest extracted from frames.

In the extraction area, an on-disk format is utilized to be compatible with training and visualizing classification datasets.

Label_A

|- <MEDIA_ID>_<FRAME_NUM>_<OBSERVATION_ID>_<BOX_ID>_Label_A.png

|- <MEDIA_ID>_<FRAME_NUM>_<OBSERVATION_ID>_<BOX_ID>_Label_A.png

|- <MEDIA_ID>_<FRAME_NUM>_<OBSERVATION_ID>_<BOX_ID>_Label_A.png

|- ....

Label_B

|- <MEDIA_ID>_<FRAME_NUM>_<OBSERVATION_ID>_<BOX_ID>_Label_B.png

|- <MEDIA_ID>_<FRAME_NUM>_<OBSERVATION_ID>_<BOX_ID>_Label_B.png

|- ....

...

Label_N

|- <MEDIA_ID>_<FRAME_NUM>_<OBSERVATION_ID>_<BOX_ID>_Label_N.png

|- ....

How a dataset is sampled

Methods for generating a dataset can vary from simple temporal dataset generation to complex algorithmic-assisted methods. From a process perspective, the first step after generating a dataset is reviewing it for accuracy before using it to train new models. Tools such as [FiftyOne]( https://voxel51.com/fiftyone/ ) can be used to view and correct datasets locally.