Hyperparameter Tuning a YOLOv8 Model with Amazon SageMaker

Object detection is a computer vision task that involves identifying objects in both images and videos. YOLO (You Only Look Once) is a state-of-the-art object detection model that is widely used within the computer vision field. It uses a Convolutional Neural Network (CNN) that takes an image and predicts bounding boxes around objects and the corresponding class label.

YOLOv8 is the newest of the series of YOLO models and will be used throughout this blog.

When training any machine learning model, hyperparameter tuning is an essential part. Hyperparameters are parameters that influence the learning process during model training. In order to produce the best possible predictions from a model, we must find the optimal set of hyperparameters.

In this blog, we will describe how to run a custom YOLOv8 model using Amazon SageMaker’s resources to find the optimal hyperparameter configuration.

For the purposes of this blog, we will assume the following:

We have a set of training images and labels saved in an S3 bucket
We have a train.py file that contains the YOLO model
We have a .yaml file that contains the directory of training and validation and the number of classes and label names

To run a hyperparameter tuning job, we need to set up an Estimator. An example is shown below and more details on each input can be found here. However, before we can do this, we must import all the necessary libraries.

# importing the libraries

import sagemaker

from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import CategoricalParameter, ContinuousParameter
from sagemaker.tuner import HyperparameterTuner, HyperbandStrategyConfig, StrategyConfig
from sagemaker.estimator import Estimator

sagemaker_session = sagemaker.Session()
role = get_execution_role()

# setting the metric definitions for the YOLO model

metric_definitions=[
    {
        "Name": "precision",
        "Regex": "YOLO Metric metrics/precision\(B\): (.*)"
    },
    {
        "Name": "recall",
        "Regex": "YOLO Metric metrics/recall\(B\): (.*)"
    },
    {
        "Name": "mAP50",
        "Regex": "YOLO Metric metrics/mAP50\(B\): (.*)"
    },
    {
        "Name": "mAP50-95",
        "Regex": "YOLO Metric metrics/mAP50-95\(B\): (.*)"
    },
    {
        "Name": "box_loss",
        "Regex": "YOLO Metric val/box_loss: (.*)"
    },
    {
        "Name": "cls_loss",
        "Regex": "YOLO Metric val/cls_loss: (.*)"
    },
    {
        "Name": "dfl_loss",
        "Regex": "YOLO Metric val/dfl_loss: (.*)"
    }
]

estimator = PyTorch(
    entry_point="train.py",
    role=role,
    image_uri='your/image',  # your image
    source_dir="./src",
    instance_count=1,
    instance_type='ml.g4dn.xlarge',
    framework_version="1.12.1",
    py_version="py38",
    sagemaker_session=sagemaker_session,
    hyperparameters={},
    use_spot_instances=True,
    input_mode='File',  # FastFile causes a issue with writing label cache
    debugger_hook_config=False,
    max_wait=360000+3600,
    max_run=360000,
    output_path='path/to/output',
    enable_sagemaker_metrics=True,
    metric_definitions=metric_definitions,
)

The estimator defined above, takes your train.py, (the source_dir needs to be where this file is saved) sets an instance type uses a spot instance and has a max_run time of 100 hours. This means that after 100 hours Amazon SageMaker terminates the job irrespective of its current position.

Any hyperparameters you want to keep the same value throughout the training jobs can also be set as a constant here. Again, more details on these can be found here.

The train.py file should include code similar to the following, with the hyperparameters that you are wanting to tune added to the parser:

# train.py

import argparse
import sys
import os
import shutil

from ultralytics import YOLO

parser = argparse.ArgumentParser()
parser.add_argument('--epochs', help='number of training epochs')
parser.add_argument('--optimizer', help='optimizer to use')
parser.add_argument('--lr0', help='initial learning rate')
parser.add_argument('--lrf', help='final learning rate')
parser.add_argument('--momentum', help='momentum')
parser.add_argument('--weight_decay', help='optimizer weight decay')
args = parser.parse_args()

print('---------------Debug injected environment and arguments--------------------')
print(sys.argv)
print(os.environ)
print('---------------End debug----------------------')

model = YOLO("yolov8n.yaml")

model.train(data='./blaa.yaml', 
            epochs=int(args.epochs), 
            batch=64, 
            optimizer=args.optimizer, 
            lr0=float(args.lr0), 
            lrf=float(args.lrf), 
            momentum=float(args.momentum),
            weight_decay=float(args.weight_decay)
           )

model.export()

As mentioned at the start, we need a .yaml file to run the YOLOv8 model. This should contain the following details:

# .yaml file

# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path:  /opt/ml/input/your/s3/bucket  # dataset root dir
train: images/train  # train images (relative to 'path')
val: images/train  # val images (relative to 'path')
test:  # test/images # test images (optional)

# Classes
names:
  0: 'label1'
  1: 'label2'

Now, we need to define the ranges of the hyperparameters you want to tune.

This is shown below; where each hyperparameter is either an IntegerParameter, CategoricalParameter or a ContinuousParameter.

hyperparameter_ranges={
    'epochs':IntegerParameter(100, 300),
    'optimizer':CategoricalParameter(['SGD', 'Adam', 'AdamW', 'RMSProp']),
    'lr0': ContinuousParameter(0.00001, 0.01),
    'lrf': ContinuousParameter(0.00001, 0.01),
    'momentum': ContinuousParameter(0.9, 0.9999),
    'weight_decay': ContinuousParameter(0.0003, 0.00099)
}

To create a tuner we use HyperparameterTuner which takes the following inputs:

Our estimator
The objective metric and definition (definitions set above)
- here we have chosen to maximise the mean average precision mAP
Hyperparameter ranges
Strategy
- We have set the strategy to be Hyperband. More details on these options here

tuner = HyperparameterTuner(estimator, 
                            objective_metric_name="mAP50-95", 
                            metric_definitions=metric_definitions, 
                            hyperparameter_ranges= hyperparameter_ranges, 
                            strategy='Hyperband',
                            max_jobs=50,
                            strategy_config = StrategyConfig(hyperband_strategy_config=HyperbandStrategyConfig(max_resource=10, min_resource = 1))
                           )

Finally, we want to fit the tuner by passing in the S3 paths to the training data