COCO with YOLO
- Complexity: MEDIUM
- Computational requirement: HIGH
In this tutorial, we will walk through the configuration of a Deeplodocus project for object detection on the COCO dataset. We will use the Deeplodocus implementations of YOLOv3 and its loss function, so no Python coding is required. However, there is plenty of scope for extending this project with your own custom-build modules.
The primary objects of this tutorial are to demonstrate how to:
- configure the CocoDetection dataset from torchvision,
- configure YOLOv3 and its loss function and,
- process the input and label data, and YOLO outputs.
- Initialise a New Project
- Data Configuration
- Model Configuration
- Loss & Metric Configuration
- Optimiser Configuration
- Transformer Configuration
We recommend following each step to create this project from scratch. However, you can clone a copy of this project from here if you'd prefer to jump ahead - but don't forget to follow the prerequisite steps.
1. Install pycocotools
To begin, we need the to install pycocotools, on which the CocoDataset torchvision module is dependent.
pycocotool requires Cython, so we'll install that first, with:
pip3 install Cython
Then we can install pycocotools itself with:
pip3 install pycocotools
2. Download the COCO Detection Dataset
Secondly, let's download the appropriate data from the COCO website.
Specifically, we need the following items:
- 2017 Train images download [18GB]
- 2017 Val images download [1GB]
- 2017 Train/Val annotations download [241MB]
When you have initialised your Deeplodocus project, we can extract each of these into the data folder.
1. Initialise the Project
Initialise a new Deeplodocus project in your terminal with:
deeplodocus new-project COCO-with-YOLO
After initialising your project extract the COCO dataset into the data directory of the empty project. The resultant structure should look like this:
data ├─ annotations │ ├─ instances_train2017.json │ └─ instances_val2017.json ├─ train2017 │ ├─ 000000000???.jpg │ ├─ 000000000???.jpg │ └─ ... └─ val2017 ├─ 000000000???.jpg ├─ 000000000???.jpg └─ ...
2. Data Configuration
Open up the config/data.yaml file, and let's get started.
2.2. Data loader
At the top of the file you'll see the entry for dataloader, use this to set the batch size and the number of workers. If you have limited GPU memory, you may need to reduce your batch size.
dataloader: batch_size: 12 # Possible batch sizes will depend on the available memory num_workers: 8 # This will depend on your CPU, you probably have at least 4 cores
2.3. Enable the required pipelines
Next, use the enabled entry to enable different types of pipeline. As we only have training and validation data in this case, we need to enable just the trainer and validator as follows:
enabled: train: True # Enable the trainer validation: False # Enable the validator test: False # There is no test data predict: False # There is no prediction data
2.4. Configure the dataset
Finally, we arrive at the datasets entry. We define this with a list of two items, which configure the training and validation portions of the dataset respectively. We'll start with the first item - training data - which is shown below.
Within the training portion we define two entries, one for input and one for label and for each entry, we define a single data source.
datasets: # Training portion - name: COCO Train 2017 # Human-readable name type: train # Dataset type (train/validation/test/predict) num_instances: Null # Number of instances to use (Null = use all) entries: # Input Entry - name: COCO Image # Human-readable name type: input # Entry type (input/label/additional data) load_as: image # Load data as image convert_to: float32 # Convert to float32 move_axis: [2, 0, 1] # Permute : (h x w x ch) to (ch x h x w) enable_cache: True # Give other entries access to this entry # We define one source for this entry - CocoDetection from torchvision.datasets sources: - name: CocoDetection module: torchvision.datasets kwargs: root: data/train2017 # Path to training image directory annFile: data/annotations/instances_train2017.json # Training annotations # Label Entry - name: COCO Label # Human-readable name type: label # Entry type (input/label/additional data) load_as: given # Do not use any additional methods on loading convert_to: float32 # Convert to float32 (after data transforms) move_axis: Null # No need for move axis enable_cache: False # No other entries need access to this data # Define one source for this entry - point to data from the input entry sources: - name: SourcePointer # Point to an existing data source module: Null # Import from default modules kwargs: entry_id: 0 # Take data from the first entry (defined above) source_id: 0 # Take from the first (and only) source instance_id: 1 # Take the second item - the label
Why are we using a SourcePointer?
When using torchvision datasets, the input and label entries are loaded together in a single iterable. This does not change how we configure the input source. However, for the label source, we use a SourcePointer to reference the second item from the first (and only) source of the first (input) entry.
Now, we can include the validation configurations, which will look very similar. There are only 4 differences:
- dataset name
- dataset type
- input entry source root
- input entry source annFile
Validation dataset configurations:
# Validation portion - name: COCO Val 2017 # Human-readable name type: validation # Dataset type (train/validation/test/predict) num_instances: Null # Number of instances to use (Null = use all) entries: # Input - name: COCO Image # Human-readable name type: input # Entry type (input/label/additional data) load_as: image # Load data as image convert_to: float32 move_axis: [2, 0, 1] # Permute : (h x w x ch) to (ch x h x w) enable_cache: True # Give other entries access to this entry # We define one source for this entry - CocoDetection from torchvision.datasets sources: - name: CocoDetection module: torchvision.datasets kwargs: root: data/val2017 # Path to val image directory annFile: data/annotations/instances_val2017.json # Validation annotations # Label - name: COCO Label # Human-readable name type: label # Entry type (input/label/additional data) load_as: given # Do not use any additional methods on loading convert_to: float32 # Convert to float32 move_axis: Null # No need for move axis enable_cache: False # No other entries need access to this data sources: - name: SourcePointer # Point to an existing data source module: Null # Import from default modules kwargs: entry_id: 0 # Take data from the first entry (defined above) source_id: 0 # Take from the first (and only) source instance_id: 1 # Take the second item - the label
3. Model Configuration
We'll use Deeplodocus' implementation of the YOLOv3 architecture for this project, which can be used with different backbone feature detectors.
Open and edit the config/model.yaml as follows to specify the object detector.
name: YOLO # Select YOLO module: deeplodocus.app.models.yolo # From the deeplodocus app from_file: False # Don't try to load from file file: Null # No need to specify a file to load from input_size: # Specify the input size - [3, 448, 448] kwargs: # Keyword arguments for the model class num_classes: 91 # Number of classes in COCO backbone: # Specify the backbone name: Darknet53 # Select Darknet53 (Darknet19 is also available) module: deeplodocus.app.models.darknet # Point to the darknet module kwargs: # Keyword arguments for the backbone num_channels: 3 # Tell it to expect an input with 3 channels include_top: False # Don't include the classifier
That's it! YOLO is configured and ready to go.
You should now be able to load the model and print a summary, like this:
Want to know more about YOLO?
For an in-depth understanding of the network architecture, we strongly recommend reading the YOLO papers:
4. Loss & Metric Configuration
To train our YOLO object detector, we need to configure a loss function.
We can specify Deeplodocus' implementation of the YOLO loss function by editing the config/losses.yaml file as follows:
YOLOLoss: name: YOLOLoss # Name of the loss object module: deeplodocus.app.losses.yolo # Import from deeplodocus app weight: 1 # Multiplier for loss function kwargs: iou_threshold: 0.5 # Weights applied to cells that do not contain and object and cells that do contain an object respectively obj_weight: [0.5, 1] # Multiplier applied to loss from coordinate predictions box_weight: 5 # Options: Null (no weights), auto, list weight values (w0, w1, ..., wn) # Auto: total / frequency * num_classes class_weight: Null # Sets a minimum class weight (may be useful when class are very imbalanced) min_class_weight: Null
We have done our best to implement this loss function as described in the literature, the source code is published here.
Currently, Deeplodocus does not include any of the traditional metrics for evaluating object detection. Unless you wish to include you own metrics, make sure that the config/metrics.yaml file is empty.
5. Optimiser Configuration
Have a look at the optimiser configurations specified in config/optimizer.yaml. By default we have specified the Adam optimiser from torch.optim. The learning rate is specified by lr, and additional parameters can also be given.
name: "Adam" module: "torch.optim" kwargs: lr: 0.0001 betas: [0.9, 0.999] weight_decay: 0
Make sure the learning rate is not too high, otherwise training can become unstable. If you are able to use a pre-trained backbone, a learning rate of 1e-3 should be just fine. However, if you are training from scratch - like in this tutorial - a lower learning rate will be necessary in the beginning.
6. Transformer Configuration
The final step is the configuration of two data transformers:
- An input transformer to pre-process images and labels before they are given to the network.
- An output transformer for post-processing and visualisation.
Edit the config/transform.yaml file as follows:
train: name: Train Transform Manager inputs: - config/transformers/input.yaml # Path to input transformer labels: - '*inputs:0' # Point to the first input transformer additional_data: Null outputs: - config/transformers/output.yaml # Path to output transformer validation: name: Validation Transform Manager inputs: - config/transformers/input.yaml # Path to input transformer labels: - '*inputs:0' # Point to the first input transformer additional_data: Null outputs: - config/transformers/output.yaml # Path to output transformer test: name: Test Transform Manager inputs: Null labels: Null additional_data: Null outputs: Null predict: name: Predict Transform Manager inputs: Null additional_data: Null outputs: Null
Why does the label transformer point to the input transformer?
COCO images are different sizes, therefore each must be resized before being concatenated into a batch. To keep the bounding boxes labels relevant, we need to normalise them by the width and height of their associated image before it is resized. Therefore, each label transformer should point to the input transformer, thus each label transform will be dependant on transform applied to its corresponding image.
5.1. Input Transformer
We now need to setup a transformer that defines the sequence of functions to apply to the inputs and labels. Open the config/transformers/input.yaml file and edit as follows:
method: sequential name: Transformer for COCO input mandatory_transforms_start: - format_labels: name: reformat_pointer module: deeplodocus.app.transforms.yolo.input kwargs: n_obj: 100 - resize: name: resize module: deeplodocus.app.transforms.yolo.input kwargs: shape: [416, 416] transforms: Null mandatory_transforms_end: Null
These two transforms constitute the two stages of the transformer pipeline illustrated below:
In the first stage, the label is formatted into an array of size (n_obj x 5) and the box coordinates are normalised by the corresponding image shape.
- An input (image) is given to the reformat_pointer function, which returns:
- the image (unchanged) and,
- a TransformData object that stores the shape of the given image and another transform function, reformat.
- As the label transformer points to the input transformer, the label will inputted to the function specified by this TransformData object, which:
- formats the label into a numpy array and,
- normalises the box coordinages w.r.t the given image shape.
In the second stage, the input image to the given shape and box coordinates of the corresponding label are scaled accordingly.
- The image is inputted to the resize function, which returns:
- the image, resized to (448 x 448) and,
- a TransformData object that points to an scale transformer function.
- The label is given to the scale transform.
- This sales up the label box coordinates by the new shape of the image.
5.2. Output Transformer
To visualise the outputs of our YOLO model during training or validation, we can apply some post-processing transforms. To do this, we need to initialise an output transformer configuration file.
Navigate to the config/transformers directory and use the command:
deeplodocus output-transformer output.yaml
This will create a new configuration file that you can open and edit to look like this:
# Define skip - for use in multiple transforms # A skip of 20 will cause the transforms to only process every 25th batch skip: &skip 25 # Sometimes there can be lots of false detections at the beginning of training # This can slow things down # Use initial skip to skip the first few batches of the first epoch initial_skip: &initial_skip 0 name: Output Transformer transforms: Activate: name: Activate module: deeplodocus.app.transforms.yolo.output kwargs: skip: *skip initial_skip: *initial_skip NonMaximumSuppression: name: NMS module: deeplodocus.app.transforms.yolo.output kwargs: iou_threshold: 0.5 # IoU threshold for NMS obj_threshold: 0.5 # Threshold for suppression by objectness score skip: *skip initial_skip: *initial_skip Visualization: name: Visualize module: deeplodocus.app.transforms.yolo.output kwargs: obj_threshold: 0.5 # Objectness threshold key: data/key.txt # Key of object class names rows: Null # No. of rows when displaying images (Null = auto) cols: Null # No. of cols when displaying images (Null = auto) scale: 0.6 # Re-scale the images before displaying wait: 1 # How long to wait (ms) (0 = wait for a keypress) width: 2 # Line width for drawing boxes lab_col: [32, 200, 32] # Color for drawing ground truth boxes (BGR) det_col: [32, 32, 200] # Color for drawing model detections (BGR) font_thickness: 2 font_scale: 0.8 skip: *skip initial_skip: *initial_skip
This transformer specifies three output transform functions:
- Activate - applies activation functions to the model outputs
- NonMaximumSuppression - removes duplicate predictions
- Visualization - displays the ground truth and predictions
The source code for each of these transform functions can be found here.
Now you're good to go! You can now run the project main file, and use the commands load() and train().
A useful series of commands can be found (commented out) under the on_wake entry in config/project.yaml