3. Data Preprocessing¶

Open In Colab

In chapter 2, we explored the data in the Face Mask Detection dataset. In this chapter, we will perform data preprocessing.

Popular datasets have tens of thousands of images, but many datasets are created on a much smaller scale, raising the question of how to train a limited dataset.

You don’t have to find new images just because your dataset is limited. Instead, various images can be obtained using data augmentation.

Figure 3-1 shows tennis balls that look identical to human eyes. However, the deep learning model sees all three tennis balls differently. With this principle, we can extract multiple data by modulating a single image.

In chapter 3.1, we will look at the torchvision.transforms and albumentations modules used for image augmentation. torchvision.transforms is a module officially provided by PyTorch, and albumentations is an optimized open source computer vision library, similar to OpenCV. It is a module that provides faster processing speed, along with other features.

Both modules can be used for augmentation when building an image classification model. However, the image augmentation function for building an object detection model is provided only by albumentations . Image augmentation for object detection should transform not only the image but also the bounding box, which is a function that torchvision.transforms does not provide.

Therefore, in chapter 3.2, we will practice bounding box augmentation using albumentations . Finally, in chapter 3.3, we will separate the data into training data and test data.

3.1. Augmentation Practice¶

For the augmentation practice, we will load the data using the code from chapter 2.1.

!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils
!python Tutorial-Book-Utils/PL_data_loader.py --data FaceMaskDetection
!unzip -q Face\ Mask\ Detection.zip
Cloning into 'Tutorial-Book-Utils'...
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 12 (delta 1), reused 2 (delta 0), pack-reused 0
Unpacking objects: 100% (12/12), done.
Face Mask Detection.zip is done!

Make sure to use the latest version of the albumentations module by upgrading it if necessary. We can upgrade a specific module through the pip install --upgrade command.

!pip install --upgrade albumentations

To visualize the augmentation output, let’s use the bounding box schematic code from chapter 2.3.

import os
import glob
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from bs4 import BeautifulSoup

def generate_box(obj):
    
    xmin = float(obj.find('xmin').text)
    ymin = float(obj.find('ymin').text)
    xmax = float(obj.find('xmax').text)
    ymax = float(obj.find('ymax').text)
    
    return [xmin, ymin, xmax, ymax]

def generate_label(obj):

    if obj.find('name').text == "with_mask":

        return 1

    elif obj.find('name').text == "mask_weared_incorrect":

        return 2

    return 0


def generate_target(file): 
    with open(file) as f:
        data = f.read()
        soup = BeautifulSoup(data, "html.parser")
        objects = soup.find_all("object")

        num_objs = len(objects)

        boxes = []
        labels = []
        for i in objects:
            boxes.append(generate_box(i))
            labels.append(generate_label(i))

        boxes = torch.as_tensor(boxes, dtype=torch.float32) 
        labels = torch.as_tensor(labels, dtype=torch.int64) 
        
        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        
        return target

def plot_image_from_output(img, annotation):
    
    img = img.permute(1,2,0)
    
    fig,ax = plt.subplots(1)
    ax.imshow(img)
    
    for idx in range(len(annotation["boxes"])):
        xmin, ymin, xmax, ymax = annotation["boxes"][idx]

        if annotation['labels'][idx] == 0 :
            rect = patches.Rectangle((xmin,ymin),(xmax-xmin),(ymax-ymin),linewidth=1,edgecolor='r',facecolor='none')
        
        elif annotation['labels'][idx] == 1 :
            
            rect = patches.Rectangle((xmin,ymin),(xmax-xmin),(ymax-ymin),linewidth=1,edgecolor='g',facecolor='none')
            
        else :
        
            rect = patches.Rectangle((xmin,ymin),(xmax-xmin),(ymax-ymin),linewidth=1,edgecolor='orange',facecolor='none')

        ax.add_patch(rect)

    plt.show()

There are some differences from the functions used in chapter 2. We can see that the torch.as_tensor function has been added to the generate_target function to prepare for the operation between tensors when training the deep learning model later.

Additionally, the plot_image function introduced in chapter 2 read images from a file path. Now, the plot_image_from_output function visualizes the image after it has been converted to torch.Tensor . In PyTorch, images are expressed as [channels, height, width] , whereas in matplotlib they are expressed as [height, width, channels] . Therefore, we must use the permute function to change the channel order expected by matplotlib .

3.1.1. Torchvision Transforms¶

To practice torchvision.transforms, let’s first define the TorchvisionDataset class. The TorchvisionDataset class loads the image through the __getitem__ method and then proceeds with data augmentation. Augmentation is performed according to the rule stored in the transform parameter. To measure time, it uses the time function, and finally returns image , label , and total_time .

from PIL import Image
import cv2
import numpy as np
import time
import torch
import torchvision
from torch.utils.data import Dataset
from torchvision import transforms
import albumentations
import albumentations.pytorch
from matplotlib import pyplot as plt
import os
import random

class TorchvisionMaskDataset(Dataset):
    def __init__(self, path, transform=None):
        self.path = path
        self.imgs = list(sorted(os.listdir(self.path)))
        self.transform = transform
        
    def __len__(self):
        return len(self.imgs)

    def __getitem__(self, idx):
        file_image = self.imgs[idx]
        file_label = self.imgs[idx][:-3] + 'xml'
        img_path = os.path.join(self.path, file_image)
        
        if 'test' in self.path:
            label_path = os.path.join("test_annotations/", file_label)
        else:
            label_path = os.path.join("annotations/", file_label)

        img = Image.open(img_path).convert("RGB")
        
        target = generate_target(label_path)
        
        start_t = time.time()
        if self.transform:
            img = self.transform(img)

        total_time = (time.time() - start_t)

        return img, target, total_time

Let’s practice image augmentation using the function provided in torchvision.transforms. After setting the image size (300, 300), we will crop it to size 224. Then we’ll randomly change the image’s brightness, contrast, saturation, and hue. Finally, after inverting the image horizontally, we will convert it to a tensor.

torchvision_transform = transforms.Compose([
    transforms.Resize((300, 300)), 
    transforms.RandomCrop(224),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2),
    transforms.RandomHorizontalFlip(p = 1),
    transforms.ToTensor(),
])

torchvision_dataset = TorchvisionMaskDataset(
    path = 'images/',
    transform = torchvision_transform
)

We can resize the image through the Resize function, then we can crop the image through the RandomCrop function provided by transforms. The ColorJitter function randomly changes the brightness, contrast, saturation, and hue. The RandomHorizontalFlip performs a horizontal inversion with a defined probability of p. Let’s run the code below to compare the image before and after the change.

only_totensor = transforms.Compose([transforms.ToTensor()])

torchvision_dataset_no_transform = TorchvisionMaskDataset(
    path = 'images/',
    transform = only_totensor
)

img, annot, transform_time = torchvision_dataset_no_transform[0]
print('Before applying transforms')
plot_image_from_output(img, annot)
Before applying transforms
../../../_images/Ch3-preprocessing_16_1.png
img, annot, transform_time = torchvision_dataset[0]

print('After applying transforms')
plot_image_from_output(img, annot)
After applying transforms
../../../_images/Ch3-preprocessing_17_1.png

We can see that the mentioned changes have been applied to the image. However, while the image itself has changed, the bounding box has not. We can see that the augmentation provided by torchvision.transform only affects the image value, not the location of the bounding box.

For image classification, the label value is fixed even if the image changes. But for object detection, the label value will also change as the image changes. We’ll see how to solve this problem in chapter 3.2. First, however, we will continue to compare the torchvision and albumentations modules. We will calculate the time spent converting the image in torchvision_dataset and measure the time it will take to repeat it 100 times using the code below.

total_time = 0
for i in range(100):
  sample, _, transform_time = torchvision_dataset[0]
  total_time += transform_time

print("torchvision time: {} ms".format(total_time*10))
torchvision time: 10.138509273529053 ms

It took about 10 to 12 ms to perform the image conversion 100 times. In the next section, we will check the augmentation speed of the albumentations module.

3.1.2. Albumentations¶

In chapter 3.1.1, we measured the conversion speed of torchvision.transforms. In this section, we will look at another augmentation module, albumentations. Like we did for torchvision, let’s first define the dataset class. AlbumentationDataset has a similar structure to TorchVisionDataset. It reads the image using the cv2 module and converts it to RGB. After converting the image, the result is returned.

class AlbumentationsDataset(Dataset):
    def __init__(self, path, transform=None):
        self.path = path
        self.imgs = list(sorted(os.listdir(self.path)))
        self.transform = transform
        
    def __len__(self):
        return len(self.imgs)

    def __getitem__(self, idx):
        file_image = self.imgs[idx]
        file_label = self.imgs[idx][:-3] + 'xml'
        img_path = os.path.join(self.path, file_image)

        if 'test' in self.path:
            label_path = os.path.join("test_annotations/", file_label)
        else:
            label_path = os.path.join("annotations/", file_label)
        
        # Read an image with OpenCV
        image = cv2.imread(img_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        target = generate_target(label_path)

        start_t = time.time()
        if self.transform:
            augmented = self.transform(image=image)
            total_time = (time.time() - start_t)
            image = augmented['image']
        
            
        return image, target, total_time

For a speed comparison with torchvision.transform, let’s use the same functions: Resize, RandomCrop, ColorJitter, and HorizontalFlip. Then we will compare the images before and after the change.

# Same transform with torchvision_transform
albumentations_transform = albumentations.Compose([
    albumentations.Resize(300, 300), 
    albumentations.RandomCrop(224, 224),
    albumentations.ColorJitter(p=1), 
    albumentations.HorizontalFlip(p=1), 
    albumentations.pytorch.transforms.ToTensor()
])
# Before applying transforms
img, annot, transform_time = torchvision_dataset_no_transform[0]
plot_image_from_output(img, annot)
../../../_images/Ch3-preprocessing_25_0.png
# After applying transforms
albumentation_dataset = AlbumentationsDataset(
    path = 'images/',
    transform = albumentations_transform
)

img, annot, transform_time = albumentation_dataset[0]
plot_image_from_output(img, annot)
../../../_images/Ch3-preprocessing_26_0.png

Like torchvision.transform, the image has been transformed, but the bounding box has not changed. To measure the speed, we will apply albumentation 100 times.

total_time = 0
for i in range(100):
    sample, _, transform_time = albumentation_dataset[0]
    total_time += transform_time

print("albumentations time/sample: {} ms".format(total_time*10))
albumentations time/sample: 2.1135759353637695 ms

It took about 2.0 to 2.5 ms to perform the image conversion 100 times. Compared to torchvision.transforms, it is about 4 times faster.

3.1.3. Probability-Based Augmentation Combination¶

Albmentations is not only faster than torchvision.transforms, but also provides a new function. In this section, we will look at the OneOf function provided by Albumentations. This function retrieves the augmentation functions in the list based on a given probability value. The probability value of the list and of the corresponding function are considered together to decide whether or not to execute. The OneOf functions below each have a probability of being selected. Since the 3 albumentations functions inside each function are also given a probability value of 1, we can see that one of the 3 functions is selected and executed with an actual probability of 1/3. There are various possible augmentations that can be created by adjusting the probability value in this way.

albumentations_transform_oneof = albumentations.Compose([
    albumentations.Resize(300, 300), 
    albumentations.RandomCrop(224, 224),
    albumentations.OneOf([
                          albumentations.HorizontalFlip(p=1),
                          albumentations.RandomRotate90(p=1),
                          albumentations.VerticalFlip(p=1)            
    ], p=1),
    albumentations.OneOf([
                          albumentations.MotionBlur(p=1),
                          albumentations.OpticalDistortion(p=1),
                          albumentations.GaussNoise(p=1)                 
    ], p=1),
    albumentations.pytorch.ToTensor()
])

Below is the result of applying albumentations_transform_oneof to an image 10 times.

albumentation_dataset_oneof = AlbumentationsDataset(
    path = 'images/',
    transform = albumentations_transform_oneof
)

num_samples = 10
fig, ax = plt.subplots(1, num_samples, figsize=(25, 5))
for i in range(num_samples):
  ax[i].imshow(transforms.ToPILImage()(albumentation_dataset_oneof[0][0]))
  ax[i].axis('off')
../../../_images/Ch3-preprocessing_33_0.png

3.2. Bounding Box Augmentation¶

When performing augmentation on the image used to build the object detection model, the bounding box conversion must be carried out along with the image conversion. As we saw in chapter 3.1, if the bounding box is not converted together, the model training will not work properly because the bounding box is detecting the wrong place. Bounding box augmentation is possible by using the bbox_params parameter in the Compose function provided by Albumentations.

First, let’s create a new dataset class using the code below. The transform part of the AlbumentationsDataset class in section 3.1.2 has been modified. Not only the image but also the bounding box is being transformed, so the necessary input and output values are modified.

class BboxAugmentationDataset(Dataset):
    def __init__(self, path, transform=None):
        self.path = path
        self.imgs = list(sorted(os.listdir(self.path)))
        self.transform = transform
        
    def __len__(self):
        return len(self.imgs)

    def __getitem__(self, idx):
        file_image = self.imgs[idx]
        file_label = self.imgs[idx][:-3] + 'xml'
        img_path = os.path.join(self.path, file_image)

        if 'test' in self.path:
            label_path = os.path.join("test_annotations/", file_label)
        else:
            label_path = os.path.join("annotations/", file_label)
        
        image = cv2.imread(img_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        target = generate_target(label_path)

        if self.transform:
            transformed = self.transform(image = image, bboxes = target['boxes'], labels = target['labels'])
            image = transformed['image']
            target = {'boxes':transformed['bboxes'], 'labels':transformed['labels']}
        
            
        return image, target

Next, let’s use the albumentations.Compose function to define the transformation. The first thing we will do is perform a horizontal flip, then we will rotate between -90 and 90 degrees. Enter the albumentations.BboxParams object into the bbox_params parameter in order to convert the bounding box as well. In the Face Mask Detection dataset, the bounding box notation is xmin , ymin , xmax , ymax , which is the same as pascal_voc notation. Therefore, enter pascal_voc in the format parameter. Also, enter labels in label_field to store the class values for each object in the labels parameter when transform is performed.

bbox_transform = albumentations.Compose(
    [albumentations.HorizontalFlip(p=1),
     albumentations.Rotate(p=1),
     albumentations.pytorch.transforms.ToTensor()],
    bbox_params=albumentations.BboxParams(format='pascal_voc', label_fields=['labels']),
)

Now let’s activate the BboxAugmentationDataset class and check the augmentation result.

bbox_transform_dataset = BboxAugmentationDataset(
    path = 'images/',
    transform = bbox_transform
)

img, annot = bbox_transform_dataset[0]
plot_image_from_output(img, annot)
../../../_images/Ch3-preprocessing_39_0.png

Whenever we run the code above, the image is converted and output. In addition to that, the bounding box is also properly transformed so we can see that it accurately detects masked faces in the transformed image. We will build a model in chapters 4 and 5 using the data created by converting the image and bounding box together.

3.3. Data Separation¶

To build an Artificial Intelligence model, we need training data and test data. The training data is used when training the model, and the test data is used when evaluating the model. Test data must not overlap with training data. Let’s divide the data imported in chapter 3.1 into training data and test data. First, let’s check the total number of data with the code below.

print(len(os.listdir('annotations')))
print(len(os.listdir('images')))
853
853

We can see that there are a total of 853 images. Usually, the ratio of training data to test data is 7:3. This data has a small total compared to more comprehensive datasets, so let’s take an 8:2 ratio. In order to use 170 of the 853 data as test data, we will move the data to a separate folder. First, create a folder to contain test data using the Linux command mkdir.

!mkdir test_images
!mkdir test_annotations

If we run the code above, the test_images folder and test_annotations folder will be created. Now, we will move 170 files each from the images folder and annotations folder into the newly created folders. We will use the sample function in the random module to randomly extract numbers and use them as index values.

import random
random.seed(1234)
idx = random.sample(range(853), 170)
print(len(idx))
print(idx[:10])
170
[796, 451, 119, 7, 92, 826, 596, 35, 687, 709]
import numpy as np
import shutil

for img in np.array(sorted(os.listdir('images')))[idx]:
    shutil.move('images/'+img, 'test_images/'+img)

for annot in np.array(sorted(os.listdir('annotations')))[idx]:
    shutil.move('annotations/'+annot, 'test_annotations/'+annot)

As seen in the code above, we can use the shutil package to move 170 images and 170 annotation files to the test_images folder and test_annotations folder, respectively. Let’s check the number of files in each folder.

print(len(os.listdir('annotations')))
print(len(os.listdir('images')))
print(len(os.listdir('test_annotations')))
print(len(os.listdir('test_images')))
683
683
170
170

For image classification, you only need to check the number of images after dividing the dataset. But for object detection, it is necessary to check how many objects for each class exist in the dataset. Let’s use the code below to check the number of objects by class in the dataset.

from tqdm import tqdm
import pandas as pd
from collections import Counter

def get_num_objects_for_each_class(dataset):

    total_labels = []

    for img, annot in tqdm(dataset, position = 0, leave = True):
        total_labels += [int(i) for i in annot['labels']]

    return Counter(total_labels)


train_data =  BboxAugmentationDataset(
    path = 'images/'
)

test_data =  BboxAugmentationDataset(
    path = 'test_images/'
)

train_objects = get_num_objects_for_each_class(train_data)
test_objects = get_num_objects_for_each_class(test_data)

print('\n Object in training data', train_objects)
print('\n Object in test data', test_objects)
100%|██████████| 683/683 [00:13<00:00, 51.22it/s]
100%|██████████| 170/170 [00:03<00:00, 52.67it/s]
Object in training data Counter({1: 2691, 0: 532, 2: 97})

Object in test data Counter({1: 541, 0: 185, 2: 26})
ã…¤

get_num_objects_for_each_class is a function that stores the label values of all bounding boxes in the dataset in total_labels and then uses the Counter class to count and return the number of each label. In the training data, there are 532 objects in class 0, 2,691 objects in class 1, and 97 objects in class 2. In the test data, there are 185 objects in class 0, 541 objects in class 1, and 26 objects in class 2. We can confirm that the data is divided appropriately by seeing that the ratio of 0,1,2 is similar for each dataset.

So far, we have seen how to use the Albumentations module to inflate the number of images used to build an object detection model, and we have seen how to separate the owned data into training data and test data. In chapter 4, we will build a mask wearing detection model by training RetinaNet, a one-stage model.