This research is focused on generating race car image data using Generative Adversarial Networks with a given set of input images. Three different GAN approaches are explored and compared to generate synthetic race car images. This research revolves around learning about advanced GANs and using multiple parameters to tune the generator algorithms.
In the following sections, there is a detailed explanation of GANs and an introduction to the input data (section one, methodology). Experimental setup (section two) lends itself towards the implementation and results of the three different model techniques applied (section three).
A GAN consists of two neural networks: a generator and a discriminator. The generator produces fake data, and the discriminator tries to differentiate between the fake and real data. The two train against each other, as demonstrated in Figure 1.
A key feature of this structure is that the generator never sees the real data and learns how to produce similar looking data through feedback from the discriminator. Thus, in situations involving confidential data, one can train the full network in a secure environment and then release only the generator to outside researchers. Then, the generator can be used to produce arbitrary quantities of data for analysis.
Training the discriminator
This works like any other neural network, but with the extra step of producing a current batch of fake data from the generator prior to each training iteration. One feeds both the real and generated data through the neural network and trains the discriminator to output the correct real and fake labels.
Figure 1. GAN Training Process
Training the generator
To train the generator, one uses the combined architecture but trains only the layers belonging to the generator. These layers are updated with backpropagation to achieve labels of "real" for the generated data, as in Figure 1, shown above.
Figure 2. Input data for race car analysis
Training of GAN models was conducted using a proprietary data set of 60,000 NASCAR race car images during a race. For information on how these images were curated and labeled, see our previous paper, Image Classification of Race Cars. Every car with a specific number had distinctive features in terms of color, design and sponsor positioning. Additionally, model training was tried on two different resolutions, 64x64 and 128x128.
Training for the GANs was performed on a Cisco C480ML machine using Tesla V100 GPU. Docker containers were loaded on the C480ML to run Python 3 and the packages listed below. Most of the code development was done using Jupyter notebooks in JupyterLab environment.
The following Python packages were used to perform the training and testing:
In this experiment, the objective was to generate multi-label image data using GANs trained on a proprietary data set of 60,000 NASCAR race car images. The experiment began with deep convolutional GAN (DCGAN), a GAN architecture commonly used to work with images. DCGAN cannot generate multiple images at once, so another GAN architecture capable of handling this problem, Conditional GAN (CGAN), was also leveraged.
CGAN has been successfully implemented to generate multi-label data in the past, however, in this experiment, the results were not as expected. Instead of learning distinct features from each image label data, the model learned complex new features that were a combination of distinct features from multiple image classes. This resulted in output cars having different color compositions that were a mix of multiple car labels. A similar observation was made in output car shapes and labels. This could be potentially attributed to instability in the discriminator training which led the efforts towards spectral normalization GAN (SN-GAN).
The spectral normalization technique has proven to be a successful methodology to stabilize the training of the discriminator. This is a weight normalization technique which is computationally light and easy to incorporate into existing implementations. The final output images were generated using this technique.
- Python 3.5+
- SciPy 0.19.1
Deep Convolutional GAN is a type of GAN architecture introduced by Radford et al. (Alec Radford n.d.), in 2016, and uses elements of Convolutional Neural Network (CNN) architectures, along with GANs to model images. DCGAN is characterized by the following changes to the CNN architectures that result in training higher resolution and deeper generative models:
1. Replacing pooling layers with strided convolutions and using transposed convolution for up-sampling
2. Eliminating fully connected hidden layers for deeper architectures
3. Batch Normalization
Batch normalization (Sergey Ioffe 2015) stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This alleviates training problems that arise due to poor initialization and helps gradient flow in deeper models. It also helps deep generators learn which prevents the generator from collapsing all samples to a single point, a common failure mode observed in GANs.
ReLU and LeakyReLU activation functions are used in DCGANs for stable training. The ReLU activation function is used in the generator for all layers except the output layer, which uses the Tanh function, and helps with performance. Using a bounded activation allows the model to learn to more quickly saturate and cover the color space of the training distribution. Within the discriminator, LeakyReLU is used as the activation function for all layers.
Figure 3. DCGAN Generator. The input layer consists of 100 demensions, randomly generated. The subsequent layers transform this into an image of dimension 64x64, with 3 color channels. Each layer uses either a convolutional transformation or convolutional transpose.
The DCGAN model was trained using 3700 64 * 64 * 3 colored images of a single label (48) race car. These images were originally of the size 224 * 224 * 3 which were reshaped to 64 * 64 * 3 for training purposes. No other pre-processing steps were applied to the training images. The model was trained using mini-batch stochastic gradient descent (SGD), with a mini-batch size of 128. All weights were initialized from a zero-centered normal distribution with standard deviation 1. In the LeakyReLU, the slope of the leak was set to 0.1 in all the layers. The Adam optimizer was used for training the network with a learning rate of 0.0002.
The following results were obtained on training the DCGAN for a batch size of 32 for 30 epochs on 3700 colored images (64*64*3) of Car 48 and 3500 colored images (64*64*3) Car 9.
Figure 4. Results obtained on training the DCGAN for a batch size of 32 for 30 epochs on 3700 colored images (64*64*3) of car 48 and 3500 colored images (64*64*3) car 9.
The DCGAN generates decent quality car images which are visually similar to the original images, but the model could use some fine tuning (explained in the next sections) to further improve the output quality. However, DCGAN generates only one image label at a time. So, in order to generate multi-class images, we tried another GAN architecture called Conditional GAN capable of generating multiple image labels.
GANs can be extended to a conditional model if both the generator and discriminator are conditioned on some extra information "y.". 'Y could be any kind of auxiliary information, such as class labels or data from other modalities. We can perform the conditioning by feeding y into the both the discriminator and generator as additional input layer. CGAN helps GAN by adding extra information, y.
Recall in GAN that there are two neural networks, the generator G(z) and the discriminator D(x), where the network generates a generic output from an unknown noise distribution. In the Conditional GAN, the generator learns to generate a fake image with a specific condition or characteristics, like labels of images. Hence, our generator and discriminator in CGAN have an extra input y, Generator - G(z,y) and Discriminator - D(x,y) respectively.
Figure 5. Illustration of Conditioning input (y) to Generator and Discriminator
- Python 2.7+
The CGAN model was trained using 224 * 224 * 3 colored images of 4 different cars. These images were reshaped for training purposes. Car labels of each image were passed as a condition input in one-hot representation (y in figure above). The generator network was built using BatchNormalization for improving the performance and stability of neural networks, Strided convolutions and transposed convolution for up-sampling, LeakyReLU which allows a small gradient when the unit is not active and helps with stabilizing the training (alpha=0.1) and tanh activation function to normalize the input data between -1 to 1. The Discriminator network was built with BatchNormalization, Strided convolutions and LeakyReLU. For compiling the model, binary_crossentropy was used as the loss function. The Adam optimizer was used for training the network with a learning rate of 0.0002.
The model was trained with four different classes of images, totaling 12400 images. The class labels of car numbers 9, 24, 42, and 48, were used as classes. Original image size was 224 * 224 * 3 which was reduced to 64 * 64 * 3.
The below results were generated on batch size of 32 after 60 epochs. First, color composition of generated images were not the same, and the shape of the car is distorted as compared to the original images. Second, the labels are not correctly identified and the model generated images with the same features (i.e. black, yellow and red color) in all four classes. The problem with generating the same type of image in GAN is a common problem, called mode collapse.
Mode collapse is the scenario in which the generator produces the same or nearly the same images every time and is able to successfully fool the discriminator.
Figure 6. Images are generated by the CGAN sho2 nearly the same results for each class of cars which is an indication of mode collapse
To overcome the issue of mode collapse, experience replay was implemented where a sample array of generated images was stored of the size equivalent to the batch size. At each epoch, a randomly generated image was appended in the sample array. The collected sample was provided to the discriminator in the last epoch and emptied the sample.
Below are the output images after implementing Experience Replay:
Figure 7. Output images after implementing experience replay
After implementing experience replay, there was an improvement in the shape of the cars as compared to the images generated without experience replay. The color reproduced was similar to the original images, but mode collapse still played a role and the model generated images of car 42 in all four classes of images it generated.
A major challenge of leveraging GANs is the instability of its training. The Spectral Normalization technique applied during training of GANs has proven to be a successful methodology to stabilize the training of the discriminator. This is a weight normalization technique which is computationally light and easy to incorporate into existing implementations. In the below experimentation, spectral normalization GANs will generate better quality images at a higher resolution.
- Python 3.5+
- SciPy 0.19.1
To improve upon prior results using DCGAN and CGAN, the concept of 1-Lipschitz constraint was tested, to help control gradient training. We tried using the recently proposed method of enforcing the 1-Lipschitz constraint on the discriminator, explained in the reference.
The 1-Lipschitz constraint was explored and enforced by acting on the weights of the discriminator. However, rather than simply clipping the weights in each layer to be small, we normalized the spectral norm of the weight matrix W at each layer in the discriminator.
The cited paper (Miyato et al.) used a TensorFlow implementation of spectral normalization which was publicly available as Python code on GitHub. This served as a starting point for us to start with our training and compare on the results of DCGAN, CGAN and SN-GAN.
The spectral normalization GAN was trained, as described above, on a subset of the car images dataset, with classes defined on car numbers 9, 24, 42 and 48. We had about 12000 input images for four different classes. Images were reduced to a resolution of 128 * 128px and were center-cropped. The models were trained for 50,000 iterations and we set the batch size to 64, which means that our models were trained on 6.4 million images. We started with parameters set originally, which used the Adam learning rule. SN-GAN experiments used a learning rate of 0.0002, β1 = 0.5, and β2 = 0.999 for Adam. The regularization parameter for gradient penalty, λ, was set to be 1.
Spectral normalization helped improve the quality of generated images significantly. Structural Similarity Index Metric (SSIM), which may not be a standard practice for GAN model evaluation, gave a comparative score for every generated image. SSIM score is the score generated after comparison of one image with another for pixelate similarity. In this case, the SSIM score was calculated for every generated image after comparing with a set of randomly chosen real images from the pool, and the average of all SSIM scores was calculated for each generated image. To select good images, generated images can be sorted and selected based on SSIM score.
Below are the generated set of images for different classes of cars:
Figure 8. Generated set of images for the different classes of cars using the SN-GAN and showing the SSIM score for each.
Evaluation II: Training classifier model on generated data, and test on original images
Another evaluation method used to check the accuracy of SN-GAN was to train a classification model on the generated images and testing its ability to predict the class of real images. The classifier model was trained on 4000 images per class, and total of 16000 SN-GAN randomly generated images were trained without any discrimination between classes. We then tried classifying sets of original images using this model which yielded highly encouraging results. Classification of original images worked very well, giving the average accuracy of 89.6% for all classes.
Figure 9. Classifier Results
Below is the confusion matrix for actual vs predicted classes for the classification model accuracy built on generated images using SN-GAN. "Actual Classes" are the classes from the set of real images and "Predicted Classes" are the output of classifier trained on SN-GAN generated images. Accuracy reaches 90% in two out of the four classes. An important observation to note here is that, in the training dataset for SN-GAN, there were multiple images under different classes that looked very similar to another class across different races. Since SN-GAN was trained irrespective of race number, that resulted in the slight inaccuracy seen here. The large number of car number 42 getting classified as car number 9 (9.68%) is due to this reason.
Figure 10. Confusion matrix for actual vs. predicted classes
Generative Adversarial Networks (GAN) have been used in the past to generate new images of people and objects. In this paper, building on that work, GANs were trained using a proprietary race car image dataset to generate synthetic race car images. Different GAN Architectures (DCGAN, CGAN and SN-GAN) were explored to generate real looking multi-label car images. The experiments uncovered strengths and limitations of each GAN architecture and outlined ways to deal with common GAN training problems such as mode collapse.
- This is an architecture commonly used to work with image data
- Visually similar images were obtained but DCGAN couldn't train and generate multiple classes all at once
- Extension of GAN to conditional model with information of car labels provided as input
- Instead of learning distinct features from each image label data, the model learned complex new features which were a combination of distinct features from multiple image classes
- This resulted in output cars having color compositions which were a mix of multiple car labels
- The spectral normalization technique has proven to be a methodology to stabilize the training of the discriminator which is also a computationally light method
- This technique normalizes the weights of the discriminator enforcing 1-Lipschitz constraint on the discriminator layer
By observing the results after applying these architectures, it can be concluded that SN-GAN model was successfully able to achieve the objective of this research and generate multi-label image data for different cars.
The paper also explores quantifiable measurements of image outputs from GANs such as Structural Similarity Index Metric (SSIM) and training a classifier instead of the commonly popular "eyeballing method" for assessing the quality of the final output. The multinomial classifier trained on the four classes of generated images achieved a combined accuracy of 89.6% on real images data for all the classes, while reaching a highest accuracy of 93.6% for a single class (car label 24).
In the future, this approach can be used to generate large-scale image datasets for computer vision research in sensitive areas such as healthcare scans or in areas requiring a lot of effort for data collection such as collecting soil sample images. Machine Learning (ML) models can then be trained over these generated image datasets to perform a variety of tasks such as identifying tumors or early signs of diseases in healthcare, to soil and crop quality analysis for improving farm yields.
This report may not be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means, including, but not limited to, electronic, mechanical, photocopying, recording, or otherwise, without the prior express written permission of WWT Research. It consists of the opinions of WWT Research and as such should be not construed as statements of fact. WWT provides the Report "AS-IS", although the information contained in Report has been obtained from sources that are believed to be reliable. WWT disclaims all warranties as to the accuracy, completeness or adequacy of the information.