Project 2: Plant Disease Classification and Detection along with Plant leave Generation
Overview
- We will use transfer learning to classify plant disease.
- We will implement various object detection models like Faster RCNN and SSD to detect plant disease.
- Generate the fake image using various GANs and study its utility as a training sample for the experiment.
1. Introduction
Plant disease is a serious problem for farmers all over the world. It reduces the production and quality of the food. While the global population is rapidly increasing, reduction in availability and access to food increases the cost of food. Various methods have been developed to diagnose diseases, but still plant disease poses threat to farmers. Machine learning and deep learning have been successfully applied to various domains like health care, finance, communication, etc. The agricultural industry can also benefit from modern technology. The dataset existing in the current time is lab controlled which perform very poorly on real-life condition with natural background, lightning, different stages of symptoms, and so on. So the authors of Plant Doc Dataset have prepared their dataset containing images from non controlled environment. We will use the Plant Doc dataset in our project for the classification and detection of plant diseases.
Nowadays most image classification problems are solved using a convolution neural network. Convolution neural networks became popular when AlexNet beat the previous traditional methods by a huge margin. The architecture of AlexNet consists of five convolution layers, max-pooling layers, and fully connected layers with relu non-linearity. Due to its huge success, it has been a standard practice to tackle image classification tasks with CNN. The disadvantage of using CNN is we require a very large dataset to train otherwise our model will overfit (it perform perfectly on a specific dataset only). The dataset we are using consists of only 2340 images from 27 classes, so it becomes very difficult to train a CNN model.
2. Learning with a small dataset
The problem of learning with a small dataset can be approached through:
- Data Augmentation
- Transfer Learning
- Data Augmentation: Data augmentation is the process of increasing the number of training samples by transformation like scaling, zooming, flipping, rotation, etc while the labels of the data are preserved. It has been used as a regularizer in preventing overfitting and to increase the generalization power of the model. Another approach like generating fake images to increase the number of training images with the help of generative models like GAN can also be used. We will discuss GANs in more detail in a later section. The figures of scaling, flipping, rotation are given below.
- Transfer Learning: Transfer learning is applying knowledge gained in one task to solve a problem in another similar domain. Transfer learning has shown promising results in many image classification tasks. When there is a lack of a large labeled dataset, transfer learning helps to address this limitation by transferring the learned parameter of well trained CNN model on a large dataset (e.g ImageNet) to solve other similar classification problems. Transfer learning can be performed in two different ways:
-
Feature extraction: This approach utilizes a well-trained CNN model on a large dataset as a feature extractor for the target domain. All convolution layers of the well-trained CNN model are frozen while the fully connected layers are removed.
-
Fine-tuning: This approach also utilizes a well-trained CNN model on a large dataset as the base and replaces the classifier layer with a new classifier layer. In this method convolution layers of well-trained CNN models are not frozen and their weight can get updated during the training process. The initialization of the base model is done by the pre-trained weights and the classifier layer weights are random weights.
In our project, we experiment with the Inception ResNetV2 and Mobile Net as our base model. The results from the two models are:
Now we will combine the PlantDoc data with Plant Village Dataset. Plant Village dataset is the collection of different images of plant disease which are collected in the lab setting. But PlantDoc is obtained from a natural setting. We will add (100-132) images per class to our original dataset while we don’t add any images in two classes i.e. Apple Rust Leaf and Grape Black Rot. The test set contains images from the PlantDoc dataset only. The result after using both PlantDoc and Plant Village date is given below table.
We can see that the model’s accuracy increases by 5% in Inception ResnetV2 and by 4% in MobileNet after adding images from Plant Village dataset, although there is high contrast in Plant Village and PlantDoc datasets.
Images form PlantDoc dataset.
Images from Plant Village Dataset.
Examining the wrong prediction, we found that 14 images were placed in the wrong classes. The images were:
test/Corn Gray leaf spot/IMG_42231.jpg
test/Corn Gray leaf spot/rsz0803Figure6.jpg
test/Corn leaf blight/0796.20graylssymt.jpg
test/Corn leaf blight/07c.jpg
test/Corn leaf blight/2013Corn_GrayLeafSpot_0815_0003.JPG.jpg
test/Corn leaf blight/corn-disease-update-fig-3-gray-leaf-spot.jpg
test/Corn leaf blight/corn-gray-leaf-spot-f4.jpg test/Potato leaf early blight/3023.jpg
test/Potato leaf early blight/backus-056-potato-blight.jpg
test/Potato leaf early blight/early-blight-or-target-spot-alternaria-solani-lesions-on-a-tomato-AXK6AY.jpg
test/Potato leaf late blight/20090710-lateblight.jpg
test/Potato leaf late blight/5816740026_d42ef24413_Phytophthora-Infestans.jpg
test/Soyabean leaf/leaf-raspberry-isolated-on-a-white-stock-photography-image-10106222-1625198.jpg
test/Potato leaf early blight/potatobd001.jpg
After removing images from the misplaced class and keeping them in the right class. We performed the prediction on the test set and found that, out of 14 images, our model predicted 11 images correctly. Finally, we were able to get 66 % and 70 % accuracy in InceptionResnetV2 model and MobileNet model. The result is given in below table:
The heatmap of confusion matrix is given below:
Analyzing the heatmap, we can see the wrong predictions are from the same category i.e Apple, Corn, Potato and Tomato categories. The images from these classes are very hard to distinguish because they look visually similar. Some of the images are given below:
These problems can be tackled by increasing the labeled data from the above-mentioned categories.
3. Interpreting the model prediction with Gradient-Weighted Class Activation Map(Grad-CAM)
Interpretability is a relation between formal theories that express the possibility of interpreting or translating one into another (from Wikipedia). Interpretability in machine learning and deep learning models has been a major concern for a long time. The traditional machine learning algorithm is more interpretable while its accuracy and robustness are low and on the other hand deep learning is great at accuracy and robustness but lacks interpretability. So there is a tradeoff between accuracy and interpretability. Deep neural networks are considered black boxes because it’s really hard to figure out what is happening to make the output. So interpreting these models can help to gain more trust in predictions.
In 2016, a method called Gradient weighted class activation mapping was developed to make convolution neural network-based model more transparent by visualizing the region of input that is important for prediction. Class discriminative (i.e localize the category in the image) and high-resolution (i.e capture the fine-grained details) make a good visualization. Grad-CAM is a strict generalization of CAM(Class Activation Mapping). The disadvantage of CAM is you need to feed the output of Global Average Pooling to the softmax layer which hurts the accuracy of the model. Grad-CAM generalizes CAM which allows us to visualize the region of input that is important for any kind of CNN architecture. Grad-CAM computes the gradient of the score for class C with respect to feature map activation of a convolution layer. These gradients are global averaged to obtain the neuron importance weight.
The neuron importance weight captures the importance of feature map k for a target class C. Then the linear combination of the forward activation map is performed and followed by ReLU to obtain the result.
This result is the heatmap of the same size of feature map like 14 x 14. The ReLU activation is used because we are only interested in the features that have a positive influence on the class of interest. The produced heatmap is rescaled and superimposed on the original image to visualize the important regions in the image for prediction.
Below are some of the results obtained by Grad-CAM.
From the above figure, we can see the model is learning to focus on the infected parts of the leaves to make predictions. The last two figures are from the same class i.e Tomato Early Blight. In this figure we made our model predict them as Tomato Early Blight and Tomato Septoria Gray Spot respectively. The visualizations are quite interesting. It focuses on the infected upper part to make its prediction as Tomato Early Blight and it focuses on the infected lower part to make its prediction as Tomato Septoria Gray Spot.
4. Plant disease detections
The plant disease classification method often fails to perform in a different situation. Among which some of them are listed below:
- Classification method fails when there are multiple diseases on the same leaf
- Classification method fails when there are multiple leaves on images where they could be from the same species with different classes.
- Images with complex background poses difficulties in classification models.
So, Object Detection algorithms were explored as a solution to tackle the above situation.
Deep Learning based object detection can be classified into two different types:
- Two-stage detector (such as RCNN, Fast RCNN, Faster RCNN)
- One-stage detector (such as YOLO, SSD)
We will briefly look into both types of algorithms and compare their result on plant disease detection tasks.
Two stage detectors
In this type of detection, at first, they extract series of the region of interest using a proposal algorithm. Then classification and localization are performed on these proposals. Since the region of interest is further processed which leads to higher accuracy than one stage detector.
We will begin with a short introduction of R-CNN, Fast R-CNN, and Faster RCNN which are two stage detectors.
R-CNN stands for Region-based Convolution Neural Network. In short, two different types of computation take place in R-CNN. There is a region proposal method for input images to extract nearly 2000 regions that are likely to contain objects. Each of the region proposals is reshaped into a fixed shape and these reshaped images are fed into convnets like AlexNet, VGG 16, Resnet, etc which produces feature representation. The feature is passed to a linear classifier to classify into one of the categories in the data and bounding box regressor which will produce a bounding box offset for the refinement of the original proposal.
Figure of RCNN Model
Image Source:http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Limitation of R-CNN
- Training takes place in three different stages.
- fine-tune using log loss
- train the svm classifier
- train the bounding box regressor
- Training and inference are slow.
So, to overcome these limitations Fast R-CNN was proposed. The main idea was to reduce the computation time of the R-CNN algorithm. Instead of running a convnet for 2000 resized region proposals, they just ran the input image through convnet to get a feature map and run a selective search algorithm to get region proposals. They mapped the region proposal to feature map using ROI projection then extracted the fixed dimension representation for each region proposal regardless of its original size and aspect ratio using ROI pooling. Then the fixed dimension representations are passed to a fully connected layer for prediction. Unlike in R-CNN, the training is done at a single stage using combined loss(i.e classification loss + bounding box loss).
Figure of Fast R-CNN Model
Image Source:http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Fast R-CNN has significantly decreased the training time still the test time is dominated by region proposals and the selective search algorithm also became a bottleneck for a complex dataset like COCO. So the author moved beyond the hand-designed method for the region proposal to the learned method and introduced Faster R-CNN.
In Faster R-CNN, the author replaced the selective search method with CNN for region proposal. They named it Region Proposal Network. The RPN network takes feature map as input and predicts region proposal with a wide range of scale and aspect ratio. Now we have a region proposal, the following step is similar to Fast-RCNN.
Figure of Faster R-CNN Model
Image Source: The above image was taken from paper Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”
One stage detector
In one stage detector, it detects the objects in a single pass using a convolution neural network. So they are fast as compared to the two-stage detector and can be used for real-time application. Examples of one-stage detectors are YOLO and SSD. We will discuss SSD and use this architecture in our project.
SSD stands for Single Shot Multibox Detector. One of the objectives of SSD is to detect objects using a single deep neural network. Image with ground truth boxes is passed as input for SSD. The standard image classification architecture is used to extract feature maps which are used to produces bounding boxes and score for the presence of object followed by non-maximum suppression to produce the final detection. But the last feature map is very small in size so the small object gets lost in the latter features map which poses difficulty in detecting small objects. To overcome this problem, instead of using a single feature map they performed detection at different feature maps layer and allowed detection at multiple scales which is shown in the figure below:
Figure of SSD Model
Image Source: The above image was taken from paper Liu et al, SSD: Single Shot MultiBox Detector
For Plant disease detection, we will use the PlantDoc dataset which is available at the roboflow public dataset because RoboFlow has corrected 28 annotations and the data is also available in TFRecord format.
We will use two model i.e faster_rcnn_resnet101_v1_640x640 and ssd_mobilenetv2_320x320 in our project and compare their result. The Faster R-CNN model achieved an mAP score of 45.43 % and SSD model achieved an mAP score of 41.54 %. We can see the Faster R-CNN model performed better than SSD model by 4 %. The result is given in below table:
We have used momentum optimizer with cosine decay learning rate for both the model. For ssd_mobilenetv2, the learning_rate_base and warmup_learning_rate are 0.008 and 0.0133 respectively, and the batch size of 64. For faste_rcnn_resnet101, the learning_rate_base and warmup_learning_rate are 0.004 and 0.0133 respectively and the batch size is set to 4. The nms_threshold for the first and second stage is set to 0.6 and 0.5 respectively and max_total_detection is set to 200.
Some of the detections are given below:
5. Generative Models
A deep learning system requires a huge amount of data for training otherwise it will suffer from overfitting and low accuracy to fit the complex data. In our case, the number of images per class is between 45-190 which is fairly small to what these deep learning systems require. Data augmentation has been a method of choice to increase the training samples. Recently developed different variants of GANs have also shown promising results in generating images. So we will look at the images generated by different variants of GANs like DCGAN and ProGAN and their usefulness for training samples.
GAN was firstly introduced in a paper by Ian Goodfellow and other researchers in 2014. In GAN, the goal is to learn a data distribution and generate a sample that looks like the original distribution. GANs are deep neural net architecture comprised of two neural networks competing one against the other, so they are called adversarial. It consists of two networks a generator and a discriminator. The generator is assigned a task to generate the sample that looks like real images from a noise vector while a discriminator is used to distinguish between generated samples and the original sample. The role of the generator is to create an image in such a way so that it can fool the discriminator while the role of the discriminator is to distinguish between actual data and generated data accurately.
Figure of GAN Architecture
Image Source:www.kdnuggets.com/2017/01/generative-…-learning.html
The loss function for GAN is:
where
D(x) is discriminator output for real data
D(G(z)) is discriminator output for generated data
The discriminator is trained to maximize the loss function such that D(x) is close to 1 and D(G(z)) is close to O. Generator is trained to minimize log(1-D(G(z))) such that D(G(z)) is close to 1.
Problems in GAN
- Mode Collapse: During training, the generator may collapse to a setting where it always produces the same output.
- Vanishing Gradient: The gradient of the generator network becomes close to zero during the initial process of training.
- Lack of proper evaluation metric.
GAN is well known for being delicate and unstable. So, a form of GAN called Wasserstein GAN was proposed. In WGAN, Wasserstein distance or Earth Mover distance is used as the GAN loss function. Wasserstein distance is a way of measuring the distance between two probability distributions. The mathematical equation for Wasserstein distance is:
where
Pr = real probability distribution
Pg = generated probability distribution
γ∈Π(Pr,Pg) = one transport plan
In the above equation, the infimum (minimum) is intractable. So, the author proposed Kantorovich-Rubinestein duality to
where supremum is overall the 1-Lipschitz functions.
1-Lipschitz constant means the slope of a line between two points x and y shouldn’t exceed 1.
The supremum over all the function is replaced with a parameterized family of function W
During training, it is difficult to maintain 1-Lipschitz continuity of fw. So, the author clamped the weight to a fixed box [W=[-0.01,0.01]] after each gradient update to preserve the Lipschitz function.
Weight clipping to enforce the Lipschitz constraint can lead to undesired behavior like gradient exploding when c is large and vanishing gradient when c is small. Thus to get rid of weight clipping the author of WGAN-GP proposed to add a gradient penalty term in the loss function of the critic. The critic is similar to the discriminator in the GAN but instead of outputting binary class, the critic output the score of realness for a given image. The new objective is
where
xˆ = εx + (1 − ε)x˜
x = real image
x˜ = generated image
Now, we will use DCGAN (Deep Convolution Generative Adversarial Network) WGAN_GP as a proof of concept whether the synthetic data generated have similar features as of original images. The architecture of the generator consists of one dense layer, one reshaping layer, four upsampling layers, five convolution layers, and four batch normalization layers. The architecture of the critic consists of five convolution layers, five dropout layers, and one dense layer. Training DCGAN takes a lot of resources so will only take a sample of data set to train the model. The training sample was taken from three classes i.e. Apple Scab leaf, Apple Rust leaf, Apple leaf. The images generated by DCGAN WGAN_GP are shown below:
Image generated by DCGAN WGAN_GP
Based on the above image, it can be observed that generating an image with complex background fails because the images get highly affected by background features.
We also used the PlantVillage dataset which doesn’t contain background as the images were captured in a laboratory setup. The training sample was taken from one class i.e. Apple Healthy leaf. The images generated by DCGAN WGAN_GP are shown below:
Image generated by DCGAN WGAN_GP
Based on the above image, it can be observed that the generated images are much more appealing than the images from the PlantDoc dataset. The images are trying to capture the shape of the leaf. The DCGAN model is only trained for 2500 epochs because training the DCGAN WGAN_GP model takes a lot of resources and time. Training on PlantVillage Dataset for a longer duration can produce images that look similar to original images.
The images generated by DCGAN are of size 64x64. Training DCGAN to produce a higher resolution image can be hard because of instability. To use the synthetic images for training purposes we need images of size around 256x256.
For this reason, we have used Progressively Growing GAN(ProGAN) to generate the image of size 256x256. The key idea of ProGAN is to grow the generator and discriminator progressively. Both generator and discriminator start by creating the image of size 4x4 for a fixed number of iteration and then progressively increases the resolution by adding layers to the network till it becomes of size 1024x1024. This process both speeds the training and greatly stabilizes the training process allowing us to generate high-resolution images.
For training ProGAN, sample images were taken from 8 classes of PlantVillage Dataset. The generated images were of size 256x256. Below are the images generated by ProGAN:
Image generated by ProGAN of size 32x32
Image generated by ProGAN of size 64X64
Image generated by ProGAN of size 128X128
Due to resource constraints, we couldn’t generate an image of size 256X256. So we only generated images up to size 128X128.
Based on the above image, it can be observed that the generated image poses quality like shape, color, and texture as of real images. But, training the same ProGAN architecture on the PlantDoc dataset produces images that are very much indistinguishable because of complex background features. So it is difficult to generate images having complex backgrounds or images which are taken from real-life conditions. The images generated by ProGAN on PlantVillage look similar to original images which can potentially be used as training samples.
6. Conclusion
In this project, we proposed an approach to detect/classify plant diseases using various image classification and object detection models. We also explored various GAN architectures like DCGAN_WGAN_GP and ProGAN to generated fake images of plant leaves. We used both datasets i.e PlantDoc and PlantVillage and achieved a classification accuracy of 70% using the MobileNet model which outperforms InceptionResNetV2 by 4%. We used an improved dataset from RoboFlow for object detection and achieved an mAP score of 45% using faster_rcnn_resnet101 model which outperforms ssd_mobilenetv2 by 4%. We also generated fake images using both of the datasets. The images from natural setting were very hard to generate whereas images generated from lab setting looks similar to real images. Finally, the goal of this project to create a web application to classify/detect plant diseases was achieved.
7. References
PlantDoc: A Dataset for Visual Plant Disease Detection
How to Train a Progressive Growing GAN in Keras
8. Links
The whole code can be found Here
The original dataset can be found Here and Here