Evaluating Deep Neural Network-based Fire Detection for Natural Disaster Management

Recently, climate change has led to more frequent extreme weather events, introducing new challenges for Natural Disaster Management (NDM) organizations. This fact makes the employment of modern technological tools such as Deep Neural Networks-based fire detectors a necessity, as they can assist such organizations manage these extreme events more effectively. In this work, we argue that the mean Average Precision (mAP) metric that is commonly used to evaluate typical object detection algorithms can not be trusted for the fire detection task, due to its high dependence on the employed data annotation strategy. This means that the mAP score of a fire detection algorithm may be low even when it predicts fire bounding boxes that accurately enclose the depicted fires. In this direction, a new evaluation metric for fire detection is proposed, denoted as Image-level mean Average Precision (ImAP), which reduces the dependence on the bounding box annotation strategy by rewarding/penalizing bounding box predictions on image level, rather than on bounding box level. Experiments using different object detection algorithms have shown that the proposed ImAP metric reveals the true fire detection capabilities of the tested algorithms more effectively.


INTRODUCTION
Climate change has resulted in a notable upsurge in the frequency and severity of natural disasters, particularly wildfires and floods, posing significant threats to both ecosystems and human lives.Since 2000 the recorded average number of fires per year was 70,600 [11].These climatic events are projected to persist in the future, requiring significant improvements in the field of Natural Disaster Management (NDM).One pivotal facet of NDM pertains to emergency response, which focuses on human lives safety, immediate relief provision, and the restoration of stability in disaster-stricken areas.Within the ambit of emergency response, the deployment of advanced fire detection mechanisms emerges as a critical task.This technology not only detects fires at their early stages, preventing uncontrolled conflagrations, but also provides useful outputs that can be used to optimize the allocation and utilization of available firefighting resources.
Traditional image processing-based approaches for fire detection primarily relied on processing video frames using wavelet transformations [28] or combining color and motion detection to identify fire pixels [26].More recent fire detection methodologies [9,27] typically rely upon Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) to identify and localize fires within images or video frames.DNNs and CNNs have the capacity to undergo training for fire detection across various scales and in a diverse range of environmental conditions, offering a more effective solution when compared to conventional sensors.However, they also face some challenges, primarily due to their reliance on both the quality and quantity of available data.In order to achieve good generalization ability, DNNs/CNNs typically require annotated datasets that encompass a wide array of scenes and numerous fire-related scenarios.Additionally, the development of larger, high-accuracy models usually entails an increased computational demand, posing a greater challenge for real-time fire detection.
Nowadays, the prevailing approaches for fire detection [2,20,21] leverage advanced object detection algorithms built on CNNs [13,15] , which are great at handling spatial information.Conversely, some methods opt for Transformer-based approaches [1,31], which, despite being slower, utilize architectures capable of capturing global context of an image.
All these approaches utilize the mean Average Precision (mAP) metric to evaluate their performance on detecting objects/fires, which awards/penalizes object/fire bounding box predictions based on their alignment with the corresponding ground-truth boxes.In most objects such as cars, numerous "children" objects that belong to different classes (e.g., car wheel, car window) collectively contribute to creating the "parent" object (car).Consequently, each "parent" object corresponds to exactly one ground-truth bounding box.However, in the case of objects like fire, "children" objects belong to the same class as the "parent" object (fire), which creates uncertainty regarding whether each "child" is, in fact, a "parent" object.This uncommon property of fire entities introduces uncertainty for both human annotators and DNNs/CNNs concerning the number of bounding boxes required to represent a fire object accurately.This is illustrated in Fig. (1), where despite the fact that all annotation styles are deemed correct, it is probable that only one of them will align with the predicted bounding boxes (case A, sub-figures) .In cases like the ones depicted in sub-figures c) and d), the mAP scores do not represent the actual object/fire detection performance of the detectors.To tackle this, we propose a new evaluation metric for fire detection, namely Image-level mean Average Precision (ImAP).Instead of looking at each predicted bounding box separately, ImAP evaluates the fire detection models on their ability to predict fire object bounding boxes in the whole image.Experiments using different object detectors show that the proposed metric is more suitable for evaluating these models in the fire detection task.

RELATED WORK
Object detection involves identifying and localizing numerous distinct objects within an image.Training DNNs to identify specific objects, typically requires a manually annotated dataset, where each object of interest is outlined by its corresponding ground-truth bounding box and labeled with its associated class.During testing time, object detections algorithms typically output the predicted bounding box coordinates (in a pre-defined format), the object class and the corresponding prediction confidence score.Therefore, the correctness of a bounding box prediction (  ) with respect to its corresponding ground-truth (  ) is measured using the Intersection over Union (IoU) metric.This metric computes their overlapping area divided by their union area as depicted in Fig. 2 and it is defined as: Based on Eq. ( 1) and the ground-truth and predicted classes   ,   respectively, True Positives (TP), False Positives (FP) and False Negatives (FN) are defined as follows: • True Positive (TP): A prediction for which the IoU of the predicted bounding box with the corresponding groundtruth is higher than a threshold  and both of them belong to the same class,  (  ,   ) >  AND   =   .• False Positive (FP): A prediction for which the IoU of the predicted bounding box with the corresponding groundtruth is lower than a threshold , or the predicted box and the ground-truth do not belong to the same class,  (  ,   ) <  OR   ≠   .• False Negatives (FN): A ground-truth bounding box which the DNN fails to detect.It is important to highlight the fact that if there is a bunch of predictions that match the conditions to be counted as TP for a particular ground-truth bounding box, we mark as TP only the one with the highest confidence score, and we classify the remaining as FP.Then, in order to calculate the average precision metric, the precision and recall metrics are utilized.
The precision metric Eq. 2 signifies the percentage of accurate predictions made by the model.A higher precision value implies a greater likelihood that a given prediction is correct.On the other Both of these metrics provide information about the weaknesses and strengths of detection algorithms.However, selecting the best among them becomes a challenging task when we lack a single scalar metric.Additionally, the evaluation results are not influenced by the confidence scores of the predicted bounding boxes.These disadvantages have been addressed by the Average Precision (AP) metric, which calculates the area under the precision-recall curve (PR-curve) depicted in Fig. 3.In Fig. 3, the X-axis represents the recall rate, while the Y-axis denotes the corresponding precision values.In order to generate the PR-curve, all predictions must be arranged in descending order based on their confidence scores.Subsequently, each TP prediction is given a value DT=1, while each FP one is given a value of DT=0.The X-axis and Y-axis values of the PR-curve for the n-th prediction are then defined as: where   is the total number of ground-truth bounding boxes.However, drawing the PR-curve based on linear interpolation of the points produced by Eq.(4, 5), is causing many "zigzags" on the curve as shown in Fig. 3 , which may lead to inaccurate evaluation results [17].This phenomenon arises due to the fact that when consecutive FP predictions are followed by a TP, the precision value of the TP is higher than the minimum of the FPs (consecutive FPs have the same recall and different precision values).All AP variations utilize the approach of selecting the maximum precision from the right, with the aim to eliminate the error that "zigzag" curve produces.Therefore, the precision for a specific recall value  , is the highest precision achieved among all recalls  ′ where  ′ ≥  [6].Based on the updated piece-wise constant curve Fig. (3), object detection challenges [6,7,14] utilize either N-point [19] or all-point [30] interpolation for the AP metric computation.
The N-point interpolation creates a set of N equal spaced recall values  ′  = { 1  , 2  , ...,  −1  ,   } in order to compute the average of their corresponding precision values [17].
All-point interpolation calculates the AP across all recall values generated by Eq. 5.While this approach offers enhanced accuracy relative to N-point interpolation techniques, it may present computational inefficiencies when applied to expansive datasets [29].

IMAGE-LEVEL MEAN AVERAGE PRECISION
In the field of object detection, each image is linked to a set of fire predictions (Preds) and a set of ground truths (Gts).Each element of Gts comprises Bounding Box coordinates (BB) and their corresponding Class (CLS) labels, while Preds also include the Confidence Score variable (CS).Unlike most objects, fire objects can be represented in various BB combinations, leading to discrepancies between   and   (cases b, c, of Fig. 1).Preventing these scenarios necessitates employing an IoU between the sets  and , to evaluate the overall fire prediction performance within an image.Image-level Intersection over Union (ImIoU) is a modification of the IoU, which measures how well the union of  fit the  union Fig. 5.
where   ,   are the lengths of the prediction and groundtruth sets, respectively.Therefore, we redefine True positives, False Positives, False Negatives and True Negatives as follows: @ (13) When two small   correspond to a bigger   , the area of the union that is not in the intersection one, is causing a drop on ImIoU value.Comparing theoretically @[0.5 : 0.05 : 0.95] with @[0.5],the last one can handle these 'error areas' due to low ImIoU threshold  while the metric with large thresholds incorrectly evaluate image-level predictions as FP.So @[0.5]fulfil the purposes of image-level evaluation while @[0.5: 0.05 : 0.95] tends to behave like a combination of the box-to-box mAP with ImAP.
DNN-based object detection models often predict more objects than what actually exists.High-confidence predictions typically align closely with the   .In contrast, low-confidence predictions may not correspond to any target and often exhibit low or zero IOU with the corresponding .These incorrect predictions expand the "error area", resulting a large number of false positives due to low ImIoU.To optimize the evaluation performance of our metric, we need to identify the CS threshold that maximizes the ImAP.By filtering predictions based on this threshold, we retain only the essential predictions that best match the union of ground-truths.
For natural disaster management, visualizing and analyzing fire detections results is crucial.Setting the confidence score threshold to zero often results in poor visualizations due to a high number of false positives.Moreover, manually selecting the threshold can be imprecise.By setting the threshold to the value that maximizes ImAP, we can filter out false positives while retaining the predictions that best capture fire within an image.mAP metric can not detect this threshold due to the fact that removing detections is decreasing its value.When the AP algorithm evaluates the low confidence score predictions in order to draw the furthest right points of the PR-curve, FPs detections do not affect the metric as much as possible TPs that will be removed after the filtering.The reason for this based on Eq.(4,5), is that a low confident TP extent the limits of RP-curve in the same amount as a high confident TP (denominator of Eq. 5 has a constant value equal to the number of the ground-truths).In contrast to recall, the changes in precision are small regardless of the prediction result, as the denominator of Eq. 4 is equal to n.

EXPERIMENTS 4.1 Dataset
In order to train deep neural networks across a wide range of fire scenarios and environments like cities, forests, and aerial images, we combine three datasets: dfire [3], jhope [12], and crossican [22,23].Crossican, initially a segmentation dataset for forest fires, undergoes a transformation into a detection dataset using image processing.Jhope, sourced from roboflow, contains diverse fire types.Dfire, a fire-smoke dataset, has smoke boxes removed, retaining all images for training without ground-truth.This approach aids deep neural networks in distinguishing fires with like-fire objects.We created a test set, emphasizing scenarios of forest fires and wildfires, serving the purpose of natural disaster management.

DNN-based Object Detection
The choice of DNN models wasn't solely guided by the latest realtime object detectors.Models process image information in diverse ways, resulting in differences between their detections.CNN-based models [10,18,25] have been dominant in computer vision, due to their ability extracting rich spatial information.Yolo-v8 [13] is a powerful CNN based real-time detector which have the best accuracy compared to other architectures within the YOLO [15,25] family.It is extracting 3 feature maps of different scales produced by the backbone and transfer information from one map to another via down-sampling and up-sampling.Subsequently, predictions are generated from each new feature map.In contrast to Yolo-v8, Faster-RCNN [18] generate its predictions based on the Region Proposal Network (RPN).The RPN, for every vector of the last feature map, predicts the coordinates of many bounding boxes along with their confidences scores.Then, for every proposal, Faster-RCNN extracts The rise of Natural Language Processing, thanks to transformerbased [4,24] architectures with excellent long-range dependency detection capabilities, also made an impact in the field of the computer vision.Soon enough, architectures like Visual Transformer (VIT) [5] and Detection Transformers (Detr) [1] demonstrated superior performance compared to traditional CNN-based approaches.RT-DETR [16] is state of the art real-time object detector.The RT-DETR architecture consists of a ResNet backbone, a hybrid encoder that transfers information between the last three feature maps produced by the backbone, and a decoder comprising several stacked transformer decoder layers.From the output of the decoder, RT-DETR predicts the bounding boxes along with their associated classes.

Experimental Setup
All models were trained for 72 epochs, 640 input image size and with its recommended setup.The Faster R-CNN was trained using the Stochastic Gradient Descent (SGD) optimizer, with a learning rate of 10 −3 and a batch size of 4. In contrast, the RT-DETR employed the AdamW optimizer, set at a learning rate of 10 −4 and 10 −5 , and also maintained the same batch size of 4. Lastly, the YOLOv8 was trained using SGD, but with a higher learning rate of 10 −2 and a larger batch size of 16.

Experimental Results
Fig. 6 depict  [0.5 : 0.05 : 0.95] and @[0.5]scores in relation to the Confidence Score (CS) threshold.It is crucial to emphasize that maintaining a zero or constant CS threshold across all models, as observed in Fig. 6, can lead to an unjust and biased comparison of object detectors.By selecting the CS threshold that maximizes ImAP, we obtain a robust metric that reveals the maximum fire detection performance of the models.
The maximum ImAP values for each detector are presented in Table 1, accompanied by mAP metric results for comprehensive assessment.Notably, the strong correlation between @[0.5 :   2) proves that the ImAP metric, as the ImIoU threshold increases, mirrors the behavior of mAP, eliminating any margin for "error" areas in fire detection.However, @[0.5]shows a lower correlation with mAP due to its ability to overcome the incorrect discrepancies between predicted and ground-truth bounding boxes (Fig. 5, case C).Consequently, based on its results, we can identify the model that excels in predicting fire within an image.

CONCLUSIONS
In this work a new metric for evaluating fire detection algorithms, called Image-level mean Average Precision (ImAP) is proposed.
Due to the particular nature of the fire detection task, the proposed metric measures how well the overall fire is detected in the whole image, extending the bounding box-per-bounding box evaluation protocol followed by the typical mAP metric.Experiments using a wide variety of object detection algorithms and a challenging fire detection dataset have shown that the proposed metric can more accurately capture and represent the actual performance of the fire detectors.As a result, it can serve as a very useful tool for a wide range of NDM applications.Through additional experiments it is also shown that for increased threshold values the proposed ImAP metric behaves similar to the typical mAP one.Finally, it is shown that ImAP with a threshold value  = 0.5 provides very useful insights for selecting the most appropriate fire detection model.

Figure 3 :
Figure 3: Precision-Recall Curve :  ≥    ,  0 = 0 and   the total number of predictions made by the model.In both approaches, the AP metric is computed for each class separately.So the mean Average Prevision (mAP) metric evaluates the performance of the model across all classes:  = 1     =1   .

Figure 4 :
Figure 4: A visualization of the Image-level Intersection over Union

Figure 5 :
Figure 5: Comparison of the IOU with ImIoU for three different cases.Above the images are the ImIoU results along with the detection results for each scenario.Below the images are the IOU of each prediction P1, P2, P3 with their corresponding ground-truth.We observe that ImIoU predict the images as True Positives regardless of the annotation style.In contrast to ImIoU, IOU in most cases will result many incorrect FPs affecting the mAP

Table 1 :
Comparative evaluation of Faster-RCNN, YOLO-v8, and RT-DETR DNN methods using mAP and ImAP evaluation metrics

Table 2 :
Correlation between the Table 1 columns 0.05 : 0.95] and mAP (Table