Object Detection in Unstructured Driving Environments

This paper conducts a comprehensive error analysis of the inference process performed on the YOLOv8 and RTDETR model, utilizing two distinct datasets: MS COCO


Problem Statement
Despite the widespread adoption of object detection models like YOLOv8 and RT-DETR, there remains a critical need to understand their performance variations across different datasets.This paper aims to address this gap by conducting an error analysis of the YOLOv8 and RT-DETR model's inference on two distinct datasets: MS COCO, the dataset on which YOLOv8 and RT-DETR is trained, and IDD, a different dataset.The specific focus is on evaluating model performance using mean Average Precision (mAP) and Intersection over Union (IoU) metrics.By identifying and analyzing the discrepancies in model performance across these datasets, this study seeks to provide insights into the model's effectiveness and limitations in real-world scenarios.

YOLO
YOLOv8 [5], utilizes a deep neural network with numerous convolutional layers, including backbone networks like CSPDarknet53 and SPP (Spatial Pyramid Pooling), followed by detection layers.YOLOv8 aims to strike a balance between speed and accuracy, crucial for real-time applications.It achieves this by optimizing various components of the network, including backbone architecture, training strategies, and post-processing techniques.YOLOv8 introduces optimizations to enhance inference speed without compromising accuracy.Techniques such as model pruning, network quantization, and efficient post-processing are employed to achieve real-time performance on resourceconstrained devices.The paper provides comprehensive experimental results on benchmark datasets, demonstrating the superior performance of YOLOv8 compared to previous versions and other state-of-the-art object detection models in terms of both speed and accuracy.

RT-DETR
RT-DETR [2], a groundbreaking object detector developed by Baidu, combines Vision Transformers (ViT) with innovative techniques to achieve real-time performance without compromising accuracy.Its efficient architecture processes multiscale features by separating intra-scale interaction and cross-scale fusion, reducing computational costs and enabling rapid detection.Notably, it features IoU-aware query selection for improved object detection accuracy and supports flexible adjustment of inference speed through decoder layer modifications, making it highly adaptable for diverse real-time scenarios.Compatible with accelerated backends like CUDA with TensorRT, RT-DETR surpasses many existing real-time detectors in performance.It beats YOLO in terms of performance.

MS COCO 2017
The  With the YAML file in place, we've executed the YOLOv8 and and Vision Transformer models to generate results and predictions.These outputs now serve as the basis for a rigorous analysis aimed at assessing both the successes and errors of the detection process.

Results
Upon conducting experiments, we observed that the reported results were successfully reproduced in our experimental setup.Specifically, models trained on the COCO dataset exhibited remarkable performance when evaluated on COCO's validation data, consistent with previous findings in the literature.However, when applied to the IDD dataset, these models yielded poor results.This discrepancy in performance may be attributed to the unique challenges present in the IDD dataset, such as occlusions and other traffic conditions that are characteristic of Indian roads.These conditions differ significantly from those encountered in the COCO dataset, highlighting the importance of dataset diversity and the need for specialized models to address specific environmental contexts.

COCO
For the COCO dataset, a thorough examination was conducted to assess the model's performance in object detection.Among the 4900 images, it was discovered that in a significant portion, approximately 1850 images, the model failed to recognize ground truths.Specifically, these images presented a scenario where over 50% of the ground truths remained undetected, even when employing a stringent 0.6 Intersection over Union (IoU) threshold.This observation highlights a considerable challenge faced by the model in accurately identifying objects within the COCO dataset.To further explore the extent of this challenge, a closer examination was conducted on the subset of images with the highest number of ground truths.This analysis aimed to highlight whether the model's performance varied significantly based on the density of objects within an image.Through graphical representation, it was revealed how the percentage of undetected ground truths fluctuated across the top 100 images with the most ground truths, shedding light on potential patterns or anomalies in the model's behavior under varying object densities.
Similarly, the investigation extended to the IDD dataset, which encompasses a diverse array of urban scenes captured from onboard vehicle cameras.Among the 4762 images scrutinized from this dataset, a noteworthy trend emerged, with 3078 images exhibiting a significant shortfall in ground truth detection.Once again, employing the 0.6 IoU threshold criterion, more than 50% of ground truths remained undetected in these images, indicative of the model's challenges in accurately identifying objects within urban environments.In essence, the findings from both the COCO and IDD datasets underscore the nuanced challenges encountered by the model in object detection tasks, ranging from diverse object categories to varying environmental contexts.By meticulously analyzing the prevalence of undetected ground truths across a substantial number of images, this experiment provides valuable insights into the limitations and areas for improvement in contemporary object detection models.