UAV Detection System

Abstract

The proliferation of small Unmanned Aerial Vehicles (UAVs) poses significant security challenges to critical infrastructure. Detecting these small targets in real-time on edge devices is difficult due to their low radar cross-sections and complex environmental backgrounds. This report presents a computer vision system designed to address these challenges using a YOLOv8-based architecture. The system is enhanced by two key strategies: (1) Multiscale Processing using a 5-crop inference mechanism with Non-Maximum Weighted (NMW) fusion to simulate a zoom effect without interpolation loss; and (2) Cross-Head Knowledge Distillation (CrossKD) combined with Progressive KD to efficiently transfer detection-sensitive features from a large Teacher model to a lightweight Student model.

Technical Challenges

As commercial drones become more accessible, the risk of unauthorized surveillance and airspace intrusion increases.

Small Object Size

Drones often occupy a tiny fraction of the frame (<5%), and standard resizing in object detectors causes significant feature loss.

Computational Cost

High-accuracy models (e.g., YOLOv8-X) are too heavy for edge devices, requiring the speed of 'Nano' models with the accuracy of 'Large' models.

Domain Generalization

The system must generalize across diverse environments including clear sky, fog, night, and various weather conditions.

Methodology

Our approach combines two powerful strategies to achieve high accuracy with real-time performance.

Multiscale Processing

To resolve the small object issue, we implemented a grid cropping strategy during inference. Instead of resizing the entire large image down to the model input size (which destroys small details), we process the image in segments.

5-Crop Pattern

Image is divided into 4 crops from corners and 1 from center
Each crop covers approximately 55-65% of the original frame
Crops are fed at input size (320×256) creating a "zoom" effect

NMW Fusion

Merging results using standard NMS proved too aggressive. We adopted Non-Maximum Weighted (NMW) to calculate weighted average of box coordinates based on confidence scores.

Note: Using only the 5 crops (discarding the full-frame original image) yielded the best F1-score and AP, as the original resized frame often introduced false positives.

Knowledge Distillation

To achieve low latency, we distilled knowledge from a heavy Teacher model to a lightweight Student model (YOLOv8-Nano) using Cross-Head Knowledge Distillation (CrossKD).

Progressive Distillation

Directly distilling from a massive model to a tiny one often causes "Knowledge Shock" due to the capacity gap. We employed a progressive pipeline:

Teacher (v8-X) → Intermediate (v8-L) → Student (v8-Nano)

Cross-Head Knowledge Distillation (CrossKD)

Instead of traditional feature mimicking, CrossKD establishes cross-connections between Teacher and Student detection heads:

Student Features → Teacher Head: Forces the Student's backbone to learn features robust enough for the Teacher's powerful detection head
Teacher Features → Student Head: Trains the Student's head to process high-quality, complex features from the Teacher

By interacting directly at the head level, the model learns better bounding box regression and focuses on detection-sensitive features rather than background noise.

System Pipeline

Overview of our training and inference architecture

Phase 1: Training Pipeline

Step 1: Robust Init

Transfer Learning

Source: DUT Anti-UAV Dataset (10k images)

Pre-training YOLOv8 on a diverse dataset to learn general drone features (shape, motion) before seeing the target data.

Step 2: Progressive KD

Knowledge Distillation

Target: VIP Cup 2025 Dataset
YOLOv8-X YOLOv8-L
YOLOv8-L YOLOv8-n

Using Intermediate Teacher (L) to bridge the capacity gap between X and Nano.

Step 3: CrossKD Loss

Cross-Head Distillation

Applying Cross-Head Knowledge Distillation during training.

Cross-connecting Student backbone → Teacher head and Teacher backbone → Student head for detection-sensitive feature transfer.

Phase 2: Inference Pipeline

Input Frame

Raw RGB/IR Image

5-Crop Split

4 Corners + 1 Center
(Zoom Effect)

YOLOv8-n

Distilled Model
(Batch Inference)

NMW

Non-Maximum Weighted
(Fusion)

Result

Final Bounding Box

Experimental Results

Evaluated on VIP Cup 2025 and DUT Anti-UAV datasets

Quantitative Results

Configuration	mAP@0.5	mAP@0.5-0.9	FPS
YOLOv11-Nano	0.51	0.23	72
YOLOv12-Nano	0.55	0.22	69
YOLOv8-Nano (Baseline)	0.55	0.22	77
YOLOv8-Nano (KD)	0.65	0.25	77
YOLOv8-Nano (Pretrained)	0.79	0.48	77
YOLOv8-Nano (Multiscale)	0.61	0.23	28
YOLOv8-Nano (KD + Pretrained + Multiscale)	0.84	0.51	28

Multiscale Analysis

Experiments on the cropping strategy showed that including the original full-frame image often degraded precision.

Original + 4 Crops: Higher Recall, but lower Precision (more false positives)
5 Crops Only: Best trade-off, achieving F1-score of 0.3323 IMPROVEMENT 0.2978

Knowledge Distillation Results

We distilled knowledge from YOLOv8-X to YOLOv8-Nano via YOLOv8-Large as an intermediate using Cross-Head Knowledge Distillation (CrossKD). This approach improved mAP by +10% by focusing on detection-sensitive features and improving localization.

Conclusion

This project developed a UAV detection system that balances accuracy and speed. The multiscale processing strategy handles small objects by simulating a zoom effect, while the cross-head knowledge distillation combined with progressive KD enables efficient knowledge transfer with enhanced localization capabilities.

+29% Accuracy

mAP improved from 0.55 to 0.84

Real-time Performance

28 FPS on edge devices

Better Localization

Improved bounding box regression via CrossKD

References

Research papers that informed our approach

High-Speed Drone Detection Based On Yolo-V8

Jun-Hwa Kim, Namho Kim, Chee Sun Won

ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (2023)

Vision-based Anti-UAV Detection and Tracking

Jie Zhao, Jingshu Zhang, Dongdong Li, Dong Wang

IEEE Transactions on Intelligent Transportation Systems (2022)

A Fourier-based Framework for Domain Generalization

Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, Qi Tian

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

Improving Small Drone Detection Through Multi-Scale Processing and Data Augmentation

Rayson Laroca, Marcelo dos Santos, David Menotti

International Joint Conference on Neural Networks (IJCNN) (2025)

Domain-invariant Progressive Knowledge Distillation for UAV-based Object Detection

Liang Yao, Fan Liu, Chuanyi Zhang, Zhiquan Ou, Ting Wu

IEEE Geoscience and Remote Sensing Letters (2024)

CrossKD: Cross-Head Knowledge Distillation for Object Detection

Jiabao Wang, Yuming Chen, Zhaohui Zheng, Xiang Li, Ming-Ming Cheng, Qibin Hou

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Real-time UAV Detection System

Abstract

Technical Challenges

Small Object Size

Computational Cost

Domain Generalization

Methodology

Multiscale Processing

5-Crop Pattern

NMW Fusion

Knowledge Distillation

Progressive Distillation

Cross-Head Knowledge Distillation (CrossKD)

System Pipeline

Phase 1: Training Pipeline

Phase 2: Inference Pipeline

Experimental Results

Quantitative Results

Multiscale Analysis

Knowledge Distillation Results

Detection Demos

Conclusion

+29% Accuracy

Real-time Performance

Better Localization

References