Robust and Label-Efficient Deep Waste Detection

Hassan Abid ¹, Khan Muhammad ², Muhammad Haris Khan ¹

¹ Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
² Sungkyunkwan University, Seoul, South Korea

arXiv Code BibTeX

Abstract

Effective waste sorting is critical for sustainable recycling, but academic AI still trails commercial systems due to limited datasets and reliance on legacy detectors. We advance AI-driven waste detection by benchmarking open-vocabulary object detectors (OVOD), establishing strong supervised baselines, and introducing an ensemble-based semi-supervised learning framework on the real-world ZeroWaste dataset. Class-only prompts perform poorly in zero-shot OVOD, whereas LLM-optimized prompts substantially improve accuracy. Fine-tuning modern transformer-based detectors yields new baselines of 51.6 mAP, more than doubling prior CNN results. Finally, we fuse model predictions to create soft pseudo-labels that improve semi-supervised training; applied to the unlabeled ZeroWaste-s subset, this produces high-quality annotations that boost downstream detectors beyond fully supervised training.

Key Highlights

Zero-shot OVOD: Class-only prompts underperform; LLM-optimized prompts improve OWLv2 from 7.3 → 13.5 mAP.

Fine-tuned baselines: Co-DETR, DETA, and Grounding DINO each reach 51.6 mAP, >2× stronger than prior CNN baselines (e.g., TridentNet 24.2).

Semi-supervised: Ensemble-based soft pseudo-labels push Grounding DINO (Swin-B) to 54.3 mAP with consistent per-class gains (including rare metal).

Final pseudo-annotations: 33,075 boxes over 6,065 images (ZeroWaste-s); training on these improves YOLO11 (+6.3 mAP) and RT-DETR (+4.3 mAP).

Method Overview

We study waste detection in three stages designed to separate capability from adaptation and then scale with unlabeled data:

Zero-shot OVOD benchmarking. Evaluate Grounding DINO, OWLv2, and YOLO-World on ZeroWaste with class-only prompts; then apply an LLM-driven prompt refinement loop to test how far text guidance alone can go under clutter, deformation, and reflectivity.
Supervised fine-tuning. Adapt modern detectors to the domain by training on ZeroWaste-f—establishing strong closed-set baselines (≈51.6 mAP) and quantifying the gap to zero-shot transfer.
Semi-supervised soft pseudo-labels. Build an ensemble of fine-tuned models to label ZeroWaste-s: filter detections, IoU-cluster, fuse with WBF, then soft-weight confidence by spatial consistency and inter-model agreement to supervise large-scale training.

This pipeline (i) exposes zero-shot limitations under domain shift, (ii) sets robust supervised baselines, and (iii) unlocks further gains by converting unlabeled frames into reliable training signal.

Zero-Shot OVOD

We benchmark Grounding DINO, OWLv2, and YOLO-World in a zero-shot setting on ZeroWaste using only class-level prompts (“cardboard”, “soft plastic”, “rigid plastic”, “metal”). Performance is uniformly low (mAP ≤ 7.3), with large objects detected more reliably than transparent or reflective ones.

Zero-shot results with class-only prompts. — Zero-shot detection with class-only prompts: overall mAP is very low across models.

To improve results, we introduce an LLM-guided prompt optimization pipeline, where GPT-4o enriches class names with contextual cues (e.g., “flexible plastic bag”). This yields consistent gains—OWLv2 +6.2 mAP, Grounding DINO +5.4—yet still trails supervised baselines, underscoring the need for domain adaptation.

Iterative prompt optimization pipeline for OVOD. — Iterative prompt optimization with GPT-4o.

Class-only vs optimized prompts comparison. — Optimized prompts substantially improve zero-shot detection but remain below supervised baselines.

Fine-Tuned Baselines

To overcome zero-shot limits, we fine-tune state-of-the-art transformer-based detectors on ZeroWaste-f. This more than doubles performance compared to legacy CNNs: while TridentNet reached 24.2 mAP, Co-DETR (Swin-L), DETA (Swin-L), and Grounding DINO (Swin-B) each achieve 51.6 mAP, setting new baselines for industrial waste detection.

Supervised fine-tuning baselines on ZeroWaste-f. — Fine-tuning transformer-based detectors on ZeroWaste-f sets new benchmarks, far surpassing legacy CNN models.

Semi-Supervised Learning

We apply ensemble-based soft pseudo-labeling on the unlabeled ZeroWaste-s subset. Predictions are fused with consensus-aware weighting to generate reliable supervision for semi-supervised training, reducing the need for costly manual labels.

This boosts Grounding DINO (Swin-B) from 51.6 → 54.3 mAP with consistent per-class AP gains—including rare metal—demonstrating effective scaling beyond limited annotations.

Ensemble-based soft pseudo-labeling: filter → IoU cluster → WBF → consensus-weighted confidence → training.

Semi-supervised results using ensemble soft pseudo-labels. — Semi-supervised training with ensemble-based soft pseudo-labels improves detection across both Swin-T and Swin-B backbones.

Dataset & Settings

All experiments use the ZeroWaste dataset, collected in a full-scale Material Recovery Facility (MRF) with high-resolution overhead imagery. It provides labeled and unlabeled subsets for supervised and semi-supervised learning.

ZeroWaste-f: 4,503 labeled images with bounding boxes for cardboard, soft plastic, rigid plastic, and metal.
ZeroWaste-s: 6,212 unlabeled images captured under identical conditions for semi-supervised training.
Challenges: severe class imbalance (cardboard >66%, metal <2%), cluttered scenes (often 15+ objects), occlusions, and deformable materials.

ZeroWaste dataset examples (ZeroWaste-f with GT annotations; ZeroWaste-s unlabeled). — ZeroWaste dataset examples: (a) ZeroWaste-f with ground-truth annotations; (b) ZeroWaste-s for semi-supervised learning.

Final Pseudo-Annotations

Leveraging our soft pseudo-labeling framework, we release 33,075 bounding boxes across 6,065 images in ZeroWaste-s. These pseudo-labels offer a scalable alternative to manual annotations while maintaining high reliability.

Detectors trained exclusively on this pseudo-labeled set generalize strongly: YOLO11 improves by +6.3 mAP and RT-DETR by +4.3 mAP on ZeroWaste-f test—showing pseudo-annotations can rival or even surpass fully supervised training in this domain.

Distribution and examples of final pseudo-annotations on ZeroWaste-s. — Final pseudo-annotations on ZeroWaste-s: 33k high-quality boxes enabling scalable semi-supervised training.

Conclusion

We benchmark zero-shot OVOD on real-world waste data, establish strong supervised baselines via fine-tuning, and introduce an ensemble-based semi-supervised framework that produces high-quality pseudo-annotations. While zero-shot models benefit from prompt optimization, task-specific fine-tuning and consensus-driven soft pseudo-labels deliver the largest gains—outlining a scalable path for AI-assisted waste recovery in industrial MRFs. Code and pseudo-annotations are available to support future work on robust industrial waste detection.

BibTeX

@misc{abid2025robustlabelefficientdeepwaste,
  title         = {Robust and Label-Efficient Deep Waste Detection},
  author        = {Hassan Abid and Khan Muhammad and Muhammad Haris Khan},
  year          = {2025},
  eprint        = {2508.18799},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2508.18799}
}