While humans naturally identify novel objects and understand their relationships, deep learning-based object detectors struggle to detect and relate objects that are not observed during training.
To overcome this issue, Open World Object Detection (OWOD) has been introduced to enable models to detect unknown objects in open-world scenarios. However, OWOD methods fail to capture the fine-grained relationships between detected objects, which are crucial for comprehensive scene understanding and applications such as class discovery and tracking.
In this paper, we propose a method to train an object detector that can both detect novel objects and extract semantically rich features in open-world conditions by leveraging the knowledge of Vision Foundation Models (VFM). We first utilize the semantic masks from the Segment Anything Model to supervise the box regression of unknown objects, ensuring accurate localization. By transferring the instance-wise similarities obtained from the VFM features to the detector’s instance embeddings, our method then learns a semantically rich feature space of these embeddings.
Extensive experiments show that our method learns a robust and generalizable feature space, outperforming other OWOD-based feature extraction methods. Additionally, we demonstrate that the enhanced feature from our model increases the detector’s applicability to tasks such as open-world tracking.
▲ The overall pipeline of the proposed method. PROB based on Deformable DETR is adopted for an open-word object detector. The Unknown Box Refine Module enhances the regression of unknown proposals using the segmentation masks from SAM. The Embedding Transfer Module distills instance-wise relationships obtained from DINOv2’s rich feature space. For each known and refined unknown proposal, DINOv2 features are distilled into the detector’s instance embeddings, based on the similarity between the DINOv2 features as a weight for the contrastive loss.
▲ Proposals with high feature similarity to the query proposal are colored red, while highly dissimilar proposals are colored blue. Ours successfully captures similarities between both the known and unknown objects. For example, while giraffe and horse are similar to giraffe query, the fire hydrant is highly dissimilar. In contrast, PROB fails to distinguish features across proposals, treating all as highly similar. Although RNCDL uses self-supervision to learn features, it fails to learn semantically meaningful representations.
▲ Our model achieves a more accurate detection of unknown objects using the semantic masks generated by SAM to guide the regression of these objects.
▲ Bounding box colors represent track IDs, with both the squirrel and ball belonging to unknown classes. While PROB successfully detects the squirrel, they fail to associate them across frames since the embeddings of squirrel’s varying shapes are not similar enough. Our method successfully tracks the squirrel by learning semantically rich instance embeddings, which are used to compute feature similarity between inter-frame proposals.