Open World Object Detection (OWOD) addresses realistic scenarios where unseen object classes emerge, enabling detectors trained on known classes to detect unknown objects and incrementally incorporate the knowledge they provide.
While existing OWOD methods primarily focus on detecting unknown objects, they often overlook the rich semantic relationships between detected objects, which are essential for scene understanding and applications in open-world environments (e.g., open-world tracking and novel class discovery).
In this paper, we extend the OWOD framework to jointly detect unknown objects and learn semantically rich instance embeddings, enabling the detector to capture fine-grained semantic relationships between instances. To this end, we propose two modules that leverage the rich and generalizable knowledge of Vision Foundation Models (VFM). First, the Unknown Box Refine Module uses semantic masks from the Segment Anything Model to accurately localize unknown objects. The Embedding Transfer Module then distills instance-wise semantic similarities from VFM features to the detector's embeddings via a relaxed contrastive loss, enabling the detector to learn a semantically meaningful and generalizable instance feature.
Extensive experiments show that our method significantly improves both unknown object detection and instance embedding quality, while also enhancing performance in downstream tasks such as open-world tracking.
▲ Our method extends OWOD by not only detecting unknown objects but also extracting semantically rich features. We adopt PROB as the baseline open-word object detector. During training, the known and unknown proposals from PROB, with corresponding instance embeddings, are fed into the proposed modules. The Unknown Box Refine Module improves the localization of unknown objects by treating refined unknown boxes from SAM as pseudo ground truth. The Embedding Transfer Module extracts source embeddings by average pooling DINOv2 features within the refined unknown and known proposals. Pairwise similarities between source embeddings are then computed and used as weights for the relaxed contrastive loss, controlling the attraction and repulsion between instance embeddings. At inference, the detector generates semantically rich instance embeddings, capturing fine-grained relationships between detected proposals.
▲ Proposals with high feature similarity to the reference are shown red, while dissimilar proposals are in blue. Ours successfully captures semantic similarities between both the known and unknown objects. For example, the reference giraffe is similar to both an unknown giraffe and a known horse, while the fire hydrant is highly dissimilar. In contrast, PROB treats all proposals as highly similar. RNCDL, despite using self-supervision to learn features, fails to capture meaningful semantics, mistakenly considering the giraffe and fire hydrant highly similar.
▲ Unlike PROB, where embeddings are mixed and lack clear separation, our method produces a well-structured feature space with distinct class-wise clusters. Additionally, semantically similar animal classes appear closer together, while unrelated categories remain separate. This shows that our method effectively captures semantic relationships within the feature space.
▲ Unknown object detections from PROB (top row) and our model (bottom row) are compared. By leveraging semantic masks from SAM, our model achieves accurate localization.
▲ Bounding box colors represent track IDs, with both the squirrel and ball belonging to unknown classes. While PROB successfully detects the squirrel, they fail to associate them across frames since the embeddings of squirrel’s varying shapes are not similar enough. Our method successfully tracks the squirrel by learning semantically rich instance embeddings, which are used to compute feature similarity between inter-frame proposals.
00:05-00:23 : Model explanation
00:26-01:06 : Inter-Proposal Relationship Results (OWOD split)
01:06-01:45 : Inter-Proposal Relationship Results (Unknown Unknown split)
01:45-02:13 : Detection Results
02:13-02:50 : Tracking Results