Detect Anything via Next Point Prediction

Rex-Omni is a 3B-parameter multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, keypointing, and visual prompting into a single next point prediction framework.

Key capabilities

Unified vision tasks through next point prediction

Object Detection

Detect and localize objects by taking category names as natural language inputs, enabling flexible and intuitive text-based object detection.

Object Referring

Identify and localize objects corresponding to natural language referring expressions, enabling fine-grained alignment between linguistic descriptions and visual content.

Object Pointing

Predict the precise point location of a target object specified by a natural language description, allowing fine-grained and lightweight spatial localization.

OCR

Detect and recognize words or text lines by predicting bounding boxes or polygons corresponding to textual regions in the image.

Visual Prompting

Detect all objects belonging to the same category as the provided visual prompt, where the reference object is specified by one or more bounding boxes in the input image.

Keypoint Detection

Detect instances and output a standardized set of semantic keypoints (e.g., 17 joints for humans/animals), providing structured pose representations.

Unified architecture for multiple vision tasks

Next point prediction framework

Rex-Omni reformulates visual perception as a next point prediction problem, unifying diverse vision tasks within a single generative framework. It predicts spatial outputs (e.g., boxes, points, polygons) auto-regressively and is optimized through a two-stage training pipeline—large-scale Supervised Fine-Tuning (SFT) for grounding, followed by GRPO-based reinforcement learning to refine geometry awareness and behavioral consistency.

3B Parameters
10+ Vision Tasks
1 Unified Model
Architecture

Quick Start

Get started with Rex-Omni in just a few lines of code

# Install Rex-Omni
conda create -n rexomni python=3.10
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
git clone https://github.com/IDEA-Research/Rex-Omni.git
cd Rex-Omni
pip install -v -e .
from PIL import Image
from rex_omni import RexOmniWrapper, RexOmniVisualize

# Initialize model
rex = RexOmniWrapper(
    model_path="IDEA-Research/Rex-Omni",
    backend="transformers"
)

# Load image and run detection
image = Image.open("your_image.jpg")
results = rex.inference(
    images=image, 
    task="detection", 
    categories=["person", "car", "dog"]
)

# Visualize results
vis = RexOmniVisualize(
    image=image,
    predictions=results[0]["extracted_predictions"]
)
vis.save("result.jpg")

Open innovation

To enable the research community to build upon this work, we're publicly releasing Rex-Omni with comprehensive tutorials and examples.

🔬

Research

Complete model architecture, training details, and evaluation results

💻

Code

Full implementation with easy-to-use Python package and tutorials

🎮

Demo

Interactive Gradio demo and comprehensive example scripts

📚

Tutorials

Step-by-step guides for each vision task with Jupyter notebooks