Rex-Omni is a 3B-parameter multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, keypointing, and visual prompting into a single next point prediction framework.
Detect and localize objects by taking category names as natural language inputs, enabling flexible and intuitive text-based object detection.
Identify and localize objects corresponding to natural language referring expressions, enabling fine-grained alignment between linguistic descriptions and visual content.
Predict the precise point location of a target object specified by a natural language description, allowing fine-grained and lightweight spatial localization.
Detect and recognize words or text lines by predicting bounding boxes or polygons corresponding to textual regions in the image.
Detect all objects belonging to the same category as the provided visual prompt, where the reference object is specified by one or more bounding boxes in the input image.
Detect instances and output a standardized set of semantic keypoints (e.g., 17 joints for humans/animals), providing structured pose representations.
Rex-Omni reformulates visual perception as a next point prediction problem, unifying diverse vision tasks within a single generative framework. It predicts spatial outputs (e.g., boxes, points, polygons) auto-regressively and is optimized through a two-stage training pipeline—large-scale Supervised Fine-Tuning (SFT) for grounding, followed by GRPO-based reinforcement learning to refine geometry awareness and behavioral consistency.
Get started with Rex-Omni in just a few lines of code
# Install Rex-Omni
conda create -n rexomni python=3.10
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
git clone https://github.com/IDEA-Research/Rex-Omni.git
cd Rex-Omni
pip install -v -e .
from PIL import Image
from rex_omni import RexOmniWrapper, RexOmniVisualize
# Initialize model
rex = RexOmniWrapper(
model_path="IDEA-Research/Rex-Omni",
backend="transformers"
)
# Load image and run detection
image = Image.open("your_image.jpg")
results = rex.inference(
images=image,
task="detection",
categories=["person", "car", "dog"]
)
# Visualize results
vis = RexOmniVisualize(
image=image,
predictions=results[0]["extracted_predictions"]
)
vis.save("result.jpg")
To enable the research community to build upon this work, we're publicly releasing Rex-Omni with comprehensive tutorials and examples.
Complete model architecture, training details, and evaluation results
Full implementation with easy-to-use Python package and tutorials
Interactive Gradio demo and comprehensive example scripts
Step-by-step guides for each vision task with Jupyter notebooks