Cap3D

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson^†, Honglak Lee^† (†: equal advising)

Scalable 3D Captioning with Pretrained Models

Tiange Luo*, Chris Rockwell*, Honglak Lee^†, Justin Johnson^† (*: equal contribution, †: equal advising)

Data is hosted at [Huggingface], including 1,006,782 descriptive captions for 3D objects in Objaverse and Objaverse-XL, associated with point clouds (16,384 colorful points), and 20 rendered images along with camera details (intrinsic & extrinsic), depth data, and masks.
Our code for captioning, rendering, and view selection are released in [Github]
Our code for finetuning text-to-3D models are released in [Github]
Some of our fine-trained model checkpoints can be found in [Huggingface].
Compositional and general descriptive captions for 3D objects in the ABO dataset is at [Huggingface]
General descriptive captions for 3D objects in the ShapeNet dataset is at [Huggingface]

Our experimental findings indicate that the rendering views significantly impacts the performance of 3D captioning with image-based models, such as BLIP2 and GPT4-Vision. Especially, our method, which utilizes 8 rendering views, achieves higher quality, less hallucination, and more detailed captions than GPT4-Vision with 28 views.

Both Cap3D and our newer method (DiffuRank) render input 3D objects into multiple views for caption generation (green steps). However, while Cap3D consolidates these captions into a final description (blue steps), DiffuRank employs a pre-trained text-to-3D diffusion model to identify views that better match the input object’s characteristics. These selected views are then processed by a Vision-Language Model (we used GPT4-Vision) for captioning (orange steps).

Randomly sampled selected views by DiffuRank. The left row features the top-6 views as ranked by DiffuRank, while the right row displays the bottom-6. We adopt two different kinds of rendering, and notice that DiffuRank can select the views with the appropriate rendering that highlight object features.

Cap3D

View Selection for 3D Captioning via Diffusion Ranking

Scalable 3D Captioning with Pretrained Models

=Resources

=Overview

=Related Publication

Objaverse: A Universe of Annotated 3D Objects

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

GPT-4 Technical Report

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Objaverse-XL: A Universe of 10M+ 3D Objects