=Resources

  • Data is hosted at [Huggingface], including 1,006,782 descriptive captions for 3D objects in Objaverse and Objaverse-XL, associated with point clouds (16,384 colorful points), and 20 rendered images along with camera details (intrinsic & extrinsic), depth data, and masks.
  • Our code for captioning, rendering, and view selection are released in [Github (still cleaning)]
  • Our code for finetuning text-to-3D models are released in [Github]
  • Some of our fine-trained model checkpoints can be found in [Huggingface].
  • Compositional descriptive captions for 3D objects in the ABO dataset is at [Huggingface]

=Overview

Our experimental findings indicate that the rendering views significantly impacts the performance of 3D captioning with image-based models, such as BLIP2 and GPT4-Vision. Especially, our method, which utilizes 8 rendering views, achieves higher quality, less hallucination, and more detailed captions than GPT4-Vision with 28 views.

Both Cap3D and our newer method (DiffuRank) render input 3D objects into multiple views for caption generation (green steps). However, while Cap3D consolidates these captions into a final description (blue steps), DiffuRank employs a pre-trained text-to-3D diffusion model to identify views that better match the input object’s characteristics. These selected views are then processed by a Vision-Language Model (we used GPT4-Vision) for captioning (orange steps).

Randomly sampled selected views by DiffuRank. The left row features the top-6 views as ranked by DiffuRank, while the right row displays the bottom-6. We adopt two different kinds of rendering, and notice that DiffuRank can select the views with the appropriate rendering that highlight object features.

=Related Publication

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, Jitendra Malik

GPT-4 Technical Report

OpenAI

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

Objaverse-XL: A Universe of 10M+ 3D Objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi