=Resources

  • Data can be found in [Huggingface], including descriptive captions for 3D objects in Objaverse and ABO, along with Objaverse's point clouds, rendered images, and Shap-E latent codes.
  • Our code for rendering, captioning, and finetuning text-to-3D models are released in [Github]
  • Some fine-trained model checkpoints can be found in [Huggingface].
  • More captioning examples can be found in [Link].

=Overview

Cap3D provides detailed descriptions of 3D objects by leveraging pretrained models in captioning, alignment, and LLM to consolidate multi-view information.

Figure 1: Example captioning results by Cap3D.

Table 1: Human evaluations revealed Cap3D surpasses crowdsourced annotation in efficiency, cost, and speed.

Our proposed method, Cap3D, employs a four-step process. First, we render a set of 2D views for each 3D object. Next, we apply image captioning to achieve preliminary descriptions. As these captions may contain inaccuracies, an image-text alignment model, CLIP, is introduced in the third step to rectify errors. Finally, an LLM is employed to unify captions from various perspectives, creating a comprehensive caption. This process is shown in Figure 2 and detailed below.

Figure 2: Overview of Cap3D pipeline.

=Related Publication

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, Jitendra Malik

GPT-4 Technical Report

OpenAI

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi