=Resources
- Data is hosted at [Huggingface], including 1,006,782 descriptive captions for 3D objects in Objaverse and Objaverse-XL, associated with point clouds (16,384 colorful points), and 20 rendered images along with camera details (intrinsic & extrinsic), depth data, and masks.
- Our code for captioning, rendering, and view selection are released in [Github (still cleaning)]
- Our code for finetuning text-to-3D models are released in [Github]
- Some of our fine-trained model checkpoints can be found in [Huggingface].
- Compositional descriptive captions for 3D objects in the ABO dataset is at [Huggingface]
=Overview
Our experimental findings indicate that the rendering views significantly impacts the performance of 3D captioning with image-based models, such as BLIP2 and GPT4-Vision. Especially, our method, which utilizes 8 rendering views, achieves higher quality, less hallucination, and more detailed captions than GPT4-Vision with 28 views.
Randomly sampled selected views by DiffuRank. The left row features the top-6 views as ranked by DiffuRank, while the right row displays the bottom-6. We adopt two different kinds of rendering, and notice that DiffuRank can select the views with the appropriate rendering that highlight object features.