Abstract
We introduce HyperFields, a method for generating text-conditioned Neural Radiance Fields (NeRFs) with a single forward pass and (optionally) some fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF distillation training, which distills scenes encoded in individual NeRFs into one dynamic hypernetwork. These techniques enable a single network to fit over a hundred unique scenes. We further demonstrate that HyperFields learns a more general map between text and NeRFs, and consequently is capable of predicting novel in-distribution and out-of-distribution scenes -- either zero-shot or with a few finetuning steps. Finetuning HyperFields benefits from accelerated convergence thanks to the learned general map, and is capable of synthesizing novel scenes 5 to 10 times faster than existing neural optimization-based methods. Our ablation experiments show that both the dynamic architecture and NeRF distillation are critical to the expressivity of HyperFields.
Community
My summary of the day: Generating 3D objects based solely on text descriptions has proven extremely challenging for AI. Current state-of-the-art methods require optimizing a full 3D model from scratch for each new prompt, which is computationally demanding.
A new technique called HyperFields demonstrates promising progress in generating detailed 3D models directly from text prompts, without slow optimization.
The HyperFields approach instead aims to learn a generalized mapping from language to 3D geometry representations. This would allow tailored 3D models to be produced for new text prompts efficiently in a single feedforward pass, without slow optimization.
HyperFields combines two key techniques:
- A dynamic hypernetwork that takes in text and progressively predicts weights for a separate 3D generation network. The weight predictions are conditioned on previous layer activations, enabling specialization.
- Distilling individually optimized 3D networks into the hypernetwork, providing dense supervision for learning the complex text-to-3D mapping.
In experiments, HyperFields exceeded previous state-of-the-art methods in sample efficiency and wall-clock convergence time by 5-10x. It demonstrated the ability to:
- Encode over 100 distinct objects like "yellow vase" in a single model
- Generalize to new text combinations without seeing that exact prompt before
- Rapidly adapt to generate completely novel objects with minimal fine-tuning
However, limitations remain around flexibility, fine-grained details, and reliance on existing 2D guidance systems.
TL;DR: HyperFields uses a dynamic hypernetwork to predict weights for a 3D generation network. The method is 5-10x faster than existing techniques and can quickly adapt to new text prompts, but has limitations in fine details.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ED-NeRF: Efficient Text-Guided Editing of 3D Scene using Latent Space NeRF (2023)
- TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields (2023)
- Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else (2023)
- Noise-Free Score Distillation (2023)
- ProteusNeRF: Fast Lightweight NeRF Editing using 3D-Aware Image Context (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper