HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image

3D content creation from a single image is a long-standing yet highly desirable task. Recent advances introduce 2D diffusion priors, yielding reasonable results. However, existing methods are not hyper-realistic enough for post-generation usage, as users cannot view, render and edit the resulting 3D content from a full range. To address these challenges, we introduce HyperDreamer with several key designs and appealing properties: 1) Viewable: 360 degree mesh modeling with high-resolution textures enables the creation of visually compelling 3D models from a full range of observation points. 2) Renderable: Fine-grained semantic segmentation and data-driven priors are incorporated as guidance to learn reasonable albedo, roughness, and specular properties of the materials, enabling semantic-aware arbitrary material estimation. 3) Editable: For a generated model or their own data, users can interactively select any region via a few clicks and efficiently edit the texture with text-based guidance. Extensive experiments demonstrate the effectiveness of HyperDreamer in modeling region-aware materials with high-resolution textures and enabling user-friendly editing. We believe that HyperDreamer holds promise for advancing 3D content creation and finding applications in various domains.

mesh modeling with high-resolution textures enables the creation of visually compelling 3D models from a full range of observation points.2) Renderable: Fine-grained semantic segmentation and data-driven priors are incorporated as guidance to learn reasonable albedo, roughness, and specular properties of the materials, enabling semantic-aware arbitrary material estimation.3) Editable: For a generated model or their own data, users can interactively select any region via a few clicks and efficiently edit the texture with text-based guidance.Extensive experiments demonstrate the effectiveness of HyperDreamer in modeling region-aware materials with high-resolution textures and enabling user-friendly editing.We believe that HyperDreamer holds promise for advancing 3D content creation and finding applications in various domains.

INTRODUCTION
In light of the high costs associated with expert-assisted 3D content creation and the increasing demand across diverse applications, such as gaming, online conferencing, and virtual social presence, there has been growing attention on 3D content generation, particularly in the domain of controllable generation.Traditional approaches [Chan et al. 2021;Deng et al. 2021] in this field have predominantly relied on training category-specific models using large-scale 3D or 2D datasets, resulting in limited applications to specific categories.However, recent years have witnessed remarkable progress [Lin et al. 2023;Poole et al. 2022], notably through the incorporation of diffusion priors derived from state-of-the-art 2D generative models.These advancements have facilitated the generation of reasonably accurate 3D content, marking a significant breakthrough in the field.
In recent 2D diffusion-based 3D content generation methods [Poole et al. 2022;Tang et al. 2023], it becomes common practice to incorporate text or single image conditions to achieve controllable generation.Due to its inherent ill-posed nature, researchers rely on a 2D diffusion model [Rombach et al. 2021] as a guide prior to directing the rendering process, ensuring that all generated images are concentrated within the high-realism regions of the latent space.By confining the generated content to these regions, the overall realism of the produced 3D content is significantly enhanced.
Despite notable advancements, Current methods for 3D content generation suffer from two major drawbacks: limited postgeneration usability and 2D diffusion bias.The former stems from the use of implicit 3D representations that trade off usability for fidelity.Users are unable to freely zoom, re-render, or edit the resulting 3D content to get the desired 3D content, which hampers its practical applicability and restricts creative possibilities.The latter arises from the training of the diffusion model on a 2D dataset that contains rich lighting and shading variations.These variations enhance the realism of the 2D images, but also introduce unwanted effects in the textures of the 3D models, as shown in Figure 4-d.
To address the above issues, we propose HyperDreamer, a 3D content generation and editing framework that is full-range viewable, renderable, and editable.1) Full-range viewable: A novel custom super-resolution module is introduced, which incorporates pseudo multi-view images to facilitate high-resolution supervision.This module enables the generation of high-resolution textures for 360 • content, allowing the creation of visually captivating 3D models from a full range of observation points.2) Full-range renderable: The Segment-Anything-Model [Kirillov et al. 2023] is integrated into our generation approach, enabling online 3D semantic segmentation.Leveraging the segmentation mask, we introduce a semanticaware albedo regularization loss to mitigate the diffusion bias.To enable a more realistic rendering in downstream applications, we model the appearance using a spatially varying Bidirectional Reflectance Distribution Function (BRDF) [Chen et al. 2022] and learn reasonable albedo, roughness, and specular properties of the materials, enabling semantic-aware arbitrary material estimation.3) Full-range editable: An interactive editing method is introduced, enabling users to perform interactive segmentation on 3D meshes effortlessly.By leveraging a normal-to-image model diffusion model, HyperDreamer allows users to edit textures of specific regions in 3D meshes using text-based guidance.With just a few clicks, users can efficiently modify the targeted region, enhancing the editability and flexibility of the HyperDreamer.
Extensive experiments demonstrate the effectiveness of Hyper-Dreamer in modeling region-aware materials with high-resolution textures, and facilitating user-friendly editing, and show that Hy-perDreamer surpasses state-of-the-art methods by a significant margin in terms of both 3D generation and editing quality.We believe that HyperDreamer, with its markedly superior quality and flexibility, effectively broadens the accessibility of AI-generated 3D content for practical applications.

RELATED WORKS
Text-guided 3D Generation.The text-guided 3D generation has gained significant attention following the remarkable success of text-to-image generation methods.Dream Fields [Jain et al. 2022] employed the text-image model CLIP [Radford et al. 2021] to optimize NeRFs [Mildenhall et al. 2020] by aligning the text and image embeddings.Building on the same principle, DreamFusion [Poole et al. 2022] replaced CLIP with diffusion models and devised an SDS loss to distill knowledge from the denoising procedures.Magic3D [Lin et al. 2023] further enhanced generation performance by employing a coarse-to-fine framework and using meshes as the 3D representation in the second stage.Fantasia3D [Chen et al. 2023] disentangled the geometry and appearance modeling and introduced the spatially varying bidirectional reflectance distribution function (BRDF) for photo-realistic texture.Our approach utilizes a single image as the guided condition instead of text, which provides more detailed and specific information and introduces additional challenges.
Single-image Reconstruction.Reconstructing 3D models from a single image has been a long-existing topic.Inference-based methods [Choy et al. 2016;Gu et al. 2023;Jun and Nichol 2023;Melas-Kyriazi et al. 2023a;Nichol et al. 2022;Pavllo et al. 2023;Tulsiani et al. 2017;Vasudev et al. 2022;Wu et al. 2023] heavily depend on the datasets used for training, many of which can not handle diverse and general objects.Optimization-based methods utilize priors from 2D text-to-image diffusion model to guide the reconstruction process.RealFusion [Melas-Kyriazi et al. 2023b] employs textual inversion technique to bridge the gap between the reference image and text-conditioned guidance.Make-it-3D [Tang et al. 2023] employs a two-stage framework and leverages high-quality textures extracted from the reference image.Zero-1-to-3 [Liu et al. 2023   Material and Illumination Estimation.Multi-view reconstruction methods [Munkberg et al. 2022] benefited from separately modeling geometry, material, and illumination conditions, while it's a highly ill-posed problem for generation.Previous works like Fantasia3D [Chen et al. 2023] propose to learn globally varying roughness and metallic distributions, which may not always align with realistic material properties.Based on the material estimation approaches from a single image [Sang and Chandraker 2020; Wimbauer et al. 2022], we further propose a more plausible assumption that materials within the same semantic class share similar material properties, enabling spatially varying materials modelling while preventing degenerate solutions.
Text-guided 3D Editing.Recently, text-guided image processing has experienced rapid development in both quality and diversity.Text2Mesh [Michel et al. 2022] proposes a neural style field, which uses CLIP to guide the initial mesh based on text.TANGO [Chen et al. 2022] follows a similar scheme and uses a BRDF to optimize the appearance.However, there is a gap from the actual use due to insufficient accuracy.More recently, TEXTure [Richardson et al. 2023] leverages an improved depth-to-image diffusion process and applies an iterative scheme that paints a 3D model from different viewpoints.However, none of them enable text-guided editing of a local area on a 3D object.We propose an interactive editing method that users can edit textures based on text guidance in selected 3D regions with a few simple clicks or in a global manner.

PRELIMINARIES 3.1 3D Representation
Inspired by Magic3D [Lin et al. 2023], we adopt NeRF [Mildenhall et al. 2020] and DMTet [Shen et al. 2021] for the first and second stage training, respectively.In the first stage, NeRF represents the scene as an implicit function that maps a 3D location  and a 2D viewing direction  to a volume density  and color .To render a pixel, NeRF alpha-composites the densities and colors along the ray that is cast from the camera to the pixel: To accelerate the training, we employ the efficient hash grid encoding from Instant NGP [Müller et al. 2022] instead of pure MLPs.
In the second stage, we adopt DMTet to produce high-resolution outputs without high computational and memory requirements.DMTet is a hybrid representation that integrates implicit and explicit surface representations and can efficiently render high-resolution textured meshes with differentiable rasterization.Formally, DMTet models the 3D shape as a deformable tetrahedral grid (  , ), where   are the vertices in the tetrahedral grid .Each tetrahedron  ∈  has four vertices {   | ∈ {, , ,  }}, each associated with a SDF value  (  ) and a deformation Δ   .The surface mesh is extracted by differentiable marching tetrahedra algorithm.

Score Distillation Sampling (SDS)
Previous works [Lin et al. 2023;Poole et al. 2022] have leveraged the 2D diffusion model [Rombach et al. 2021] as prior knowledge for text-to-3D generation.The diffusion model  learns a denoising function   (  ; , ) that estimates the noise  based on the noisy image   , text embedding  and noise step .It progressively reduces the noise and introduces image structure.To optimize the 3D scene  , Score Distillation Sampling (SDS) guides all rendered images to match the given text embedding  under diffusion priors: where  denotes the image renderer and  () represents a weighting function.In addition to text conditional SDS, zero-1-to-3 [Liu et al. 2023] introduces a 3D-aware SDS that conditions on input view

Segment Anything Model (SAM)
Segment Anything Model (SAM) [Kirillov et al. 2023] is the foundation model for general image segmentation, which supports various segmentation modes such as automatic everything and manual prompt.Taking point prompts as an example, SAM takes an image  and a set of user-specific prompts P = (, ) as inputs, and the output is a corresponding segmentation mask .Among them, P includes  and , where  is the set of each point coordinate and  is the set corresponding to each point label.We use  to represent the SAM model, so we have  ,P =  (, P).

METHODOLOGY
This section elaborates on the proposed framework in detail.Despite the inherent challenges posed by the ill-posed nature of the problem, HyperDreamer capitalizes on the deep priors from the 2D diffusion model, semantic segmentation model, and material estimation model, which collectively empower the capability for fullrange viewing, rendering, and interactive editing.Specifically, (1) to achieve high-fidelity texture generation, we utilize high-resolution pseudo multi-view images for auxiliary supervision, as detailed in Sec 4.1.(2) For material modeling, we introduce online 3D semantic segmentation and semantic-aware regularizations, which is initialized via material estimation results, as described in Sec.4.2.
(3) Furthermore, a novel interactive editing approach is proposed in Sec.4.3 for effortless targeted modification of regions on 3D meshes via interactive segmentation.

360 • High-Resolution Texture Generation
In the second training stage, the mesh representation allows for rapid images rendering, unlocking the potential of achieving highresolution texture maps.However, our guidance model, Zero-1-to-3 [Liu et al. 2023], was originally trained on low-resolution images (256 × 256).The resulting SDS loss fails to handle higher-resolution images, thereby limiting the benefits offered by mesh representation.The disparity between the resolutions used for training and inferencing leads to a relatively blurry texture map.
To overcome this challenge, we propose a high-resolution texture generation module.We first select a set of novel views and directly generate  images per view using Zero-1-to-3.Subsequently, we employ a super-resolution network [Rombach et al. 2021]  the sampled images, enabling high-resolution supervision.Since the multi-view images generated by Zero-1-to-3 are not perfectly 3D consistent, directly applying per-pixel loss can lead to network instability.Instead, we employ perceptual loss [Johnson et al. 2016] in the feature space.By leveraging perceptual loss, we can minimize the content and style differences between two images without relying on pixel-level alignment, effectively alleviating inconsistencies during the training process.We assume that the reference image already contains all of the semantic components of the generated 3D model.We also assign pseudo labels to novel view images by thresholding the feature similarities, and all these 2D labels are used to supervise the semantic branch training.We present detailed implementations in the supplementary materials.Firstly, the diffusion priors suffer from intrinsic shading and reflectance effects.For instance, Stable Diffusion and Zero-1-to-3 are trained on abundant images with lighting and shading variations, inevitably baking these effects into the textures of the generated 3D models.As shown in Figure 4, given the front view of the teddy bear, Zero-1-to-3 tends to generate a dark back view, as if the light source only exists in the front, leading to a black back of the 3D model as in Figure 4-b.

Semantic
Secondly, the shading and reflectance characteristics in the reference image are integrated into the albedo color learning of the model via RGB reconstruction loss, making it challenging for rerendering.
We aim to introduce several albedo losses to alleviate the aforementioned problems.For the diffusion bias, we assume that albedo colors in regions under the same semantic label are similar.For the   semantic labels, we maintain a albedo library called   , which is updated regularly according to the semantic-region-averaged albedo colors of the reference image along the training.For each novel view, we predict segmentation masks with the semantic branch, and then we use a Gaussian filter to gain a weighted average of predicted albedo colors inside each semantic group.We propose a semantic-aware albedo regularization as below: Furthermore, we incorporate a state-of-the-art single-image derendering framework [Wimbauer et al. 2022] to generate the albedo map of the reference image as an additional albedo supervision.(5)

Appearance
We utilize spherical Gaussians (SGs) [Yan et al. 2012] to approximate the rendering equation in closed form.For a spherical Gaussian with  dimensions, given the lobe axis  ∈ S 2 , lobe sharpness  ∈ R + , and lobe amplitude  ∈ R  + , the spherical function is formulated as: where  ∈ S 2 denotes the input.
The environment map (  ) is represented as a mixture of SGs: The SVBRDF is divided into diffuse BRDF and specular BRDF:    ,   ; ) =   ()/ +   (  ,   ; ).The diffuse term modeled as an MLP based on the multi-resolution hash input encoding.And the specular term at location  is formulated as: where ℎ is half vector and M is Fresnel and shadowing effects.The last term in Eqn. 5 is approximated as [Meder and Brüderlin 2018]: (  • ) =  (  ; 0.0315, , 32.7080) − 31.7003.
Therefore, the rendering equation is represented as the multiplication of SGs and can be calculated in closed form.Learnable parameters above include {  ,   ,   }  =1 for the environmental map, diffuse albedo   , and the spatially varying {, }.We assume that regions with the same semantic label usually share alike materials and enforce channel consistency in roughness and specular.
Please refer to the supplementary materials for more details.

Interactive Editing
Editing a 3D model requires complex interaction with 3D shapes while maintaining global consistency to achieve the desired design.
We propose an intelligent and user-friendly interactive 3D editing tool that allows users to quickly settle the target area in 3D space by one-shot selection and edit its texture based on text guidance.

Interactive Segmentation
In Mesh.Interactive segmentation in the 3D mesh enables users to segment any region of the 3D object.In our method, as shown in Figure 5, we use two UV maps to represent the masks of the 3D mesh, where T  for selected regions and T  for remaining regions, respectively.Given a target view   , we can render the masks   −1  and   −1  which are actually the point prompts cache from the previous  -1 views but not complete segmentation results in the current view.Then, we sample points with a patch sampling mechanism on   −1  as positive prompts and   −1  as negative prompts in each patch to generate refined segmentation results   and   via SAM.Inverse rendering is then applied to project   and   onto T  and T  , where we use a gradient-based optimization to T  for L  over the values of T  when rendered through the differential renderer where,   is the mask of mesh at the view   .Similarly, the method of projecting back to texture in each view in subsequent is the same.4.3.2Text-Guided Texture Synthesis.We apply Normal-to-Image model M  based on ControlNet [Zhang and Agrawala 2023] to paint textures that closely match the surface details on the 3D Mesh directly.
To address the inconsistency problem, we divide each rendered view into three partitions:   ,   , and     .The   partition is the target region that needs to be painted for the first  time.The   partition is either a previously well-painted target region or a region that is out of the target region.The     partition is the region painted from the previous views, but they are mainly the junction of adjacent views and need further refinement.
To attain     , we first perform an opening operation, , on the mask   to eliminate out-lie small regions.We then performs erode E and dilate D as follows, In the M  , we modify the sampling process by blender diffusion to inject the information of region partition into the denoising process.The mask   explicitly blends the noised latent    and the denoised latent estimation   as follows: (12) Based on the above texture synthesis method, we can achieve local editing in the 3D mesh.In more detail, we can limit the editing area to the target region by doing the dot product with the original texture map T and the T  obtained by interactive segmentation in mesh.Finally, users can select any region in the 3D object to edit based on text guidance, as illustrated in Figure 5.  [Müller et al. 2022] to adopt a two-stage training pipeline: a coarse NeRF is trained for 50 epochs in the first stage, guided by Zero-1-to-3 based SDS loss and other regularisation terms like depth and normal loss.We use MiDaS [Ranftl et al. 2021] and OmniData [Eftekhar et al. 2021] to extract the depth and normal estimations, respectively.We train a DMTet [Shen et al. 2021]   b.

Blender rendering
Figure 8: Analysis of the material modeling.We show an example of the output roughness and specular maps in a, together with its rendering results in Blender.We show how the albedo loss at the reference view help alleviate shading and reflectance learning in the albedo texture in b.

EXPERIMENTS 5.1 Qualitative Comparisons
We show some qualitative comparisons with several state-of-the-art works in Figure 6, where we present both the reference view and the back view of the object for each method.The results by Shap-E are relatively worse than the other optimization-based methods.The instances generated by NeuralLift-360 are small in size and low in quality, while the basic semantics is reserved.RealFusion and Zero-1-to-3, both leverage the reference view RGB reconstruction loss for constraint and thus keep a high fidelity with the reference image.While RealFusion suffers severely from the multi-face problem, and the results from Zero-1-to-3 are blurry.Our method achieves the highest quality in both the reference view and the back view, presenting realistic and reasonable generations.

Quantitative Comparisons
We adopt three metrics for quantitative comparisons: 1) LPIPS [Johnson et al. 2016] evaluates the reconstruction quality of the reference view image; 2) Contextual distance [Mechrez et al. 2018] evaluates the pixel-level distance between the rendered novel view images and the reference image; 3) CLIP-Score [Radford et al. 2021] measures the semantic-level distance between the novel view images and the reference image.We select 20 images online with a wide range of diversity, and we select 10 instances from the DTU dataset [Aanaes et al. 2016] that are basically complete.The results for the two datasets are shown in Table 1 and Table 2, respectively.Our model outperforms the comparison methods in all three metrics by a large margin, quantitatively revealing the effectiveness of the pipeline.8-a, where we observe that the material proprieties are highly correlated with the semantic label of the region.We also show how the albedo loss helps decompose the albedo texture out of the reference view.5.3.3Editing.We show in Figure 9 that the Naive method for segmentation in the mesh, which only inputs positive prompt and randomly samples the point prompts cache, has a high probability of failure especially in the condition of dealing with discrete and complex regions.However, adopting our scheme, it is more robust with patch sampling in the positive and negative prompts and input both into SAM.

CONCLUSION
This paper introduces a framework, HyperDreamer, which enables hyper-realistic 3d content generation and editing for a single image.In contrast to previous works, the 3D content generated by our method is full-range viewable, renderable, and editable.Extensive experiments demonstrate the effectiveness of HyperDreamer in modeling region-aware materials with high-resolution textures and enabling user-friendly editing.We believe that HyperDreamer holds promise for advancing 3D content creation and editing, which would be practical for both academic and industrial usage.

"
Turn it to red coat, Chinese style" "Turn it to brown hat"

Figure 2 :
Figure2: Overview of our 3D generation and editing pipeline.We introduce diffusion priors, semantic priors, and derendering priors into this highly under-constraint problem to enable high-resolution textures with material modeling and interactive editing after the generation.
Figure 3: SAM at the generation stage.We effectively cluster concise semantic groups compared to the raw SAM results.and relative camera extrinsic to exploit the 3D consistent priors: ∇  L  (,  = ( )) = E ,  ()(  (  ;   , , , ) − )   , (3) where   represents the input view,  and  are the relative camera rotation and translation from the input view to the desired viewpoint.

Figure 4 :
Figure 4: Diffusion bias.The 2D diffusion bias in d leads to 3D generation failures in b, which can be alleviated by the albedo regularization in c.
global semantic segmentation.During the second training stage, we also propose to integrate a new MLP-based branch upon the hash encoding and equip the framework with a globallyconsistent mesh segmentation for further semantic regularization.We first use SAM to produce over-segmented results of the reference image (Figure 3 (b)), and then we cluster different semantic parts by thresholding the feature similarity among them before assigning the semantic labels, as shown in Figure 3 (c).

Figure 5 :
Figure 5: Interactive editing process.Users can select the interest regions and then our method output the texture mask of the target area to our texture synthesis pipeline for text-guided editing.

Figure 6 :
Figure 6: Qualitative comparisons.HyperDreamer generates a high-fidelity reference view and more realistic and reasonable results at the novel view.

Figure 7 :
Figure 7: Ablation on the super-resolution (SR) module.Highfrequency details on textures are generated under SR supervision.

Figure 9 :
Figure9: Analysis of our scheme of segmentation in mesh.Our method has better ability to handle complex circumstances.
Resolution.We show how our SR module works in Figure 7.It largely enhances the texture details and realism in our model, which enables it to support high-resolution zoom-in views in comparison with other methods.5.3.2Materials.Examples of the generated roughness and specular map are shown in Figure

Figure 10 :
Figure 10: Additional results by HyperDreamer with more views.Images in the last column are specular and roughness map respectively (from top to bottom).

Rombach Diffusion Prior Single-image Generation Interactive editing Reference image Semantic Prior Derender Prior
] synthesizes novel views by fine-tuning diffusion models [

Table 1 :
Quantitative results on our data.

Table 2 :
Quantitative results on the DTU dataset.
MethodsContextual ↓ CLIP ↑ Perceptual ↓ for 100 epochs in the second stage based on the first stage model.The SR, semantics, and material modules are integrated with the second stage only.More details are presented in the supplementary materials.
[Liu et al. 20232021DS loss based on Stable Diffusion[Rombach et al. 2021] and RGB reconstruction loss with the reference view, in comparison, Zero-1-to-3[Liu et al. 2023] leverages the SJC loss based on its viewpoint-conditioned model as the guidance.