An Implicit Neural Representation for the Image Stack: Depth, All in Focus, and High Dynamic Range

In everyday photography, physical limitations of camera sensors and lenses frequently lead to a variety of degradations in captured images such as saturation or defocus blur. A common approach to overcome these limitations is to resort to image stack fusion, which involves capturing multiple images with different focal distances or exposures. For instance, to obtain an all-in-focus image, a set of multi-focus images is captured. Similarly, capturing multiple exposures allows for the reconstruction of high dynamic range. In this paper, we present a novel approach that combines neural fields with an expressive camera model to achieve a unified reconstruction of an all-in-focus high-dynamic-range image from an image stack. Our approach is composed of a set of specialized implicit neural representations tailored to address specific sub-problems along our pipeline: We use neural implicits to predict flow to overcome misalignments arising from lens breathing, depth, and all-in-focus images to account for depth of field, as well as tonemapping to deal with sensor responses and saturation - all trained using a physically inspired supervision structure with a differentiable thin lens model at its core. An important benefit of our approach is its ability to handle these tasks simultaneously or independently, providing flexible post-editing capabilities such as refocusing and exposure adjustment. By sampling the three primary factors in photography within our framework (focal distance, aperture, and exposure time), we conduct a thorough exploration to gain valuable insights into their significance and impact on overall reconstruction quality. Through extensive validation, we demonstrate that our method outperforms existing approaches in both depth-from-defocus and all-in-focus image reconstruction tasks. Moreover, our approach exhibits promising results in each of these three dimensions, showcasing its potential to enhance captured image quality and provide greater control in post-processing.

Fig. 1.Our approach is capable of accurately reconstructing a canonical all-in-focus (AiF), high-dynamic-range (HDR) radiance map alongside with depth information, using a sparse Time-Aperture-Focus (TAF) stack as input.This gives full post-processing control over focus, aperture, and exposure conditions.
In everyday photography, physical limitations of camera sensors and lenses frequently lead to a variety of degradations in captured images such as saturation or defocus blur.A common approach to overcome these limitations is to resort to image stack fusion, which involves capturing multiple images with different focal distances or exposures.For instance, to obtain an all-in-focus image, a set of multi-focus images is captured.Similarly, capturing multiple exposures allows for the reconstruction of high dynamic range.In this paper, we present a novel approach that combines neural fields with an expressive camera model to achieve a unified reconstruction of an all-in-focus high-dynamic-range image from an image stack.Our approach is composed of a set of specialized implicit neural representations tailored to address specific sub-problems along our pipeline: We use neural implicits to predict flow to overcome misalignments arising from lens breathing, depth, and all-in-focus images to account for depth of field, as well as tonemapping to deal with sensor responses and saturation -all trained using a physically inspired supervision structure with a differentiable thin lens model at its core.An important benefit of our approach is its ability to handle these tasks simultaneously or independently, providing flexible post-editing capabilities such as refocusing and exposure adjustment.By sampling the three primary factors in photography within our framework (focal distance, aperture, and exposure time), we conduct a thorough exploration to gain valuable insights into their significance and impact on overall reconstruction quality.Through extensive validation, we demonstrate that our method outperforms existing approaches in both depth-from-defocus and all-in-focus image reconstruction tasks.Moreover, our approach exhibits promising results in each of these three dimensions, showcasing its potential to enhance captured image quality and provide greater control in post-processing.

INTRODUCTION
Capturing and reproducing the real world is a fundamental goal of image processing.However, due to imaging device limitations, capturing all relevant information in a single shot is often challenging.Image fusion techniques improve image quality by combining multiple images.Traditionally, all-in-focus (AiF) image reconstruction [Bouzos et al. 2019;Li et al. 2013;Liu et al. 2017;Si et al. 2023;Suwajanakorn et al. 2015;Zhang and Levine 2016], depth estimation [Favaro 2010;Hazirbas et al. 2019;Lin et al. 2013;Maximov et al. 2020;Moeller et al. 2015;Si et al. 2023;Suwajanakorn et al. 2015;Won and Jeon 2022;Yang et al. 2022], and high-dynamic-range (HDR) reconstruction [Debevec and Malik 1997;Mertens et al. 2007] are treated separately, requiring multiple images with varying focus and exposure settings.This process is time-consuming and leads to redundant image capture.
In this work, we address these challenges by focusing on the reconstruction of AiF HDR radiance maps from a sparse image burst, where each image may contain entangled information from both defocus blur and different exposure conditions.Our approach simplifies image capture, improves fusion efficiency, and enables post-editing capabilities such as refocusing, bokeh reproduction, and exposure adjustment.We combine an implicit neural representation with an expressive camera model to reconstruct both the AiF HDR image and the corresponding depth from the sparse image burst without the need for ground-truth supervision.
The problem we address is notably challenging and requires a comprehensive approach due to the interplay of three crucial elements in the image capture process: focal distance, aperture, and exposure time, which we model in our approach to simulate the imaging pipeline (Sec.3).Focal distance determines which parts of the scene will appear sharp and in focus.During focal distance adjustment in a focal stack, lens breathing usually appears, leading to pixel misalignment and ghosting artifacts in captured images.The aperture controls lens opening, affecting both the amount of light reaching the sensor and depth of field.Our framework incorporates a differentiable thin lens formulation to model both focal distance and aperture.Additionally, we introduce a novel approach to mitigate lens breathing using an implicit flow, which enables warping each image in the focal stack to a canonical view, eliminating the need for a reference image and preserving all pixel information.While few existing learning-based methods have attempted to address lens breathing, they focus on a single specific problem such as depth from defocus (DfD) [Moeller et al. 2015;Suwajanakorn et al. 2015;Won and Jeon 2022] and often result in a loss of pixel information.Our framework also tackles the problem of DfD by leveraging a thin-lens formulation as a physical constraint to model shallow depth of field in the input images, outperforming the state of the art in both depth estimation and AiF image reconstruction (Sec.4.1).Finally, exposure time controls the light exposure of the image sensor.Longer exposures capture more light, resulting in a brighter image.However, extended exposures can also saturate the sensor and lead to clipping.To model this process, we introduce an implicit tone mapper that supports the reconstruction of the HDR radiance map from the different exposures in the low-dynamic-range (LDR) input images.
Relying on these components, our framework achieves high flexibility.In contrast to previous approaches [Debevec and Malik 1997;Hasinoff and Kutulakos 2007;Mertens et al. 2007;Si et al. 2023;Suwajanakorn et al. 2015;Won and Jeon 2022;Yang et al. 2022], it supports as input any set of images with different unstructured combinations of focal distance, aperture, and exposure to learn an AiF HDR image and its corresponding depth in an unsupervised manner.We term this stack of unstructured images Time-Aperture-Focus stack (TAF).Our approach can handle an arbitrary number of input images and produces outstanding results with only five images.To achieve this, the estimated depth serves as an input to our lens model, which produces a differentiable disk kernel.This kernel is applied to blur the AiF image, effectively modeling the combined effects of focus and aperture.In a final step, the blurred image is tone mapped to produce a defocused LDR image.Then, a reconstruction loss is computed by comparing this output image and the input image from the TAF stack, guiding the iterative refinement of the predicted AiF image and depth map.After optimization, our framework further enables post-editing capabilities such as focus, aperture, and exposure editing (Fig. 1).More results can be found in the supplementary material and video.
Further, we extensively investigate the impact of different combinations of focal distance, aperture, and exposure on the overall reconstruction performance of our method (Sec.4.2), revealing two effective strategies for improving AiF HDR image reconstruction and depth estimation.On one hand, utilizing a moderate aperture size and multiple exposure times with focal sweeping yields better AiF HDR image reconstruction.On the other hand, maintaining a fixed exposure time while varying the aperture size during focal sweeping leads to improved depth estimation.
In summary, we present the following contributions: • We introduce a unified implicit neural representation that takes as input any set of images with different unstructured combinations of focal distance, aperture, and exposure (TAF stack), guided by a physically-inspired supervision structure centered around a differentiable thin lens model.This representation enables the reconstruction of an AiF HDR image and depth, and supports flexible post-editing, such as refocusing, as well as aperture and exposure adjustment in a well-disentangled manner.• Our proposed method achieves outstanding results in DfD and AiF image reconstruction tasks, outperforming the state of the art.• To support our analysis and validation, we introduce a new dataset of focal stacks comprising 10 synthetic scenes simulating lens-breathing effects (including five HDR scenes), and 25 captured scenes using various cameras and lenses.

RELATED WORK
We first discuss previous work addressing the problem of obtaining depth and AiF from the focal stack, which is one of the main applications of our framework.Then, we focus on implicit neural representations and their recent use in applications close to ours.

Depth and AiF from Focal Stacks
Traditional depth estimation from a focal stack can be categorized into depth from focus (DfF) and depth from defocus (DfD) methods.DfF approaches face challenges in determining a suitable focus criterion that accurately captures the focus information across various scenes and conditions, and effectively detecting the highest focus values and accurately localizing the corresponding depths.DfD approaches, on the other hand, face challenges in accurately capturing depth information from defocused images and balancing depth estimation accuracy with image resolution.Despite notable advancements [Favaro 2010;Lin et al. 2013;Moeller et al. 2015;Suwajanakorn et al. 2015], the quality of depth estimation in traditional methods still falls short of achieving satisfactory results.
With the advent of deep learning, the differences between DfF and DfD have diminished, as both approaches usually utilize focal stacks as input and ground-truth depth as supervision to train end-to-end models [Hazirbas et al. 2019;Maximov et al. 2020;Won and Jeon 2022;Yang et al. 2022].Wang et al. [2021] presented a novel approach for jointly estimating depth and an AiF image from an input focal stack by designing a shared common network that can be trained either supervised with ground-truth depth maps or unsupervised with only ground-truth AiF images.Recently, in concurrent work, Si et al. [2023] proposed a self-supervised framework for DfD which also considers a thin lens model for predicting both depth and AiF images from focal stacks while being supervised through input reconstruction.However, their end-to-end network is trained on a synthetic focal stack dataset, which may not generalize well to realworld scenarios.Additionally, they do not consider lens breathing and struggle with a small number of input images.We show and discuss the limitations of their approach through quantitative and qualitative comparisons.
Despite significant advancements in learning-based approaches, multiple problems remain unsolved.For instance, obtaining real focal stacks with ground-truth depth for training is challenging due to factors like lens breathing.Although synthetic datasets can circumvent this issue, they often introduce domain gaps [Si et al. 2023;Yang et al. 2022].Moreover, while many learning-based methods treat DfD and DfF as a regression problem, defocus blur can be more effectively modeled as a physical phenomenon that holds implicit cues for direct depth inference.Our approach does not rely on ground-truth data for supervision during training and, unlike previous works, is highly flexible as it can handle diverse combinations of focus, aperture and exposure in the TAF stack as input to reconstruct the AiF HDR image and corresponding depth.Further, our approach outperforms state-of-the-art approaches in reconstructing depth from the focal stack.

Implicit Neural Representations and Their Applications
Recent literature has demonstrated the potential of fully connected networks for memory-efficient and continuous implicit representations, known as implicit neural representations.Sitzmann et al.
[2020] demonstrated how to use these representations to effectively model various signals, including images and 3D shapes, in particular using periodic activation functions and Multi-Layer Perceptrons (MLPs).Neural Radiance Fields (NeRF) [Mildenhall et al. 2021] employ MLPs to parametrize 5D radiance fields, enabling high-fidelity novel-view synthesis and 3D reconstruction.Building upon NeRF, Dark-NeRF [Mildenhall et al. 2022] uses raw images to train the model and maximize the available information for tasks like denoising, HDR reconstruction, and refocusing.HDR-NeRF [Huang et al. 2022] directly learns HDR from LDR inputs and uses a tone mapper for re-projecting HDR content to different exposure intervals.Recently, Jun-Seong et al. [2022] explored HDR radiance fields using a plenoptic function as a scene representation, eliminating the need for exposure information during training.Wu et al. [2022] identified that the quality of NeRF decreases when the input images have shallow depth of field (DoF) and introduced a differentiable circle of confusion (CoC) representation to simulate radiance scattering, allowing the synthesis of AiF images and DoF rendering.For recovering clear scene representations from blurred images, Ma et al.
[2022] learned a deformable kernel as a degradation model.Other implicit neural representation applications include image alignment and layering [Nam et al. 2022], video fitting with latent codes [Feng et al. 2022], video editing using layered 2D atlases parameterized by MLPs [Kasten et al. 2021], and improving mixed reality rendering coherence through the learned camera characteristics [Mandl et al. 2021].Our framework draws inspiration from these works and leverages implicit neural representations.

METHOD
Our goal is to design an implicit neural representation that, by learning from an image stack, models the AiF HDR image and depth map of the scene.The pipeline of our approach is shown in Fig. 2. We first introduce how we represent the depth and the AiF HDR image (Sec.3.1).Importantly, we present an implicit flow model to compensate for the lens breathing effect in the image stack.Next, we describe our differentiable thin lens model that renders a defocused HDR image from the depth and AiF image (Sec.3.2).We then introduce an implicit tone mapper that maps the defocused HDR image to an LDR image conditioned on the level of exposure (Sec.3.3).Finally, we discuss the loss functions used to train the implicit neural representation (Sec.3.4).After training, the representation can be used to synthesize images under new focal distances, apertures, and exposures.

Depth and Image Representation with Implicit Flow
Input to our system is an image stack {  }  =1 .For each image   , we also have its corresponding camera metadata including focal distance, aperture, and exposure, which differ between images in the stack.We aim to learn a shared depth map  and AiF HDR image   from the image stack.
In our pipeline, we represent the depth map  and AiF HDR image   using coordinate-based MLPs.A simple design is to use two MLPs that take pixel coordinates (, ) as input and output the depth  2 ∈ R + and RGB color  ∈ R 3 respectively.However, this design does not take into account lens breathing, the phenomenon where changing the focus distance causes a slight change in the field of view (FoV) [Gross 2005].This effect occurs due to the movement of the lens elements inside the lens barrel to adjust the focus distance, and can be observed even when the camera is stationary.Consequently, it introduces misalignments between the same pixel coordinates of images with different focus distances.
Previous techniques for addressing lens breathing typically use the smallest FoV image in a focal stack as reference and align all other images in the stack to it by cropping and resizing using feature extraction and matching [Suwajanakorn et al. 2015;Won and Jeon 2022].However, this approach may cause a loss of information in the focal stack and can only generate a single FoV AiF image.In this paper, we propose a novel approach to address lens breathing by learning a canonical view from the focal stack that encompasses all pixel information.This approach employs an implicit flow to project the canonical view onto each FoV in the stack, instead of aligning to a specific reference image.To account for lens breathing, we embed the focus distance  1 , which is the key factor in lens breathing, into the coordinate input, and use an MLP   as the flow predictor to predict an offset in the  and  coordinates.This continuous offset compensates for the lens breathing effect and warps pixel locations to canonical coordinates: ′ ,  ′ = ( + △,  + △) .
(2) This way, pixels corresponding to different focus distances are aligned in the canonical space.An example of the learned flow is shown in Fig. 10.
The canonical coordinates are then used to query the depth and HDR color with two MLPs:   : ( ′ ,  ′ ) ↦ →  2 and   : ( ′ ,  ′ ) ↦ → .Thus, the full depth map  and AiF image   can be obtained by querying the MLPs using the coordinates grid of all pixels, as illustrated in the left half of Fig. 2.

Differentiable Thin Lens Model
Given the AiF HDR image and depth map, our next step is to estimate the defocused HDR image.Defocus blur is a prevalent phenomenon in everyday photography that arises when a camera lens fails to focus light rays onto a single point on the image sensor.The thinlens model [Potmesil and Chakravarty 1981] can be used to explain defocus blur in an image, where the camera lens is parameterized by a focal length  , focus distance  1 , and aperture  (i.e., F-number).When an object is in focus, the distance between the lens and the image sensor is adjusted to ensure that the object's image is formed on the image sensor.However, if the object is not at the focal distance, the lens will produce a disk on the sensor rather than a single point.This disk is referred to as the circle of confusion (CoC), and its diameter  can be computed as . (3) Once we have obtained , we can simulate the defocus blur via a spatially-varying convolution.Let   (, ) be the pixel value of   at (, ) and  (, , , ) be the weight of a spatially-varying kernel, where  and  represent the current position of the convolution in image space,  and  is the location within the kernel.The defocused image   of the spatially-varying convolution is given by Following the discussion above, we model  using a disk of unit energy with diameter .Notice that a disk kernel is a closer approximation to the defocus operation than a commonly used Gaussian kernel [Potmesil and Chakravarty 1981].However, a conventional disk kernel is discontinuous and therefore not differentiable with respect to its diameter  [Bangaru et al. 2021].Consequently, the gradients cannot be back-propagated to the depth  2 and thus the depth map cannot be optimized.In this paper, we propose an approach inspired by Gwosdek et al. [2012], to address this issue by introducing a differentiable disk kernel.This kernel enables smooth optimization while accurately representing the defocus blur effect.
To connect the continuous CoC diameter  to the discrete kernel grid, we introduce a soft boundary of the kernel as follows: where  represents the distance from (, ) to the center of the disk, and (, ) are evaluated on the discrete pixel grid.Notice how this is a simple and efficient approximation of anti-aliasing, which does not require integrating over pixel area.This allows the gradients to be back-propagated through , ensuring smooth optimization of the depth map.Fig. 3 illustrates our kernels.

Implicit Tone Mapper
In order to infer HDR information from an LDR image stack with different levels of exposure, we require a tone mapping operation that projects HDR to LDR.Typically, the following function is used for tone mapping [Liu et al. 2020;Wang et al. 2022]: where Î is an LDR image, and  quantifies the level of exposure, i.e., the amount of light that reaches the sensor, depending on aperture and exposure time.The clip operation is employed to constrain the range of values within the interval [0, 1], and  (•) is the gamma correction.However, there is still a gap between this approximation and real camera response curves, and we found that directly incorporating this explicit model into our framework produces artifacts (Sec.4.3).
To tackle this challenge, we use an implicit tone mapper   , again represented by an MLP: where subscripts  denote color channels, i.e., each channel is processed independently using the same network.To ensure an LDR output, we employ a hyperbolic tangent as the last layer of the MLP and linearly remap its outputs to [0, 1].

Loss
We jointly optimize all four MLPs using the loss function where L rec is the MSE loss, L VGG is a perceptual loss [Johnson et al. 2016], L reg is a regularization term for flow, and L TV is a total variation regularization penalty for the depth map.,  and  are weights to balance the components, which are set to  = 0.01,  = 0.5, and  = 0.05 in all our experiments.The combination of MSE and perceptual loss is a common practice [Feng et al. 2022;Liu et al. 2020].For the perceptual loss L VGG , we compute squared differences of intermediate feature maps of a pre-trained VGG16 [Simonyan and Zisserman 2014] network (layers 5, 10, and 16).To enforce smoothness in our implicit flow, we include the regularization term where    is a Jacobian matrix assembled from the gradients of   with respect to all input coordinates.Finally, to encourage smoothness in the estimated depth values, we apply the total variation loss L TV on the depth map , penalizing depth differences of neighboring pixels in both the horizontal and vertical direction.
After training, we can synthesize new images by feeding any focus distance, aperture, and exposure into the pipeline.Results are presented in Fig. 4. Please refer to the supplementary material for additional implementation details, including network architectures and training details.

EVALUATION
In this section we demonstrate the advantages of our method in a series of experiments.In Sec.4.1 we evaluate depth-from-defocus (DfD) and all-in-focus (AiF) reconstruction.In Sec.4.2 we extensively investigate the three factors focal distance, aperture, and exposure time, which our approach for the first time allows to consider in a unified framework.We conduct ablation studies in Sec.4.3, and give an overview of runtime performance in Sec.4.4.

Depth-from-Defocus and All-in-Focus Reconstruction
Here, we consider the one-dimensional problem of DfD reconstruction by comparing to the state-of-the-art methods DFFMobile [Suwajanakorn et al. 2015], DFFWild [Won and Jeon 2022], DEReD [Si et al. 2023], and DFFV [Yang et al. 2022].Among these methods, DFFMobile and DEReD are also able to reconstruct AiF images, so we consider them as baselines for this task as well.As there is no variation in exposure for both tasks, we do not use our tone mapper in this set of experiments.
To facilitate this comparison, we consider three datasets: (a) A novel rendered dataset, consisting of 10 scenes, generated using path tracing, providing ground-truth data for quantitative evaluation; (b) a synthetic dataset generated from the RGBD NYUv2 corpus [Silberman et al. 2012] in conjunction with the camera simulator from DFFWild [Won and Jeon 2022], consisting of 100 scenes; (c) a novel real dataset for qualitative analysis, captured using a Canon RP camera with EF 50mm/F1.8,RF 24-105mm/F4.0and RF 85mm/F2.0 lenses, as well as a Canon 6D2 camera with an EF 24-105mm F/3.5-5.6 lens, consisting of 25 scenes.All datasets contain lens breathing.For each experiment, we select five images as input to the methods, each with a different focus distance.In the supplementary material we provide further details about the datasets.
Most of our competitors take into account lens breathing by choosing the input image with the smallest effective FoV as reference and performing alignment with respect to it, cropping away pixels close to the image boundaries in all other images.Our approach is markedly different in this regard and does not require any cropping.However, to facilitate meaningful comparisons, in the following we consider only pixels that are visible for all methods.
We use the mean absolute error (MAE), the mean squared error (MSE) and the absolute relative distance (Abs-Rel) as the metrics  for DfD evaluations, which are commonly used in this field [Eigen et al. 2014;Laina et al. 2016;Li et al. 2017].For the evaluation of AiF results, we utilize PSNR and SSIM [Wang et al. 2004] metrics.
Based on the results presented in Tables 1 and 2, our method demonstrates superior performance compared to other approaches on the DfD task.Similarly, from the results shown in Table 3, our method outperforms the baselines in the AiF task.
In Figures 8 and 9 we showcase qualitative results for DfD and AiF, confirming our numerical evaluation.We observe that DFFWild and DFFV, which require full supervision, suffer from a domain gap: While they can generate visually plausible results on the rendered dataset, their performance deteriorates significantly on other datasets.On the other hand, we see that label-free methods like DFFMobile and DEReD generally struggle to produce high-quality results across the spectrum.Regarding the AiF task, our results appear sharper and more natural compared to other methods.DFF-Mobile's cropping-based approach to account for lens breathing produces undesirable black boundaries, while DEReD, which does not account for lens breathing, tends to produce noticeable ghosting artifacts.For more qualitative results on both AiF and DfD tasks please refer to the supplementary material.
We are further interested in the performance of all methods when data is provided that does not exhibit lens breathing.To this end, we re-ran all evaluations on a variant of the NYU dataset that does not contain this effect.On this simpler dataset, our competitors significantly increase their performance -in particular DFFMobile and DFFWild.The performance of our method increases only marginally, but still outperforms all competitors, indicating robustness.We provide all details in the supplementary material.

Exploration of TAF Sampling Strategies
Here, we investigate different sparse sampling strategies of the TAF cube.We are interested in devising strategies that result in highestquality reconstruction and disentanglement, providing guidelines for capturing TAF stacks.
The final photometric exposure  of an image is determined by both aperture and exposure time: Our TAF formulation allows to disentangle these factors [Jacobson et al. 2000], while previous methods often struggle with this decomposition [Huang et al. 2022;Wang et al. 2022].In our exploration setup, we simplify Eq. 11 by switching to the log domain, resulting in the linear relationship (Fig. 5a) where  =  2 ( ), and  =  2 ().Based on this parameterization, we conduct a comprehensive set of experiments.We base our evaluation on five path-traced scenes.For each scene, we render 5 × 5 × 5 = 125 samples on a parallelepiped in the linearized TAF space (Fig. 5b).Please refer to the supplementary material for details of the dataset creation.
In each of our experiments, we sample five images according to a variety of strategies and evaluate reconstruction quality.We consider 1D, 2D, and 3D sampling strategies.
As 1D strategies, we consider the three colored lines in Fig. 5b, which correspond to specific tasks previously considered in the literature: DfD, i.e., varying focal distance while fixing aperture and exposure (yellow), HDR fusion [Debevec and Malik 1997]  In the 2D case, we manually select four distinct cross-sections.Three of these cross-sections represent scenarios where the aperture is fixed at small, medium, and large values (denoted as   ,   , and   ), while varying the exposure time and focus distance.The fourth cross-section represents a scenario where exposure time is fixed, while aperture and focus distance vary (denoted as   ).Within each cross-section, we design four sampling patterns: two diagonals, a vertical cross, and a horizontal cross.Please refer to the supplementary material for more details.In the 3D case, we produce random instances of n-rooks sampling for all three dimensions.
Our main focus lies in exploring the strategies in 2D and 3D, which are in principle capable of reconstructing and disentangling all information, while the 1D results are naturally limited.We evaluate the results in terms quality for AiF HDR as well as depth map reconstruction.We use HDR- VDP-3.0.6 [Mantiuk et al. 2023] and PU21-SSIM [Azimi et al. 2021] to measure AiF HDR results.For depth map evaluation, we again employ the Abs-Rel metric.Table 4 provides the results of this analyis, while more detailed results can be found in the supplementary material.We observe that our method demonstrates robustness and achieves good performance under various sampling strategies.Results of our 1D evaluation are found in the supplementary material.
In the 2D evaluation,   exhibits the lowest performance in AiF HDR reconstruction, as expected due to significant defocus blur.Conversely,   demonstrates lower performance in depth estimation, primarily attributed to a wider depth-of-field and limited blur cues available.  offers a trade-off, showcasing satisfactory results in both AiF HDR reconstruction and depth estimation.
Remarkably, within the fixed exposure time cross section,   demonstrated competitive performance in AiF HDR reconstruction and improved performance in depth estimation.This can be attributed to the fact that, in this cross section, the variation in exposure primarily arises from changes in the aperture size.Consequently, this variation introduces both exposure variation and defocus blur, resulting in a wider range of defocus blur types and richer cues for depth estimation.Both focus distance and aperture size contribute to the strength of the defocus blur, explaining the observed performance variations.We observe similar trends in the 3D n-rooks sampling scenario, although the performance is marginally inferior to   , with a notable increase in standard deviation indicating higher instability.The good quality of depth estimation in n-rooks sampling further verifies our conjecture that jointly changing the focus distance and aperture size is beneficial for accurate depth estimation.
In conclusion, for capturing we recommend using either a moderate aperture size with multiple exposure times or fixing the exposure time while changing the aperture size, along with focal sweeping.These combinations provide a good balance between capturing highquality images and maintaining flexibility in adjusting the depth and HDR information.We also verify these conclusions on real datasets in the supplementary material.
While the results in this section indicate that for TAF stack capture it suffices to consider a 2D subspace, we want to emphasize that TAF space synthesis is fully 3D, and that our method for the first time enables this kind of systematic analyis.

Ablations
In this section, we ablate selected components of our pipeline.Numerical results for all ablations are provided in Table 5, based on our real dataset.We observe that removing any component decreases the performance of our method.We elaborate on the components in the following paragraphs.
Implicit Flow Predictor.We choose to model lens breathing using a neural network.We study alternative models in Tab. 5 and Fig. 16.First, we consider removing flow prediction from our pipeline and observe severe ghosting artifacts due to the misalignment between the input images.Second, we consider a polynomial model inspired by established practices in lens distortion correction [Weng et al. 1992].Specifically, we model radial distortion in polar coordinates, i.e., as a function of radius  = √︁  2 +  2 via where  ′ is the remapped radius, and the coefficients   are 3rd-order polynomials of the focal distance  1 .During training, we optimize for the coefficients of the latter polynomials.We study the cases  = 1, i.e., simple linear magnification, and  = 3, which includes non-linear distortions.We observe that neither model reaches the quality of our neural flow predictor.
Disk vs. Gaussian Kernel.Previous DfD methods [Favaro 2010;Si et al. 2023] commonly employ a Gaussian kernel to simulate defocus blur.Fig. 11 demonstrates that a disk kernel produces more realistic bokeh.
Implicit Tone Mapper.We compare our implicit tone mapper to two alternatives.First, we consider an explicit tone mapping function as per Eq.7 in Fig. 7, which struggles to reproduce highlights.Second, we consider an alternative implicit tone mapper without weight sharing between the color channels in Fig. 12, which leeds to color shifts.Table 5 reveals that our choice of implicit tone mapping delivers the highest-quality results.
In all our experiments, we have used JPEG images as input.Alternatively, linear/RAW images could be considered, for which a way simpler tone mapper would suffice, at the expense of additional processing steps [Mildenhall et al. 2022].We emphasize that linear images still are not HDR, therefore requiring at least a clipping operation for "tone mapping".
Loss Terms.Here we investigate the effect of our loss terms.Fig. 13 reveals that the VGG-loss is important to reproduce high-frequency details, avoiding blurry reconstructions.The Jacobian regularization helps to smooth the implicit flow and thus prevents deformation artifacts, as shown in Fig. 14.The TV loss enhances smoothness and continuity of the depth map by reducing noise and abrupt pixel transitions, illustrated in Fig. 15.Again, corresponding numerical results are provided in Table 5. Number of input images.We investigate depth reconstruction quality as a function of the number of input images to our system, evaluated on the NYU dataset.In Fig. 6 we plot errors relative to our solution which uses five images.We see that less images give inferior results, while more images tend to only marginally improve the reconstruction.Even though these results tend to vary across scenes, they indicate that five input images are a reasonable default choice.

Runtime
Our method runs at interactive rates, but the inference time highly depends on the aperture size during post editing, as it determines the blur kernel size.When a small aperture is used, one frame takes as little as 22 ms per frame for a resolution of 384 × 256 pixels, and 85 ms for a resolution of 768 × 512 pixels on an Nvidia RTX 3090.In the worst-case scenario, when the aperture is big and larger kernel size has to be used, our method provides results within 62 ms and 788 ms per frame for the lower and higher resolution, respectively.These results can be further improved by replacing the differentiable blur kernel with a faster approximation at inference time.Training time is substantial.It takes around 20 minutes for low resolution and 4 hours for the higher resolution on a GPU RTX 8000.

CONCLUSIONS, LIMITATIONS, AND FUTURE WORK
In this work, we have proposed a novel approach utilizing implicit neural fields and a differentiable thin-lens model to represent the Time-Aperture-Focus (TAF) stack.This representation allows us to extract complete information from the stack, including depth, allin-focus, and high-dynamic-range data.Our method achieves stateof-the-art performance in Depth-from-Defocus tasks, and allows a faithful exploration of the three-dimensional space of imaging dimensions.One notable feature of our method is the ability to perform flexible post-editing.After fitting the model, users can adjust parameters such as focus distance, aperture, and exposure, enabling effective disentanglement.Additional results can be found in the supplementary material.
Our method does not account for the sensitivity of the sensor (ISO), which affects exposure and noise levels.Additionally, we primarily focus on static scenes and do not address motion blur caused during image capture.Our implementation is not performance optimized, and we expect that leveraging recent advances in neural field training and inference [Müller et al. 2022] will significantly boost the performance.As our approach provides just one depth value per pixel, we are not currently able to handle transparent objects.We hope our method inspires future work on full-dimensional image recovery from sparse measurements.

Explicit Tone Mapper
Implicit Tone Mapper (Ours) GT

Fig. 2 .
Fig.2.Pipeline of our method.We represent a depth map and AiF HDR image via coordinate-based MLPs.An implicit flow predictor is adopted to model the lens breathing effect.Given the depth map, AiF HDR image, and the metadata of the camera, we calculate the defocused HDR image via a novel thin-lens model that is differentiable w.r.t. the depth map.Finally, the defocused HDR image is projected to LDR via an implicit tone mapper, again modeled by an MLP.

Fig. 4 .
Fig. 4. Novel views with varying exposure time, aperture and focus distance synthesized by our method.
, i.e., varying exposure time while fixing focal distance and aperture (blue), and varying aperture photography [Hasinoff and Kutulakos 2007], i.e., varying aperture and fixing exposure time and focal distance (red).
Fig. 5. a) The parallelogram visually depicts the relationship between Fnumber, exposure time, and photometric exposure (EV) in the logarithmic domain.Dots with the same color signify an equal exposure intensity.b) Focus distance is incorporated, resulting in the construction of a 3D parallelepiped.Previous approaches only deal with a single dimension (arrows), while we consider the entire 3D space.

Fig. 7 .Fig. 8 .
Fig. 7. Comparing LDR outputs of an explicit tone mapper to our implicit approach.The explicit solution struggles with the accurate reproduction of out-of-focus highlights.

Fig. 9 .
Fig.9.All-in-focus image reconstruction for the three datasets used in our evaluation.Our method successfully recovers the AiF image, while other methods introduce a variety of artifatcs: DFFMobile struggles with resolving the blur and also produces undesirable black boundaries on the Real Dataset.DEReD produces results with visible hue shifts and ghosting artifacts.More results can be found in the supplementary material.

Fig. 10 .Fig. 12 .
Fig. 10.Visualization of our learned flow.We warp the learned canonical view to images corresponding to different focal distances, simulating lens breathing.

Fig. 15 .
Fig. 15.Comparisons of estimated depth map with (Ours) and without the TV loss.This loss allows the method to reconstruct smoother depth maps and reduces local fluctuations caused by textures of the captured objects.

Fig. 16 .
Fig. 16.Comparison of novel-view image quality for different flow predictors.Only our implicit flow-based approach gets rid of ghosting artifacts arising from misalignment caused by lens breathing.

Table 1 .
Evaluation of DfD on the rendered dataset.

Table 2 .
Evaluation of DfD on the NYU dataset.

Table 3 .
Evaluation of AiF image reconstruction on the rendered and NYU dataset.

Table 4 .
Explorations of different combinations.