Frustum Volume Caching for Accelerated NeRF Rendering

Neural Radiance Fields (NeRFs) have revolutionized the field of inverse rendering due to their ability to synthesize high-quality novel views and applicability in practical contexts. NeRFs leverage volume rendering, evaluating view-dependent color at each sample with an expensive network, where a high computational burden is placed on extracting an informative, view-independent latent code. We propose a temporal coherence method to accelerate NeRF rendering by caching the latent codes of all samples in an initial viewpoint and reusing them in consecutive frames

Fig. 1.We use a frustum volume cache to store the expensively computed view-independent output of Θ base , allowing for efficient lookups via backward reprojection and trilinear interpolation, and re-evaluation of view-dependent effects via the smaller Θ head .We utilize an occupancy grid to detect cache-misses and seamlessly combine cached samples with newly evaluated ones along view rays.Our approach accelerates real-time rendering, as well as offline rendering with expensive effects, e.g.motion blur.
Neural Radiance Fields (NeRFs) have revolutionized the field of inverse rendering due to their ability to synthesize high-quality novel views and applicability in practical contexts.NeRFs leverage volume rendering, evaluating view-dependent color at each sample with an expensive network, where a high computational burden is placed on extracting an informative, view-independent latent code.We propose a temporal coherence method to accelerate NeRF rendering by caching the latent codes of all samples in an initial viewpoint and reusing them in consecutive frames.By utilizing a sparse frustum volume grid for caching and performing lookups via backward reprojection, we enable temporal reuse of NeRF samples while maintaining the ability to re-evaluate view-dependent effects efficiently.To facilitate high-fidelity rendering from our cache with interactive framerates, we propose a novel cone encoding and explore a training scheme to induce local linearity into the latent information.Extensive experimental evaluation demonstrates that these choices enable highquality real-time rendering from our cache, even when reducing latent code size significantly.Our proposed method scales exceptionally well for large networks, and our highly optimized real-time implementation allows for cache initialization at runtime.For offline rendering of high-quality video sequences with expensive supersampled effects like motion blur or depth of field, our approach provides speed-ups of up to 2×.

INTRODUCTION
NeRFs [Mildenhall et al. 2020] have recently received considerable attention due to their ability to faithfully synthesize novel views from a learned 3D scene representation, leveraging differentiable volume rendering.Although explicit or hybrid representations [Fridovich-Keil et al. 2022;Kerbl et al. 2023] have emerged as competitors citing faster rendering times, NeRF-based methods, particularly Zip-NeRF [Barron et al. 2023], still achieve the highest view synthesis quality.The volumetric nature of NeRFs requires spatial sampling along view rays with high sample counts.Although common volume rendering techniques such as empty space skipping or early ray termination can alleviate this limitation, many of the best performing methods are not capable of real-time rendering due to large per-sample computational cost.Even in the context of offline-rendering, high-quality NeRF-based methods can be prohibitively slow.The most common way to accelerate NeRF rendering is to bake the underlying representation of a trained model into a more renderfriendly representation, e.g. a mesh [Chen et al. 2023;Reiser et al. 2024;Yariv et al. 2023] or a voxel grid [Garbin et al. 2021;Hedman et al. 2021].However, baking these models can be time-consuming, and often involves a trade-off between quality retention and memory requirements.
Intending to accelerate NeRF-based rendering, we first consider the typical network architecture employed across virtually all NeRF methods.Generally, the evaluation of each sample is split into two parts: an expensive base network predicts density and view-independent latent codes, while a lightweight head network predicts outgoing radiance conditioned on a view direction and the view-independent intermediate output.However, when rendering consecutive frames of a video sequence or during real-time rendering, the expensive base network is evaluated repeatedly at very similar 3D positions, making temporal reuse of view-independent information desirable.
We therefore propose a volumetric caching approach, storing view-independent latent information per sample in a view-aligned sparse frustum voxel (froxel) grid, which can be efficiently sampled via backward reprojection (see Fig. 1).For each sample position, we interpolate the latent code and density from our cache and re-evaluate view-dependent effects via the head network.While our approach works out-of-the-box with popular NeRF models like Instant-NGP [Müller et al. 2022], we find that both quality and performance are subpar.To address both limitations, we first propose a novel view-directional cone encoding, which allows for significantly smaller latent codes and an even more lightweight head network, resulting in increased performance.To remedy interpolation artifacts, we propose to induce spatially local linearity to the latent codes during training, resulting in higher image quality when rendering from our cache.When combined, both techniques enable high-fidelity rendering from our cache with interactive framerates.
We provide the source code for our training framework, and a real-time capable renderer that showcases our method's capabilities.Our highly optimized CUDA implementation features an adaptive, asynchronous cache initialization for the real-time viewer, supersampled motion blur and depth of field for offline rendering, as well as a custom fused kernel to speed up the head network.Our method effectively accelerates real-time rendering, being up to 4.8× faster than Instant-NGP while achieving equal quality -compared to non-cached variants, caching increases rendering speed by up to 2×.Consequently, our renderer performs particularly well for high-quality video sequences with expensive effects, where we do not observe any loss of rendering quality.

BACKGROUND & RELATED WORK
In the following, we give an overview of the most important NeRF methods, different acceleration techniques for neural rendering, as well as a general overview of temporal coherence methods.

Neural Radiance Fields
NeRFs [Mildenhall et al. 2020] tackle the inverse rendering problem, learning a 3D scene from a set of posed 2D images.Prior techniques mostly tried to solve this problem with image-based techniques e.g. with multi-layered images [Mildenhall et al. 2019;Srinivasan et al. 2019].Instead, NeRF uses an MLP to learn and encode an implicit volumetric representation of a 3D scene as density and view-dependent outgoing radiance.By applying positional encoding [Tancik et al. 2020], their approach produced state-of-the-art results in novel view synthesis, however, expensive raymarching using a large MLP limited its applicability.Follow-up work aimed to speed up training and increase quality by explicitly subdividing the 3D space, employing either multiple smaller MLPs [Reiser et al. 2021] or learning per-voxel Spherical Harmonics (SH) coefficients [Fridovich-Keil et al. 2022].Other works used more efficient explicit representations like octrees [Yu et al. 2021] or factorized the 5D space into lower-rank tensors [Chen et al. 2022;Tang et al. 2022].Müller et al. [2022] introduced an efficient multi-resolution hash encoding, allowing for much smaller MLPs and, therefore, faster optimization and rendering.
To alleviate the issue of high sample counts, recent work explored depth oracles [Neff et al. 2021], occupancy grids [Liu et al. 2020;Müller et al. 2022], or automatic per-ray integration [Lindell et al. 2021].Other methods investigated more efficient sample placement with proposal networks [Barron et al. 2022[Barron et al. , 2023] ] or sparse sampling networks [Kurz et al. 2022].To circumvent aliasing, Mip-NeRF [Barron et al. 2021] proposed an integrated positional encoding, allowing the model to reason about scale.Mip-NeRF 360 [Barron et al. 2022] extended this idea to unbounded scenes, and Zip-NeRF [Barron et al. 2023] combined multisampled anti-aliasing with iNGP's hash grid encoding [Müller et al. 2022].In contrast to volumetric neural rendering, 3D Gaussian Splatting [Kerbl et al. 2023] recently demonstrated fast training and rendering with high visual quality, relying on differentiable rasterization of a mixture of anisotropic 3D Gaussians.

Baking NeRFs
One common way to accelerate NeRF rendering is to "bake" the neural representation into a render-friendly format, frequently resulting in a trade-off between quality retention and memory requirements.FastNeRF [Garbin et al. 2021] factorizes the NeRF network into separate positiondependent and view-dependent MLPs, whose intermediate outputs are cached in world-space and can be efficiently queried.Hedman et al. [2021] leverage a sparse voxel grid storing opacity, diffuse color and a neural feature vector for efficient rendering on commodity hardware.To further limit storage requirement, MERF [Reiser et al. 2023] reduces the voxel grid resolution and utilizes 2D feature planes.Duckworth et al. [2024] subsequently demonstrated real-time rendering of apartment-scale scenes with a set of MERFs.Other approaches bake NeRFs into a mesh [Chen et al. 2023;Reiser et al. 2024;Yariv et al. 2023] leveraging the efficient polygon rasterization pipeline: these approaches struggle with fine geometric structures and semi-transparent objects during meshing.Finally, other data structures for baking, such as duplex meshes with neural features [Wan et al. 2023] or view-dependent volumes [Yu et al. 2021], have also been explored.Our approach circumvents common drawbacks of related work by constructing a view-aligned representation in real-time from the original NeRF representation.

Temporal Coherence Methods
Reusing rendering information in consecutive frames has always been desirable for computer graphics applications.Early work used forward reprojection with mesh-based 2.5D reconstructions [Mark et al. 1997], or layered depth images [Shade et al. 1998].On the contrary, backward reprojection for shading reuse is done by re-rendering geometry [Nehab et al. 2007] or through flow fields and fixed-point iteration [Bowles et al. 2012].These surface-based methods mostly do not translate well to volume rendering.Greger et al. [1998] propose to use 3D probes to store irradiance volumetrically.Other works explored image-based techniques, rendering layers of slabs [Mueller et al. 1999], or perform point-based rendering [Zellmann et al. 2012].More recently, neural networks were used to cache 3D radiance [Müller et al. 2021] for path-tracing, showing promising results for both surfaces and participating media.Wronski [2014] and Hillaire [2015] both use a frustum voxel (froxel) grid to temporally integrate scattered light for volumetric effects, e.g.fog.Lochmann et al. [2016] propose irregularly sized frustum segments to store a piecewise-analytic emission-absorption representation.Our temporal coherence method is entirely volumetric and uses a sparse regular froxel grid and backward reprojection to store and retrieve NeRF latent codes, enabling re-evaluation of view-dependent effects.

PRELIMINARIES
In this section, we first recite details about volumetric rendering with NeRFs, discuss extensions to handle unbounded scenes, and analyze the underlying architecture of common NeRF methods.
A straightforward approach for sample placement is via an invertible function (•), mapping an input from stepping-space S to a corresponding  value.iNGP [Müller et al. 2022] performs exponential stepping for unbounded scenes, with  +1 =   • (1 + ) for a small cone angle  = 1 256 .This translates to a step size function () =  0 • (1 + )  =   , and  −1 (  ) = log(    0 − (1 + )) = .To prevent under-/oversampling,   is clamped between [Δ min , Δ max ], dependent on the size of the scene's bounding box (cf.Supplemental A for details).iNGP distills a binary occupancy grid  during optimization to enable empty space skipping, representing the expected density within discretized voxels; for large-scale scenes, a hierarchy of grids is employed.
To ease the learning objective for large, unbounded scenes, multiple works utilize a scene contraction [Barron et al. 2022;Neff et al. 2021] to bound the input domain of Θ. Mip-NeRF 360 [Barron et al. 2022] contracts 3D points x that lie outside the unit norm ball: When using the infinity norm || • || ∞ , this leads to a final contracted space that spans the entire domain in range [−2, 2] 3 , thereby fitting the cubic hash grid geometrically [Tancik et al. 2023].[Mildenhall et al. 2020] and iNGP [Müller et al. 2022].NeRF uses frequency encoding  for both 3D position x and view-direction v, a deep base MLP, and a wide latent code l (  = 256).In contrast, iNGP uses a multi-resolution hash grid, a tiny base MLP, very narrow l (  = 16) and SH encoding  for the view-direction.
Mip-NeRF 360 [Barron et al. 2022] introduces a distortion loss L dist , that encourages weights along a ray to form a Dirac distribution in stepping-space.Their loss also includes a second term, which forces intervals to become smaller, which can be omitted for a method with fixed step sizes.When using a stepping-space where  −1 (  ) = , the loss L dist and its partial derivative can be rewritten as

NeRF Network Architecture
As our final approach leverages the underlying NeRF architecture for efficient caching, we examine current state-of-the-art networks.The most prominent NeRF variants split Θ into a viewindependent base and a view-dependent head network where  is view-independent by design and head is conditioned on the view-independent latent code l ∈ R   .NeRF's base consists of a frequency encoding  and a deep MLP, with   = 256.Mip-NeRF [Barron et al. 2021] and Mip-NeRF 360 [Barron et al. 2022] introduce a scale-aware positional encoding, with even wider MLP hidden layers and latent code for Mip-NeRF 360.On the contrary, iNGP [Müller et al. 2022] use their multi-resolution hash encoding and a tiny MLP as Θ base , with a narrow l (  = 16; the first value of l is also log-space ), and SH encoding  of the view-direction.Finally, Zip-NeRF [Barron et al. 2023] follows iNGP's network design, but with wider latent codes (  = 256) and a much larger view-dependent MLP for increased quality.We contrast the architectures of NeRF and iNGP in Fig. 2. All discussed architectures provide opportunity for accelerated rendering through caching as they use compute intensive Θ base networks, however, large latent codes ultimately lead to excessive cache size requirements.In this work, we propose several techniques to produce smaller but more informative latent codes.

INTERPOLATING NERF SAMPLES
We propose a temporal coherence method for caching and temporal reuse of view-independent information to accelerate NeRF rendering.Our key idea is to cache the output of Θ base for each evaluated sample in a view-aligned volumetric data structure, sample this cache with trilinear interpolation from novel viewpoints, and efficiently re-evaluate view-dependent effects via Θ head .This poses several challenges, addressed here.

Interpolation
An integral part of our approach is interpolating latent information {, l} of NeRF samples from a volumetric data structure.The cache is initialized from a camera at position o  , with corresponding view-projection matrix   .We store the information for a view frustum in 3D froxel grids  = {  ,  l } with indices in froxel-space Z = (, , ) consisting of (, ) in image pixel space, and  in stepping space S. Any world-space point x ∈ R 3 can be transformed into a z ∈ Z via the transformation  : Due to the  2 norm, the frustum appears curved along the -dimension when transformed back into view-space.Notably,  is sparsely populated, with a cell (, , ) only being set if the ray at pixel (, ) placed a sample at  = () during cache initialization.To detect cache-hits and handle missing information, we store occupancy information in an additional binary froxel grid   .We obtain our interpolated values { σ, l} via trilinear interpolation from   and  l , where we replace any value with zero if the corresponding cell in   is unoccupied.This introduces a zero-shift into the interpolation results, effectively assuming that values must be zero in unoccupied space, which is a reasonable assumption for density but not for latent codes.Hence, we re-normalize l by considering adjacent occupancy information, effectively disregarding unoccupied cells:

Learning Spatial Linearity
Each latent code l is an intermediate output of a neural network, approximating a highly non-linear function.Naïve linear interpolation of l is not well behaved, even for spatially close 3D samples.We can induce linearity by performing the interpolation from Eqn. ( 7) with randomly shifted samples during training, however, performing full trilinear interpolation from eight samples is expensive.Notably, for large-scale unbounded scenes and when using exponentially increasing step size, adjacent samples in a view frustum exhibit the largest spatial differences along the view rays, i.e. froxel-space -dimension.Therefore, we propose to perform interpolation from just two samples, shifted along the view ray.
During training, we generate a single random offset in stepping-space Δ  ∈ [0, 1] per ray r to interpolate every actual sample x = r(  ) from two artificial samples at x 0,1 = r(  ± (1 − Δ  )  ), cf.Fig. 3 for a visualization.We then evaluate  at these positions, set all interpolation values to zero if the corresponding position is unoccupied, and perform the same normalization as in Eqn. ( 7) with where  x  , l x  = Θ base (x  ), and   =  (x  ) ∈ {0, 1}.We then continue the evaluation of the network with σ and l.
As can be seen in Fig. 3, rendering novel views from our frustum volume cache can produce disturbing artifacts when interpolating naïvely.Inducing linearity along view rays during optimization eliminates these artifacts.Our experiments suggest that this linearity constraint leads to slightly degraded image quality metrics.However, considering the increased performance with our temporal coherence method, inducing spatial linearity proves worthwhile.

View-dependent Cone Encoding
The performance of our caching approach is highly dependent on the size of l, as this directly influences cache size and lookup speed.The second crucial factor is the performance of Θ head , as it is always re-evaluated for cached samples.Notably, l is fully view-independent, meaning the whole 360 • viewing information is encoded into l, and Θ head is able to produce outgoing radiance for every possible view-direction.This is wasteful for a temporal coherence approach, where cached samples will only ever be viewed from viewpoints similar to the cached viewpoint.
We introduce a view-dependent cone encoding, which produces view-dependent latent codes that encode the viewing information for a cone  (v  ,  max ) of angle  max around view-direction v  : This change allows us to shift computational load from Θ head to a new cone encoding network Θ neck , which takes the view-independent output of Θ base and v  as input, and outputs the cone encoding l.The outgoing color for the actual view-direction v  can be recovered from l by providing the difference of encoded view-directions to Θ head .During optimization, Θ head needs to be provided with examples to learn the relationship between l, v  , and v  .For this purpose, we generate   view-directions {v (,1) , . . ., v (,  ) } for every actual view-direction v  , such that v (, ) ∈  (v  ,  max ).For each sample along v  , each of those shifted view-directions computes l  from Θ neck , evaluates the sample color via Θ head (l  , v (, ) , v  ), and accumulates them along the ray.The total loss is then computed as the average loss of all   view-directions.
Randomly sampling v (, ) from  (v  ,  max ) does not result in a good enough variety of viewdirections during training.Therefore, we propose a slightly more sophisticated sampling scheme, depicted in Fig. 4. We first uniformly random sample a smaller cone  (v  ,  max 2 ) to obtain a center view-direction v ′  .Next, we place the   samples uniformly on a circle of radius  max 2 around v ′  , with a random radial offset  ′ .By effectively encircling v  , we ensure a good distribution of viewdirections while guaranteeing that no sample lies outside the encoding cone.Note that v  = v  when rendering without a cache, and Θ head does not need to be executed during cache initialization.

A Model for Efficient Rendering From Cache
To maximize quality and performance, we propose a NeRF model that is particularly well suited for our caching approach in Fig. 5.The model is based on iNGP (big) [Müller et al. 2022], but with several adaptions.Most importantly, we leverage our view-dependent cone encoding and spatial linearity training recipe, detailed previously.These modifications enable a significantly smaller l and Θ head .To counter the capacity loss when shrinking l further, we provide the encoded sample position x to Θ head , where we use NeRFs positional encoding [Mildenhall et al. 2020], which can be re-evaluated efficiently and does not need to be cached.
To reduce the overall sample count and improve quality for unbounded scenes, we incorporate different techniques from recent NeRF methods.We use the scene contraction and an adapted version of the distortion loss of Mip-NeRF 360 [Barron et al. 2022], as described in Eqns.( 2) and (3).Our scenes are still bounded by the occupancy grid , however, the input domain of Θ is contracted.Finally, we downweigh gradients close to the camera to reduce floaters, as suggested by Philip and Deschaintre [2023].
Notably, combining our view-dependent cone encoding with interpolation along the view ray means that each sample is evaluated twice for each of the   view-directions during optimization.A naïve approach would execute Θ for each of those combinations, leading to large memory requirements and computational overhead.However, the two interpolation samples are joined when computing l, and the   view-directions of the cone encoding can share the same Θ base outputs.Assuming   = 4, this reduces the number of evaluations of Θ to 2×Θ base , 8×Θ neck , and 4×Θ head per sample, ensuring fast optimization for our method.

NERF FRUSTUM VOLUME CACHING & REPROJECTION
We propose a caching and reprojection approach to retrieve interpolated NeRF latent information from a view-aligned sparse froxel grid and efficiently re-evaluate view-dependent effects.Our proposed datastructures and methodologies are designed to fulfill the following objectives: we aim to maximize cache lookup speed, minimize the cache size, and perform fast cache initialization during runtime.Our design decisions were guided by quality and performance evaluations on our highly optimized CUDA/C++ implementation.

Caching Datastructure
Due to its design, our NeRF model produces a low number of samples per ray, which are sparsely allocated inside the view frustum.We utilize a sparse froxel grid with fixed-size cubic bricks of side length   , enabling efficient parallel initialization and providing a favorable compromise between sparsity and lookup speed.We employ a brick froxel grid  that stores per-brick binary occupancy information   and indices   , pointing into the data voxel grid  where the brick's actual data resides. stores per-sample latent information in three separate voxel grids: density   , latent code  l , and binary sample occupancy   .As neighboring bricks in  are not necessarily adjacent in , prohibiting the use of efficient hardware interpolation, we opt to pad bricks inside  to store additional information of neighboring bricks.
We maintain an additional datastructure to accelerate occupancy checks by spanning a 2.5D frustum volume, representing the known range  of our cache.Essentially, each sample outside  is guaranteed to be unoccupied in .We can easily construct this volume during cache initialization by using the current camera position, its corresponding near plane and each rays termination depth.For a visualization of our datastructures, cf.Fig. 6.

Cache Initialization
During cache initialization, all rays place their -th sample at the same distance   , allowing us to use a regular froxel grid -note that all values are explicitly initialized as zero.A naïve initialization strategy would let all rays perform sample placement and early ray termination independently, only synchronizing for the allocation of bricks.Regrettably, this can leave bricks underutilized if most rays already terminated, but a few distant rays remain and keep placing samples.
We mitigate this limitation with brick-wise initialization, operating with brick-sized ray bundles of dimension   ×  instead of individual rays.Ray bundles always perform stepping collectively, only terminate as a whole, and never terminate before reaching the end of the currently sampled brick in .Additionally, bricks are only allocated if any ray of the bundle places a valid sample, leading to much higher utilization of bricks and better memory coherence.This strategy maximizes brick occupancy and lookup performance, but leads to slower cache initialization as collective ray bundle termination inevitably leads to more evaluated samples.Note that we differentiate between inner and outer rays when using padded bricks.Only inner rays are considered for ray bundle termination and can initiate brick initialization, whereas outer rays merely place samples when bricks are initialized.

Reprojection & Sampling
With our caching approach established, we show our algorithm for sampling in Alg.(1).We transform samples x into froxel-space Z via Eqn.(6).A sample z ∈ Z is inside the known range , if z  lies between the cache's near plane and the termination depth at  [z  , z  ].If   is occupied, we determine the position d inside  by expanding the brick index from   and adding z's offset inside the current brick.Finally, we count this sample as a cache-hit if   is occupied at d and opacity exceeds a threshold   , to perform our normalized trilinear interpolation as proposed in Eqn. ( 7).If a sample is outside  but occupied in , we treat it as a cache-miss and evaluate Θ.

Implementation Details
Our implementation consists of a pytorch training framework, based on NerfAcc [Li et al. 2023], and an optimized CUDA/C++ real-time viewer and offline renderer.Both applications leverage tiny-cuda-nn [Müller 2021] for efficient MLP and input encoding inference.We provide our source code in the supplementary material.Datastructures.To allow for better memory access patterns and hardware interpolation, we store the data of  and  in 3D textures.All values are stored in full-precision, except for l, which is natively output by tiny-cuda-nn in half -precision.To further speed up cache retrieval, we group the entries of  l into batches of four (half4), a format supported by CUDA and GPU hardware.
Sampling.Following iNGP [Müller et al. 2022], we perform sampling rounds with a fixed number of samples per ray, terminating and compacting the remaining rays after each round.In addition, we reject samples with opacity below   = 10 −5 and perform early ray termination at transmittance  < 10 −4 .We split our implementation of Alg.(1) across multiple kernels: the sampling kernel only performs the cache-hit detection and writes the -values of each ray's cached and new samples to separate -buffers.Individual rays might exclusively place cached or new samples during a round, leading to sparse sample buffers and unnecessary evaluations of Θ and Θ head .Therefore, we compact the -buffers after sampling, and perform the interpolation of l and generation of network inputs in a separate per-sample kernel.Finally, to accelerate empty space skipping during ray initialization, we mesh and ray-trace the occupancy grid following Wald et al. [2021] to place the first sample.Interpolated from cache (cache-hit), resampled (cache-miss), or NULL if rejected Fused Head Network.Our potential speedup is clearly dependent on the runtime of Θ head , which is evaluated for each cached sample.Our model employs two additional input encodings  (x) and  (v  ), which need to be communicated via slow global memory, if implemented naïvely.We thus opt for a custom fused kernel that performs the full evaluation of Θ head , thereby only requiring communication of (x, v  , v  ) instead of their much larger encodings.Each thread is responsible for loading and encoding one input sample, performing cooperative tensor-core matrix multiplications for the input layer, and matrix-vector multiplication in registers for the output layer.
Real-time Viewer & Offline Renderer.Our real-time viewer is designed for high interactivity, providing automatic cache updates based on the cache-hit ratio (CHR).We hide latency with double-buffered caches and run initialization asynchronously in a lower priority stream, enhancing user experience.Our offline renderer can be used to render high-quality images and video sequences with supersampled motion blur and depth of field, adapted from iNGP [Müller et al. 2021].

RESULTS
We evaluate our proposed model and caching approach on challenging, real-world scenes from the Mip-NeRF 360 dataset [Barron et al. 2022].First, we demonstrate the effectiveness of our proposed model components and training schemes by comparing against other popular NeRF models.Additionally, we evaluate the performance and quality of our caching approach, when applied to our trained models.We provide additional implementation details in Supplemental B.
Models.We refer to our proposed model from Sec. 4.4 as Ours, using   = 4 samples and degree  max = 25 • for optimizing the cone encoding, and 4 degrees for the frequency encoding  To showcase the scalability of our caching approach, we add a version of our model with a larger hash grid (10 levels, 2 22 entries, and max.resolution of 8192 3 ), called Ours (huge).

Model Evaluation
For the quantitative evaluation of our trained models, we report PSNR, SSIM [Wang et al. 2004], LPIPS [Zhang et al. 2018] and FLIP [Andersson et al. 2020] in Tab. 1.We also include numbers for our re-implementation of iNGP (big) † [Müller et al. 2022], where we use the same hash grid configuration as for Ours.To facilitate cross-method comparisons, we also include metrics for 3DGS [Kerbl et al. 2023], Mip-NeRF 360 [Barron et al. 2022] and Plenoxels [Fridovich-Keil et al. 2022], which are reproduced from Kerbl et al. [2023] and marked with ‡ .Our evaluations show that Ours is able to maintain competitive quality and reduce sample counts, even though it uses smaller latent codes and a smaller Θ head .Reducing  max of the cone encoding to 5 • improves quality slightly, but struggles when re-evaluating the latent codes for larger viewchanges (see Supplemental D for details).Ours (huge) performs best out of our methods and rivals even Mip-NeRF 360 and 3DGS, but leads to slower render times and considerably larger model size compared to the (big) variant.Contrasting iNGP Ours with the base iNGP model showcases the effect of the distortion loss and scene contraction, reducing the sample count drastically and improving image quality.
On average, training time increases slightly from 20 to 24 minutes when enabling interpolation training in iNGP Ours, and to 47 minutes with our cone encoding (Ours).Although each sample is evaluated 8×, training times only increase by 2.5×, which is achieved by sharing common computations (cf.Sec.4.4).For comparison, 3DGS trains for 30−40 minutes on our system, and Mip-NeRF 360 for 12 hours on 4 A100 GPUs1 .

Caching & Reprojection
To evaluate our cache-based rendering for different view changes, we utilize the test set views (at position o with view-direction v) from Mip-NeRF 360 [Barron et al. 2022] and initialize our cache from different locations in the vicinity of o.To quantify the effect of small/large translation, we move the cache initialization camera's position o trans along v, i.e.
Similarly, we create rotated initialization views o rot by rotating o around a reference point along v, with rotation angles  and  (see Supplemental E for details).Following [Lochmann et al. 2016], we shift the viewing-angle by  ∈ {5 • , 10 • , 15 • }, and sample each  at six different angles  ∈ {0 • , 60 • , 120 • , 180 • , 240 • , 300 • }.We utilize this setup for both performance and qualitative evaluation, reporting average results for  and positive/negative .All evaluations use padded bricks of size   = 6 for our cache datastructure.
Image Metrics.We provide a comparison based on PSNR for positive/negative translation and different rotation angles  in Tab. 2. Both Ours and iNGP Ours are able to faithfully render the test set views from cache, even for larger rotational movements.When linearity is not induced during training, quality degrades significantly.Contrary to iNGP Ours, Ours is able to faithfully render from cache, even when latent code size is further reduced.Our cone encoding delivers higher initial quality when trained for smaller cone angles, however, quality degrades heavily for larger rotational view changes.Example images for a single test view can be seen in Fig. 7.
Performance & Cache Size.We evaluate the average performance over all scenes of the Mip-NeRF 360 dataset [Barron et al. 2022] for different configurations of Ours and iNGP Ours in Tab. 3. The timings are measured on an NVIDIA RTX 4090, and averaged over 10 runs in FullHD resolution.iNGP Ours has the fastest baseline performance but only experiences slight speed-ups when rendering from the cache, as evaluation of Θ head is expensive for this model.Reducing   for iNGP Ours from 16 to 8 leads to a decreased memory footprint, but increases performance only slightly.Ours experiences large speedups when employing caching, especially for smaller rotational movements where the CHR is high.Scaling up Θ base in Ours (huge) leads to slower overall render times, however, we can take full advantage of caching if CHR is high enough.Finally, changing   for Ours has a considerable impact on cache size and speedup.
Cache initialization times and cache size are also highly dependent on brick size   and padding choice, with initialization for Ours and   = 6 (w/ padding) taking 161ms and using 3.41 GiB of memory on average, while   = 8 (w/o padding) requires 75ms and 1.64 GiB (cf.Supplemental G).Our double-buffered initialization allows us to hide this latency over multiple frames.Fig. 7. Qualitative comparison of our cache-based rendering approach for an example view of the Kitchen scene, rendered with Ours.The cache is rendered from the test set viewpoint, but initialized from a slightly rotated/translated viewpoint in its vicinity.We visualize the contribution of cache-hit vs. cache-miss samples and provide a FLIP comparison to the baseline render (= without caching).As can be seen in the inset PSNR scores (compared to the ground truth image), we maintain the baseline quality and even exceed it in some configurations.Note that the stripes in the FLIP images appear due to deterministic sample placement during rendering, and smooth out when rendering multiple "jittered" rays per pixel.Table 3. Performance and cache size (w/o double-buffering) ablation of our proposed methods.We report speedup of each method compared to its baseline performance ("No Cache").Times in ms for FullHD resolution.The speedups are heavily dependent on the efficiency of Θ head , as can be seen when we disable our custom fused Θ head implementation.We include render times for iNGP (big) † to put the other results into perspective, indicating that Ours can achieve speedups of 3.8× to 4.8× compared to iNGP (big).We show detailed per-stage timings of the cache rendering pipeline in Supplemental F.
Video Sequence.To showcase scalability of our approach for high-quality offline rendering, we measure performance for a 300 frame video sequence of the Stump and Bonsai scene.We initialize the cache with double the sampling rate along the view ray to prevent any possible undersampling, reaching a maximum cache size of 5.15 GiB for the Stump scene, and 3.95 GiB for Bonsai, with only ∼2.5% of bricks occupied in the sparse brick froxel grid.The cache is re-initialized automatically after the CHR drops below 85%, which occurs around 50 times in both scenes and takes around 200ms per initialization.Considering overall render times, including the time spent for cache initialization, we achieve an average speed-up of ∼1.84× through our caching approach, without any noticeable loss in quality (see supplementary videos and Supplemental H for details).

LIMITATIONS & FUTURE WORK
Even though our work shows promising results, there are limitations to its application.The most obvious limitation is cache size, which can become prohibitively large for lower-end graphics cards, especially on higher resolutions.Secondly, even though cache initialization can be performed efficiently at runtime, it can still lead to a worse user experience during real-time rendering, if performed naïvely.Our double-buffered initialization hides latency to a high extent, however, more sophisticated asynchronous approaches or prediction mechanisms would be required to completely mitigate this drawback.This is not an issue for offline-rendering, where initialization times are amortized by faster rendering and longer cache reuse.Furthermore, our approach currently relies on a deterministic step size function to perform cache lookups, which prohibits the usage of proposal sampling techniques.Finally, choosing the stepping intervals too large during cache initialization can lead to undersampling, which could be solved by using anti-aliased sampling techniques, as proposed by Zip-NeRF [Barron et al. 2023].Applying our method to more expensive models could also lead to more significant speedups, but would only be feasible for offline-rendering.Another promising application is virtual reality, where a single cache can be used for both viewpoints and reused across multiple frames.
While our method is currently outperformed by 3DGS [Kerbl et al. 2023] and mesh-based methods [Chen et al. 2023] in terms of performance, volumetric NeRF approaches do not exhibit some of their drawbacks.3DGS suffers from popping artifacts (discussed in detail in Radl et al. [2024]), slight inaccuracies due to their projection approximation, and comparatively large models.Mesh-based methods struggle to faithfully represent thin structures and semi-transparent surfaces, and often require an additional baking step.
Reducing cache size through compression, e.g. by consolidating similar latent codes hierarchically, is a first logical next step.Additionally, updating the caching datastructure continuously with the information of cache-miss samples could help improve performance and postpone the next cache initialization.Finally, although we only tested our method on static NeRF scenes, it would be intriguing to apply it to dynamic NeRFs by using a bi-directional deformation field.

CONCLUSION
In this paper, we examined common NeRF network architectures in detail and proposed a caching and reprojection approach to exploit their underlying structure.Our method allows for temporal reuse of NeRF samples via a view-aligned, fully volumetric cache representation, while enabling re-evaluation of view-dependent effects.We are able to accelerate rendering by caching expensively computed view-independent latent codes, and proposed efficient cache sampling algorithms and training schemes, to further improve quality and performance.Additionally, we introduced a novel view-dependent cone encoding that allows for much smaller latent codes, thereby decreasing memory requirements and further improving performance when rendering from the cache.Our approach scales exceptionally well with larger models and can speed up offline rendering with expensive effects by up to 2×, without requiring any baking of the underlying NeRF model.Furthermore, our insights into interpolation and caching of volumetric sample latent codes have the potential to inspire further developments in classical volume rendering.We think that temporal coherence methods will be essential going forward, especially for expensive, high-quality NeRF models.The source code of our training framework and renderer are publicly available (after review).As expected, a smaller  max leads to a higher baseline quality, as Θ neck can produce a more exact view-dependent encoding, and l needs to be less informative.However, re-evaluating l for larger view-direction changes fails catastrophically, as the model has not seen such large view-changes during training.Ultimately, there is a trade-off between quality and adaptability to view-changes, which can be chosen to accommodate the application's usecase.

E GENERATING OFFSET VIEW-DIRECTIONS
In order to evaluate our cache rendering after rotational view-changes, we aim to generate rotated views o rot with view-direction v rot for each test view at position o with view-direction v. Therefore, we require some reference point p ref along v to act as our rotation anchor.As Mip-NeRF 360 views are all inward facing towards the scene origin, we choose p ref as the closest point to the scene origin along v, determined via the projection of −o onto v.We then use the scene's up-vector and

F DETAILED PER-STAGE TIMINGS
We report detailed timings for the individual stages of the rendering pipeline in Tab. 6.In the case of slight translation, where the CHR is close to 100%, Ours achieves speedups of almost 2×.Notably, already for rotational movements of  = 10 • (CHR =∼72%), the inference of new samples (cachemisses) exceeds the combined cost of cache lookup and inference.Unfortunately, the rendering pipeline also spends a considerable amount of time with stages whose performance is independent of caching ("Misc" and "Sample"), thereby limiting the speedup potential of a caching approach.If we only consider the inference and cache lookup stages, we actually achieve speedups of 2.7×.

G PERFORMANCE ABLATION: BRICK SIZES
We evaluate the performance of our caching approach for Ours with different brick sizes   , and show their impact on cache size and cache initialization times.In theory, padding should allow for faster memory access, since it eliminates the need for interpolation across brick borders.We see this effect for larger brick sizes, however, this does not manifest for small bricks.One possible reason is the larger memory overhead, as padded bricks with   = 6 already use 57% of their values to store neighboring information, leading to larger cache sizes and cache initialization times.In general, larger bricks allow for less sparsity in the datastructure, therefore also increasing cache size.Note: we use a custom interpolation kernel for the no-padding case, where we launch eight threads per  sample and perform trilinear interpolation via shuffle operations.A naïve implementation might show a different runtime behavior.

H VIDEO SEQUENCE RENDERING
To showcase the capabilities of our caching approach for accelerating offline-rendering of highquality videos, we render a 300 frame video sequence for the Stump and Bonsai scenes in FullHD, using 256× supersampled motion blur and depth of field.We use Ours (huge) for both video sequences, reporting average rendering time and per-frame timings in Fig. 9.Note that cache initialization is already accounted for in the render times.
For caching, we use   = 16 without padding and initialize the cache with twice the resolution along the view ray to prevent any undersampling.Note that this leads to larger cache sizes and longer cache initialization times, as well as slightly slower rendering from the cache.The cache is reinitialized around 50 times in both video sequences, taking around 200ms per initialization.Evidently, this overhead is quickly amortized through the increased rendering performance, especially since this cache is then sampled 256× per frame and reused across multiple frames.The small spikes in the plot signalize how render times increase as the CHR decreases, dropping back down as soon as the cache is re-initialized.Independent of caching, the performance is heavily influenced by the overall number of samples per ray, which is dependent on the currently rendered scene content.

Fig. 2 .
Fig.2.Network architecture comparison between the original NeRF[Mildenhall et al. 2020] and iNGP[Müller et al. 2022].NeRF uses frequency encoding  for both 3D position x and view-direction v, a deep base MLP, and a wide latent code l (  = 256).In contrast, iNGP uses a multi-resolution hash grid, a tiny base MLP, very narrow l (  = 16) and SH encoding  for the view-direction.

Fig. 3 .
Fig. 3. Depiction of our proposed training scheme to induce spatial linearity into the latent information l and .(a) Interpolation fails catastrophically at object boundaries if spatial linearity is not learned, which manifests as bright dots all over the object in the shown example.(b) We induce linearity during training by interpolating every actual sample at x  from two artificial samples x (,0) , x (,1) , which are shifted along the view ray by a random per-ray stepping-space offset Δ  ∈ [0, 1].(c) This enables our final model to correctly perform trilinear interpolation from cache.

Fig. 5 .
Fig.5.Our proposed model utilizes a novel cone encoding, to shrink the size of latent codes l, and minimize the cost for re-evaluating l from a different view-direction v  .The newly introduced intermediate network Θ neck , receives a SH encoded initial view-direction  (v  ), which is allowed to differ up to  max degrees from v  .The resulting l is now view-dependent, however, our proposed training scheme enables re-evaluation from v  by providing the difference in viewing-angle to Θ head .Forwarding the frequency encoded 3D sample position  (x t ) to Θ head helps to counteract a possible information loss between Θ base and Θ head .

Fig. 8 .
Fig. 8. View-consistency evaluation for our proposed model and cone encoding with different training degrees  max (Ours 5 • −25 • ) and actual view-direction changes  for the Room scene of the Mip-NeRF 360 dataset.
the vector p ref − o as rotation axes for our rotation angles (, ) to receive the rotated view position o rot and view-direction v rot = p ref −o rot ∥p ref −o rot ∥ .

Fig. 9 .
Fig. 9. Per-frame render timings (including cache initialization) for high-quality video sequences, rendering the Stump and Bonsai scene with 256 samples per ray in FullHD on Ours (huge).
Fig. 4. Our proposed cone encoding learns to produce latent codes l, which were generated from a viewdirection v  , but can be re-evaluated from a similar view-direction v  , lying in the same cone  (v  ,  max ).By randomly generating a number of view-directions {v (,1) , . . ., v (,  ) } ∈  (v  ,  max ) during training, providing the difference in viewing-angle to Θ head , and averaging the loss of all   samples, the model learns to treat l as a cone encoding around v  .To guarantee variety during each training iteration, we randomly choose a center point v ′  ∈  (v  ,  max 2 ), and uniformly space the   samples on a circle with radius  max 2 .
Visualization of the cache datastructures for a 2D toy example with a center object and an opaque background.During cache initialization, each ray places a number of samples at the same depths   , and performs early ray termination after reaching full opacity.The near plane and termination depths form a 2.5D frustum volume, representing the known range .Memory is sparsely allocated for fixed-size bricks inside a froxel grid , storing occupancy   , and indices   per brick.The indices   point into voxel grid , which contains the actual per-sample cached data.Note that this is only a simplified visualization and our froxel grids are actually curved along the -dimension when transformed to view-space.

Table 2 .
Qualitative evaluation based on PSNR image metrics when rendering from our cache for different rotational/translational viewpoint changes.Models trained without induced linearity deliver significantly degraded image quality during cache rendering.Ours can faithfully render from cache even for small latent code sizes, if trained for a large enough cone angle.

Table 4 .
Ablation of different model configurations for Ours and iNGP Ours.

Table 6 .
Detailed timings for the individual stages of the cache rendering pipeline, for rotational and minimal translational view-changes."Sample" refers to raymarching and placement with occupancy grid checks."Misc" contains the color accumulation and compaction stages.Times in ms for FullHD resolution for Ours.Note: Summed up stage timings can differ slightly from total timings in other tables.

Table 7 .
Ablation of performance and cache sizes/initialization times for different brick size   and padding configurations.Times in ms for FullHD resolution on Ours.