What Operations can be Performed Directly on Compressed Arrays, and with What Error?

In response to the rapidly escalating data movement-related costs of computing with large matrices and multidimensional arrays, several lossy compression methods have been developed that help reduce the volume of data moved. Unfortunately, all these methods require the data to be decompressed before operating on it. In this work, we develop a lossy compressor for arbitrary-dimensional arrays called PyBlaz 1 that supports a dozen non-trivial operations directly on compressed data while also offering good compression ratios. PyBlaz is based on the PyTorch framework, and thus can be run on CPUs or GPUs without any code changes. We evaluate the efficacy of PyBlaz on datasets originating in three applications: comparing shallow-water simulation implementations, measuring statistics from MRI images, and detecting the scission point in plutonium fission data. Our results demonstrate that PyBlaz’s compressed-domain operations achieve good scalability while incurring controllable errors. To our best knowledge, this is the first such lossy compressor that supports compressed-domain operations in the realm of handling arbitrary-dimensional scientific datasets.


I. INTRODUCTION
Today's computational processes in areas such as highperformance computing (HPC) and machine learning (ML) consume as well as produce large multi-dimensional matrices and tensors, making it imperative that we rein in the volume of data moved. 1 It has been demonstrated that lossy compression methods (elaborated later in Section II), can achieve significant data volume reduction-anywhere from a factor of 8 to 60 or more.These methods have been used to reduce the volume of data checkpointed and even to avoid recomputing frequently reused data: this data can be stored in compressed form, and the compression costs are amortized over multiple reuses.While sufficient details of the original data are available when the compressed data is decompressed, existing methods require that the data be decompressed before operating.In this work, we show that through a suitable modification of existing data compression pipelines, a dozen commonly needed operations can be performed directly without decompression-with many operations incurring no additional error than sustained during compression.
While such direct operation has been attempted in limited settings before (Section II) and is also found in methods such as homomorphic encryption [1], ours is the first proposal that significantly advances the available repertoire of operations while also (1) providing an easily accessible and reasonably efficient implementation on top of GPU-powered PyTorch, and (2) studying the performance and error of direct operations in a variety of nuanced ways.
To provide an adequate context for detailing our work, let us consider an obvious alternative: precision tuning where one chooses FP16 or FP32 (IEEE floating-point standard 16bit number or 32-bit number [2]) in lieu of FP32 or FP64, respectively.While almost a drop-in replacement (modulo recompilation, etc.) to an existing piece of code, precision tuning attains a compression ratio of only two to four.Unfortunately, one can almost never be sure which few words are about to overflow or underflow: all other words may still have unused dynamic range.This is one reason why, despite all the hoopla, precision tuning has seen scanty update in HPC. 2ossy data compression operates by taking a data volume and distributing the error.As one example, some methods take the largest magnitude exponent in the volume, shift the remaining values right (thus losing some low-significance bits in the process) to match this exponent, and then encode the remaining data using an orthonormal transform.Thereafter, only the compression coefficients are kept-not the original data.Lossy compression is not a drop-in replacement to an existing application (e.g., specific array location identities are spread around during compression).However, many "bulk operations" can be supported quite naturally, and directly without decompression.
Here is one use-case: consider the so-called shallow-water simulation [4].Suppose one wants to perform these simulations over time at two working precisions, and obtains two time-series data sets ("two movies"), and also wants to determine the point in time at which the two time-series deviate beyond a threshold.Such deviations can be coarsely calculated using the L 2 -norm distance (which looks at the whole surface) or the Wasserstein distance3 that can capture subtle local features (e.g., "water ripples").In this work, we show that a suitably designed lossy compressor can determine both these types of distance metrics with respect to compressed data representations-without first having to decompress.
Key Contributions: We now list our two key contributions, providing insights.First, how can we incorporate operations that act directly in the compressed space?This is achieved by making sure that the compression coefficients are a much smaller proxy to the original data and are obtained through a set of linear operations (detailed in the paper).Thus one can operate on the coefficients in lieu of the data.These types of transformations are, in principle, supported by existing compression schemes such as ZFP [5]-but these have not been realized or evaluated in practice.In this regard, our contribution is to support the "obvious" operations such as addition and "non-obvious" ones such as Wasserstein distance.This results in a compressor that does not achieve as high a compression ratio as would have been achievable if we only had aimed for high compression ratios without the need to directly support operations-but with the bonus of having direct operation capability.
Second, even if these compressed-space operations are available, one has to be able to use them meaningfully in actual problem contexts.In this regard, we demonstrate these operations in the setting of three distinct applications where the distance between two data sets arises naturally.We show that it is possible to discern sufficient differences even in the image space of compressed representations. 4

II. BACKGROUND AND RELATED WORK
Lossy data compression methods [5]- [13], notably ZFP [5] and SZ [6], often achieve compression ratios of 50 or more, while maintaining sufficient fidelity for many applications.These compressors, developed with scientific data in mind, generally do not make many assumptions regarding the dimensionality of the data or the nature of each dimension.Specialized compression algorithms such as MP3 [14] for audio, JPEG [15] for images, or HEVC [16] for video exist, but are outside the scope of our work.None of these were created with the goal of facilitating operations on compressed data.Blaz [17] is a lossy compressor for 2-dimensional arrays, allowing certain compressed-space operations such as addition and multiplying by a scalar.We take inspiration from Blaz to build a compressor named PyBlaz for higher performance and also (especially) for direct operation.There are also designs that permit direct operation, but mainly for neural networks [18]- [23].

A. PyBlaz and three related compressors
We now discuss three closely related compressors to provide some context: ZFP [5], SZ [6], [10], [12], [13], and Blaz [17].We first start with a few concepts common to all compressors, motivating their purpose: • Decorrelation: Decorrelation is any process that reduces the autocorrelation in data.Orthonormal transformations and prediction are two decorrelation techniques.• An orthonormal transformation is a linear transformation that consolidates selectable bands of information while preserving orthogonality and dot products, key to realizing efficient compressed-space operations.• Prediction is describing elements of an array in relation to other nearby elements using a model of the data, such as a sliding template.a) ZFP: is a compression method for arbitrarydimensional FP64 arrays.It begins by blocking input arrays into blocks of size 4 in each direction.Each of these blocks is converted to a block floating-point format, such that each block shares the exponent of the biggest element in the block, with other significands scaled appropriately and converted to fixpoint form.Next, a near-orthogonal transform is applied to the fixed-point numbers in all directions.The resulting coefficients are converted to negabinary and their bits are encoded in decreasing order of significance using a Huffman-like scheme.
b) SZ: is a compression method designed for compressing 1-to 5-dimensional floating-point arrays.SZ uses a constant, linear, or quadratic prediction model to predict each element in the array based on its neighbors and quantizes the residuals using Huffman coding.c) Blaz: is a compression method for 2-dimensional FP64 arrays.Blaz blocks input arrays into 8 × 8 blocks.It then saves the first element of each block and encodes the others as the difference from their previous elements.Then, a block-wise discrete cosine transform (DCT) [24] is applied, resulting in blocks of coefficients.The biggest coefficient in each block is saved, and the others are binned into 255 bins, indexed using 8-bit integers from -127 to 127.Each block of indices is then pruned by dropping the 6 × 6 square in the higher-index corner of the block and then flattening the remaining indices.

B. Notations
We refer to the length (or size) of an array in each direction as its shape and notate shapes using tuples, e.g.(4,4) for a 4×4 array, or more generally using lower-case boldface letters (e.g.x), like vectors.Similarly, we notate indices into arrays A is the input array, and A ′ is the array obtained after lowering the precision, B 1..4 represent the blocks we get after blocking.In this figure, we show the rest of the procedure only on block B 1 .We perform DCT on each block, after which we obtain an array of coefficients C, which is further binned to result in an array of indices I. Finally, we apply the pruning mask, shown with black and white colors representing Boolean values, which results in pruned indices represented as F .F is later flattened.Unlike in Blaz, we skip the differentiation step (called normalization in [17]), which facilitates certain compressed-space operations explained in detail in § IV-A.
like vectors, e.g.i indexes into array X as X i .Also, X ...1 is all elements of X whose indices' last coordinate is 1.We use X to notate the sum of all elements in X, and X as the product of all elements in X. X + Y is element-wise addition of X and Y , X − Y is element-wise subtraction of Y from X, X ⊙ Y is the element-wise product of X and Y , and X ⊘ Y is the element-wise quotient of X and Y , broadcasting where necessary.X p is the element-wise exponentiation of each element of X to the pth power and b X is an array of b exponentiated by elements in X. ⌈X⌉ is the ceiling of X, an array of integers that are at least as great as their corresponding element of X.

III. PYBLAZ ARCHITECTURE
We developed PyBlaz to compress arbitrary-dimensional arrays of floating-point numbers.The key goal of this paper is, of course, to have PyBlaz directly support (i.e., without decompression) operations such as the dot product, mean, variance, covariance, L 2 norm, cosine similarity, structural similarity, and Wasserstein distance.(all detailed in § IV)-all without losing compression efficacy.The compression ratio ( § IV-C) depends only on compression settings and is independent of data.This is in contrast to other compressors, such as SZ, that might change the compression ratio in order to enforce some L ∞ error bound.Thus, the amount of compression-induced error depends on how closely the settings suit the data being compressed.

A. Compression Steps
Compression consists of five major steps: data type conversion, blocking, orthonormal transform, binning, and pruning.Each step is explained below and an example of compressing a 2-dimensional array is shown in Figure 1.Decompression consists of the compression steps in reverse.As PyBlaz is a lossy compressor, only blocking is exactly invertible.a) Data type conversion (to lower precision): Uncompressed scientific data is often found in 64-bit floatingpoint (FP64) arrays.During compression, it is sometimes advantageous to convert the array elements to a lowerprecision data type because the upcoming binning and pruning steps will further discard information.Lowering the precision not only reduces the compressed form memory usage, but also speeds up compression, decompression, and certain compressed-space operations.PyBlaz supports Brain Floating Point (bfloat16) [25], FP16, FP32, and FP64 as the precision types to which one can convert the input data.Loss is expected when converting to a type with fewer significand bits or converting to a type with insufficient dynamic range.b) Blocking: Blocking allows subsequent steps in the compression process to be performed on each block independently, facilitating parallelization.PyBlaz supports all block shapes that are a power of two in all directions, including shapes that are not the same length in all directions, i.e. nonhypercubic shapes.During blocking, an input array shaped s, with dimensionality d = |s|, is padded with zeros such that its size in each direction is a multiple of the block size in the corresponding direction.Let the block shape be i (the maximum intrablock index), and let the arrangement of blocks in the reshaped array have shape b = ⌈s ⊘ i⌉ (the maximum block index).Then the reshaped array will have shape bi.For example, using an input array shape (3, 224, 224) and block shape (4,4,4), the reshaped array will have shape (1,56,56,4,4,4).c) Orthonormal transform: Each block is then transformed into coefficients of some orthonormal transform, which we use to reduce periodic (wave-like) redundancy by consolidating data of the same spatial frequency into a scalar coefficient.PyBlaz uses DCT [24] by default, such that the DCT coefficients correspond to the correlation between the block and sampled cosine functions of different frequencies in different directions.Let {H 1 , . . ., H d } be the DCT matrices for each block size.Matrix H for block size s contains each ith element of each j-frequency sampled cosine function such that

2s
, where i ∈ [1..s] and j ∈ [1..s].Let Σ be a list of at least 2d symbols to use as indices.Then if B is the blocked array, the block-wise transformed array is using Einstein's summation notation [26].An example is explained in Appendix VI-A.PyBlaz also supports other transforms besides DCT, such as the Haar wavelet transform.The block can be recovered, modulo floating-point rounding error, by multiplying these coefficients with their corresponding sampled functions and summing them.
d) Binning: Binning coarsens the coefficient space, allowing coefficients to be referred to using shorter descriptors (fewer bits).To perform binning, the biggest elements per block of the coefficients array C are collected as N k = ∥C k ∥ ∞ for all block indices k.Then the elements in C are binned into a number of bins according to the bin index type, an integer type.PyBlaz supports int8, int16, int32, and int64.The number of bins is the number of values distinguishable by the data type minus one, and more bins enables finer rounding.The values of the bins in each block are centered at 0 and range to the biggest element.We define an index type radius r = 2 b−1 − 1, where b is the number of bits in the integer type.The values are binned by multiplying them by r, dividing by the biggest coefficient in their blocks, and rounding to the nearest integer.The result is an array of indices e) Pruning: Pruning allows selection of certain coefficient indices, and thus certain frequencies, to keep in the compressed array representation.The choice of indices to keep depends on the application.For example, noise of a certain frequency may be filtered by pruning its corresponding coefficient.Indices to keep are specified using a pruning mask P , shaped i. Loss in this step stems from the dropping indices, as it is effectively rounding to zero.After pruning, the remaining indices are flattened into a sequence F k for all blocks k.Because P is saved in the compressed representation, F can be unflattened with zeros in place of unspecified indices.

B. Compressed Form
The result of compressing the input data is a compressed array consisting of the original shape s, the block shape i, the biggest coefficient for each block N , the flattened specified bin indices F , as well as information that is required for decompression, such as the pruning mask.All components and their memory usage are listed in § IV-C.In our discussion of compressed-space operations ( § IV), we consider only the set {s, i, N, F } as a simplification of a compressed array.From this, an array can be decompressed by scaling F by N appropriately, unflattening blocks, performing the inverse orthonormal transform, merging blocks, and cropping to the original shape.

IV. COMPRESSED-SPACE OPERATIONS
Table I lists all the compressed-space operations along with the result type and source of error that can occur in each operation.All of the operations, except the approximate Wasserstein distance, are differentiable.This feature enables their incorporation into gradient-based optimization pipelines,  The list of operations supported by PyBlaz along with their result types (array of scalar).Some details are explained in § IV-A and § IV-B.Notably, most of the compressed-space operations introduce no additional error (beyond the inevitable error incurred during compression.)such as those that form the backbone of state-of-the-art ML models.We will explain these operations with pseudocode.

A. Operations in PyBlaz
The output obtained from compressing the input data is used to perform the compressed-space operations.As discussed before, we use the set {s, i, N, F } as a simplification of a compressed array.The algorithms presented in this section employ several element-wise array operations which have a natural implementation on GPUs (facilitated by our use of GPU-based PyTorch in realizing PyBlaz).The compression pipeline facilitates these operations due to two key properties.
(1) Each block of F is proportional to each block of transform coefficients, and scaling F by N recovers the specified coefficients.Thus, operating on F and in some cases N is equivalent to operating on the coefficients.(2) Because the orthonormal transform preserves orthonormality and dot products, the inverse transform is unnecessary for addition and multiplication and for obtaining related summative information, such as magnitudes and variances.
1) Negation: Negation (Algorithm 1) is done by negating F .Because the bin indices are proportional to the coefficients, negating the bin indices is equivalent to negating the coefficients that would result during decompression.
2) Element-wise addition: To add two compressed arrays (Algorithm 2), we calculate a sum of their specified coefficients and scale the indices to the possibly changed biggest coefficients.
3) Addition of a scalar: This and some later operations depend on the specified coefficients (Algorithm 3).
If the block shape is i and the first coefficients in each block were not pruned away, the first coefficient in each block is the mean of the uncompressed block scaled by c = (i Then the mean of the uncompressed array is the mean of all first coefficients divided by c.The addition of a scalar (Algorithm 4) is then achieved by adding the scalar multiplied by c to all first coefficients.
4) Multiplication by a scalar: Multiplication by a scalar (Algorithm 5) is done by multiplying N by the absolute value of the scalar, and if the scalar is negative, negating F .
In addition to the above operations, some other operations are facilitated by skipping the differentiation step.
5) Dot product: The dot products before and after an orthonormal transform are equal.Thus, the compressed-space dot product (Algorithm 6) is the dot product of specified coefficients.

Algorithm 3: SpecifiedCoefficients(A)
Data: compressed array A = {s, i, N, F } Result: specified coefficients of a compressed array b ← the number of bits in the element type of Algorithm 4: AdditionOfScalar(A, x) Data: compressed array A = {s, i, N, F }, scalar x Result: the array with x added to all elements Ĉ ← SpecifiedCoefficients(A); // Algorithm 3 ) is obtained by getting the mean of the element-wise product of the centered coefficients of two compressed arrays [20].Centered coefficients represent an uncompressed array with its mean subtracted.Block-wise covariance is also available by getting the block-wise means of this product.
) is obtained by finding the covariance of an array with itself.Block-wise variance is available as the block-wise mean of squared block-wise centered coefficients.Standard deviation or block-wise standard deviation are available as the square root of variance or blockwise variance.9) L 2 norm: The L 2 norm [27], also known as Euclidean norm, is commonly used to measure physical or spatially meaningful distances.Taking advantage of orthonormality, we a) Cosine similarity: [28] captures the degree of angular similarity between two vectors.We obtain cosine similarity by dividing the dot product of two arrays by the product of their L 2 norms (Algorithm 11).was developed to assess the visual similarity of images to humans, but has also been used to evaluate the quality of compressed data [11].SSIM compares the luminance, contrast, and structure of the images to assign a similarity score from 0 to 1 through a weighted product of these three terms (Algorithm 12).

B. Approximate operations in PyBlaz
While we have the possibility of realizing many approximate operations in PyBlaz, we discuss only one that is relevant to this work, namely the Wasserstein distance [30], used to measure the distance between probability distributions.Because it can be described as the cost of transforming one distribution to another by moving probability mass or density, it is also known as the Earth mover's distance.It is especially useful where one distribution is a perturbed version of another.
In PyBlaz, we can use the block-wise mean to find approximations of arbitrary operations on uncompressed arrays.The granularity of the approximation depends on the block shape specified for the blocking step during compression.At one extreme, one-element blocks will result in an exact operation while discarding all compression benefits.Thus, the choice of block shape should anticipate necessary approximate operations.
return l w l c wc s ws ; 1) Approximate Wasserstein distance: By using the blockwise mean, we can obtain an approximation of decompressed arrays, and thus can perform the conventional p-order Wasserstein distance between them (Algorithm 13).Because the Wasserstein distance measures distances between probability distributions, we ensure that the arrays are probability distributions by applying the softmax function 5 to them if they are not.Because sorting is involved, this function is not differentiable.

Algorithm 13: ApproximateWassersteinDistance(A, B)
Data: compressed arrays ; We now define a crucial figure of merit of a compressor, namely the compression ratio it delivers.

C. Compression Ratio
After compressing an array whose original shape is s, with dimensionality d = |s|, using block shape i, f -bit floating point numbers, i-bit integers as the bin index type, and a pruning mask P , certain quantities are to be stored.Because P and N have shapes i and ⌈s ⊘ i⌉ respectively, their shapes need not be stored, only their elements flattened into a 1dimensional sequence.The components to be stored are • the floating point and integer types, specified in 4 bits, • s, which takes 64d bits, • a marker for the end of s, taking up to 64 bits, • i, which takes 64d bits, • P flattened, which takes i bits, • N flattened, taking f ⌈s ⊘ i⌉ bits, and • F taking i( P )( ⌈s ⊘ i⌉) bits.Thus, if the input array were provided with u-bit elements, the compression ratio approaches u s (f + i P ) ⌈s ⊘ i⌉ as array size approaches infinity.The compression settings that most impact the compression ratio are the bin index type and pruning mask.For example, with an input array shaped (3, 224, 224) of 64-bit elements, block shape (4, 4, 4), floating-point type FP32, index type int16, and no pruning, the compression ratio will be ≈ 2.91.Using int8 and pruning half the indices, the compression ratio will be ≈10.66.

D. Compression Error
Compression error is introduced in the data type conversion, orthonormal transform, binning, and pruning steps.Error in the data type conversion and orthonormal transform steps consists only of floating-point rounding error, which we will exclude from this discussion.Here we describe error in the other steps.
In the binning step, we will consider the maximum error per block, which is half the width of a bin, determined by the highest-magnitude coefficient in the block and the number of bins.If the magnitude of the biggest coefficient in block k is N k , then the range of numbers that the bins must cover is 2N k .Recall that because the bins are centered at zero, the number of bins given an integer bin index type is the number of values distinguishable by that type minus one, which is also twice the radius r plus one, so we have 2r + 1 bins.Then, the width of a bin is 2N k 2r+1 , and so the maximum error in terms of coefficients is N k 2r+1 .In the pruning step, error is introduced by pruning nonzero bin indices, which correspond to non-zero coefficients.Therefore, in terms of coefficients, the error is simply the corresponding coefficients of the pruned indices.
As for how coefficient errors translate to errors in the uncompressed space, we must consider the orthonormal transform used.Recall that the basis vectors in orthonormal transform matrices are of unit length.Therefore, if a coefficient were to maximally influence an element of the decompressed array, the basis vector should consist of all zeros except for a one in some element, such as if we used the standard basis as a transform matrix (although doing so is useless).Therefore, by changing a single coefficient, the maximum error that may appear in an element of a decompressed array (i.e. the L ∞ error bound) is the same as the error in the coefficient.Of course, binning and pruning usually change more than one coefficient.Their combined effects on one element of a decompressed array is the sum of the corresponding elements of the transform matrix scaled by the coefficients.Therefore, if C is the coefficients and i is the block shape, then in block k, the only L ∞ error bound we can provide is a rather loose ∥C k ∥ ∞ i.However, due to the dot product-preserving properties of orthonormal transforms, we know the the L 2 error in the block is the same as the L 2 norm of the coefficient errors in the block.

E. Compression, Decompression, and Operation Time
An expected benefit of PyBlaz using the GPU is higher throughput compared to the single-threaded Blaz. Figure 2 shows the typical behavior of compressed-space operation time in PyBlaz, with comparisons to that in Blaz.The time taken remains constant until the GPU threads are saturated, after which it scales polynomially with array size, like the single-threaded Blaz.We include additional time figures for 3-dimensional arrays using a conservative block shape and various other compression settings in Appendix VI-B.We also compared the compression and decompression speeds of PyBlaz and those of SZ and ZFP.We compressed and decompressed hypercubic arrays with elements ranging from 0 to 1 arranged in a constant gradient from the lowest indices to the highest indices.That is, we used the array X, shaped (d) 3-dimensional decompression Fig. 3: Compression and decompression time taken compared to ZFP using CUDA.ZFP with CUDA supports only arrays of up to 3 dimensions and compression using fixed-rate mode.ZFP decompression does not use the GPU.ZFP compression ratios of approximately 8, 4, and 2 were specified using 8, 16, and 32 bits per scalar.PyBlaz ratios of approximately 8 and 4 were achieved using bin index types int8 and int16.This experiment was performed on a machine with an AMD Ryzen 5 3600 CPU and an NVIDIA GeForce RTX 2070 Super GPU.
s, such that X x = (x−1) (s−1) for all indices x. Figure 3 shows the time taken compared to ZFP.

V. EXPERIMENTS AND RESULTS
We study three different practical situations using PyBlaz.
Shallow water simulation is particularly used to show how errors caused due to precision tuning can be captured using our compressed-space operations.We use the simplest difference operation on each pixel of the data (done using the negation and addition compressed space operations) to show this error.
MRI: Next, we experiment with brain MRI images in the LGG segmentation data, showing absolute error and relative error between uncompressed and compressed mean, variance, and L 2 norm of one MRI image, and SSIM between two images.
Fission Data: Finally, we experiment with plutonium atom fission data to show how the compressed-space Wasserstein distance can be used to capture topological features (e.g. a scission point).A scission point is a time interval in which the atom's nucleus splits.This is achieved by compressing data at each time step and then comparing data using the L 2 norm and Wasserstein distance in adjacent time steps.We also compare the use of these two measures.

A. Shallow water simulation data
The shallow-water equations are a set of partial differential equations derived from Navier-Stokes equations that describe the fluid flow below the pressure surface in a 2-dimensional confined boundary domain.These simulations can output data on multiple simulation features over time, including water velocity, depth, pressure, and height of the surface.
We use a Julia implementation of the shallow water simulation [31], [32] that supports different precisions including FP16, FP32, and FP64.Even though 16-bit floating-point numbers (FP16) could be appropriate for specific applications, their precision is sometimes insufficient for certain scientific simulations compared to FP32 or FP64.Using FP16 could present difficulties related to the stability or accuracy of the simulations, which could lead to numerical errors or instability in subsequent calculations.We first verify that some imprecision occurs due to different floating-point types by looking at uncompressed simulation outputs.Then, we try to capture it by using aggressive compression settings in PyBlaz and checking whether we can achieve similar results with PyBlaz as well.
a) Dataset: For our experiments, we used a 500-day, double-gyre simulation with a domain shaped 200 × 400, with 100 grid cells in the first dimension.We varied the data type between FP16 and FP32 with boundary condition set to nonperiodic, wind forcing in x direction as double gyre and topography as seamount.More details on each of the parameters can be found in [31].
b) Experiment: The experiment highlights the importance of carefully choosing the precision settings in numerical simulations, and of considering the potential effects of changing precision on the shallow water simulation.We used the water surface height data obtained from the output of Fig. 4: Height of the water surface at one-time step from a shallow water simulation using different precisions.(a) Surface height using FP16 and (b) Surface height using FP32.These visualizations show the areas affected by the change in precision, with immediately visible differences marked with black rectangles in (a) and (b).By finding the difference between these outputs, we also capture other areas that have major differences in (c) and (d) marked with green rectangles.We also show that PyBlaz is able to capture similar differences from compressed data.Note that the surface height from this simulation could be negative.two simulations using FP16 and FP32 as data type.We first visualized the raw output to compare the effect of precision.We then visualized the element-wise difference in the data between both uncompressed forms and used compressedspace negation and addition to see the difference between the compressed forms.The compressor block shape is set to 16 × 16 for this experiment, the floating-point type is FP32, and the bin index type is int8.c) Observation and Result: Figure 4(a) and (b) show the surface height at one simulation step using different precision settings (FP16 and FP32), while Figure 4(c) and  (d) show the difference between the two precision settings using uncompressed and compressed calculations, respectively.The black rectangles in Figure 4(a) and (b) highlight the areas with the biggest perturbations.The perturbation caused by changing precision is not limited to the areas marked with black rectangles.The green rectangle in Figure 4(c) highlights an area where the perturbation is also present, as observed by calculating the pixel-by-pixel difference between Figure 4(a) and (b).With this experiment, we also show that the compressed-space difference operation (negation and element-wise addition) as shown in Figure 4(d) can be used to calculate the perturbations caused by changing precision.
The key takeaway from the experiment is that chang-ing precision causes perturbations in both uncompressed and compressed calculations, and that compressed data and compressed-space difference operations can be used to analyze and quantify these perturbations.This is useful because such scientific simulations have large data and are usually stored in compressed form and such lightweight compressed space operations are useful in most cases.

B. LGG segmentation data
This experiment characterizes the behavior of error on four compressed-space operations as a function of different compression settings on 3-dimensional MRI images.
a) Dataset: The LGG segmentation dataset [33] is a set of magnetic resonance imaging (MRI) images of human brains and corresponding patient data.Each image contains three channels: pre-contrast, fluid-attenuated inversion recovery (FLAIR), and post-contrast.The dataset consists of 110 MRI images.The first dimension of the images, corresponding to the up direction from the perspective of the human that was scanned, varies in size from 20 to 88, with a mean size of 35  in some directions is higher than that in others, which makes these images an appropriate candidate for compression using PyBlaz.Because some examples had missing pre-contrast or post-contrast channels, we experimented only with the FLAIR channel.After normalizing the values to the range [0, 1], we compressed the images in the FLAIR channel using various compression settings and measured the mean absolute and relative error between certain PyBlaz scalar functions and those on uncompressed images using plain PyTorch.The relative errors are relative to a FLAIR mean of 0.0870, averaged across each image and then across all images.The FLAIR standard deviation is 0.1238.The operations used in this study were mean, variance, L 2 norm, and SSIM.We took the first three measures on individual images.We calculated the SSIM between all pairs of images in the dataset, cropping or padding one of them to match shapes if necessary.
c) Observations and Result: We show mean absolute errors, mean relative errors, and compression ratios in Figure 5.Among the floating-point types, FP32 and FP64 achieved almost the same error, while FP16 and bfloat16 errors were probably unacceptable for many applications.Among the 16-bit types, FP16 usually achieved lower error than bfloat16 from its longer significand, while bfloat16 avoids NaNs because of its longer exponent.Among the FP32 and FP64 results, the lowest error is achieved using the smallest blocks and index type int16.However, depending on the function used, int8 may achieve the same error while approximately doubling the compression rate.Also, we can see the advantage of non-hypercubic blocks.When compressing large arrays whose size is similar in all directions, we usually expect bigger block sizes to have a higher compression ratio, as fewer floating points need to be stored.However, because the first dimension of the images in the dataset varies in size, and is usually about an eighth the size of the other dimensions, block sizes that have bigger first dimensions often take up significant additional space for padding.
Thus, for this dataset, the highest compression ratios are achieved using the non-hypercubic 4 × 16 × 16 blocks, which also achieve lower error than the hypercubic 8 × 8 × 8 blocks on the mean, variance, and SSIM functions.

C. Plutonium atom fission data
Nuclear density functional theory (DFT) is an approach used to study the properties inside an atom's nucleus.Nuclear fission is a phenomenon that takes place in a nucleon-nucleon interaction in atomic nuclei.The process of nuclear fission involves the division of an atom's nucleus into two or more parts.During this process, the nucleus is stretched, which causes deformation and changes in the nucleus structure.Accurately identifying the nuclear scission point i.e. the time steps between which the nucleus splits is a significant issue in nuclear fission research.At the point when nuclear scission happens, the atom's topology changes, which leads to a highmagnitude change in data values.Due to the large floatingpoint values generated during nuclear fission, it is advanta-geous to compress the data and to look for the nuclear scission in compressed space.In this experiment, we show how we can compress the data and perform a basic compressed-space operation to identify the time steps between which the scission occurred.
a) Dataset: The dataset consists of spatial densities in plutonium atoms that include the spatial densities of neutrons in the nucleus.These densities are sampled on a grid of shape 40 × 40 × 66.The dataset is negative log-transformed 6  and sampled at 15 different time steps, specifically [665,670,675,680,685,686,687,688,689,690,692,693,694,695,699].Previous work shows that the nuclear scission happens between time steps 690 and 692 [34]- [36].
b) Experiment: We take the negative log-transformed neutron densities of every time step, each provided as a 3dimensional array, and compress each array individually.To find the nuclear scission point we find the L 2 norm of the difference in log density between adjacent time steps D 1 and D 2 as ∥D 1 − D 2 ∥ 2 .Figure 6a shows these differences per time step pair.Since the L 2 norm was not able to capture the nuclear scission point with certainty (as we explain later), we also calculate the approximate Wasserstein distance between each time step and plot the line graph with different values of the order (p).Although the higher-order norms such as L ∞ are also able to ignore the noise.We used the approximate Wasserstein distance to demonstrate a use for approximations in PyBlaz.
For our experiment, we used a block size of 16 × 16 × 16, index type int16, and FP32 as the data type for the compressor.
c) Observation and Results: The L 2 norm differences and Wasserstein distances between adjacent time steps are shown in Figure 6.A major peak is obtained between time steps 690 and 692 when operating on compressed datamatching previously known results.
We first observe the results obtained using L 2 norm with block size 16×16×16.Although a major peak is shown at time step 690 (indicating a scission event between time steps 690 and 692), we also observed other nearby peaks between time steps 685 and 686 and between steps 695 and 699.Existing literature shows that other peaks are noise, and the topology of the data is not changing at those points [34], [35].These noise peaks make the scission less certain.Changing compression settings did not change this behavior.The compressed-space L 2 norm maintains sufficient fidelity to capture topological changes of interest because the maximum error between the L 2 norms of each time step from uncompressed and compressed data is approximately 1.68 (as shown in the right side of Figure 6a), while the mean L 2 norm is 618.97.
Using the p-order Wasserstein distance between adjacent time steps (Figure 6b, we observe that as p increases, the 6 Negative log-transformation is a method of normalizing and managing negative values in a dataset.A constant is added to all the data values, and then log transformation is performed on them.Since this process maintains the relative order of the values and an analogue of their relative magnitudes, it does not affect the overall behavior of the data and hence a similar effect would appear in the data if the experiments were performed on raw densities.smaller peaks are suppressed, and only one major peak (where nuclear scission occurred) is first observed when the order is set to 68.We also observe that if the order ≥80 all the peaks vanish.This shows that (approximate) compressed-space operations are able to identify nuclear scission and that one distance measure has advantages over another in certain cases.

VI. CONCLUSIONS
In this paper, We studied many direct operations on compressed data, selecting the appropriate distance metrics to characterize errors on top of compression-decompression errors (if any).Using a shallow water simulation, we showed how operations such as negation and element-wise addition can be used on large floating-point data of different precisions.We also show how large block sizes can be used while still capturing precision-change induced perturbations in the dataset.We then characterized the error between compressed-space scalar functions and their uncompressed-space counterparts as a function of data from the LGG segmentation dataset and compression settings.We finally used the physics dataset of nuclear fission of plutonium atoms to show the change in the topology of time-series data, illustrating the merits of using of the Wasserstein distance over the L 2 norm.This validates the promise of PyBlaz, the first data compressor supporting a dozen compressed domain operations.
Future work: We did not have an occasion to illustrate most of the supported operations through experiments, leaving them for future evaluation.Incorporating compression or decompression might be controlled by pragmas placed against their data structure declarations.Another usage scenario is in ensemble-testing where applications are compiled under different flags for testing the numerical characteristics of the code, as in [37].In [37], only simplistic distance measures were used and the distance calculation was in the uncompressed domain.With PyBlaz, we can employ more sophisticated measures while keeping the time-sequences of evolving simulation results in compressed form.PyBlaz can be made to automatically change its compression settings in order to enforce some L ∞ error bound through Bayesian optimization or a similar search process instead of relying on the user to find optimal compression settings.We plan to develop rigorous stage-wise error analysis for PyBlaz similar to what has been done for ZFP [38].
Last but not least, formal verification of compression, decompression, and compressed-space operations is almost a requirement, because subtle flaws might look confusingly similar to actual data aberrations.This verification would be necessary both at the higher-level (e.g., by coming up with equational axioms pertaining to various operations).Codelevel verification is equally important: an off-by-one error again might not cause a visible alarm until when one inadvertently handles the wrong (and critical) data.

A. Example orthonormal transform
Suppose we wish to perform DCT on a blocked array B using block shape (4,8).Recall that {H 1 , . . ., H d } are the DCT matrices for each block size and H ij = 1+(j>1) s cos πi(2j+1)

2s
. In this case, we have two DCT matrices H 1 and H 2 , one for each dimension, such that    Then for all blocks, let b be the arrangement of blocks in the blocked array, so that all blocks of coefficients are C [1..b]

Fig. 1 :
Fig.1: PyBlaz architecture showing compression of a 2-dimensional array.The blue colors represent floating-point numbers and red colors represent integers.A is the input array, and A ′ is the array obtained after lowering the precision, B 1..4 represent the blocks we get after blocking.In this figure, we show the rest of the procedure only on block B 1 .We perform DCT on each block, after which we obtain an array of coefficients C, which is further binned to result in an array of indices I. Finally, we apply the pruning mask, shown with black and white colors representing Boolean values, which results in pruned indices represented as F .F is later flattened.Unlike in Blaz, we skip the differentiation step (called normalization in[17]), which facilitates certain compressed-space operations explained in detail in § IV-A.

Algorithm 1 :
Negation(A) Data: compressed array A = {s, i, N, F } Result: the negated array return {s, i, N, −F }; Algorithm 2: Addition(A, B) Data: compressed arrays A = {s, i, N 1 , F 1 }, B = {s, i, N 2 , F 2 } Result: the element-wise sum of A and B b ← the number of bits in the element type of F

Algorithm 9 :
Variance(A) Data: compressed array A = {s, i, N, F } Result: the variance of the array v ← Covariance(A, A); // Algorithm 8 return v; can obtain the L 2 norm by calculating the L 2 norm of the specified coefficients (Algorithm 10).Algorithm 10: L 2 Norm(A) Data: compressed array A = {s, i, N, F } Result: the L 2 norm of the array Ĉ ← SpecifiedCoefficients(A); // Algorithm 3 return ∥ Ĉ∥ 2 ;

Fig. 5 :
Fig. 5: Absolute error and relative error between compressedspace scalar functions available in PyBlaz and uncompressed scalar functions on the FLAIR channel of the LGG segmentation dataset.Faint dots are individual examples.Squares show mean errors (MAE on the absolute axis) across all examples.Squares are missing where NaNs occurred on some example(s).Black horizontal lines show mean compression ratios over all examples, whose values are shown on the right vertical axis.No pruning was used.There is no relative error axis on SSIM because it is an index in [0, 1], which accounts for the magnitude of the arrays it compares.
Zoomed L2 norm difference between time step 687 and 688 for Pu densities L2 norm difference between adjacent time steps for Pu densities (a) L2 difference for uncompressed, (de)compressed, and compressed data, showing that the error between each of them is almost negligible.A zoomed version of L2 norm between time steps 687 and 688 is shown on the right side to show that uncompressed, (de)compressed, and compressed values are not exactly the same and actually differ by a small value.L2 norm does not eliminate other peaks (noise) captured at different time steps Wasserstein Distance Between Adjacent Time Steps of Pu Density (b) Approximate Wasserstein distance with different values of p.With p = 68, we clearly differentiate the nuclear scission time step from others.

Fig. 6 :
Fig. 6: (a) L 2 norm difference between adjacent time steps for negative log-transformed plutonium neutron density for compressed and uncompressed data.We observe a major change at time step 692 showing that the nuclear scission happened at this time step.But there are other peaks that might be misleading.(b) Approximate Wasserstein distance between adjacent time steps.We see that the misleading peak starts to vanish as the order (p) of the Wasserstein distance changes and only one peak is left when p = 68 (shown with the red line), showing only major changes (or scission point) in the data.A red line graph with p = 68 is also shown, clearly showing one major peak captured using the Wasserstein distance.

Figure 7
Figure7shows the time taken to perform various compressed-space operations available in PyBlaz.