Fast 2D Bicephalous Convolutional Autoencoder for Compressing 3D Time Projection Chamber Data

High-energy large-scale particle colliders produce data at high speed in the order of 1 terabytes per second in nuclear physics and petabytes per second in high-energy physics. Developing real-time data compression algorithms to reduce such data at high throughput to fit permanent storage has drawn increasing attention. Specifically, at the newly constructed sPHENIX experiment at the Relativistic Heavy Ion Collider (RHIC), a time projection chamber is used as the main tracking detector, which records particle trajectories in a volume of a three-dimensional (3D) cylinder. The resulting data are usually very sparse with occupancy around 10.8%. Such sparsity presents a challenge to conventional learning-free lossy compression algorithms, such as SZ, ZFP, and MGARD. The 3D convolutional neural network (CNN)-based approach, Bicephalous Convolutional Autoencoder (BCAE), outperforms traditional methods both in compression rate and reconstruction accuracy. BCAE can also utilize the computation power of graphical processing units suitable for deployment in a modern heterogeneous high-performance computing environment. This work introduces two BCAE variants: BCAE++ and BCAE-2D. BCAE++ achieves a 15% better compression ratio and a 77% better reconstruction accuracy measured in mean absolute error compared with BCAE. BCAE-2D treats the radial direction as the channel dimension of an image, resulting in a 3x speedup in compression throughput. In addition, we demonstrate an unbalanced autoencoder with a larger decoder can improve reconstruction accuracy without significantly sacrificing throughput. Lastly, we observe both the BCAE++ and BCAE-2D can benefit more from using half-precision mode in throughput (76-79% increase) without loss in reconstruction accuracy. The source code and links to data and pretrained models can be found at https://github.com/BNL-DAQ-LDRD/NeuralCompression_v2.


INTRODUCTION
High-energy particle accelerators, such as the Large Hadron Collider (LHC) [7] and the Relativistic Heavy Ion Collider (RHIC) [17], play a critical role in advancing knowledge about the fundamental building blocks of the universe.Particle accelerators work by accelerating charged particles close to the speed of light and colliding them, where they interact and produce new subatomic particles.Particle detectors are built around the collision point to detect these particle products, which encode the information about interactions at the collision.Basically, a tracking detector acts as a camera capturing three-dimensional (3D) particle trajectories.If there is a particle passing through it, each pixel or voxel will register an analog-todigital (ADC) number above a zero-suppression threshold.This 3D "camera" can work with a large number of channels (millions to billions) at a high frame rate (10 kHz to GHz), producing a large amount of ADC data.
Specifically, the recently constructed sPHENIX experiment at RHIC [17] consists of layers of tracking and calorimeter detectors aiming to study the microscopic nature of strongly interacting matter, ranging from nucleons to the strongly coupled quark-gluon plasma.Most sPHENIX data come from its main tracking detector, a Time Projection Chamber (TPC) (shown in Figure 1).sPHENIX TPC digitizes 42M-voxels 3D pictures of the collision continuously at 77 kHz.Traditionally, to reduce and store the data in time, a filtering system, called level-1 trigger, has been used.The trigger determines which data are more valuable, leading to a small subset of data being selected and stored for later analysis.Instead of a level-1 trigger, developing a high-throughput real-time compression algorithm to reduce and store all collision signals has become increasingly important for future collider experiments with streaming data acquisition (DAQ) that aims to record all collisions [1,5].There is abundant literature in the lossy compression community, but few existing methods have been optimized or designed for sparse 3D TPC data.For example, the effectiveness of the errorbounded SZ [6,14,18] compression algorithm method has been demonstrated in climate science and cosmology data.The fixed-rate compression method (ZFP) [13] has been motivated by hydrodynamics simulations, while the MultiGrid Adaptive Reduction of Data (MGARD) [2,11] method has been developed for compressing turbulent channel flow and climate simulation.We hypothesize that most data challenges impacting the high-performance computing (HPC) community stem from distributed high-fidelity simulation in climate science, fluid dynamics, cosmology, and molecular dynamics.Therefore, by introducing this unique challenge from the particle accelerator community, we seek to ignite research interest in the scientific data reduction community.Although all these compression algorithms have demonstrated reasonable performance with 3D TPC data, a specially designed neural network-based model, Bicepheoulous Convolutional Auto-Encoder (BCAE) [10], can outperform them in both compression rate and reconstruction accuracy.However, as an initial proof-of-concept, BCAE has some drawbacks, including suboptimal compression throughput.
This work proposes an improved version of BCAE, called BCAE++, which improves the compression ratio from 27 to 31 and decreases the mean absolute error (MAE) from .198 to .112.MAE is an indicator of reconstruction accuracy-the lower, the better.We also introduce a two-dimensional (2D) variant of BCAE, BCAE-2D, by replacing 3D CNN layers with a 2D one and treating the radial dimension of the TPC data as a channel dimension.This change has resulted in 3× speedup in throughput.Because only the encoder portion of the auto-encoder neural network architecture will be used in real time and the decoder (decompression) can be used offline, this work also explores if expanding the number of decoder parameters can improve reconstruction accuracy.Lastly, we demonstrate a post-training "trick" using the half-precision representation of a network.It can enhance the throughput by over 70% without losing reconstruction accuracy.

TPC DATA AND BCAE COMPRESSION METHODS 2.1 TPC Data Preparation
Figure 1 shows that the sPHENIX TPC is a perfect testbed for developing a high-throughput real-time compression algorithm.It is located between the inner silicon vertex tracker and the electromagnetic calorimeter.Along the radial dimension, the TPC is composed of 48 cylindrical layers of small sensors, which are grouped into three layer groups: inner, middle, and outer.Each layer group has 16 consecutive layers.In the digitized data and for each TPC layer, the voxels are presented as a rectangular grid with rows along the  (or horizontal) direction and columns along the azimuthal direction.Within one layer group, all layers have the same number of rows and columns.This allows us to represent the ADC values from one layer group as a 3D array.This study focuses on the outer layer group, where the array of ADC values has shape (16,2304,498) in the radial, azimuthal, and horizontal orders.To match the subdivision of the TPC data assembly module in the readout chain, the voxel data are divided into 24 equal-size non-overlapping sections: 12 along the azimuthal direction (30 degrees per section) and 2 along the horizontal direction (divided by the transverse plane passing the collision point).We call one such section a TPC wedge (Figure 2).The array of ADC values from each TPC wedge in the outer layer has shape (16,192,249), listed in radial, azimuthal, and horizontal directions, respectively.All ADC data from the same wedge will be transmitted to the same group of front-end electronics, after which a real-time lossy compression algorithm could be deployed.Therefore, TPC wedges are used as the direct input to the deep neural network compression algorithms.
Here, we use the simulation data of 1310 events for central √    = 200 GeV Au+Au collisions with 170 kHz pile-up collisions.The data were generated with the HIJING event generator [20] and Geant4 Monte Carlo detector simulation package [4] integrated with the sPHENIX software framework [16].The simulated TPC readout (ADC values) from these events are represented in a 10-bit unsigned integer ∈ [0, 1023].To reduce unnecessary data transmission between detector pixels and front-end electronics, a zerosuppression algorithm has been applied.All ADC values below 64 are suppressed to zero as most of them are noise.This zerocompression makes the TPC data sparse at about 10% occupancy (non-zero values).
We divide the 1310 total events into 1048 events for training and 262 for testing.Each event contains 24 outer-layer wedges.Thus, the training partition contains 25152 TPC outer-layer wedges, while the testing portion has 6288 wedges.The compression algorithm aims to compress each wedge independently.Finally, as trajectory locations must be interpolated from neighboring sensors using the ADC values, it is important to preserve the relative ADC ratio between the sensors.Hence, for this study, we trained autoencoders to reconstruct the log ADC values (log 2 (ADC+ 1)) instead of the raw ADC values.A log ADC value is a float number in [0., 10.].Because of the zero-suppression at 64, all nonzero log ADC values exceed 6.The ground truth distribution of log ADC values is plotted in Figure 3.

Bicephalous Convolutional Auto-Encoder (BCAE)
An autoencoder [9] is composed of one encoder and one decoder.The output from the encoder is called a code.Autoencoders are commonly used for data compression.The compression ratio is defined as the ratio between the size of an input and its code-the smaller the code, the higher the compression ratio.The decoder takes in a code and produces a reconstruction of the input.The distance between the original input and the reconstruction is used as the loss function to train the autoencoder network.Although this approach may work well enough in cases where the input distribution is more regular (resembling a Gaussian distribution), it may struggle with a distribution such as those of zero-suppressed log ADC values [3,8].As evident in Figure 3, the log ADC value is bi-modal and has a sharp edge at 6.0.For this irregular distribution, we need renovated autoencoder structures.The BCAE [10] was proposed as a potential solution.As illustrated in Figure 4, in addition to the reconstruction decoder  reg , BCAE also has a segmentation decoder  seg for voxelwise bi-class classification.The segmentation decoder determines whether a voxel is zero (class 0) or nonzero (class 1).The output produced by  seg is assessed using the focal loss L seg , a specialized loss function designed to address imbalanced datasets [12].The output generated by the regression decoder  reg is combined with that from  seg to form the reconstruction of the input, which then is evaluated using a regression loss L seg .This study uses MAE for the regression loss.
Specifically, if we denote the total number of voxels by  and let l be the output of  seg for voxel , the focal loss is defined to be Here,   = 1 if the voxel is positive and 0 otherwise, and  is the focusing parameter.We employ the focal loss because, on average, only 10.8% of ADC values are nonzero.In this study, we set the focusing parameter  to be 2. Given a classification threshold ℎ and let v be the prediction for voxel  produced by  reg , the masked prediction ṽ is defined as ṽ = v 1 l >ℎ , where 1 is the characteristic function.Hence, the regression loss L reg is defined as Finally, to manage the gap between 0 and 6, we adopt another technique proposed by [10] called regression output transformation.We apply an output activation function  () = 6.+ 3. exp() to the output from the regression decoder.Note that by applying  , all regression output values are above 6., and the zero values in the reconstruction will result from the masking by segmentation output (refer to the definition of ṽ ).

BCAE++ and BCAE-HT
Two modifications are made to the original BCAE [10].First, we pad the horizontal direction from length 249 to length 256 with zero.This makes halving the dimension more straightforward and enables using convolution/deconvolution with kernel size 4, padding 1, and stride 2 uniformly throughout the encoder and decoder construction.This change streamlines the neural network architecture search in a programmatic way.In addition, this modification reduces the code dimension from (8,17,13,16) to (8,16,12,16) and improves the BCAE compression ratio from 27.041 to 31.125.Zeropadding in the horizontal direction is clipped during the evaluation, so reconstruction accuracy metrics are not inflated.Second, we remove all the normalization layers in BCAE as they do not affect reconstruction performance significantly in a sufficiently long training.However, we can speed up training and inference without them.Based on these modifications, we constructed BCAE++ with a similar number of parameters to the original BCAE but with better performance (Table 1) and a larger compression ratio.
We also introduce a high-throughput (HT) variation, BCAE-HT.The difference between BCAE++ and BCAE-HT is in the number of features (output channels) in the four residual blocks of their respective encoders.For BCAE++, the numbers of features are

BCAE-2D
Due to the thinness of the TPC wedge along the radial dimension-16 layers versus 192 in the azimuthal and 249 in the horizontal dimensions-it may be more reasonable to treat the layer dimension as channels of an "image." Moreover, while the layers all have different radii, the number of columns (along the azimuthal direction) in one layer group remains the same.This means the distance between two adjacent voxels along the azimuthal direction is farther apart in an outer layer than an inner one.This breaks the inductive bias of translation invariance of a 3D convolution along the radial direction, making 2D convolution an even more appropriate choice for a TPC wedge.
We detail the construction of the 2D encoder and decoder in Algorithm 1 and Algorithm 2, respectively.The algorithms use  and  to denote the number of input and output channels,  to show kernel size,  to indicate padding (default is 0), and  to signify stride (default is 1).

Algorithm 2: BCAE_decoder_2D
Input: number of blocks , number of upsampling layers , output activation function  Output: A PyTorch module 1 # NOTE: a decoder must have the same number of upsampling steps as the downsampling steps in its corresponding encoder 2 Initialize network  to be an empty module list;  is the number of downsampling/upsampling layers.

Training Procedure
We implement all BCAEs with PyTorch 2.0.The training is conducted on the 25152 outer-layer TPC wedges in the training partition of the datasets, while 6288 TPC wedges are reserved for testing.We set the classification threshold ℎ in Equation (2) to be .5 for both training and testing.All BCAE models are trained with a batch size of 4, and we train all BCAE++ and BCAE-HT for 1000 epochs.The initial learning rate is set at 10 −3 and remains constant for the first 100 epochs.In the remaining epochs, we decrease the learning rate by 5% every 20 epochs.All BCAE-2D models are trained for 500 epochs.We set the initial learning rate at 10 −3 and keep it constant for the first 50 epochs.In the remaining epochs, we decrease the learning rate by 5% every 10 epochs.For all BCAE models, we use the AdamW [15] optimizer with ( 1 ,  2 ) = (0.9, 0.999) and weight decay 0.01.
To improve the classification performance, we balance the contribution of the segmentation and regression loss dynamically as follows: assume the segmentation and regression losses at epoch  are   s and   r , respectively.Denote the coefficient of the segmentation loss at epoch  by   .Then, the coefficient for L seg for epoch We set  0 to be 2000.

RESULT 3.1 Compression Ratio
The compression ratio is computed by the ratio between the input and the code.Both input and code are treated as 16-bit float.As mentioned in Section 2.3 and 2.4, the shape of the code produced by the BCAE-2D is (32, 24, 32), and those for BCAE++ and BCAE-HT are both (8,16,12,16).Because the TPC wedge has shape (192,249,16), the compression ratio is 31.125 for all newly introduced BCAE variants.This is greater than the compression ratio of 27.041 in the original BCAE [10].

Comparing Encoder Model Size and Throughput
The encoder model size is measured in the number of trainable parameters.Encoder throughput is measured in the number of TPC wedges processed per second.The input and output are allocated in GPU memory.Therefore, the file system input/output (IO) and host-to-device data transfer are not considered.All throughput experiments are conducted on a single NVIDIA RTX A6000 GPU with driver version 535.On the software side, we used PyTorch 2.0 compiled with CUDA 12.2.
As shown in Table 1, BCAE++ has the largest encoder size of 226k parameters, followed by the original BCAE with 202k.The

Reconstruction Accuracy Comparison
We evaluated performance of the BCAEs using four metrics: MAE, peak signal-to-noise ratio (PSNR), precision, and recall.Here, precision and recall are defined as follows: .
We compare BCAE-2D and BCAE++ performance in two computation modes shown in Table 2. Given that compressing in halfprecision yields negligible performance degradation while significantly boosting throughput, it is the most likely computation model for future deployment.Hence, all BCAEs performance reported in Table 1 are obtained with the half-precision computation mode.BCAE++ achieves the best scores in all reconstruction measurements.
Figure 5 compares the reconstruction performance of BCAE-2D, BCAE++, and BCAE-HT on one test TPC wedge.The noticeably different plots (second row) indicate the reconstruction produced by BCAE++ is the most accurate.

Investigating Half-precision Speedup
Throughput is tested in two computation modes: full-precision and half-precision.In full-precision mode, the encoder weights and input are all set to 32-bit floats.In half-precision mode, we manually cast the encoder weights and input to 16-bit floats.In Figure 6AB, for BCAE-2D and BCAE++, half-precision affords more than 70% improvement in throughput.In full-precision mode, BCAE-HT and BCAE-2D have a similar throughput of 4000 frames per second.However, BCAE-HT's speedup is much less (Figure 6C).This is due to the extremely small model size (9.8k) of BCAE-HT after reducing the 3D convolution channel sizes from BCAE++'s (8,16,32,32) to (2,4,4,8).As shown in Figure 6D, Tensor Core units are not used by those time-consuming convolution computations.

Investigating Auto-encoder Design
Here, we study how the depth (the number of blocks) of a BCAE-2D model's encoder and decoders influence its reconstruction accuracy.For this purpose, we conduct a grid search on the number of encoder blocks ( in Algorithm 1), ranging from 3 to 7, and  7 illustrates the reconstruction accuracy of BCAE-2D models in MAE, precision, and recall.While the performance benefits significantly from deepening the decoders, the influence of encoder depth is relatively ambiguous.The benefit of a deep encoder is more obvious only when it is paired with decoders that are significantly deeper.We also calculate the compression throughput in half-precision mode and demonstrate the result in Figure 6E.After balancing the reconstruction accuracy and compression throughput, we choose the BCAE-2D model with  = 4 encoder blocks and  = 8 decoder blocks to represent the BCAE-2D models.

CONCLUSION AND DISCUSSION
The TPC tracking detectors examined in this work represent a modern particle accelerator detector that produces a large volume of data at an extreme rate (1 TB/s to 1 PT/s).Unlike common simulation data from hydrodynamics, climate science, and cosmology, TPC data are zero-compressed and sparse, presenting a unique challenge to real-time data compression algorithms.We present two BCAE variants: BCAE++ and BCAE-2D.Compared to the BCAE, BCAE++ improves the compression rate by 15% and reconstruction accuracy by 77% measured in MAE.The BCAE-2D model treats the radial dimension of TPC data as a channel dimension and achieves a 3x speedup during inference compared to BCAE++.
Based on this initial effort, there are several research directions worthy of future pursuit.For example, to further optimize the neural network throughput performance, we want to incorporate network pruning, quantization, and sparse CNN techniques.We also seek to extend our throughput comparison to include results of GPU-accelerated conventional lossy compression methods, such 0 16 32 48 64 80 96   as MGARD-GPU and cuSZ [19].Finally, we anticipate this work can attract additional research interest in particle detector data compression by the scientific data reduction community.

Figure 2 :
Figure 2: Example of a TPC Wedge.The  axis is in the horizontal direction.The  and  axes are on the plane spanned by the radial and azimuthal directions.

Figure 6 :
Figure 6: Panel A-C: Throughput in half-and full-precision modes on single NVIDIA RTX A6000 GPU.Panel D: Diagnosing the lack of speedup by changing from full-precision to half-precision in the BCAE-HT model.This is due to small kernel sizes and the lack of Tensor Core activities.Panel E: Throughput in half-precision of BCAE-2D with  = 3, 4, 5, 6, 7 encoder blocks and 3 downsampling layers.The encoder size is measured in the number of parameters.

Figure 7 :
Figure 7: Reconstruction accuracy of BCAE-2D models with varying encoder and decoder depths.

Table 1 :
Performance, encoder model size, and throughput comparison.The encoder size is measured in the number of trainable parameters.The reconstruction accuracy metrics and throughput are all measured with half-precision mode.The best performance with respect to each metric is underlined.

Table 2 :
Reconstruction accuracy in full-and half-precision computation mode.