Soft Delivery: Survey on a New Paradigm for Wireless and Mobile Multimedia Streaming

The increasing demand for video streaming services is the key driver of modern wireless and mobile communications. Although many studies have designed digital-based delivery schemes to send video content over wireless and mobile networks, significant quality degradation, known as cliff and leveling effects, often occurs owing to fluctuating channel characteristics. In this article, we present a comprehensive summary of soft delivery, which is a new paradigm for wireless and mobile video streaming and discuss the future directions of soft delivery. Existing studies found that introducing multi-dimensional cosine transform, human vision system, and graph signal processing can make soft delivery schemes more effective in untethered immersive experiences, including virtual reality and volumetric media, than digital-based delivery schemes. In addition, this study finds that soft delivery has the potential to be a new standard to deliver deep neural network models and tactile information over wireless and mobile networks.


I. INTRODUCTION
Video streaming over wireless and mobile networks is one of the major applications in wireless environments; according to Cisco Visual Networking Index studies, approximately fourfifths (82%) of the world's mobile data traffic will be video content by 2022 [1].The explosive growth of data traffic, especially video traffic, poses a huge challenge to wireless and mobile networks.In recent years, immersive content, such as virtual reality (VR), augmented reality (AR), and mixed reality (MR), have shown very good potential to be the next important applications for networks.Together with the development of fifth generation (5G) technology and smart wearable Takuya Fujihashi and Takashi Watanabe are with Graduate School of Information Science and Technology, Osaka University, Suita, Osaka, 565-0871 JAPAN e-mail: fujihashi.takuya@ist.osaka-u.ac.jp, watanabe@ist.osakau.ac.jp.
devices, which enable technology for all extended reality (XR) applications, the growth of such immersive applications is rapidly increasing.
In general, wireless video streaming systems transmit images and video signals to a single or multiple users with different channel characteristics.For high-quality video streaming applications, the main challenge is the difficulty in fully utilizing each user's channel capacity and providing each user with the best video quality possible under his/her channel conditions; this will provide users with an improved quality of experience (QoE).To address this challenge, conventional streaming systems, which consist of video coding and transmission technologies, have been proposed based on digital-based solutions.In terms of the video coding part, H.265/High-Efficiency Video Coding (HEVC) [2], which has been standardized by the Joint Collaborative Team on Video Coding (JCT-VC), can be used to encode VR/360-degree videos.As the successor of HEVC, the future video coding standard, named, H.266/Versatile Video Coding (VVC), has been developed by the Joint Video Experts Team (JVET); VVC takes camera-view video, high dynamic range video, and VR/360-degree video into account.In addition, videoand geometry-based point cloud coding [3], [4] have been standardized by the Motion Picture Experts Group (MPEG) for volumetric video encoding and decoding.In the video transmission part, the source bits are channel-coded with time interleaving to exhibit robustness against a certain level of channel errors.The channel-coded bits are then mapped into the transmit data symbols corresponding to arbitrary modulation schemes, such as binary phase shift keying (BPSK), quadrature phase shift keying (QPSK), or quadrature amplitude modulation (QAM).To choose an appropriate source and channel coding rate according to the user's channel condition, the channel statistics are generally required to be known at the time of source and channel coding.Once both the source and channel coding processes are completed, the conventional systems work optimally only for a specific channel condition.
If the observed channel quality, i.e., the channel signalto-noise ratio (SNR), falls below a threshold, the decoding process tends to break down completely.This phenomenon is called the cliff effect [5].In contrast, if the observed channel quality increases beyond the threshold, it does not improve the performance unless an adaptive rate control of the source and channel coding is performed in real-time according to the rapid fading channels.This phenomenon is known as the leveling effect.
Thus, accurate channel estimation and real-time rate control of the source and channel coding are desired for conventional streaming systems.However, the channel conditions of wireless and mobile networks may vary drastically and unpredictably, resulting in imperfect channel estimation and rate control owing to this variation.Conventional streaming systems tend to utilize the channel capacity conservatively to prevent cliff and leveling effects, taking into account the fact that rate control may be inaccurate.
A new paradigm of wireless video delivery, namely, soft delivery [6]- [8], has been proposed to fully utilize the instantaneous channel capacity without cliff and leveling effects.It is essentially a scheme with "lossless compression and lossy transmission."The compression stage is solely a transform to decorrelate the image and video signals into frequency domain coefficients, leaving out the conventional quantization and entropy coding.The transmission stage skips digital-based channel coding.Instead, it scales each transform coefficient individually and modulates it directly to a dense constellation for transmission.Here, the scaling operation serves the purposes of both power allocation and unequal signal protection against channel noises and fading effects to maximize the reconstruction quality.At the receiver end, the image and video signals are reconstructed by demodulating the received signals and inverting the scaling and transform operations.The soft delivery scheme was shown to not only provide a graceful performance transition in a wide channel SNR range but also achieve competitive performance compared with the conventional digital-based delivery schemes.

A. Contributions of the Study
This work entails a comprehensive survey of soft delivery schemes, including an overview on existing techniques, extension for immersive content, and future research directions.Some existing studies have focused on soft delivery schemes, which are shown in Fig. 1, with a brief description of the related topics and key techniques.To the best of our knowledge, this survey is the first to introduce methodologies and approaches for soft delivery to transmit high-quality image and video signals via unstable and diverse wireless and mobile channel environments.The main contributions of this study are summarized as follows: • An overview of the conventional digital-based and soft delivery schemes, as well as the benefits of the latter, is presented.• The existing soft delivery techniques, such as energy compaction, power allocation, bandwidth utilization, and packet loss resilience, are surveyed.In this context, the abstraction and key contributions of these techniques are reviewed and summarized.• The extensions needed for immersive video streaming using the soft delivery are classified, including freeviewpoint video, 360-degree video, and point cloud.
• We finally review the future research directions related to soft delivery.

B. Survey Structure
The remainder of this paper is organized as follows: • Section II describes an overview of conventional digitalbased delivery schemes and their issues, such as the cliff, leveling, and staircase effects.• Section III presents the basic principles of the pioneering work on soft delivery to solve the aforementioned effects in wireless and mobile video streaming applications.• Section IV presents a review of the existing techniques on soft delivery.We classify these techniques into energy compaction, power allocation, bandwidth utilization, packet loss resilience, overhead reduction, receiver heterogeneity, and discuss their implementation, as well as their contributions.• Section V lists the existing techniques vis-à-vis soft delivery for immersive contents, i.e., future multimedia applications.
• Section VI suggests the future directions of the soft delivery approach based on the overall trends observed from the survey results.• Section VII concludes the paper.

A. Overview
One of the major issues in wireless video delivery is sending high-quality videos within the considerably limited capacity of wireless links.For this purpose, standardized digital video compression is carried out for video frames in conventional video delivery schemes [9]- [11], as shown in Fig. 2 (a) to remove redundancy among video frames.In particular, H.264/Advanced Video Coding (AVC) [12], H.265/HEVC, and H.266/VCC standards are typical video coding standards for generating a compressed bitstream from video frames.In such standard video encoders, the video frames are classified into I-, P-, and B-frames.The I-frame is a reference frame encoded independently from other video frames.In each I-frame, pixel values are divided into blocks and transformed into frequency domain coefficients using discrete cosine transform (DCT) or discrete sine transform (DST), and then non-uniformly quantizing the coefficients according to a quantization parameter (QP).A large QP indicates a larger quantization step, leading to a smaller bit rate.Finally, the quantized coefficients are compressed by an entropy coder, which removes statistical redundancy in the coefficients.Variable length coding (VLC) is widely deployed for entropy coding because of its efficiency and simplicity.The P-frame obtains residuals between the previous reference and the current video frames using block-based motion estimation and then encodes the residuals using the aforementioned procedures.The B-frame also obtains and encodes residuals between the previous and past reference video frames and the current video frame using motion estimation.The P-frame and B-frame have lower traffic than the I-frame.
After passing through the digital video compression, the bitstream is then passed to the wireless transmission part in sequence.Channel coding is first used for the bitstream to protect against channel errors.For example, binary convolutional codes and low-density parity-check are widely used as forward error correction in Wi-Fi systems.The coded bitstream is then mapped onto in-phase and quadrature (I and Q) components using digital modulation formats such, as QPSK and m-ary QAM.In both wireless and mobile networks, a combination of modulation formats and different channel coding rates, for example, 1/2 and 3/4, is defined in the modulation and coding scheme (MCS).According to the measured wireless channel SNRs, the sender adapts its MCS value to maximize the link data rate.At the receiver end, bit errors may occur in channelcoded bits owing to effective noise and/or fading effects.The receiver then tries to reconstruct video frames from the received bits using inverse procedures, i.e., demodulation, channel decoding, and video decoding.

B. Critical Issues on Quality
If the measured wireless channel quality is stable during video transmission, conventional digital-based schemes can provide high-quality video frames for users.However, the channel quality of each user fluctuates over time owing to a combination of user mobility, multipath propagation, and obstacle shadow.a single bit error can cause the loss of entire data [13].As mentioned earlier, channel coding schemes are adopted to correct burst and random bit errors.However, they generally exhibit an all-or-nothing behavior for error correction.When the instantaneous channel quality, i.e., the SNR, falls below a certain threshold, possible errors that occur in the bitstream during wireless communications will disable video decoding.A collapsed signal reconstruction causes a cliff effect.The cliff effect is a phenomenon whereby the quality of the received information abruptly drops as soon as the channel quality falls below the threshold, as shown in Fig. 3 (a).For example, the video quality of the BPSK modulation format with 1/2-rate channel coding drops below the wireless channel SNR of 4 dB.In modern network environments (e.g., content delivery, mobile, and wireless networks), the cliff effect becomes a major impediment when video frames are transmitted over diverse channel conditions to heterogeneous users.In this case, users whose channel quality is below the critical point receive unwatchable video frames.Some solutions have addressed the cliff effect associated exclusively with channel coding, such as hybrid automatic repeat request and rateless coding schemes [14]- [18].They adapt the number of transmissions to changing channel conditions for error prevention.However, these schemes are not well suited for streaming multiple users under diverse channel conditions.In addition, they do not reduce the quantization error at the video encoder end; thus, the leveling effect still occurs in video quality.
2) Leveling Effect: Once the channel quality surpasses the threshold, the video quality remains constant as shown in Fig. 3 (a).As mentioned earlier, the cliff effect is caused when the receiver SNR is below 4 dB in the BPSK modulation format with 1/2-rate channel coding, whereas the channel gain does not reflect on the video quality above the wireless channel SNR of 4 dB.Digital-based schemes determine the parameters of the video coding and wireless transmission parts based on the channel estimation.If the instantaneous channel quality is better than the estimated one, no additional gain can be obtained because the distortion of the video coding cannot be reconstructed for each user.
3) Staircase Effect: To mitigate the cliff and leveling effects, some layered coding schemes, referred to as schemes with scalable video coding (SVC) [19] with a combination of hierarchical modulation (HM) [20], have been proposed for wireless and mobile video streaming [21], [22].These layered coding schemes encode video frames into one base layer (BL) and several enhancement layers (ELs).The BL is used to ensure that all the users in the target channel SNR range can receive the baseline quality of video frames, whereas the ELs are used to enhance the video quality of the users in high-channel SNRs.Each SVC layer is then mapped onto the corresponding HM layer.Notably, HM provides unequal error protection to the transmitted video frames according to their relative importance.However, SVC with HM cannot completely remove the cliff effect; it only divides one big cliff into multiple stairs according to the number of layers, as shown in Fig. 3 (b).In addition, because the assigned transmission power to each layer is lower than that of the single-layer coding schemes, the cliff shifts to higher wireless channel SNRs.

A. Overview
To prevent the cliff, leveling, and staircase effects in wireless video delivery, a pioneer soft delivery work, namely, SoftCast was proposed in [6]- [8].The block diagram of SoftCast is illustrated in Fig. 2 (b).SoftCast's design is based on a simple principle that ensures that the transmitted signal samples are linearly related to the original pixel values.This principle naturally enables a sender to satisfy multiple receivers with diverse channel qualities, as well as a single receiver, where different packets experience different channel qualities.
The sender first takes a group of pictures (GoP) and uses a full-frame 3D-DCT as the decorrelation transform.The DCT frames are then divided into N small rectangular blocks of transformed coefficients called chunks.The coefficients in each chunk are then scaled to match the transmission power constraints.Specifically, the scaling coefficients are chosen to minimize the reconstruction mean square error (MSE).A Walsh-Hadamard transform (WHT) is then applied to the scaled chunks for power normalization across the chunks to provide packet loss resilience.This process transforms the chunks into slices.Each slice is a linear combination of all scaled chunks.Finally, the coefficients in the slices are directly mapped to the I and Q components in a pseudo-analog manner for transmission.Here, channel coding operations are skipped for the coefficients.
Fig. 4 (a) through (f) shows the conventional digital-based modulation, i.e., 16-QAM, and pseudo-analog modulation proposed in SoftCast.Conventional modulation modulates channel-coded bits to produce real-value digital samples that are transmitted to the channel.For example, 16-QAM modulation takes sequences of four bits and maps each sequence to a complex I/Q number, as shown in Fig. 4 (a).After modulation, the wireless physical layer (PHY) of the sender transmits the mapped complex numbers to the receiver.In this study, we consider a transmitted signal sample of 1010.Because of the broadcast nature of the wireless medium, multiple receivers hear the transmitted samples but with different noise levels.For example, in Figs. 4 (b) and (c), a receiver with a high channel SNR can distinguish which of the 16 small squares the original signal sample belongs to, and hence, correctly decode the transmitted sample.A receiver with a low channel SNR can distinguish only the quadrant of the transmitted sample, and hence, can decode only the two bits of the transmitted sample.In this case, these bit errors may cause a collapsed signal reconstruction during digital video decoding.
In contrast to the existing modulation design, SoftCast outputs the real values of the DCT coefficients that are already coded for error protection.The pseudo-analog modulation directly maps pairs of the scaled DCT coefficients to the I and Q of the digital signal samples, as shown in Fig. 4 (d).As mentioned earlier, multiple receivers hear the transmitted samples under different channel SNRs.Although the transmitted samples are distorted according to their SNR, the receiver regards the received samples as scaled DCT coefficients.This process avoids all cliff, leveling, and staircase effects because the sender does not need to estimate the channel condition, and the noise level in the received samples faithfully reflects the instantaneous channel condition [23].Consequently, pseudoanalog modulation ensures that the received video quality is proportional to the instantaneous channel quality, as shown in Fig. 3 (c).
In parallel, SoftCast sends an amount of data, referred to as metadata, for signal reconstruction.These metadata consist of the mean and variance of each transmitted chunk as well as a bitmap.The mean of each chunk is used to obtain the chunk approximate zero-mean distributions by subtracting the mean of all pixels in each chunk [24].The variance of each chunk is used to find the per-chunk scaling factors such that the reconstruction error is minimized.The bitmap indicates the positions of the discarded chunks into the GoP.When the available channel bandwidth for SoftCast is less than the required bandwidth, SoftCast discards chunks with less energy.Specifically, when the available and required bandwidths for SoftCast are M chunks and N (> M ) chunks, respectively, SoftCast discards less energy M − N chunks to meet the bandwidth requirement.On the receiver side, these discarded chunks are replaced by null values.The discarded chunks are registered as a bitmap and then compressed using run-length encoding.Metadata are strongly protected and transmitted in a robust way (e.g., BPSK modulation format with a low-rate channel code) to ensure correct delivery and decoding.
At the receiver side, a minimum MSE (MMSE) decoder is used to estimate the content of the chunks due to channel noise.The MMSE provides a high-quality estimate of the DCT coefficients by leveraging the knowledge of the statistics of the DCT coefficients, i.e., chunk variance, as well as the statistics of the channel noise.Using the metadata, the denoised chunks are properly reassembled and undergo an inverse 3D-DCT, thereby providing the corresponding GoP.

B. Details of Scaling and Inverse Scaling Operations
In contrast to digital-based delivery schemes, soft delivery schemes use power allocation and MMSE filters, i.e., scaling and inverse scaling operations, for analog-modulated symbols to realize error protection over wireless channels.Figure 5 illustrates the procedures of obtaining the reconstructed DCT coefficients at the receiver end.SoftCast realizes chunk-wise power allocation and filter operations according to the statistics of the chunks and channel conditions.Let x i denote the ith analog-modulated symbol.Each analog-modulated symbol is scaled by g i for noise reduction: Here, s i is the ith DCT coefficient, and g i is the scale factor for the coefficient power allocation.The sender performs optimal power control for g i to achieve the highest video quality.Specifically, the best g i is obtained by minimizing the MSE under the power constraint with the total power budget P as follows: where E[•] denotes the expectation, xi is an estimate of the transmitted symbol, λ i is the power of the ith DCT coefficient, N is the number of DCT coefficients, and σ 2 is the receiver noise variance.The near-optimal solution is expressed as: After transmission over the wireless channel, each symbol at the receiver end can be modeled as follows: where y i is the ith received symbol and n i is an effective noise with a variance of σ 2 .The receiver extracts DCT coefficients from the I and Q components and reconstructs the coefficients using the MMSE filter [6] as follows: The receiver then obtains the corresponding video sequence using the inverse 3D-DCT for the filter output ŝi .

IV. TECHNICAL SOLUTIONS FOR SOFT DELIVERY
Because SoftCast skips nonlinear digital-based encoding and decoding operations corresponding to motion estimation, quantization, and entropy coding, it realizes a linear quality improvement associated with channel quality improvement.In particular, SoftCast has shown outstanding performance compared with the conventional digital-based delivery schemes when receivers are highly diverse and/or the channel condition of each receiver varies drastically.Conversely, Soft-Cast's design is simplistic, so there remains much scope for improvement in adopting soft delivery in practical scenarios, including stable channel conditions, band-limited, and/or errorprone environments.For this purpose, many studies have been conducted to improve the performance of soft delivery.The existing works on soft delivery schemes can be classified into seven types, as shown in Fig. 1: energy compaction, optimal scaling, bandwidth utilization, resilience to packet loss, overhead reduction, multi-user heterogeneity, and hardware implementation.

A. Energy Compaction of Source Signals
In soft delivery schemes via linear mapping (from source signals to channel signals), the reconstruction quality greatly depends on the performance of the energy compaction technique for the source signals.Specifically, the existing study in [52] clarified that the performance of the soft delivery schemes degrades as the ratio of maximum energy to minimum energy of the source component increases.To yield better quality under both stable and unstable channel conditions, existing studies have adopted different energy compaction techniques listed in Table II for the source signals.
Typical solutions are to adopt wavelet-based signal decorrelation methods.Specifically, some studies [26]- [31] have adopted a motion-compensated temporal filter (MCTF), which is a temporal wavelet transform method, to remove inter-frame redundancy by realizing motion compensation in soft delivery.The MCTF recursively decomposes video frames into lowand high-frequency frames according to a predefined level.For example, WaveCast [29] adopted a 3D-discrete wavelet transform (DWT), i.e., the integration of 2D-DWT and MCTF, to remove temporal and spatial redundancy.Although Soft-Cast exploits a full-frame 3D-DCT to remove the intra-and inter-frame redundancy for energy compaction, WaveCast can further improve the reconstruction quality by fully exploiting the inter-frame redundancy using motion compensation.A detailed discussion on the effects of other decorrelation methods is presented in [25], [53].[54] also utilized inter-frame redundancy by designing an adaptive GOP size mechanism.It adaptively controlled the GoP size based on shot changes and the spatio-temporal characteristics of the video frames and took a full-frame 3D-DCT for energy compaction across the video frames in one GoP.
Another typical solution is to send large energy coefficients as metadata, and thus prevent the transmission of such coefficients using pseudo-analog modulation.[32] designed Advanced SoftCast (ASoftCast) to send low-frequency coefficients as the metadata.ASoftCast decomposed the original images into frequency components using 2D-DWT; the frequency component was then divided into two parts: the lowest frequency sub-band and other sub-bands.The wavelet coefficients in the lowest-frequency sub-band are processed by run-length coding; they are then channel-coded and digitally modulated for additional metadata transmissions.The optimized power allocation for the SoftCast scheme in [33] selected and sent high-energy coefficients as the metadata to reduce the energy of the analog-modulated symbols.These results can assign a high transmission power to low-energy coefficients to improve the received quality.Here, determining the high-energy coefficients for each GoP is computationally complex owing to the use of an exhaustive search.To reduce the computational complexity, [55] adopted a zigzag scan to select the side information.Other studies in [34]- [39] divided the video into BL and ELs, which were coded and sent in digital and pseudo-analog ways, respectively.For example, the base layer in gradient-based image SoftCast (G-Cast) [38] sent the DC and low-frequency coefficients of the image, while the enhancement layer extracted and sent an image gradient, which represents the edge portion of the image, using a gradient transform.The receiver then created a final estimation of the image via a gradient-based reconstruction (GBR) procedure, utilizing both the image gradient at the enhancement layer and the low-frequency coefficients provided by the base layer.
Other solutions adopted a nonlinear encoder and decoder for source signals to decrease the ratio of the maximum to the minimum energy of the analog-modulated symbols.The typical solution is to introduce coset coding [56], [57], which is a typical technique in distributed source coding, for soft delivery.Coset coding partitions the set of possible source values into several cosets and transmits the coset residual codes to the receiver.With the received coset codes and the predictor, the receiver can recover the source value by choosing the where si q + 1 2 represents the coset index.At this time, the sender only needs to transmit the coset residual code for energy compaction.At the user side, with the received coset residual code li and the side information si (i.e., the predicted DCT coefficient obtained from the reference video frame), the receiver reconstructs the DCT coefficients by coset decoding.Given the coset residual code l i , there are multiple possible reconstructions of s i that form a coset C: DCast is then selected in coset C that is nearest to the side information si as the reconstruction of the DCT coefficient: In this case, the value of each coset step q is crucial for the coding performance of DCast.The value of q is calculated by estimating the noise at the receiver end shown in [42], [43].However, the reconstruction quality of DCast also depends on the side information quality.If the side information si is rough, the receiver may make wrong decisions with a smaller q.
[58] introduced a side information refinement (SIR) algorithm [59] to refine the side information for the quality enhancement of DCast.
The concept of coset coding has been widely applied in other studies on soft delivery for the same purpose.For example, [44]- [48] utilized pseudo-coset coding for lower frequency components and sent the coset index using the digital framework.Here, the residuals in the lowest-frequency components and other frequency components are sent using pseudo-analog modulation.The main difference between coset coding and pseudo-coset coding is the sending of the coset index as additional metadata.The layered coset coding and AWGN MSE [60] Fading MSE [61], [62] OFDM MSE [63]- [65] MIMO MSE [66]- [68] MIMO-OFDM MSE [69] Impulse noise MSE [70], [71] NOMA MSE [72] Underwater acoustic networks MSE [73] UAV-enabled networks MSE [74] mmWave lens MIMO MSE [75] AWGN SSIM [76] AWGN and MIMO FWD [77] AWGN EQMSE [78] AWGN Foreground and background distortions adaptive coset coding were applied to the soft delivery scheme in [47] and [48], respectively.LayerCast in [47] introduced layered coset coding to simultaneously accommodate heterogeneous users with diverse SNRs and bandwidths.The layered coset coding used large to small coset steps to obtain coarse to fine layers from each chunk.The coarse layer, i.e., BL, is sufficient to reconstruct a low-quality DCT chunk for narrowband users, whereas each fine layer, i.e., EL, provides refinement information of the DCT chunk for wideband users.
[49]- [51] utilized the coset coding for cooperative soft delivery systems, i.e., a three-node relay network.A sender broadcasts the DCT coefficients obtained from the video frames using pseudo-analog modulation to the relay node and the destination node.If the channel quality between the sender and the destination node is higher than a threshold, the destination node reconstructs the video frames from the soft-delivered DCT coefficients.If the channel condition is lower than the threshold, the relay node sends the coset residual code to the destination node, and then the destination node reconstructs the video frames using the received coset residual code and the side information obtained from the softly delivered DCT coefficients from the sender.

B. Channel-Aware and Perception-Aware Power Allocation
As mentioned in Section III-B, the power allocation in SoftCast minimizes the MSE between the original and reconstructed video signals over additive white Gaussian noise (AWGN) channels.There are several drawbacks toward adopting SoftCast in practical scenarios: 1) practical wireless channels have more complex characteristics, e.g., fading caused by multipath and impulse noise, than the AWGN channels, and 2) MSE is not an effective index for describing the perceptual fidelity of images/videos.
For the first drawback, the existing studies re-designed the power allocation for practical wireless channels, including fading [60] and frequency-selective fading, i.e., orthogonal frequency-division multiplexing (OFDM) [61], [62], impulse noise [69], multiple-input and multiple-output (MIMO) [63]- [65], and MIMO-OFDM channels [66]- [68].In [60], the authors designed an optimal power allocation for fading channels.In fading channels, a fading effect, i.e., multiplicative noise, will degrade the reconstruction quality.Although Soft-Cast assumes that multiplicative noise can be canceled with exact channel estimation at the receiver end, no algorithm can guarantee an error-free channel estimation.In addition to the power allocation design, the authors analyzed the effect of the channel estimation error on the reconstruction quality at the receiver end.
For frequency-selective fading channels, such as OFDM and MIMO-OFDM channels, the key issue is how to match the analog-modulated symbols to the independent subcarriers/subchannels for high-quality image/video reconstruction.In [66], [67], they observed similarities between the source and channel characteristics and exploited the similarities for subcarrier/subchannel matching.Specifically, Figs. 6 (a) and (b) show a similar spread between the highest-and lowestenergy source and channel components, respectively.It seems natural to match both sides so that high-energy DCT components are transmitted on high-gain subchannels to prevent them from acting against each other.ParCast [66] and the extended version of ParCast+ [67] assigned the more important DCT coefficients to higher gain channel components and allocated power weights for each DCT coefficient with joint consideration of the source and channel for video unicast systems.ECast in [68] extended the source and channel matching and power allocation for video multicast systems.For multicast systems, it is necessary to deal with the large overhead of channel feedback from multiple receivers.In ECast, multiple users simultaneously send tone signals for the channel feedback, and the sender receives the superposition of multiple tone signals.
Although the sender cannot distinguish each of the channel gains, the weighted harmonic means of channel gains can be obtained from the superposed tone signals; thus, ECast utilizes the channel gain for the source and channel matching and power allocation.
Other studies solved power allocation problems in modern wireless systems, including non-orthogonal multiple access (NOMA) [70], [71], underwater acoustic OFDM [72], unmanned aerial vehicle (UAV)-enabled [73], and mmWave lens MIMO systems [74].For example, in NOMA systems, source signals are coded into BL and ELs and then transmitted simultaneously through superposition coding (SC).With successive interference cancellation (SIC), near users with strong channel gains can decode both BL and EL signals, whereas far users with weak channel gains may only decode BL signals.
In the existing studies, both BL and ELs are analog-coded in [70], whereas BL and ELs are digital-and analog-coded, respectively, in [71].They solved the power allocation across the BL and ELs to minimize the distortion for all receivers with heterogeneous channel conditions.In underwater acoustic OFDM [72] and mmWave lens MIMO systems [74], the error behavior differed substantially across channel components, and the channel characteristics showed a similar tendency, as depicted in Fig. 6 (b).They solved the source and channel matching and power allocation problems, which are also discussed in frequency-selective fading channels , to minimize the distortion at the receiver end.
For the second drawback, some studies [75]- [78] also redesigned the power allocation with perceptual considerations, including structural similarity (SSIM) [75], foveation [76], and saliency [77].In these studies, determining the perceptionaware weights for each source component is challenging.Specifically, in SoftCast, the scaling factor for each coefficient is obtained from its power information to minimize the MSE.
These studies considered the perception-aware weight for the ith coefficient w i in the scaling factor to minimize the perceptual distortion as follows: For this purpose, [75] demonstrated the relationship between the MSE in the DCT coefficients and the SSIM distortion to obtain the weight for the ith DCT coefficients of all the chunks w i .They found that the weight for the high-frequency coefficients was larger than that for the low-frequency coefficients, which was consistent with the characteristics of the human visual system (HVS).FoveaCast in [76] introduced the foveation-based HVS [79] and the corresponding HVS-based visual perceptual quality metric, called foveated weighted distortion (FWD), for the optimization objective.For a given foveation point (f x , f y ) in the pixel and frequency domains, the error sensitivity for each pixel/frequency coefficient at location (x, y) can be defined in the foveationbased HVS.FoveaCast regarded the error sensitivity in the DWT domains as the weight w i and performed foveationaware power allocation.In [77], visual saliency maps were introduced for the perception-aware power allocation.Saliency maps represent the attended regions in an image when a user watches the image owing to the visual attention mechanism of the human brain.In this case, the weight for the ith pixel w i is based on the normalized visual saliency defined from any arbitrary visual saliency model, such as the Itti-Koch-Niebur model [80].Based on the weight, it allocates considerable transmission power to salient regions to minimize the eyetracking weighted MSE (EQMSE).

C. Bandwidth Utilization
The source bandwidth of soft delivery schemes depends on the number of transmitted analog-modulated symbols every second, i.e., baud rate.In the aforementioned designs, the source bandwidth is mainly considered sufficient to send all the transmitted non-zero analog-modulated symbols over the wireless medium.However, when the channel bandwidth is lower than the source bandwidth, some analog-modulated symbols are discarded at the receiver side.Here, the loss of the important coefficients, i.e., the low-frequency coefficients, may have a significant impact on the reconstruction quality.Specifically, the expected distortions in soft delivery schemes for single and multiple contents owing to the bandwidth constraint under the transmission power constraint are discussed in [81] and [82], [83], respectively.To meet the bandwidth constraint, the typical method is to selectively discard the chunks in higher frequency components to fill the bandwidth [6], [84].When the sender discards some chunks, the receiver regards all the coefficients in the discarded chunks as zeros.Because it needs to send the locations of the discarded chunks to the receiver, SoftCast sends the location information as a bitmap.Although SoftCast assumes equal-size chunks across low-to high-frequency components, [84] adopted smaller chunk sizes in high-frequency components to realize a fine-grained control to meet the bandwidth limitation.Another study in [85] used bandwidth-reducing Shannon-Kotelnikov (SK) mappings to increase the number of chunks transmitted over bandwidthconstrained channels.The SK mappings are typical N :1 bandwidth-reducing or 1:M bandwidth-expanding non-linear mappings.In this study, 2:1 SK mappings are used to encode several pairs of chunks with less energy to send more chunks with medium energy within the channel bandwidth.Fig. 7. Data-assisted soft delivery schemes [96].They utilize the stored images and videos on the cloud to limit the number of transmission symbols.
Other studies [86]- [93] introduced compressive sensing (CS) techniques [94], [95] for soft delivery over bandwidth-constrained wireless channels.Notably, CS is a sampling paradigm that allows the simultaneous measurement and compression of signals that are sparse or compressible in some domains.In general, recovering source signals from compressed signals is impossible because the system is underdetermined.However, if the source signals are sufficiently sparse in some domains, the CS theory indicates that the source signals can be reconstructed from the compressed signals by solving the 1 minimization problem.The advantage of CS-based soft delivery is the recovery of chunks in highfrequency coefficients using CS-based signal reconstruction algorithms, such as approximate message passing (AMP) and iterative thresholding, even though the chunks are discarded at the sender's end.For high-quality reconstruction, adaptive rate control and reconstruction algorithms are mainly adopted for CS-based soft delivery.For instance, [88] adaptively controlled the compression rate based on visual attention, i.e., both the texture complexity and visual saliency, to satisfy the bandwidth constraint while maintaining better perceptual quality.[92] adaptively selected reliable columns from the measurement matrix and compressed source signals using the selected columns.In view of the reconstruction algorithm, [89] designed an adaptive transform for noisy measurement signals to obtain sparser transform coefficients for clean reconstruction.[90] and [91] designed grouping methods for measurement signals to utilize the similarity between video frames for the reconstruction.
Other studies utilized stored images/videos on the cloud to reduce the bandwidth requirement in soft delivery.Specifically, data-assisted communication of mobile image (DAC-Mobi) [97], data-assisted cloud radio access network (DaC-RAN) [98], and knowledge-enhanced mobile video broadcasting (KMV-Cast) schemes [96], [99], [100], which are referred to as data-assisted soft delivery schemes, have been proposed for high-quality image/video transmission.Fig. 7 presents an overview of the data-assisted soft delivery schemes.The main contributions of the data-assisted soft delivery schemes are 1) a sender sends a limited number of analog-modulated symbols and 2) the receiver reconstructs images/videos using correlated images, i.e., side information, obtained from a cloud.
In DaC-Mobi [97], successive coset encoders were introduced to divide the DCT coefficients into three layers of bit planes: most significant bits (MSBs) in low-frequency coefficients, MSBs in other frequency coefficients and middle bits, and least significant bits (LSBs).Here, MSBs in low-frequency coefficients and LSBs were transmitted to the receiver in digital and pseudo-analog manners, respectively, whereas MSBs in other frequency coefficients and middle bits were discarded.Based on the received MSB in the low-frequency coefficients, the receiver reconstructs a downsampled image to retrieve correlated images in the cloud.The retrieved correlated images were used as side information to resolve ambiguity due to discarded bits and reconstruct the entire image.DaC-RAN [98] and the extended version of KMV-Cast [96], [99], [100] adopted Bayesian reconstruction algorithms that utilize correlated images/videos. in the cloud as prior information to reduce the required bandwidth for soft delivery.The main difference between the DaC-RAN and KMV-Cast schemes is that the former assumes that the same images/videos exist in the cloud, whereas the latter does not require that the same images/videos exist at the receiver end by designing prior knowledge broadcasting in a digital manner.
The aforementioned studies considered the channel bandwidth to be lower than the source bandwidth.If the channel bandwidth is greater than the source bandwidth, the soft delivery schemes become less efficient.In this case, the soft delivery schemes utilize the extra bandwidth by retransmission.[101] and [102] designed an analog channel coding to use an extra channel bandwidth for quality enhancement.For example, [102] proposed a chaotic function-based analog encoding [103] for soft delivery.Although the existing chaotic function-based analog coding is designed for uniformly distributed sources, the analog coding for Gaussian distributed sources significantly amplifies source signals and thus consumes unnecessary transmission power.They designed a chaotic map function for Gaussian distributed source signals to prevent power increments compared to the input power.Mcast in [104] also utilized extra bandwidth for quality improvement.As mentioned earlier, the sender can send the source data multiple times if an extra bandwidth is available.In this case, the utilization of extra time slots for quality improvement is a key issue.To overcome this issue, MCast optimized the assignment of the chunks of the DCT coefficients to available channels in multiple time slots to fully exploit the time and frequency diversities.
In contrast to the aforementioned studies, [105], [106] dealt with bandwidth variations.When the available bandwidth is less than the expected bandwidth at the sender's end, some important chunks will not have the opportunity to be transmitted before the playback deadline.They grouped several chunks into a tile and sent the tile with a large variance and high priority to dispatch important coefficients before the playback deadline.

D. Packet Loss Resilience
Even though the channel bandwidth is sufficient to send all the non-zero analog-modulated symbols, some analogmodulated symbols are discarded at the receiver side owing to loss-prone wireless channels.Specifically, the packet loss owing to strong fading and interference may have a significant impact on the reconstruction quality if important chunks and coefficients are lost.SoftCast used the WHT to redistribute the energy of the source signals across whole packets for resilience against packet loss.However, each packet still contains a large amount of energy, and thus, degradation owing to packet losses remains considerable.
To maintain better reconstruction quality in error-prone wireless channels, some related studies [107]- [109] have introduced CS techniques, i.e., block-wise CS [110], for packet loss resilience.The CS technique is suitable for wireless transmission with random packet loss owing to its random measurement.Random measurement considers all packets as of equal importance.In contrast to typical CS techniques, block-wise CS can reduce the storage and computational costs of the reconstruction.A pioneering work on packet loss resilience is the distributed compressed sensing-based multicast scheme (DCS-cast) [107].Figure 8 shows an overview of the sender and receiver operations in the DCS-Cast.In the DCS-Cast, each image is first divided into blocks and the coefficients in each block are randomized using the same measurement matrix across the blocks.One coefficient in every block is packetized to normalize the importance across packets.Even though some packets may be lost over lossprone wireless channels, the receiver obtains noisy pixel values using the same measurement matrix at the sender and reconstructs the lost pixel values using the CS reconstruction algorithm in the DCT/DWT domains.Because the lost pixel values can be recovered from the reconstruction algorithm, DCS-Cast maintains high image/video quality in the lossprone channels.To further improve the reconstruction quality, multi-scale [108] and adaptive [109] block-wise CS algorithms have been adopted for soft delivery.The multi-scale blockwise CS algorithm [108] decomposes each video frame into a multi-level 2D-DWT and then optimizes the sampling rate for each DWT level according to its importance.However, the adaptive block-wise CS algorithm [109] divides several video frames into one reference frame and subsequent nonreference frames and adaptively determines whether direct or predictive sampling should be used for each block in a nonreference frame.Direct sampling randomizes the signals in the block, whereas predictive sampling calculates the residuals between the blocks in the reference and non-reference frames and randomizes residuals to utilize the inter-frame similarity for the reconstruction.

E. Overhead Reduction
In soft delivery schemes without chunk division, a sender needs to let the receiver know the power information of all the DCT coefficients to demodulate the signals.For the receiver to carry out the MMSE filtering in Eq. ( 6), the sender needs to transmit λ i of all coefficients without errors as metadata, which may constitute a large overhead.For example, when the sender transmits eight video frames with a resolution of 352 × 288, the sender needs to transmit metadata for all DCT coefficients, i.e., 352 × 288 × 8 = 811,008 variables in total, to the receiver.This overhead may induce performance degradation owing to the rate and power losses in the transmission of analog-modulated symbols.
To reduce the overhead, SoftCast divides the DCT coefficients into chunks and carries out chunk-wise power allocation using an MMSE filter.However, overheads are still high, and chunk division causes performance degradation due to improper power allocation.
To achieve better quality under a low overhead requirement, the related studies can be classified into two types, as shown in Figs 9 (a) and (b): 1) sender-side overhead reduction and 2) receiver-side overhead reduction.Studies on the sender-side overhead reduction [111]- [114] designed fitting functions to obtain the power information with fewer parameters.In this case, the sender and receiver share the same fitting function in advance and send the parameters as metadata for overhead reduction.Specifically, [111] designed a fitting function with four parameters for each chunk, and [112] designed a loglinear function with two parameters for each chunk.Another study in [113] found that equal-size chunk division was not suitable for chunk-wise fitting, and thus, an adaptive chunk division, i.e., L-shaped chunk division, was designed for an accurate fitting.In addition, [114] exploited a Lorentzian fitting function with seven parameters based on a Gaussian Markov random field for each GoP.
Studies on receiver-side overhead reduction [115], [116] estimate the power information only from the received signals without any additional computational cost at the sender side.[115] is a pioneer work to estimate the power information from the received signals, and blind data detection (BDD) [116] was proposed to decode the received analog-modulated symbols without the power information at the receiver.Specifically, BDD uses a zero-forcing estimator and the sign of the received signals to approximate the source signals.
We note that both types of overhead reduction cause quality degradation owing to estimation errors.In [112], the effect of modeling accuracy on the reconstruction quality in soft delivery was analyzed.

F. Antenna and Resolution Heterogeneity
In SoftCast and later soft delivery schemes, the channel heterogeneity of the receivers can be solved by pseudoanalog modulation with linear coding and decoding operations.However, the heterogeneity of other aspects still impairs each user's experience.To mitigate the impairment caused by the heterogeneity across multiple receivers, some studies extended soft delivery to deal with the heterogeneity of resolution [117] and receiver antennas [118], [119].In [117], the authors dealt with the resolution heterogeneity of receivers in video broadcast systems.Specifically, they designed a novel spatial decomposition method based on linear projections to provide differentiated resolution demands.After decomposition, the input videos can be divided into BL and multiple ELs.A base layer guaranteed the base resolution of the video content, while the ELs progressively enlarged the video resolution.
For the antenna heterogeneity in MIMO systems, the CS algorithm was adopted in [118], whereas AirScale in [119] designed a combination of multiple similar description (MSD) coding and multiplexed space-time block coding (M-STBC).In [118], the transformed coefficients were randomized using a measurement matrix before transmission.The randomized symbols are transmitted from multiple antennas; the receiver with an insufficient number of antennas may receive a limited number of symbols via wireless channels.Even though the receiver does not receive some symbols owing to the number of antennas, the CS-based signal reconstruction algorithm recovers the lost symbols to decrease the degradation owing to the antenna heterogeneity.In AirScale [119], the MSD coding produces highly similar descriptions from video frames in one GoP to provide an additional feature that any linear combination of the descriptions can be used to reconstruct the source signal.M-STBC is then adopted for the descriptions to achieve either multiplexing gain or diversity gain for multiple receivers with a diverse number of antennas.When there is an insufficient number of receiver antennas, M-STBC puts similar symbols to corresponding space-time positions to enhance the reconstruction quality using the linear combinations of the received descriptions.When there is a sufficient number of receiver antennas, different symbols are used to achieve multiplexing gain.

G. Implementation
The aforementioned studies mainly discussed performance improvements in theoretical analyses and simulations.Some studies implemented a soft delivery scheme on softwaredefined radio [133] platform and field-programmable gate array (FPGA) [134]- [136] to empirically demonstrate the benefits of soft delivery in practical wireless channels.In [133], the authors used Ettus Research X310 and GNU Radio for implementation and evaluated the visual quality of soft delivery in both simulations and experiments.Conversely, in [134]- [136], they exploited the Xilinx Virtex7 FPGA for implementation and tested the reconstruction quality as a function of wireless channel SNRs.Other studies [137], [138] implemented soft delivery on the prototypes of multi-user MIMO(MU-MIMO) and long-term evolution (LTE) systems, respectively.For example, in [137], SoftCast is implemented on BUSH, which is a large-scale MU-MIMO prototype that performs scalable beam user selection with hybrid beamforming for phased-array antennas in legacy WLANs.They performed experiments to evaluate the video quality in terms of PSNR and SSIM over a lossy MU-MIMO channel.

V. EXTENSION FOR IMMERSIVE CONTENTS
SoftCast and other soft delivery schemes mentioned in the previous sections were designed for conventional images and video signals.In modern wireless and mobile communication scenarios, the streaming of immersive content will be a key application for reconstructing 3D perceptual scenes that provide full parallax and depth information for human eyes.The immersive contents can be applied to various applications, such as three to six degrees-of-freedom (6-DoF) entertainment, remote device operation, medical imaging, vehicular perception, VR/AR/MR, and simulated training.Figures 10 (a) through (c) show the typical immersive contents of freeviewpoint video [139]- [141], 360-degree video, and point cloud [142].Even in immersive contents, the video frames are compressed in a digital manner, and the compressed bitstream is then channel-coded and modulated in sequence.This means that cliff and leveling effects still occur in the streaming of the immersive contents owing to the variation in the channel conditions.To prevent cliff, leveling, and staircase effects, some studies have extended soft delivery schemes toward immersive content for future wireless multimedia services.

A. Free Viewpoint Video
Free-viewpoint videos enable us to observe a 3D scene from freely switchable angles/viewpoints.Fig. 10 (a) shows an example of a free-viewpoint video wherein numerous closely spaced RGB and infrared (IR) camera arrays are deployed to capture the texture and depth frames of a 3D scene, such as a football game.Even though the number of deployed cameras in the field is limited owing to physical constraints, the receiver can synthesize intermediate virtual viewpoints using rendering techniques, e.g., depth image-based rendering [143], [144] to obtain numerous switchable viewpoints.To synthesize intermediate virtual viewpoints using the rendering technique, the sender encodes and transmits the texture and depth frames of two or more adjacent viewpoints, the format of which is known as multi-view plus depth (MVD) [145].
For conventional MVD video streaming over wireless links, digital video compression for MVD video frames, e.g., MVC+D [146] or 3D-AVC [147], fully utilizes the redundancy between the cameras and texture-depth for compression.In this case, the streaming schemes need to solve view synthesis problems in addition to cliff and leveling effects to yield better video quality even in the synthesized virtual viewpoints.Specifically, the video quality of the virtual viewpoint is determined by the distortion of each texture and depth frame.In digital-based MVD schemes, the distortion depends on the bit and power assignments for each texture and depth frame.It is often cumbersome to achieve the best quality at a target virtual viewpoint using parameter optimization owing to the combinatorial problem with nonlinear quantization.
Some studies [120]- [124] designed a soft delivery scheme for a free-viewpoint video.Specifically, FreeCast in [120]- [122] is the first scheme for a free-viewpoint video.Because MVD video frames have redundancy of cameras and texturedepth, FreeCast jointly transforms texture and depth frames using 5D-DCT to exploit inter-view and texture-depth correlations for energy compaction.In addition, FreeCast can simplify the optimization problems of view synthesis by reformulating it into a simple power assignment problem.This is because bit allocation, i.e., quantization, is not required in FreeCast.They found that the power assignment problem for the texture and depth frames can be solved using a quadratic function to yield the best quality at the desired virtual viewpoint.Another study [123] focuses on the view synthesis problems under the 3D-DCT operations for each camera's texture and depth frames and designs the power assignment method for solving the problem.The main difference between the power assignment methods of FreeCast and [123] is that FreeCast pre-assigns transmission power before the decorrelation, whereas [123] controls the transmission power after the decorrelation.Other studies in [125], [126] adopt soft delivery to collect video frames from multiple wireless and mobile cameras via wireless channels, even though the aforementioned studies assume that the video frames at the server are error-free.

B. 360-Degree Video
Notably, 360-degree video contents build a synthetic virtual environment to mimic the real world with which the users interact.Each user can watch 360-degree videos through a traditional computer-supported VR headset or an all-in-one headset (e.g., Oculus Go).When the user requests the 360degree video, the sender sends the 360-degree video frames, and the user may play a part of the 360-degree video frames, which is referred to as the viewport, through the user's headset.Here, 360-degree videos are mainly captured by an omnidirectional camera or a combination of multiple cameras and saved in a spherical format.Before transmissions, the sphere frames are mapped onto the 2D plane using a certain projection method, e.g., equirectangular and cube map projections.
In 360-degree video streaming, the major issue is to yield better video quality in the user's viewport by effectively reducing perceptual redundancy within 360-degree video frames.Because each user only watches the viewport via the headset at each time instance, a large video traffic is created if the sender sends the full resolution of the 2D-projected video frames with an identical quantization parameter.One of the simplest methods to reduce perceptual redundancy is viewportonly streaming [148].In video playback, the user may move a viewing viewport according to the user's head/eye movement.Based on the movement, the user requests a new viewport to the sender, and the sender sends back the corresponding viewport.Because the sender transmits one viewport at each time instant, viewport-only streaming can mitigate the video traffic.However, the user needs to receive a new viewport from the sender in every viewport switching, which causes a long switching delay.A long switching delay, i.e., approximately 10 ms, may cause simulator sickness [149].Owing to a long delay in the standard Internet, it is difficult for viewport-only streaming schemes to satisfy the switching delay requirements.To prevent simulator sickness, conventional schemes [150] divide 360-degree video frames into multiple tiles and independently encode them with different quantization parameters to yield better viewport quality within the bandwidth constraint.
The studies [127]- [130] on soft delivery schemes focus on the quality optimization of the user's viewport in addition to cliff and leveling effect prevention.[127] is the first scheme for viewport-aware soft 360-degree video delivery.According to the viewing viewport, the sender first adopts pixel-wise power allocation to reduce the perceptual redundancy in 360degree video frames and then carries out the combination of one-dimensional DCT (1D-DCT) and spherical wavelet transform (SWT) as a decorrelation to utilize the redundancy in the sphere and time domains.OmniCast [128] further considers the feature of 360-degree videos into quality optimization.Specifically, they analyze the relationship of the distortion between the spherical and projected 2D domains as the spherical distortion for each projection method, and design power allocation to realize the optimal quality in the 2Dprojected 360-degree videos.360Cast [129] and the extended version of 360Cast+ [130] adopt viewport prediction based on linear regression and foveation-aware power allocation within the predicted viewport to further reduce the perceptual redundancy.

C. Point Cloud
Volumetric content delivery provides highly immersive experiences for users through XR devices.The point cloud [142] is arguably the most popular volumetric data structure for representing 3D scenes and objects on holographic displays [151], [152].A point cloud typically consists of a set of 3D points, and each point is defined by 3D coordinates, i.e., (X, Y, Z), and color attributes, i.e., (R, G, B).In contrast to conventional 2D images and videos, 3D point cloud data are neither well aligned nor uniformly distributed in space.
The major challenge in volumetric delivery over wireless channels is how to efficiently compress and send numerous and irregular structures of the 3D point cloud within a limited bandwidth.Some compression methods have been proposed for point clouds to deliver 3D data.Specifically, Draco [153] employs kd tree-based compression [154] and a point cloud library (PCL) using octree-based compression [155]- [157].To further reduce the amount of data traffic in point cloud delivery, two transform techniques have been proposed for energy compaction of the non-ordered and non-uniformly distributed signals: Fourier-based transform, e.g., graph Fourier transform (GFT) and wavelet-based transform, e.g., regionadaptive Haar transform [158].For example, recent studies used GFT for the color components [159] and 3D coordinates [160] of graph signals for signal decorrelation.They used quantization and entropy coding for the compression of decorrelated signals.
HoloCast [131] is a pioneering work on soft 3D pointcloud delivery for unstable wireless channels.Specifically, they regard 3D points as vertices in a graph with edges between nearby vertices to deal with the irregular structure of the 3D points motivated by [160], [161].HoloCast uses GFT for such graph signals to exploit the underlying correlations among adjacent graph signals and directly transmits lineartransformed graph signals as a pseudo-analog modulation over the channel.However, it has been found that graph-based coding schemes need to send the graph-based transform basis matrix used in GFT as additional metadata for signal decoding.For example, the sender needs to send N 2 real elements of the graph-based transform basis matrix as the metadata when the number of 3D points is N .In [132], Givens rotation [162], [163] was used for GFT basis matrix compression.Givens rotation is used to selectively introduce zeros into a matrix to create an identity matrix from the basis matrix using angle parameters.The angle parameters are quantized prior to the metadata transmission for overhead reduction.

VI. FUTURE DIRECTIONS
As mentioned in the previous sections, soft delivery schemes have been studied to overcome the issues of conventional image and video streaming in modern wireless and mobile networks since 2010.The main concept of soft delivery schemes is to replace conventional nonlinear operations with only linear operations, thus preventing the cliff and leveling effects, which are caused by such nonlinear operations.
Moreover, the reconstruction quality of soft delivery schemes highly depends on the performance of the linear encoding and decoding operations.Recent studies integrate nonlinear encoding and decoding operations in soft delivery to take advantage of further quality improvements.Specifically, the studies in [164], [165] integrate low-rate digital-based encoding and decoding into soft delivery.Although highrate digital-based operations are sensitive to channel quality fluctuations and thus have a cliff effect at the receiver end, operations with a relatively low rate can compact the signal energy and prevent bit errors even with channel fluctuations.In this case, the reconstruction quality was significantly low.For quality enhancement, the hybrid digital-analog delivery is utilized to send the residual signals; then, the receiver adds the received residuals to the digitally coded images/videos.In this case, the reconstruction quality can be gradually improved based on the wireless channel quality.
Other recent studies utilize deep neural network (DNN) architectures for nonlinear encoding and decoding operations.In particular, deep convolutional neural networks (DCNNs) have been successfully applied to image-based tasks [166], [167].The DCNN first learns the weights of the mapping function using noisy and original images based on a massive number of typical image datasets.Because the mapping function can represent linear/nonlinear noise, DCNN-based nonlinear encoding and decoding operations offer better performance compared with conventional nonlinear operations.

A. Hybrid Digital-Analog Delivery
Although integration with digital operations, e.g., coset coding, was initially proposed in 2011 [40], pioneering studies in [164], [165] proposed the superposition of analog-coded and digital-coded symbols, i.e., hybrid digital-analog (HDA) delivery, to exploit the benefits of both conventional digitalbased and soft delivery schemes.Fig. 11 shows an overview of the HDA delivery schemes.The HDA delivery schemes consist of the digital and analog coding parts.At the sender side, the video frames are first encoded by the digital video encoder and the digitally-coded bitstream is channel coded, modulated, and assigned transmission power by the sender.Meanwhile, the residuals are coded, power-assigned, and modulated by the soft delivery scheme.Both outputs from the digital and analog coding parts are superposed and transmitted over wireless channels.In this case, the transmitted signal x i is the sum of BPSK-modulated vector signal x d i and output vector signal of the soft delivery scheme x a i as follows: The BPSK-modulated symbol and the analog-modulated symbol are scaled by P d and g i , respectively. x where b i ∈ X = {±1} is the BPSK-modulated symbol and  = √ −1 denotes the imaginary unit.Here, the near-optimal solution of g i under the transmission power budget P a is expressed as: We note that the budgets of the transmission power for the digital and analog parts need to satisfy the total power budget P t : At the receiver side, it first decodes the digital-modulated symbols and then obtains the analog-modulated symbols by subtracting the digital-modulated symbols from the received symbols.Finally, the receiver reconstructs the baseline quality of the video frames from the output of the digital part and enhances the video quality by adding the output of the analog part.
A key issue in HDA delivery is the assignment of transmission power to the digital and analog parts [169].Specifically, the power assigned to the digital part must guarantee the correct decoding of the symbols.By contrast, the digital decoder treats the superimposed analog-modulated symbols x a i as noise.To achieve better decoding performance, the I component of x a i should be kept as small as possible.In [170], they only select the high-frequency coefficients, which are expected to be very small values for superposition.The remaining lowfrequency coefficients are delivered using pseudo-analog modulation.The HDA framework in [171] regards the superposed symbols as three main parts: 1) orthogonal analog symbols, 2) digital symbols, and 3) nonorthogonal analog symbols superimposed onto digital symbols.They designed resource allocation among these three parts to achieve a better balance between lowering interference and improving reconstruction quality.Another study [172] designs a prediction model to describe the relationship between the variance of residuals and the quantization parameter, and determines the optimal transmission power for the analog part, which maximizes the reconstruction quality with the correct decoding of the digital part.The HDA delivery scheme in [173] treats the imperfect decoding of the digital part and finds the best assignment of the transmission power for the digital and analog parts.This prevents too much power assignment for the digital part to ensure a low bit error rate (BER).In contrast to the aforementioned studies, [174] treats the bandwidth of other digital traffic as hidden resources for HDA video delivery.Specifically, they superimpose the analog-modulated symbols and digital symbols of the other digital traffic to utilize the hidden resource under the constraint that the BER requirement of the other digital traffic is not compromised.
Other studies [182]- [184] extend HDA video delivery for immersive contents.Swift in [182] considers stereo video delivery and designs a zigzag coding structure for the stereo video to utilize both intra-and inter-view correlations.In the zigzag coding structure, the odd frames in the left view and the even frames in the right view are encoded digitally, and the rest of the frames are encoded in analog.Here, the reconstructions of the digitally coded frames are used as side information to further remove redundant information from the analog-coded frames.Another study [183] extends HDA delivery for MVD videos and solves the view synthesis optimization to yield the best quality from an intermediate virtual viewpoint.HoloCast+ in [184] designs HDA delivery for point cloud delivery.

B. Applied Deep Neural Network
In recent studies, DNN-based nonlinear operations have been integrated with soft delivery.The multi-layer perceptron auto-encoder is first adopted to reduce the overhead of soft delivery [185].Specifically, the proposed encoder obtains a few latent variables from the pixel values, and the proposed decoder decodes the accurate power information from the Fig. 12. GNN-based end-to-end encoder and decoder for wireless 3D point cloud delivery [168].
received latent variables for proper power allocation.The reconstruction quality can be maintained even with only one metadata across one GoP.Another study [186] integrates a DCNN-based auto-encoder into a soft delivery scheme.The proposed encoder directly compresses each image into a limited number of latent variables, and the proposed decoder reconstructs the image from the latent variables.Here, the latent variables are transmitted over wireless channels using pseudo-analog modulation.Even though the latent variables are obtained by nonlinear functions and delivered over wireless channels with a lower SNR, cliff and leveling effects can be prevented via pseudo-analog modulation.Other studies have introduced the DNN architecture for power allocation [187] and decoding operations [188].The study in [187] uses a youonly look-once (YOLO) structure [189] to extract the region of interest (ROI) and non-ROI parts from each image and then assign unequal transmission power across ROI and non-ROI parts for perceptual quality enhancement.The proposed scheme in [188] integrates DCNN-based image denoising, specifically deep image prior (DIP) [190], into soft delivery.The DIP finds linear and nonlinear noise effects for reconstructing clean images from noisy images.The proposed scheme can remove fading and noise effects from the received images using DCNN-based image restoration.Another study [168] introduces graph neural networks (GNN) [191] for wireless point cloud delivery.The GNN is a novel model for graph representation learning that allows the analysis of the irregular geometric structure of graph data.Fig. 12 showed their GNN-based auto-encoder (GAE) [192], [193] to encode 3D point clouds into a limited number of latent variables.One of the benefits of the GAE is that it allows graph signal reconstruction from a limited number of latent variables without requiring additional metadata.

VII. CONCLUSION
Herein, we present an exhaustive survey and research outlook of the soft delivery schemes.We first review conventional digital-based video delivery schemes and the critical issues of the schemes, including cliff, leveling, and staircase effects.Next, we provide an overview of the soft delivery schemes and the taxonomy of the existing schemes from the perspectives of energy compaction, power allocation, bandwidth utilization, packet loss resilience, overhead reduction, receiver heterogeneity, and implementation.Finally, future research directions, including extension of immersive content, HDA delivery, and DNN-based soft delivery, are discussed.

Fig. 1 .
Fig. 1.Taxonomy of the studies on soft delivery schemes.

Fig. 4 .
Fig. 4. Mapping coded video to I/Q components of the transmitted signal.(a) Traditional 16-QAM maps a bit sequence to the complex number corresponding to the point labeled with that sequence.(d) By contrast, SoftCast's PHY treats pairs of coded values as the real and imaginary parts of a complex number.

Fig. 5 .
Fig. 5. Scaling and inverse scaling operations in soft delivery schemes.

Fig. 9 .
Fig. 9. Block diagram of the sender-side and receiver-side overhead reduction methods.(a) Sender utilizes fitting functions to obtain power information with fewer parameters.(b) Receiver estimates the power information only from the received symbols.
Table I lists three critical issues regarding the video quality of the digital-based schemes because of the channel quality fluctuation: cliff, leveling, and staircase effects.
1) Cliff Effect: Digitally encoded bits are known to be susceptible to errors during wireless transmission.Because entropy coding schemes have an all-or-nothing behavior, even Digital video encoder (e.g., H.264/AVC, H.265/HEVC, H.266/VVC)

TABLE I CRITICAL
ISSUES REGARDING VIDEO QUALITY IN WIRELESS AND MOBILE VIDEO STREAMING

TABLE II BRIEF
INTRODUCTION TO TYPICAL ENERGY COMPACTION TECHNIQUES FOR SOFT DELIVERY SCHEMES i by a coset step q and obtains the coset residual code l i as follows: