COCONUT: Content Consumption Energy Measurement Dataset for Adaptive Video Streaming

HTTP Adaptive Streaming (HAS) has emerged as the predominant solution for delivering video content on the Internet. The urgency of the climate crisis has accentuated the demand for investigations into the environmental impact of HAS techniques. In HAS, clients rely on adaptive bitrate (ABR) algorithms to drive the quality selection for video segments. Focusing on maximizing video quality, these algorithms often prioritize maximizing video quality under favorable network conditions, disregarding the impact of energy consumption. To thoroughly investigate the effects of energy consumption, including the impact of bitrate and other video parameters such as resolution and codec, further research is still needed. In this paper, we propose COCONUT, a COntent COnsumption eNergy measUrement daTaset for adaptive video streaming collected through a digital multimeter on various types of client devices, such as laptop and smartphone, streaming MPEG-DASH segments. Furthermore, we analyze the dataset and find insights into the influence of multiple codecs, various video encoding parameters, such as segment length, framerate, bitrates, and resolutions, and decoding type, i.e., hardware or software, on energy consumption. We gather and categorize these measurements based on segment retrieval through the network interface card (NIC), decoding, and rendering. Additionally, we compare the impact of different HAS players on energy consumption. This research offers valuable perspectives on the energy usage of streaming devices, which could contribute to creating a media consumption experience that is both more sustainable and resource-efficient. Dataset URL: https://athena.itec.aau.at/coconut/.


INTRODUCTION
The impact of video streaming on global Internet traffic has experienced significant growth over the past years.As outlined in the Ericsson Mobility Report [8], by the end of the first half of 2023, video streaming constituted 71% of all mobile data traffic.Projections foresee a further increase, reaching 80% by 2028.Most videos available online are transmitted using HTTP Adaptive Streaming (HAS), including Dynamic Adaptive Streaming over HTTP (DASH) [7] and HTTP Live Streaming (HLS) [20], which stands out as a prominent method for delivering video data over the Internet [3,24].In HAS, each video is encoded at various quality levels, each characterized by a specific bitrate and resolution.Subsequently, the video is temporally divided into segments of uniform duration.On the client side, an Adaptive Bitrate (ABR) algorithm dynamically selects and requests the appropriate bitrate for each segment.This adaptive approach accommodates fluctuations in network throughput, with the ultimate goal of delivering high Quality of Experience (QoE) [22,23,17].
Advanced Video Coding (AVC) [27] is still the most widely embraced video coding format, enjoying support from numerous hardware manufacturers and adoption by major platforms.Despite its widespread use, AVC exhibits lower compression efficiency, particularly noticeable in high-resolution videos, prompting the establishment of a new video coding standard in 2013, i.e., High-Efficiency Video Coding (HEVC) [21].This standard achieved a notable 50% reduction in bitrate compared to its predecessor, AVC [19].
The climate crisis has brought attention to the environmental repercussions of video streaming, emphasizing the necessity for studies to evaluate its carbon footprint [2,18].In the domain of video consumption, streaming-related processes at the client side can be organized into three classes [1], including segment (i) retrieval, (ii) decoding, and (iii) rendering.Retrieval refers to the client's process of requesting and downloading video segments from the server, whose energy consumption mainly depends on the network interface card (NIC).Decoding considers the system processes involved in decompressing the encoded video segments.Rendering describes the process of displaying the decoded video frames on the device screen.Several studies remark on the impact of video bitrate and resolution on energy consumption [26,5,9].However, these

Feature
GreenABR [26] Chachou et al. [4] Ours studies focus on only one type of device (e.g., smartphone, PC, laptop), trying to find a generalized model for the energy consumption of all device types.Since the energy consumption highly depends on hardware and software characteristics, the energy measurements will likely vary if collected on different devices.Therefore, a comparison across various device types is crucial.Furthermore, in [5,9], the energy consumption is computed and analyzed as a whole, with no insights into the different processes involved.On the other hand, [26] only considers energy measurements from segment decoding, trying to estimate segment retrieval and rendering separately with existing models.
In this paper, we introduce a COntent COnsumption eNergy measUrement daTaset for adaptive video streaming (COCONUT) collected on various types of client devices, such as laptop and smartphone, streaming MPEG-DASH segments.Our dataset is publicly available at https://athena.itec.aau.at/coconut/.For each combination of parameters defining a scenario, COCONUT includes a collection of measurements of current intensity (A) useful to determine the power consumption (W) at each sampled instant.The dataset is useful for investigating the impact of different hardware, software, and video streaming parameters on the energy consumption of devices employing HAS.
The remainder of this paper is organized as follows.Section 2 gives an overview of the related work.We present our collection testbed, dataset characteristics, and structure in Section 3. The analysis of the collected data is reported in Section 4.Then, we provide possible applications for COCONUT in Section 5.This is followed by the concluding remarks in Section 6.

RELATED WORK
The increasing concern about the climate crisis has brought more attention to the energy usage of everyday computing devices.This has led to the publication of multiple datasets to assess their energy consumption.[25,4,26].
Green500 [25] presents a list of over 500 computing systems characterized by the number of computing cores, maximum performance rating according to the High-Performance Linpack (HPL) benchmark (PFlops/s), overall power (kW) and energy efficiency (GFlops/W).The data is further ordered in a double-ranked way, highlighting both the most powerful and most energy-efficient machines.In the context of HAS, Chachou et al. [4] and Turkkan et al. [26] release a collection of energy measurements related to segment decoding on client devices.Chachou et al. [4] collected energy measurements for two devices, one desktop PC and one laptop.This data is, however, only relative to software video decoders, leaving an open question on the impact of the optimized hardwareaccelerated video decoding.Turkkan et al. [26] perform an intense analysis on the optimal trade-offs between video quality and the energy consumption of a mobile video streaming device.This study is based on a series of collected data that the authors publicly released.This data collection includes video parameters, such as bitrate, frame rate or frames per second (fps), resolution, motion rate, and perceived quality expressed in Video Multi-Method Assessment Fusion (VMAF) [11], together with the power measured by short-circuiting the device battery while streaming the relative video content.Although they provide interesting insights on the energy consumption of client devices, this data is limited to the decoding process.Therefore, no measurement has been provided for segment retrieval and rendering.
Contribution.Table 1 depicts the main differences between our dataset, COCONUT, and the related work.Unlike other studies, we collected numerous measurements on one laptop and one smartphone.Furthermore, we provide measurements for both segment retrieval and decoding plus rendering, to better differentiate their impact on the overall energy consumption.Given that different codecs lead to different visual qualities [14], we additionally employ AVC, HEVC, VP9, and AV1, to compare their energy efficiency in decoding video segments with the quality they provide.For each video frame, we also report Peak signal-to-noise ratio (PSNR), the perceived VMAF quality, the spatial information (SI) and the temporal information (TI) using VCA [15].This provides an estimation of the video complexity, ensuring diversity in the complexity of the videos.Lastly, we employ multiple HAS players, i.e., Shaka, dashjs, and Exoplayer, and different decoding options, i.e., software and hardware-accelerated.

METHODOLOGY AND DATASET STRUCTURE
In this section, we first describe the hardware and software setup of our testbed.We then provide a categorical decomposition of the energy consumption based on the specific operating system (OS) layers and processes.Subsequently, we present the structure and characteristics of the dataset.

Testbed
Our testbed comprises a HAS client (smartphone or laptop), supplied by a power supply with defined voltage between 0 and 30 V, and a HAS server that communicate either via Wi-Fi ™ (smartphone, latpop), Ethernet (laptop), or LTE (smartphone).Furthermore, a Voltage/Ampere meter is required to collect energy consumption data from the client.An overview of the testbed is presented in Figure 1 and described in the following.
• dashjs (v4.7.Client.The HAS client is responsible for requesting and downloading the desired video segments via the HTTP GET method from the HAS server.Once the content has been retrieved, the client proceeds to play it back according to its reproduction timeline.In our experiments, the client varies between the Lenovo © Thinkpad ™ P1 Gen4 (laptop) and the Xiaomi © Mi 11 5G (smartphone), each with a custom set of hardware specifications, OS, and HAS player, as presented in Table 2.During the experiments, the players have been set to full screen to reproduce the same conditions happening in a real-world scenario.It is crucial to mention that, for evaluating the effects of decoding and rendering processes exclusively, we locally retrieved the video sequences without engaging the server.Furthermore, the maximum buffer capacity of the player is set to 0 s to force the retrieval of the following video segment to happen right before the rendering of the current segment is completed.Subsequently, the client activates its NIC and requests all video segments from the server to evaluate the energy impact of segment retrieval.
Server.The HAS server is a key component of a video streaming architecture.Its duty is to serve the manifest, which contains information about the available video streams and their different quality levels, and the requested video content to the clients.Information about the video sequences stored and served by the server is provided in Section 3.2.For Wi-Fi ™ and Ethernet, the HAS server is in the same private network of the HAS client.For the experiments with LTE, instead, the smartphone client fetches the content from a Amazon © Simple Storage Service (S3) bucket.
Power supply.Every electronic device requires a power source to operate.Although AC/DC adapters are supposed to provide a stable power supply, we noticed in our experiments that the voltage applied to the selected devices was heavily oscillating.Therefore, to minimize these oscillations, we connected the client to the NANKADF © WPS-10H, a device that supplies a suitable voltage between 0 and 30 V and supports a current flow up to 10 A. This way, we ensure the accuracy of the measurements.Based on the device and charger specifications, we supplied 20 V to the laptop and 5 V to the smartphone.
Voltage/Ampere meter.Energy-related measurements, such as Volt (V) and Ampere (A), for a device can be measured either by physical devices or by hardware-specific software.In our experiments, we used the multimeter Voltcraft © VC-7200BT, capable of sampling voltage and current in subsecond intervals with a frequency of up to 200 Hz, ensuring a capture rate of one measurement every 5 ms.In this way, we can accurately measure the actual energy consumption of the devices.Given the voltage supply provided by the device, we measure and gather the current intensity (A) to obtain the power consumption by product, i.e.,  () =  () × ().

Video preparation
Video sequences.We selected several video sequences from publicly available datasets1,2 depending on their genre, as presented in Figure 2, namely (a) Lifting Off 1 (balloon festival), (b) Big Buck Bunny 2 (animation), (c) Indoor Soccer 1 (indoor sport), (d) Unspoken Friend 1 (horse riding), and (e) Seconds That Count 1 (slow motion).To prepare the media content for delivery, essentially two steps are necessary, namely encoding and packaging.
We encoded the selected sequences with AVC, HEVC, VP9, and AV1 using: where IN and OUT are the input and output (encoded) media files, respectively, LIB is the adopted video codec library, i.e., libx264 for AVC, libx265 for HEVC, libvpx-vp9 for VP9, and libsvtav1 for AV1, and BITRATE is the target bitrate for the encoded video stream.FPS is the number of frames per second and SL is the segment length in seconds.At this time, audio is not considered since video has the highest impact on the energy consumption of a device.Furthermore, our goal is to compare different streaming parameters that influence only the visual domain.
Packaging.After being encoded, the video sequences are packaged using DASH to enable their distribution to the clients.This process involves splitting the whole content into shorter segments, represented and described in an XML-formatted file known as the manifest or Media Presentation Description (MPD).Therefore, we used MP4Box4 , a multimedia packager available in Multimedia Open Source Project (GPAC) 5 , by executing the following command: where SL is the segment length in milliseconds, SN is the name specification for the generated segments, IN is a sequence of paths (separated by space) of the encoded videos to package, and MANIFEST is the output path for the generated manifest.

Energy consumption decomposition
From a client perspective, the energy consumption  of a client device involved in video streaming can be modeled as: where   ,   , and   are the energy consumed by segment retrieval, decoding, and rendering, respectively [1]. Figure 3 depicts visual interconnections between these stacks within the client's streaming pipeline.
Retrieval, also known as segment fetching, includes all networkrelated operations designed for the transmission stack (TCP/IP for HTTP/1 and HTTP/2, and UDP for HTTP/3), such as packetization, addressing, routing, and error control, ensuring reliable data transmission through the NIC between devices on a network.In this way, the client and the server can engage in data transfer processes to exchange information, such as files or synchronization data.
Decoding refers to the process of decompressing an encoded multimedia file, here encoded video segments.Typically, decoding happens after unpackaging, which is the process of extracting the actual encoded video segment together with information useful for the segment playback, such as its position in the timeline.In this work, we do not differentiate between the two processes.
Lastly, rendering describes the process of displaying the decoded video frames on the device screen, applying image corrections, such as upscaling or downscaling, if necessary.

Dataset structure
Our dataset is structured as follows: The data available in COCONUT are presented and described in Table 3.The main.csv file is necessary to access the desired collection of energy measurements and contains the information relative to each scenario, defined by the unique identification sequence ScenarioID.
Each line in the main.csvfile represents a different scenario for which two collections of measurements have been made available.The first collection refers to the retrieval measurements, stored in the file Energy_details.csv,which contains measurements related to the current intensity (A) and download time (s) for each scenario, both averaged over all retrieved segments.The second collection is contained in the file [ScenarioID]_details.csv and represents a series of measurements for decoding+rendering with NIC disabled gathered by the multimeter during the reproduction.These measurements are then grouped based on the streaming time ST and averaged on a second-basis, excluding the first 3 seconds since the measurements are biased by the player initialization.Whereas PM is available for both devices and refers to the current intensity measured by the multimeter, the last 5 columns, namely PN, GPU_T, GPU_D, GPU_R, and GPU_CPU, are only available for the laptop, as outcome by Nvitop [12] and Psutil [13].
In addition to the data presented in Table 3, for each video content and frame, we provide PSNR, VMAF, SI and TI values in the file content_details.csv.

RESULTS AND ANALYSIS
According to Table 2 and the specifications provided in Section 3, we can count a total of roughly 10,000 different scenarios, i.e., combinations of parameters.In this section, we visually present an excerpt of the collected data (cf.Figures 4a-4h) and provide insights on the collected data.If not otherwise specified, the reported     (f) ExoPlayer and BBB (smartphone).Utilization of CPU processes related to the GPU measurements refer to either decoding+rendering, meaning that the content is fetched locally, or retrieval, requesting the segments from a HAS server.The measurements reported for decoding+rendering (Dec.and Ren. in Figure 4) are obtained as the difference between the value reported by the multimeter and the current intensity required by the respective client device when idle.Furthermore, only AVC, HEVC, and VP9 support hardware decoding, while AV1 is only decodable via software.Hence, when not specified, the current intensity measurements refer to software decoding for AV1 and hardware decoding for AVC, HEVC, and VP9. Figure 4a presents the average spatial information (SI) and temporal information (TI) of the selected video sequences.From the graph, we can notice that the chosen video sequences cover a big portion of the complexity spectrum with high and low SI-TI combinations.
Figure 4e reports the current intensity (A) required by laptop and smartphone devices on idle, i.e., when no streaming is being performed, depending on the display brightness level (%).From the illustration, we can notice a linear dependency between current intensity and brightness level for both devices until 100% brightness for the smartphone.During the experiments, the brightness of the display has been set to 80% for the smartphone and to 100% for the laptop.
Figure 4b depicts the direct current measured for decoding and rendering averaged on all samples for the laptop streaming the Big Buck Bunny (BBB) sequence.It is interesting to note the large gap in the measured current intensity between AV1 and the other codecs, especially for 60 fps.While AV1 for 24 fps already leads to a higher current intensity than all other codecs, even for 60 fps, AV1 for 60 fps consumes roughly 4 times more than AV1 for 24 fps for representations with ID higher than 10.This linear dependency on the representation ID, i.e., bitrate and resolution, is due to the absence of hardware decoding for AV1, which is instead decoded via software.At the bottom of the graph, we can find HEVC for 24 fps, whose current intensity is up to roughly a third of that of the compared codecs.
Figure 4f presents a scenario similar to Figure 4b when the smartphone streams the BBB sequence.It is interesting to note the difference to the scenario presented in Figure 4b.In this scenario, although the difference between 24 fps and 60 fps is significant for all codecs (e.g., from 0.1 A to 0,22 A for HEVC), there is no meaningful difference between AVC, HEVC, and VP9.On the other hand, when streaming AV1 content to the smartphone, we notice a considerable increase in current intensity from representation 1 to representation 12 (roughly 4 times more for 24 fps and 2 times more for 60 fps), while the measurements for AVC, HEVC, and VP9 do not seem to be affected by bitrate and resolution.
Figure 4c and Figure 4g present the current intensity for different representations when BBB, 24 fps and 60 fps, is locally retrieved through the VLC media player on the laptop and ExoPlayer on the smartphone, respectively.In both graphs, we notice that hardware acceleration (HW) always requires the lowest amount of Ampere.On the other hand, software decoders (SW) struggle to compete with hardware-accelerated decoders with significant increments in current intensity.For instance, in Figure 4c, we can observe up to roughly 350% higher current intensity for representation 12 and 24 fps from HW|VP9 to SW|VP9.An interesting insight is that SW|AV1 consumes lower energy than SW|AVC, SW|HEVC, which shows the highest current intensity for all representations with 60 fps, and SW|VP9, which leads to the highest current intensity for all representations with 24 fps.
On the other hand, Figure 4g represents SW|AV1 for both 24 fps and 60 fps as the most intense software decoding strategy with 4 times more current required than all other software decoding techniques for 24 fps.When analyzing the 60 fps scenario, the current intensity for hardware decoding is roughly 0.2 A for all codecs, while the values for software decoding show a linear dependency with the representation ID, increasing by more than 100% (from HW|60fps|VP9 to SW|60fps|AVC).When analyzing the 60 fps scenario, it is worth noting that ExoPlayer is not able to initialize the software decoder for VP9 for representations with resolutions higher than 1080p (10,11,12) and 60 fps, meaning that the tested smartphone device has limited decoding capabilities.Therefore, some results for SW|VP9 are missing.
Figure 4d and Figure 4h illustrate the energy consumption in kilo Joule (kJ) of the laptop and smartphone devices, respectively, for the supported NIC types when retrieving video segments from the Lifting Off sequence varying the representation ID, segment length (1 s, 4 s, and 10 s) and bandwidth (5 Mbps, 10 Mbps, and 20 Mbps).The results are averaged on all video segments for each representation ID.Furthermore, we report the average segment size (MB) as colored dot markers for each representation ID and segment length.In Figure 4d, it is interesting to note that varying the segment length does not significantly affect the energy consumption, for which we report a maximum difference of less than 0.1 kJ between 1 s and 10 s for 20 Mbps.On the other hand, different bandwidths do have a considerable impact on the energy consumption, which is roughly doubled from 5 Mbps to 10 Mbps and from 10 Mbps to 20 Mbps for both Ethernet and Wi-Fi ™ .This is reasonable, as the NIC operates when retrieving a video segment, whose download time depends on the bandwidth.The energy consumption of Ethernet and Wi-Fi ™ present similar values since the NIC is mainly active in receiving rather than delivering data, for which we expect wireless communications to have a impact.
Similar considerations for the Wi-Fi ™ can be drawn from Figure 4h.For LTE, instead, compared to the results for 10 s and 4s segment length, which follow the same trend, retrieving Litfing Off video segments with 1 s length requires roughly 2 times more energy.It is interesting to note that we ran the experiments in less than 1 hour to limit the variability in network congestion happening during the day.
Due to the lack of space, we could not present the comparison for other parameters we gathered and included in COCONUT.

APPLICATIONS
Possible applications for COCONUT are highlighted as follows.
Power prediction models.With the data provided by COCONUT, custom prediction models can be trained to estimate the power consumption of a client device relative to segment retrieval or decoding+rendering, based on several specific parameters, such as bitrate, resolution, codec, device type, NIC, and available bandwidth.
Impact analysis of specific parameters.COCONUT can be used to perform comparisons between several parameters, e.g., codecs, bitrates, resolutions, HAS players, software/hardware decoding.An in-depth analysis can be performed to shed a light on the impact of these parameters on the power consumption of a client device.
Energy-aware ABR algorithm.COCONUT can be a substantial help in designing energy-aware ABR algorithms to select the appropriate video segment.A custom cost function could be employed to trade off between video quality (↑) and energy consumption (↓) for segment retrieval, decoding, and rendering.For instance, if a video sequence is available in multiple codecs, such ABR algorithm could either retrieve or estimate the quality and energy consumption of each available video segment to fetch the codec that provides the lowest cost.

CONCLUSION
This data collection contains several key parameters, such as bitrate, resolution, complexity, codec, quality, frame rate, segment length, NIC, HAS player, and decoding type.Based on different combinations of the mentioned parameters, we analyze the gathered energy measurements and present insights into their influence on energy consumption, distinguishing between segment retrieval through the NIC and decoding+rendering.Lastly, we present possible applications for COCONUT that we plan to study in future works.

Figure 1 :
Figure 1: Testbed used to collect the measurements in CO-CONUT.

Figure 2 :Figure 3 :
Figure 2: Sample frames from the selected video sequences.
SI and TI of the video sequences.

Figure 4 :
Figure 4: Visual results extracted from COCONUT for different scenarios.

Table 1 :
Comparison of COCONUT with the related work.

Table 2 :
Client devices utilized throughout the experiments.

Table 3 :
Structure of the CSV files in COCONUT.