Tensor-Aware Energy Accounting

With the rapid growth of Artificial Intelligence (AI) applications supported by deep learning (DL), the energy efficiency of these applications has an increasingly large impact on sustainability. We introduce Smaragdine, a new energy accounting system for tensor-based DL programs implemented with TENSORFLOW. At the heart of Smaragdine is a novel white-box methodology of energy accounting: Smaragdine is aware of the internal structure of the DL program, which we call tensor-aware energy accounting. With Smaragdine, the energy consumption of a DL program can be broken down into units aligned with its logical hierarchical de-composition structure. We apply Smaragdine for understanding the energy behavior of BERT, one of the most widely used language models. Layer-by-layer and tensor-by-tensor, Smaragdine is capable of identifying the highest energy/power-consuming components of BERT. Furthermore, we conduct two case studies on how Smaragdine supports downstream toolchain building, one on the comparative energy impact of hyperparameter tuning of BERT, the other on the energy behavior evolution when BERT evolves to its next generation, ALBERT.


INTRODUCTION
Green AI [42] is a fundamental challenge with far-reaching implications on the future practice of software engineering, and the sustainability of our society [45].Deep learning (DL) [3,8,12,29,40] -the technology that drives the current wave of AI revolution -happens to be excessively energy-hungry.For example, training DL-based large language models is estimated to consume 1,287 megawatt hours of electricity [36].Optimizing the energy consumption of DL systems and applications is a fast-developing direction with intense interest [5,14,39].
The first step toward change is awareness 1 .Underpinning many solutions of energy optimization is the fundamental problem of energy accounting: a deep understanding of energy consumption by breaking it down in (software or hardware) components.This is in contrast with the "monolithic" black-box approach, e.g., measuring the end-to-end energy consumption of a DL training session.Energy accounting is a classic problem, with established solutions [2,11,52,53] focusing on breaking down the energy consumption by hardware components and OS system components.It is not difficult to envision a gray-box approach that retrofits these existing solutions to DL programs, i.e., breaking down the energy consumption of a DL training session by architectural units or OS threads.
Our key insight is that, be it a black-box approach or a gray-box approach, it is a missed opportunity that an energy accounting system ignores the unique abstractions latent in the DL program.After all, a DL program is highly structured, often broken down in modules formed by layers of tensors.Both the black-box approach and the gray-box approach pessimistically treat the DL program just like any other program.How can we leverage the structural information in the DL program for energy accounting?What benefits do we gain with this new flavor of energy accounting?

Tensor-Aware Energy Accounting
In this paper, we introduce Smaragdine2 , a novel multi-grained energy accounting system for TensorFlow [1]-based DL programs that aligns the decomposition of energy consumption with that of the logical structure of the DL program.At the heart of Smaragdine is a novel white-box methodology of energy accounting: the system is aware of the internal structure of the DL program, and its energy consumption can be broken down following its abstraction boundaries.The output of Smaragdine is a series of nested Energy Distribution Diagrams (EDD) corresponding to the hierarchical decomposition of the DL program.For instance, a top-level EDD shows how the overall energy consumption of a DL program is broken down to modules; the EDDs for sub-modules show how the energy consumption of a particular module is broken down to Deep Neural Network (DNN) layers; and so on.With the hierarchically decomposed EDDs, the atomic unit of energy accounting is a tensor operation in TensorFlow-based implementations.
The white-box approach of Smaragdine has some distinct benefits.First, it complements the active research of explainable AI [1,7,37] with a perspective on the explainability of non-functional properties such as energy consumption.The hierarchical decomposition structure of EDDs explains how energy consumption is distributed in a layer-by-layer, or even tensor-by-tensor manner.Second, EDDs -with their "logical" nature of energy accounting -may offer insights to downstream designs in energy optimization and energy debugging.With EDDs, it is a trivial task for a downstream designer to zero in on the "energy hotspot" -the highest energy-consuming unit (a module, a layer, or even a tensor) of the DL program.These energy hotspots are likely to be the ideal candidates for optimization or bug fixes due to their proportionally larger impact.Third, the white-box approach of Smaragdine promotes the study of portability in energy accounting.In gray-box approaches, the result of energy consumption is fundamentally platform-specific: the breakdown of energy consumption across hardware components is dependent on what hardware components each platform has.In contrast, Smaragdine breaks down the energy consumption through logical components of a DL program.
The design of Smaragdine must overcome two major challenges.First, there remains a semantic gap between the DL program and the underlying physical system: no ready-made profiler or tool can connect the semantic features of the DL program execution to energy consumption.The solution of Smaragdine is a trace-based alignment algorithm, which conceptually can be viewed as a monitor that simultaneously tracks the DL program runtime events and the energy consumption of the system, and aligns the two to compute the EDD (details in § 3).Second, there are complex interactions at the application-system interface.For example, most TensorFlow applications are multi-threaded running on heterogeneous platforms, i.e. with both CPUs and GPUs.How to account for energy consumption is non-trivial (see § 3 for details).

Understanding BERT Energy Behavior
To evaluate the effectiveness of tensor-based energy accounting, we apply Smaragdine to BERT [10], a widely used text analysis model.In the domain of natural language processing (NLP), BERT plays a central role in powering numerous NLP end-user applications.In § 5, we show how the BERT application can be hierarchically decomposed into EDDs.We also show how BERT transformers [46] -especially its attention modules -dominate the energy consumption in different stages of BERT use, from pre-training to fine-tuning to prediction.Throughout experiments, we find Smaragdine incurs low overhead while retaining high precision and stability.
To further demonstrate the usefulness of Smaragdine in building the downstream toolchain, we use BERT in two case studies.First, we provide a white-box study on the impact of hyperparameter tuning of BERT.We show the energy/power consumption trends with different configurations in the number of layers and the number of hidden embeddings.Second, we conduct an evolutionary study to compare BERT with a newer variant, AL-BERT [25].Through Smaragdine, we show how the energy behavior has evolved from BERT to ALBERT.

Contributions
To the best of our knowledge, Smaragdine is the first white-box energy accounting system for tensor-based DL programs.This paper makes the following contributions: • the distinct methodology for the energy accounting of DL programs where energy accounting is aware of the internal structure of the DL program, i.e., semantics-aware, illustrated by EDDs (see § 3) • a trace-based alignment algorithm for bridging the semantic gap of energy accounting while considering complex application-system interactions (see § 3) • an in-depth evaluation of Smaragdine through BERT, revealing the module-level, layer-level, and tensor-level energy behavior of BERT (see § 5) • a comparative study on the energy/power impact of hyperparameter tuning in BERT, and the energy/power behavior evolution from BERT to its next generation, ALBERT (see § 6) Smaragdine is an open-source project.The source code and all raw data of this paper can be found at the anonymous site: https://github.com/project-smaragdine/smaragdine.

BACKGROUND
Deep Learning Programs.A DL program is a dataflow program that composes a number of neural networks (NNs) -and often deep neural networks (DNNs) -together.A neural network (NN) is a collection of neurons wired together, where each neuron serves as a transformation function.Each neuron can be associated with a weight, a potentially adjustable value that indicates the importance of the transformation.DNNs hierarchically organize NNs together, each of which is called a layer.Each DNN layer can be implemented by a tensor [33], which generalizes scalar and vector computations over the input/output of the neurons.
Semantically, one may view a DL program as a potentially nested dataflow program.A DL program consists of modules wired together through dataflows, and each module can be implemented either as a tensor, or another (nested) dataflow program.A concretization of this view is to consider a DL program as a "nested DNN, " consisting of layers wired together.Each layer may either be an atomic tensor layer or composite layer, i.e. another DNN.In the rest of the paper, we adopt this view.
DL programs can be trained, i.e., adjusting the neuron weights of its resident NNs to fit a data set.Once trained, a DL program can be used for prediction, i.e., estimating an output from an input.DNNs are trained through backpropagation [29,40], where the error of the model's prediction is used to adjust weights.This requires computation of the network's gradient -the differential impact of neuron connections -to correctly adjust the weights.
BERT.BERT [10] is a tunable NLP model relying on DNNs.BERT popularized the idea of splitting training into two stages: pre-training and fine-tuning.BERT is initially pre-trained on a general data set -where all model weights are trained for a long time.BERT can then be fine-tuned on a curated data set -where only a subset of model weights are trained for a shorter time.As a result, BERT can be quickly configured to solve specific kinds of text problems without retraining from scratch [10,28,38], leading to a break-through in NLP.
For NLP models, two important tasks are embedding and encoding.Embedding represents the input (such as a text) for processing, and encoding addresses the transformation between the input and the output.In design, BERT's encoding module, called an encoder, is a nested DNN: it stacks together a number of modules (i.e., composite layers) each of which is called a transformer [46].For our purpose, note that a transformer is a nested dataflow program with NNs inside.A key BERT innovation is one of the NNs nested inside a transformer, called self-attention.This unit enables a (now widely used) form of encoding known as bidirectional representation.
TensorFlow.TensorFlow [1] is a machine learning library that supports complex tensor calculus.TensorFlow programs can be designed using the Keras API, a framework for constructing DL programs.For DL training, computing the gradient by hand is nontrivial.To overcome this, TensorFlow automates this process by using automatic differentiation [29].

SMARAGDINE DESIGN 3.1 Problem Statement
We use  ∈ LN represent layer names,  ∈ TN tensor (layer) names, and  ∈ CN composite layer names.LN = TN ∪ CN.We further use TO to represent the set of tensor operations.It is important to observe that the structure of an EDD in Def.3.2 mirrors that of the DNN program in Def.3.1.This is intentional: it is the goal of Smaragdine that the output of energy accounting reflects the logical structure of the DNN program.
Most DL programs rely on a very limited set of tensor operations, i.e., the sizes of TN and TO are small.For example, BERT is built on top of a limited number of tensor operations, such as vector multiplication (MATMUL) and tangent gradient (TANHGRAD).However, it is important to note that where tensors appear in the DL program makes them semantically different: a vector multiplication used for computing self-attention is different from one using Einstein summation.To effectively support this difference, we introduce qualified tensor name ( ∈ QN), in the form of ⟨[ 1 , . . .,   ]; ⟩ for some  ≥ 0. Intuitively, this refers to a tensor named  that immediately resides in layer   , which is in turn nested in  −1 and so on.
Example 3.3 (Qualified Tensor Name and its ShortHand).The QTN ⟨[bert, encoder, layer_0, output, dense]; MatMul⟩ refers to tensor MatMul defined in the dense layer, which in turn is nested in the output layer, which is nested in layer_0, and so on.From now on, we use its more mnemonic shorthand form3 bert/encoder/layer_0/output/dense/MatMul.
With the QTN, we can now "flatten" the EDD to produce a representation more friendly for tensor-level comparison of energy consumption: Definition 3.4 (Tensor Energy Footprint).We define a tensor energy footprint (TEF)  as a function   : QN ⇀ ENERGY.
Given a program , the transformation between EDD and TEF is simple, to be defined in § 3.3.

Algorithm Overview
Design Challenges.As we discussed in § 1, Smaragdine first must overcome a Semantic Gap challenge: while it is generally known how to perform (gray-box) energy accounting in a per hardware component or per thread manner, how such systems-level energy consumption can be mapped to the structure of the DL program is an open question.Secondly, there is an Application-Systems Interface challenge.A typical DL program is multi-threaded, so addressing concurrency is the rule not the exception; DL programs routinely run in a heterogeneous environment -where there may be multiple devices, such CPUs and GPUs.This complicates accounting further as device reports their consumption separately.
A Solution Overview.To address the Semantic Gap challenge, Smaragdine is designed as a runtime monitor to perform tracebased alignment: it collects two traces of information at time intervals, and align them based on timestamps.The two conceptual traces are (1) a tensor event trace  : TIMESTAMP ⇀ QTN, which temporarily records the tensor activities, where each tensor is identified by its QTN; (2) a power trace:  : TIMESTAMP ⇀ POWER.The core algorithm is conceptually simple: Smaragdine continuously tracks what tensor events have happened, where each event has happened, and how much power the device has when it happens.Now that  contains the semantic information, its alignment with  bridges the semantic gap between the program runtime and the system runtime.Let us now overview several technical challenges: • Imperfect Alignment: It should be known that neither  nor  is surjective; in other words, some timestamps may not have a tensor event or any power reading.In a nutshell, Smaragdine is a recency-based alignment algorithm: we align a tensor event with the most recent power reading in the time line.• Multiplicity:  is not injective; in other words, there might be multiple tensor events that occur at the same timestamp.
In Smaragdine, all tensor events that happen at the same time are attributed with an equal share of the energy consumption.• Durable Events: tensor events happen for a duration.This realistic view is in contrast with our conceptual formulation above, where the  mapping appears to indicate that the occurrence of a tensor event happens at an instantaneous timestamp.We resolve this with a conceptual algorithm (in Algorithm 2) and an optimized algorithm (see discussion in § 3.3).
To address the Application-System Interface challenge, Smaragdine behaves as a universal accountant, abstracting the system's consumption as energy traces.First, Smaragdine is aware of the heterogeneity of the system, where trace alignment is performed in a per device manner.In other words, Smaragdine operate on the device tensor event trace  : DEVICE ⇀ , and the device power trace:  : DEVICE ⇀ .Second, Smaragdine is aware of the concurrency of the system.With Multiplicity, the tensor events that are concurrently executed on different threads residing on one device each will receive an equal share of the energy consumption of that device.Smaragdine also addresses complex systems behavior such as thread migration, where the mapping between threads and devices is updated.

Algorithm Specification
Key Data Structures.Algorithm 1 presents the key data structures Smaragdine works with at run time.The  and  traces we discussed in the previous section are represented by EventTrace and PowerTrace respectively.Due to Multiplicity, the EventTrace also records the duration of the tensor event (dur), together with its starting time (dur), as shown in Lines 9-13.The QTN of the tensor operation is kept in the op field.To address the Application-Systems Interface challenge, we also record where in the underlying systems such event is happening, in the device field.Possible values are the CPU/GPU units, as shown in the enum definition for Device.Smaragdine provides Start and Stop methods (Lines 19-20) to enable an accounting session.Due to the need for Imperfect Alignment, utility function Now returns the last interval timestamp that is still smaller than the given timestamp.TEF Computation.Algorithm 2 specifies the core monitoring algorithm which ultimately produces the TEF.Due to the technical challenge of Durable Events, we flatten the EventTrace, i.e., turning it into a DeviceFlatTrace where the event is repeated for every interval covered by the duration The flatten function iterates through EvenTrace events and add them to a DeviceFlat-Trace from the start to the end of the event (Lines 2 -8).Finally, we account operations by iterating over each device's DeviceFlat-Trace.The tensor operations for each time interval is assigned an equal fraction of the total energy.The TEF, represented in the algorithm as TEFootprint, is Aggregated together by combining all TSFootprint's.The algorithm implemented by Smaragdine is an optimized version of Algorithm 2. In practice, if a tensor operation has a long duration, the Flatten process in Algorithm 2 would require insert many entries in DeviceFlattenTrace.Our optimized algorithm keep track of the start/end timestamps of a tensor operation internally without an explicit flattening process.The specification of this optimized algorithm can be found in the repository.
EDD Reconstruction.Given a TEF, it is simple to compute a corresponding EDD.Function T2E(, ) computes the EDD for program , defined as ⟨N ; ⟩, where N is the smallest set such that N ( 1 )( 2 ) . . .(  )() =  () for any  ∈ domain( ),  = ⟨[ 1 , . . .,   ], ⟩, where  = ⟨N; ⟩.Summarized TEF.Realistic DNN programs have a complex topology.This implies that a typical TEF in the real world contains numerous entries.From the standpoint of program understanding, it would be desirable if we could combine the energy consumption of "similar" tensors together.
For transformer-based DNN programs such as BERT, an opportunity exists: while such programs consist of a large number of transformers, the internals of all transformers are self-similar, i.e., they have indistinguishable topology [20].This provides us an opportunity to sum up the energy consumption of all tensors that reside in self-similar transformers.In BERT, the boundary of a transformer is identified by module name layer_, where  is a number that ranges the number of transformers.In other words, we can potentially sum up the following entries in a TEF:  () where In addition, Smaragdine can also produce the power counterpart of STEF, which we call Summarized Tensor Power Footprint (STPF).We elide their verbose definitions in this presentation.

SMARAGDINE IMPLEMENTATION
Decoupled Monitoring.We choose to implement Smaragdine as a separate process co-running with the monitored application.In our implementation, Smaragdine is written in Rust.The monitored TensorFlow applications we evaluate Smaragdine over are Python runtimes.The inter-runtime communication is implemented through grpc4 , a widely used language-agnostic remote procedure call framework.At the begin and end of the training epochs in TensorFlow, we use SessionRunHook5 to attach callbacks to asynchronously communicate with Smaragdine.The Start and Stop methods of Smaragdine are called upon receiving the grpc messages.Decoupled monitoring enables a more language-neutral approach toward the monitored application, anticipating the diversity of future ML applications, which may be written in other languages.
Power Sampling.Smaragdine samples the power consumption of all CPU/DRAM/GPU components of the underlying system.The power trace is broken down by device, as described at line 4 Algorithm 2.
Specifically, the CPU and DRAM energy consumption is obtained through Intel's RAPL interface.RAPL provides Machine-Specific Registers (MSR) to report the accumulative energy consumption of Intel processors, reporting for each domain (i.e., motherboard socket) separately.Within each domain, it further breaks down the report by components, i.e., the CPU cores, the uncore (cache, TLB, etc), and the DRAM regulator.The MSR is read through powercap, a Linux power management module where MSR values are exposed as a psuedo-filesystem.
GPUs energy consumption is obtained through the NVIDIA Management Library (NVML) 6 , an interface for both monitoring and managing NVIDIA devices.The NVML provides high-level querying of GPU devices, including the instantaneous power.Smaragdine samples from the NVML using the nvml-wrapper package 7 , which provides a thin Rust wrapper around the library.
Our power sampling period is set at 4 milliseconds, which is the smallest period that we observe where power data are updated in hardware.
Excluding Energy Consumption by OS and Smaragdine.One practical concern is that the OS maintains a basic level of energy consumption, such as through daemons.In a similar vein, Smaragdine as a co-running process also incurs a small share of energy consumption itself.We need to exclude the energy consumption resulting from the processes outside of our application, including the co-running Smaragdine process.We resort to a prior tool Eflect to accomplish this goal.Eflect was able to virtualize the energy consumption, i.e., separating the fraction of energy consumption due to a specific application from the rest of the system.In other words, for each power sample obtained by Smaragdine at line 19 in Algorithm 1, the sample only consists of the fraction of energy consumption due to the monitored application itself.We do not virtualize the energy consumption of the GPUs.Unlike the CPUs, GPUs only execute the kernel required by our monitored application, without the background OS daemon processes.The Smaragdine process itself does not execute on the GPU.
Event Trace Collection.We obtain the TensorFlow event trace through TensorFlow's built-in profiler.The profiler monitors the application with callbacks to the start and end of all operations executed.We start the profiler through the ProfilerHook8 class, an extension of the SessionRunHook.At the end of execution, the profiler produces an event trace using TensorFlow's timeline API, where events are identified by the tensor's QTN.Combined with traces from the Smaragdine hook described above, we account the epoch with Algorithm 2.

EVALUATION
We present an experimental evaluation of Smaragdine with the aim to answer the following questions: • RQ1: what insights can Smaragdine provide to the designers of DL applications on their energy consumption?• RQ2: what are the precision, overhead, and scalability characteristics of Smaragdine-based energy accounting?• RQ3: can Smaragdine help build the downstream toolchain for understanding DL applications?
We answer RQ1 and RQ2 in this section, and RQ3 in § 6.

Experimental Setup
All experiments are conducted on a server consisting of an Intel Xeon Silver 4300 v3 2.30 GHz CPU with 20 cores, a PNY NVIDIA Quadro P5000 GPU with 2560 CUDA cores, and 64GB DDR4 of RAM.
The CPU is configured with hyperthreading enabled.The machine runs with a Debian 11 OS with the default ondemand governor where the P-state is on.All experiments were run with TensorFlow 2.8 on Python 3.8.We use BERT through its experiment repository 9 .Both pre-training and fine-tuning were performed for 500 epochs with the BERT repository's recommended parameters: • A max sequence length of 128 words BERT is pre-trained with BookCorpus10 , a collection of free, unpublished novels, and English Wikipedia11 , which contains annotated entries from a variety of domains.BERT is fine-tuned with the Corpus of Linguistic Acceptability (CoLA) [49] data set.Each experiment described in this paper is repeated 5 times.Recall from § 2 that BERT operates in three stages -pre-training, fine-tuning, and prediction.Smaragdine is capable of performing energy accounting for all 3 stages.For pre-training and fine-tuning, we further separate each into the forward and backward passes of the execution, following our discussion in § 2.

A Bird's Eye View of BERT's Energy Consumption
Fig. 1 provides a high-level comparison on the energy consumption of the 3 stages.Here, pre-training is shown as the most expensive, consuming over 60 KJs, vs. fine-tuning's 45KJ consumption.This observation is aligned with our understanding of BERT: pretraining updates more neurons than fine-tuning.Prediction consumes very little energy, around 2KJs.Prediction does not require the machinery of training, such as batching and iterative execution.As a result, we expect prediction to be the cheapest task.
Since pre-training is the largest consumer, as well as the first step in building a model, we will present our results of energy accounting of this stage for the rest of this section.The results for other stages are provided in the public repository (https://github.com/projectsmaragdine/smaragdine).

Multi-Grained Energy Accounting
Smaragdine is able to report energy consumption of a DL program following the logical structure of its hierarchical decomposition: • (whole) program-level accounting.At the top level, Smaragdine can provide an overview of how the energy consumption of the program is distributed among its top-level layers.For example, Fig. 2 is an EDD that shows the top-level view of BERT energy consumption.• (composite) layer-level accounting.Smaragdine can show the energy consumption of a composite layer through nested EDDs.For example, Fig. 2 shows the EDD of BERT's encoder layer.Fig. 4 shows a set of hierarchical views of the first transformer -a composite layer in the encoder -and its components.• tensor-level accounting.At this fine-grained level, Smaragdine reports the energy consumption of tensors, i.e., the "leaves" in the hierarchical decomposition structure.Fig. 5 shows an STEF that provides a summarized view where the energy consumption of tensors is ranked.There are two key observations.First, the encoder and its enclosing transformers dominate the energy consumption of BERT.BERT has two main tasks: embedding and encoding.As it turns out, the latter consumes more than 99% of BERT energy consumption.With the encoder, the stacked transformers again dominate the energy consumption.Given the central role that transformers play in language models like BERT, this comes with no surprise.Second, the transformers do not consume energy uniformly.The transformers closer to the input are lower consumers, up to 2% less than the average consumption of 8.25±0.61%.The consumption rises until plateauing around 8.55% at the fifth transformer.We speculate that this is due to the power state of the executing devices.To confirm this, we investigated the power trace of the underlying system, with results shown in Fig. 3.The TensorFlow runtime appears to schedule the transformer execution in phases, where the first transformer is executed in the earlier timestamps in each training epoch.Interestingly, the CPU/GPU system starts at a lower-power state, and only ramps up when the workload increases.As a result, operations executed in the early phases (such as the first transformer) consumes less energy overall.

Energy
Accounting for Nested Layers.Based on our earlier discussion, transformers are the largest consumers in BERT.We use Smaragdine to "zoom in" deeper in the hierarchy, to the first transformer (bert/encoder/layer_0).Fig. 4a shows its EDD.
Two observations are noteworthy.First, the attention layer consumes a significant amount of energy; in addition, the attention layer's energy is primarily consumed in computing self-attention (self).This confirms the important role that the attention mechanism plays in BERT.Second, dense computation dominates the energy consumption.Through the EDDs of the sub-layers in Fig. 4b, 4c,  and 4d, we can observe that the dense layers inside intermediate and output dominate the energy consumption.The dense layers are implemented as matrix multiplication, one of the most computationally intensive operations.
The forward pass and the backward pass exhibit similar energy behavior (EDDs in the repository), with one notable exception: the attention layer consumes a noticeably larger share in the backward pass, 48.19%, than the forward pass of 40.68%.This share difference also persists in the self-attention layer: the shares for the backward pass vs. the forward pass are 79.52% vs. 72.51%.In  the backward pass, gradient calculation is relatively expensive for attention layers.

5.3.3
Tensor-Level Accounting.Fig. 5 presents the top-10 tensors in the form of STEF and STPF.Here, we highlight 2 observations.First, matrix multiplication dominates the energy consumption.All top-10 energy-consuming tensors are vector multiplication across different BERT layers.Second, the backward pass is a much larger consumer than the forward pass.In BERT, all backward passes are included in the top composite layer of gradient.Here, 8 of 10 of the top energy-consuming tensors come from the backward pass.This is consistent with the top-level view we showed in Fig. 1, but the STEF here provides a significantly finer-grained view on which tensors that contribute to the larger energy consumption.Third, the power consumption of different tensors remains stable.Indeed, regardless of the different semantic purposes that different tensor computations serve, all share the nature of matrix multiplication (MatMul).As power consumption is strongly correlated with the nature of the computation itself, the power remains similar for all MatMul tensors.

Precision
. For all sampling-based systems, the precision of the results may be impacted by the sampling design itself, such as how many and how often samples are taken.We evaluate the precision of Smaragdine in two metrics: accounting similarity with sample sparsing (ASSS) and accounting similarity with sample widening (ASSW).indexed by the tensor IDs.We show an example in Fig. 6.Intuitively, a higher value of accounting similarity implies two accounting results demonstrate more similar trends.ASSS is built upon the intuition that the ground truth is approached when the sampling rate reaches infinity.Recall in § 4, power consumption is sampled at four milliseconds, and cannot be sampled at a higher rate due to hardware constraints.We circumvent this challenge by computing the accounting similarity when the sampling interval is further lengthened.This counter-intuitive   7: ASSS and ASSW (In the left figure, the X-axis is the sampling period and the Y-axis is the ASSS value.In the right subfigure, the X-axis is the number of experiments, and and the Y-axis is ASSW value.The PCC is computed between the STEF of Smaragdine's default setting, and that of the setting in the X-axis.)idea is rooted on how discrete systems approximate continuous values: if we view the result when the sampling rate is infinity as the limit, the shape of the trajectory where the limit is approached offers clues on the error of the approximation.The results are shown in Fig. 7a.The similarity oscillates between 0.90 and 0.94.Overall, the curve forms a "plateau": further increasing the sampling rate would likely offer little benefit in changing the trend exhibited in the STEFs.The intuition behind ASSW is that the ground truth of a samplingbased algorithm can be approached when the number of samples reach infinity.Recall that each Smaragdine experiment is repeated in 5 runs (see § 5.1).We now compute ASSW by relating the STEF generated when 2, 3, 4, 5, etc, experiments are conducted, i.e., in 10, 15, 20, 25 runs.The results are shown in Fig. 7b.The similarity is high, between 0.97 and 0.99.

Stability.
In addition to precision, a sampling system must preserve stability: when the same experiment is repeated, a stable sampling system should produce consistent results.
To evaluate this, we can compute the accounting similarity between different experiments, shown in Fig. 8.All pairs of experiments have a high accounting similarity.Generally, a PCC greater than 0.7 is considered to be strong correlation.

5.4.3
Overhead.Finally, we quantify the overhead Smaragdine introduces to the application under energy accounting.We compare the application under Smaragdine's accounting with the same application without accounting.We report a runtime overhead of −0.052 ± 0.12% and an energy overhead of 0.47 ± 1.10%.The overhead is well within the margin of errors.Smaragdine is a low overhead energy accounting system.

CASE STUDIES
In this section, we describe two cases studies that demonstrate the usefulness of Smaragdine for supporting client studies of DL energy consumption.This section is aimed at addressing RQ3.

A Comparative Study on BERT Variants
As shown by the original developers of BERT [10], BERT can be configured with different hyperparameter settings.In particular, there are two important ones that impact the model topology: the number of stacked transformer layers (L), and the number of hidden embeddings (H).Our analysis in §. 5 was applied to the largest model described in the original paper, i.e.BERT BASE : L=12, H=768.
We use Smaragdine to generate the STEFs and STPFs for BERT under alternative hyperparameter settings, with the comparative results shown in Fig. 9. Specifically, Fig. 9a and 9b show high PCC correlation across all variants in terms of both power and energy.This means that the relative standing of power/energy consumption of different tensors remains stable across different BERT variants.In other words, the top-consuming tensors in one BERT variant are likely the top-consuming tensors in the others too.
Fig. 9c and 9d show the results in mean error difference (MED).According to Fig. 9c, the dominating factor of power consumption is the number of hidden embeddings (H).All BERT variants with H=768 have a higher power consumption than their counterparts where H=512.For different variants with the same number of hidden embeddings, the ones with more layers consume more power.
Fig. 9d shows the energy trend.Interestingly, this figure does not strictly follow the trend exhibited for power consumption.It is true that when the number of layers increases, the energy consumption also increases.However, the number of hidden embeddings is no longer a deciding factor on energy consumption.For example, observe the cell between E and D, which shows the former has less (mean) energy consumption than the latter, but the former has more hidden embeddings than the latter.This is a conscious reminder to future DL program developers who are energy-conscious: both the number of layers and the number of hidden embeddings have impact on energy consumption, where neither factor may dominate.

From BERT To ALBERT
Finally, we also apply Smaragdine to ALBERT [25], a variation of BERT with a more efficient training method.Our experiments were conducted over the ALBERT 12 .The same hyperparameters are used as those in our default setting of BERT.Fig. 10 presents the STEF and STPF for ALBERT.For all top-10 energy consumers, the EINSUM tensor is used.This refers to Einstein summation, an index-based approach to defining tensor transformations.Indeed, matrix multiplication can also be represented with Einstein summation.From BERT to ALBERT, the transformation has changed from MATMUL to EINSUM, but tensor-based mathematical transformation remains dominant in energy consumption. 12  ALBERT indeed has some different consumption behavior than BERT.While the distribution of the energy for the STEFs is similar, the values are much smaller, almost half in some cases.In contrast, the STPF of ALBERT shows much higher power consumption -at least 120W -for all of the top-consuming tensors.This indicates that ALBERT is more likely to place GPUs in a higher utilization level.In addition, the standard deviations in power consumption are significantly larger than BERT tensors too.We speculate that it may result from changes in scheduling from BERT to ALBERT.Topologically, BERT chains the transformers in a sequential stack, i.e., the output of the  th transformer is fed into ( + 1) th transformer as input.The topology of ALBERT however consists of multiple data pipelines.We think this design change may offer ALBERT more opportunities to execute multiple layers in parallel  non-deterministically.This conjecture is also consistent with the fact that GPUs operate at a higher power in ALBERT in BERT.

RELATED WORK
Energy Estimation for DL.DeLight [39] analytically models the energy cost resulted from forward and backward passes, mathematically captured through the arithmetic operations, activation functions, and propagation errors latent in the DNN.NeuralPower [5] is another analytical approach to estimating the energy consumption of a DNN given its topological details, with a mathematical model to estimate the power consumption and execution time of a DNN inference, together with parameters related to GPU scheduling such as stride size.Energy estimation and energy accounting are fundamentally different problems.While energy estimation offers a priori insight of the DNN energy consumption, energy accounting is conducted a posteriori to answer "what happened."For energy estimation, a model is assumed, e.g., how propagation and its errors are represented, how computations are kernelized, and how the GPU scheduler sets the stride size.Smaragdine is model-less.
Garcia-Martin et al. [14] surveyed the energy estimation approaches for ML applications, with a focus on non-DL systems.
Energy Optimization of DL.A direction that received significant interest is the energy optimization of DNNs.Domain-specific architectures and accelerators [22,27] often deliver better energy efficiency.As a well-known example, Tensor Processing Units (TPUs) enable more performance-and energy-efficient executions for applications like TensorFlow.On the algorithm level, there is a long tradition in designing neutral networks with better energy efficiency.For example, SqueezeNext [15] and ChamNet [9] considers energy efficiency as a key design constraint.Model compression techniques can often lead to increased energy efficiency, including quantization [16,17,23], pruning [18,26], and distillation [19,21].
Black-box systems-level approaches such as GPOEO [47] and Zeus [50] provide tools to optimize energy consumption on GPUs.For example, Zeus adaptively conducts the training of DNNs with different combinations of batch sizes and GPU power limits, and selects the more energy-optimal configurations on the Pareto curve.
Energy optimization and energy accounting go hand in hand.Smaragdine can complement existing approaches by providing a white-box view on the impact of their energy optimization, describing it in a per-layer or per-tensor manner.§ 6 serves as examples to demonstrate how Smaragdine can help designers gain insight on energy optimization, i.e., what has really happened inside a DL program when its energy consumption is reduced/changed.Modular DL.BERT and ALBERT adopt a modular approach for model construction, whose hierarchical decomposition structure is leveraged by Smaragdine for accounting.DNN modularization is common.For example, Inception [43,44] evolves by replacing highdimension convolutions with a sequence of small ones.Montavon et al. [32] uses Taylor decomposition of subnetworks to improve network understandability.In addition, there is emerging work in applying modularization to search for sub-networks that were not part of the design specifications [24,34].
Understanding/Profiling/Debugging DL Programs.TensorFlow has released TensorBoard [1], an API which collects and visualizes key metrics from a DL program runtime.Amazon similarly created the Amazon SageMaker Debugger [37] for real-time monitoring of DL applications.Both of these tools allow designers to compare training and performance metrics.PACE [7] performs accuracy estimation over both the data set and model to identify potential issues before performing full training.Debugging DL programs [6,30,41,48,54] is an important direction in software engineering.

THREATS TO VALIDITY
First, our experiments are constructed on systems only with CPUs and GPUs; no additional accelerators are available.Hardware acceleration for DL programs is a rapidly developing field [22].Our algorithm can be extended to support additional hardware, through extending the energy domains, i.e, the Device construct at Line 6 in Algorithm 1.Second, we rely on RAPL for monitoring CPU power consumption, available primarily on Intel architectures, and on Nvidia-specific interface for monitoring GPU power consumption.This limitation can be overcome through external meters or power modeling [4,51].Third, the TensorFlow runtime we currently support is in Python.While this is in sync with the majority of TensorFlow applications (BERT and ALBERT are both developed in Python), we do not have experimental evidence on the effectiveness of TensorFlow energy accounting in other language runtimes.In § 4, we discussed decoupled monitoring, which may facilitate prototyping of our designs for other languages.

CONCLUSION
Tensor-aware energy accounting is a novel methodology where the accounting of energy consumption is aligned with the hierarchical decomposition structure of nested deep neural networks defined in TensorFlow-based deep learning programs.Through energy distribution diagrams and tensor energy footprints, our energy accounting system, Smaragdine is capable of revealing insights on the white-box energy behavior of two widely used natural language models, BERT and ALBERT.

Definition 3 . 1 (
DL Program).We define a program  ∈ P as a directed graph ⟨ ; ⟩ where N = C ∪ T, and C : CN ⇀ P is a bijective partial function for the set of composite layers, T : TN ⇀ TO is a bijective partial function for the set of tensor layers, and  : LN ⇀ LN is a partial mapping denoting the dataflow among layers.The primary goal of Smaragdine -i.e., tensor-level energy accounting -is to produce an EDD: Definition 3.2 (Energy Distribution Diagram).We define an EDD ∈ EDD as a directed graph ⟨N ; ⟩ where N = C ∪ T , and C : CN ⇀ EDD is a partial function for the set of EDDs (indexed by composite layer names), T : TN ⇀ ENERGY is a partial function for the set of energy consumption values (indexed by tensor names).

Figure 1 :
Figure 1: BERT Total Energy Consumption (The forward and backward passes are shown as stacked bars.Throughout the paper, the whiskers show the standard deviation.)Recallfrom § 2 that BERT operates in three stages -pre-training, fine-tuning, and prediction.Smaragdine is capable of performing energy accounting for all 3 stages.For pre-training and fine-tuning, we further separate each into the forward and backward passes of the execution, following our discussion in § 2.Fig.1provides a high-level comparison on the energy consumption of the 3 stages.Here, pre-training is shown as the most expensive, consuming over 60 KJs, vs. fine-tuning's 45KJ consumption.This observation is aligned with our understanding of BERT: pretraining updates more neurons than fine-tuning.Prediction consumes very little energy, around 2KJs.Prediction does not require the machinery of training, such as batching and iterative execution.As a result, we expect prediction to be the cheapest task.Since pre-training is the largest consumer, as well as the first step in building a model, we will present our results of energy accounting of this stage for the rest of this section.The results for other stages are provided in the public repository (https://github.com/projectsmaragdine/smaragdine).

Figure 3 :
Figure 3: BERT Single-Epoch Power Trace (The X-axis is elapsed time and the Y-axis is the power consumption.Each point in the figure is the average of power consumption of all training epochs at the same elapsed time.The orange data points show the time stamps when the first transformer is executed, while those in blue show the time stamps when other transformers are in execution.)

Figure 4 :
Figure 4: Transformer-Level EDDs for the Forward Pass of BERT's First Transformer (bert/encoder/layer_0) and its Nested Layers.

Figure 6 :
Figure 6: Accounting Similarity (STEF 2 has half the sampling rate of STEF 1 .The PCC of the two STEFs is shown to the right.)

Figure
Figure7: ASSS and ASSW (In the left figure, the X-axis is the sampling period and the Y-axis is the ASSS value.In the right subfigure, the X-axis is the number of experiments, and and the Y-axis is ASSW value.The PCC is computed between the STEF of Smaragdine's default setting, and that of the setting in the X-axis.)

Figure 8 :
Figure 8: Accounting similarity across Experiments (Each box is the PCC between two experiments indexed by a-i.Each experiment is the default 5-run by Smaragdine. )