YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs

We address the challenges associated with deploying neural networks on CPUs, with a particular focus on minimizing inference time while maintaining accuracy. Our novel approach is to use the dataflow (i.e., computation order) of a neural network to explore data reuse opportunities using heuristic-guided analysis and a code generation framework, which enables exploration of various Single Instruction, Multiple Data (SIMD) implementations to achieve optimized neural network execution. Our results demonstrate that the dataflow that keeps outputs in SIMD registers while also maximizing both input and weight reuse consistently yields the best performance for a wide variety of inference workloads, achieving up to 3x speedup for 8-bit neural networks, and up to 4.8x speedup for binary neural networks, respectively, over the optimized implementations of neural networks today.


I. INTRODUCTION
In recent years, neural networks have expanded their reach beyond high-performance computing environments, permeating low-end servers and edge devices such as smartphones, IoT devices, and smart sensors [1]- [4].However, the deployment of neural networks on these devices presents various challenges, with inference time being a critical factor [4]- [8].The Single Instruction, Multiple Data (SIMD) capabilities of contemporary CPUs present an opportunity to accelerate neural networks.SIMD allows a single instruction to be executed on multiple data elements concurrently, thereby substantially improving computational throughput and overall performance, and yielding benefits in terms of both energy conservation and efficient utilization of computational resources [9]- [11].
Dataflow refers to an execution order of computational operations of a neural network, and it is an important consideration when utilizing SIMD for inference.It determines the reuse opportunities of different variables (e.g., inputs, weights, and outputs), and can therefore guide how to best allocate valuable SIMD register resources to maximize reuse.While dataflows for deep learning accelerators have been extensively explored [12]- [15], the majority of previous studies and libraries for CPUs do not consider dataflows [16]- [19].Instead, weight stationary, i.e., keep using the same weight value until all computations requiring this value are done before moving on to the next weight value, is widely adopted [20]- [22].However, we found that by adopting the carefully designed dataflow and co-optimizing with other techniques (i.e., blocking, operator fusion), the inference speed can be improved significantly, up to 3.5 times, compared to state-of-the-art implementations of 8-bit integer networks [18], and >10 / 4.8 times compared to optimized bitserial [18], [23] / state-of-the-art SIMD [20] implementations of binary neural networks, respectively.
The nuances of SIMD optimization, such as ensuring nondependency in vector register values, are highlighted in [27], [31].These complexities are compounded by the reliance on fragile heuristics in current autovectorization techniques, as critiqued in [26], [30], [33].This is also true for highly optimized frameworks like TVM [18] as they rely on compiler backends such as LLVM [34].With these challenges, the burden of SIMD optimization predominantly lies with program-mers.Consequently, there's a pressing need for a systematic approach to maximize SIMD implementation efficiency.
To this end, we present the first work that employs the notion of dataflow to systematically explore the full SIMD computation capacities on CPUs for efficient neural network inference.The major contributions include: 1) We extended the existing dataflows, which typically specify only one type of variable to be reused, by allowing all types of variables to be reused.Extended dataflows enable systematic exploration to fully utilize SIMD register resources, and substantially reduce costs associated with data and instruction movements.2) We formalized a set of heuristics, based on data movement costs, to optimize three basic, general neural network dataflows -defined in Sec.II -by maximizing reuse opportunities within each dataflow.3) We implemented a code generator that automatically uses SIMD instructions to implement the three basic dataflows and various extended dataflows, for any given neural network configuration.This code generator allows us to compare different dataflows to determine the most efficient implementation.4) We quantitatively compared our best implementation against state-of-the-art implementations using representative workloads, and show that our results achieve substantial improvements: up to 3.5x speedup for 8bit neural networks (against TVM [18]), and up to 4.8x speedup for binary neural networks (against [20]), respectively.

II. BASIC DATAFLOWS OF NEURAL NETWORKS
Three major, basic dataflows have been identified in the literature 1 [36]- [38], as shown in Algorithms 1, 2, and 3 in the semantics of ARM SIMD intrinsics [39], using convolution layers as an example.

A. Input Stationary (IS)
IS operates by iterating through the input tensor.It applies all relevant filters to each input and accumulates the results to the respective entries in the output.Algorithm 1 IS Dataflow for Convolution Layers.

Require: inputs[H], weights[R], outputs[E]
for h in H do input ← vload(&inputs[h]); for r in R do weight = vload(&weights[r]); calculate e from h, r; outputs[e] += vredsum(vmul(input, weight)); end for end for 1 We exclude dataflows that are specifically tailored to specific deep learning accelerator architectures (e.g., Row-starionary [12], No-local-reuse [35], etc.) as they cannot be applied to CPUs.For example, row-stationary keeps software variables stationary in the rows of processing engines of a 2D systolic array; however, there is no notion of "rows of cores" in CPUs.

B. Weight Stationary (WS)
WS iterates through the weight tensor.For each output entry whose computation depends on the current weight tensor, WS collects each relevant entry from the input for computations and accumulates the result to the corresponding output.

D. Memory layout and Computation Order
Naturally, the computation order under a dataflow follows the sequential memory addresses of the corresponding data elements.We illustrate the memory layout scheme in Fig. 1.
We opt for the NCHW[xc] memory layout for each input/output tensor.In traditional NCHW alignment, tensors are arranged by first the number of images (batch size, N), then channels (C), followed by height (H), and lastly width (W).In NCHW [xc], data are grouped into blocks of size x × H × W , and we call these blocks channel blocks.The channel blocks follow the NCHW layout, while data in each channel block follows the HW[xc] layout, and x is typically chosen so that x × element width is a multiple of the size of the physical vector registers (1-3× in our implementation).
There are two main reasons for this memory layout choice.First, vectorization in the channel dimension streamlines vector computations, avoiding excessive operations such as shifting, because the number of channels multiplied by data size in a neural network layer is usually a multiple of SIMD register length (or vice versa).Previous works have demonstrated the effectiveness of this scheme for floating-point, integer and binary neural networks [19], [20], [40].
Second, NCHW [xc] enables data reuse between successive channel blocks.With NHWC, no element engages in calculations for two successive elements, whether inputs, weights, or outputs, under any dataflow.In contrast, NCHW[xc] enables various dataflows to be exploited to maximize data reuse (see Sec. III).Note that, for binary networks, NHWC can be largely the same as NCHW [xc] in performance since the number of channels in most network architectures is ≤ 512 and a multiple of vector register size in modern ISAs [19].
To optimize weight data access locality, we adopt the CKRS[xc] memory layout (matching the input/output tensor layout), where C, K, R, S denote #Input Channels, #Output Channels, #rows/filter height, #columns/filter width, respectively, and x in the notation for the weight tensor is chosen to be exactly the x of the input tensor.Following this layout, the output tensors can be written back sequentially regardless of the size of the input/output channel blocks and dataflows.
In terms of the compute order across input channel blocks, for better memory locality (as validated by our observation), we proceed along the output channel dimension before moving onto the next input channel block.In other words, the loop on the input channel dimension is an outer loop of that on the output channel dimension.

E. Implementation and Performance of Basic SIMD Dataflows
In software, we declare three vector variables to implement any of the three basic dataflows, one for each of the input, weight, and output data types.The size of each vector variable is x×element width as shown in Fig. 1), which is a multiple of the vector register size.Also, the total size of all vector variables is less than or equal to the total size of all vector registers.We distinguish these two terms because physical vector registers in some architectures can be concatenated to form longer vectors.For example, in ARM, vector registers are 128 bits in size, but vector variables can be multiples of 128 bits occupying multiple physical registers.
We compared the three basic dataflows (the experiment setup is outlined in Sec.V), and the results can be found in Fig. 2. We see that OS consistently outperforms the others in all tests conducted in terms of runtime.With a stride of 1, OS is by median 1.93x and 3.41x faster than IS and WS, respectively.With a stride of 2, OS is, by median, 5.39x and 2.81x faster than IS and WS, respectively.The superior performance of OS is due to a multitude of factors including lowered numbers of reduction sum operations, reduced output tensor data movement, and more regular instruction and memory access patterns.
While basic dataflows capture the reuse opportunities of the data that are active in the current computation, they only utilize a limited number of vector registers (precisely 3×vector variable size vector register size ), leaving all others idle.This is because, as discussed in Sec.I, compilers today are not able to discover vectorizable code -except for the simple cases -and fully utilize all vector registers automatically.This necessitates the need to extend the basic dataflows for faster inference.

III. EXTENDING THE BASIC DATAFLOWS
We say that a dataflow utilizes the stationarity of some data if it keeps that data close to the compute units -in vector registers in our case -for reuse.A dataflow is σ stationary if it uses σ stationarity, where σ is a predefined type of data (inputs, weights, or outputs).We extend the notion of dataflow by defining two types of stationarities, i.e., anchoring stationarities and auxiliary stationarities.
Anchoring stationarity is the stationarity that decides the execution order of computations.For example, output stationary dataflows have the outputs as their anchoring data type, so we always complete all computations involving an output element before moving on to the next.One dataflow can have at most one anchoring stationarity.The most naive implementation of a dataflow is constituted of an anchoring stationarity only, which is equivalent to one of the basic dataflows discussed in Sec.II.The major limitation of the basic dataflows is that not all vector registers are utilized.
In optimized implementations, vector registers are fully utilized to stash data to lower the data movement costs associated with both anchoring and non-anchoring data typesnon-anchoring data types are also referred to as auxiliary data types.The auxiliary stationarities determine which auxiliary data types should be allocated in vector registers.For example, an output-anchored dataflow may be accompanied by weight and/or input auxiliary stationarity.More than one auxiliary stationarity can accompany an anchoring stationarity.
An important question is to decide how to allocate vector registers to store (or stash) anchoring and auxiliary data types, which is dependent on two factors: (1) the total number of available vector registers, which constraints the overall SIMD capability, and (2) data reuse opportunities, which affects data movement costs, and also bounds the benefits that can be obtained by stashing the corresponding data in vector registers.

IV. OPTIMIZING EXTENDED DATAFLOWS
Our methodology for optimizing an extended dataflow follows two steps.First, we analyze reuse opportunities and develop heuristics to maximize data reuse benefits within each basic (i.e., anchoring stationarity only) dataflow to derive the corresponding auxiliary stationarities.Next, we empirically compare different implementations of the extended dataflows by varying vector register allocation schemes using a code generator to determine the best dataflow for performance.
While this methodology can be applied to most layers in neural networks, we focus our discussions on convolution layers, including simple convolutions [41], depthwise convolutions [6], [42], grouped convolutions [43], shuffled grouped convolutions [44], and so on.This is because these layers are common, and their latencies are generally longer compared to other layers [5]- [7], [45], [46].The convolution operation is shown in Fig. 3. Notation-wise, we use ih, iw, f h, f w, oh, ow for input height, input width, filter/weight height, filter/weight width, output height, and output width, s for strides, x for the number of data elements in a vector variable, and H, R, E for the sizes of input, filter/weight, and output tensors.Thus, A. Maximizing Data Reuse under Each Basic Dataflow 1) Reuse under Output Stationary Dataflows: Under output-anchored dataflows with the computation sequence following the description in Sec.II-C, all corresponding weights in each channel, totaling R, are reused between the computations for two successive output elements.Additionally, there are (f w − s) • f h reusable input elements involved in the computations for two successive outputs.We demonstrate these reuse opportunities in Fig. 4a.The reuse scheme of inputs is similar for s > 1, as shown in Fig. 4b, differed only by the number of inputs reusable between the computations around two successive outputs.
2) Reuse under Input Stationary Dataflows: Given the algorithm of input-anchored dataflows (Sec.II-A), when s = 1, all corresponding weights in each channel, totaling R, can be reused between the computations around two successive input elements.Outputs (partial sums) under input-anchored dataflows can be reused in a way similar to how inputs are reused under output-anchored dataflows.We demonstrate this reuse scheme in Fig. 4d.Note that we would need to reverse the sequence of the weights (i.e., following the order of the outputs) to enable this reuse scheme (see Fig. 4d).
When s > 1, reusing both outputs and weights becomes complicated.Not all weights are applied to every input.For s = 2, the number of weights/outputs associated with the computations around one input can be 1, 2, or 4, as demonstrated in Fig. 5.In this case, the reuse opportunities become sparse.Additionally, code structure becomes less regular.
3) Reuse under Weight Stationary Dataflows: In weightanchored dataflows (Sec.II-B), between the computations around two successive weights in an input channel block, all H inputs and E outputs can be reused, as depicted in Fig. 4c.
When using vector registers to stash an input, the input will not be reused in the computation involving each weight when s > 1.On the other hand, stashed outputs are guaranteed to be reused with each weight.As stashing outputs also saves writerelated operations and the size of the output tensor is almost always greater than the remaining SIMD vector registers, we will later demonstrate the sufficiency of only supporting output auxiliary stationarity under weight-anchored dataflows.
4) Heuristics to Quantify the Effectiveness of Data Reuse under Each Dataflow: We use the reduction in the number of memory instructions (both read and write, data size = c × elem width) for each input channel as the guiding metric for framing the heuristics for choosing auxiliary stationarities, summarized in Table I.The baseline configurations correspond to the basic dataflow implementations discussed in Sec.II, where 3 × vector variable size/vector register size vector registers are allocated only.For the extended dataflows, we utilize additional vector variables (which are mapped to vector registers) for the auxiliary data types to further reduce data movement costs.
Output-anchored Dataflows: Independent of the value of s, the numbers of inputs and weights associated with an output element, disregarding edge cases, are always equal to R for each input channel.Thus, every time we stash an input or weight vector variable in one or more vector register(s), the number of memory reads always goes down by the size of the output tensor.
b) Input-anchored Dataflows: When s = 1, the gains from auxiliary allocation mimic that under output-anchored dataflows.We expect a reduction of H memory reads and H memory writes for every vector variable allocated to stash outputs for each input channel block.For each vector variable allocated for stashing weights, we expect a reduction of H memory reads per input channel block.Note that H ≈ E in this case.When s > 1, the gains from auxiliary allocation become complex as shown in Table I.
c) Weight-anchored Dataflows: Recall from Sec. IV-A3 that we iterate through both the whole input and output tensors under weight-anchored dataflows.While we proceed by 1 element on the output tensor, we need to leap forward by s elements on the input tensor and also increment the starting input index (i.e., the first weight starts with the input at index 0, the second weight starts with the input at index 1, and so forth) for the computations associated with each weight element.This naturally implies that each vector variable allocated for inputs saves R ≈ H s 2 memory reads, and each vector variable assigned to stash outputs saves R reads and R writes, respectively, per input channel block.Guided by the heuristics, we derive the following observations: Observation 1: Weight-anchored dataflows will gain the least performance improvement from auxiliary stationarities.
Observation 2: Output-anchored dataflows will likely yield better performance than input-anchored dataflows when both are fully optimized.
Observation 3: Under output-anchored dataflows, prioritizing input auxiliary stationarity and prioritizing weight auxiliary stationarity will yield similar results.
Observation 4: Under input-anchored dataflows, prioritizing output auxiliary stationarity will yield better performance than prioritizing weight auxiliary stationarity.
Observation 5: Under weight-anchored dataflows, prioritizing output auxiliary stationary will yield better performance than prioritizing input auxiliary stationary.

B. Extended Dataflow Implementations and Code Generator
Based upon the above observations, we develop a code generator to extend all three basic anchoring dataflows with auxiliary stationarities to further determine vector register allocation schemes, which is done by varying the number of vector registers allocated to each type of data.We first allocate a subset of vector registers (sweeping from v 0 to v 3n−1 , where n = size(vec var)/size(vec reg), size(vec var) ∈ {128, 256, 512}, and size(vec reg) = 128 in our implementation) to store the vector variables corresponding to the anchoring data type, then the remaining vector registers to the auxiliary data types.Algorithm 4 Allocation sequence for inputs under secondaryunrolled output-anchored dataflows.(The same sequence applies for outputs under input-anchored dataflows when s = 1.) Initialize the original allocation sequence with sequential rowmajor allocation.for un in range[1, lcm(all #vector variables per row > stride)) do if # vector variables on this row > stride then Rotate stash indices on this row left by stride else The sequence stays the same end if end for 1) Implementation of Output-anchored Dataflows: For each output element under computation, we first determine if the required input and weight elements are already stashed in vector variables.If so, we perform the computation using those stashed data.Otherwise, we load the required data from memory into 2 vector variables of length size(vec var) = x×element width.Note that the sequence of vector variable usage between every two consecutive outputs is identical for weights but different for inputs.This means that we incur the cost of SIMD data transfer if we assign vector registers in the same way across all unrolled iterations of the weight loop, as the same position on the "window" covering all inputs  involved in the computations of an output data would be matched to a different input in two successive iterations.
To circumvent unnecessary data transfers between vector registers used for auxiliary input stationarity, we implement secondary unrolling, performed on the output loop with a magnitude of the least common multiple of all numbers of input vector variables per row (in the input tensor) that are greater than s, so that each iteration of the secondary unrolled loop uses vector variables differently: the specific sequence of allocating input vector variables differs between the computations around two successive outputs if the number of input vector variables in that row is greater than s, and Fig. 5: Under input-anchored dataflows: weights and outputs associated with each input when s=2 for each channel.Darker color means more data are associated with that input element.Fig. 6: Secondary loop unrolling to bypass vector data transfer, using one channel for demonstration.sremains the same otherwise.Algorithm 4 demonstrates the sequences of vector variable allocations for input auxiliary stationarity across each secondary-unrolled iteration, and Fig. 6 provides a graphical example of secondary loop unrolling.
To further minimize data movements, we directly load vectors of input data to be newly stashed into their corresponding vector variables (thereby overwriting the previous data), instead of new vector variables.
It is also worth noting that, through our observations, we found it advantageous to accumulate all results in a single vector register (instead of a scalar register) and execute the reduction sum operation only when all computations involving an output element have been completed.Although this approach consumes more vector registers, it ultimately saves costs related to performing a reduction sum operation on a scalar variable upon the completion of each computation.
Algorithm 5 summarizes the implementation of outputanchored dataflows.
2) Implementation of Input-anchored Dataflows: Under input-anchored dataflows, we can allocate the remaining vector variables to both weights and outputs.When s is 1, we observe that the sequences of vector variable usage between every two consecutive inputs are identical for weight data but different for output data.Similar to the output-anchoring dataflows, this means that we incur the cost of vector data transfer if we consistently use variables in the same sequence.Therefore, Algorithm 5 Implementation of Output-anchored Dataflows Prep 1: Initialize a total of numInStash input vector variables by loading data from the input tensor.
Prep 2: Initialize a total of numW gtStash weight vector variables by loading data from the weight tensor.
for c in ic by x do for k in oc do for h in oh by s do for w in ow by s do ▷ Secondary Unroll Set the anchoring output vector variable to ⃗ 0 for r in f h do Overwrite a completely used input stash with the new input by vload(c Use the stashed vector as weight else weight = vload(c end for end for end for end for again, we perform secondary unrolling on the output loop, following a similar procedure as described in Sec.IV-B1, but with the sequence of weights in reverse.We write the stashed outputs back to memory when their usage is complete for this row, i.e., when the output is in the first column of the current window of computation.The pseudocode of Input-anchored dataflows is provided in Algorithm 6. 3) Implementation of Weight-anchored Dataflows: Similar to output-and input-anchored dataflows, we describe a concrete and general method to implement weight-anchored dataflows in Algorithm 7.For input and output auxiliary stationarity under weight-anchored dataflows, we always stash the earliest yet unstashed element to exploit locality.We perform a loop split on the weight loop on top of unrolling to write stashed outputs back to memory only when their last usage is complete.When s > 1, inputs are reused once for every s weights.
Our code generator follows Algorithms 5, 6, and 7 to implement various extended dataflows using ARM Intrinsics.Users input the anchoring stationarity, the number of vector variables to be allocated to each auxiliary stationarity, and the layer configurations to generate custom dataflow implementations.

C. End-to-End Optimization of Memory Layout Sequence
Consistent memory layout alignment across consecutive layers is a prerequisite for efficient neural network inference.

Algorithm 6 Implementation of Input-anchored Dataflows
Prep 1: Initialize a total of numInStash input vector variables by loading data from the input tensor.
Prep 2: Initialize a total of numW gtStash weight vector variables by loading data from the weight tensor.
for c in ic by x do for k in oc do for h in ih do for w in iw do for ((h ′ , w ′ ), (r, s)) in (assoc idx(h, w, c)) do ▷ In reverse order ▷ Output and weight indices.See Fig. 5 if r • f w + s ∈ stashedW eightsIndices then Use stashed vector as weight Any layout discrepancy entails the need for transformations, leading to additional overhead.To combat this issue, we resort to the commonly adopted dynamic programming approach based on searched results [20], [47], [48].The algorithm's strategy hinges on minimizing layout transformations by using costs obtained from repeated runs of different scheduling schemes on each layer, ensuring reduced variance.By leveraging these costs, the algorithm determines optimal layouts that synchronize every two successive layers, thus curtailing the necessity for layout transformations.
In addition, we search for the optimal blocking schemes in compile time by running the program under each of the possible configurations and comparing their performance.

V. EXPERIMENT SETUP
We use physical ARM machines to quantitatively evaluate and compare dataflows implemented using our code generator.These experiments encompass executing convolution layers Algorithm 7 Implementation of Weight-anchored Dataflows Prep 1: Initialize a total of numInStash input vector variables by loading data from the input tensor.
Prep 2: Initialize a total of numOutStash output vector variables by setting them to 0's.
for c in ic by x do for k in oc do for for h in oh do for w in ow do Calculate ih and iw with oh, ow, padding, s if ih • input width + iw < numInStash then Use the stashed vector as input Use the stashed vector as output output = vadd(vmul(input, weight)) end if end for end for end for end for end for with various combinations of the following parameters, as well as collecting end-to-end runtime results for neural networks, to facilitate a thorough and comprehensive evaluation and comparison of different dataflows.
• Input Size: We focus on larger convolution layers that are time-consuming with input sizes of 56×56 and 112×112.• Weight filter Size: We use filters of sizes 3 × 3, 4 × 4, and 5×5, as these dimensions are most widely employed.
• Stride: We use strides of 1 and 2, as these values are also the most commonly used.• Number of Filters: We tested with 128, 256, and 512 filters to compare the different dataflows across various numbers of filters.• Vector Lengths: 128, 256, and 512, which are supported by modern ISAs such as ARM [28] and x86 [29], [49].
We use the GCC compiler [50] with the most aggressive optimization flags to compile all programs.We ran our experiments on a system with 64-bit quad-core ARM Neoverse-N1 CPUs which adopts the aarch64 architecture.Each program was executed 100 times to obtain the average run time.

A. Validation of Heuristics
We generated programs that implement extended dataflows for various convolution layers in ARM Intrinsics and ran experiments following the setup described in Sec.V to validate the heuristics described in Sec.IV.
We primarily present the results for s = 1 because (1) With output-anchored dataflows, the relative gains from weight and input auxiliary stationarities stay constant regardless of whether s is 1 or 2. (2) For weight-anchored dataflows, according to our heuristics in Sec.IV-A4, the improvement of extended dataflows over the basic, anchoring-only dataflow under s = 2 is expected to be less than that for s = 1.(3) Under input-anchored dataflows, as s increases, the difference between gains from weight and output auxiliary stationarity amplifies -we have empirically observed this behavior.(4) To compare output-anchored and input-anchored dataflows, we aim to determine whether the additional memory writes due to auxiliary output stationarity can exceed the 1.93x difference.Studying this under s = 2 is less insightful, as the difference between OS and IS (5.39x) is considerably larger.

1) Comparing Different Anchoring Stationarities:
Finding 1: Weight-anchored dataflows yield the least improvement from auxiliary dataflow optimizations and are consistently the slowest by a large magnitude.
Weight-anchored dataflows, even when fully optimized, significantly underperform in comparison to other anchoring stationarities (Fig. 7b).Surprisingly, fully optimized outputanchored dataflow implementations are by median approximately 7.41x faster than their weight-anchored counterparts.However, when comparing the basic dataflows, we observe only a median performance difference of about 5.44 times between WS and OS, and roughly 2.91 times between WS and IS, given s = 1.This escalating disparity is attributed to the different performance enhancements yielded by our optimization technique for different anchoring dataflows.As illustrated in Fig. 7a, the introduction of auxiliary stationarities results in a modest median improvement of around x1.08 for WS, while IS and OS enjoy more substantial median speedups of approximately x1.96 and x1.78 times, respectively.In fact, we find that adding auxiliary stationarities to the basic WS dataflow can sometimes lengthen the compute time.This is due to a low reuse frequency of the stashed auxiliary data and a more dominant increase in the size of the instruction cache.This result validates Observation 1 derived from our heuristics.
Finding 2: Output-anchored Dataflows outperform input-anchored Dataflows in the majority of the cases.
While IS seems to gain a larger performance improvement from the addition of auxiliary stationarities, we still find output-anchored dataflows to be superior upon full optimization.For the same convolution layer configuration, optimized output-anchored dataflows are faster than inputanchored dataflows for around 90% of the cases, which validates Observation 2.
(a) Speedup from the most optimized extending dataflows, normalized to the respective basic dataflows (i.e., results from Fig. 2).
(b) Relative Latency comparing the most optimized extended dataflows, normalized to the performance of OS.This finding validates Observation 3. By comparing the latency of dataflows that prioritize allocation for weight auxiliary stationarity and the ones that prioritize input auxiliary stationarity, we observe neither allocation scheme is consistently superior to the other, and the differences between the two schemes are small (within 6%).Finding 4: Allocating vector variables to outputs first improves performance compared to prioritizing allocation for weights under input-anchored dataflows.
By average, prioritizing stashing outputs yields an 8% performance gain, which becomes more evident as we increase the vector length.It follows that Observation 4 is validated.We find that under almost all cases, prioritizing output auxiliary stationarity brings a performance gain of up to 3% over prioritizing weight auxiliary stationarity.This validates Observation 5; however, the differences are negligible.3) Optimized Dataflow: From all previous analyses and results, we conclude that OS-anchored dataflow with auxiliary weight stationarity is the most optimized dataflow in our study.While there is generally little difference between prioritizing auxiliary WS and prioritizing auxiliary IS, we find the former to yield better code readability and more regular instruction patterns.Algorithm 8 summarizes this dataflow.

B. Neural Network Speedup against State-of-the-Art Implementations
Applying end-to-end optimizations discussed in Sec.IV-C, we compare our technique to state-of-the-art baselines.
For INT8 neural networks, we use TVM as one of the baselines.TVM is a highly optimized machine learning compiler stack for efficient neural network deployment across various hardware platforms [18].We compare the end-toend inference latency of variants of ResNet [51] (Resnet-18 and Resnet-34) and VGG [52] (VGG-11, VGG-13, and VGG-16) with TVM-autotuned (we use GridSearchTuner as the KernelTuner -this enumerates through the entire search space for configurations [53]) implementations and untuned implementations (TVM default).We set TVM to target the architecture and SIMD extension to match the physical machines used for our experiments.Across all network architectures Fig. 8: End-to-end relative speedups for 8-bit quantized neural networks from our techniques, normalized to TVM default mode without autotune (Note: for DenseNet-121 we do not have the results for TVM default mode, and had to use a different tuner (TaskScheduler), and we use the first tuning trial as the baseline).
and numbers of threads, we observe a ∼3x speedup over TVM's implementations, and up to ∼14x over its untuned implementation.Moreover, our multithreading scheme yields comparable scalability.We also compare the end-to-end results with programs generated by gcc/clang (with the highest level of optimization and autovectorization enabled).Ours achieve significant (4x-6x) speedup.
For the evaluation of binary neural networks, we compared the inference latency of our implementations with Cowan et al.'s TVM-based bitserial implementations [23].Since the code released by Cowan et al.only works for convolution layers on CPUs (while their end-to-end code generation tool targets Raspberry Pi and is not applicable to CPUs), we only perform this comparison for convolution layers.Bitserial implementations, although optimized for low-power consumption, do not offer satisfactory inference speed.Notably, our implementations are over 12x faster for various convolution layers.Based on the end-to-end results reported in their paper (which incorporates additional optimizations through microkernel synthesis) [23], we anticipate that our implementations will still outperform theirs by a large margin (6x or higher) in the endto-end comparisons.We also compared our implementations of various convolution layers in VGG against those from [20], and ours achieve up to 4.8x speedup.
a) Unroll-and-Jam: Unroll-and-jam reduces memory access costs by reordering instructions without breaking data dependencies [67]- [70], which can enhance the performance of convolution and fully-connected layers in DNNs [18], [71], [72].Our technique bypasses unneeded load instructions previously handled by jamming, and further jamming can be applied on top of our technique to lower latency.
b) Winograd Convolution: Winograd convolution reduces the complexity of convolution operations [73]- [77] and there exist various optimizations of its implementation on CPUs [78]- [82].Utilizing a similar concept of reusing data to speed up convolution inference, DREW [83] optimizes Winograd convolution by clustering data and reusing computed results and trades off accuracy and inference performance.In contrast, our method retains accuracy and suits all architectures with SIMD support.Moreover, standard Winograd convolutions struggle with quantization [80], [84]- [86], while our technique does not suffer from this limitation.c) Transformer Optimizations: Transformers have revolutionized several areas of machine learning [87]- [92].However, optimizing their performance, particularly on CPUs, remains a significant challenge [93]- [96].Efforts to date include pruning [97]- [100], quantization [101]- [104], knowledge distillation [105]- [108], architecture search [94], [109], [110], GEMM optimizations [95], [96], [111], and hardware-level optimizations [93], [112].Moreover, while there exist previous works on studying dataflows for transformers on other hardware platforms [113]- [116], no dataflow work has been done on CPUs to the best of our knowledge.Our technique is orthogonal to and may be combined with other Transformer optimization techniques such as GEMM optimizations (e.g., [96]).d) Intel AMX Extension: Intel's AMX [49] is designed to accelerate matrix-level operations on CPUs, and only available in high-performance processors like the 4th Generation Xeon Scalable Processors [117].Our research focuses on prevalent SIMD extensions.Moreover, it is essential to develop dataflows that maximize data reuse opportunities in AMX to further optimize its performance, and our methodology may be extended for this purpose.e) Binary Neural Network Optimizations: Frameworks that optimize binary neural networks specifically exist.An example is daBNN [118], which employs various assemblylevel microkernels to optimize performance.However, daBNN fails to harvest all data reuse opportunities, such as reusing input data between two successive outputs, or reusing weight data.By combining our dataflow technique with daBNN, further improvements can be achieved.

Fig. 1 :
Fig. 1: Memory layout of tensors.Red arrows show a subset of data elements following sequential memory addresses.Input channel blocks are traversed first along the output channel dimension.The purple shade covers a single vector variable.

Fig. 2 :
Fig. 2: Relative latency of basic dataflows for various convolution layers for Vector Length = (elem width × c) ∈ {128, 256, 512} (mean of 100 runs), normalized to the latency of OS.Configurations on the y-axes are in the format of (f w/f h, iw/ih, nf ).

Fig. 4 :
Fig. 4: Reuse opportunities under each anchoring dataflow, showing only one channel and one kernel.

2 )
Findings Related to Auxiliary Stationarity: Here, we compare different auxiliary stationarity schemes under each anchoring dataflow.

Finding 3 :
Prioritizing stashing inputs or weights does not significantly impact performance under outputanchored dataflows.

Finding 5 :
Prioritizing output allocation yields only slightly better performance than prioritizing input allocation under weight-anchored dataflows.
Algorithm 2 WS Dataflow for Convolution Layers.OS iterates through the output tensor.It performs all necessary multiply-accumulate computations to obtain the final result for one output entry before moving on to the next.Algorithm 3 OS Dataflow for Convolution Layers.

TABLE I :
Summary of gains from auxiliary allocation for each operation involving one channel block and one kernel