Speech Understanding on Tiny Devices with A Learning Cache

This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2 MB. Evaluated on challenging speech benchmarks, our system resolves 45%--90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.


INTRODUCTION
Speech is a pervasive human-computer interface.It is in particular suitable to small embedded devices where form factors are limited and hands-free interactions are preferred.Examples include voice user interface, smart home gadgets (such as smart lights/speakers/thermostat), and fitness trackers.The tasks that underpin speech interfaces are spoken language understanding (SLU), which translates a spoken utterance in the form of an audio wave to a structured semantic label, e.g."Turn the temperature in the bedroom up" is mapped to {intent: "change_temperature", slots: [{action: increase, object: heat, location: bedroom}].By 2023, 4.2 billion devices are estimated to have voice input capabilities [5].
To deploy SLU on small embedded devices 1 , neither on-device execution nor offloading is ideal: (1) On-device execution is constrained by resources.Satisfactory SLU accuracy often requires a deep neural network, typically an encoder-decoder architecture with transformers [23,28,32].Their memory footprint (hundreds of megabytes) far exceeds the memory on tiny devices which is in the order of a few megabytes [76].While recent research has tailored SLU to small devices [15,25], the resultant models are limited to predefined, simple commands such as "Lights up"; the models fall short of handling longer and more complex speeches commonly seen in daily life [45].
(2) Offloading inputs to the cloud incur higher delays and costs.The delays consist of waking up network interfaces, establishing connections, and transmitting data.Users are well-known to be sensitive to delays in speech interactions [41,59].The costs of cloud inference is also a rising concern, as shared by both research communities [73] and industry stakeholders [7,11,12].As a notable example, Qualcomm estimates that each machine learning (ML) query costs 10x higher compared to conventional cloud API [7].
To this end, this paper sets to integrate on-device execution (for reduced delay/cost) with cloud offloading (for well-established accuracy) in a novel way.
Insight: locality in spoken inputs Our key observation is the temporal locality in voice inputs: for a device, the received voice commands are likely uttered by a small number of users; most voice inputs are recurring with nearly identical transcripts.See §2.3 for detailed evidence.
The temporal locality implies a new opportunity on system efficiency: the device may match any new input against cached ones and directly emit user intents in case of high similarity.The rationale is that the device could resolve a significant fraction of inputs without offloading, reducing latency while still retaining accuracy across all inputs.Figure 1: Our system in the design space of speech understanding systems.
While the idea seems simple, to realize it we are met with daunting challenges.For one thing, raw speech signals exhibit strong acoustic variability as induced by disturbance and noises [42].For the other, even utterances of the same transcript, by the same speaker, will vary notably in the acoustic domain, which renders straightforward waveform comparisons ineffective.The drastic difference can be seen in Figure 3.

Match at which level of representations?
The choice of voice features requires a careful examination and thoughtful tradeoff.On one hand, to be robust against variations, our system should match utterances represented by higher-level speech features.On the other hand, extracting higher level features is beyond the capacity of small feature extractors that can fit in the devices.In principle, our chosen speech representations should generalize beyond individual utterance instances; yet, they should not over-generalize, e.g., extracting text from utterances becomes an overkill for matching repeated utterances.
How to ensure accuracy?Lightweight feature extractors often output noisy features.To achieve competitive match accuracy, our ideas are twofold: (1) specialization: instantiating multiple versions of feature extractors for one device and for utterances of similar lengths; (2) learning from offloading: using the SLU results returned by the cloud as a supervision signal, continuously tuning the ondevice feature extractors.Combined, the two ideas specialize a device's extractors to the inputs received by the device in situ.
Our system is a learning cache for SLU on embedded devices.As shown in Figure 1(a), SC entails two major designs: (1) it caches recent offload results: input utterances as represented by low or middle level acoustic features; (2) it uses the recent offload results to finetune the device's feature extractors for specialization.SC matches new inputs first by clustered raw frequency vectors (referred to as L1, a low level representation) and in case of L1 mismatch, by CTC (Connectionist Temporal Classification) loss [30] over phoneme sequences (referred to as L2, a middle level representation).Notably, SC's feature extractors are streaming, computing on utterance segments as they are ingested, which hides much of the computation delay behind the IO.
As shown Figure 1(b), SC takes a new point in the design space of speech understanding.SC's choice of acoustic features sets it apart from work that matches voice commands by transcripts [74]; SC's learning capability sets it apart from query-by-example (QbE) [21,33,40,47] and keyword spotting (KWS) [15,52] which cannot personalize online.To ensure reproducibility, we release our training and inference code along with sample models at the following URL: https://github.com/afsara-ben/SpeechCacheResults We implemented SC on a low-cost MCU and evaluated it on 75K+ voice inputs from 210 users.Compared to offloading all inputs, SC resolves 45% -90% of inputs with recurring transcripts on device, reducing the average latency by up to 81% (from near 1 second down to around 150 ms).Meanwhile, it retains accuracy comparable to that of the state-of-the-art models.Even in challenging situations such as scarce cached inputs or multi-tenant usage of the device, SC still reduces the latency by 3% -34% compared to offloading to the cloud.SC has a low memory footprint of less than 2 MB (model and cache combined), which is suitable to low-cost embedded devices.
Contributions Towards deep speech understanding on small embedded devices: • We identify a fresh opportunity: temporal locality in voice inputs, and leverage it through on-device caching.
• We propose two novel designs: matching utterances at hierarchical levels of acoustic features; personalizing the feature extractors, which is done via finetuning the extractors with offload results.
• We present a suite of optimizations which are vital to performance, including an ensemble of models for various input lengths, dynamic thresholds, and model compression.

MOTIVATIONS 2.1 Our system model
We target embedded devices as commonly seen in smart home and wearable gadgets.Examples include Arm Cortex-M microcontrollers clocked at less than a few hundred MHz, with less than a few MBs of RAM.We will refer to them as "devices" for short.
Following prior work, we assume that a device continuously detects the presence of human voices with low overhead, and activates SLU upon detection [76].We assume that the device has network connectivity -to a cloud service or a nearby hub, which we refer to as "cloud" for short.The cloud serves a state-of-the-art (SOTA) SLU model.The resource needed by such an SLU model (memory in 100s of MB) far exceeds that of a device.
The goal and target benefits We set to make the devices capable of executing SLU with near SOTA accuracy, while processing a significant fraction of inputs locally.Our primary benefits are (1) lower cloud cost by reducing the invocations to its ML service; (2) lower inference latency and thus better user experience.Beyond that, our system may strengthen privacy and security for applications that allow occasional offloading despite the risk of data breach.
A large speech recognition model [36] can resolve rich, complex commands that resemble daily conversations (e.g.'please check and tell me about the reminders i have placed today') but is beyond the compute and memory resources of a simple MCU.It is illustrated in Figure 2. SLU under resource constraints is feasible only with certain restricted input types: utterance is short and is a predefined keyword ("yes", "stop", "go" etc.) [15,52] or few word command ("play music", "lights on") [48].
In [48]  Cost breakdown Typical E2E large SLU model in Figure 2 (a) although being highly accurate, is impossible to deploy on MCU.To understand caching opportunities and the potential reduction in overhead, we study a small SLU model [48] in Figure 2 (Bottom).Note that the model was designed with efficiency in mind, not accuracy.Its capacity is much lower than SC (ours) as discussed in §7.2.Our observations are that: (1) the early stages require much less compute and memory than later stages.These stages are for extracting lower level acoustic features (sound unit, phoneme).The later stages, especially word embeddings, require memory that far exceeds embedded devices.(2) executing the early stages take much less time (∼96ms) than offloading one second of input to the cloud (∼ 300ms).Note that the choice of offloading audio waveform (∼10s KB per input) or intermediate features (e.g.phonemes at 160 bytes per input second) does not affect the offloading latency much; the delay is primarily contributed by the network round trips (RTT) and the device's network interface power management (wakeup or duty cycling) [53].
Applicability Our system targets smart devices, typically interface devices for controlling smart homes, used by one/several users and seeing recurring queries or commands.Any Voice User Interface (VUI), smart wearables, smart home control and gadgets such as thermostats, light control fall under this category.Additionally, our system has use cases for human robot interaction and smart navigation.

Motivation: locality in spoken inputs
Such input locality is pervasive, for the following reasons.In many settings, a smart device serves no more than several users.For instance, a wearable serves its wearer exclusively; home devices in a bedroom serve family members that use the room [19]; smartspace devices such as activity trackers serve one or a few workers in close proximity.Speech inputs received by a device often follow a small set of transcripts.We attribute the causes to two sources of locality.(a) Intent locality.Studies show that, despite rich features of home assistants, most invocations are for a small set of daily routines, such as querying schedules, checking weather, and IoT controls [13].The users rarely change their usage patterns, and the majority of sessions (77%) involves only one or two domains [19].(b) Transcript locality.To express the same intent, a user is likely to utter the same transcript repeatedly, e.g."what time is it" for querying the time.Study shows that humans naturally follow linguistic/social conventions to facilitate understanding among individuals [50].The idiolect study also demonstrates that a user often has her own preferred choice of words and sentences [26].Such human-to-human conventions are reinforced in the human-computer interactions; users are motivated to use recurring transcripts to facilitate machine understanding.
We recognize that the above observations may not apply to certain devices, e.g. a public kiosk serving diverse users/inputs.For them, our system will not fail but will see diminishing benefits, as §7 will show.

Design implications
The locality motivates us to cache recently spoken inputs, for which we make three design choices.
Choosing cache representations Our top challenge is robust match of utterances at low cost.The solution hinges on the representations of cached inputs.(1) While low level acoustic representations require smaller ML models and less computations, the representations can be brittle.This is further exacerbated by background noises and the speaker distance.Figure 3 compares three spoken instances of the same transcript, by the same user back to back.These representations not only appear visually different, but also fail common metrics for sequence matching: euclidean similarity, cosine similarity, and Levenshtein distance.(2) By contrast, higher level representations such as words are more robust; yet, the required on-device memory and computation would defeat our goal of efficiency.
Considering these tradeoffs, we choose to represent cache inputs as sequences of acoustic or phoneme features.

Specialization
The cache adapts to each device and its users, and more aggressively, each recent transcript.This allows SC to specialize its model parameters (mapping from raw waveforms to acoustic representations) and cache representations.This drives the accuracy high and allows simple model structures.

Learning online
The device can receive online supervision from the cloud model in situ, using the cloud output to finetune its local models for specialization.This sets our design apart from prior embedded speech systems [20,47,48,76] which lack such supervision; these systems run generic models frozen at development time and compromise on accuracy.

SC OVERVIEW
Architecture SC consists of a device runtime that extracts input features and performs cache lookup; a cloud runtime, which resolves offloaded inputs with a SOTA SLU model; the cloud finetunes the feature extractors for the device.
On a device, the cache comprises two levels: level one (L1) represents an input as a sequence of sound units and level 2 (L2) represents an input as a sequence of phonemes.The two levels run ML models as their feature extractors.This is illustrated in Figure 1.The maximum number of cached inputs is user configurable, which we expect to be no more than one hundred based on workload studies [19,29,67].
To extract the features at two levels, SC runs two lightweight acoustic models: (1) SincNet [65] and 1D convolutions, followed by frame-level frequency discretization and (2) GRUs with linear classification.
We further apply a series of optimizations (discussed in §6): augmenting input data for finetuning, grouping utterances by lengths and running separate feature extractors, and dynamic thresholds for similarity comparisons.
Operations When deployed, SC installs generic, pretrained feature extractors to a device.It uploads a shadow copy of the feature extractors to the cloud.
During operation, any new input goes through the L1 cache and, in case of L1 mismatch, the L2 cache.Each cache level generates input representations and compares them against existing entries.In case of both L1 and L2 mismatch, SC uploads the input -as a raw waveform -to the cloud.The size of a waveform contributes little to the upload delay: a typical input lasting 3 seconds is no more than tens of KB, for which the upload delay is dominated by a network round trip [53].After processing an offloaded input, the cloud sends back the resultant intent as well as new L1/L2 entries (as sequence of sound units and phonemes, respectively), which the device installs.Note that upon L1 or L2 cache hit, the cloud is not invoked; no new cache entries are installed.
In processing offloaded inputs, the cloud also finetunes its shadow copy of the feature extractor.Every  offloaded inputs, the cloud sends the shadow copy back to the device, replacing its local feature extractors.Empirically, we find  = 100 as a good balance between model freshness and download frequencies.The model download is as fast as 1-2 seconds, as a single model size is < 2 MBs before compression and 0.62 MBs after compression.See §7.6 for details on memory footprint.For finetuning details, see §4 and §5.SC caches multiple entries per transcript as utterance for the same transcript varies; evident in Figure 3. Cache entry per utterance is updated at runtime in absence of cache hit.
The overall latency for SC is calculated taking into account the percentage of inputs that are offloaded.Since SC is streaming, only the last 10 frames or 0.25s of audio processing adds to the latency overhead.The choice of segment/step size is consistent with prior work [51].The filter rate is the percent number of inputs that are processed on-device.A higher filter rate thus implies a lower latency as most processing would be done locally.
Cache space management Since an entry takes as few as 200 bytes (see Table 7 for details), SC keeps all entries in memory by default.We noticed that: a mid-range MCU of 2 MB memory can easily allocate 1% of its flash for 50-60 cache entries, sufficient for covering most recent received utterances.After all free entries are used, the device invokes the well-known LRU policy [38] to evict victim entries.The device allows to cache multiple inputs that map to the same intent.It, however, prevents a small number of popular intents from dominating the whole space by capping the number of per intent entries.
Cache warm up For a new device without prior inputs, SC preloads the cache with example entries.These entries can be chosen for specific deployment, e.g. common voice commands in smart homes.In this warm up period, our cache resolves fewer inputs (i.e.lower hit rates) but still functions.As the cache encounters new commands and replaces the preloaded entries, its accuracy will ramp up to normal.
Alternative designs rejected by us While it may be tempting to specialize for individual speakers (rather than devices as SC does), we keep SC simple without any need for speaker identification.However, SC altogether caches inputs that may come from different speakers sharing a device.We experimentally test the impact on the number of users per device in §7.2.
SC matches intents with identical transcripts.It cannot match utterances of different transcripts implying same intent ("Turn off

Cache insertion path
Cache lookup path

Cache
Figure 4: The design of sound unit cache (L1) the light" cached, SC still regards "Turn the light off" as a cache miss).Trivial fixes, e.g.raising the similarity threshold, fails because intents are sensitive to slight differences in transcripts ("Turn bedroom light on" and "Turn bathroom light on" imply different intents).The systematic fix is transitioning from acoustic representation to word embeddings, which however defeats SC's efficiency goal.

SOUND UNIT CACHE (L1)
Our rationale for L1 is to absorb "easy" audio inputs (e.g.highly similar to some prior inputs).For this reason, we design L1 cache for low cost, and in exchange tolerate a lower recall (i.e.L1 will miss a significant amount of true recurring inputs).

Overview
L1 compares low level spectral features of sound units as shown in Figure 4.
At runtime, given a waveform input , the system queries the cache by: extracting spectral features from  ( 1 ) and matching the frame-wise features against each cached key, which is a sequence of spectral feature centroids.For the matching, the system computes the CTC loss between the incoming features with the cached centroids ( 4 ).
In case of cache miss, the system updates the cache by: discretizing the frame-wise features through clustering ( 2 ) and further collapsing adjacent frames ( 3 ).It saves the resultant sequence of centroids {  } (which represent distinct sound units) alongside the cloud-supplied intent.

Designs
We next elaborate on the major steps in Figure 4.
1 Feature extraction L1 splits the input into  audio frames For each frame   , it extracts spectral features   = { 1 ,  2 , ...,   } which is a vector of continuous real numbers.
An audio frame is a short segment or window of consecutive audio samples having a fixed duration.A distinct sound constituting a single or multiple frame(s) is referred to as a sound unit or phone.Converting audio frames to salient, spectral features is a standard preprocessing step of all modern SLU pipelines [52,65].It is also required by L2.Therefore, the extraction is free/independent to L1.It is of low cost, as done by a sequence of convolutional layers.Notably, we operate in the time domain: apply convolutions on a raw input waveform, instead of transforming the waveform to the frequency domain [27,56,65].
SC sends each frame to a SincNet layer [65] followed by two 1D convolutional layer.These layers build upon the high-dimensional features acquired in the initial SincNet layer.The output features are of lower abstraction.
Since the extraction is generic to human speech, SC adopts frozen SincLayer and convolutional layers pretrained for ASR.For pretrained feature extractors, we use the extractor in phoneme module from Fluent [48].Finetuning the frozen pretrained layers saw little benefit.
2 Feature discretization SC discretizes spectral feature vectors per frame through K-means clustering [44], as inspired by self-supervised speech models [17,24].To do so it clusters vectors based on their euclidean distance from randomly initialized centroids.Each vector is assigned to its nearest centroid and the centroid positions are updated iteratively until convergence.With each iteration L1 learns more about the inherent characteristics present within the feature vector.After discretization, each frame   is represented by a numerical ID   for  = 1... .The extracted spectral feature vectors,  = { 1 , ....,   } are continuous in nature but for our sequence matching task, we need discrete values.We opt for discretizing the frame features before sending them to the subsequent ML layers (as prediction targets).The rationale is that each discretized feature could represent a distinct sound (roughly), simplifying comparisons/matching of the sounds.
Unique to SC's specialization principle, the clustering is per utterance instead of per transcript, so each utterance has its distinct sequence of centroids.At runtime, for each seen transcript , SC augments the speech as described in §6 and clusters the extracted spectral vectors corresponding to , using K-means clustering.Whenever the device encounters a new utterance for , the sequence of centroids   = { 1 , ...,   } against  are updated.The results are separate sets of  centroids each for a distinct utterance of .
We also experiment with  centroids per device instead of per utterance.The performance is inferior to the prior approach because it is too coarse grained for varied transcripts.A set of  centroids per device is inconsistent with the original transcript, hence we reject the idea.
3 Frame collapsing L1 collapses adjacent frames that have identical IDs (e.g."... 13  4 Sequence matching SC does a "soft" match between a sequence of probabilistic sound units against cached sequences.Due to the noisy nature of audios, "hard" match between deterministic sound units would force SC to emit one most likely sound unit per frame, which does not optimize for a most likely sequence match as a whole and is therefore susceptible to input disturbance.
For each spectral feature   in   it generates a probability distribution   over all discrete sound units.The distribution is calculated based on the distances to the cached centroids Since the closer a spectral feature is to a certain centroid, the more likely it belongs to that centroid, we utilize the inverse of the distance to measure the likelihood to match.Specifically, for time frame , the inverse is calculated with I  = max(d  ) − d  .The predicted match to a specific cached centroid, P  , is obtained with argmin  d  .
Given the probabilistic   and a cached sequence (  ), L1 determines all concrete sequences of centroids (alignment) that would collapse to   .It computes the aggregated probabilities for a single alignment () as  ( |) =  −1  =0     where     is the probability of observing label   at time .
After that, SC uses an objective function CTC loss () [30], a well known algorithm in speech recognition, to compute the aggregated probability of all possible input sequences (or valid alignments) in the inverse mapping  −1 () that would collapse to a cached sequence: The process of finding the best alignment (sequence) is handled using a dynamic programming algorithm called the forwardbackward search which has the lowest loss value.If the loss is below a pre-defined threshold X, SC deems it as a cache hit and returns the associated intent.Intuitively, the threshold X should be normalized to the sequence length (e.g.tolerating higher loss for longer sequences).

Implementation details
At a standard sampling rate of 16KHz [39], we apply a hamming window of size 401 to , creating a sequence of  frames, each spanning 25 ms, or 401 audio samples.
In 1 each 1D convolutional layer consists of 60 filters of length 5 and step size 1.To decrease the dimensionality of features, each layer undergoes a temporal 1D average max-pooling operation with a kernel size of 2 and a stride of 2. All hidden layers use leaky RELU activation function with a negative slope of 0.2 and batch normalization is applied before each activation layer to speed up training convergence.

PHONEME CACHE (L2) 5.1 Overview
L2 matches inputs based on phonemes as shown in Figure 5.
At runtime given a waveform input , L2 runs a feature extractor: the input is a sequence of spectral features (continuous) produced by L1 feature extractor in Figure 4; the output is a sequence of probabilistic distribution represented as phoneme posteriors 1 .For matching, SC computes the CTC loss between the phoneme posteriors and cached phoneme sequence 3 .
Cache entries are updated upon L2 cache miss: the ground truth utterance sent to cloud is tokenized and used as a "reference" phoneme sequence 2 .L2 caches this phoneme sequence against a cloud provided intent.With each seen utterance in cloud, the model is finetuned 4 .

Designs
1 Feature extraction A phoneme is a sound unit/phone spanning one or more audio frames.We follow a common design for extracting phonetic features from raw speech.
Consuming a sequence of uncollapsed, frame-level spectral features   for each frame   from L1 (step 1 in Figure 4), L2 applies two bi-directional GRU layer (a hidden size of 128), followed by a linear classifier and a 42 phoneme output softmax layer (41 context independent (CI) phoneme targets and an additional «sp» (blank) target).
The output phoneme posterior   is a sequence  that represents the log probabilities or phoneme logits at each time step in the sequence; here  = { 1 , ...,  42 }.Implementation is similar to §4.3. 2 Tokenization Upon L2 cache miss, the cloud provides a "ref- erence" phoneme sequence  = {  ,   , • • • ,   } for .To do so, the cloud runtime transcribes  to words and then to phonemes, using a standard tokenizer such as NLTK [6].A special "blank" token is inserted between adjacent words.The rationale is that the cloud with its large speech/language models can transcribe  with low WER; the phonemes (reversed) derived from the transcript and used as cache key are therefore less error prone and closer to the ground truth.
3 Sequence matching For each time step T, The generated sequence of phoneme posteriors  = ( 1 , ...,   ) will be matched against all existing L2 entries { =0.. } which save deterministic phoneme sequences.Here, the match employs CTC loss: from the probabilistic  and a saved sequence (  ), L2 determines all concrete phoneme sequences (alignment) that would collapse to   .It computes the aggregated probabilities for a single alignment () as:  ( |) =  −1  =0     , where     is the probability of observing label   at time .
Then the CTC loss () is computed as the sum of the probabilities of all the valid alignments mapped onto it by ,  ( |) =  ∈ −1 ( )  ( |).The cache entry with the lowest CTC loss is chosen; if the loss is below a predefined threshold, L2 deems it as a hit and returns the associated intent.
§6 discusses threshold determination. 4 Online finetuning is crucial as shown in §7 as it allows SC to be personalized.The cloud keeps a shadow copy (M) of the phoneme feature extractor that the device is currently running.In the training phase, given the phoneme logits (or posteriors), the model is finetuned to predict phoneme targets (ground truth).The cloud runs a forward pass of M, gets a sequence of phoneme posteriors, calculates the CTC loss between generated sequence and the reference sequence.Finally, the loss is backpropagated to update the parameters of M. Having accumulated all updates, the cloud pushes M' to the device for update.

Implementation details
The objective of the model is to perform a single label intent classification task.The training process involves both real and augmented inputs with batch size 16, Adam optimizer and learning rate 1 −4 .L2 is finetuned until the CTC loss converges.The CTC loss in PyTorch uses dynamic warp search (DTW), an optimized version of the traditional forward backward algorithm.For L2 feature extractor, hidden dimension 128 is optimal.

KEY OPTIMIZATIONS
Model Ensembling A novel optimization is to specialize feature extractors to input length -a reasonable indicator of acoustic and lexical complexity.Intuitively, utterances lasting 4-5 seconds (e.g."At one pm today start the robot vacuum cleaner in kitchen") are likely queries with context; unlike short commands only lasting a few seconds (e.g."Play music").We hypothesize that their matching tasks could benefit from separate models finetuned on the input complexity.
Concretely, SC instantiates multiple () versions of the feature extraction models, each version finetuned on a range of input audio lengths.Empirically, SC runs  = 3 models, for input lengths (0,2.7]sec, [2, 4) sec, and [4,) sec respectively, also referred to as buckets.This multi-versioning of models can be seen as a special case of Mixture of Experts (MoE) [37].Similar to MoE, the subdivision of predictive modeling tasks is done; division element being the input length.An expert (bucket model) is developed for each subtask.Unlike in MoE, we do not need a neural gated network for routing individual input to an expert, it is simply done by comparison against input length.
At runtime, input goes through each bucket model.After the command ends, the extracted features are taken from the bucket model corresponding to input audio length.Note that multiversioning only applies to devices; the cloud still runs a monolithic model for SOTA performance.
L1 underperforms on short commands as it doesn't have enough features for better inference.For such commands we bypass L1 and directly process them in L2.To determine the threshold for short commands -we do a systematic exploration with cutoff values at every 0.5 intervals upto 3 seconds and observe that a cutoff of 2 second (or commands in range (0,2.7]sec) can be considered as 'short' and eligible for bypassing L1.
Weight Sharing Only the weights in L2 are learnable.The acoustic feature extractor of L1 (which is also common to L2) is independent to learnable weights; thus can be shared by  bucket models.

Input augmentation
To fine-tune the on-device models, a considerable difficulty is that a device may not have enough inputs, which we expect can be as few as 5 -10.To address this difficulty, we find data augmentation vital.
For any offloaded waveform, the cloud creates multiple augmented versions of it in order to resemble the possible variation in future inputs: (a) Temporal Shift -SC shifts the waveform towards either direction for   % of the total duration, where   conforms to uniform distribution in [-5,5].This simulates users starting to speak before activating device or continuing to speak after recording ends.(b) Frequency Shift: SC varies the waveform frequency by   % , where   conforms to uniform distribution in [-10,10].(c) Ambient Noise: SC applies Gaussian noise at 5% of the maximum volume in the recording.It allows the model to distinguish important phonemes from background noise.In a typical home setting a wide range of ambient noise could occur at the same time, e.g., television and phone.
Given an input, each transformation above creates five versions of it.
In Domain training is applicable for SLURP.Before finetuning, the base model is trained on domain specific utterances from SLURP.The purpose is to make the model familiar with the complexity of SLURP sentences.We train on 10% of the total SLURP utterances.
Dynamic thresholds for CTC loss Intuitively, the CTC loss threshold  for cache hit should correlate to the length of the input sequence .To dynamically determine the threshold, SC adopts a small, 2-layer MLP model (with hidden dimension of 64, and ReLU activation) that maps  →  .The model only has 193 parameters; during cache lookup, its inference overhead is negligible compared to other computations, e.g.CTC loss.This optimization is only needed when querying inputs in evaluation.It is not needed for learning.We train the model offline on a held-out set loaded with 100 entries and use the MLP model to predict the threshold (that would give best results) on the fly.

EVALUATION
We answer the following questions through experiments: (1) Can SC achieve competitive accuracy & latency?
(2) Do the key designs of SC positively contribute to its performance?(3) Is the performance of SC sensitive to environments and configurations?

Methodology
Test platforms We implement SC atop PyTorch 2.0.1 (for the cloud runtime) and X-Cube-AI 8.1.0(for the device runtime).We deploy SC in a low resource platform as detailed in Table 1.We choose a Cortex-M7 processor as it has integrated singe-instruction multiple-data (SIMD) and multiply-accumulate (MAC) instructions useful for accelerating low-precision computations.
We run SC's cloud runtime on an x86 server in lab.To better estimate the cloud/device network delays in real deployment, we invoke Microsoft's speech service [2] with the benchmark inputs and measure end-to-end wall time.The input is sent from the US east coast and invokes data centers of the east coast.We repeat the test on enterprise WiFi and LTE, and use that measurement as the offload delays in our experiments.Our delays are 0.29-0.34RTF (on average: 900ms for a 3 second audio; stddev: 100ms), which largely match the cloud API delays in prior work [59,60,75] .The payload is only tens of KB hence bandwidth is not an issue.Note that, Azure speech service is used only to measure the RTT (in ms), the accuracy reported comes from the actual hardware detailed in Table 1.
Datasets are summarized in Table 2: (1) SLURP-C is curated from a popular speech benchmark SLURP [18], which comprises lexically complex, linguistically-diverse utterances close to daily conversations, e.g."please add an event to my schedule".As the original SLURP waveforms were captured with varying devices and acoustic conditions, we construct SLURP-C as the subset recorded in the close range setting (2745 utterances from 74 speakers; each speaker utters 5-961 transcripts), best matching our targeted scenarios such as smart homes and wearables.We also consider adversarial inputs: a subset SLURP-mix recorded with a mix of near/far range devices and in noisy conditions (52,935 utterances from 157 speakers); (2) FSC [48] comprises shorter and simpler utterances representing voice assistant commands, e.g."play music"; with 30K utterances from 97 speakers, covering 31 intents.Given a speaker, a distinct transcript is uttered 1-2 times.
Note that compared to the original SLURP and FSC datasets, we exclude speakers that have too few (<5) utterances, as they lack enough data to warm up our cache.See §3 for discussion on such a situation.
Comparisons CloudOnly offloads all inputs to the cloud, which runs a model that achieves the SOTA accuracy with the best efficiency, with regard to each dataset.For SLURP-C, the cloud runs NVIDIA's Conformer-Transformer-Large [1], pretrained on NeMo OnDevice-{S|L} are models that completely run on device.(1) OnDevice-S is tailored to MCUs and targets simple utterances.As a popular model by Fluent.AI [48], OnDevice-S has only 3.96M parameters and was shown to have good accuracy on FSC.(2) OnDevice-L is a compressed model that can handle complex utterances as in SLURP-C.It comprises a Conformer(S) [31] for ASR and a Mobile-BERT [69] for NLU, totalling ∼110M parameters (ASR:NLU=1:10).OnDevice-L far exceeds an MCU's resource; we regard it as a reference point -an efficiency-optimized model that still generates reasonable accuracy on SLURP-C.We confirm that further smaller models would not produce meaningful results.
Ours-{D|T|M} + is our system with different combinations of optimizations (as discussed in §6), in which: D stands for dynamic thresholds for feature matching; T stands for in-domain training; M stands for model ensembling.
Benchmark settings We control two factors that have high performance impact: the number of speakers sharing a device; the fraction of unseen speech transcripts.
1-speaker-100%-seen comprises of 74 (SLURP-C) and 53 (FSC) separate tests, each for a distinct speaker.In a test   for speaker , all test inputs are from this speaker.We construct   as follows (also see Figure 6).
From the speaker 's utterances corresponding to a distinct transcript,   includes one randomly selected utterance as a cached input and all other utterances as test inputs.A test   has two phases.In the learning phase SC processes all the cached inputs (which have various transcripts), building the device cache and finetuning the feature extractors.In the test phase SC uses the tuned feature extractors and processes the test inputs.We report performance for the test phase: for each of SLURP-C and FSC, we aggregate the results from all the tests { =1.. }, where  is the number of speakers (74 for SLURP-C and 53 for FSC).
1-speaker-k%-seen is the same as above, except that in each test   , only the transcripts of k% of test inputs have appeared in the cached inputs.We experiment with k=70 and k=0 (an extreme case, no transcripts were seen).
n-speakers-k%-seen is the same as above, except that each test  now comprises utterances from  randomly grouped speakers.Speaker groups are disjoint.The utterances carry no speaker IDs.We report results for  = 3 and  = 100, for which we perform 24 (SLURP) and 18 (FSC) tests.Additionally, we also test on allspeakers-100%-seen.

Metrics
We report accuracy and latency averaged over all the test inputs.The latency is end-to-end, from the moment an utterance completes till the system generates a response.Real-time factor (RTF) measures how fast a speech model can process audio input  Table 3: Measured performance of our system as compared to baselines, showing that we deliver strong accuracy while incurring much lower latency (ms in table).The absolute latency is for an input of 3 second.Note that the baselines' performance remain largely unaffected w.r.t.benchmark settings.
and is the ratio of the processing time to audio length.We report latency both in wall-clock time and RTF (wall time normalized by the voice length).

End-to-end results
As shown in Figure 7, our system offers competitive latency/accuracy tradeoffs.
(1) On SLURP-C comprising complex inputs, our system delivers high accuracy of 0.89 (only 0.01 lower than gold) while incurring 4x lower latency (159 ms vs. 870 ms).OnDevice-S, which suits most MCUs, fails to generate meaningful responses: Its accuracy is as low as 0.06.
(2) On FSC which comprises simple voice inputs, our system and all baselines deliver accuracy as high as 0.98-0.99.Yet, our system runs much faster: its latency is 1.3x lower than CloudOnly and even 50% lower (663ms vs 1221ms) as compared to Ondevice-S of only 3.96M parameters.
On-device compute is measured in GOps per input second using PyTorch.
Impact of input complexity Our benefit is more pronounced on complex, richer inputs as in SLURP-C.As shown in Table 3, SC incurs lower latency (due to higher filter rates) on SLURP-C than FSC, because its cache is more effective on matching longer inputs comprising richer sound features.By contrast: on these inputs, ondevice SLU cannot be afforded by MCU; shallow, matching-based query-by-examples (QbE) can only handle inputs as short as a few words [47], because it lacks the personalization capability.
Impact of the number of speakers Our system is robust against additional speakers sharing a device.Compared to the default 1speaker setting, 3-speakers-100%-seen sees modestly ∼400ms higher latency at similar accuracy on SLURP-C as shown in Table 3. Table 4 shows the breakdown results: slightly lower filter rates while maintaining the same level of cache accuracy.With additional speakers, although SC loses some benefit of personalization, it is still more specialized (thus more efficient) than a generic on-deivce model trained to fit all training data.
A generic SC trained on all data from SLURP-C gives a low accuracy of 0.80, showing the significance of personalization (our design).

In-the-wild evaluation
To thoroughly test SC, we conduct a user study and report performance in the four benchmark settings.

Design and Data collection:
We collect 210 audio recordings from three volunteer speakers.Among them, 126 utterances are recorded with headset (close range) and 84 without (far range).The volunteers (1 female, 2 male) are non-native speakers from diverse ethnic backgrounds (Korean, Chinese, Bengali).For the uttered transcripts, 21 unique commands were chosen from a randomly selected subset of the original SLURP dataset.The close range recording is done using a standard headset (Sony WH320) and for adversarial inputs (far-range), audio is recorded using the M1 Mac microphone.The adversarial dataset contains a mix of with and without headset recordings.
Evaluation: While evaluating, we obtain the filter rate, cache accuracy, overall accuracy, latency and RTF in the four benchmark settings.We observe that SC retains a similar performance on the collected recordings to that in the original experiments conducted using SLURP data, and retains a similar accuracy as before with a minor delay (end to end latency of 409ms).For the different benchmark settings, the accuracy and latency is consistent.The experimental setting was Ours-M where only the model ensembling optimization was used.For adversarial inputs, performance is slightly inferior on our custom dataset.For a 1-spk-100%-seen benchmark with adversarial inputs, after noise reduction, overall accuracy is 0.80 at a filter rate of 0.26 having latency of 644ms.The detailed   benchmark results and comparison with baselines are reported in Table 5.

Cache efficacy
Cache latency SC's cache is fast.The most expensive operation, feature extraction, is done in a streaming fashion in parallel to the voice ingestion.As a result, the delay for processing all but the last streaming segment (which spans around 250ms) are hidden.After features are extracted, the overhead of matching the features, is negligible.In case of L1 hit, SC's latency is 96 ms on our test platform (most of which is from SincLayer); in case of L1 miss and L2 hit, the latency is an additional 89 ms.The on-device latency is almost 5x lower than offloading to the cloud.
Cache accuracy is decent as shown in Table 4. (1) On inputs with known transcripts (1-speaker-100%-seen), our cache processes (i.e.filters) 46.23% of inputs on device, avoiding offloading them.Among such locally processed inputs, our cache's accuracy is as high as 0.96.Between the two cache levels, L1 is more selective (i.e.lower filter rate) but shows higher accuracy.
(2) On inputs with unseen transcripts (e.g.1-speaker-0%-seen), the cache correctly deems almost all inputs as mismatch, offloading them to the cloud.Filters rates are thus very low (3-5%).As a result, the overall accuracy only sees a minor drop (about 2-3%) compared to the setting with all seen transcripts.
Two-level caching Both levels complement each other and contribute to the overall performance as shown in Table 4. On longer inputs (SLURP-C), L1 is more effective (i.e. higher filter rates) and incurs low cost; yet, on short inputs, L1's simple features are often noisy.This is compensated by L2 with its deeper features.We deem both levels essential.

Significance of key optimizations
Online learning SC not only benefits from caching but also crucially from learning (i.e.personalization).Using frozen, pretrained feature extractors without online finetuning sees overall accuracy drop by 0.11 (from 0.99 to 0.88) in FSC.Deeper investigation shows that the cache is much less effective: the filter rate is as low as 2%; among all the cache hits, the accuracy is reduced by 40%.
Model ensembling is vital.Replacing the default ensembled model with one monolithic model results in notable accuracy drop as shown in Figure 8 (Ours-DM vs. Ours-DT).The reason is that using one model limits the room for input specialization, rendering caching less effective.Further increasing the number of buckets, e.g. from 3 to 5, sees diminishing return.
caches intermediate results and skips frame to reduce compute.[34] incorporates prompting and in-context learning for IC using large pretrained language models.[71] delineates the correlation between speech recognition performance and training size; an acceptable performance requires > 100 training data.As discussed in §3, our model is robust against this challenge.[45] mentions exploring the opportunity to focus on short, complex commands with multiple elements instead of extended conversations.We dwell in the middle and focus on longer, complex but single intent commands.[75] introduces an intermediate caching framework (or hub) in between the edge and cloud for IC.[49] employs a moderately small LSTM network for large vocabulary speech recognition.Multiple models employ content prefetching for service acceleration [54,66].A concurrent work optimizes SLU for Armv8 SoCs in the local/cloud setting [14].Unlike Armv8 SoCs that can run a complete SLU engine, microcontrollers can only run an inference cache (our unique design).These two projects do not depend on each other.Their contributions are orthogonal.
Additionally, commercial off the shelf speech-to-intent engines such as rhino from PicoVoice [3], Wio Terminal [10] etc. cannot scale to the performance of SC.

CONCLUSIONS
In this work, we propose a novel, hierarchical, and learning cache for end-to-end SLU inference.It leverages the benefits of both ondevice and cloud infrastructure.SC exploits temporal locality of voice commands and performs SLU at 80% of the cost with comparable accuracy.It is capable to execute on an MCU with just 2 MB of memory.Moreover, we implement a series of novel optimizations to increase performance and adapt to resource requirements on tiny devices.

Figure 3 :
Figure 3: Raw speech waveforms often exhibit strong variations.Three waveforms of the same sentence ("wake me up at 5am this week"), uttered by the same speaker, from SLURP are shown.First two are recorded in far field conditions while the third is recorded in close range (headset).

Figure 6 :
Figure 6: Benchmark settings, showing the test for one given speaker.We report the accuracy averaged over all speakers.ASR-Set 3.0 and finetuned on SLURP.For FSC, the cloud runs an attention-based RNN sequence-to-sequence model from Speech-Brain [4].CloudOnly's accuracy is considered as gold.OnDevice-{S|L} are models that completely run on device.(1) OnDevice-S is tailored to MCUs and targets simple utterances.As a popular model by Fluent.AI[48], OnDevice-S has only 3.96M parameters and was shown to have good accuracy on FSC.(2) OnDevice-L is a compressed model that can handle complex utterances as in SLURP-C.It comprises a Conformer(S)[31] for ASR and a Mobile-BERT[69] for NLU, totalling ∼110M parameters (ASR:NLU=1:10).OnDevice-L far exceeds an MCU's resource; we regard it as a reference point -an efficiency-optimized model that still generates reasonable accuracy on SLURP-C.We confirm that further smaller models would not produce meaningful results.Ours-{D|T|M} + is our system with different combinations of optimizations (as discussed in §6), in which: D stands for dynamic thresholds for feature matching; T stands for in-domain training; M stands for model ensembling.

Figure 7 :
Figure 7: SC (ours) with only 0.5M parameters and 1.8 MB model size incurs low compute per input and delivers high accuracy.The closest other on-device model is 2x more expensive in FSC and 10x in SLURP-C.

Figure 8 :
Figure 8: Our optimizations play complementary roles in the overall performance.Ours-DM gives the best accuracy while Ours-TM gives the best latency.

Table 2 :
Datasets in experiments.See §7.1 for details.

Table 4 :
Detailed cache performance, showing the cache's efficacy and that L1/L2 complement each other.Filter rate (FR): # of inputs hit at a cache level, normalized by # of inputs received by that cache level.Cache accuracy (CA): of all hits at a cache level, the fraction that the cache level correctly classifies.Measurement from Ours-DM.

Table 5 :
Our performance on real users.Absolute latency (ms in table) is for a 3-second input.