FM Tone Transfer with Envelope Learning

Tone Transfer is a novel deep-learning technique for interfacing a sound source with a synthesizer, transforming the timbre of audio excerpts while keeping their musical form content. Due to its good audio quality results and continuous controllability, it has been recently applied in several audio processing tools. Nevertheless, it still presents several shortcomings related to poor sound diversity, and limited transient and dynamic rendering, which we believe hinder its possibilities of articulation and phrasing in a real-time performance context. In this work, we present a discussion on current Tone Transfer architectures for the task of controlling synthetic audio with musical instruments and discuss their challenges in allowing expressive performances. Next, we introduce Envelope Learning, a novel method for designing Tone Transfer architectures that map musical events using a training objective at the synthesis parameter level. Our technique can render note beginnings and endings accurately and for a variety of sounds; these are essential steps for improving musical articulation, phrasing, and sound diversity with Tone Transfer. Finally, we implement a VST plugin for real-time live use and discuss possibilities for improvement.


INTRODUCTION
Synthesizers can be very expressive instruments, whether controlled by the ubiquitous keyboard [28], by augmented instruments, or instrument-like interfaces [1,25,26], or by whole new sets of gestures enabled by novel controllers [8,35].
Recent developments in integrating Deep Neural Networks (DNNs) with audio generators have renewed interest in using the unaltered audio of a musical instrument as a control source for a synthesizer.One such example is the DDSP architecture and its derivatives [5,12,16,34], that allows for real-time control of a synthesizer using a set of features extracted from an input audio signal.It has been used to develop various creative timbre transformation applications, which we collectively refer to as Tone Transfer applications.[3,4,24,30].
We situate the scope of our work on audio-based synthesis control for real-time performances, looking at sonic diversity and synthesizer phrasing and articulation.These essential components of musical expression have been thoroughly studied for composition with MIDI for decades [2,43,44] but we argue that they open new challenges and possibilities when considering an audio-based control approach.Transients at the beginnings of notes and the transitions between notes play a vital role in defining the continuity and flow of musical phrasing.We argue that a continuous control approach such as Tone Transfer could potentially learn mappings that capture beginnings, endings, and the links between notes during performance, generating musically articulated synthetic sounds.
In this work, we begin by examining the challenges faced by existing Tone Transfer architectures when it comes to effectively supporting aspects of musical expression such as phrasing, articulation, and sonic diversity.We argue that these challenges are primarily linked to the training methods employed and the commonly used synthesis models.Next, we propose Envelope Learning as a method to circumvent these issues.This technique revolves around designing Tone Transfer architectures that focus on matching synthesis parameters instead of audio features.Since our models learn musical events at the level of synthesis control, they can reproduce quick changes in sound, such as the start and end of musical notes, which are essential for musical phrasing.
We train the models to learn different tones by using patches from a well-known FM synthesizer, which provides a diverse range of sounds to work with.Finally, we implement our models on an audio plugin for real-time performances and reflect on its performance and possibilities for improvement.For training and deploying source code, see the online supplement 1 .We expect our models to complement existing Tone Transfer architectures and offer further performance possibilities for live use and sound design.

BACKGROUND 2.1 Synthesis control with audio signals
Audio-based control in synthesis has a longstanding history, exemplified by pioneering instruments like the Roland guitar [19].
In recent times, a prevalent technique involves using onset detectors and fundamental frequency trackers, enabling control of a synthesizer through MIDI signals [11].This method facilitates the translation of audio input into synthesized sounds, offering a versatile approach to musical control.
However, generating MIDI triggers through explicit note onset detection introduces a bottleneck in the gestural channel [18] between the instrument and the synthesizer which may hinder an expressive performance.In that regard, timbre characteristics related to musical phrasings, such as variations in dynamics, frequency spectrum, amplitude envelope, and attack transients [27], are compressed into a single scalar velocity value.In MIDI, phrasing is implicitly represented through the timing and velocities of a sequence of note events.Audio-to-MIDI converters typically identify single notes at a time without consideration of longer sequences, so any temporal jitter or inaccuracy in dynamics could result in a disjointed sense of phrasing when the MIDI sequence is replayed by a synthesizer.
In addition to MIDI-based control, alternative strategies have been investigated, involving the extraction of continuous features from the audio signal of a musical instrument to control synthesis processes.In prior work, audio signals from instruments have been utilized as oscillators [29].This technique employs the audio signal itself as an oscillator for generating synthesized sounds, a special case of an Adaptive Digital Audio Effect [20,39].However, this approach restricts the method's versatility as it binds the sonic characteristics of the input directly to the output limiting its range of sonic possibilities.
Continuous control offers the possibility of better supporting musical expression.Interestingly, by closely analyzing how notes from an audio source are intertwined, we may also be able to facilitate longer phrasing arcs on the synthesizer.The problem now resides in navigating the complexity of the mapping design.What are the features we should extract, and how should we associate them to support a variety of synthetic sounds?There are many degrees of freedom, and the strategy becomes much less evident [31].
One possible answer can be found in the work of Levitin et al., where they proposed a valuable framework for analyzing the processes involved in a musical event control [21].They outline distinct stages of control within a musical event including the beginning, middle, ending, and terminus of the event, and highlight that Digital Musical Instruments (DMIs) often provide greater control over the middle.
In this context, different beginnings and endings can encompass musical articulation and serve as vital contextual links within musical phrases [23].The audio signal contains this important information within a very short duration and may require direct attention and specific handling to accurately capture and preserve these critical elements.These insights underscore the need for a focused approach to address beginnings and endings explicitly.

Differentiable Signal Processing
The seminal work of Van Den Oord et.al. [38], spawned a novel approach for data-driven audio generation and control called Neural Audio Synthesis.In this context, Deep Neural Networks (DNN) learn complex synthesizers from audio corpora, that can be used for composition [13], singing voice control [42] timbre transformation [17] and synthesizer parameter estimation [6], to name a few applications.
Engel et.al. [12] proposed a method called Differentiable Signal Processing (DDSP) that combines neural networks and DSP modules, such as synthesizers and audio effects, allowing an error signal to be backpropagated through them.This approach enables joint training of the whole pipeline, effectively biasing the network to learn to control the DSP modules.It allows efficient sound generation with DNN models that can comfortably run in real-time on a CPU [14] and yield impressive results on a variety of differentiable synthesis architectures for musical instrument [5,16,34] and singing voice rendering [42,45].

Tone Transfer
Tone Transfer [4] is a promising application enabled by DDSP for audio-based control of synthesizers.The supporting architecture, called DDSP Decoder [12], learns to control parameters of a synthesizer, conditioned by a frame-wise fundamental frequency ( 0 ) and loudness sequences extracted from an input audio signal.These characteristics are instrument-agnostic and relate uniquely to musical form; during inference, the model can support any musical instrument signal that contains a tractable  0 .
A continuously controllable synthesizer such as the DDSP Decoder can potentially deal with the fine-grained characteristics of note beginnings and endings, essential for phrasing.Nevertheless, we note that certain design decisions related to its architecture and training methods may hinder phrasing, articulation, and sound diversity in a performance setting.
One problem is related to the training process, which aims to resynthesize an audio corpus of a particular instrument from a set of  0 and loudness conditioning sequences, guided by the Multiscale Spectrogram Loss [36,42].This involves a trade-off between time and frequency resolution [32], affecting the model's capacity of discerning and synthesizing accurate instrument onsets, which typically happen in the order of tens of milliseconds [41] and are essential to convey distinct articulation and build musical phrases.
Transient rendering is also affected by the synthesizer architectures typically employed in DDSP decoders.In the majority of the cases, a harmonic source such as a harmonic synth [12], a waveshaper [16] or a wavetable [34] is paired with a noise synthesizer in a setting that resembles a Spectral Modelling Synthesizer [33].This configuration is usually not sufficient for an accurate representation of transients [9,40].
Regarding sound diversity, we note that the resynthesis objective implicitly ensures a high correlation between the input and output loudness, as indicated in the original paper [12].Since different musical instruments have different loudness profiles, in many cases performers expect the dynamic characteristics of the generated audio to be different from those of the input.Losing this degree of freedom may make the learned timbre track the dynamics of the input too closely, producing unnatural sounds and limiting sonic diversity.
Another issue is related to the availability of training data.Single musical instrument datasets are difficult to collect [22], and in many cases,  0 may not be easy to extract, especially for synthetic sounds.This also limits the amount and type of sounds that can be synthesized with Tone Transfer.
Finally, it is worth noting that sound design practitioners are not familiar with the spectral modeling synthesizers typically employed for Tone Transfer.A well-known architecture with interpretable parameters allows performers to intervene in the synthesis process and manipulate results enhancing the possibilities of pre-trained models [5].

FM Synthesis
Frequency Modulation (FM) synthesis is a well-known method to generate complex sounds from a compact set of synthesis parameters [7].One of the best-known implementations is the Yamaha DX7, which utilizes a well-established linear FM synthesis architecture, that has been used in other works for applications such as sound matching and neural audio synthesis [5,6].
The DX7 generates its distinctive sound using six frequencymodulated sinusoidal oscillators.Programming the synthesizer involves configuring a patch that specifies various parameters for each oscillator.These parameters include the routing, which determines how the oscillators are interconnected (e.g., in a stacked or additive manner), the frequency ratios of the oscillators relative to the played note, as well as the Attack-Decay-Sustain-Release (ADSR) parameters of its Envelope Generators (EGs).
During audio rendering, the oscillator's frequency ratios and routing remain fixed.Instead, the sound dynamics are primarily controlled by the ADSR envelopes.These envelopes modulate the output levels of each oscillator, influencing either their volume or modulation index, depending on their interconnection.Sound design on the DX7 involves configuring the routing, frequency ratios, and ADSR parameters of the EGs.

METHOD
Existing Tone Transfer architectures have shown the ability to learn relationships between control inputs and synthesizer features.However, we have observed certain limitations in terms of transient generation and sonic diversity that could restrict the performative possibilities of the models.
We propose an alternative design method for a model that learns relationships from a dataset of synthesis control signals extracted from synthesizer patches and designed following the musical event control model described by Levitin et al. [21].The model learns to render note beginnings, middles, and endings directly from a continuous control source.We use an FM synthesizer based on the Yamaha DX7 for which there is a wide variety of sounds available on the web [37].
We divide our approach into three stages, shown in Figure 1, namely a dataset generation step that creates event-aligned sequences from synthesizer patches, a training step we call Envelope Learning that learns a mapping function   between these sequences, and an inference step where we deploy trained models into a Tone Transfer pipeline and use them to control an FM oscillator block with audio signals.
Our current research shares similarities with our previous work, where we utilize a neural network to control oscillator amplitudes of an FM synthesizer based on a sequence of audio signal features [5].However, in contrast to our earlier approach, we introduce a new design strategy that (1) avoids the reconstruction objective and MSS loss, allowing decoupled dynamics (2) learns an input-tooutput mapping at the level of short frames of signal, allowing for accurate transients, and avoiding the use of differentiable synthesis components, and (3) does not require an audio corpus for training, and instead can learn from a patch collection of the FM synthesizer.

Dataset Generation
In this step, we create a dataset of  training tuples (  ,   ,   ),  = 1, ..., with  =  1 , ...  ∈ R and  =  1 , ...  ∈ R being sequences of length  modeling amplitude and fundamental frequency of a monophonic audio input respectively. =  1 , ...  ∈ R 6 represent the linear output level envelopes of the six FM oscillators, that we extract from a synthesizer patch.For training, we use (,  ) as input sequences to our model, and  as supervision.
In order to generate the input sequences (  ,   ) we take into consideration the model proposed by Levitin et al., [21].To simplify the dataset generation process, we only consider separate notes as musical events.For our Tone Transfer use case, an explicit note beginning is determined by a sudden change in the input amplitude contour  and a valid  0 detected in  .During the middle, the amplitude and fundamental frequency are sustained over time.Finally, the ending of a note is characterized by a decay trajectory in amplitude, while the  0 remains valid until the terminus.Considering this, we can model our amplitudes  with a trapezoid generator, that is, a step generator plus a decay ramp.The fundamental frequency contour of a note can be represented with a step generator.
To obtain   , we use a Python implementation of the Yamaha DX7 ADSR Envelope Generators (EGs) adapted from a well-known emulator [15].These EGs can be programmed with a synthesizer patch  and actuated through MIDI to obtain the amplitude envelope sequences of the six oscillators.
The dataset generation starts with a designer selecting a synthesizer patch  they want to enable for Tone Transfer.We program the ADSR parameters of the EGs with  and generate a set of MIDI notes of random duration and with random velocity and note values.
For each note, we obtain the oscillator envelope sequences   , and create the aligned input sequences (  ,   ) following a simple set of rules.When a "NOTE ON" message is received, we generate a step response with an amplitude proportional to the velocity and note value for  and  respectively.After the "NOTE OFF" event is received, a linear decay ramp is rendered in  until the last oscillator envelope in  reaches zero.During this time  remains valid and then is set to zero.
Next, the input sequences are normalized between [0, 1], establishing a linear range of  and  corresponding to MIDI velocity and MIDI notes values respectively.The oscillator envelope sequences fall in the range of [0, 2]; they are also normalized to a range within [0, 1].Finally, all sequences are padded with zeroes before and after so that each training tuple features the same length.
The end result is a dataset that aligns two characteristics of musical events ( and  ) to synthesizer controls () that can render a specific timbre obtained from the patch.Since the input  is proportional to the velocity, note beginnings are characterized by  different amplitude discontinuities which in turn, are aligned with the oscillator envelopes  that render different onsets.Middles are mostly aligned with the decay and sustain parts of , and the decaying sections of note endings are synchronized with the release sections of the envelopes.Figure 2 shows a plot of the first six training instances of a dataset generated from the "E.PIANO 1" patch, a well-known DX7 electric piano patch, illustrating the synchrony between inputs and oscillator envelopes.
It is important to recognize that musical instrument notes can often display ambiguous behavior, and it is not always the case that a decrease in amplitude indicates the end of a note or a sustained amplitude indicates the middle.In a causal setting for real-time use, we cannot be sure that a note is ending even if there is a decay amplitude trajectory in the input.Although the input trajectories used in this setting may not fully represent a real-world scenario, they are valuable in demonstrating the proof of concept and analyzing opportunities for improvement.

Envelope Learning
To implement our system, we need a neural network model that can learn the temporal relationships between the inputs (,  ) and the oscillator envelopes , rendering the attack and decay sections of the control sequences after a discontinuity in the input is detected, and generating note ends accordingly when a decay trajectory is detected in the input.
To this end, and following the design of other Tone Transfer architectures, we employ a model that features a stateful Gated Recurrent Unit (GRU) and a linear layer as output layer.The GRU is a causal model that works frame-by-frame, is conditioned by the  and  sequences, and learns the relationships between current and past inputs, producing a hidden state that is projected with the linear layer into six controls for the oscillators.We denote the neural net as the parameterized function   as shown in Eqn. 1, where  denotes the frame index.
ô  =   (  ,   ) Since we do not employ audio during our training process, we train the network by conditioning it with  and  , and using the oscillator envelopes  from the dataset as supervision.We use the L1 Loss between the oscillator envelope predictions and ground truth as the minimization objective:  = || − ô || 1 .We call this process Envelope Learning.
The L1 loss aims to match every single frame that is generated by the network directly with the ground truth.This is unlike the DDSP-based methods that learn to control envelopes indirectly by employing a resynthesis objective, a spectrogram audio loss, and noise synthesizers.This results in limited transient resolution, as explained in the previous section.Learning in a direct fashion allows us to explicitly address and reproduce transients during training.

Inference
Model inference takes place within the Tone Transfer pipeline, which takes an input audio signal   , and yields a synthesized output   , with  denoting the audio sample index.Similar to other Tone Transfer approaches, we divide this pipeline into three stages: (1) Feature Extraction, which obtains aligned features from input audio â and f  related to input amplitude and fundamental frequency  0 respectively, with  denoting frame index.are extracted from input audio across an analysis window of length  .(.) may denote a signal amplitude or power estimator algorithm,  (.) an  0 tracker, and  (.) a normalization function that maps  0 into the range [0, 1].â = (   , ...  (+1) ) (2) Control Prediction, we use our neural network   to infer a set of frame-wise FM synthesis controls, the oscillator output levels ô  , from the conditioning signals â and f  .
(3) An FM oscillator bank   renders a window of  audio samples from output levels ô  , fundamental frequency  0 .
We configure the bank with the oscillator routing and frequency ratios of the patch  used to train   , although this can be changed during inference.

IMPLEMENTATION 4.1 Training
We select a set of common DX7 patches and create a training dataset for each one of them.Next, we train one neural net model per patch following our Envelope Learning method.
For each patch, we generate 1000 random MIDI notes with velocities between 1 and 127, and note values between 0 and 127.We set a random duration for each note between 600 and 732 frames.Next, we generate the aligned input and oscillator envelope sequences ,  , and , as described in Section 3.1.Finally, we pad them with zeroes to reach a final size of 1000 frames per instance, so that the active notes occupy about two-thirds of the total length.We split the dataset with a ratio of [0.80, 0.1, 0.1] for training, validation, and testing respectively.
Our neural net features a GRU with a hidden size of 128.We empirically choose this value as we note that the training loss does not improve with bigger models, and to keep the computing requirements low.We train one model per dataset, for a total of 120000 steps, using the Adam optimizer with a learning rate of 1e-3, a learning rate decay of 0.98 for every 10000 steps, and a batch size of 32 instances.We use Pytorch as a training framework.The process takes about four hours per model using a single NVIDIA GeForce RTX 2080 Ti GPU.
To assess the effectivity of the training process, we compare on the test set the absolute distance between the ground truth oscillator envelopes and the predictions || − ô || 1 for each trained model.
Furthermore, we set out to assess the capabilities of each of the trained networks for synthesizing audio with the learned timbre.Firstly, we render audio using both the ground truth  and predicted envelopes , using an FM oscillator block configured with the oscillator routing and ratios extracted from the patches used to train the models.We employ the fundamental frequency  0 extracted from the normalized MIDI note values present in the sequences  of the test set.
Next, we compute the signal-to-noise ratio (SNR) as a power quotient of our reference and an error computed from the sampleby-sample difference between both signals.We compute the SNR in decibels (dB), as shown in Eqn. 5.
We use this metric to assess the reconstruction quality of note beginnings and endings.To account for note beginnings, we aggregate the first 100 milliseconds of each note in the test set for both rendered audios and then compute the SNR on these signals obtaining    .We use 100 ms to account for the different onset times that the models present.Furthermore, we identify the ending sections of each note by looking at the decaying ramp in , which is aligned with  in our dataset.We aggregate the audio samples of each note and compute    .We aggregate the rest of the audio section of each note, between the note onset and the start of the decaying ramp, and compute    at the note middle.
Table 1 shows the results for the metrics.The low  1 loss indicates that our model is able to minimize the training objective and predict the oscillator envelope sequences from the conditioning signals.This translates into an adequate reproduction of note beginnings, middles and endings; the SNR metrics show that even in the worst case, the models can render the note sections with not more than about 1% of power error.
Although these results do not represent our models' performance capabilities when deployed in a Tone Transfer pipeline, they show that our training objective allows the networks to learn the envelope contours of the oscillators for different timbres, and can accurately render the beginning, middles, and endings of notes when conditioned with the continuous input sequences  and  .

Model
Envelope

Deployment
We implement the Tone Transfer pipeline on a real-time audio plugin using JUCE and Libtorch, Pytorch's C++ API.Our prototype can load new neural net models and FM configurations, supporting all the learned timbres.It runs in real-time and performs inference and synthesis at a frame rate of 690 Hz, to render audio at a sample rate of 44.1kHz, similar to our Yamaha DX7 reference implementation [15].
Within the pipeline, we extract frame-wise fundamental frequency  0 using the YIN algorithm [10], using an analysis window of 1024 samples, which yields a minimum detectable frequency of about 90 Hz.We compute the conditioning signal   by converting the fundamental frequency values from Hz to MIDI note value and then applying normalization between [0, 1], as shown in Eqn. 6.Furthermore, when a valid fundamental frequency is not detected, the extractor returns zero.
2 ( 0 /220) + 57.01 127 (6) Next, we supply the continuous amplitude input for our system  from a decibel-scale RMS detector.We employ a compute block over a sliding window  , clamping the minimum value to -70dB, and normalizing between 0 to 1.
Since our datasets (and therefore, our trained models) present a linear amplitude range in , our system tries to match normalized RMS in decibels to envelope variations associated with MIDI velocity.Next, our model predicts the current envelope values for the FM synth, which are interpolated from frame to sample rate and used for the synthesis process.The fundamental frequency  0 is also linearly interpolated and used to drive the oscillators at the synthesis step.
Furthermore, we reset the model's hidden state to all zeroes when both conditioning sequences are zero.This ensures that the model starts from a known state to process a new incoming note.The plugin runs on a MacBook Pro 2021 with a USB audio interface running at 44.1 kHz and a hop size of 64 samples, yielding a pipeline delay of 3 ms including buffering.

DISCUSSION
Our model offers the capability to generate a wide range of timbres on an FM synthesizer by learning the dynamic trajectories of oscillator envelopes reflected in the dataset.Our approach effectively replaces the traditional envelope generator of the DX7 with a recurrent neural network (RNN) that provides continuous controllability instead of MIDI.
In this context, the dataset generation approach serves as the bridge between explicit note beginnings and endings, which are event-based, and the continuous control framework.When trained with our Envelope Learning method, the network is able to learn and reproduce the rapid beginnings and endings of notes, even without explicit information about note boundaries.
Previous Tone Transfer architectures learn to control synthesizers indirectly by minimizing an audio loss of a resynthesis task, using spectrogram losses that act upon long windows of audio.These effectively look at the middle of musical events and present a limited temporal resolution for beginnings and endings.In contrast, our approach overcomes this limitation by learning a direct correspondence between inputs and synthesis parameters at a control level.This allows for precise rendering of the transient characteristics of the learned timbre, provided we have a representation of that timbre available in the form of a synthesizer patch.We argue these are the first steps for achieving expressive and nuanced synthesizer articulation with Tone Transfer algorithms.
We suggest that these results are encouraging to explore the Envelope Learning technique building further input-output associations.One possibility would be to align additional input features such as spectral features to other sound characteristics like attack and decay rates.Another would be to introduce multiple notes per training instance to explicitly model phrasing in context, modeling note events of specific musical instruments in the dataset for a more nuanced control.Other alternatives include exploring further conditioning choices to assess responsivity in terms of dynamics, modifying the trapezoidal amplitude note model in  to better account for particular instruments and input detectors, or training using patch data from other synthesizer architectures.

Reflections on performances
As a proof of concept, we record two musicians using our audio plugin in real-time, playing guitar and sax 2 .We select these two source instruments since they provide very different volume dynamics and articulations to drive our plugin.We record three models, trained with electric piano, strings, and brass patches respectively.
We informally observe that our Tone Transfer approach can effectively render timbre from the learned patches, including note beginnings.This is reflected particularly well in the example of the guitar controlling the electric piano, which shows a bright attack on the beginning of those notes that are not legato.Next, in the guitar example that plays a string tone, the synthesizer features a slow attack, even though the guitar is plucked, showing that our model does not project input loudness to the output, as DDSP does.
Note endings are much more difficult to assess since their generation depends on a decaying amplitude envelope presented by the audio input.On the guitar, the decay envelope may be too fast to render the learned note ending before the fundamental frequency cannot be tracked anymore.
For the case of the saxophone as a control source, note beginnings and endings are not that clear, but this is to be expected as it presents a much different amplitude contour, including amplitude modulations that were not accounted for during dataset generation.These may force the model to re-render note characteristics of beginnings or endings, which can be observed when the saxophone controls the electric piano.
On the other hand, we note a lack of dynamic range in the synthesis output.We argue that this is due to the fact that the patches were originally designed be played with a keyboard with MIDI notes and velocity controls.For the piano patch, for instance, we observe that a low-velocity value still produces a signal with high amplitude but less brightness.Redesigning the patches to obtain higher variations in output amplitude and retraining the models may improve the results.

CONCLUSIONS
Transforming the audio of an instrument to a synthetic sound is a challenging task, as it involves a one-to-many relationship.Each instrument has its unique timbral palette, dynamic contour, and articulation possibilities, which can vary significantly even among instruments of the same type.On the other hand, the sound produced by a synthesizer can be highly versatile; and only a subset of the source instrument's characteristics may be desired in the output.
We can argue that there is no definitive "gold standard" that can provide a baseline mapping between an instrument's audio and a synthetic sound: tradeoffs are necessary to find viable solutions.
In this work, we first analyzed current Tone Transfer architectures and identified a tradeoff in their rendering capabilities: these models learn new timbres from audio corpora and can project the input loudness to the output, at the expense of a good resolution of note beginnings and endings which are essential for musical articulation and phrasing.
In light of the analyzed shortcomings, we presented Envelope Learning, a design method where a model learns a set of input-tosynthesis parameters correspondences and accurately replicates note beginnings and endings.The tradeoff, in this case, is on the note middles: we use a simplified musical note model for our dataset generation that does not consider variations in amplitude or pitch during an event.This works well during testing in the training environment but may result in unexpected transitions and reduced dynamic range when used in a Tone Transfer setting, especially with sustaining instruments.We leave for future work an assessment of the performance possibilities of our algorithm and an exploration of techniques to overcome current limitations.
Finally, we implemented a Tone Transfer pipeline in an audio plugin for real-time performance, taking a step towards improving sound diversity and phrasing capabilities for audio-based control of synthesizers.Our system bridges the sonic diversity gap of previous approaches, learning new sounds from a vast number of DX7 patches for which their timbre can now be continuously controlled with musical instruments.We hope that our work motivates further research in model design with the goal of improving phrasing and articulation in real-time neural synthesizers controlled by musical instruments.

Figure 1 :
Figure 1: Design steps for our Tone Transfer system.a) we create a synthetic dataset of aligned sequences (,  , ). and  model the frame-wise amplitude and  0 trajectories of a monophonic audio signal, while  are the oscillator output levels of an FM synthesizer programmed with a patch .b) We train a Recurrent Neural Network model   to learn the correspondences between the features ,  and the controls  reflected in the dataset.c) We deploy the RNN into a Tone Transfer pipeline.In this context,   processes frame-wise input features from real audio   , and controls the envelopes of an FM oscillator bank configured according to .

Figure 2 :
Figure 2: Five training tuples of a dataset extracted from the "E.PIANO 1" patch.Showing its corresponding input sequences  and  , and the synchronized envelopes .

Table 1 :
1          Envelope absolute error and audio SNR at beginnings and endings of the test set notes for trained models.