A Reward Modulated Spiked Timing Depended Plasticity inspired algorithm applied on a MultiLayer Perceptron

The creation of a framework in which traditional Machine Learning and neuromorphic algorithms compete to solve a shared Reinforcement Learning environment is presented in this work. In addition, this configuration allows the exploitation of modern and widely-used Machine Learning libraries. The PyTorch framework is used to investigate the expanded capabilities and potential of training an action-critic network pair comprised of specialised units using a custom learning algorithm. The policy and value networks utilised in this context are fully interconnected MultiLayer Perceptrons. The training procedure employs two distinct algorithms: an algorithm inspired by Reward Modulated Spiked Timing Dependent Plasticity and the conventional Back Propagation technique. A comparative evaluation and analysis of the findings is performed.


INTRODUCTION
Currently, Machine Learning (ML) is widely regarded as the most prevalent type of Artificial Intelligence (AI) due to its ability to address a wide variety of challenging issues.As a result, a plethora of tools, frameworks, and algorithms have been developed to improve the efficacy and efficiency of machine learning applications [34].Reinforcement learning (RL) is a subfield of machine learning that has gained significant popularity.RL aims to optimise the cumulative reward obtained over a sequence of actions, rather than focusing solely on the immediate rewards associated with individual actions.Consequently, its effectiveness is increasingly notable in fields ranging from technology to health-care to playing games [23].As a result of the demonstrated effectiveness, companies such as OpenAI and DeepMind have focused their research efforts on advancing reinforcement learning.The publication of Deepmind's 'ATARI paper' in 2013 [20], which investigated the application of RL agents in playing Atari games, sparked considerable intrigue and captivated both the general public and the scientific community.The OpenAI Gym, which was introduced in 2016 [6], offers a standardised application programming interface (API) for reinforcement learning (RL) environments, thereby providing substantial assistance to researchers in this field.
An additional subgenre of machine learning (ML) research that is gaining prominence is the investigation of Spiking Neural Networks (SNNs) [33].The bioplausibility of these entities surpasses that of their traditional counterparts.Hence, their research can provide a deeper understanding of the brain's low power consumption, particularly in contrast to artificial neural networks.Additionally, this can potentially contribute to the advancement of more efficient brain-computer interfaces.In recent years, there has been notable progress in the field of SNN simulation tools and neuromorphic algorithms, facilitating their study and expanding their potential applications.Additionally, there has been a gradual emergence of SNN hardware.The integration of reinforcement learning (RL) and spiking neural networks (SNN) is gradually attracting interest within the academic community.According to Izhikevich [14], this approach has the potential to provide insights into various processes, such as the influence of dopamine on learning.It can also contribute to the development of robotics that are significantly more energy efficient [5].Nevertheless, there has been limited advancement in integrating the breakthroughs from both disciplines.
Therefore, a resulting question is how a setup can be constructed where the comparison of classical ML and neuromorphic algorithms is possible on the same virtual environment while the ability to use modern ML libraries is also retained.
In alignment with the aforementioned perspective, the objective is to leverage the expanding capabilities of Pytorch [25] in a manner that provides adequate adaptability, enabling the training of a network comprising of customized units and a specialized learning algorithm.The model should be able to learn a Farama Foundation Gymnasium RL Environment [6].Therefore, the learning algorithm should be able to communicate with the environment and use its output -especially the rewards for the successful training of the model-.
To validate the efficacy of the proposed methodology, the implemented configuration is assessed through experimentation involving the training of an intelligent agent to successfully solve the CartPole environment within the Gymnasium framework [6].The learning algorithm uses Proximal Policy Optimization (PPO) [30].The generated loss function is handled by an Adam optimizer [16] but instead of applying usual Back Propagation (BP), a variant inspired by Reward Modulated Spiked Timing Depended Plasticity (R-STDP) [15] is applied.For simplicity, the policy and value networks are fully connected MultiLayer Perceptrons (MLPs).Each layer however corresponds to a custom torch.nn.module [25] and therefore it can easily be replaced with any type of layer: classical spiking, recursive or hybrid.
BP since its introduction has been the main and most successful optimization algorithm used in artificial neural networks [10].It is based on the evaluation, through the application of chain rule, of the error signal's derivatives of each layer with respect to the parameters of the layer's units [28].However as Weiderman points out BP is non-biologically faithfully process [35].Biological neurons utilize their own group of learning processes and STDP is perhaps the most well documented and understood.It is expressed through the processes of Long Term Potentiation (LTP) and Long Term Depression (LTD) [4] and provides a biological explanation for Hebbian learning [24].
The remainder of this work is organized as follows.Section 2 provides an overview of related models and applications.Section 3 presents an analytical description of the overall methodology.Section 4 reveals the produced results.Finally Section 5 concludes on the work's findings and discusses future plans.

RELATED WORK
The subsequent paragraphs provide a concise overview of prior research endeavours, with a particular emphasis on the utilisation of neuromorphic training algorithms in various applications.Specifically Spike Timing Depended Plasticity (STDP) [4] in Supervised Learning (SL) and RL, through the BP of an error or reward signal respectively.A case of SNN training through BP is also examined.

STPD in SL
A target-reaching navigation system for a mobile vehicle is proposed by Bing et al. based on a R-STDP learning rule and a Leaky Integrate and Fire SNN [5].In their approach, the weight modification is the product of an STDP component, an annealing learning rate, a local to every synapse reward signal, and a local for every synapse eligibility trace designed to represent the synaptic efficacy.The reward signal is back-propagated to every synapse and has its local value evaluated based on every weight contribution.
Liu et al. proposed a Supervised STDP as an efficient training method for a SNN classifier.The multilayer network uses Integrate and Fire neurons and is trained to successfully classify the MNIST dataset [7].In the proposed approach, only the first spike of the spike train and its timing carry significant information.The error signal is normally computed as the Jacobian of the loss function relative to the weights.The STDP components take part in this computation as partial derivatives of the output's first spike timing with respect to the weights.

STDP in RL
An obstacle avoiding navigation system for a mobile robot is developed by Shim and Li [31].The proposed model utilizes a one hidden layer feed-forward Leaky Integrate and Fire SNN.The network is trained through additive R-STDP.An eligibility trace is introduced which keeps track of the STDP contribution to the weight change.
The reward signal at any given moment is global for the whole network.The resulting weight change is dictated by the product of the current weight, the learning rate, the global reward, and the eligibility trace.
Mozafari et al. [21] categorize their work as RL due to the fact that they also utilize R-STDP.However, the proposed SNN network is being trained on image classification.Therefore, it does not seem to be any sparse rewards involved.In a similar approach as Liu et al. [19], Mozafati et al. consider the timing of the first spike to carry all the significant information.A reward-punishment mechanism is proposed and a comparison is carried out between traditional STDP and R-STDP.

BP in SNNs
Moreover, closely related to the present research is the work of Esser et al. [9].The researchers use a training none-neuromorphic network and a deployment neuromorphic network.They apply a BP rule on the training network and then they copy the weights' updates on the deployment network.The training network units' output represents the probability of the corresponding deployment network's neuron to spike.In the present paper the authors, in an almost antithetical to the above publication but also similar in the same instance manner, regard the MLP units' output as the probability that a hypothetical, corresponding SNN's neurons spike.

METHODOLOGY
In the following paragraphs a detailed description of the constructed setup is provided.Additionally, the parameters and details of the use case are also mentioned and explained.The developed code can be found uploaded on GitHub [11].

Experimental Setup and RL Environment
All the simulations are conducted on a DELL XPS 15 9570 [12] equipped with an Intel i7-8750H [29] processor and 16GB of RAM.
The graphics card is an NVIDIA GeForce GTX 1050 Ti [17].All code is written in Python3.Version 3.11.364bit [25] is used during all runs.Pytorch version is 2.0.1 with CUDA 11.8 [25].
The used RL enviroment is CartPole [6] and is part of Gymnasium's Classic Control environments.A cart, moving on a line, tries balancing a non-stable horizontal rod under the effect of gravity.Version used is CartPole-v1.The implementation is able to use parallel environments.During all runs the algorithm collects data from four parallel synchronized environments.They are all wrapped together as a single gymnasium.vector.SyncVectorEnv subclass [6].

Networks and Parameters' Update
For the whole subsection indices i and j denote the corresponding input and output units respectively.Index l corresponds to the number of the example in a mini-batch of a population of size n.Index k denotes the layer level of the corresponding unit.Furthermore, a, z, and w representing the signal after activation -application of the sigmoid function, the signal before activation and the synaptic weight respectively.The above notation remains consistent across all equations contained in the present publication.
An Agent is created for the control of the environment.It consists of two actor-critic MLPs [27] networks.Each of their layers comprises 64 fully connected units.Each unit implements the Perceptron model without bias as shown in equations ( 1) and (2).
The MLPs are implemented with the Pytorch tensor library [25].The usual implementation of a layer of perceptrons in Pytorch is a sequence of torch.nn.linear and torch.nn.Sigmoid classes.Instead, the perceptron's layer is implemented as a single subclass of torch.nn.Module.This approach actually allows different possible type of unit layers, linear, nonlinear, recursive, neuromorphic to be defined as separate modules.
Through the extending capabilities of Pytorch autograd [25] custom BP, different custom methods can be implemented for each module, offering the opportunity of comparison between them.A similar method and implementation is followed by Liu et al. [19] in their application of STDP on SL.Autograd normally performs the task of generating a Jacobian matrix of the loss function with respect to the weights.torch.autograd.backwardcomputes the gradients of the given tensors with respect to graph leaves as stated in the documentation [25].This Jacobian is passed by the torch.optimstep method to the optimization algorithm which is the one performing the weight updates.In this work, two different methods are implemented.
The first implementation performs the same operations and chain rule differentiation as the intrinsic autograd does, in order to perform usual BP on a MLP [28].It is included mainly for comparison reasons but also for additional validation -especially during the early phases of development in order to test that the overall process produces consistent results.The underlying chain rule describing the differentiation between layers, for the case that no biases are present, is presented in equations ( 3), ( 4) and (5).The second implementation generates a matrix with each element corresponding to an appropriate weight correction, but instead of being generated by chain rule differentiation, a different scheme is applied, inspired by R-STDP.STDP is a bio-plausible procedure that is able to introduce a learning mechanism to the biological neuron [4].When STDP is combined with biological rewarding systems, like the dopamine system, induces R-STDP.R-STDP can explain RL in living organisms as shown and also modeled by Izhikevich [15].
In this formulation the reward signal for the hidden layers is given by equation ( 6), where    is the reward assigned to the jth neuron of the kth hidden layer.This approach is similar to the one previously, suggested by Bing et al. [5] with the difference that in the present case at this stage a separate reward signal is stored for each training example.The reward signal corresponding to every synaptic weight     will be    =    .For the outer layer, the reward signal is calculated by Pytorch's backward method applied to the loss of the Proximal Policy Optimization (PPO) algorithm.
The STDP-inspired component of the weight update signal is calculated on the forward pass as described by equation 9.An assumption is made in regard to that the perceptron's output can be viewed as equivalent to the spike rate of an SNN neuron.Then, a conditioned rate for a synapse of a presynaptic neuron to spike, given that its postsynaptic neuron is not spiking at a given trial, can be approximated by equation (7).It is known that this condition between pre-synaptic and post-synaptic neurons is the required condition for LTP -a state under which the synaptic weight becomes stronger [4].The opposite function LTD resulting in the weakening of the synaptic strength is known to happen when the postsynaptic neuron spikes, given that the presynaptic neuron does not spike.The conditional rate of this event is approximated by equation (8).In equations ( 7) and ( 8) intersections are calculated as the average of their minimum and maximum bounds.
In order to calculate the STDP component of the weight correction, a few further assumptions are made.Each step is considered to have a duration of 1 time unit.The two phenomena LTP and LTD are assumed to obey a Poisson distribution.The difference between the chances of each event happening once in one time-step is assumed here as a valid measure of the change in synaptic weight due to STDP.The resulting formula is presented by equation (9).
The final weight correction signal, which is passed from autograd to the optimization algorithm, consists of the mean over all the mini-batches training examples of the element-by-element products of the reward times the STDP component.The relative evaluation is demonstrated in equation (10).A negative sign is added in order for the optimizer to be able to solve this as a minimization problem.epsm stands for epsilon machine and it is added in order to avoid division with zero terms.

Training and Optimization
For the purposes of training, an appropriate variation of PPO [30] is applied.The present implementation of the algorithm is heavily based on the one discussed in the blog-post 'The 37 Implementation Details of Proximal Policy Optimization' and implemented in the corresponding GitHub repository by Huang [13].A clipped loss version is used with Generalized Advantage Estimation.The clipping factor is set at 0.2.This value is in the range suggested by Andrychowicz et al. [2] for similar setups.In their publication, continuous action spaces are used and harder to solve environments but a similar dependence on the clipping factor is guessed in this work's experiments.The factor discount coefficient set to 0.99 while the Generalized Advantage Estimation's hyper-parameter is set to 0.95.The entropy factor equals 0.01.Finally, the factor of the value function component is set to 0.5.The algorithm is an on-policy algorithm and therefore improves the policy that is used, through constant evaluation.The evaluation is carried out by a critic separate network while the policy is manifested by an actor-network as a probability distribution of choosing any action for any given state.Andrychowicz et al. find in general a better performance in setups with separate networks than in setups of a shared actor-critic network [2] and these is also the approach followed here.An orthogonal weight initialization is performed as recommended by Engstrom et al. [8] and the actor-network is initialized with a distribution of 0 mean and a low std=0.01as recommended by Andrychowicz et al. [2].
A variation of Adam algorithm [16] is used for the optimization of the parameters -synaptic weights-of both the value and the policy networks.Adam combines the technique of the adaptive individual rates with the idea of the momentum, where instead of the actual gradients a moving average is used.The implemented version is the one supplied by torch.optimAdam[25].Furthermore, the stabilizing hyperparameter of the algorithm during all the runs, is set to 1. − 5 instead the preset 1. − 8.

RESULTS
Performance is often hard to track in RL tasks in comparison to other types of learning.This is mainly due to the highly noisy output.Total average episode reward is often chosen as an appropriate metric [20] and the same approach is followed here.All diagrams are generated with the aid of Tensorboard [1].Tensorboard offers visualization of the measurements conducted during the workflow.Exponential smoothing by a factor of 0.99 is applied to all the graphs in order to reduce the noise and therefore make them easier to read.All diagrams demonstrate a training period of 1 million total steps of the parallelized environment.

Vanilla BP
As demonstrated in Figure 1 BP in combination with proven solid training and optimization algorithms provides excellent results and converges extremely fast, even if it does not fully solve the environment.Its consistent behavior provides proof that the present setup is valid and functional.The algorithm achieves best performance for a learning rate of 1. − 3. Learning rate annealing is also applied, as suggested, for optimal fine-tuning of network's parameters by Sutton and Barto [32].The result is demonstrated in Figure 1.The Steps Per Second (SPS) rate converges to a value of around 850sps while training the networks.

Hybrid R-STDP Inspired Algorithm
The hybrid algorithm did not manage to solve the environment in the span of 1000000 steps.However, it manages to showcase a clear ability to learn.Under the absence of successful learning, the average of the game stays very close to the value 22 ± 2 -after smoothing-and clearly this is not the case.The best performance, demonstrated in Figure 2, is achieved with a learning rate of 5.6 − 3 and no learning rate annealing.The learning process seems to have been accelerated after about 600k steps.A possible explanation may be that for this learning rule, the networks' initial state is not favorable.SPS converges to approximately 280sps, while training the networks.[2].In a future project, the additional use of a tuning library like Ray [18] can assist the search for a functional set and achieve an optimization level that in practice is not possible to be attained by guesses and tries.
Early versions of the algorithm prior to the implementation of the custom autograd mechanism were up to two scales of magnitude slower than the later versions that use autograd.To put this in context, prior to the autograd implementation the achieved SPS was consistently under 10sps, while after the autograd implementation, there was an increase in the sps rate, up to the achieved 280sps, after optimization.BP achieved 850sps.Therefore, this proves that normal BP is not only more efficient than the proposed experimental algorithm -for this setup-but also cheaper.
However, there are a few remarks to be made.Spiking is not an intrinsic function of Perceptron.Therefore, LTP and LTD are not intrinsic processes but simulated.This adds to the computational cost.Furthermore, in the case of Perceptron an exact computationally cheap derivative does exist and therefore classic BP is a clear winner.However, in neuromorphic models, exact differentiation is usually impossible [22] and on the contrary spiking is a intrinsic process.Therefore, STDP components can directly be evaluated.Furthermore, with the aid of appropriate traces [26], this can be done in a computationally efficient way, while exact differentiation is not an option.
Another issue is the fact that BP achieves 850sps but this is far behind the approximately 2500sps achieved on the same machine with a similar setup if build-in autograd differentiation is used.The main reason for this is that build-in functions are written in C++ while the custom differentiation functions which are used in this implementation are written in Python3.Therefore, there is a huge speed gap, since C++ is a faster language.Aruoba and Fernández-Villaverde [3] note an approximately 44 times speed difference between the two.However, Libtorch ,a Pytorch version with C++ frontend, offers the option to directly write custom differentiation functions in C++ and add them to torch::autograd, achieving this way similar performance with the build-in functions [25].
The present work manages to demonstrate that a process normally linked to units with temporal dynamics can be applied even in the absence of such behavior.It also hints the high potential of a R-STDP as a process in network training.Moreover, it is the authors' belief that the fusion of different techniques, the combined use of recently developed tools in RL with the application of concepts inspired by neuromorphic algorithms, is still an 'uncharted' field and of great potential for further research.Many tools, such as training algorithms, optimization algorithms, and network structures have been developed for either classic ML or neuromorphic computing.Therefore, hybrid implementations and their capabilities are highly unexplored.
The present approach enables the researchers to easily implement different kinds of setups and apply them on the same problem.The behavior of different neuromorphic spiking models, under variations of R-STDP and gradient-based algorithms, can be examined and their performance can be compared.Similarities and differences between parameter updates performed with different methods and under the use of an appropriate metric might be useful to be examined.The option of implementing autograd extensions directly in C++ is very appealing and a worthwhile endeavor that should be pursued.Also additional algorithms can be examined through further customization of the optim function of Pytorch [25], which is responsible for the applied optimizer.Other neuromorphic hardware-software setup combinations might be considered too.

Figure 1 :
Figure 1: BP Algorithm.Diagram of Episode's Return vs. Global Step

Figure 2 :
Figure 2: R-STDP Inspired Algorithm.Diagram of Episode's Return vs Global Step