Facilitating Experimental Reproducibility in Neural Network Research with a Unified Framework

In the realm of neural network research, achieving experiment reproducibility is paramount for building upon existing knowledge and advancing the field. This paper examines a multi-agent neural network framework on its ability to facilitate the reproduction of experiments. Also, we address the reproducibility problem when there are data or source code limitations. The framework offers crucial functionalities for facilitating experiment reproducibility achieved through data, layer outputs, architectures, and weights exchange among the framework's agents. Through the integration of these functionalities, this framework empowers researchers to reproduce and validate experimental results consistently, fostering a more robust and collaborative research environment in the field of neural networks. The experimental results demonstrate the framework's reproducibility abilities. Furthermore, we test the framework in terms of reproducibility in an emergency natural disaster management situation. Finally, we analyze how the privacy limitations of the original neural network affect the reproducibility results.

Deep Neural Networks (DNNs) have achieved impressive results in various complex tasks, and much effort has gone into improving their predictive accuracy.However, an equally crucial aspect of any machine learning system is its ability to produce stable and reproducible predictions.The challenge of ensuring prediction reproducibility is a concern even when the network architecture and training data remain constant across different training runs.Unfortunately, two critical elements that contribute to the high accuracy of deep networks -over-parameterization and the randomization of training algorithms-present substantial challenges to achieving reproducibility.Over-parameterization means that neural networks often have multiple solutions that minimize the training objective, leading to different model solutions [23,34].Randomization, on the other hand, stems from various sources of uncertainty in standard neural network training, such as weight initialization, mini-batch ordering, non-deterministic aspects of training platforms, and, in some cases, data augmentation.When combined, these factors mean that training neural networks can result in vastly different solutions in each run, even when the training data remains the same, creating a significant challenge for reproducibility.
On the other hand, achieving reproducible experimental results can be particularly challenging because most papers do not provide the source code or data used for their experiments.Most reproducibility techniques require the exact original experimental setting (dataset, source code) and fail to proceed if any requirement is missing.This paper identifies the parts of the source code needed to reproduce the original deep learning experiment and uses a multi-agent neural network framework to facilitate the process.To this end, we define these availability limitations as privacy and we introduce the five different privacy levels the framework can handle.
Reproducibility hinges on the ability to recreate a neural network model with minimum disagreement with the original one, ensuring that the model's behavior remains consistent.However, methods to mitigate this disagreement have only been presented on the over-parameterization and randomization problem.This paper examines the reproducibility solutions regarding the availability problem (privacy) using the framework described in [15].This framework introduces a novel approach by creating a collaborative neural network environment composed of agents capable of exchanging information and learning tasks from each other.In this context, solving the reproducibility problem entails successful collaboration between a newly introduced agent and the agent to be reproduced.The framework offers three key mechanisms to enhance reproducibility: a) by enabling data exchange between the agents, b) by enabling weights exchange between the agents, and c) by enabling learning via knowledge distillation techniques.This paper aims to assess and validate the reproducibility capabilities of this framework.By doing so, it seeks to aid the process of reproducing experimental results within the scientific community, thereby advancing modern research in the field of Deep Learning.For practical applications, these kinds of frameworks can be used to transfer knowledge safely, effectively, and quickly in emergencies, such as natural disasters.

RELATED WORK 2.1 Reproducibility
Reproducibility is a widely discussed [11] and concerning issue across various scientific disciplines [2,30], highlighting a potential crisis in this regard.Many scientists have encountered difficulties when attempting to reproduce experimental findings [2], with failure rates exceeding 50% when attempting to reproduce their work in fields like medicine, physics, and engineering.The failure rate rises to more than 75% when attempting to reproduce results from the works of others in the same fields.
In the field of computational biology, reproducibility has been identified as a significant challenge [24].Studies have shown that the percentage of reproducible studies in this field can be as low as 10% or less [24].This lack of reproducibility has been attributed to factors such as unclear method descriptions and inherent variability in biological systems [24].To address this issue, efforts have been made to improve reproducibility in computational biology research.Tools have been developed to facilitate reproducibility, such as comprehensive code libraries and online evaluation platforms [10].These tools aim to provide researchers with the necessary resources to accurately reproduce and validate computational models and results [10].
Reproducibility is also a concern in the field of computer vision.In the context of face presentation attack detection competitions, reproducibility has been emphasized as a key aspect [33].In a specific competition, the top-performing teams released their source code and summarized their approaches, promoting transparency and reproducibility [33].This highlights the importance of sharing code and methodologies to enable others to reproduce and build upon research findings.
In the field of deep learning, related work until recently has been focused on the mitigation of the disagreement between the original and the reproduced neural network, due to over-parametrization and randomness [23,34].Another concern regarding the reproducibility of a research experiment is the dataset and code availability.Although works like [33] emphasize the code and dataset availability problem, there are limited to no ways of handling it.
To assess the different reproducibility techniques the need for a metric emerges.The metric that addresses this issue is defined as churn [22] and refers to the disagreement in predictions between two models.Essentially, churn represents the fraction of test examples where the predictions made by these two models do not align.It's important to note that when both models have perfect accuracy, which is often unattainable in practical scenarios, the churn is zero.While it is possible to reduce churn by removing all sources of uncertainty in the training process, such as controlling the random initialization seed and data order, it remains challenging due to inherent non-determinism in the current computational algorithms.

Teacher-Student Learning
The concept of teacher-student learning originated from the idea of condensing the knowledge stored in one or multiple complex neural networks into a single, simpler student network [7].This approach balances efficiency and performance in learning tasks [3].In a study conducted by [19], a large pre-trained network can provide labels for unlabeled data.Notably, [13] achieved significant results by distilling knowledge from a group of models into a single student model.Expanding on the work of [19] and [13], subsequent research focused on teacher-student interactions to refine the process of knowledge distillation, leading to highly proficient student models [32], [18], [35], [27], [3], [31].
In a study by [27], a novel framework is introduced to compress extensive and deep networks into narrower yet deeper ones.This method, known as FitNets, utilizes the teacher network's outputs and intermediate representations to train a more profound but narrower student network.Consequently, this approach enhances the training process and improves student performance.[32] defined distilled knowledge as the procedural flow learned through intermediate-layer feature representation.Considering this perspective, the student deep neural network surpasses the abilities of the teacher DNN.The capabilities of teacher-student interaction frameworks could be utilized to bolster deep neural network reproducibility.

UTILISING FRAMEWORK FOR REPRODUCIBILITY 3.1 Brief framework description
The framework proposed in [15] facilitates efficient knowledge transfer and collaboration by establishing a network of agents.Each agent consists of an out-of-distribution detector, a classifier, and a set of rules for training, knowledge distillation, and out-ofdistribution detection, as shown in Figure 1.The agents can use their set of rules to train their classifier, collaborate with other agents, and assess their knowledge of a given dataset.The framework also defines the rules for agent communication, that enable data, weights, feature maps, and soft-target transmission.By utilizing the framework communication rules, agents can exchange knowledge seamlessly in a collaborative environment.The agents either possess or need to acquire knowledge, as a result, they can act both as students and teachers.The framework also provides a library that consists of widely used architectures to avoid discrepancies between architecture variants.

Framework reproducibility functionalities
Experiment reproducibility is bolstered through a well-defined framework for neural network communications.The framework proposed in [15] offers a plethora of options for data and knowledge transmission among its agents.It defines the rules of communication and collaboration between different agents, facilitating the knowledge exchange among them.Let us now suppose that we The teacher agent sends its training data (D  ), network architecture, and soft-targets (a  ) for every sample of (D  ) so that the student can reproduce the teacher's network using knowledge distillation.(3) Knowledge distillation without teacher's training dataset: In case the teacher's training data are private, the student agent sends its training data to the teacher.The teacher agent sends its network architecture and soft-targets (of the student's training data D  ) so that the student can reproduce the teacher's network using knowledge distillation.In the context of the problem at hand, the term privacy refers to the degree of transparency the teacher agent has.Regarding the reproducibility topic, in the context of the framework examined, weights and architecture transmission is considered the most transparent (non-private) option.Option 4 constitutes the most private option.The three remaining options display a level of privacy with ranking: 2, 1, and 3 from lowest to highest privacy.In fact, option 2 is a special case of option 3 where the same dataset used for the teacher's training is used to distill the teacher's knowledge to the student.The difference between 1 and 2 is the contribution of the teacher agent in the training process of the student agent, which is really useful for reducing the disagreement between the two networks.The reason why option 1 is defined as a more private option than option 3 can be explained simply by understanding that actual datasets contain more information than the soft-target activations on the circumstances the teacher agent is trained, thus it facilitates the churn reduction.These remarks are illustrated in Table 1, which summarizes the information of the teacher agent shared for every option.In our experiments, we will compare the different reproducibility options in terms of privacy and churn.

Reproducibility for Natural Disaster Management
The framework's reproducibility options have direct application to Natural Disaster Management systems.The student agent in this case can either be a natural disaster control center, a drone, or another vehicle suitable for this situation.The student agent can collect data from the disaster at hand and accumulate the knowledge of the other agents seamlessly by distilling their knowledge.Another useful feature is the weights transmission, as it is quick and applicable directly, saving time which is crucial in situations like natural disasters.As extensively discussed before, the framework provides a plethora of options taking into consideration the privacy limitations of the teacher agent.This feature ensures that the original neural network can be reproduced under many cases of privacy limitations, which is extremely useful for natural disaster management.Knowledge about wildfires, flooding, tropical cyclones, tsunamis, volcanic activity, and other kinds of natural disasters, can be transferred effectively and quickly using this framework.This enables the users of natural disaster management systems to acquire useful knowledge seamlessly and handle the situation more effectively.The experimental results will demonstrate the framework's capabilities in a wildfire situation.

EXPERIMENTAL RESULTS
The experimental setup, which verifies the effect of the framework's functionalities on the reproducibility problem, will be extensively discussed in this section.According to the previous rationale, the teacher agent is the agent to be reproduced on another agent (student).Let us now suppose that the teacher agent represents a neural network on a fire monitoring drone, performing fire classification [17].The natural disaster control center claims that a second drone is required for monitoring.The problem at hand requires reproducing the neural network of the first drone (teacher agent) to the second drone (student agent) with the minimum possible disagreement.
The fire dataset consists of 1900 images divided into 1520 for training and 380 for testing.The dataset contains annotated images with labels indicating the existence or the absence of fire in the image, as shown in Figure 2. The teacher's architecture is the ResNet18 architecture with a measured accuracy score of 96.05%.We are going to test the framework's ability to reproduce the teacher agent's results, according to the different privacy limitations.To this end, we measured the churn and accuracy metrics for all different reproducibility options.The experimental setting for options 1 and 2 is clear, option 1 is the repetition of the original training process and option 2 is the simplest knowledge distillation technique [13].Regarding option 3, the student agent has managed to access only a subset of the teacher's training dataset, thus this subset is used for training.For option 4 the student uses the same subset as option 3 for training, however, the teacher's architecture is private.The student has the option to pick an architecture from the architecture library.For this experiment, the student picks MobileNetV3 [14].Option 5 is also standard procedure and is applied by copying the teacher's architecture and weights.
Taking into consideration the definition of privacy as described above we can define different privacy levels for better results interpretation and visualization.To this end, we designate option 5 as having level 1 privacy, while options 1 to 3 are allocated level 3, level 2, and level 4 privacy, respectively.Option 4 is allocated level 5 privacy.Table 2 presents the accuracy and churn metrics for the options provided by the framework.The results show the deviation in churn and accuracy of the experiment reproducibility process, according to the different levels of privacy of the teacher agent.Table 2 along with Figure 3 indicate the effect of different levels of privacy on the churn metric.The more private the components of the teacher agent are, the more disagreement will emerge between the teacher and the student agent.The experimental results prove the framework's reproducibility abilities and its options constitute a useful solution for the privacy limitations of the teacher agent.The experiments exhibit good reproducibility results in terms of churn, i.e. the original and the reproduced model disagreement.The possibilities of the framework can be demonstrated by its ability to replicate deep learning models, taking into consideration real-world restrictions and limitations.Although the system is tested on a natural disaster management scenario, the method is directly applicable when reproducing the results of a research paper.The implementation of the framework on an accessible, worldwide scale promises to boost reproducibility and accelerate the research progress in every field using deep learning models, as the related work will be reproduced effortlessly, with the click of a button.

CONCLUSION
In conclusion, this paper has evaluated the reproducibility capabilities of a multi-agent neural network framework.We demonstrated the functionality of the examined framework in a natural disaster management setting, where the issue of safe and effective reproducibility is crucial.During the evaluation process, we considered privacy issues that are encountered frequently in natural disaster management and experiment reproducibility.The same privacy limitations can be applied to research papers, thus our work is directly applicable to research experiments.The framework empowers researchers to reproduce and validate experimental results consistently, fostering a more robust and collaborative research environment in the field of neural networks.The results are promising and future work should consider integrating churn reduction methods into the existing framework for achieving better reproducibility results.

Figure 1 :( 1 ) 2 )
Figure1: The representation of a framework's agent, containing an out-of-distribution detector, a classifier, and a set of rules.Knowledge distillation rules enable agents to change roles (teacher/student) depending on the experiment setting[15]

Table 1 :( 4 )
Shared information of the teacher agent for reproducibility purposes according to the provided options Options Dataset Architecture Soft-Knowledge distillation without teacher's training dataset or architecture: In case the teacher's training data and architecture are private, the student agent sends its training data (D  ) to the teacher.The teacher agent sends the softtargets (a  ) produced from the student's training data.The student loads an architecture from the architecture library and uses knowledge distillation to reproduce the teacher's network.(5) Architecture and weights: Having the minimum privacy possible, the teacher can send its architecture and weights (w  ) for the exact teacher's network reproduction.

Figure 2 :
Figure 2: Examples of the forest fire classification dataset[17]

Figure 3 :
Figure 3: Relationship between different levels of privacy and the churn metric

Table 2 :
Accuracy and Churn measurements for the different options provided by the framework