A Containerization Framework for Bioinformatics Software to Advance Scalability, Portability, and Maintainability

Rapid advancements in structural bioinformatics result in short software lifespans due to issues with scalability, portability and maintainability. In many cases, researchers aim to distribute scientific software as part of a research project but lack the development resources to maintain a robust web application server. Here, we introduce a web application framework and example that takes advantage of containers and the virtualization capabilities associated with them, and present an example application as a template. We contain the full web application in a container, which packages all the code and dependencies required for the software. The application itself is built specifically for structural bioinformatics, with a front end GUI and molecular viewer as well as a back end that can be altered to run arbitrary software tools. The container, which is a snapshot of the software, limits the effort required for porting and maintaining the software. At the same time, the architecture we introduce streamlines the process of starting a web server for programmers that are not web developers. Finally, for computationally intensive work, the container transfers computing costs to users in a pay-as-you-use model. An example of the web application implementation built in a container can be found in section 5.


INTRODUCTION
The exchange of software and code is central to the advancement of structural bioinformatics but fraught with practical challenges.Software implementations are always at risk of losing functionality as operating systems and libraries evolve.Software releases must also cope with difficult practical decisions: The software could be distributed as source code, which demands a distribution system for updates and faith that users will figure out how to compile it on their own systems.Alternatively, it could be distributed as a web service on a local server, but the developer must then develop the server and absorb the computational costs of all user requests.A further problem, especially in structural bioinformatics, is that three dimensional (3D) visualization is important for describing the inputs and outputs of the software, and often a source of additional hardware and software constraints.Given these challenges, it is unsurprising that, in 2019, Mangul et al. estimated that 28% of all bioinformatics resources from 2005 to 2017 are not accessible through the URL provided in their original publication, and that 28% of the software they examined for an installability test failed to install [4].
This paper presents a reusable template that can limit these problems.Specifically, it provides detailed instructions and examples for running software from web servers that are hosted on cloud-based containers.This approach offers fundamental advantages to the traditional strategies mentioned above: First, containerization fully describes the environment required by the software, helping it to remain operable as systems evolve.Second, by releasing containers with software, developers transfer the computational costs of running the software to the users of the software and centralize their software distribution system.We also demonstrate how the outputs of the software can be interactively visualized on a web interface.These methods are not individually novel.However, by integrating them into a template for the structural bioinformatics community, they create a simplified way to enhance three aspects of software in the field: (1) Scalability: A measure of how well a system can adapt to growth in data volume, traffic volume, or code complexity.(2) Portability: The ability for software to run in different systems and environments.(3) Maintainability: How easily a software can be repaired, improved, or understood.
We present this template as a way to promote improvements to scalability, portability and maintainability for the bioinformatics software ecosystem by demonstrating its ease of deployment and ease of software engineering requirements.
The open source distribution of scientific software is increasingly the norm, but it presents the user with problems relating to portability and maintenance.Technical issues with compiling and installing software are common, as users may have different systems and libraries.Nearly 49% of bioinformatics resources are excessively difficult to install [4].One reason is that installable software requires conscientious development for software portability.Minor differences in software versions, software packages, and unique settings can cause software to work incorrectly.Nonportable software leads to extensive maintenance as developers must keep up with each evolving library and deprecated function.Another issue of sharing source code to archives like github is that open source may be undesirable; the concealment of source code while still providing functional access is a choice that some researchers prefer, especially on a temporary basis as new projects advance.Even though the distribution of source code is "scalable", since any user can download the code, open source distribution is not without practical issues.
Distributing software as locally hosted web service makes software portable, but markedly exacerbates scalability and maintainability.Web applications require no technical skill from the user to obtain access to the software, however the entire technical and financial burden is placed on the developer.The development and maintenance of a web application are huge undertakings for nonprogrammers, barring the use of web applications from most independent researchers.Maintaining an application through library changes, security issues, storage utilization, host hardware errors, and other concerns requires time and expertise that only larger development teams can manage.Furthermore, the host absorbs all of the computational expenses from hosting a server; scalability can quickly become a problem with a large user base and computationally demanding work.The tradeoff of a portable software is at the cost of exacerbating scalability and maintainability issues; without keeping up with scalability and software maintenance, software can become obsolete.
Both open source distribution and web services are imperfect ways to achieve scalable, portable, and maintainable software, but they can work with increased software development experience and time.Unfortunately, neither are readily available for many independent researchers whose goals are to complete scientific software.We discuss recent efforts to improve or streamline software efforts with cloud computing infrastructure.

Related Work
There has long been interest in utilizing cloud computing products to organize, develop and distribute software.This is especially true with containers for distributing scientific software.Simply, a container is a standalone unit which packages the code and dependencies required by a software.Many developers have utilized containers for software distribution, dissemination and organization.For example, some have attempted to organize large difficult-tomanage software architectures using containers.Since containers are self-contained like a walled garden, portions of an architecture that performs different tasks can be abstracted into its own container.Kugele et al. evaluated the containerization of automotive software, which are complex and difficult to manage, and how container organization breaks down services into single container components for faster software development, faster maintenance, and more portability [3].Similarly, containers have been a way to standardize bioinformatics software to allow easier installation and execution as well as the building of more complex data analysis pipelines in Omics technology [1].Recently, workflow managers have been a key development in improving the reproducibility and portability of bioinformatics analysis and research [11].Workflow managers allow researchers to connect a series of specialized bioinformatics tools to perform complex analysis while managing software versions, parameters, and tools.Through a graphical user interface, users can drag and drop tools into workflows to generate pipelines.
These uses of containers are representative of the substantial utility containers can provide, however both rely on a development team, an ongoing community dedicated to open-sourcing their work to sustain, and leave little support for structural bioinformatics, especially for the integration of protein visualization and protein structure analysis.

ARCHITECTURE
The containerized web application framework consists of four components in a pyramidal organizational structure: a client side molecular viewer, a web server, a container, and cloud computing services.The front end consists of a graphical user interface containing the molecular viewer and form inputs that users can interact with.The form inputs are directly connected to the web server to submit computational processes to be run on the back end.The client and server are bundled into a container as a full-scale web application that can be hosted on a local machine or on the cloud.Finally we streamlined the process to hosting the web application on a cloud computing service like Google Cloud.
We provide a template in section 5 which implements an example containerized web application framework for calculating and displaying hydrogen bonds.This example exhibits the integration between a command-prompt-based hydrogen bond calculation algorithm with interactive visualization, a process that is not often supported by molecular viewers.The value of this example is that users of the code can remove our simple hydrogen bond computation software and replace it with any software they wish.The software is triggered by the user on the client side, and modifications to the client can enable multiple desired computations to be run on the server side, with results returned for download or visualization.As such, the basics of the application are already constructed, needing only the modifications for the desired code and desired interactions.

Client Side Molecular Viewer
Work in structural bioinformatics often utilizes visualizations generated by molecular viewers such as 3D protein structures, protein surfaces, and other outputs.We embedded NGL viewer, a web application for molecular visualization, directly into the client side of the web app controlled via Javascript API.NGL viewer implementation is, in itself, an abstraction of functions from the three.jslibrary which interface with WebGL to produce 3D models [7,8].In addition to standard protein structure visualization, NGL allows developers to visualize 3D results from structural bioinformatics software output.For example, we applied custom geometry capabilities in NGL to create custom objects like hydrogen bonds or the surfaces shown in Fig. 1.
Figure 1: Client-side graphical user interface with NGL viewer and file tree system for structure organization.Both PDB objects and custom surface objects can be displayed.

Server Side Web Application
The web application relies on Node.js, a library for handling the server environment, for serving web application content to the client.This includes handling http requests between server and client via express.js,managing asynchronous processes like requesting large files or completing long-running tasks, and storing private code and private files on the server or in a database platform like MongoDB.One of the most important features of Node.js is asynchronous handling since we expect the web application to be able to run arbitrary software which can range from batches of short-running tasks to extremely long-running tasks [10].In our template application, the server asynchronously runs software to find hydrogen bonds that will later be displayed on the client.Users modifying our template will be able to replace that software with their own software that performs arbitrary computations that may be displayed in different ways on the client.Likewise, additional functionalities can be built into the client to trigger different computations as desired.

Containers and Container Images
A container image is a standalone executable package of software that bundles everything necessary to run the application including code, runtime, system libraries and tools, and settings.More simply, it is a snapshot of the local environment so that software will replicate that environment and run uniformly regardless of differences in system-wide environments [5,6,9].At runtime, a container image becomes a container and starts up quickly and reliably and begins running the software.By packaging bioinformatics software into container images, we can remove issues with unreliable installation: If a developer or researcher can run their software from a container image, then a user will also be able to run it as well if provided the same image [6,9].Containers thus form a foundation on which the example web application is built, to address issues of version deprecation and system discrepancies.Although containers are not a novel solution for enhancing maintainability and portability, few bioinformatics software utilize containers.We aim to demonstrate a deployed container to encourage future applications to be built on containers.
We utilize Docker for managing containers and container images.This means web applications built in our framework will have several features which further accelerates the deployment of web applications.Docker containers include a docker file which handles configurations when building a container image.We build Docker container images from this docker file which can be kept locally or uploaded to a container repository service like Docker Hub.The container image associated with the web application presented here is linked in section 5. Finally, the container image is run using Docker engine, a container runtime that is able to be run across various linux systems and Windows Server systems [5].

Google Cloud: Cloud Computing service
When running a container, users have the option to host it on their local machine or a cloud computing service like Google Cloud.A key difference between the two options is ease of setup and access from a web-accessible URL.If a user simply wants private access to the software, they can manually start the container locally to access the software and leave it running.However, this requires a few additional steps: making sure docker is installed and running, downloading and starting the container, and keeping the host computer on until software is completed.
The advantage to using google cloud is that deployment of Docker containers is streamlined by Cloud Run, a full-featured container deployment service.Hosting via Google Cloud removes the need to have Docker pre-installed before accessing the software, and does not even require the container image to be downloaded since Cloud Run can deploy from Github and Dockerhub repositories directly.Additionally, Google Cloud provides computing resources to the host; local hosting will be affected by user hardware specifications, whereas Google Cloud will provide a set of powerful hardware capable of running intensive work.
In Google Cloud, the container deployment pipeline is simplified to only a few steps: (1) Build a Docker image.
(2) Point Cloud Run to the image via the Google cloud console or by providing a url where the image is available.(3) Setup deployment settings and begin deployment (4) Open the URL which directs anyone with the link directly to the hosted web application At the time of this publication, web services providing with Google Cloud require limited costs, and containers can be kept deployed with few wasted resources.Users can access the container at any time via the associated URL [2,12].

DESIGN
The software template we introduced has several key designs which supports the deployment of a scalable, portable and maintainable software.As design goals served as the basis for many of our architectural decisions, we will discuss each goal and major architecture feature here.

Maintainability and Portability of Software using Containers
Porting of software to various systems and continued maintenance across all those systems is a crucial and typically difficult task, however we use containers to circumvent both issues.Since containers are an isolated virtualized snapshot of the files and libraries that are used by the host, anyone can replicate this exact environment by downloading the corresponding container image.This image, which contains all of the necessary settings, code, and dependencies necessary to run a software, allows a user to later replicate the software as it was when it was first stored as an image, making portability issues negligible.As a result, maintenance intervals for scientific software can be much longer and software updates will not require any software downtime, both of which are especially beneficial to independent researchers and small development teams.

Distribution using Container Image
One of our core objectives was to distribute software in a way that allocates server hosting and computation costs to the user rather than the developer.With that in mind, we chose to distribute container images as the primary object being shared from researcher to user.The sharing of container images not only allows users to run software reliably, but also gives any user the ability to host their own web application container privately and access it anytime at their own expense.This process is as simple as downloading the container image or having it on Docker Hub and pointing to the image location for google cloud to begin hosting.Cost-wise, resource and API costs are very low for a single user.As long as the container is hosted on google cloud and deployed, one can access a private URL at any time to access their copy of the web application.In this way, costs are effectively transfered from the developer to the user in a way that is proportional to how much the user employs the application.
In addition to distributing software as container images, traditional methods of hosting a URL-accessible web server that is maintained by the host is still supported.This gives very fine control of cost distribution to anyone with access to the container image, enabling a tree-like organization of costs as shown in Fig. 3.A user may deploy a container for themselves when they are individually accessing software, allowing quick and reliable access with little setup.On the other hand, a lab might deploy one copy of a software for a whole lab to access when the software is highly frequented so that the computation costs are billed to the lab rather than individuals.A case in which multiple deployments of software for each individual in a lab may be desirable is when the base software is needed but each user requires small modifications to the software, whether in visualization or in software computation.The branching of software copies to individuals, labs or organizations naturally distributes the resource expenses based on who is hosting.
This strategy eliminates problems relating to a local backend server, such as supporting a large user base or high computational resources, because cloud resources are simply allocated by -and billed to -users seeking to employ the container.Distributing via container image limits the total number of users and resources per container by having users host their own server, paying for the resources they use as they use them.

Web Application Template
One of the most important goals of this framework is to streamline the container and web application setup process since few researchers in structural bioinformatics have background in web development, and the questions of scalability, portability and maintainability can be difficult to fully address.Although many recent methods recommend containers as a software distribution method, setting up a container and a full-scale web application is considerable.We have abstracted as much of the relevant client side and server side code as necessary to make web application deployment quick and reliable.In addition, step-by-step tutorials for various phases of setup are provided in the documentation including: container pipeline from building an image to deploying it, modifications to web application code for server and client side with examples, and google cloud account setup.
Another goal of this project is to have a flexible web application that allows developer-defined code to run with the click of a button, as shown in Fig. 2.Many bioinformatics tools were designed to run using command-line commands; we took advantage of this common practice to run software on the server using commands which begin asynchronously through an http request.The web app template requires tweaks to only a few lines of code to run a typical bioinformatics software and obtain output.In our template, we can run software to locate hydrogen bonds, but any software can be modified to run on the web app ranging from commercially used pairwise structure alignments to molecular dynamic simulations.

NGL Viewer Design Features
There are many molecular viewers on the market today, most of which are open source and have a strong development community supporting it like pymol, jmol and VMD.In contrast to these stateof-the-art viewers, which aim to add more and more tools to keep up with significant developments in structural biology, the basis of our viewer template is as an auxiliary feature to software execution.We aim to distill our viewer to best support the display of software output.For this reason, we have removed redundant functionalities like many different protein structure views and added features directly related to the display of software output.The included 3D object modelling functionality in the embedded NGL viewer allows us to display custom objects like custom surfaces and hydrogen bonds, and it can be extended further.For example, it could be used to display molecular dynamics simulations or structure alignments as outputs to developer code.
A characteristic of embedding a molecular viewer on top of a webserver is that as software runs and completes, output files can be delivered asynchronously to the viewer for users to see.The decoupling of software execution and visualization further consolidates the structural bioinformatics research process on top of all the organization provided in this containerization framework.

CONCLUSION
Obsolescence of bioinformatics software has become an issue, whether due to archival instability, software deprecation, or installation difficulty.Each of these barriers hinder software access and makes research harder to replicate, which ultimately decreases the scientific impact of the software.We introduced a containerized web application framework that simplifies the software distribution and dissemination process while making the software more scalable, portable and maintainable.By using containers as a package for distributing a full-scale web application, we enable developers to enhance portability and maintainability through a container's virtualized environment.Similarly, we support scalability by cloud hosting containers that are pay-as-you-use.By having users obtain copies of the software via container images, this approach lends itself naturally towards a tree-like distribution model.Finally, we abstracted much of the development code so our template can easily be extended, and provided step-by-step tutorials for creating containers of different designs.Overall, we attempted to simplify all of the development setup so that structural bioinformatics researchers and non-programmers can quickly build a scalable, portable and maintainable tool.

DATA
Examples of the webserver can be found here and here.Documentation and tutorials can be found here.The container image uploaded to Docker Hub can be downloaded or directly accessed via github.
This work is licensed under a Creative Commons Attribution International 4.0 License.

Figure 2 :
Figure 2: Client-server communications to run software.Client requests server to run data in software, and server hosted in a cloud computing service performs the calculations before returning software output to client.

Figure 3 :
Figure 3: Model for distributing containerized software via container images.Blue arrows indicate forking of a container image to get a replica of the containerized server and starting it, locally or on Google Cloud.Red arrows indicate accessing the server using a URL.