Accelerating Antimicrobial Peptide Discovery with Latent Structure

Antimicrobial peptides (AMPs) are promising therapeutic approaches against drug-resistant pathogens. Recently, deep generative models are used to discover new AMPs. However, previous studies mainly focus on peptide sequence attributes and do not consider crucial structure information. In this paper, we propose a latent sequence-structure model for designing AMPs (LSSAMP). LSSAMP exploits multi-scale vector quantization in the latent space to represent secondary structures (e.g. alpha helix and beta sheet). By sampling in the latent space, LSSAMP can simultaneously generate peptides with ideal sequence attributes and secondary structures. Experimental results show that the peptides generated by LSSAMP have a high probability of antimicrobial activity. Our wet laboratory experiments verified that two of the 21 candidates exhibit strong antimicrobial activity. The code is released at https://github.com/dqwang122/LSSAMP.


Candidate Library
Attribute Filter Activity Predictor

Structural Modeling
Wet Laboratory

Sequence-level
Structure-level

INTRODUCTION
In recent years, the development of neural networks for drug discovery has attracted increasing attention.It can facilitate the discovery of potential therapies and reduce the time and cost of drug development [41].Great success has been achieved in applying deep generative models to accelerate the discovery of potential drug-like molecules [23,38,40,53].Antimicrobial peptides (AMPs) are one of the most promising emerging therapeutic agents to replace antibiotics.They are short proteins that can kill bacteria by destroying the bacterial membrane [2,10].Compared with the chemical interactions between antibiotics and bacteria that can be avoided by bacterial evolution, this physical mechanism is more difficult to resist.
A typical antimicrobial discovery process usually consists of four steps, as shown in Figure 1.First, a candidate library is built based on the existing AMPs database.These candidates can be created by applying manual heuristic approaches or training deep generative models.Then, several sequence-based filters are created to screen candidate peptides based on different chemical features, including computational metrics and predictive models trained to estimate • We propose LSSAMP, a sequence-structure generative model that combines secondary structure information into the generation, which can further accelerate AMP discovery.• We develop a multi-scale VQ-VAE to control the generation in a fine-grained manner and map patterns in sequences and structures into the same latent space.• Experimental results of AMP predictors show that LSSAMP generates peptides with high probabilities of AMP.Moreover, 2 of 21 generated peptides show strong antimicrobial activities in wet laboratory experiments.

RELATED WORK
Antimicrobial Peptide Generation Traditional methods for AMP discovery can be divided into three approaches [44]: (i) Pattern recognition algorithms first builds an antimicrobial sequential pattern database from existing AMPs.Each time a template peptide is chosen to substitute local fragments with those patterns [31,35].(ii) Genetic algorithms use the AMP database to design some antimicrobial activity functions and optimize ancestral sequences with these functions [32].(iii) Molecular modeling and molecular dynamics methods build 3D models of peptides and evaluate their antimicrobial activity by the interaction between peptides and the bacterial membrane [6,33].Pattern recognition and genetic algorithm bottleneck on representing patterns, and the modeling and dynamics method is computationally expensive and time-consuming.
Deep generative models take a rapid growth in recent years.Dean and Walper [15] encodes the peptide into the latent space and interpolates across a predictive vector between a known AMP and its scrambled version to generate novel peptides.The PepCVAE [14] and CLaSS [12] employ the variational auto-encoder model to generate sequences.The AMPGAN [47] uses the generative adversarial network to generate new peptide sequences with a discriminator distinguishing the real AMPs and artificial ones.To our knowledge, this is the first study to incorporate secondary structure information into the generative phase, which is conducive to efficiently generating well-structured sequences with desired properties.
Sequence Generation via VQ-VAE The variational auto-encoders (VAEs) were first proposed by Kingma and Welling [28] for image generation, and then widely applied to sequence generation tasks such as language modeling [8], paraphrase generation [19], machine translation [4] and so on.Instead of mapping the input to a continuous latent space in VAE, the vector quantized-variational autoencoder (VQ-VAE) [46] learns the codebook to obtain a discrete latent representation.It can avoid issues of posterior collapse while has comparable performance with VAEs.Based on it, Razavi et al. [36] uses a multi-scale hierarchical organization to capture global and local features for image generation.Bao et al. [3] learns implicit categorical information of target words with VQ-VAE and models the categorical sequence with conditional random fields in non-autoregressive machine translation.In this paper, we employ the multi-scale vector quantized technique to obtain the discrete representation for each position of the peptide.

LATENT SEQUENCE-STRUCTURE MODEL
In this section, we first introduce the background of AMP discovery and discuss the limitation of existing generative models.Then we introduce the Latent Sequence-Structure model for AMP (LSSAMP), which uses the multi-scale VQ-VAE to co-design sequence and secondary structure within the same latent space.

Background
As shown in Figure 1, a typical AMP discovery includes the sequencelevel attribute and the structure-level modeling before the wet laboratory experiments.Existing deep generative models have shown promise in accelerating AMP discovery by considering the sequence-level attributes during the generation.However, they still need to check and filter the structures with external tools after the sequence generation, which makes the process less efficient.For example, Van Oort et al. [47] manually check the generated peptides and only got 12 AMP candidates with the ideal cationic and helical structure.Capecchi et al. [9] required an extra secondary structure predictor (SPIDER3 [21]) to filter the generated peptides based on the percentage of the predicted a-helix structure fraction.
structure on sequence properties by filtering the generated sequences based on the proportion of -helices, which is the most common secondary structure in AMPs.In Table 1, we use three sequence attributes (charge, hydrophobicity, hydrophobic moment) that are crucial for AMP mechanism to evaluate generation performance [18,51,54] 1 .
The ratio in Table 1 is the difference in performance before and after the secondary structure filter.We can find that most of the results are improved by limiting alpha-helical structures.The results show that by controlling the structure, the sequence properties can be improved.Thus, incorporating the structure information into generative models can not only accelerate discovery by combining all the steps before the wet laboratory but also improve the sequence properties and make the generative process more efficient.
To address these challenges, we combine the secondary structure with sequence attributes in AMP discovery and generate peptides with ideal sequence attributes and secondary structures simultaneously.
Notation.A peptide 2 with length  can be denoted by x = { 1 ,  2 , • • • ,   } and   ∈ V  belongs to one of the 20 common amino acids, which is also called a residue.The secondary structure is used to describe the local form of the 3D structure of the peptide.It can be annotated as y = { 1 ,  2 , • • • ,   }, where   ∈ V  is from 8 secondary structure types 3 .The goal is to generate peptide candidates with high antimicrobial activities to accelerate the AMP discovery process.

VAE-based AMP Models
Given a sequence x, the variational auto-encoders assume that it depends on a continuous latent variable z.Thus the likelihood can be denoted by: The controlled sequence generation incorporates the attribute  and models the conditional probability  (x|).Previous work such as PepCVAE [14] assumes that z and  are independent and  (x|) = ∫ z  (z) (x|z, )z, while CLaSS [12] models the dependency between z and  by  (x|) = ∫ z  (z|) (x|z)z. 1The definition of these three attributes can be found in Section 4.2.2. 2 Here, we use the peptide to refer to the oligopeptide (< 20 amino acids) and the polypeptide (< 50 amino acids). 3The three alpha helices are denoted as H, G, and I based on their angles.The two beta sheets are divided into E and T by shape.The others are random coil structures [24].The vanilla VAE are usually trained in an auto-encoder framework with regularization.The encoder parameterizes an approximate posterior distribution   (z|x) and the decoder   (x|z) reconstructs x based on the latent z.The models optimizes a evidence lower bound (ELBO): where the    (z|x) [(  (x|z))] is the reconstruction loss and the KL divergence is the regularization.For the conditional generation, the attributes are directly fed to the decoder with latent variable z for   (x|z, ), or trained on the latent space to get an attributeconditioned posterior distribution  (z|).The VAE-based peptide generative models are first trained on the unsupervised peptide or protein sequences and then finetuned with a few sequences with biological attribute labels.

LSSAMP
To capture the sequence and the secondary structure feature simultaneously, we would like to model the joint distribution  (x, y).
Based on Eqn 2, the training objective is: For sequence x and secondary structure y, we assume that they are independent given the latent variable z.Thus the joint distribution can be written as  (x, y) = ∫   (x|z) (y|z) (z)z.Moreover, since a peptide has a deterministic secondary structure given its amino acid sequence, we further approximate   (z|x, y) with  Φ (z|x).Therefore, the training objective is written as: For fine-grained control over each position, we assign one latent variable   for each   instead of a continuous z for the whole sequence.Since it is computationally intractable to sum continuous latent variables over the sequence, we use VQ-VAE [46] to lookup the discrete embedding vector z  = {  ( 1 ), • • • ,   (  )} for each position by vector quantization.
Specifically, the original latent variable   (  ) ∈ R  will be replaced by the codebook entry   (  ) ∈ R  via a nearest neighbors lookup from the codebook B ∈ R  × : ( Here,  is the slot number of the codebook and  is the dimension of the codebook entry .Then, the decoder will take   (  ) as its input.So KL( Φ (z|x)|| (z)) in Eqn 4 is replaced by: measuring the difference between the original latent variable   (  ) and the nearest codebook entry, rather than the KL divergence between two continuous distributions.Here, (•) is the stop gradient operator, which becomes 0 at the backward pass. is the commit coefficient to control the codebook loss.
For the sequence feature, we use the reconstruction loss to learn   , and for the secondary structure, we view it as an 8-category labeling task.Therefore, the first term in Eqn 4 becomes: However, the structure motifs are often longer than chain patterns.Therefore, we establish multiple codebooks to capture features of various scales.
Multi-scale VQ-VAE The structure motifs are often longer than sequence patterns.For example, a valid -helix contains at least 4 residues and may be longer than 12.However, sequence patterns with specific biological functions are much shorter, usually between 1 and 8 residues.In order to capture these features and map them into the same latent space, we first apply  multi-scale pattern selectors   on   and get    .Then, we establish multiple codebooks   ∈ R  × and use Eqn. 5 to look up the nearest codebook embedding    (  ).We share the codebooks between sequence reconstruction and secondary structure prediction to capture common features and relationships between the residue and its structure.The concatenated multi-scale codebook embedding is fed to the sequence generator: The total training objective is composed of the reconstruction loss, the labeling loss, and the codebook loss, which can be denoted

Algorithm 1 Training and Sampling phase of LSSAMP
Require: A protein dataset   , a peptide dataset with secondary structure   , and the AMP dataset   .The model ={  ,   ,   } with  codebooks.A set of  prior models   .1: Train on   to optimize   and   .2: Train on   to optimize   ,   ,   via Eqn.9. 3: Finetune  on   via Eqn.9. 4: Create an empty dataset   .6: for   ∈   do 7: Get -th codebook index   of   via Eqn. 5 and save it to   8: end for 9: Train an auto-regressive language model  prior  on   .10: end for by: The framework is shown in Figure 2.
Training As VAE-based generative models, we first train LSSAMP in an unsupervised manner with protein sequences, which is similar to the original ELBO in Eqn 2.Then, we incorporate the structure information by jointly training on a smaller protein dataset with secondary structure annotation.Finally, we finetune our model on the AMP dataset to capture the specific AMP characteristics.The whole training process is described in Algorithm 1.
Following Kaiser et al. [25], we use Exponential Moving Average (EMA) to update the embedding vectors in the codebooks.Specifically, we keep a count   measuring the number of times that the embedding vector   is chosen as the nearest neighbor of   (  ) via Eqn. 5. Thus, the counts are updated with a sort of momentum: . Here,  is the decay parameter.
Modeling dependency between position To model the dependency between  1: , we build an auto-regressive model on the index sequence of each codebook: Specifically, we train Transformer-based language models  prior  based on the index sequences from Eqn. 5 for each codebook  .Sampling We sample several index sequences from the prior models for each codebook , and then look up the codebook to get the embedding vector z   .Finally, z   is fed to the decoder to generate the sequence with its secondary structure.We also further control the secondary structure by heuristic structure patterns to improve the generation quality.

EXPERIMENT
We first describe our experiment settings (Section 4.1) and introduce the automatic evaluation metrics (Section 4.2) and results (Section 4.3).Then, we verify the antimicrobial activity in the wet laboratory (Section 4.4).Last but not least, we conduct an in-depth analysis (Section 4.5) to understand LSSAMP and discuss its limitation (Section 4.6).

Experiment Setup
Dataset The Universal Protein Resource (UniProt) 4 is a comprehensive protein dataset.We download reviewed protein sequences (550k) with the limitation of 100 in length as   (57k examples).Then we use a community reimplementation of AlphaFold [1], which is called ProSPr5 [5] to predict the secondary structure for   .After filtering some low-quality examples, we obtain   with 46k examples, including both sequence and secondary structure information.For the antimicrobial peptide dataset, we download from Antimicrobial Peptide Database (APD) 6 [50] and filter repeated ones to get 3222 AMPs as   .We randomly extract 3,000 examples as validation and 3,000 as the test on   and   .For   , the size of validation and test is both 100.Following Veltri et al. [49], we create a decoy set of negative examples without antimicrobial activities for comparison.It removes peptide sequences with antimicrobial activity from Uniprot, and sequences with length < 10 or > 40, resulting in 2021 non-AMP sequences (Decoy).
Baseline Traditional methods usually randomly replace several residues on existing AMPs and conduct biological experiments on them.Thus, we use Random baseline to represent the method of replacing residue randomly with a probability .Following Dean and Walper [15], we use VAE to embed the peptides into the latent space and sample latent variable  from the standard Gaussian distribution  ∼  (0, 1).For a fair comparison, we use the same Transformer architecture as our model LSSAMP and train on the Uniprot   and APD dataset   .AMP-GAN is proposed by Van Oort et al. [47], which uses a BiCGAN architecture with convolution layers.It consists of three parts: the generator, discriminator, and encoder.The generator and discriminator share the same encoder.It is trained on 49k false negative sequences from UniProt and 7k positive AMP sequences.PepCVAE is a semi-VAE generative model that concatenates the attribute features to the latent variable for conditional generation [14].Since the authors did not release their code, we use the model architecture from Hu et al. [22] and modify the reproduced code 7 for AMPs, as described in their paper.The original paper uses 93k sequences from UniProt and 7960/6948 positive/negative AMPs for training.For comparison, we use UniProt dataset   and ADP dataset   to train it.MLPeptide [9] is RNN-based generator.It is first trained on 3580 AMPs and then transferred to specific bacteria.LSSAMP is implemented as described in Section 3.3 and the detailed hyperparameters are attached in Appendix A.3.

Automatic Evaluation Metric
Following previous work [13,47], we use open-source AMP prediction tools to estimate the AMP probability of the generated sequence.Since these open-source AMP predictors are trained and report results in different AMP datasets, we use them to predict 4.2.1 AMP Classifiers.Thomas et al. [42] trained on the AMP database of 3782 sequences with random forest (RF), discriminant analysis (DA), support vector machines (SVM) 8 , and artificial neural network (ANN) 9 respectively.AMP Scanner v2 10  Charge is important because the bacterial membrane usually takes the negative charge and peptides with the positive charge are more likely to bind with the membrane.We only take integer charges into consideration.The whole charge of the peptide sequence  is defined as the sum of the charge of all its residues  (  ) at pH 7.4, which is  () =   ∈  (  ).
Hydrophobicity reflects the tendency to bind lipids on the bacterial membrane.A peptide with a high hydrophobicity is easy to move from the solution environment to the bacterial membrane.We use the hydrophobicity scale  (  ) in Eisenberg et al. [17] to calculate the hydrophobicity of a sequence, which is  () =   ∈  (  ).Hydrophobic Momentum.It is viewed as the measure of amphipathicity, indicating the ability of the peptide to bind water and lipid simultaneously.It is a definitive feature of antimicrobial peptides [20].Hydrophobic momentum  (,  ) is defined by Eisenberg et al. [17].The hydrophobic momentum is determined by the hydrophobicity  (  ) of each residue   , along with the angle  between residues.The angle can be estimated by the secondary structure. is 100 • for the -helix structure, and 180 For each peptide, we calculate the above attributes to measure its antimicrobial activity.For comparison, we draw the distribution on the APD and decoy dataset and select a range for each attribute  3: Sequence attributes of generated sequences.We use the percentage of peptides meeting the valid range (Appendix A.1) to measure the performance.Uniq is the number of unique generated sequences.C, H, uH correspond to charge, hydrophobicity, hydrophobic moment.The best results are bold.
based on the biological mechanism (Appendix A.1).We use the percentage of peptides in each attribute range to exploit the generation performance and use Combination to measure the percentage of peptides that satisfy three conditions at the same time.

Novelty.
To measure the novelty of the generated peptides, we define three evaluation metrics: Uniqueness, Diversity, and Similarity.Uniqueness is the percentage of unique peptides in the generation phase.Diversity measures the similarity among the generated peptides.We calculate the Levenshtein distance [30] between every two sequences and normalize it by the sequence length.Then we average the normalized distance to get the mean as its diversity.The higher the diversity, the more dissimilar the generated peptides are.Novelty is the difference between the generated peptides and the training AMP set.For each generated sequence, we search the training set for a peptide that has the smallest Levenshtein distance from it and normalizes the distance according to its length.We calculate the average length as the Novelty.

Experimental Results
We generate 5000 sequences for each baseline.During the generation process, we add structural restrictions on positions based Uniqueness ↑ Diversity ↑ Novelty ↑ Random  = 0.1 0.995 ± 0.000 0.871 ± 0.021 0.078 ± 0.001 Random  = 0.2 0.999 ± 0.000 0.971 ± 0.022 0. on the antimicrobial mechanism.Specifically, we reject peptides with more than 30% coil structure ('-'), which can hardly fold in the solution environment, and insert the bacterial membrane in silico screening.Besides, we limit the minimum length of a continuous helix ('H') to 4 according to physical rules.We name our model with structural control as LSSAMP and the model without extra conditions as LSSAMP w/o cond.Sequence Attributes As listed in Table 3, LSSAMP outperforms all baselines on the combination percentage, which indicates that our model can generate sequences satisfying multiple properties at the same time.Besides, the combination percentage is similar to APD, which means that our model generates sequences that have a similar distribution to APD.LSSAMP tends to generate peptides with higher hydrophobicity, while AMP-GAN and MLPeptide sample more cationic sequences.It is because they were only trained on sequences and more focused on the direct amino acid attributes.Compared with them, LSSAMP can better capture the amphiphilic indicated by the highest uH for the incorporation of structure labels.PepCVAE inefficiently generates redundant sequences, which results in a significant decrease in the number of unique sequences.Since the percentage is calculated based on the whole generation size (5000), it leads to low performance on all attributes.Furthermore, we can find that by further controlling the secondary structure, H, uH and Combination can be improved.This verifies that secondary structure will also affect sequence attributes.
Novelty From Table 4, we can see that VAE has the highest diversity and novelty.However, from Table 2 and Table3, we can find that the peptides generated by VAE do not have a high probability of AMP or the ideal sequence attributes.It means that the vanilla VAE trained on AMP datasets without attribute control can hardly capture the antimicrobial features.At the same time, LSSAMP has a significant advantage over the above strong baseline PepCVAE and MLPeptide.It means that our model can generate promising AMPs with relatively high novelty.Besides, the limitation of the secondary structure will lead to a decline in diversity.However, it does not result in more redundant peptides because the uniqueness does not decrease.It indicates that the restrictions make the model capture similar local patterns, but not generate the exact same sequence.

Ablation Study
We conduct the ablation study for our LSSAMP and show the results in Table 5. PPL is the perplexity of generated sequences that can measure fluency.Loss is the model loss on the validation set.AA Acc. is the reconstruction accuracy of residue and SS Acc. is the prediction accuracy of the secondary structure.We can find that without the first training phase on   , the model can hardly generate valid sequences.The second phase to train the model on the large-scale secondary structure dataset   will affect the prediction performance on the target AMP dataset.If we remove multiple sub-codebooks and use a single large codebook with the same size, the performance will also decline.

Wet Laboratory Experiments
We synthesized and experimentally characterize peptides designed with LSSAMP.First, we filtered the 5000 generated peptide sequences based on their physical attributes (as outlined in Section 4.2.2 and Appendix A.1) and employed off-the-shelf AMP classifiers to select the ones with high antimicrobial scores (as detailed in Section 4.2.1).Second, we rank the sequences according to their novelty (as described in Section 4.2.3) and select ones with edit distance greater than 5 residues from the existing training sequences.Finally, we obtained 21 peptides and synthesized them for wet-lab experiments.
Following the previous AMP design [9,12], we use minimum inhibitory concentration (MIC) to indicate peptide activity, which is defined as the lowest concentration of an antibiotic that prevents the visible growth of bacteria.A lower MIC means a higher antimicrobial activity.To determine MIC, the broth microdilution method was used.A colony of bacteria was grown in LB (Lysogeny broth) medium overnight at 37 degrees under PH=7.A peptide concentration range of 0.25 to 128 mg/liter was used for MIC assay.The concentration of bacteria was quantified by measuring the absorbance at 600 nm and diluted to OD600 = 0.022 in MH medium.The sample solutions(150uL) were mixed with a 4uL diluted bacterial suspension and finally inoculated with about 5 * 10E5 CFU.The Plates were incubated at 37 degrees until satisfactory growth 18h.For each test, two columns of plates were reserved for sterile control (broth only) and growth control (broth with bacterial inoculum, no antibiotics).The MIC was defined as the lowest concentration of the peptide dendrimer that inhibited the growth of bacteria visible after treatment with MTT.
We tested their antimicrobial activities against three panels of Gram-negative bacteria (A.Baumannii, P. aeruginose, E. coli), which cost about 30 days.As shown in Table 6, two peptides were both found to be effective against A. Baumannii.P2 against P. aeruginose and P1 against for E.coli also showed activity.Besides, these two newly discovered AMPs differ from existing AMPs and had low toxicity, which means they are promising new therapeutic agents.The wet-lab experiment results demonstrate LSSAMP can effectively find AMP candidates and reduce the time.

Analysis
Codebook Number We explore the effect of different numbers of codebooks on generation performance.From Table 7, we find that a single small codebook can hardly learn enough information to reconstruct the sequence.The PPL, Loss, and SS Acc.become better with the increase of codebook entries.However, the reconstruction 3.32 ± 0.03 1.20 ± 0.01 100.00 ± 0.00 85.95 ± 0.42 [1, 2, 4, 8] 3.24 ± 0.16 1.17 ± 0.05 99.79 ± 0.20 87.20 ± 0.62 Table 7: The influence of the number of codebooks on validation set of   .' [1,2,4,8]' indicates that we use 4 codebooks with window sizes of 1,2,4,8.The meanings of symbols are the same as Table 5.   8.
accuracy achieves the best performance when the codebook is 3.This may be due to the relatively short local pattern of sequences, making the window of 8 too long for it.We do not increase the window size to [1,2,4,8,16] because the maximum length of the sequence has been set to 32, making 16 too long to capture features.
Case Study We show 10 peptides generated by LSSAMP in Table 8.We can see that all of them have long alpha-helix in the middle and coil structures at the head or tail.We can also find that these sequences have positive charges with high hydrophobicity and hydrophobicity momentum.We further predict and build 3D structures of 4 generated sequences via PEPFold 3 [39] and draw the image via PyMOL [37] in Figure 3.We can find that all these four peptides have helical structures, which is consistent with our secondary structure prediction.These helical structures make them more likely to have the antimicrobial ability.However, our model also fails for predicting a long continuous helical structure for Y4 and Y9.In fact, they have a small coil structure between the two helical structures.It indicates that our model tends to predict a long continuous secondary structure instead of several discontinuous small fragments.
Visualization of LSSAMP Distribution We plot the distribution of residues, charge, sequence length, hydrophobicity, and hydrophobic momentum for APD, Decoy, and LSSAMP and LSSAMP w/o cond in Figure 4.For the distribution of amino acids, the xaxis is different amino acids and the y-axis is their frequency in the generation set.We can find that without the extra condition, LSSAMP w/o cond has a similar amino acid distribution with APD.However, when we add the secondary structure conditions, it will greatly increase the frequency of A, K, and L. And we can find that those amino acids have a relatively high distribution in APD, which means that they might be responsible for the antimicrobial activity.Therefore, the addition of secondary structures may further change the direction of amino acid distribution to increase the probability of becoming AMPs.
For the global charge, we can find that compared with APD, the decoy dataset has more negative-charge sequences.And our models tend to generate sequences with positive charges.In the sequence length, we can find that the control of structure makes the length of generated sequence less diverse and tends to be longer.This is because we forced the generated sequences to have an alphahelix structure of a length of more than 4. In hydrophobicity and hydrophobic momentum, we can find a similar tendency where the LSSAMP w/o cond captures the distribution of APD, and the secondary structure condition makes the distribution more concentrated with a higher mean.
In the 3D visualization, we use the charge, H, and uH as the axes to see the combination of the three attributes.We can see that LSSAMP and LSSAMP w/o cond are almost overlapped by APD, which indicates that these three have a similar distribution on these attributes.Decoy sequences are out of the scope.
To conclude, Figure 4 is a sanity check indicating our base model LSSAMP w/o cond can successfully capture the distribution of APD from various dimensions, and the extra secondary structure condition will further improve the distribution.

Limitation
Although LSSAMP has shown its effectiveness, it is limited by several factors.First, LSSAMP models the secondary structure instead of the 3D structure.The use of 3D structures is limited by the size of the available precise data and the difficulty of predicting 3D structures with only one input sequence.Compared to 3D structures, secondary structures are much easier to annotate.Benefiting from the development of AlphaFold2 in predicting 3D structures, we expect to further extend the work and incorporate the 3D information into the latent representation space and generate sequences with ideal 3D structures.
The other limitation is that currently there is no standard evaluation metric for antimicrobial activities.In this paper, we follow the previous practice of using AMP classifiers and sequence properties to evaluate performance.However, the performance of AMP classifiers on generated peptides may be unreliable because they are trained on existing AMPs [2].Furthermore, it is difficult to  identify a reasonable range of sequence properties.It is because existing AMPs have different mechanisms that result in various sequence attributes.The only reliable evaluation method to check antimicrobial activity is the wet laboratory test, but it is expensive and time-consuming, which makes it impossible to perform a large number of evaluations.In the future, we hope that more reliable automatic evaluation metrics for AMPs will be proposed.

CONCLUSION
In this paper, we propose LSSAMP that jointly learns sequential and structural features in the same latent space and can generate peptides with ideal sequence attributes and secondary structures simultaneously.Moreover, It leverages multi-scale VQ-VAE for finegrained control of each position.The performance evaluated using open resource AMP predictors and computational sequence attributes indicates the effectiveness of LSSAMP.It further designs two peptides with high activity against Gram-negative bacteria.This suggests that our generative model can effectively create an AMP library with high-quality candidates for follow-up biological experiments, which can accelerate the AMP discovery.

A APPENDIX A.1 Attribute Distribution
To determine the effective threshold of charge, hydrophobicity, and hydrophobic moment of AMP, we analyze the sequence distribution in APD and decoy in Figure 5.For charge, we follow the rule summarized by experts and choose sequences whose net charge is +2 to +10.For the remaining two characters, we draw a histogram and compare the proportion in each box.If the proportion of APD is larger than that in the decoy, we add bin to the acceptance range of the evaluation metric.

A.2 Secondary Structure Filter
Similar to proteins, the biological functions of AMPs are determined by their amino acid sequences and folded structures [7].If the peptide can not fold into an appropriate structure, it is still difficult to take effect.For example, by forming a helical structure, the peptide can gather hydrophobic amino acids on one side and hydrophilic amino acids on the other.This amphiphilic structure helps the peptide insert into the membrane and maintain a stable hole with other molecules in the membrane, as shown in Figure 6.Without it, the peptide can hardly penetrate the membrane and attach to the surface.But does controlling secondary structure also affect sequence attributes?To answer this question, we control the secondary structure of the generated peptides to -helix for our baseline.The performance gaps are shown in Table 9.From Table 9, we can find that most of the results are improved by limiting sequences to the !"#$ %&&'() %&&'() -helix structures.It shows that by controlling the structure, the sequence attributes can be improved, which verifies the importance of introducing secondary structures to the controlled generation process.However, the sequence size has decreased significantly, indicating that this generate-then-filter pipeline is inefficient.
Visualization of Residue Distribution.To illustrate the distribution of residues in the generated peptides, we plot tSNE, shown in Figure 7.We transform the vector with each dimension representing the probability of a certain residue to represent the peptide.Then we use tSNE to convert the high-dimensional vector to 2D and visualize them.We find that there is a large overlap between LSSAMP w/o condition and APD, which indicates that our model has captured the global distribution of APD instead of collapsing to a local mode.Furthermore, LSSAMP covers APD and has some outliers.The results show that with the secondary structure condition, our model can not only learn the existing AMP distribution but also explore more possible spaces.Structure Condition.As described above, controlling the secondary structure can affect the attributes of generated peptides.Thus we limit the percentage of the coil structure with different ratios and calculate the sequence attributes of generated peptides.The results are shown in Figure 8.The x-axis is the maximum percentage of the coil structures allowed during the generation.We

Figure 1 :
Figure 1: The overview of AMP discovery.The first two steps focus on sequence attributes and the third models the structure.The final step is to verify the antimicrobial activity by inhibiting the growth of bacteria.The grep region is the bacterial suspension and the white one means the bacteria concentration in this region is small.

Figure 2 :
Figure 2: The encoder of LSSAMP.Here, we use  = 4 pattern selectors to select local patterns with different scales and use the corresponding codebooks to obtain discrete latent variables   for each position.The number of selectors is further discussed in Section 4.5.

Figure 4 :
Figure 4: The distribution of amino acids, charge, sequence length, hydrophobicity, hydrophobic momentum, and a 3D visualization for three sequence attributes.

Figure 5 :
Figure 5: The histogram of charge, hydrophobicity and hydrophobic moment on APD and decoy dataset.

Figure 6 :
Figure 6: An example of the antimicrobial mechanism.The blue indicates the hydrophobic amino acids, and the red ones are hydrophilic.On the left, although the peptides with reasonable amino acids have attached to the bacterial membrane, they still can not insert into it.However, by folding into the helix structure, as shown on the right, the peptides maintain a stable hole that breaks the membrane of the bacterium.

Figure 7 :
Figure 7: The tSNE plot for the distribution of residue in each sequence on four datasets.

Figure 8 :
Figure 8: The sequence attributes of peptides with different percentages of the coil structure.The x-axis is the maximum percentage and the y-axis is the percentage of peptides that meet the attribute range.
[9,12,14,47]as Scanner, is a CNN-&LSTM-based deep neural network trained on 1778 AMPs picked from APD. AMPMIC 11[52]trained a CNNbased regression model on 6760 unique sequences and 51345 MIC measurement to predict MIC values.IAMPE 12[26]is a model based on Xtreme Gradient Boosting.It achieves the highest correct prediction rate on a set of ten more recent AMPs[2].ampPEP 13[29]is a random forest based model which is trained on 3268 AMPs.It has the best performance across multiple datasets[2].4.2.2 SequenceAttributes.Following the previous AMP design[9,12,14,47], we use the three sequence attributes to evaluate the generation performance: Charge(C), Hydrophobicity(H), Hydrophobic Momentum (uH).Here, we use   to denote the -th residue and  to indicate the sequence.

Table 2 :
The percentage of generated sequences being predicted as AMP.The classifiers are described in Section 4.2.1.The first part is the prediction results on AMP and non-AMP dataset as the reference.The bold ones are the best model results.

Table 4 :
The novelty of the sampling.↑ means higher is better.

Table 2 .
LSSAMP performs best in four of seven and has the highest average score across all classifiers, indicating its advantage over baselines.PepCVAE performs best on the AMPMIC and IAMPE predictors, however, it performs poorly on the other predictors and gets a low average score.MLPeptide performs relatively evenly across predictors, outperforming other models on Scanner and slightly underperforming our model on the average score.The comparison of LSSAMP and LSSAMP w/o cond indicates that adding fine-grained control on the secondary structure can further improve the generation performance.It further indicates the importance of taking the secondary structure into consideration during the AMP generation.

Table 8 :
10 generated peptides.'H' is the alpha-helix, 'T' is the Turn and '-' is the coil.