An Approach to Developing Benchmark Datasets for Protein Secondary Structure Segmentation from Cryo-EM Density Maps

More and more deep learning approaches have been proposed to segment secondary structures from cryo-electron density maps at medium resolution range (5--10Å). Although the deep learning approaches show great potential, only a few small experimental data sets have been used to test the approaches. There is limited understanding about potential factors, in data, that affect the performance of segmentation. We propose an approach to generate data sets with desired specifications in three potential factors - the protein sequence identity, structural contents, and data quality. The approach was implemented and has generated a test set and various training sets to study the effect of secondary structure content and data quality on the performance of DeepSSETracer, a deep learning method that segments regions of protein secondary structures from cryo-EM map components. Results show that various content levels in the secondary structure and data quality influence the performance of segmentation for DeepSSETracer.


INTRODUCTION
Deep learning has been widely applied in biological problems.With the fast accumulation of cryo-electron microscopy (cryo-EM) image data and 3-dimensional molecular structure data, an increasing number of deep learning methods have been developed in many subdomains of cryo-EM or cryo-electron tomography (cryo-ET).For example, deep learning approaches have been developed for picking out molecular particles from 2D cryo-EM images [1], for segmentation of protein secondary structures from cryo-EM 3D density maps [2][3][4][5][6], for deriving initial backbones from cryo-EM density maps [7], and for segmentation of cellular objects from cryo-ET images [8,9].The problem of segmentation of protein secondary structures from a cryo-EM maps of medium resolution (5-10Å) is to detect the location of helices and β-sheets in the cryo-EM density map (Fig. 1).
Many image processing methods have been proposed for segmentation of protein secondary structures [10][11][12][13][14].In addition, recent deep learning methods have shown great potential leading to increased accuracy [2][3][4][5][6]15].However, different approaches were tested using different experimental data, and there has been limited study suggesting a proper procedure to construct a test data set.Emap2sec used 4-fold cross-validation to train and test using 43 experimental maps with the medium resolution [3].The data set was obtained after two steps of screening.The first step eliminates low-quality maps that have lower than 0.65 cross-correlation score between the map and the atomic structure.The second screening removes maps if any of their chains share more than 25% sequence identity with any chain of a map in the data set [3]. EMNUSS used the same 43 experimental maps that was developed in EMap2sec method [6].Emap2sec+, which segments both secondary structures and nucleic acids, was tested four times using medium-resolution experimental maps, each with 4 to 5 density maps.The small number of experimental maps used in testing may be related to limited cryo-EM maps that contain nucleic acids.Haruspex was trained and tested using high-resolution density maps with resolution better than 4Å [4].It was not tested using mediumresolution maps.Our previous method DeepSSETracer was tested using cryo-EM density map components centered around protein chains, rather than entire density maps [5].Structure validation has been an important problem in the cryo-EM community.There have been coordinated efforts to develop test data for structure validation [16][17][18], Map and Model Challenge of 2016 [17], and Model Metrics Challenge in 2019 [18].The data used in the challenges were designed for validation of atomic structures, particularly for cryo-EM maps with better than 5 Å resolution, making them inapplicable for secondary structure segmentation from mediumresolution maps.A benchmark data set will likely advance the development of approaches for the segmentation of secondary structures.
Developing benchmark set requires understanding the distribution of data over metrics that potentially influence the segmentation.A general hypothesis is to eliminate bad quality data in training and testing.However, there has been no study how quality affects the performance and how many data at what quality level should be included in a benchmark set to fairly represent the entire database.There has also been no study on other potential factors for segmentation, such as the complexity of structures, the size of secondary structures, the size of the protein, and repetitive sequences that are often in cryo-EM density maps.
In this study, we developed a data stratification approach to create data sets satisfying pre-defined specifications in sequence identify, secondary structure contents, and quality of data.The idea is to cluster the entire data set using potential factors and then to compose and select data sets satisfying specific requirements.This approach was implemented and has generated a test set and various training sets with different levels of data quality and secondary structures contents.Our results show that the distributions of data across different structure clusters and quality clusters in the training set can affect the performance of deep learning models of DeepSSETracer.

METHOD
Cryo-EM density maps with resolution range 5-10 Å were downloaded from Electron Microscopy Data Bank (EMDB) [19].Their corresponding atomic structures were downloaded from Protein Data Bank (PDB) [20].Since many cryo-EM density maps contain multiple copies of the same atomic structure chains, only one of the repeated chains in a map was used as an envelope to isolate the corresponding density region with Chimera [21].The data set used in this paper contains 1292 atomic structure chains and their corresponding regions in the cryo-EM maps.The overall idea for stratification is to first create clusters from the entire data set based on each factor that potentially affects the performance of segmentation training and testing.Three factors are used in this study -the sequence similarity between protein chains, the secondary structure content (helix, β-sheets) in a chain, the structure-map fit that often reflects the quality of a map region (Fig. 1B).The atomic structure (ribbon) and its corresponding cryo-EM density map region (gray) for EMD-3850 PDB-5oqm chain 4; (B): Segmented helix (blue) and β sheet (pink) regions detected using DeepSSETracer [5].The density map and the atomic structure are visualized in ChimeraX [22].(C): Clusters created based on three metricssequence similarity, secondary structure content, and quality of structural fit in cryo-EM maps.

Creation of Sequence Clusters
A matrix of sequence identity scores for protein chains was created to represent the similarity between each pair of sequences across all obtained chains.The sequence identity score of any given pair was calculated by aligning two chain sequences using Needleman Wunsch algorithm [23] and defined as the percentage of identical amino acids in the alignment.The obtained similarity matrix was subsequently used to acquire sequence clusters of chains by the agglomerative hierarchical clustering algorithm.All clusters were created to ensure that any two chains with an identity score equal or above 40% were grouped into the same cluster.The agglomerative clustering algorithm is implemented in Python, using scikit-learn library [24].The single-linkage with a distance threshold of 60 was used to maintain that any two chains with a sequence identity score equal to or above 40% were grouped into the same cluster, and any cluster shared no more than 40% sequence identity with the remaining clusters.

Creation of Structure Clusters
Seven variables describing varying aspects of secondary structure content in each chain were used to obtain structure clusters of chains.These variables include the chain length (i.e., the number of amino acid residues), the number of helix residues, the number of β-sheet residues, the number of helices, the number of β-strands, the average helix length, and the average of β-strand length.Each variable was normalized to between 0 and 1 before clustering.Specifically, the length of a chain was normalized by the minimum and maximum chain length in the dataset.The number of helix (βsheet) residues was divided by the chain length, followed by minmax normalization.Similarly, the number of helices (β-strands) was also divided by the chain length followed by min-max normalization.The average length of helices (β-strands) in a chain was calculated as the number of total helix (β-strands) residues divided by the number of helices (β-strands) and was also normalized by min-max normalization.We plotted a dendrogram to visualize the hierarchical relationship of individual chains in our dataset and to determine the optimal number of structure clusters, then applied the agglomerative hierarchical clustering with Ward's linkage to obtain 4 clusters.

Creation of Quality Clusters
For a given density map, its quality was estimated by that of its helix regions.A previously developed metric, the cylindrical fit between atomic structures of helices and their respective density map regions [25] was used to assess the quality.Quality clusters of chains were subsequently obtained by applying thresholds on the quality scores, leading to four non-overlapping clusters indexed with integers from 1 to 4. Quality cluster 1 is considered as the highest quality cluster, in which the cylindrical fit scores (F1 measurement) of chains are between 0.7 and 1.0.The difference between precision and recall () are further used for the classification of clusters 2 and 3. Specifically, cluster 2 includes chains with cylindrical fit score ranging between 0.6 and 0.7, with  less than 0.15.Quality cluster 3 consists of chains under one of the two conditions: Either the cylindrical fit score between 0.6-0.7,and >0.15, or the quality score between 0.55 and 0.6, and <0.15.All remaining chains that do not enter into any of the clusters 1, 2 and 3 are placed in cluster 4.

Test Set Construction
In general, the construction of a good test set involves the consideration of multiple factors, such as redundancy in density map regions, map quality, representativeness of targeted population, and structural complexity.Here, we implemented a method to construct a test set that has a distribution of chains in both structure content and map quality similar to those in the full dataset.Specifically, a total of 50 candidate test sets were first obtained by uniform random sampling from the full dataset (Fig. 2A).For each candidate set, the Mean Square Error (MSE) [26] defined in below was then calculated to measure the difference in the distribution of chains by structure content between the candidate set and the reference set, which, in this case, is the full set.
where  is total number of structure clusters,    is the proportion of chains in the full set that are from  -th structure cluster, and    is the similar proportion in a candidate test set.A similar MSE was also calculated to measure the difference in the distribution over quality clusters.Finally, the candidate test set with the minimum average of the two MSEs was chosen as the test set.

Construct Training Sets with Pre-defined Specifications
The constructed sequence, structure, and quality clusters of chains provide a foundation for producing various data sets satisfying desired distributions over the three factors.For example, since the four structure clusters have distinct structural compositions (such as numbers and lengths of helices and β-strands), and therefore a data set with high content of β-sheets can be composed to include more chains from those structure clusters with high β-sheet composition.As illustrated in Figure 2B, once the test set was determined, a training set with a pre-defined specification was derived through sampling from a specially prepared pool of chains.The construction of this pool started with the inclusion of all chains in the entire data set except those included in the test set.The pool was further prepared to meet the pre-defined specification.As an example, a pre-defined specification (Specification 1 in Table 1) is intended to create a training set biased towards chains in clusters 1 through 4 with specification: <70%, 50%, 30%, 100%>.To achieve this, 30%, 50%, 70%, and 0% chains from structure clusters 1 through 4, respectively, were randomly chosen and removed from the pool.
Once the pool was prepared, 50 times of random sampling was conducted to generate candidate training sets with each containing 400 chains.If maintaining the same distribution of map quality as in the full set is among the desired specification, a MSE was calculated as in Eq. ( 1) where the full set and the candidate training set were used for    and    respectively.The one with the minimum MSE was chosen to be the final training set satisfying the specification.The above procedure also allows construction of training sets that are biased towards any of the quality clusters while maintaining similar distribution of structural content as in the full set, such as in Specification 6 (Table 1).Table 1: Structure and quality specifications used in generating training sets.a : percentages of chains from the four structure clusters (Specifications 1 to 3) and quality clusters (Specifications 4 to 6) entered into the sampling pool; b : numbers of chains from structure clusters s1 to s4 and quality clusters q1 to q4.

RESULTS
The full data set contains 1,292 protein chains, each having its corresponding region isolated from cryo-EM density maps.To understand the overall characteristics of the aggregated data, we examined the distribution of all chains in length (i.e., number of amino acid residues), percentage of helix and β-sheet content, and number of helices and β-strands.The most popular chain length is about 150 amino acid residuals; and chains are predominantly within 400 residuals (Figure 3A).Large number of chains have around 40% of their residues from helices and 20% from β-sheets, although substantial number of chains vary widely in length and are without β-sheets (Figure 3B).In majority of the chains, both the numbers of helices and β-strands are within 20.The distribution of these chains is somewhat uniform in the two numbers (Figure 3C), meaning no obvious correlation between the two.

Sequence Clusters
Cluster analysis based on pairwise sequence identity led to a total of 414 distinct clusters, guaranteeing that the chains in the same cluster share at least 40% sequence identity with each other, and every cluster shares at most 39.9% sequence similarity with any other.The distribution of size among these clusters is provided in Figure 4.The largest cluster contains 21 chains, and 22 clusters have more than 10 chains.There are 196 clusters that contain only one member, indicating that the aggregated data set has 196 chains with less than 40% sequence identity shared with any other chain.To simplify the procedure, the test set was constructed via random sampling from these 196 unique clusters, and the set with the lowest average MSE was chosen.

Structure Clusters
With the agglomerative hierarchical clustering on the seven variables characterizing secondary structural content of chains, we obtained four structure clusters indexed using integers from 1 to 4. The numbers of chains in the four clusters are 318, 402, 500, and 72, respectively (Table 2).Structure cluster 3 comprises the majority of protein chains in the full data set.The average length among the chains in this cluster is 215 residues, echoing the peak in the histogram of chain lengths (Figure 3A).On average, each chain in this cluster has 40% residues in helices and 10% residues in β-sheets, suggesting that this type of secondary structural composition is the most common in our dataset.Structure cluster 1 contains chains with helix-rich structure, as chains in this cluster have 68% of residues in helices, on average, and almost no βstrands.Cluster 2 has the highest β-sheet content, with over 22% β-sheet residues, followed by cluster 3 with 10%.Chains in cluster 4 are characterized by 32% residues in helices and almost no β-sheet content.This cluster likely contains more loops, representing the minority chains in our dataset, as the number of chains in this cluster is significantly lower than the other three.

Quality Clusters
There are 96, 221, 196, and 779 chains included in the four quality clusters 1, 2, 3, and 4, respectively.Among these clusters, cluster 4 is the largest and contains maps that have the lowest quality, indicating that most of the chains in the aggregated set have poor structure-map fit [25].Lower fit score indicates less cylindrical density at a helix region.Since a helix is expected to have a rough cylindrical shape at a density threshold, the structure-map fit score represents the best cylinder score among all density thresholds.A low score suggests either the density map has low quality at the helix region or the erroneous placement of the atomic structure of the helix.

Test Set
We created a test set consisting of 50 chains by following the procedures depicted in Figure 2A.Specially, all 50 chains came from the 196 unique clusters (containing one chain in each cluster).This means for every chain in the test set, there is no other that has considerable amount (over 40%) of sequence identity in the entire dataset (also among the rest 49 chains in the test set).This test set has 9,804 residues in total, with 43.16% of all residues from helices, and 12.48% from β-sheets.
Since we chose the candidate set with the lowest MSE calculated as in Eq. ( 1) to be the final test set, it is representative of the full dataset in terms of both secondary structure content and map quality.The numbers of test chains in structure clusters 1, 2, 3, and 4 are, respectively, 12, 16, 19 and 3.These are proportional to the sizes of the four clusters in the entire chain population as indicated in Table 2.The numbers of test chains from the four quality clusters 1 through 4 are 7, 7, 6, and 30 respectively.This distribution also reflects that of the entire dataset, as out of the total 50 test chains, 20 (40%) belong to high-medium quality clusters (i.e., clusters 1, 2, and 3), and 30 chains (60%) belong to low quality cluster (i.e., cluster 4).These numbers for the entire dataset are 39.94% and 60.06%, respectively.Note that the test set currently represents the quality distribution of the entire data set that includes significant portion of poor data.However, the same data stratification method can be applied to a subset of the entire data, after extremely poor data are excluded.

Training Sets with Various Pre-defined Specifications
We created six training sets using six distinct specifications (Table 1) to study the effect of various factors on the performance of trained models.For example, since structure cluster 1 contains mostly helices (Table 2) and Specification 1 uses a greater percentage of cluster 1 than does Specification 2, its training set contains more helices than that of Specification 2. This is observed in the number of chains from structure cluster 1 in Training sets 1 and 2 (148 and 58 chains respectively) (Table 1 rows 1 and 2).However, Specification 2 led to a training set with more β-sheet content, as it included a greater percentage of structure cluster 2, the richest of the four clusters in β-strands.
These training sets enable the study of how changes in the ratio between helix and β-sheet content impact model performance.As another example, Specification 5 produced Training set 5, which contained only low-quality density maps, and hence can be used to study the performance of a model that is trained with only low quality data.

Performance of Models trained using Training Sets Generated with Two Specifications
To showcase the utility of the obtained test set and training sets, we trained models using training sets generated according to Specifications 1 and 2 (defined in Table 1).As discussed in section 3.5, specification 1 led to training set with higher helix content, while specification 2 led to training set with higher βsheet content.Although Specification 1 and 2 have different desired distributions over structure clusters, they have the same distribution over quality clusters (Table 1).Both distributions over the quality clusters are same as that in the full data set.Since 779 of 1292 chains in the full data set belong to quality cluster 4 (Section 3.3), the worst of the four in quality, the training sets produced using Specification 1 and 2 have about 60% of the training data with poor quality.On the other hand, the test set was selected to resemble the distributions the same as those in the full data set, and therefore 30 of the 50 chains in the test set belong to quality cluster 4 (Section 3.4).Therefore, both the training sets and the test sets contain about 60% of the data with poor quality, and this may be reflected in the lower F1 scores of the performance.In fact, we observed a higher F1 score for both helix and β-sheet detection for models trained and tested using data without cluster 4 previously [5,27].
To reduce observations due to random chance, two independent training sets (A and B) were generated for each specification (Table 3).With each training set, a model was obtained by training the U-Net deep neural network in DeepSSETracer [5].
The performance of all models was evaluated using the test set described in section 3.4 by calculating the F1 score.The average performance across all 50 test cases is consistent between the two replicates (A and B) for each specification (Table 3).Typically, machine learning models benefit from more training examples.So, as expected, models trained with data from Specification 1 performed better for helix than those trained with data from Specification 2, while the opposite is true for β-strands.For example, higher averaged F1 scores of 49.4% (trained using set A) and 49.7% (trained using set B) for helix detection were observed, when Specification 1 was used to generate the two training sets (Table 3).These F1 scores are higher than the corresponding F1 scores of 47.8% and 48.3%, when Specification 2 was used to generate its training sets A and B. In terms of βsheet detection, the F1 scores (35.1% and 37.0%) for models trained using Specification 2 (more β-content) are higher than those (33.7% and 32.3%) for models trained using Specification 1.
We noticed that changes in distribution of structure clusters in the training sets has a greater impact on β-sheet detection than helix detection, since the difference in the average β-sheet F1 scores between the two specifications are much higher than that of helix.This can be explained as our dataset contains more helix residues than β-sheet (Figure 3B).This imbalanced problem is also common in other datasets where coil is the majority component, and β-sheet is the minority class.Our method can be used to adjust the distribution over secondary structure content in a training set to target the minority group in the dataset.The ideas of combining clustering algorithm and under-sampling have been implemented in many studies and in various fields.Some of them include density-based majority under-sampling technique (DBMUTE) [28] and cluster-based hybrid sampling for imbalanced-data (CBHSID) [29].In addition, our method can be used to design multiple custom training sets with different distribution of structure and quality clusters to study the effect of these features on model performance.

CONCLUSION
Segmentation of secondary structures from cryo-EM density maps with medium resolution is still a challenging problem due to quality of the maps at such a resolution range.Although deep learning methods have shown potential to enhance the segmentation, different methods were tested using different data.First of all, there are limited experimental data sets available to test the approaches.Currently there are only two test data sets using cryo-EM maps.One is the set of 43 medium-resolution cryo-EM maps that was used in a 4-fold cross-validation test [3].This suggests that each of the four tests only uses about 11 maps in the test.The other is the set of 28 unique chain regions of cryo-EM map components [5].Moreover, there has been no previous study supporting a method to establish a proper test set for the segmentation problem.Various studies are needed to understand potential factors that influence the segmentation.We developed a method to stratify data with three potential factors and created the sequence identity clusters, structure clusters for secondary structure characteristics, and quality clusters.We proposed an approach, in this paper, to compose data sets with desired specifications related to the three potential factors.Even though this approach has not been applied to study the actual effect of the potential factors, the methodology shown here can be used to attack the challenging problem of benchmark data establishment.
Although our current study only focused on three potential factors, the stratification method could potentially be generalized to other factors that are considered important.
The proposed approach was implemented to create a test set and various training sets to study the effect of the data on the performance of models trained using these data sets with distinct properties.The two training sets with higher content of helix perform better detection of helix overall, as expected, and the training set with lower content of β-sheet performs worse detecting β-sheet.The expected results show the potential of the stratification method for producing more balanced training sets to enhance overall performance of deep learning methods.
The work presented here is the first investigation for the problem of establishing benchmark data for protein secondary structure segmentation from cryo-EM maps at the medium resolution.The focus of this work is the development of a framework to attack the problem.Many more studies are needed, such as constructing larger data sets, understanding variables currently existing in the framework, making the specifications more flexible, and optimizing the selection of candidate data sets.

Figure 1 :
Figure 1: The segmentation of secondary structure problem and clusters of data in three metrics.(A): The atomic structure (ribbon) and its corresponding cryo-EM density map region (gray) for EMD-3850 PDB-5oqm chain 4; (B): Segmented helix (blue) and β sheet (pink) regions detected using DeepSSETracer [5].The density map and the atomic structure are visualized in ChimeraX [22].(C): Clusters created based on three metricssequence similarity, secondary structure content, and quality of structural fit in cryo-EM maps.

Figure 2 :
Figure 2: Constructing testing set (A) and training sets with varying desired content (B).

Figure 3 :
Figure 3: Distribution of chains in the aggregated data over various properties.(A) length, (B) percentage of helix and βstrand residues, (C) number of helices and β-strands, (D) number of chains in each quality cluster.

Figure 4 :
Figure 4: Distribution of cluster size (i.e., number of chains) among the 414 sequence clusters.

Table 2 : Structural characteristics of chains in the four obtained structure clusters
. Avg.: average; res.: residues; #: number.

Table 3 : Performance of four DeepSSETracer models in detection of helices and β-sheets from cryo-EM density map components.
The performance was measured by F1 score (%).The four models were trained using four training sets generated using Specifications 1 and 2 in Table1.Two training sets (A and B) were generated for each specification.N/A: no β-sheet in the chain; Chain ID: <EMDB/PDB/Chain>; H: F1 score of helix detection, β: F1 score of β-sheet detection.