Abstract
The work aims to study the application of Deoxyribonucleic Acid (DNA) multi-source data storage in Digital Twins (DT). Through the investigation of the research status of DT and DNA computing, the work puts forward the concept of DNA multi-source data storage for DT. Raptor code is improved from the design direction of degree distribution function, and six degree function distribution schemes are proposed in turn in the process of describing the research method. Additionally, a quaternary dynamic Huffman coding method is applied in DNA data storage, combined with the improved concatenated code as the error correction code. Considering the content of cytosine deoxynucleotide (C) and guanine deoxynucleotide Guanine (G) and the distribution of homopolymer in DNA storage, the work proposes and verifies an improved concatenated code algorithm Deoxyribonucleic Acid-Improved Concatenated code (DNA-ICC). The results show that while the Signal-to-Noise Ratio (SNR) increases, the Bit Error Rate (BER) decreases gradually and the trend is similar. But the anti-interference ability of the degree distribution function optimized by the probability transfer method is better. The BER of DNA-ICC scheme decreases with the decrease of error probability, which is stronger than other error correction codes. Compared with the original concatenated code, it saves at least 1.65 s, and has a good control effect on homopolymer. When the size of homopolymer exceeds 4 nt, the probability of homopolymer is only 0.44%. The proposed Quaternary dynamic Huffman code and concatenated error correction code have excellent performance.
1 INTRODUCTION
In recent years, with the rapid development of Cloud Computing, Big Data, Internet of Things (IoT), Artificial Intelligence (AI), social networking and other information technology fields and the digital transformation of traditional industries, human beings have generated more and more information. This requires not only the storage capacity, but also the scalability of storage time and space, fault tolerance, and error correction of stored procedures. With the arrival of the Big Data era, the imagination of the sharp increase in data capacity has led to the fact that the traditional data storage methods can no longer meet the demand, and the demand for new storage media has been paid more and more attention to [1]. Deoxyribonucleic Acid (DNA) data storage is considered as a medium suitable for long-term data storage due to its ultra-high density and sufficiently stable storage performance [2–4]. DNA data storage can reach a storage density of 1018byte/mm3, which is almost 106 times higher than the current highest storage density media. It has very good anti-interference ability against external high temperature and vibration, and can be preserved for a hundred years [5]. The popularity of Digital Twins (DT) has been rising in recent years, which has attracted much attention from the industry. It can make full use of physical model, sensor update, operation history and other data, integrate multi-disciplinary, multi physical quantity, multi-scale, and multi probability simulation process, and complete mapping in virtual space, thus reflecting the whole life cycle process of corresponding physical equipment [6–8]. One of the key features of DT is multi-source heterogeneous data fusion. The visual decision-making system also pays attention to the integration and comprehensive application of multi-source heterogeneous data [9]. Tremendous basic data will be generated during the actual operation of various industries, including various map element data, monitoring video data, real-time message data, urban tilt photography data, sensor data, business system data, and various database data. The visual decision-making system based on DT data fusion and DNA data storage can fully integrate the massive data between different departments, industries, systems, and data formats, and provide comprehensive data support for the perception and judgment of operation situation in various fields [10–12].
At present, the hard disk and other storage systems widely used by people have inherent shortcomings. For example, the storage life of hard disk and flash memory is only a few decades at most. The storage equipment is non-degradable and pollutes the environment. It is urgent to develop a new generation of alternative storage technology. DNA computing is a non-traditional computing technology based on DNA and enzymes, and depends on the principles of biochemistry and molecular biology [13]. DNA strands can be used to encode and store information. Researchers have designed a variety of mapping strategies, such as differential mapping and constraint mapping, to meet these biochemical constraints in the process of DNA computing. However, most mapping strategies have limited mapping potential at each nucleotide storage site [14–16]. In addition, error correcting codes, such as single parity check codes and duplicate codes, will be used in the storage of DNA data. Wang et al. (2021) [17] studied and discussed the information needs of China's Online Health Community for Corona Virus Disease 2019. These error correction methods often have the problems of high complexity and decoding failure. Therefore, a more stable DNA data storage method is needed to solve the existing problems [18]. Any new technology requires a shift from theory to practical application, as is the case with DT. The concept of DT originates from the industrial manufacturing field. Driven by 5G communication, IoT, Cloud Computing, Big Data, AI and other new generation information technologies, the concept of DT is gradually extended to more industry spaces [19–21].
With the gradual deepening of people's understanding of DNA and the rapid development and maturity of DNA storage related technologies, researchers gradually consider DNA as a new information storage medium. Qian et al. [22] pointed out that only a few studies had solved the relationship between data, and most of them were conducted outside the operating environment. Data visual decision-making can be quickly pushed into the application of DT industries to help industry managers improve their intelligent decision-making ability and efficiency. DNA storage not only has large storage capacity, high density, long storage life, but also has higher security, energy conservation, and environmental protection. As a cross fusion technology of information technology and biotechnology, DNA information storage technology plays an important role in saving storage energy and promoting the development of massive data storage. However, the research on DNA information storage is still in its infancy in China, and more energy, human and material resources need to be invested. From the perspective of long-term investment, many manufacturers believe that this technology has high value, and this technology is likely to become a breakthrough in the search for new storage media in the future.
Through literature research and algorithm verification, this work studies the problem of DNA data storage for DT. The innovation lies in the improvement of Raptor code, and six degree distribution function distribution schemes are proposed. Based on the original Huffman code, a quaternary dynamic Huffman code is proposed for the encoding and decoding of DNA data storage. Considering the content of cytosine deoxynucleotide C and guanine deoxynucleotide G and the distribution of homopolymer in DNA storage, an improved concatenated code (ICC) is proposed to be used as error correction code, and its performance is verified to be excellent.
2 RESEARCH STATUS OF DT AND DNA COMPUTING
The DNA computer used in the DNA computing process has many characteristics, such as small size, large storage capacity, fast operation, low energy consumption, and parallelism, and has advantages in data storage [23–25]. The visual data system with higher quality and requirements can be realized via the MHD fusion technology in DT. In the development of DNA computing, scholars were faced with complex problems, such as detecting the operation results of logic gates and judging whether the Logic Gates were successfully constructed. Later, scholars solved these problems by labeling various Molecule Radicals on the Thymine (T) Base. With the development of DNA computing, Logic Gates have gradually evolved into Logic Circuits, and many experts and scholars have given answers in this area.
Most research on DT only focuses on existing explicit frameworks and architectures, which face the challenge of supporting different levels of integration through agile processes [26]. Aheleroff et al. [27] conducted a study to determine the appropriate Industry 4.0 technology and the overall reference architecture model to achieve the most challenging DT application. With the intensification of market competition, the process of product development is accelerated, which requires rapid product innovation and efficient collaboration between design and manufacturing. However, there are still islands of information that hinder the integration of product lifecycle processes. Bionics and DT have been combined as potential solutions to address this problem. Some scholars proposed the concept, framework, and characteristics of DT bionics and expounded on the co-evolution mechanism of product twin, including virtual and physical product and production twin. Li et al. [28] put forward a symbiotic and co-evolutionary mechanism to integrate product development and manufacturing. They concluded that integrating bionics and DT could accelerate the innovation and development of new products and help realize the effective management of production construction.
In addition to data security, long search time for data owners, long data access time, and high system leakage rate may occur when they access data from the cloud environment. Given all the above problems, Namasudra et al. [29] introduced a fast and secure data access control model based on DNA for cloud environment. In the model, cloud service providers needed to maintain a fast and efficient data access table. The authors used a long cipher or key based on 1,024 bits of DNA to encrypt a user's confidential or personal data. They finally verified the effectiveness of the proposed model compared with existing models through experimental results and theoretical analysis. Adithya and Santhi [30] provided a color code encryption strategy of DNA computing to protect data from eavesdroppers. DT is often defined as real-world products, systems, existence, communities, and even cities, using virtual copies of data from their physical counterparts and their environments that are constantly updated. It connects virtual cyberspace with physical entities, so it is considered to be the pillar of industry 4.0 and the innovation in the future. It can be said that DT is created and used in the whole life cycle of the entity it copies, from cradle to grave. Jiang et al. [31] focused on the current situation of DT and its application in industry under the background of intelligent manufacturing, especially from the perspective of plant wide optimization. In this context, the main functions of DT are discussed, such as mirroring, ghosting, and threading.
To sum up, DT and DNA computing have achieved certain research results, and the relevant research conclusions have their own advantages and disadvantages. However, the research on DNA data storage mostly focuses on the data storage security, and pays less attention to the efficiency of data storage itself. This work optimizes and improves different encoding and decoding schemes, which can provide new ideas and research directions for the future research of DNA multi-source data storage.
3 DNA MULTI-SOURCE DATA STORAGE IN DT WITH HUFFMAN ENCODING AND CASCADE ECCS
3.1 DNA Multi-source Data Storage for DT
DNA data storage unit is four Deoxynucleotide, namely Adenine (A), Thymidine (T), Cytosine (C), and Guanine (G), also known as Bases. These four Nucleotide Groups can be arranged in different ways to form Oligonucleotides, i.e., DNA Strands. DNA data storage consists of three basic structures: data writing and reading [32, 33, 34]. The data writing part includes encoding, mapping, and composition. The data to be stored is input in Binary and encoded by Source Compression and Channel Error Correction. Then, some mapping rules transform the Binary Sequence into the Sequence consisting of four Nucleotide Bases. Finally, DNA chains are synthesized by biochemical methods and stored independently in special containers. Current DNA synthesis techniques can synthesize DNA strands of 200 to 1,000 bases in length. When reading the data, the Polymerase Chain Reaction [35, 36, 37] amplification technology should be used to copy the data of Oligonucleotides in the storage pool. A typical sequencing method relies on the characteristic that Fluorescent Nucleotides emit different colors. Precisely, the DNA sequence represented by the oligonucleotides can be read out by detecting the colors. Figure 1 illustrates the specific DNA data storage process.
Fig. 1. Implementation process of DNA information storage.
In Figure 1, the biological read-writer technology involved in DNA data storage include DNA synthesis, PCR amplification, and DNA sequencing. In view of the biological activities and the cell's exclusion, DNA synthesis is usually performed artificially in vitro. The synthesis process is divided into three stages. First, the Base Sequence is divided into several short strands because after the information is encoded, the Base Sequence is longer. Then, address Bits are added to the short chain after segmentation, convenient for quick search, location, and splicing when the subsequent file is read. Finally, it is necessary to add Macromolecular Primers with Nucleotide Sequences at both ends of the DNA chain and then preserve them to facilitate the splicing of short DNA chains and realize the function of data access.
3.2 Research on the DNA Information Storage Method based on Raptor Code
A key feature of Fountain Code is recovering information symbols using very small decoding with a high probability of receiving overhead. Fountain Code can be decoded successfully after receiving a certain number of code symbols [38]. However, LT (Luby transform) Code can be decoded 100% successfully only after all original signals are recovered during decoding, and the decoding complexity is not linear. The precoding link is added to the Raptor Code to improve coding performance and ensure the same synthesis cost, effectively resolving the contradiction between the complexity of coding and decoding and transmission efficiency. The precoding part primarily uses low density parity check for LDPC (Low Density Parity Check Code). Raptor Code is cascaded LDPC and LT Code. Assuming that there is a linear block code with a code length of N, in which the number of information symbols is K and the number of check codes is M, then the matrix generated by this block code is defined as \(\mathop G\nolimits_{k \times N}\). In other words, the information symbol matrix \(\mathop {\rm{U}}\nolimits_{k \times 1}\) can be mapped to the block code space through the generation matrix, and the block code matrix \({\rm{C}}\) and check bits M meet Equations (1) and (2): (1) \(\begin{equation} M = N - k \end{equation}\) (2) \(\begin{equation} {\rm{C}} = {\rm{U}} \times G \end{equation}\)
For irregular LDPC, the degree value of each node is determined by the Degree Distribution function. Equation (3) describes the Degree Distribution function. (3) \(\begin{equation} \lambda (x) = \sum\limits_{j = 2}^{\mathop d\nolimits_v } {\mathop \lambda \nolimits_j } \mathop x\nolimits^{j - 1} \end{equation}\)
In Equation (3), \(\mathop \lambda \nolimits_j\) represents the ratio of the number of edges sj owned by the variable node with degree value j to the total number of edges z of the bidirectional graph, which can be expressed as Equation (4). (4) \(\begin{equation} \mathop \lambda \nolimits_j = \frac{{\mathop s\nolimits_j }}{z} \end{equation}\)
Besides, dv refers to the maximum degree value of the variable node, satisfying: (5) \(\begin{equation} \sum\limits_{i = 2}^{\mathop d\nolimits_v } {\mathop \lambda \nolimits_i } = 1 \end{equation}\)
Then, the Degree Distribution function of the verification node is expressed as Equation (6). (6) \(\begin{equation} \rho (x) = \sum\limits_{i = 2}^{\mathop d\nolimits_c } {\mathop \rho \nolimits_i } \mathop x\nolimits^{i - 1} \end{equation}\)
In Equation (6), \(\mathop \rho \nolimits_i\) denotes the ratio of the number of edges bj owned by the check node whose degree value is i to the total number of edges z in the bidirectional graph, as shown in Equation (7).(7) \(\begin{equation} \mathop \rho \nolimits_i = \frac{{\mathop b\nolimits_i }}{z} \end{equation}\)dc in Equation (6) signifies the maximum degree value of the verification node, and it satisfies Equation (8). (8) \(\begin{equation} \sum\limits_{i = 2}^{\mathop d\nolimits_c } {\mathop \rho \nolimits_i } = 1 \end{equation}\)
The core of LDPC is to determine the check matrix H to obtain the generation matrix G and then encode the information symbol. The length of the information symbol determines the dimension of the check matrix H. Assuming that the length of the information symbol waiting to be encoded is n, then there is a m × n dimensional check matrix. The Gaussian elimination method is used to transform this check matrix into the standard form, that is:(9) \(\begin{equation} \mathop H\nolimits^{\rm{\cdot}} = [p|\mathop I\nolimits_m ] \end{equation}\)where p stands for a \(m \times (n - m)\) dimensional check matrix, and Im denotes a m-dimensional identity matrix. After standardization by the Gaussian elimination method, the identity matrix can be written as Equation (10). (10) \(\begin{equation} \mathop p\nolimits_i = \sum\limits_{j = 1}^{n - m} {\mathop H\nolimits_{i,j}^{\prime} } \mathop s\nolimits_j + \sum\limits_{j = 1}^{i - 1} {\mathop H\nolimits_{i,j + n - m}^{\prime} } \mathop p\nolimits_j \end{equation}\)
The Belief Propagation (BP) decoding algorithm is adopted to decode LDPC. Firstly, variable nodes are initialized, and symbols are assigned according to the acceptance conditions in Equation (11). (11) \(\begin{equation} \mathop p\nolimits_i = \left\{{\begin{array}{c@{\quad}c} { + \infty }&{\mathop y\nolimits_i = 0}\\ 0&{\mathop y\nolimits_i = E}\\ { - \infty }&{\mathop y\nolimits_i = 1} \end{array}} \right. \end{equation}\)
In Equation (11), E represents the variable node to be deleted. The Exclusive OR operation is performed to delete the associated edges between all the remaining n variable nodes that have not been deleted and the check nodes connected to these nodes, as presented in Equation (12). (12) \(\begin{equation} \mathop \phi \nolimits_{mn} = 2\mathop {\tanh }\nolimits^{ - 1} \left( {\prod\limits_{\mathop n\nolimits^{\prime} \in N\left( m \right)\left| n \right.} {\tanh \left( {\frac{{\mathop \varphi \nolimits_{\mathop {mn}\nolimits^{\prime} } }}{2}} \right)} } \right) \end{equation}\)
In Equation (12), N(m)|n represents the number of check nodes connected to the remaining variable nodes. Assuming that the associated edge of a particular check node is 1, then the variable node connected to the node can be restored. The value of the check node is N(m). After that, the associated edges connected to the restored variable node can be deleted, which can be written as: (13) \(\begin{equation} \mathop \phi \nolimits_{mn} = \mathop \phi \nolimits_{n0} + \sum\limits_{\mathop m\nolimits^{\prime} \in M\left( n \right)\left| m \right.} {\mathop \phi \nolimits_{\mathop m\nolimits^{\prime} n} } \end{equation}\) (14) \(\begin{equation} \mathop \phi \nolimits_n = \mathop \phi \nolimits_{n0} + \sum\limits_{m \in M\left( n \right)} {\mathop \phi \nolimits_{mn} } \end{equation}\)
Finally, if no check node with an associated edge of 1 can be found, the decoding will be terminated. The whole decoding process is the process of eliminating associated edges. The Table 1 is the specific algorithm process.
Table 1. BP Decoding Algorithm
Raptor Code is improved by designing the Degree Distribution function. The Degree Distribution function with good performance needs to guarantee the coverage of coded symbols and minimize the average degree value as much as possible to reduce the complexity of encoding and decoding. Therefore, the Distribution function satisfying these two points can be expressed as Equation (15). (15) \(\begin{equation} \left\{ {\begin{array}{@{}*{1}{l}@{}} {1 - d - \mathop e\nolimits^{ - \mathop \mu \nolimits^{\prime} \left( d \right)\left( {1 + \varepsilon } \right)} \ge \gamma \sqrt {\frac{{1 - d}}{k}} }\\ {d \in \left[ {0,1 - \delta } \right]} \end{array}} \right. \end{equation}\)
In Equation (15), \(\varepsilon\) represents the coding and decoding redundancy of Fountain Codes, \(\gamma\) refers to a positive real number, and \(1 - \delta\) indicates the decoding success rate. Besides, k signifies the number of information symbols, and \(\mathop \mu \nolimits^{\prime} ( d )\) is the derivative of \(\mu ( d )\). Then, there is: (16) \(\begin{equation} \left\{ {\begin{array}{@{}*{1}{l}@{}} {A \ge \gamma \sqrt {\left( {1 - d} \right) \cdot k} }\\ {d \in \left[ {0,1 - \delta } \right]} \end{array}} \right. \end{equation}\)where A represents the decoded set received by the receiver. Because DNA data is stored during coding, long strands are more error-prone, costly, and technically complex than short strands. In addition, in the process of Raptor Code coding, if the symbol code is long, the complexity in the coding process will be increased, and the overall timeliness will be reduced. Therefore, the short code length is often used to obtain a DNA-Raptor data storage architecture with better performance. Common Degree Distribution functions satisfying the above requirements can be expressed as: (17) \(\begin{equation} \mu (d) = 0.00098d\, +\, 0.459\mathop d\nolimits^2\, +\; 0.211\mathop d\nolimits^3\, +\; 0.113\mathop d\nolimits^4\, +\; 0.1113\mathop d\nolimits^{10}\, +\; 0.0799\mathop d\nolimits^{11}\, +\; 0.0156\mathop d\nolimits^{40} \end{equation}\) (18) \(\begin{align} \mu (d) &= 0.007969d\, + \,0.4935\mathop d\nolimits^2\, +\; 0.166\mathop d\nolimits^3\, +\; 0.073\mathop d\nolimits^4\, + \;0.082\mathop d\nolimits^5\, +\; 0.056\mathop d\nolimits^8\, +\; 0.037\mathop d\nolimits^9\nonumber\\ &\quad + 0.055\mathop d\nolimits^{19}\, + \; 0.025\mathop d\nolimits^{65}\, +\; 0.0031\mathop d\nolimits^{66} \end{align}\)
Since the success rate of LT decoding is positively correlated with redundancy, A slowly increases after reaching the peak value until it is close to full decoding. Therefore, the improved Degree Distribution function can be written as Equation (19). (19) \(\begin{equation} \mathop \tau \nolimits^{\rm{\cdot}} (d) = \left\{{\begin{array}{l@{\quad}l} {\frac{s}{{kd}}}&{d = 1,2,\ldots,\frac{k}{s}}\\ 0 & {d > \frac{k}{s}} \end{array}}\right. \end{equation}\)
Then, the Robust Solitary Wave Distribution (RSWD) \(\mathop \mu \nolimits^{\rm{\cdot}} (d)\) is adopted as Scheme 1, as shown in Equation (20). (20) \(\begin{equation} \mathop{\mathop \mu \nolimits_1}\nolimits^{\rm{\cdot}} (d) = \frac{{\rho (d) + \mathop \tau \nolimits^{\rm{\cdot}} (d)}}{{\sum\nolimits_{d = 1}^k {\rho (d) + \mathop \tau \nolimits^{\rm{\cdot}} (d)} }} \end{equation}\)
Scheme 2 is proposed based on setting the degree value with low probability in the Robust Solitary Wave Distribution (RSWD) to 0, as presented in Equation (21).
There are two kinds of probability distribution in the theory of soliton division: the ideal soliton distribution and Robust soliton distribution (RSWD). (21) \(\begin{equation} \mathop {\mathop \mu \nolimits_2 }\nolimits^{\rm{\cdot}} (d) = \left\{{\begin{array}{l@{\quad}l} {\mathop \mu \nolimits_2 (d) = 0}& {\mu (d) < \frac{1}{k}} \\ {\mathop \mu \nolimits_2 (d) = \mu (d)}&{\mu (d) \ge \frac{1}{k}} \end{array}} \right. \end{equation}\)
A new Degree Distribution function is obtained after normalization: (22) \(\begin{equation} \mathop \mu \nolimits_2 (d) = \frac{{\mathop {\mathop \mu \nolimits_2 }\nolimits^{\rm{\cdot}} (d)}}{{\sum\nolimits_{d = 1}^k {\mathop {\mathop \mu \nolimits_2 }\nolimits^{\rm{\cdot}} (d)} }} \end{equation}\)
Assuming that the number of information symbols of short code length k = 16∼1024, then: (23) \(\begin{equation} \mathop \mu \nolimits^{\rm{\cdot}} (d) \ge \frac{{ - \ln ( {1 - d - \gamma \sqrt {\frac{{1 - d}}{k}} } )}}{{1 + \varepsilon }} \end{equation}\)
Due to the structural characteristics of DNA data storage and the performance characteristics of Raptor Codes, the symbol code length is set as \(k = 256\). Besides, degree values \(\mathop d\nolimits_{33}\) and \(\mathop d\nolimits_{44}\) are added, the probability of occurrence of \(\mathop d\nolimits_{65}\) is transferred to \(\mathop d\nolimits_1\), and then the probability of occurrence of \(\mathop d\nolimits_{66}\) is transferred to \(\mathop d\nolimits_{33}\), which is taken as Scheme 3. The probability of the occurrence of \(\mathop d\nolimits_{65}\)is shifted to \(\mathop d\nolimits_1\), and the probability of occurrence of \(\mathop d\nolimits_{66}\) is shifted to \(\mathop d\nolimits_{44}\), which is regarded as Scheme 4. The occurrence probability of \(\mathop d\nolimits_{65}\) is transferred to \(\mathop d\nolimits_1\), which is taken as Scheme 5. The occurrence probability of \(\mathop d\nolimits_{45}\) is shifted to \(\mathop d\nolimits_2\), and the occurrence probability of \(\mathop d\nolimits_{66}\) is shifted to \(\mathop d\nolimits_{33}\), which is regarded as Scheme 6. The specific expressions are as follows: (24) \(\begin{equation} \mathop \mu \nolimits_3 \left( d \right) = 0.033d \,+\, 0.492\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{33} \end{equation}\) (25) \(\begin{equation} \mathop \mu \nolimits_4 \left( d \right) = 0.033d \,+\, 0.492\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{44} \end{equation}\) (26) \(\begin{equation} \mathop \mu \nolimits_5 \left( d \right) = 0.033d \,+\, 0.492\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{66} \end{equation}\) (27) \(\begin{equation} \mathop \mu \nolimits_6 \left( d \right) = 0.029d \,+\, 0.503\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{33} \end{equation}\)
3.3 DNA Information Storage based on Adaptive Huffman Coding and Concatenated ECCs
Huffman Coding is an Entropy Coding algorithm used for lossless data compression. Code words of different lengths are allocated according to the probability of the occurrence of coded characters. The higher the probability, the shorter the code words; the lower the probability, the longer the code words. In this way, the storage density can be improved after average processing. However, some disadvantages include the encoding method with poor timeliness, some non-generic fields, and low encoding efficiency due to the additional space storing the Huffman Tree. This paper uses the Adaptive Huffman Coding (AHC) in DNA data storage to solve these problems. AHC dynamically adjusts the Huffman Tree every time the encoder reads the symbol to be encoded and changes the corresponding weight and Huffman Tree every time a character is read. In this way, it can ensure that the current output symbol is only related to the currently encoded character and the character read and has no relationship with the character not read. The decoding process is similar to the encoding process. In view of the advantages of quaternary coding, this paper proposes a QAHC algorithm for DNA data storage, namely DNA-QAHC.
DNA multi-source data storage for DT can also be regarded as a process of sending and receiving information, in which different degrees of noise interference will cause errors in the data transmission process. Although such errors are rare, they have profound implications for data recovery. Therefore, error correction is a must. The most common errors in DNA information storage usually occur in the process of data deletion, data insertion, and replacement, collectively known as synchronization errors. Here, Concatenated Codes, Watermark Codes, and non-binary LDPC are combined for error correction. Figure 2 reveals the specific error correction process.
Fig. 2. The specific process of error correction by Concatenated Codes.
As Figure 2 suggests, the decoding process can be defined as Hidden Markov Model (HMM) by associating sparse sequences with LDPC codes.
Errors with value a are inserted between the first bit t0 and the time tj to be sent, and errors with value b are deleted. The range of drift value xj of point j is: (28) \(\begin{equation} X = \{ - \mathop x\nolimits_{\max } ,\ldots, -1,0,1,\ldots,\mathop x\nolimits_{\max } \} \end{equation}\)where xmax represents the maximum drift value. The transition probability Pa,b is defined as Equation (29) to reduce the complexity in the decoding process. (29) \(\begin{equation} \mathop P\nolimits_{a,b} = P\left(\mathop x\nolimits_{j + 1} = b| {\mathop x\nolimits_j } = a\right) \end{equation}\)
The concatenated code thus obtained has high complexity, large computational data redundancy, and low accuracy. Therefore, this work proposes an ICC algorithm and applies it to the DNA information storage process. Meantime, considering the content of cytosine deoxynucleotide (C) and guanine deoxynucleotide (G) and homopolymer in DNA storage, it is labeled as DNA-ICC algorithm. In other words, the ICC algorithm is applied to the DNA information storage process. In the algorithm, after calculating the forward and backward probabilities at the boundary of each transmitted symbol, only the ones with larger values are returned as possible drift states and corresponding symbols for subsequent decoding, and the decoding path is limited to a small range in the grid. This operation can avoid the path with very low probability from entering the calculation. The key to improve the concatenated code error correction scheme is to ensure the reliability of channel transmission.
Additionally, the input sequence also has an impact on the error behavior. The guanine and cytosine base content and homopolymer are the factors that affect the high error probability in the process of DNA synthesis. Therefore, this work adopts the DNA-QAH algorithm to encode the original file and the concatenated code for error correction. In this way, after several different encodings, the scattered base sequence can be homogenized to a great extent, better guanine and cytosine base content and homopolymer can be obtained, and the error rate of data storage can be reduced.
3.4 Experimental Verification
Case analysis and performance comparison are conducted to verify the six Degree Distribution function distribution schemes. The length of the symbol code is set as 240, the verification symbol is 16 bits, and the code length is 256. Besides, the row weight of the verification matrix is 16, and the column weight is 1. To verify the performance of DNA-QAHC algorithm and DNA-ICC scheme, this paper selects different texts, images, audio, and video for case analysis and selects the Pseudo-Random Sequence as the Watermark Sequence. The operating system is 64-bit Windows 10, the processor is Intel Core i7-7500u, the memory size is 8GB, and the running platform is MatlabR2018a. The text, image, and audio selected by inputting file information are one of the data collected daily in the research process. The specific file information is shown in Table 2:
4 RESULTS
4.1 Performance Comparison of Different DNA Information Storage Schemes based on Raptor Codes
Figure 3 illustrates the comparison of bit-error rates and decoding success times of the six schemes under different Signal-to-Noise Ratios (SNRs) and encoding and decoding redundancy \(\varepsilon\) of Fountain Codes.
Fig. 3. Performance comparison of different Raptor Codes: (a) bit error rate under different SNR, (b) the number of successful decoding.
In Figure 3(a), when the SNR increases, the error bit rate gradually decreases, and the six Raptor Code schemes have a similar downward trend. The SNR is set to 10dB in the case analysis to reduce the interference caused by SNR. Then, the redundancy of Fountain Code in Raptor Code is set to 0.01, 0.03, 0.05, 0.07, and 0.09. Figure 3(b) indicates that when decoding 100 times under the same SNR, Scheme 1 and Scheme 2 have a larger successful decoding rate than other schemes, which is close to 100% when the redundancy is greater than 0. 05. When the redundancy of Scheme 6 is greater than 0. 09, the decoding rate is close to 90%. Therefore, these schemes have a small average degree, relatively low code complexity, and high decoding success rate.
The degree value distribution probability, average degree, decoding success rate, and bit error rate performance of different Raptor Code schemes are ranked in order of scoring, with the performance decreasing from 1 to 7, as shown in Figure 4.
Fig. 4. Performance score distribution of different Raptor Code schemes: (a) degree distribution probability, (b) average degree, decoding success rate, and bit error rate performance score.
Figure 4(a) shows that the degree value distribution probabilities of the six Raptor Code schemes are very similar. With the increase in degree value, the variation trend of distribution probability and the probability of degree value are very similar. Figure 4(b) indicates that a higher average degree can increase the complexity of the encoding and decoding but can guarantee coverage in the process of signal transmission. The scores suggest that the higher the score of degree average, the lower the corresponding score of the decoding success rate. This result proves that decoding the success rate will also increase with the increase of the average degree.
4.2 Performance Verification of DNA Data Storage via AHC and ECCs
Figure 5 indicates the influence of different single character lengths on storage density.
Fig. 5. Storage density under different character lengths: (a) storage density under different character lengths, (b) comparison of memory density and runtime for 8-bit and 16-bit characters.
From Figure 5(a), the relationship between storage density and single character length is not linear but reaches a peak in a particular character length. This is because the number of required encoding symbols decreases with the increase of character length, but the types of symbols increase instead, reducing the probability of a character's occurrence. According to Figure 5(b), the AHC algorithm satisfies storage density for different storage files for 8-bit and 16-bit characters.
Under different insertion and deletion probabilities, the substitution error rate of DNA-ICC scheme is the bit error rate when Ps = 0, Ps = 0.1%, Ps = 0.2%, Ps = 0.3%, and Ps = 0.4%. The result is compared with that of other ECCs, such as Bose ay-Chaudhuri Hocquenghem (BCH) codes and Grid Matrix codes, as shown in Figure 6.
Fig. 6. Error rate comparison of different ECCs: (a) error rate comparison of different replacement error rates, (b) Ps = 0; (c) Ps = 0. 3; (d) Ps = 0. 4.
In Figure 6, the bit error rate of the DNA-ICC scheme decreases with error probability, meeting the changing trend of ECCs. The DNA-ICC scheme has the best performance compared with other error correction schemes. The error correction ability of Hamming Codes and RS Codes is poor. With the insertion error probability increase, the bit error rate is always high, and the downward trend is not apparent. Although the original Concatenated Code scheme can reduce the bit error rate, the DNA-ICC scheme has a more robust error correction ability than the original Concatenated Code.
The error correction time and Homopolymer of the DNA-ICC scheme are shown in Figure 7.
Fig. 7. Comparison of error correction time and Homopolymer of different files: (a) comparison of the running time of different error-correction schemes, (b) comparison of Homopolymer.
Figure 7 demonstrates that the DNA-ICC scheme reduces encoding and decoding time and improves the efficiency of DNA information storage. Compared with the original Concatenated Code, it saves at least 1. 65s time. In addition, the DNA-ICC scheme has a good control effect on Homopolymer. When the size of Homopolymer exceeds 4nt, the occurrence probability of Homopolymer is as low as 0. 44%, close to 0.
5 CONCLUSION
With the development and broad application of 5G communication, IoT, cloud computing, big data, AI, and other new-generation information technologies, DT technology has rapidly developed both theory and application. It gradually extends to smart cities, parks, transportation, and other application fields. The DNA data storage model combining multi-source data storage and DNA computing in DT is gradually emerging. In this paper, six Degree Distribution function schemes are proposed for DNA information storage of Raptor Codes. This paper also improves Huffman Coding and the Cascaded ECCs, puts forward the quaternary adaptive Huffman encoding and decoding method, and optimizes the Concatenated Code. It is proved that the performance of Raptor Codes is greatly improved. However, there are some deficiencies in the research. Although the Raptor Code is a kind of information encoding and decoding with good performance, the encoding efficiency is still not high enough, less than 2bit/nt. Future research will consider neural networks and machine learning for optimization. Moreover, the storage density of hexadecimal AHC can be better. In addition, the future work will consider targeted coding for different file input types according to the specific content and structural characteristics of the input. It will also refer to the current communication or storage protocols to realize the mutual communication between DNA and computer data.
- [1] . 2020. Constraining DNA sequences with a triplet-bases unpaired. IEEE Transactions on Nano-Bioscience 19, 2 (2020), 299–307.Google Scholar
Cross Ref
- [2] . 2020. Properties and constructions of constrained codes for DNA-based data storage. IEEE Access 8 (2020), 49523–49531.Google Scholar
Cross Ref
- [3] . 2021. Minimum free energy coding for DNA storage. IEEE Transactions on Nanobioscience 20, 2 (2021), 212–222.Google Scholar
Cross Ref
- [4] . 2020. K-means multi-verse optimizer (KMVO) algorithm to construct DNA storage codes. IEEE Access 8 (2020), 29547–29556.Google Scholar
Cross Ref
- [5] . 2020. On eingle-error-detecting codes for DNA-based data storage. IEEE Communications Letters 25, 1 (2020), 41–44.Google Scholar
Cross Ref
- [6] . 2021. Digital twin design for real-time monitoring – a case study of die cutting machine. International Journal of Production Research 59, 21 (2021), 6471–6485.Google Scholar
Cross Ref
- [7] . 2021. The use of a data-driven digital twin of a smart city: A case study of Alesund, Norway. IEEE Instrumentation & Measurement Magazine 24, 7 (2021), 39–49.Google Scholar
Cross Ref
- [8] . 2020. Digital twin for the oil and gas industry: Overview, research trends, opportunities, and challenges. IEEE Access 8 (2020), 104175–104197.Google Scholar
Cross Ref
- [9] . 2020. From LiDAR point cloud towards digital twin city: Clustering city objects based on Gestalt principles. ISPRS Journal of Photogrammetry and Remote Sensing 167 (2020), 418–431.Google Scholar
Cross Ref
- [10] . 2020. A methodology for digital twin modeling and deployment in industry 4. 0. Proceedings of the IEEE 109, 4 (2020), 556–567.Google Scholar
Cross Ref
- [11] . 2020. Manufacturing blockchain of things for the configuration of a data-and knowledge-driven digital twin manufacturing cell. IEEE Internet of Things Journal 7, 12 (2020), 11884–11894.Google Scholar
Cross Ref
- [12] . 2020. Virtual factory: Digital twin based integrated factory simulations. Procedia CIRP 93 (2020), 216–221.Google Scholar
Cross Ref
- [13] . 2020. Sequence-subset distance and coding for error control in DNA-based data storage. IEEE Transactions on Information Theory 66, 10 (2020), 6048–6065.Google Scholar
Cross Ref
- [14] . 2021. Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality. IEEE Transactions on Information Theory 67, 6 (2021), 3438–3451.Google Scholar
Cross Ref
- [15] . 2020. Error rate-based log-likelihood ratio processing for low-density parity-check codes in DNA storage. IEEE Access 8 (2020), 162892–162902.Google Scholar
Cross Ref
- [16] . 2020. DNA data storage: Automated DNA synthesis and sequencing are key to unlocking virtually unlimited data storage. Computer 53, 4 (2020), 63–67.Google Scholar
Cross Ref
- [17] . 2021. Information needs mining of COVID-19 in Chinese online health communities. Big Data Research 24 (2021), 100193.Google Scholar
Cross Ref
- [18] . 2020. Compressed DNA coding using minimum variance Huffman tree. IEEE Communications Letters 24, 8 (2020), 1602–1606.Google Scholar
Cross Ref
- [19] . 2021. Digital twin for 5G and beyond. IEEE Communications Magazine 59, 2 (2021), 10–15.Google Scholar
Digital Library
- [20] . 2021. Enabling technologies and tools for digital twin. Journal of Manufacturing Systems 58 (2021), 3–21.Google Scholar
Cross Ref
- [21] . 2020. Manufacturing blockchain of things for the configuration of a data-and knowledge-driven digital twin manufacturing cell. IEEE Internet of Things Journal 7, 12 (2020), 11884–11894.Google Scholar
Cross Ref
- [22] . 2018. Linking empowering leadership to task performance, taking charge, and voice: The mediating role of feedback-seeking. Frontiers in Psychology 9 (2018), 2025.Google Scholar
Cross Ref
- [23] . 2020. Constraining DNA sequences with a triplet-bases unpaired. IEEE Transactions on Nanobioscience 19, 2 (2020), 299–307.Google Scholar
- [24] . 2021. DNA design based on improved ant colony optimization algorithm with bloch sphere. IEEE Access 9 (2021), 104513–104521.Google Scholar
Cross Ref
- [25] . 2020. Construction of GC-balanced DNA with deletion/insertion/mutation error correction for DNA storage system. IEEE Access 8 (2020), 140972–140980.Google Scholar
Cross Ref
- [26] . 2020. From LiDAR point cloud towards digital twin city: Clustering city objects based on Gestalt principles. ISPRS Journal of Photogrammetry and Remote Sensing 167 (2020), 418–431.Google Scholar
Cross Ref
- [27] . 2021. Digital twin as a service (DTaaS) in industry 4. 0: An architecture reference model. Advanced Engineering Informatics 47 (2021), 101225.Google Scholar
Cross Ref
- [28] . 2021. Digital twin bionics: A biological evolution-based digital twin approach for rapid product development. IEEE Access 9 (2021), 121507–121521.Google Scholar
Cross Ref
- [29] . 2020. DNA computing and table based data accessing in the cloud environment. Journal of Network and Computer Applications 172 (2020), 102835.Google Scholar
Cross Ref
- [30] . 2021. Deoxyribonucleic Acid (DNA) computing using Two-by-six complementary and color code cipher. Bulletin of Computer Science and Electrical Engineering 2, 1 (2021), 38–45.Google Scholar
- [31] . 2021. Industrial applications of digital twins. Philosophical Transactions of the Royal Society A 379, 2207 (2021), 20200360.Google Scholar
Cross Ref
- [32] . 2020. Securing multimedia by using DNA-based encryption in the cloud computing environment. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3s (2020), 1–19.Google Scholar
Digital Library
- [33] . 2018. Cost-efficient server provisioning for cloud gaming. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 3s (2018), 1–22.Google Scholar
Digital Library
- [34] . 2011. Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9, 1 (2011), 137–149.Google Scholar
Digital Library
- [35] . 2021. CovidDeep: SARS-CoV-2/Covid-19 test based on wearable medical sensors and efficient neural networks. IEEE Transactions on Consumer Electronics 67, 4 (2021), 244–256.Google Scholar
Digital Library
- [36] . 2021. A new few-shot learning method of digital PCR image detection. IEEE Access 9 (2021), 74446–74453.Google Scholar
Cross Ref
- [37] ... and . 2021. Heat transfer simulation of various material for polymerase chain reaction thermal cycler. Journal of Mechanical Engineering (JMechE) 8, 2 (2021), 27–37.Google Scholar
Cross Ref
- [38] . 2021. IRTS: An intelligent and reliable transmission scheme for screen updates delivery in DaaS. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 1–24.Google Scholar
Digital Library
Index Terms
DNA Computing-Based Multi-Source Data Storage Model in Digital Twins
Recommendations
Generating DNA code word for DNA computing with realtime PCR
ACST '08: Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and TechnologyA number of DNA computing models to solve mathematical graph problem such as the Hamiltonian Path Problem (HPP), Traveling Salesman Problem (TSP), and the Shortest Path Problem (SPP) have been proposed and demonstrated. Normally, the DNA sequences used ...
Soft-Decision Decoding for DNA-Based Data Storage
2018 International Symposium on Information Theory and Its Applications (ISITA)This paper presents novel soft-decision decoding (SDD) of error correction codes (ECCs) that substantially improve the reliability of DNA-based data storage system compared with conventional hard-decision decoding (HDD). We propose a simplified system ...
The Working Operation Problem on Triple-stranded DNA Structure Model
GCIS '09: Proceedings of the 2009 WRI Global Congress on Intelligent Systems - Volume 04After Adleman’s paper published in 1994, a few of scholars solved some noted NP-complete problems on DNA computing. And all these methods of DNA computing are based on conventional Watson-Crick hydrogen bond of double-helical DNA molecule. Here we show ...













Comments