skip to main content
research-article
Open Access

DNA Computing-Based Multi-Source Data Storage Model in Digital Twins

Published:24 February 2023Publication History

Skip Abstract Section

Abstract

The work aims to study the application of Deoxyribonucleic Acid (DNA) multi-source data storage in Digital Twins (DT). Through the investigation of the research status of DT and DNA computing, the work puts forward the concept of DNA multi-source data storage for DT. Raptor code is improved from the design direction of degree distribution function, and six degree function distribution schemes are proposed in turn in the process of describing the research method. Additionally, a quaternary dynamic Huffman coding method is applied in DNA data storage, combined with the improved concatenated code as the error correction code. Considering the content of cytosine deoxynucleotide (C) and guanine deoxynucleotide Guanine (G) and the distribution of homopolymer in DNA storage, the work proposes and verifies an improved concatenated code algorithm Deoxyribonucleic Acid-Improved Concatenated code (DNA-ICC). The results show that while the Signal-to-Noise Ratio (SNR) increases, the Bit Error Rate (BER) decreases gradually and the trend is similar. But the anti-interference ability of the degree distribution function optimized by the probability transfer method is better. The BER of DNA-ICC scheme decreases with the decrease of error probability, which is stronger than other error correction codes. Compared with the original concatenated code, it saves at least 1.65 s, and has a good control effect on homopolymer. When the size of homopolymer exceeds 4 nt, the probability of homopolymer is only 0.44%. The proposed Quaternary dynamic Huffman code and concatenated error correction code have excellent performance.

Skip 1INTRODUCTION Section

1 INTRODUCTION

In recent years, with the rapid development of Cloud Computing, Big Data, Internet of Things (IoT), Artificial Intelligence (AI), social networking and other information technology fields and the digital transformation of traditional industries, human beings have generated more and more information. This requires not only the storage capacity, but also the scalability of storage time and space, fault tolerance, and error correction of stored procedures. With the arrival of the Big Data era, the imagination of the sharp increase in data capacity has led to the fact that the traditional data storage methods can no longer meet the demand, and the demand for new storage media has been paid more and more attention to [1]. Deoxyribonucleic Acid (DNA) data storage is considered as a medium suitable for long-term data storage due to its ultra-high density and sufficiently stable storage performance [24]. DNA data storage can reach a storage density of 1018byte/mm3, which is almost 106 times higher than the current highest storage density media. It has very good anti-interference ability against external high temperature and vibration, and can be preserved for a hundred years [5]. The popularity of Digital Twins (DT) has been rising in recent years, which has attracted much attention from the industry. It can make full use of physical model, sensor update, operation history and other data, integrate multi-disciplinary, multi physical quantity, multi-scale, and multi probability simulation process, and complete mapping in virtual space, thus reflecting the whole life cycle process of corresponding physical equipment [68]. One of the key features of DT is multi-source heterogeneous data fusion. The visual decision-making system also pays attention to the integration and comprehensive application of multi-source heterogeneous data [9]. Tremendous basic data will be generated during the actual operation of various industries, including various map element data, monitoring video data, real-time message data, urban tilt photography data, sensor data, business system data, and various database data. The visual decision-making system based on DT data fusion and DNA data storage can fully integrate the massive data between different departments, industries, systems, and data formats, and provide comprehensive data support for the perception and judgment of operation situation in various fields [1012].

At present, the hard disk and other storage systems widely used by people have inherent shortcomings. For example, the storage life of hard disk and flash memory is only a few decades at most. The storage equipment is non-degradable and pollutes the environment. It is urgent to develop a new generation of alternative storage technology. DNA computing is a non-traditional computing technology based on DNA and enzymes, and depends on the principles of biochemistry and molecular biology [13]. DNA strands can be used to encode and store information. Researchers have designed a variety of mapping strategies, such as differential mapping and constraint mapping, to meet these biochemical constraints in the process of DNA computing. However, most mapping strategies have limited mapping potential at each nucleotide storage site [1416]. In addition, error correcting codes, such as single parity check codes and duplicate codes, will be used in the storage of DNA data. Wang et al. (2021) [17] studied and discussed the information needs of China's Online Health Community for Corona Virus Disease 2019. These error correction methods often have the problems of high complexity and decoding failure. Therefore, a more stable DNA data storage method is needed to solve the existing problems [18]. Any new technology requires a shift from theory to practical application, as is the case with DT. The concept of DT originates from the industrial manufacturing field. Driven by 5G communication, IoT, Cloud Computing, Big Data, AI and other new generation information technologies, the concept of DT is gradually extended to more industry spaces [1921].

With the gradual deepening of people's understanding of DNA and the rapid development and maturity of DNA storage related technologies, researchers gradually consider DNA as a new information storage medium. Qian et al. [22] pointed out that only a few studies had solved the relationship between data, and most of them were conducted outside the operating environment. Data visual decision-making can be quickly pushed into the application of DT industries to help industry managers improve their intelligent decision-making ability and efficiency. DNA storage not only has large storage capacity, high density, long storage life, but also has higher security, energy conservation, and environmental protection. As a cross fusion technology of information technology and biotechnology, DNA information storage technology plays an important role in saving storage energy and promoting the development of massive data storage. However, the research on DNA information storage is still in its infancy in China, and more energy, human and material resources need to be invested. From the perspective of long-term investment, many manufacturers believe that this technology has high value, and this technology is likely to become a breakthrough in the search for new storage media in the future.

Through literature research and algorithm verification, this work studies the problem of DNA data storage for DT. The innovation lies in the improvement of Raptor code, and six degree distribution function distribution schemes are proposed. Based on the original Huffman code, a quaternary dynamic Huffman code is proposed for the encoding and decoding of DNA data storage. Considering the content of cytosine deoxynucleotide C and guanine deoxynucleotide G and the distribution of homopolymer in DNA storage, an improved concatenated code (ICC) is proposed to be used as error correction code, and its performance is verified to be excellent.

Skip 2RESEARCH STATUS OF DT AND DNA COMPUTING Section

2 RESEARCH STATUS OF DT AND DNA COMPUTING

The DNA computer used in the DNA computing process has many characteristics, such as small size, large storage capacity, fast operation, low energy consumption, and parallelism, and has advantages in data storage [2325]. The visual data system with higher quality and requirements can be realized via the MHD fusion technology in DT. In the development of DNA computing, scholars were faced with complex problems, such as detecting the operation results of logic gates and judging whether the Logic Gates were successfully constructed. Later, scholars solved these problems by labeling various Molecule Radicals on the Thymine (T) Base. With the development of DNA computing, Logic Gates have gradually evolved into Logic Circuits, and many experts and scholars have given answers in this area.

Most research on DT only focuses on existing explicit frameworks and architectures, which face the challenge of supporting different levels of integration through agile processes [26]. Aheleroff et al. [27] conducted a study to determine the appropriate Industry 4.0 technology and the overall reference architecture model to achieve the most challenging DT application. With the intensification of market competition, the process of product development is accelerated, which requires rapid product innovation and efficient collaboration between design and manufacturing. However, there are still islands of information that hinder the integration of product lifecycle processes. Bionics and DT have been combined as potential solutions to address this problem. Some scholars proposed the concept, framework, and characteristics of DT bionics and expounded on the co-evolution mechanism of product twin, including virtual and physical product and production twin. Li et al. [28] put forward a symbiotic and co-evolutionary mechanism to integrate product development and manufacturing. They concluded that integrating bionics and DT could accelerate the innovation and development of new products and help realize the effective management of production construction.

In addition to data security, long search time for data owners, long data access time, and high system leakage rate may occur when they access data from the cloud environment. Given all the above problems, Namasudra et al. [29] introduced a fast and secure data access control model based on DNA for cloud environment. In the model, cloud service providers needed to maintain a fast and efficient data access table. The authors used a long cipher or key based on 1,024 bits of DNA to encrypt a user's confidential or personal data. They finally verified the effectiveness of the proposed model compared with existing models through experimental results and theoretical analysis. Adithya and Santhi [30] provided a color code encryption strategy of DNA computing to protect data from eavesdroppers. DT is often defined as real-world products, systems, existence, communities, and even cities, using virtual copies of data from their physical counterparts and their environments that are constantly updated. It connects virtual cyberspace with physical entities, so it is considered to be the pillar of industry 4.0 and the innovation in the future. It can be said that DT is created and used in the whole life cycle of the entity it copies, from cradle to grave. Jiang et al. [31] focused on the current situation of DT and its application in industry under the background of intelligent manufacturing, especially from the perspective of plant wide optimization. In this context, the main functions of DT are discussed, such as mirroring, ghosting, and threading.

To sum up, DT and DNA computing have achieved certain research results, and the relevant research conclusions have their own advantages and disadvantages. However, the research on DNA data storage mostly focuses on the data storage security, and pays less attention to the efficiency of data storage itself. This work optimizes and improves different encoding and decoding schemes, which can provide new ideas and research directions for the future research of DNA multi-source data storage.

Skip 3DNA MULTI-SOURCE DATA STORAGE IN DT WITH HUFFMAN ENCODING AND CASCADE ECCS Section

3 DNA MULTI-SOURCE DATA STORAGE IN DT WITH HUFFMAN ENCODING AND CASCADE ECCS

3.1 DNA Multi-source Data Storage for DT

DNA data storage unit is four Deoxynucleotide, namely Adenine (A), Thymidine (T), Cytosine (C), and Guanine (G), also known as Bases. These four Nucleotide Groups can be arranged in different ways to form Oligonucleotides, i.e., DNA Strands. DNA data storage consists of three basic structures: data writing and reading [32, 33, 34]. The data writing part includes encoding, mapping, and composition. The data to be stored is input in Binary and encoded by Source Compression and Channel Error Correction. Then, some mapping rules transform the Binary Sequence into the Sequence consisting of four Nucleotide Bases. Finally, DNA chains are synthesized by biochemical methods and stored independently in special containers. Current DNA synthesis techniques can synthesize DNA strands of 200 to 1,000 bases in length. When reading the data, the Polymerase Chain Reaction [35, 36, 37] amplification technology should be used to copy the data of Oligonucleotides in the storage pool. A typical sequencing method relies on the characteristic that Fluorescent Nucleotides emit different colors. Precisely, the DNA sequence represented by the oligonucleotides can be read out by detecting the colors. Figure 1 illustrates the specific DNA data storage process.

Fig. 1.

Fig. 1. Implementation process of DNA information storage.

In Figure 1, the biological read-writer technology involved in DNA data storage include DNA synthesis, PCR amplification, and DNA sequencing. In view of the biological activities and the cell's exclusion, DNA synthesis is usually performed artificially in vitro. The synthesis process is divided into three stages. First, the Base Sequence is divided into several short strands because after the information is encoded, the Base Sequence is longer. Then, address Bits are added to the short chain after segmentation, convenient for quick search, location, and splicing when the subsequent file is read. Finally, it is necessary to add Macromolecular Primers with Nucleotide Sequences at both ends of the DNA chain and then preserve them to facilitate the splicing of short DNA chains and realize the function of data access.

3.2 Research on the DNA Information Storage Method based on Raptor Code

A key feature of Fountain Code is recovering information symbols using very small decoding with a high probability of receiving overhead. Fountain Code can be decoded successfully after receiving a certain number of code symbols [38]. However, LT (Luby transform) Code can be decoded 100% successfully only after all original signals are recovered during decoding, and the decoding complexity is not linear. The precoding link is added to the Raptor Code to improve coding performance and ensure the same synthesis cost, effectively resolving the contradiction between the complexity of coding and decoding and transmission efficiency. The precoding part primarily uses low density parity check for LDPC (Low Density Parity Check Code). Raptor Code is cascaded LDPC and LT Code. Assuming that there is a linear block code with a code length of N, in which the number of information symbols is K and the number of check codes is M, then the matrix generated by this block code is defined as \(\mathop G\nolimits_{k \times N}\). In other words, the information symbol matrix \(\mathop {\rm{U}}\nolimits_{k \times 1}\) can be mapped to the block code space through the generation matrix, and the block code matrix \({\rm{C}}\) and check bits M meet Equations (1) and (2): (1) \(\begin{equation} M = N - k \end{equation}\) (2) \(\begin{equation} {\rm{C}} = {\rm{U}} \times G \end{equation}\)

For irregular LDPC, the degree value of each node is determined by the Degree Distribution function. Equation (3) describes the Degree Distribution function. (3) \(\begin{equation} \lambda (x) = \sum\limits_{j = 2}^{\mathop d\nolimits_v } {\mathop \lambda \nolimits_j } \mathop x\nolimits^{j - 1} \end{equation}\)

In Equation (3), \(\mathop \lambda \nolimits_j\) represents the ratio of the number of edges sj owned by the variable node with degree value j to the total number of edges z of the bidirectional graph, which can be expressed as Equation (4). (4) \(\begin{equation} \mathop \lambda \nolimits_j = \frac{{\mathop s\nolimits_j }}{z} \end{equation}\)

Besides, dv refers to the maximum degree value of the variable node, satisfying: (5) \(\begin{equation} \sum\limits_{i = 2}^{\mathop d\nolimits_v } {\mathop \lambda \nolimits_i } = 1 \end{equation}\)

Then, the Degree Distribution function of the verification node is expressed as Equation (6). (6) \(\begin{equation} \rho (x) = \sum\limits_{i = 2}^{\mathop d\nolimits_c } {\mathop \rho \nolimits_i } \mathop x\nolimits^{i - 1} \end{equation}\)

In Equation (6), \(\mathop \rho \nolimits_i\) denotes the ratio of the number of edges bj owned by the check node whose degree value is i to the total number of edges z in the bidirectional graph, as shown in Equation (7).(7) \(\begin{equation} \mathop \rho \nolimits_i = \frac{{\mathop b\nolimits_i }}{z} \end{equation}\)dc in Equation (6) signifies the maximum degree value of the verification node, and it satisfies Equation (8). (8) \(\begin{equation} \sum\limits_{i = 2}^{\mathop d\nolimits_c } {\mathop \rho \nolimits_i } = 1 \end{equation}\)

The core of LDPC is to determine the check matrix H to obtain the generation matrix G and then encode the information symbol. The length of the information symbol determines the dimension of the check matrix H. Assuming that the length of the information symbol waiting to be encoded is n, then there is a m × n dimensional check matrix. The Gaussian elimination method is used to transform this check matrix into the standard form, that is:(9) \(\begin{equation} \mathop H\nolimits^{\rm{\cdot}} = [p|\mathop I\nolimits_m ] \end{equation}\)where p stands for a \(m \times (n - m)\) dimensional check matrix, and Im denotes a m-dimensional identity matrix. After standardization by the Gaussian elimination method, the identity matrix can be written as Equation (10). (10) \(\begin{equation} \mathop p\nolimits_i = \sum\limits_{j = 1}^{n - m} {\mathop H\nolimits_{i,j}^{\prime} } \mathop s\nolimits_j + \sum\limits_{j = 1}^{i - 1} {\mathop H\nolimits_{i,j + n - m}^{\prime} } \mathop p\nolimits_j \end{equation}\)

The Belief Propagation (BP) decoding algorithm is adopted to decode LDPC. Firstly, variable nodes are initialized, and symbols are assigned according to the acceptance conditions in Equation (11). (11) \(\begin{equation} \mathop p\nolimits_i = \left\{{\begin{array}{c@{\quad}c} { + \infty }&{\mathop y\nolimits_i = 0}\\ 0&{\mathop y\nolimits_i = E}\\ { - \infty }&{\mathop y\nolimits_i = 1} \end{array}} \right. \end{equation}\)

In Equation (11), E represents the variable node to be deleted. The Exclusive OR operation is performed to delete the associated edges between all the remaining n variable nodes that have not been deleted and the check nodes connected to these nodes, as presented in Equation (12). (12) \(\begin{equation} \mathop \phi \nolimits_{mn} = 2\mathop {\tanh }\nolimits^{ - 1} \left( {\prod\limits_{\mathop n\nolimits^{\prime} \in N\left( m \right)\left| n \right.} {\tanh \left( {\frac{{\mathop \varphi \nolimits_{\mathop {mn}\nolimits^{\prime} } }}{2}} \right)} } \right) \end{equation}\)

In Equation (12), N(m)|n represents the number of check nodes connected to the remaining variable nodes. Assuming that the associated edge of a particular check node is 1, then the variable node connected to the node can be restored. The value of the check node is N(m). After that, the associated edges connected to the restored variable node can be deleted, which can be written as: (13) \(\begin{equation} \mathop \phi \nolimits_{mn} = \mathop \phi \nolimits_{n0} + \sum\limits_{\mathop m\nolimits^{\prime} \in M\left( n \right)\left| m \right.} {\mathop \phi \nolimits_{\mathop m\nolimits^{\prime} n} } \end{equation}\) (14) \(\begin{equation} \mathop \phi \nolimits_n = \mathop \phi \nolimits_{n0} + \sum\limits_{m \in M\left( n \right)} {\mathop \phi \nolimits_{mn} } \end{equation}\)

Finally, if no check node with an associated edge of 1 can be found, the decoding will be terminated. The whole decoding process is the process of eliminating associated edges. The Table 1 is the specific algorithm process.

Table 1.
1Start Initialize the variable node and assign the symbol as 0, E, 1 according to the reception:
2\(\mathop p\nolimits_i = \left\{ {\begin{array}{c@{\quad}c} { + \infty }&{\mathop y\nolimits_i = 0}\\ 0&{\mathop y\nolimits_i = E}\\ { - \infty }&{\mathop y\nolimits_i = 1} \end{array}}\right.\)
3Initialize all verification nodes to 0.
4For all remaining variable nodes, XOR and delete the associated edges between them at the same time:
5\(\mathop \phi \nolimits_{mn} = 2\mathop {\tanh }\nolimits^{ - 1} \left( {\prod\limits_{\mathop n\nolimits^{\prime} \in N( m )| n } {\tanh \big( {\frac{{\mathop \varphi \nolimits_{\mathop {mn}\nolimits^{\prime} } }}{2}} \big)} } \right)\)
6If The associated edge of a check node is 1
7\(\mathop \phi \nolimits_{mn} = \mathop \phi \nolimits_{n0} + \sum\limits_{\mathop m\nolimits^{\prime} \in M( n )| m } {\mathop \phi \nolimits_{\mathop m\nolimits^{\prime} n} }\)
8Deletes the associated edge connected to the variable node.
9Else if
10No check node with associated edge 1 was found during decoding
11Decoding aborted.
12End

Table 1. BP Decoding Algorithm

Raptor Code is improved by designing the Degree Distribution function. The Degree Distribution function with good performance needs to guarantee the coverage of coded symbols and minimize the average degree value as much as possible to reduce the complexity of encoding and decoding. Therefore, the Distribution function satisfying these two points can be expressed as Equation (15). (15) \(\begin{equation} \left\{ {\begin{array}{@{}*{1}{l}@{}} {1 - d - \mathop e\nolimits^{ - \mathop \mu \nolimits^{\prime} \left( d \right)\left( {1 + \varepsilon } \right)} \ge \gamma \sqrt {\frac{{1 - d}}{k}} }\\ {d \in \left[ {0,1 - \delta } \right]} \end{array}} \right. \end{equation}\)

In Equation (15), \(\varepsilon\) represents the coding and decoding redundancy of Fountain Codes, \(\gamma\) refers to a positive real number, and \(1 - \delta\) indicates the decoding success rate. Besides, k signifies the number of information symbols, and \(\mathop \mu \nolimits^{\prime} ( d )\) is the derivative of \(\mu ( d )\). Then, there is: (16) \(\begin{equation} \left\{ {\begin{array}{@{}*{1}{l}@{}} {A \ge \gamma \sqrt {\left( {1 - d} \right) \cdot k} }\\ {d \in \left[ {0,1 - \delta } \right]} \end{array}} \right. \end{equation}\)where A represents the decoded set received by the receiver. Because DNA data is stored during coding, long strands are more error-prone, costly, and technically complex than short strands. In addition, in the process of Raptor Code coding, if the symbol code is long, the complexity in the coding process will be increased, and the overall timeliness will be reduced. Therefore, the short code length is often used to obtain a DNA-Raptor data storage architecture with better performance. Common Degree Distribution functions satisfying the above requirements can be expressed as: (17) \(\begin{equation} \mu (d) = 0.00098d\, +\, 0.459\mathop d\nolimits^2\, +\; 0.211\mathop d\nolimits^3\, +\; 0.113\mathop d\nolimits^4\, +\; 0.1113\mathop d\nolimits^{10}\, +\; 0.0799\mathop d\nolimits^{11}\, +\; 0.0156\mathop d\nolimits^{40} \end{equation}\) (18) \(\begin{align} \mu (d) &= 0.007969d\, + \,0.4935\mathop d\nolimits^2\, +\; 0.166\mathop d\nolimits^3\, +\; 0.073\mathop d\nolimits^4\, + \;0.082\mathop d\nolimits^5\, +\; 0.056\mathop d\nolimits^8\, +\; 0.037\mathop d\nolimits^9\nonumber\\ &\quad + 0.055\mathop d\nolimits^{19}\, + \; 0.025\mathop d\nolimits^{65}\, +\; 0.0031\mathop d\nolimits^{66} \end{align}\)

Since the success rate of LT decoding is positively correlated with redundancy, A slowly increases after reaching the peak value until it is close to full decoding. Therefore, the improved Degree Distribution function can be written as Equation (19). (19) \(\begin{equation} \mathop \tau \nolimits^{\rm{\cdot}} (d) = \left\{{\begin{array}{l@{\quad}l} {\frac{s}{{kd}}}&{d = 1,2,\ldots,\frac{k}{s}}\\ 0 & {d > \frac{k}{s}} \end{array}}\right. \end{equation}\)

Then, the Robust Solitary Wave Distribution (RSWD) \(\mathop \mu \nolimits^{\rm{\cdot}} (d)\) is adopted as Scheme 1, as shown in Equation (20). (20) \(\begin{equation} \mathop{\mathop \mu \nolimits_1}\nolimits^{\rm{\cdot}} (d) = \frac{{\rho (d) + \mathop \tau \nolimits^{\rm{\cdot}} (d)}}{{\sum\nolimits_{d = 1}^k {\rho (d) + \mathop \tau \nolimits^{\rm{\cdot}} (d)} }} \end{equation}\)

Scheme 2 is proposed based on setting the degree value with low probability in the Robust Solitary Wave Distribution (RSWD) to 0, as presented in Equation (21).

There are two kinds of probability distribution in the theory of soliton division: the ideal soliton distribution and Robust soliton distribution (RSWD). (21) \(\begin{equation} \mathop {\mathop \mu \nolimits_2 }\nolimits^{\rm{\cdot}} (d) = \left\{{\begin{array}{l@{\quad}l} {\mathop \mu \nolimits_2 (d) = 0}& {\mu (d) < \frac{1}{k}} \\ {\mathop \mu \nolimits_2 (d) = \mu (d)}&{\mu (d) \ge \frac{1}{k}} \end{array}} \right. \end{equation}\)

A new Degree Distribution function is obtained after normalization: (22) \(\begin{equation} \mathop \mu \nolimits_2 (d) = \frac{{\mathop {\mathop \mu \nolimits_2 }\nolimits^{\rm{\cdot}} (d)}}{{\sum\nolimits_{d = 1}^k {\mathop {\mathop \mu \nolimits_2 }\nolimits^{\rm{\cdot}} (d)} }} \end{equation}\)

Assuming that the number of information symbols of short code length k = 16∼1024, then: (23) \(\begin{equation} \mathop \mu \nolimits^{\rm{\cdot}} (d) \ge \frac{{ - \ln ( {1 - d - \gamma \sqrt {\frac{{1 - d}}{k}} } )}}{{1 + \varepsilon }} \end{equation}\)

Due to the structural characteristics of DNA data storage and the performance characteristics of Raptor Codes, the symbol code length is set as \(k = 256\). Besides, degree values \(\mathop d\nolimits_{33}\) and \(\mathop d\nolimits_{44}\) are added, the probability of occurrence of \(\mathop d\nolimits_{65}\) is transferred to \(\mathop d\nolimits_1\), and then the probability of occurrence of \(\mathop d\nolimits_{66}\) is transferred to \(\mathop d\nolimits_{33}\), which is taken as Scheme 3. The probability of the occurrence of \(\mathop d\nolimits_{65}\)is shifted to \(\mathop d\nolimits_1\), and the probability of occurrence of \(\mathop d\nolimits_{66}\) is shifted to \(\mathop d\nolimits_{44}\), which is regarded as Scheme 4. The occurrence probability of \(\mathop d\nolimits_{65}\) is transferred to \(\mathop d\nolimits_1\), which is taken as Scheme 5. The occurrence probability of \(\mathop d\nolimits_{45}\) is shifted to \(\mathop d\nolimits_2\), and the occurrence probability of \(\mathop d\nolimits_{66}\) is shifted to \(\mathop d\nolimits_{33}\), which is regarded as Scheme 6. The specific expressions are as follows: (24) \(\begin{equation} \mathop \mu \nolimits_3 \left( d \right) = 0.033d \,+\, 0.492\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{33} \end{equation}\) (25) \(\begin{equation} \mathop \mu \nolimits_4 \left( d \right) = 0.033d \,+\, 0.492\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{44} \end{equation}\) (26) \(\begin{equation} \mathop \mu \nolimits_5 \left( d \right) = 0.033d \,+\, 0.492\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{66} \end{equation}\) (27) \(\begin{equation} \mathop \mu \nolimits_6 \left( d \right) = 0.029d \,+\, 0.503\mathop d\nolimits^2 \,+\, 0.167\mathop d\nolimits^3 \,+\, 0.072\mathop d\nolimits^4 \,+\, 0.082\mathop d\nolimits^5 \,+\, 0.056\mathop d\nolimits^8 \,+\, 0.037\mathop d\nolimits^9 \,+\, 0.0556\mathop d\nolimits^{19} \,+\, 0.003\mathop d\nolimits^{33} \end{equation}\)

3.3 DNA Information Storage based on Adaptive Huffman Coding and Concatenated ECCs

Huffman Coding is an Entropy Coding algorithm used for lossless data compression. Code words of different lengths are allocated according to the probability of the occurrence of coded characters. The higher the probability, the shorter the code words; the lower the probability, the longer the code words. In this way, the storage density can be improved after average processing. However, some disadvantages include the encoding method with poor timeliness, some non-generic fields, and low encoding efficiency due to the additional space storing the Huffman Tree. This paper uses the Adaptive Huffman Coding (AHC) in DNA data storage to solve these problems. AHC dynamically adjusts the Huffman Tree every time the encoder reads the symbol to be encoded and changes the corresponding weight and Huffman Tree every time a character is read. In this way, it can ensure that the current output symbol is only related to the currently encoded character and the character read and has no relationship with the character not read. The decoding process is similar to the encoding process. In view of the advantages of quaternary coding, this paper proposes a QAHC algorithm for DNA data storage, namely DNA-QAHC.

DNA multi-source data storage for DT can also be regarded as a process of sending and receiving information, in which different degrees of noise interference will cause errors in the data transmission process. Although such errors are rare, they have profound implications for data recovery. Therefore, error correction is a must. The most common errors in DNA information storage usually occur in the process of data deletion, data insertion, and replacement, collectively known as synchronization errors. Here, Concatenated Codes, Watermark Codes, and non-binary LDPC are combined for error correction. Figure 2 reveals the specific error correction process.

Fig. 2.

Fig. 2. The specific process of error correction by Concatenated Codes.

As Figure 2 suggests, the decoding process can be defined as Hidden Markov Model (HMM) by associating sparse sequences with LDPC codes.

Errors with value a are inserted between the first bit t0 and the time tj to be sent, and errors with value b are deleted. The range of drift value xj of point j is: (28) \(\begin{equation} X = \{ - \mathop x\nolimits_{\max } ,\ldots, -1,0,1,\ldots,\mathop x\nolimits_{\max } \} \end{equation}\)where xmax represents the maximum drift value. The transition probability Pa,b is defined as Equation (29) to reduce the complexity in the decoding process. (29) \(\begin{equation} \mathop P\nolimits_{a,b} = P\left(\mathop x\nolimits_{j + 1} = b| {\mathop x\nolimits_j } = a\right) \end{equation}\)

The concatenated code thus obtained has high complexity, large computational data redundancy, and low accuracy. Therefore, this work proposes an ICC algorithm and applies it to the DNA information storage process. Meantime, considering the content of cytosine deoxynucleotide (C) and guanine deoxynucleotide (G) and homopolymer in DNA storage, it is labeled as DNA-ICC algorithm. In other words, the ICC algorithm is applied to the DNA information storage process. In the algorithm, after calculating the forward and backward probabilities at the boundary of each transmitted symbol, only the ones with larger values are returned as possible drift states and corresponding symbols for subsequent decoding, and the decoding path is limited to a small range in the grid. This operation can avoid the path with very low probability from entering the calculation. The key to improve the concatenated code error correction scheme is to ensure the reliability of channel transmission.

Additionally, the input sequence also has an impact on the error behavior. The guanine and cytosine base content and homopolymer are the factors that affect the high error probability in the process of DNA synthesis. Therefore, this work adopts the DNA-QAH algorithm to encode the original file and the concatenated code for error correction. In this way, after several different encodings, the scattered base sequence can be homogenized to a great extent, better guanine and cytosine base content and homopolymer can be obtained, and the error rate of data storage can be reduced.

3.4 Experimental Verification

Case analysis and performance comparison are conducted to verify the six Degree Distribution function distribution schemes. The length of the symbol code is set as 240, the verification symbol is 16 bits, and the code length is 256. Besides, the row weight of the verification matrix is 16, and the column weight is 1. To verify the performance of DNA-QAHC algorithm and DNA-ICC scheme, this paper selects different texts, images, audio, and video for case analysis and selects the Pseudo-Random Sequence as the Watermark Sequence. The operating system is 64-bit Windows 10, the processor is Intel Core i7-7500u, the memory size is 8GB, and the running platform is MatlabR2018a. The text, image, and audio selected by inputting file information are one of the data collected daily in the research process. The specific file information is shown in Table 2:

Table 2.
Encoding file typeFormatMemory (KB)Note
Text 1.txt147.3Chinese
Text 2.txt37.5English
Image 1.tif11.8Color image
Image 2.tiff25.4Gray image
Audio 1.wav10.2
Audio 2.mp313.8

Table 2. Specific File Information

Skip 4RESULTS Section

4 RESULTS

4.1 Performance Comparison of Different DNA Information Storage Schemes based on Raptor Codes

Figure 3 illustrates the comparison of bit-error rates and decoding success times of the six schemes under different Signal-to-Noise Ratios (SNRs) and encoding and decoding redundancy \(\varepsilon\) of Fountain Codes.

Fig. 3.

Fig. 3. Performance comparison of different Raptor Codes: (a) bit error rate under different SNR, (b) the number of successful decoding.

In Figure 3(a), when the SNR increases, the error bit rate gradually decreases, and the six Raptor Code schemes have a similar downward trend. The SNR is set to 10dB in the case analysis to reduce the interference caused by SNR. Then, the redundancy of Fountain Code in Raptor Code is set to 0.01, 0.03, 0.05, 0.07, and 0.09. Figure 3(b) indicates that when decoding 100 times under the same SNR, Scheme 1 and Scheme 2 have a larger successful decoding rate than other schemes, which is close to 100% when the redundancy is greater than 0. 05. When the redundancy of Scheme 6 is greater than 0. 09, the decoding rate is close to 90%. Therefore, these schemes have a small average degree, relatively low code complexity, and high decoding success rate.

The degree value distribution probability, average degree, decoding success rate, and bit error rate performance of different Raptor Code schemes are ranked in order of scoring, with the performance decreasing from 1 to 7, as shown in Figure 4.

Fig. 4.

Fig. 4. Performance score distribution of different Raptor Code schemes: (a) degree distribution probability, (b) average degree, decoding success rate, and bit error rate performance score.

Figure 4(a) shows that the degree value distribution probabilities of the six Raptor Code schemes are very similar. With the increase in degree value, the variation trend of distribution probability and the probability of degree value are very similar. Figure 4(b) indicates that a higher average degree can increase the complexity of the encoding and decoding but can guarantee coverage in the process of signal transmission. The scores suggest that the higher the score of degree average, the lower the corresponding score of the decoding success rate. This result proves that decoding the success rate will also increase with the increase of the average degree.

4.2 Performance Verification of DNA Data Storage via AHC and ECCs

Figure 5 indicates the influence of different single character lengths on storage density.

Fig. 5.

Fig. 5. Storage density under different character lengths: (a) storage density under different character lengths, (b) comparison of memory density and runtime for 8-bit and 16-bit characters.

From Figure 5(a), the relationship between storage density and single character length is not linear but reaches a peak in a particular character length. This is because the number of required encoding symbols decreases with the increase of character length, but the types of symbols increase instead, reducing the probability of a character's occurrence. According to Figure 5(b), the AHC algorithm satisfies storage density for different storage files for 8-bit and 16-bit characters.

Under different insertion and deletion probabilities, the substitution error rate of DNA-ICC scheme is the bit error rate when Ps = 0, Ps = 0.1%, Ps = 0.2%, Ps = 0.3%, and Ps = 0.4%. The result is compared with that of other ECCs, such as Bose ay-Chaudhuri Hocquenghem (BCH) codes and Grid Matrix codes, as shown in Figure 6.

Fig. 6.

Fig. 6. Error rate comparison of different ECCs: (a) error rate comparison of different replacement error rates, (b) Ps = 0; (c) Ps = 0. 3; (d) Ps = 0. 4.

In Figure 6, the bit error rate of the DNA-ICC scheme decreases with error probability, meeting the changing trend of ECCs. The DNA-ICC scheme has the best performance compared with other error correction schemes. The error correction ability of Hamming Codes and RS Codes is poor. With the insertion error probability increase, the bit error rate is always high, and the downward trend is not apparent. Although the original Concatenated Code scheme can reduce the bit error rate, the DNA-ICC scheme has a more robust error correction ability than the original Concatenated Code.

The error correction time and Homopolymer of the DNA-ICC scheme are shown in Figure 7.

Fig. 7.

Fig. 7. Comparison of error correction time and Homopolymer of different files: (a) comparison of the running time of different error-correction schemes, (b) comparison of Homopolymer.

Figure 7 demonstrates that the DNA-ICC scheme reduces encoding and decoding time and improves the efficiency of DNA information storage. Compared with the original Concatenated Code, it saves at least 1. 65s time. In addition, the DNA-ICC scheme has a good control effect on Homopolymer. When the size of Homopolymer exceeds 4nt, the occurrence probability of Homopolymer is as low as 0. 44%, close to 0.

Skip 5CONCLUSION Section

5 CONCLUSION

With the development and broad application of 5G communication, IoT, cloud computing, big data, AI, and other new-generation information technologies, DT technology has rapidly developed both theory and application. It gradually extends to smart cities, parks, transportation, and other application fields. The DNA data storage model combining multi-source data storage and DNA computing in DT is gradually emerging. In this paper, six Degree Distribution function schemes are proposed for DNA information storage of Raptor Codes. This paper also improves Huffman Coding and the Cascaded ECCs, puts forward the quaternary adaptive Huffman encoding and decoding method, and optimizes the Concatenated Code. It is proved that the performance of Raptor Codes is greatly improved. However, there are some deficiencies in the research. Although the Raptor Code is a kind of information encoding and decoding with good performance, the encoding efficiency is still not high enough, less than 2bit/nt. Future research will consider neural networks and machine learning for optimization. Moreover, the storage density of hexadecimal AHC can be better. In addition, the future work will consider targeted coding for different file input types according to the specific content and structural characteristics of the input. It will also refer to the current communication or storage protocols to realize the mutual communication between DNA and computer data.

REFERENCES

  1. [1] Li X., Wang B., Lv H., Yin Q., Zhang Q., and Wei X.. 2020. Constraining DNA sequences with a triplet-bases unpaired. IEEE Transactions on Nano-Bioscience 19, 2 (2020), 299307.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Immink K. A. S. and Cai K.. 2020. Properties and constructions of constrained codes for DNA-based data storage. IEEE Access 8 (2020), 4952349531.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Cao B., Zhang X., Wu J., Wang B., Zhang Q., and Wei X.. 2021. Minimum free energy coding for DNA storage. IEEE Transactions on Nanobioscience 20, 2 (2021), 212222.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cao B., Zhao S., Li X., and Wang B.. 2020. K-means multi-verse optimizer (KMVO) algorithm to construct DNA storage codes. IEEE Access 8 (2020), 2954729556.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Weber J. H., De Groot J. A., and Van Leeuwen C. J.. 2020. On eingle-error-detecting codes for DNA-based data storage. IEEE Communications Letters 25, 1 (2020), 4144.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Wang K. J., Lee Y. H., and Angelica S.. 2021. Digital twin design for real-time monitoring – a case study of die cutting machine. International Journal of Production Research 59, 21 (2021), 64716485.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Major P., Li G., Hildre H. P., and Zhang H.. 2021. The use of a data-driven digital twin of a smart city: A case study of Alesund, Norway. IEEE Instrumentation & Measurement Magazine 24, 7 (2021), 3949.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Wanasinghe T. R., Wroblewski L., Petersen B. K., Gosine R. G., James L. A., De Silva O.... and Warrian P. J.. 2020. Digital twin for the oil and gas industry: Overview, research trends, opportunities, and challenges. IEEE Access 8 (2020), 104175104197.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Xue F., Lu W., Chen Z., and Webster C. J.. 2020. From LiDAR point cloud towards digital twin city: Clustering city objects based on Gestalt principles. ISPRS Journal of Photogrammetry and Remote Sensing 167 (2020), 418431.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Schroeder G. N., Steinmetz C., Rodrigues R. N., Henriques R. V. B., Rettberg A., and Pereira C. E.. 2020. A methodology for digital twin modeling and deployment in industry 4. 0. Proceedings of the IEEE 109, 4 (2020), 556567.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Zhang C., Zhou G., Li H., and Cao Y.. 2020. Manufacturing blockchain of things for the configuration of a data-and knowledge-driven digital twin manufacturing cell. IEEE Internet of Things Journal 7, 12 (2020), 1188411894.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Yildiz E., Møller C., and Bilberg A.. 2020. Virtual factory: Digital twin based integrated factory simulations. Procedia CIRP 93 (2020), 216221.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Song W., Cai K., and Immink K. A. S.. 2020. Sequence-subset distance and coding for error control in DNA-based data storage. IEEE Transactions on Information Theory 66, 10 (2020), 60486065.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Cai K., Chee Y. M., Gabrys R., Kiah H. M., and Nguyen T. T.. 2021. Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality. IEEE Transactions on Information Theory 67, 6 (2021), 34383451.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Lu X., Jeong J., Kim J. W., No J. S., Park H., No A., and Kim S.. 2020. Error rate-based log-likelihood ratio processing for low-density parity-check codes in DNA storage. IEEE Access 8 (2020), 162892162902.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Campbell M.. 2020. DNA data storage: Automated DNA synthesis and sequencing are key to unlocking virtually unlimited data storage. Computer 53, 4 (2020), 6367.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Wang J., Wang L., Xu J., and Peng Y.. 2021. Information needs mining of COVID-19 in Chinese online health communities. Big Data Research 24 (2021), 100193.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Mishra P., Bhaya C., Pal A. K., and Singh A. K.. 2020. Compressed DNA coding using minimum variance Huffman tree. IEEE Communications Letters 24, 8 (2020), 16021606.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Nguyen H. X., Trestian R., To D., and Tatipamula M.. 2021. Digital twin for 5G and beyond. IEEE Communications Magazine 59, 2 (2021), 1015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Qi Q., Tao F., Hu T., Anwer N., Liu A., Wei Y.... and Nee A. Y. C.. 2021. Enabling technologies and tools for digital twin. Journal of Manufacturing Systems 58 (2021), 321.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Zhang C., Zhou G., Li H., and Cao Y.. 2020. Manufacturing blockchain of things for the configuration of a data-and knowledge-driven digital twin manufacturing cell. IEEE Internet of Things Journal 7, 12 (2020), 1188411894.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Qian J., Song B., Jin Z., Wang B., and Chen H.. 2018. Linking empowering leadership to task performance, taking charge, and voice: The mediating role of feedback-seeking. Frontiers in Psychology 9 (2018), 2025.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Li X., Wang B., Lv H., Yin Q., Zhang Q., and Wei X.. 2020. Constraining DNA sequences with a triplet-bases unpaired. IEEE Transactions on Nanobioscience 19, 2 (2020), 299307.Google ScholarGoogle Scholar
  24. [24] Zhou Q., Wang X., and Zhou C.. 2021. DNA design based on improved ant colony optimization algorithm with bloch sphere. IEEE Access 9 (2021), 104513104521.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Xue T. and Lau F. C.. 2020. Construction of GC-balanced DNA with deletion/insertion/mutation error correction for DNA storage system. IEEE Access 8 (2020), 140972140980.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Xue F., Lu W., Chen Z., and Webster C. J.. 2020. From LiDAR point cloud towards digital twin city: Clustering city objects based on Gestalt principles. ISPRS Journal of Photogrammetry and Remote Sensing 167 (2020), 418431.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Aheleroff S., Xu X., Zhong R. Y., and Lu Y.. 2021. Digital twin as a service (DTaaS) in industry 4. 0: An architecture reference model. Advanced Engineering Informatics 47 (2021), 101225.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Li L., Gu F., Li H., Guo J., and Gu X.. 2021. Digital twin bionics: A biological evolution-based digital twin approach for rapid product development. IEEE Access 9 (2021), 121507121521.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Namasudra S., Sharma S., Deka G. C., and Lorenz P.. 2020. DNA computing and table based data accessing in the cloud environment. Journal of Network and Computer Applications 172 (2020), 102835.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Adithya B. and Santhi G.. 2021. Deoxyribonucleic Acid (DNA) computing using Two-by-six complementary and color code cipher. Bulletin of Computer Science and Electrical Engineering 2, 1 (2021), 3845.Google ScholarGoogle Scholar
  31. [31] Jiang Y., Yin S., Li K., Luo H., and Kaynak O.. 2021. Industrial applications of digital twins. Philosophical Transactions of the Royal Society A 379, 2207 (2021), 20200360.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Namasudra S., Chakraborty R., Majumder A., and Moparthi N. R.. 2020. Securing multimedia by using DNA-based encryption in the cloud computing environment. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3s (2020), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Li Y., Deng Y., Tang X., Cai W., Liu X., and Wang G.. 2018. Cost-efficient server provisioning for cloud gaming. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 3s (2018), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Kuruppu S., Beresford-Smith B., Conway T., and Zobel J.. 2011. Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9, 1 (2011), 137149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Hassantabar S., Stefano N., Ghanakota V., Ferrari A., Nicola G. N., Bruno R.... and Jha N. K.. 2021. CovidDeep: SARS-CoV-2/Covid-19 test based on wearable medical sensors and efficient neural networks. IEEE Transactions on Consumer Electronics 67, 4 (2021), 244256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Beini Z., Xuee C., Bo L., and Weijia W.. 2021. A new few-shot learning method of digital PCR image detection. IEEE Access 9 (2021), 7444674453.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Lischer K., Digdaya Putra A. B. R., Sahlan M., Khayrani A. C., Ginting M. J., Wijanarko A.... and Pratami D. K.. 2021. Heat transfer simulation of various material for polymerase chain reaction thermal cycler. Journal of Mechanical Engineering (JMechE) 8, 2 (2021), 2737.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Zheng H., Wang J., Zhang J., and Li R.. 2021. IRTS: An intelligent and reliable transmission scheme for screen updates delivery in DaaS. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DNA Computing-Based Multi-Source Data Storage Model in Digital Twins

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3s
      June 2023
      270 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582887
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 February 2023
      • Online AM: 13 September 2022
      • Accepted: 1 September 2022
      • Revised: 29 June 2022
      • Received: 30 January 2022
      Published in tomm Volume 19, Issue 3s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)485
      • Downloads (Last 6 weeks)56

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!