Multi-Language Character Encoding Technique for DNA Storage

Page 1

Multi-Language Character Encoding Technique for DNA Storage Wei Wang*1, Zhengxu Zhao2, Wei Zhang3 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, China

1,2

Beijing Aerospace Control Center, Beijing, China

3

wangwei@stdu.edu.cn; 2zhaozx @stdu.edu.cn; 363858606@qq.com

*1

Abstract In 1994, Dr Adleman solved problem using DNA as computational mechanism. He proved the principle that DNA computing could be used to solve computationally complex problems. Recent 20 years with the rapid development of biological molecular computer, scientist have set a series of theoretical model and succeed in biochemical experiment. DNA computing has become an important research direction of the computer science and molecular biology. This research present a novel approach in which character could be encoded by the permutation and combination of the four nitrogenous bases (Adenine, Guanine, Cytosine and Thymine) in DNA molecules. And the character encoding should support multi-language and unique identifier. Keywords DNA Storag; Character Encoding; DNA Computing

Introduction The rapid development of science and information industry, especially the development of multimedia technology, cloud computer and computer network, computer storage equipment not only has a larger data storage capacity, higher data transmission rate and more reliable data storage quality. Also on how to make the data more economic and safe storage, storage in time and space on the extensibility, have put forward higher requirements. Current computer storage system the birth defects are revealed and the subsequent development of lack of power, has become one of the bottleneck of the computer promotion. Whether the HDD or optical storage technology is unable to cope with the future demand for storage of computer. It is estimated that in the future semiconductor, disk, and CD-ROM data storage density will achieve its physical limit, it is urgent need to develop a new generation of alternative storage technology. On the other hand, Biological molecular computer

which Adleman completed the first experimental verification has been rapid development. Nearly two decades, a variety of theoretical models and experimental methods emerge in endlessly, such as Adleman model, Splicing System model, InsertionDeletion System model and DNA-EC model. DNA storage as an important branch in the field of biological molecular computer, because it has high storage density and low hardware cost, access procedure parallelizable, good scalability and integration, and long term storage. In the foreseeable future DNA storage system will be likely to replace the traditional storage systems. DNA molecule is a powerful and effective natural information storage medium, it has been widely used since 1985 when DNA molecule was synthesized for the first time. There are obvious similarities between DNA storage system and traditional storage system, both of two storage system are sequential storage devices, and use special symbols to indicate the beginning and end of a single information section, and the data error correction coding is used to ensure the integrity of their information. As a result, DNA molecules can be used as a medium of the information is stored. DNA storage technology is based on the DNA molecule storage medium. The four nitrogenous bases (Adenine, Guanine, Cytosine and Thymine) what are contained within DNA molecule can be used to encode information. With the existing biochemical experiment method, it's easily complete the clone operation of DNA molecules and the modify operation of the nitrogenous bases what has been encode in the DNA molecules, these operations are similar with the traditional storage system which read and write operations. Because of the advantages of DNA storage system such as stable and reliable work, no wear, huge information capacity, long life, high quality, low price of bits of information and access procedure parallelizable, DNA storage system is seen as high

International Journal of Automation and Control Engineering, Vol. 4, No. 1—April 2015 2325-7407/15/01 019-3 Š 2015 DEStech Publications, Inc. doi:10.12783/ijace.2015.0401.05

19


20

Wei Wang, Zhengxu Zhao, Wei Zhang

density and large capacity of storage. Although DNA molecule as a data storage method has been proposed, but at this stage how to encode the information what will be stored in DNA molecule has not yet been determined. The method of character encoding is one of most important foundations of computer system, There is an exploratory research what use permutation and combination of four nitrogenous bases of DNA molecule to encode the character information. This research include two major problems, storage medium select and coding rules. Storage Medium DNA molecule as information storage medium can take many forms. As information storage medium of DNA molecule can be a single-stranded, also can be double-stranded; can be a long chain, can also be a circular strand, some with special biological meaning chain is called the plasmid[6]. These different modes have their different advantages and disadvantages when they are as information storage medium, therefore must consider these factors when choosing storage medium, to make the DNA molecule storage advantages and simplicity of operation have been play. DNA storage system using circular singlestranded DNA molecule as storage medium. Compared with single-stranded and double-stranded each have each advantages and disadvantages. Double-stranded DNA is more stability than singlestranded DNA, that is one of the most important reasons what the most living organisms choose double-stranded DNA as their genetic materials, but the data which stored in the double-stranded DNA are difficult to read. Double-stranded should be unzipped their two attached chains into single-stranded before reading and clone. Single-stranded DNA can use Watson-Crick Complement principle to read data, but it is not stable, and single-stranded DNA is not only more easily fracture than double-stranded DNA, but also easily to form own complementary hairpin structure. It is the reasons why we choose singlestranded that single-stranded easier to read and clone than double-stranded. In addition we can avoid the generation of the hairpin structure in the singlestranded special design. Compare with long-chain DNA than circular strand DNA, long chain will be cut into two independent segments by endonuclease at a time, but circular strand is still together, under certain conditions can also even the back circular strand again. Even more

long chain easy to be degraded by certain exonuclease from its ends, and this degradation possibility of a circular strand is less than long chain. easy way to comply with the journal paper formatting requirements is to use this document as a template and simply type your text into it. Coding Rules The DNA molecule is composed of four nitrogenous bases, therefore the permutation and combination of the four nitrogenous bases can be used to encode information which will be stored in the DNA storage system. The coding rules are as follows: Unique Code A In order to compatible with different countries and languages, multi-language environment, it is must be defined each character as unique code. Coding using an abstract way which combines Adenine, Guanine, Cytosine and Thymine (A, G, C and T for short) to deal with characters, and the visual image work, such as font size, shape, font, form, style and so on for application software to deal with, such as a web browser or word processor. Permutation and Combination of Nitrogenus Bases Use The coding rule is composed of four nitrogenous bases permutation and combination. In order to maximize the including information about the character of all countries and languages, from 0 to 0x10FFFF are used to indicate all countries and the language character in Unicode encoding, a total of 1114112 code points. If use the nitrogenous bases permutation and combination to represent 1114112 code points, in order to defined each character as unique code, it need 11 nitrogenous bases to represent each code point. For economizing on space of storage, reducing duplication of nitrogenous bases which are from the high-order to low-order. And the adenine (A for short) as '00', the guanine (G for short) as '01', the cytosine (C for short) as '10', the thymine (T for short) as '11', Table 1 is mapping table of nitrogenous bases. TABEL 1. MAPPING TABLE OF NITROGENUS BASES

No. 1 2 3 4 5 6 7

Unicode Binary Sequence 0 0 A 0x1 1 G 0x2 10 C 0x3 11 T 0xA 1010 CC 0xAF 1010 1111 CCTT 0x10FFFF 1 0000 1111 1111 1100 0000 GAATTTTTAAA


Multi-Language Character Encoding Technique for DNA Storage

Latin Letters Computer system support the basic Latin letters. In the ISO8859-1 it defined 256 commonly used characters, such as numbers, uppercase Latin letters, lowercase Latin letters, etc. So the first 256 positions in the character encoding reserved for the characters which include in the ISO8859-1, in order to improve the character encoding efficiency and compatibility.

21

The Import the text file which include Latin alphabets, Chinese characters, Japanese characters, numbers, and symbols. The application software (Figure. 1 is an example) get Unicode of the character in binary at first. then follow the coding rules the application software transcodes the Unicode to the nitrogenous bases. Inverse this operation, the application software also get the raw text from DNA sequence.

Multi-Languages Enviroment

Conclusions

To improve the efficiency and compatibility of multilanguages, the character encoding provide independent zone for different language. The Unicode plane is a good reference for the character encoding.

This paper puts forward a set of encoding of characters used to DNA storage system. The character encoding can be implemented to convert character to sequence of nitrogenous bases so as to implement the encoding and decoding of character information. This character encoding are more compatible with the multi-language environment, and all character encoding is uniqueness.

Algorithm Algorithm describes how to perform the character encode with nitrogenous bases. First import the text file which will be transform into the memory. According to the order of the characters in the text, get the Unicode of the character one by one. Follow the code rules, transcodes the Unicode to nitrogenous bases. Output the final result to store DNA sequence. For example, the character "A" Unicode is 0x41 (01000001), the corresponding nitrogenous bases is AAAAAAAAGAAG, simplified nitrogenous bases is GAAG 1: 2: 3: 4: 5: 6: 7:

Initialization Import the text file for each character do Get Unicode of the characters Cunicode Transcode Cunicode to CDNA Output CDNA to store DNA sequence end for

Verification of Algorithm

ACKNOWLEDGMENT

Dr. Qian Xu, Dr. Yang Guo are greatly acknowledged for supporting this study. Laboratory of complex network and visualization has made publishing of this article possible. REFERENCES

Adleman LM., "Molecular Computation of Solution to Combination Problems," in Science, vol. 266(11), 1994, pp. 1021-1023 Dietrich A. and Been W., "Memory and DNA," in J theor Biol, vol. 208, 2001, pp. 145-149 Garzon MH., Neel A., Chen H., "Efficiency and Reliability of DNA Based Memories," in GECCO, 2003, pp. 379-389 ROBERT F W., Molecular Biology, 2nd ed., Beijing:Science Press, 2003, pp. 642 -682. Wei

Dan, "Review of

magnetic

information

storage

technology," in Physics, vol. 33(9), 2004, pp. 646-651 Zhengxu Zhao, Yang Guo, Scale-free Model in Software Engineering: a New Design Method, 2013 International Conference on Geo-Informatics in Resource Management & Sustainable Ecosystem, 2013. ISSN: 1865-0929. Print ISBN: 978-3-642-41907-2. Online ISBN: 978-3-642-41908-9. Conference location: Wuhan, China. ZINGEL T., "Formal models of DNA computing:a survey," in Proc Estonian Acad Sci Phys Math, vol. 49(2), 2000, pp. FIGURE 1. EXAMPLE OF THE CHARACTER ENCODING

90-99.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.