While the role of DNA as a biological memory is well established exploring its potential as a data memory is relatively new. DNA data memory has not quite yet reached the stage where a blob of DNA can have some wires attached to it to write and read its data content, good progress has been made.
The their latest work, published in Science magazine, Yaniv Erlich and Dina Zielinski from Columbia University and the New York Genome Center, mixed some clever biochemistry with some leading edge communications data encoding techniques and added a dash of processing power. The result, under the heading of “DNA Fountain,” is a demonstration of the ability to use DNA to store a complete operating system of 1.4 MBytes, a movie and other files for a total of greater than 2 Mbytes.
This is now possible because at the same time they have provided a new level of efficiency and reliability for the technique. If the DNA-data memory must have an acronym to fit it in the SRAM, DRAM, NVRAM memory spectrum, then biologic archival read rarely memory (BARRM) might be one choice.
As illustrated in Figure 1, within the DNA helix each cross linking nucleotide (nt) will contain one of the four nucleobases (bases). Given the the ability to be able to selectively place them in order along a DNA helix backbone offers the possibility of a binary data memory of two bits/base or nucleotide (i.e. 00, 01,10 and 11). The bonds between the bases linking the DNA spiral backbones are characterised by either two or three covalent hydrogen bonds.
It is suggested in the DNA would offer an eye catching memory data density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
At its core, the DNA-data memory methodology relies on a technique used in data communication where instead of repeating the transmission when an erroneous piece of a data stream is received, enough bytes are transmitted to allow by statistical analysis the correct data to be extracted. The technique is based on what are called “Fountain” codes.
Fountain codes allow data (such as a file) to be divided into an unlimited number of encoded pieces, in a form which allows them to be reassembled into the original file given any subset of the encoded pieces of data, provided that you have a little more than the size of the original file.
In data communications and now for memory a “Fountain” of suitably, encoded data is fired at a receiver, which is able to reassemble the file by catching enough "droplets" (the bits of encoded data). It is immaterial which bits of encoded data are received or missed. The water analogy using “fountains," “droplets” and “buckets” is now part of the language of these techniques. A bucket full of droplets will give you enough information to extract the original data.
Fountain codes are only a part of the method of changing a binary data stream into a form suitable for translation into strands of DNA. This latest work adds a new twist which accommodates the special stability needs of a potential DNA data memory. Emphasising the most desirable links and removing undesirable features such as too many (GC) links and long sequences of the same link the latter called homopolymer runs (TTTT...).
The target of the memory “Write” process is to turn the original data steam into a series of DNA oligonucleotides or “Oliogs” as the short form. These can be sent to a company specializing in the manufacture of DNA to order who return a small ampoule of the data encoded DNA.
The memory “Write” process
Figure 2 illustrates, scaled down in segment size, the data memory write process sequence. It starts by breaking a block of the source data (in this case it was a Tarball, like a Zip-file,into small segments).
The next step is to add bitwise randomly selected segments of the data. The small inset in Figure 2 shows the bimodal histogram for the segment selection; the number of segments subjected to the bitwise addition are in the range 1 to 3 with a peak at 13. This step goes under the name of a Luby transform named for the inventor of the first practical Fountain code. The number of droplets will be slightly larger than the number of segments.
The next step is to add a randomly selected bit described as a “seed” to the data to create what are described as "droplets." The seed is incremented for each droplet and acts to identify the droplet.
The innovation in [Ref 1] is the addition of a new step which is not part of the original fountain code methodology and removes some of the problems and limitations of earlier attempts at using DNA as a data memory. It is a selection or screening process which maximise the effectiveness of the process and to fully realize the coding potential of each nucleotide. As illustrated in Figure 2, the droplets (00,01,10,11) are converted into nucleobases (A,C,G,T, respectively). This is followed by a screening step looking for any biochemical undesirable homopolymer runs of the same base, such as TTTT or a high GC content.
Biochemical constraints dictate this screening step because high GC content or long homopolymer runs (e.g., TTTTT…) are undesirable, as they are difficult to synthesize and prone to sequencing (read) errors. The decay of these undesirable DNA features during storage can induce uneven representation of the oligos.
As illustrated in Figure 3, each oglio for the work[Ref 1] of 38 Bytes, of which 32 Bytes (128 nt) were the data payload, 4 Bytes (16 nt) for the random seed (the transform droplet) plus 2 Bytes (8 nt) for error checking. Added each end were illumina plate adapters each of 6 Bytes (24 nt).
DNA memory read
To retrieve the information, the oligo pool is initially amplified by Polymerase Chain Reaction (PCR) and the DNA library sequenced using a propriety one Illumina MiSeq flow cell. PCR is a technique used to create from a single or a few copies of a DNA segment a few thousand or even millions of perfect copies of that sequence.
This is a well established process for DNA analysis. In brief, the denatured oligos have added to the ends terminators and are allowed to attach themselves to the surface of a lawn also seeded with terminators. What then happens is the single strand of the oligos bend to attach themselves to a matching terminator, as a single strand. Using the single strand as a template a matching strand grows from the lawn finishing the process when the growth ceases at the terminator on the other end of the strand. Now the resulting DNA spiral is denatured (split into single strands). The process is then repeated until large clusters as copies of the original single strand have been created, as illustrated in simplified form in Figure 4, for just two of the many strands within one of the multiple strand clusters.
Starting at the terminator the complementary DNA strand is allowed to grow by immersion in a mixture of all nucleotide base types each of which has been tagged with a different colour fluorescent dye and a blocker. Once the tagged base is attached the blocker acts to inhibit any further growth of the strand.
After the first base attaches itself to its complementary base on the chain and blocks any further growth laser illumination will cause each column of strands to fluoresce with a colour identifying the particular nucleotide. The dye and the blocker are then then removed and the process repeated slowly building a picture of the nucleotide sequence.
The inherent redundancy of the memory means not all oligos are required for the decoding, so those of doubtful value are discarded reducing the exposure to erroneous oligos. At each step the colour images are scanned and recorded to produce the sequence chart and from it the original data, see the inset in fig 4. This data is then subjected to the reverse transform to obtain the original data. Reading the memory is a destructive process, however the the ease with which the PCR step can be used to make an almost unlimited number of perfect copies removes this as a potential problem.
At the moment, the write process must be measured in days because of the need to send the oligos to a specialist fabrication facility. The rest of the processing for both read and write is measured in minutes and hours. The reported cost was $3,500 per Mbyte. All those values must be considered as early days values, with room for improvement.
The real success story is the closeness this latest work came to theoretical packing density 2 bits per molecule. While biochemical constraints limit the efficiency of DNA memory to 1.98 bit/nt, which is reduced further by Shannon capacity considerations[Ref 1] achieved a value of 1.57 bits/nt, just 14 percent lower.
—The career of Ron Neale, as a researcher, process developer, and designer of solid-state memory devices, stretches back over 50 years. More recently he has been involved as a consultant, writer, and keen and critical observer of the latest memory developments. His EE Times Progress Reports on the state of play in memory developments have a large following. He has a number of firsts in memory device development and manufacture in the areas of phase change memory (PCM) and programmable read-only memory (PROM), including anti-fuzes and programmable VIAS. He holds 20 patents in the area of memory and programmable interconnect and is a member of The Institute of Physics and a Chartered Physicist. He is also qualified as both a mechanical and electronic engineer. As well as memory device development and research, Ron has also held senior positions in companies involved in computer development and the manufacture of semiconductor fabrication equipment, as well as serving a stint as editor of Electronic Engineering magazine.
The complex write/read processes associated with the DNA-data memory might be eliminated as the end-point of some of the work already underway in interfacing biological material with silicon, which might eventually lead to silicon chip loaded with a blob of DNA acting as the memory.
It is interesting to note while the solid state silicon fabrication processes struggles with obtaining 10nm structures, the biochemists are able to grow such structures -- might that point the way to the future of lithography?
There are two other possibilities: one good, the other bad. The good is it might one day be possible to carry in the perfect environment of your body a few grams of DNA loaded with a knowledge pool of all the data ever generated. The bad news it might be possible to write a DNA data string that unintentionally gets mixed in the human system and acts like an uncontrollable contagious viral disease endlessly reproducing itself -- a data black death for humans.
[Ref 1] DNA Fountain enables a robust and efficient storage architecture, Yaniv Erlich & Dina Zielinski Science, 03 Mar 2017: Vol. 355, Issue 6328, pp. 950-954 DOI: 10.1126/science.aaj2038