While the role of DNA as a biological memory is well established exploring its potential as a data memory is relatively new. DNA data memory has not quite yet reached the stage where a blob of DNA can have some wires attached to it to write and read its data content, good progress has been made.
The memory “Write” process
Figure 2 illustrates, scaled down in segment size, the data memory write process sequence. It starts by breaking a block of the source data (in this case it was a Tarball, like a Zip-file,into small segments).
The next step is to add bitwise randomly selected segments of the data. The small inset in Figure 2 shows the bimodal histogram for the segment selection; the number of segments subjected to the bitwise addition are in the range 1 to 3 with a peak at 13. This step goes under the name of a Luby transform named for the inventor of the first practical Fountain code. The number of droplets will be slightly larger than the number of segments.
The next step is to add a randomly selected bit described as a “seed” to the data to create what are described as "droplets." The seed is incremented for each droplet and acts to identify the droplet.
The innovation in [Ref 1] is the addition of a new step which is not part of the original fountain code methodology and removes some of the problems and limitations of earlier attempts at using DNA as a data memory. It is a selection or screening process which maximise the effectiveness of the process and to fully realize the coding potential of each nucleotide. As illustrated in Figure 2, the droplets (00,01,10,11) are converted into nucleobases (A,C,G,T, respectively). This is followed by a screening step looking for any biochemical undesirable homopolymer runs of the same base, such as TTTT or a high GC content.
Biochemical constraints dictate this screening step because high GC content or long homopolymer runs (e.g., TTTTT…) are undesirable, as they are difficult to synthesize and prone to sequencing (read) errors. The decay of these undesirable DNA features during storage can induce uneven representation of the oligos.
As illustrated in Figure 3, each oglio for the work[Ref 1] of 38 Bytes, of which 32 Bytes (128 nt) were the data payload, 4 Bytes (16 nt) for the random seed (the transform droplet) plus 2 Bytes (8 nt) for error checking. Added each end were illumina plate adapters each of 6 Bytes (24 nt).