Translation Software Allows Efficient Storing of Massive Data Amounts in DNA Molecules

A Los Alamos National Laboratory team led by Dr. David S. Smith has developed an important technology to translate digital binary files into a four-letter genetic code for molecular data storage.

Our software, the Adaptive DNA Storage Codec (ADS Codex), translates files from what a PC understands to what biology understands, said Latchesar IONKOV, a Los Alamos computer scientist and principal investigator of the project. It’s like translating English to Chinese but harder.

This work is vital to IARPA’s Molecular Information Storage Program (MIST), which aims to provide a cheaper, more significant, and longer-lasting storage solution for big-data operations within the government and private sectors. MIST’s short-term objective is to read and write ten terabytes (a trillion bytes) in 24 hours for $1000. The initiative is refined by other teams working on the retrieval and writing components (DNA sequencing and synthesis), while Los Alamos focuses on coding and decoding.

“DNA is a promising alternative to tape storage, which has been the dominant method of cold storage since 1951,” said Bradley Settlemyer. He’s a Los Alamos storage systems researcher, systems programmer, and expert in high-performance computers. “DNA storage is likely to change how we view archival storage because of the long-term data retention and high data density. Instead of storing YouTube on acres and acres of servers, you could keep it all in your fridge. Researchers must first overcome several technological obstacles related to integrating different technologies.

The translation is preserved.

Comparing DNA storage to the conventional long-term storage technique that uses magnetic tapes the size of pizzas, it is less expensive, physically compact, energy efficient, and more durable. DNA can survive for hundreds of years and does not require maintenance. The files stored in DNA can also be copied easily and at a negligible price.

The storage density of DNA is astounding. Humanity will produce an estimated 33 zettabytes (3.3, followed by 22 zeroes) by 2025. The information could fit in a ping-pong ball with plenty of leftover room. The Library of Congress holds 74 terabytes of data, which is 74 million bytes. Six thousand of these libraries could fit into a DNA archive the size of a tiny poppy seed. Facebook’s 300 petabytes (300,00 terabytes of data) can be stored on a half-poppy source.

The DNA synthesis process is used to encode a binary file. Synthesis is a reasonably well-understood technology that organizes the DNA building blocks into different arrangements. These are indicated by the sequences of letters A, C, G, and T. This code provides the instructions to build every living organism on Earth.

ADS Codex, developed by the Los Alamos team, explains how to convert binary data (all the 0s and 1s) into four-letter combinations of A, C, G, and T. It also describes how to decode the code back into binary. ADS Codex can accommodate all the different methods of DNA synthesis. Los Alamos has developed a version of ADS Codex 1.0 and plans to evaluate other MIST teams’ storage and retrieval system in November 2021.

ADS Codex addresses two major obstacles when creating DNA data files.

The team needed to find new ways to correct errors, as the molecular storage system has a much higher error rate than traditional digital systems. The second reason is that errors in DNA storage come from a completely different source than in the digital realm, making them more challenging to correct.

Ionkov explained that binary errors on a hard drive occur when one becomes a 0, but DNA has more issues caused by insertion or deletion errors. You’re writing A and C and G and T, but sometimes nothing appears, and the letters are shifted to the left or typed AAA. The standard error correction codes won’t work with this.”

ADS Codex provides additional information, called error detection codes, that can be used for data validation. The software checks if codes match when it converts data back into binary. ACOMA will try to remove or add nucleotides if they do not compare.

Smart scaling up

Today’s most significant data centers are large warehouses, which can store exabytes of data. That’s trillions of millions or more bytes. These digitally-based data centers are expensive to build, power, and run. As the demand for data storage grows exponentially, they may not be the most cost-effective option.

Los Alamos, and other national security missions, need to store data for a long time on cheaper media. Settlemyer stated, “At Los Alamos we have some the oldest digital data and the largest data stores, dating back to the 1940s.” It still has tremendous value. We’ve been at the forefront of finding a solution for cold storage because we store data forever.

Settlemyer believes DNA storage could be disruptive because it crosses multiple fields of innovation. The MIST project has sparked a new alliance among tape manufacturers, DNA synthesis firms, DNA sequencing firms, and high-performance computing organizations such as Los Alamos, pushing computers to ever larger-scale simulations that produce mind-boggling quantities of data.

Deeper Dive into DNA

Most people associate DNA with life and not computers. DNA is a four-letter code that transmits information about an organism. The four nucleotides that makeup DNA are each identified with a letter. These include adenine, thymine, guanine, and cytosine.

The bases form a double helix by wrapping them in a spiral around each other. These letters are arranged into a code that tells the organisms how to start. The genome is made up of all DNA molecules. It’s the blueprint for your body.

Researchers have discovered that they can write long strings of letters A, C, G, and T by synthesizing the DNA molecules. They then read back these sequences. This process is similar to how a computer stores data using 0s or 1s. Ionkov says that the method is proven effective, but reading and writing DNA-encoded data takes a lot of time.

“Adding a single nucleotide is very slow.” Ionkov stated that it takes one minute. Imagine writing a single file on a hard disk taking over a decade. This problem can be solved by using massive parallelism. “You write tens or millions of molecules at once to speed up the process.”

Leave a Reply

Your email address will not be published. Required fields are marked *