Scientists have developed what they claim is the most efficient data storage technique ever, with a new DNA-encoding method that approaches the theoretical maximum for information stored per nucleotide.
Using an algorithm called DNA Fountain, the researchers squeezed six files into a single speck of DNA – including a short film, an entire computer OS, and an Amazon gift card – but that's just for starters. The team says the same technique could effectively compress all the world's data into a single room.
Not only is DNA data storage an amazing space saver; the technique could also enable us to preserve knowledge with extreme robustness and longevity – unlike traditional technology media, which is known to succumb to all kinds of faults with time.
"DNA won't degrade over time like cassette tapes and CDs, and it won't become obsolete – if it does, we have bigger problems," says computer scientist Yaniv Erlich from Columbia University.
DNA storage itself isn't new, with the technique pioneered in 2012 by researchers at Harvard University, who figured out how to compress a 53,400-word book into the genetic code of synthetic DNA molecules, and then read the data back using DNA sequencing.
Since then various other teams have been trying to optimise the technique, with Microsoft claiming last year that a method it had come up with was 20 times more efficient than the previous record.
In turn, Erlich and fellow researcher Dina Zielinski from the New York Genome Centre now say their own coding strategy is 100 times more efficient than the 2012 standard, and capable of recording 215 petabytes of data on a single gram of DNA.
For context, just 1 petabyte is equivalent to 13.3 years' worth of high-definition video, so if you feel like glancing disdainfully at the external hard drive on your computer desk right now, we won't judge.
At the heart of the researchers' system is an algorithm originally designed to detect and fix errors in streaming video applications.
According to the researchers, the same kind of mechanism can be used to avoid errors when reading back binary data (made up of 1s and 0s) that's been translated into the four nucleotide bases in DNA: A, G, C, and T.
"[N]ot all DNA molecules are created equally," Erlich told Dexter Johnson at IEEE Spectrum.
"If you have DNA molecules that have a long stretch of the same nucleotide, such as AAA, it is not very favourable for the informatics machinery. It's very hard to read this molecule without an error. So you want to avoid stretches like that."
The researchers' algorithm manages to avoid errors when reading back the DNA data by additionally encoding a series of hints about what the information should look like once decoded.
This mean that not only can you recreate any DNA fragments that get lost in the process – it's also highly optimised.
"We showed that we can reliably store information on DNA, and that our organising of information approaches 'optimal packing,'", Erlich told Katherine Lindemann at ResearchGate, "meaning it is nearly impossible to fit more information on the same amount of DNA material."
To test the system, the team compressed six files: a computer OS; an 1895 French short film, Arrival of a train at La Ciotat; a US$50 Amazon gift card; a computer virus; a Pioneer plaque; and an academic paper by information theorist Claude Shannon.
The overall file size of the complete package was relatively tiny – coming in at just 2MB – but the important thing was testing to see if the DNA Fountain algorithm was able to encode the binary information into genetic data without losing any of the information.
After the digital data – represented in a list of 72,000 DNA strands – was converted into a speck of DNA molecules carried in a vial, the researchers were able to sequence the DNA and recover the files with zero errors.
While it's an impressive result, the team says it will be some time before the expense of storing and reading data in DNA makes sense for the rest of us. For their 2MB package, the researchers spent $7,000 to synthesise the DNA, and another $2,000 to sequence it.
Erlich thinks it could be more than a decade before DNA storage becomes accessible to the general public.
And even then, the technology might be reserved for things like recording patient data in medical systems, as opposed to being sold to consumers as the latest tech product.
"This is still the early stages of DNA storage. It's basic science," Erlich told Eva Botkin-Kowacki at The Christian Science Monitor.
"It's not that tomorrow you're going to go to Best Buy and get your DNA hard drive."
The findings are reported in Science.