Our records of the human genome may still be missing tens of thousands of 'dark' genes. These hard-to-detect sequences of genetic material can code for tiny proteins, some involved in disease processes like cancer and immunology, a global consortium of researchers has confirmed.
They may explain why past estimates of our genome's size were way larger than what the Human Genome Project discovered 20 years ago.
The new international study, still awaiting peer review, shows our library of human genes very much continues to be a work in progress, as more subtle genetic features are picked up with advances in technology, and as continued exploration uncovers gaps and errors in the record.
These overlooked genes have been hiding away in regions of our DNA thought not to code for proteins. These regions were once dismissed as 'junk DNA' but it turns out small bits of these sequences are still being used as instructions for mini-proteins.
Institute of Systems Biology proteomicist Eric Deutsch and colleagues found a large cache of them by searching genetic data from 95,520 experiments for fragments of protein-coding sequence. These include studies using mass spectrometry to investigate small proteins, as well as catalogues of protein snippets detected by our own immune systems.
Instead of the long, well-known codes that initiate the reading of DNA instructions for protein creation, indicating the starting point of a gene, these 'dark' genes are preceded by shorter versions which have allowed them to be overlooked by scientists.
Despite these missing parts in their start sequences, the non-canonical open reading frame (ncORF) genes are still used as a template to create RNA and some of those are then used to make small proteins with only a handful of amino acids. Previous studies have shown cancer cells contain hundreds of such tiny proteins.
"We believe the identification of these newly-confirmed ncORF proteins is immensely important," the team writes in their paper. "Their proteins… may have direct biomedical relevance, which is manifested in the growing interest in targeting such cryptic peptides with cancer immunotherapy, including cellular therapies and therapeutic vaccines."
Some of the genes that encode these cryptic peptides are transposons that move around our genomes, including sequences inserted into us by viruses.
Others are what the researchers call aberrant. For example, some of the proteins known to exist from mass spectrometry evidence have only ever been located in cancer samples, so their associated genes may not naturally belong in our bodies.
"Thus, it remains possible that certain ncORF peptides reflect aberrant proteins whose existence is deemed out of context with the canonical proteome," Deutsch and team explain.
Out of the 7,264 sets of these non-canonical genes identified, the researchers found at least a quarter of them could create proteins. This amounted to at least 3,000 new peptide-coding genes to add to the Human Genome, and the team suspects there are tens of thousands more, all missed by previous proteomic techniques.
"It's not every day that you get to open a research direction and say, 'We might have a whole new class of drug targets for patients,'" University of Michigan neurooncologist John Prensner told Elizabeth Pennisi at Science.
The tools the team have developed will help other researchers to continue to uncover more of this dark genetic matter.
This research is awaiting peer review on bioRxiv.