Ginormous New 'Index' Shares Data From 100 Million Science Papers For Free : ScienceAlert

There's a vast amount of research out there, with the volume growing rapidly with each passing day. But there's a problem.

Not only is a lot of the existing literature hidden behind a paywall, but it can also be difficult to parse and make sense of in a comprehensive, logical way. What's really needed is a super-smart version of Google just for academic papers.

Enter the General Index, a new database of some 107.2 million journal articles, totaling 38 terabytes of data in its uncompressed form. It spans more than 355 billion rows of text, each featuring a key word or phrase plucked from a published paper.

"This is a look-up tool, a dictionary of knowledge, a map to knowledge," says the creator of the Index, archivist Carl Malamud. "A tool that we believe is an essential facility for the practice of science in our modern age."

While we've mentioned Google, this isn't quite a search engine – scientists using the General Index will have to code their own search engines to work with it. Rather, it's a carefully cataloged and structured catalogue that can be used to probe into decades' worth of scientific research.

Its primary purpose is to help with text mining: using computers to quickly scan millions of data points to find and cross-link references to something specific. Human beings can't possibly read through and pick out key bits of data from millions of journal articles, but a computer program connected to the General Index can.

Reaction from other scientists has been positive. One expert, computational biologist Gitanjali Yadav from the University of Cambridge in the UK, says the new database goes some way to solving the problem of restricted access to previously published material.

"There is no way for me – or anyone else – to experimentally analyze or measure the chemical fingerprint of each and every plant species on Earth," Yadav told Nature. "Much of the information we seek already exists, in published literature."

The idea is that the General Index can be used to search for plants, chemicals, genes, proteins, materials, place names and much more – though the team behind it is keen to emphasize that it still needs some tidying up and expanding, and is very much a work in progress (as it probably always will be).

All of this information is available to download and use for free from the General Index portal, with no copyright applied and no restrictions – the Index is just snippets of papers, not the papers themselves. As we've mentioned though, you're going to need some coding skills in order to really make any sense out of it.

Unlike the controversial Sci-Hub portal, the Index isn't hosting papers in their entirety, although questions have been raised as to the project's legality. For Malamud, the project falls well within legal boundaries.

"I am very confident that what I'm doing is legal," Malamud told Nature. "We are not doing this to provoke a lawsuit, we are doing it to advance science."