Researchers from MIT and Stanford University have developed a novel system intended to protect the privacy of people who contribute their genomic data to large-scale biomedical studies. These databases of genomic information currently pose privacy risks, as it is often possible to infer from people's raw genomic data their surnames and perhaps even the shapes of their faces.
As such, many people have been reluctant to contribute their genomic data to biomedical research projects, and an organization hosting a large repository of genomic data might conduct a months-long review before deciding whether to grant a researcher's request for access.
In a paper appearing today in Nature Biotechnology, the new system that promises efficient privacy protection for studies conducted over as many as a million genomes is explained.
"As biomedical researchers, we're frustrated by the lack of data and by the access-controlled repositories," says Bonnie Berger, the Simons Professor of Mathematics at MIT and corresponding author on the paper. "We anticipate a future with a landscape of massively distributed genomic data, where private individuals take ownership of their own personal genomes, and institutes as well as hospitals build their own private genomic databases. Our work provides a roadmap for pooling together this vast amount of genomic data to enable scientific progress."
At the core of the system is a technique called secret sharing, which divides sensitive data among multiple servers. To store the number x, for instance, a secret-sharing system might send the random number r to one server and x-r to the other.
Neither server is independently able to infer x. Collectively, however, they can still perform useful operations. If one server stored a bunch of r's and added them together, and the other added up all the corresponding (x-r)'s, then sharing the results and adding them together would yield the sum of all the x's. Neither server, however, would ever observe the value of any one x.
If both servers are hacked, of course, the attacker could reconstruct all the x's. But so long as one server is trustworthy, the system is secure. Furthermore, that principle generalizes to multiple servers. If data are divided among, say, four servers, an attacker would have to infiltrate all four; hacking any three is insufficient to extract any data.