Columbia University researchers have developed a computational method that they say can accurately and sensitively quantify protein-DNA binding affinity. Their research was published earlier this week in the Proceedings of the National Academy of Sciences.

"The genomes of even simple organisms such as the fruit fly contain 120 million letters worth of DNA, much of which has yet to be decoded because the cues its provides have been too subtle for existing tools to pick up," said Richard Mann, a principal investigator at Columbia's Mortimer B. Zuckerman Mind Brain Behavior Institute and a senior author of the paper. "But our new algorithm lets us sweep through these millions of lines of genetic code and pick up even the faintest signals, resulting in a much more complete picture what DNA encodes."’

Mann has studied Hox genes for decades and found that even though each individual Hox gene guided a different feature of growth, the Hox transcription factors were all binding strongly and visibly to the same set of easily identifiable DNA sequences.

However, in 2015, Hox and his team discovered that the Hox transcription factors were also binding at many other locations—just more discretely at low-affinity sites. The scientists believed these low-affinity binding sites to be key to the Hox transcription factors being able to drive one aspect of development versus another. The problem remained how to decipher these sites from the genome.

Subscribe to eNewsletters
Get the latest industry news and technology updates
related to your research interests.

To address this challenge, Dr. Mann and his lab joined forces with the lab of Harmen Bussemaker, a professor in Columbia's department of biological sciences and systems biology and an expert in building computational models of genetic activity.

A few years ago, the two labs developed a genetic sequencing method called SELEX-seq to systematically characterize all Hox binding sites. But their approach still had limitations: It required the same DNA fragment to be sequenced over and over again. With each new round, more pieces of the puzzle were revealed, but information about those critical low-affinity binding sites remained hidden.

binding affinity

To overcome this challenge, Dr. Bussemaker and his team developed a sophisticated new computer algorithm that was able to explain the behavior of all DNA sequences in the SELEX-seq experiment. They called this algorithm No Read Left Behind, or NRLB.

"In simple terms, NRLB allows us cover the entire spectrum of binding sites—from the highest to the lowest affinity—with a much greater degree of sensitivity and accuracy than any existing method, including state-of-the-art deep learning algorithms" said Dr. Bussemaker, who was the paper's other senior author. "Building on that foundation, we now hope to develop more in-depth biological and computational models to help answer the most complicated questions about the genome."

Image: Gradually eliminating low-affinity binding sites identified by NRLB (from left to right) results in a gradual reduction of gene expression (white). Image courtesy of Mann Lab/Columbia's Zuckerman Institute.