The brains at Google DeepMind in London have developed a new artificial intelligence (AI) tool that can accurately predict how single variants or mutations in human DNA sequences can impact the biological processes regulated by the genes they affect. This is an immensely complex feat that now, thanks to the new model, takes significantly less time and computing power than previous methods.
The rest of this article is behind a paywall. Please sign in or subscribe to access the full content.What is AlphaGenome looking to understand?
For a bit of context: DNA is a “cellular instruction manual” that informs how every part of a living organism should function. It's effectively a code, made up of four chemical base pairs – adenine (A), guanine (G), cytosine (C), and thymine (T) – which can be tied together in extremely long sequences, encoding huge amounts of information.
Within this code, it’s possible to find things akin to patterns and even “grammatical rules” that can help us understand its role and function, but this is not an easy task given the huge amounts of “letters” in each passage.
Even small variations in this code can drastically impact biological processes and can contribute to diseases, which makes it very tricky to understand at a molecular level. For instance, it would be extremely difficult to predict the effect if an A, T, C, or G is replaced by a different base.
The majority of changes (approximately 98 percent) occur in non-coding regions, or the “dark genome”, sections of DNA that don’t produce proteins but can affect gene expression, adding to the challenge.
However, this is the type of task that AI is great at. It’s able to read vast amounts of data, recognize patterns, and then use its “experience” to make predictions.
What is AlphaGenome?
Enter Google DeepMind, the London-based company whose co-founders, Demis Hassabis and John Jumper, were awarded the 2024 Nobel Prize in Chemistry for their groundbreaking AI model, AlphaFold2, which predicts the 3D structure of nearly all known proteins.
Their latest creation is AlphaGenome, a model that can look at long DNA sequences — up to 1 million base pairs — and predict thousands of molecular properties characterizing their regulatory activity.
In the latest demonstration of this ability, researchers have shown how AlphaGenome can simultaneously predict 5,930 human or 1,128 mouse genetic signals that relate to specific functions, such as gene expression, splicing, and modification of proteins.
When pitted against other existing “state-of-the-art models,” AlphaGenome beat them in 25 out of 26 tests.
While the model does have its limitations, the tool will prove useful to scientists hoping to better understand disease biology and, ultimately, develop new treatments to manage them. To improve the chances of that becoming a reality, the researchers are releasing AlphaGenome free of charge for non-commercial use.
“This work is an exciting step forward in illuminating the ‘dark genome’. We still have a long way to go in understanding the lengthy sequences of our DNA that don’t directly encode the protein machinery whose constant whirring keeps us healthy,” Professor Rivka Isaacson, Professor of Molecular Biophysics in the Department of Chemistry at King’s College London, who wasn’t involved in the study, said to the Science Media Centre.
“There are so many interwoven possibilities, and complex feedback mechanisms, that I doubt the whole thing will ever be fully untangled. AlphaGenome gives scientists whole new and vast datasets to sift and scavenge for clues,” she added.
Adding a touch of caution to the promising findings, Professor Ben Lehner, Head of Generative and Synthetic Genomics at the Wellcome Sanger Institute, who also wasn't involved in the study, remarked: “AlphaGenome is a great example of how AI is accelerating biological discovery and the development of therapeutics.”
“However, AlphaGenome is far from perfect and there is still a lot of work to do. AI models are only as good as the data used to train them. Most existing data in biology is not very suitable for AI - the datasets are too small and not well standardized. The most important challenge right now is how to generate the data to train the next generation of even more powerful AI models. We need to do this fast, cost-effectively, and in a way that both the data and the resulting models are available for everyone to use," he added.
The study is published in the journal Nature.





