Nvidia, Evozyne, InstaDeep and researchers from TU Munich present new advances in the use of AI in biology at the JP Morgan Healthcare conference.
Advances in generative AI models for language and images are transforming the market for natural language processing (NLP), art and design. But the underlying technologies such as transformers, diffusion models or variational auto-encoders (VAE), and methods such as unsupervised learning with gigantic amounts of data, are also proven outside these areas.
A promising application area is bioinformatics, where models such as Deepmind’s AlphaFold 2 or Meta’s ESMFold predict protein structure, or diffusion models are expected to open a new era in protein design. In 2022 alone, nearly 1,000 scientific papers were published on Arxiv on the use of AI in biology. By 2025, more than 30% of new drugs and materials could be systematically discovered using generative AI techniques, according to the Gartner report “Innovation Insight for Generative AI”, for example.
Nvidia partners with startups and researchers to advance bioinformatics
At this year’s JP Morgan Healthcare conference, Nvidia is presenting the results of two collaborations with startups and researchers: the Nucleotide Transformer genomic language model and the ProT-VAE generative protein model.
The Nucleotide Transformer was created through a collaboration between InstaDeep, recently acquired by Biontech, Technical University of Munich, and Nvidia. The team trained different model sizes with data of up to 174 billion nucleotides of different species on Nvidia’s Cambridge-1 supercomputer, following the recipe for success of large language models such as GPT-3: large models, a gigantic data set and a lot of computing power.
As expected, the performance of the Nucleotide Transformer increased with model size and data volume. The team tested the model in 19 benchmarks and in 15 of them it performed as well or better than other models trained specifically for these tasks. In the future, the transformer should help translate DNA sequences into RNA and proteins, for example.
“We believe these are the first results that clearly demonstrate the feasibility of developing basic models in genomics that truly generalize to all tasks,” said Karim Beguir, CEO of InstaDeep. “In many ways, these results mirror what we’ve seen in the development of adaptable baseline models in natural language processing over the past few years, and it’s incredibly exciting to see this now applied to such difficult problems in drug discovery and human health.
The ProT-VAE AI model generates new proteins
Researchers from the startup Evozyne went further: using Nvidia’s BioNeMo platform, they created the ProT-VAE generative model to generate new proteins. While models such as AlphaFold or ESMFold predict the structure of protein sequences, ProT-VAE is designed to derive functions directly from the sequences and thus generate new proteins that perform a specific function.
The ability to design proteins with predetermined functions is a central goal of synthetic biology and has the potential to revolutionize fields like medicine, biochemical engineering or the energy sector. The problem: With natural amino acids alone, there are many more possible proteins than protons in the visible universe.
Evozyne sees the solution in “machine learning guided protein engineering” with ProT-VAE. The model sandwiches a VAE network between an Nvidia pre-trained protein transformer encoder and decoder. The VAE network is then trained for a specific protein family in which new proteins are to be generated. In the generation process, however, the model can also benefit from the full representations of the transformer ProtT5, which transformed amino acid sequences into millions of proteins during Nvidia’s formation.
To test their model, the team designed, among other things, a variant of the human PAH protein. Mutations in the PAH gene can limit its activity and lead to metabolic disorders, such as disrupting mental development and leading to epilepsy. According to the researchers, ProT-VAE engineered a variant with 51 mutations, 85% sequence similarity, and 2.5-fold improved function.
We envision that the model may offer an extensible and generic platform for machine learning-guided directed evolution campaigns for data-driven design of novel synthetic proteins with “supernatural” function.
Until recently, this process took months or even years. With ProT-VAE, this time has been reduced to a few weeks.