Mother Nature is widely regarded as the most powerful generative force, having designed the vast and intricate variety of life on Earth using just four genetic letters—A, T, C, and G. But can generative AI build upon her work?

A groundbreaking new algorithm called Evo 2 is pushing the limits of what AI can achieve in the realm of biology. Trained on an immense dataset of roughly 128,000 genomes—equating to 9.3 trillion DNA letter pairs from all domains of life—Evo 2 is now the largest generative AI model ever created for biological research. Developed by scientists at the Arc Institute, Stanford University, and Nvidia, Evo 2 is capable of writing entire chromosomes and small genomes from scratch.

One of Evo 2’s remarkable abilities is its deep understanding of how mutations in DNA can affect proteins, RNA, and overall health. It sheds light on “non-coding” regions of DNA—parts that don’t directly make proteins but are vital in regulating gene activity. These areas have long been a mystery but are now understood to play a role in diseases. By analyzing these regions, Evo 2 offers new insights into how genetic mutations influence cellular functions.

In a significant step forward for scientific collaboration, the team has made Evo 2’s software code and model parameters available to the wider scientific community. Researchers can now access the tool through a simple web interface, allowing them to explore the algorithm’s potential. With Evo 2 as a foundation, scientists have the opportunity to develop specialized AI models. These models could help predict how specific mutations affect protein functions, explain gene behavior across different cell types, and even assist in designing new genomes for synthetic biology projects.

According to Patrick Hsu, a key author of the study, Evo 2 marks a pivotal moment in the emerging field of generative biology. For the first time, machines are capable of reading, writing, and “thinking” in the intricate language of DNA. This development opens doors to a new era of genetic discovery.

Evo 2 builds upon its predecessor, the original Evo model, which was introduced just a year ago. Both models are large language models (LLMs), the same type of algorithms behind popular chatbots like ChatGPT. The original Evo model was trained on about three million genomes, primarily from microbes and viruses.

In contrast, Evo 2 expands its training to include a much broader variety of life forms, from humans and plants to yeast and other eukaryotes. Eukaryotic organisms—those with more complex cells—have far more intricate genomes than bacteria. For example, some DNA sequences in eukaryotes play specific roles in turning genes on and off, or in producing multiple versions of a protein from a single gene.

“These features are crucial for the development of multicellular life, complex traits, and behaviors unique to eukaryotic organisms,” the research team explains in a preprint paper. However, these same regulatory mechanisms make training AI models more challenging. Regulatory elements in DNA can be located far from the genes they control, often hidden in non-coding regions. These sequences may not directly produce proteins, but they are essential for regulating gene expression and maintaining chromosome structure.

The researchers deliberately included these non-coding regions in Evo 2’s training dataset, OpenGenome2, which comprises the DNA sequences of 128,000 genomes across the entire tree of life. By doing so, they ensured the model could identify patterns that are critical for understanding the complexities of eukaryotic DNA.

To maximize its potential, Evo 2 was trained on two versions: a smaller model using 2.4 trillion DNA letters and a larger model that utilized the entire 9.3 trillion-letter dataset. Evo 2’s ability to process such vast amounts of data is vital for studying eukaryotic cells, whose DNA sequences are much longer and more complex than those of bacteria. The model’s increased data capacity allows it to broaden its “search window” and identify crucial patterns across a wider genetic landscape. In fact, Evo 2 was trained on 30 times more data than its predecessor and can process 8 times more DNA letters at once. The training process took several months, running on more than 2,000 Nvidia H100 GPUs.

Once completed, Evo 2 demonstrated its superior capabilities by outperforming existing models in predicting the effects of mutations in the BRCA1 gene, which is strongly linked to breast cancer. In particular, Evo 2 excelled when considering both protein-coding and non-coding mutations, showing over 90 percent accuracy in distinguishing between benign and potentially harmful genetic changes.

This achievement signals a significant leap in the field of generative biology. Evo 2 not only expands our understanding of the genome but also provides a powerful new tool for exploring how genetic mutations impact human health, evolution, and even the potential for synthetic biology.

In sum, Evo 2 represents a game-changing advancement in the application of AI to biology, with the potential to unlock vast new realms of genetic research and innovation.

By Impact Lab