Accelerate your protein engineering. Download our free guide to cell display
In vivo DNA mutagenesis as a data strategy for AI-driven protein engineering
A brief perspective on how in vivo DNA mutagenesis is emerging as a scalable data-generation strategy for AI-driven protein engineering, and why broad sequence diversity can outperform tightly controlled synthetic libraries for model training and discovery.
1/30/20262 min read


Most discussions around CRISPR focus on correcting DNA sequences in human disease. An equally important, but far less discussed, application is the intentional introduction of mutations into genes to enable protein discovery, learning, and optimization.
In protein and cell engineering, the limiting factor is often not screening capacity, but library construction. Designing synthetic mutant libraries requires careful decisions around diversification strategy, oligonucleotide design, cloning workflows, and transformation efficiency. While these approaches offer tight control, they also impose practical limits on scale, throughput, and iteration speed—particularly when the goal is large-scale protein variant dataset generation.
As AI and machine-learning methods become increasingly central to protein engineering, these limitations are becoming more pronounced. Modern protein ML workflows do not simply require “better variants.” They require large, diverse, and functionally annotated datasets that span both functional and non-functional regions of sequence space. In many cases, the objective is not to test a narrowly defined hypothesis, but to enable broad sequence space exploration for machine learning and allow models to learn underlying sequence–function relationships.
This is where in vivo DNA mutagenesis becomes compelling.
In vivo mutagenesis systems use Cas9-based fusion enzymes or error-prone DNA polymerases to introduce mutations directly into DNA inside living cells. Instead of constructing diversity ex vivo through repeated rounds of PCR and cloning, sequence variation accumulates autonomously as cells grow and divide. The result is a continuously evolving population of variants generated under biologically relevant conditions—an approach increasingly viewed as autonomous data generation for protein engineering.
From an AI perspective, this represents a fundamental shift:
From precise library design to high-diversity protein datasets for AI
From manually curated variant sets to stochastic mutagenesis for machine learning
From single-round experiments to continuous evolution datasets for protein ML
The primary tradeoff is control. In vivo DNA mutagenesis does not allow precise specification of mutation sites or distributions. Mutation patterns are random, individual variants are not predesigned, and datasets often contain substantial epistatic and combinatorial mutations. However, for AI/ML-driven protein engineering, this lack of control is often a feature rather than a limitation. Large, unbiased, and diverse variant pools can provide richer training data than narrowly engineered libraries, particularly when paired with high-throughput functional assays.
For applications such as:
Training predictive models of protein function
Learning protein fitness landscapes across broad sequence space
Identifying non-intuitive structure–function relationships
Generating large labeled datasets for downstream model optimization
in vivo DNA mutagenesis can act as a biological data engine, producing the kind of experimental data that protein machine-learning models require.
At Ranomics, we view library generation not simply as a molecular biology challenge, but as a data strategy decision. In vivo DNA mutagenesis complements traditional synthetic libraries by enabling scalable, continuous diversification when the primary objective is AI training data generation for protein engineering, model development, and discovery at scale.
As AI continues to reshape protein engineering, approaches that prioritize data volume, diversity, and biological relevance will increasingly define what is possible. In vivo DNA mutagenesis is one such approach—and an important component of next-generation AI-enabled protein engineering workflows.
source: Anna Zimmermann, Julian E. Prieto-Vivas, Karin Voordeckers, Changhao Bi, Kevin J. Verstrepen,Mutagenesis techniques for evolutionary engineering of microbes – exploiting CRISPR-Cas, oligonucleotides, recombinases, and polymerases,Trends in Microbiology,Volume 32, Issue 9,2024,Pages 884-901,
