Accelerate your protein engineering. Download our free guide to cell display

Leveraging AI and Deep Mutational Scanning to Engineer Novel Enzymes

7/15/202516 min read

Introducing Deep Mutational Scanning (DMS): The Core Concept

Deep Mutational Scanning (DMS) has emerged as a powerful experimental paradigm to effectively characterize sequence-function relationships. At its core, DMS is a high-throughput functional genomics method that combines massively parallel mutagenesis with functional selection and deep sequencing to systematically measure the effects of thousands, or even millions, of protein variants in a single experiment.

The central principle of DMS involves three components. First, a large library of variants of a gene of interest is created, often aiming to generate every possible single amino acid substitution in the protein. Second, this library is introduced into a biological system—such as yeast, bacteria, viruses, or cultured mammalian cells—where a selectable phenotype is physically linked to each variant's genotype. A functional selection is applied, which physically separates variants with high activity and depletes those with low or no activity. Third, high-throughput DNA sequencing is used to quantify the frequency of each variant in the population before and after the selection. By comparing these frequencies, a functional score can be calculated for each mutation, reflecting its impact on the protein's activity. The result is a comprehensive "sequence-function map," an atlas that details the functional consequences of nearly every possible mutation across the protein landscape.

Generation of Variant Libraries

The foundation of any DMS experiment is the creation of a large and diverse library of gene variants. The choice of mutagenesis strategy is critical and is dictated by a trade-off between cost, scale, precision, and the specific biological question being addressed. Several methods are commonly employed.

Error-Prone PCR is a widely used technique due to its relative simplicity and cost-effectiveness. It utilizes low-fidelity DNA polymerases or sub-optimal reaction conditions that introduce random nucleotide substitutions during amplification. The mutation rate can be tuned by altering PCR conditions, such as the concentration of manganese chloride or dNTPs. However, this method suffers from significant drawbacks, most notably mutation bias. Certain polymerases, for instance, favor transitions over transversions or have a preference for mutating A/T base pairs, meaning the resulting library is not truly random.This makes it difficult to achieve comprehensive coverage of all possible amino acid substitutions, especially those requiring multiple nucleotide changes in a single codon.
Oligonucleotide-based Methods offer far greater control and precision, albeit at a higher cost. These strategies rely on the synthesis of DNA oligonucleotides that contain the desired mutations. One approach uses "doped" oligos, which are synthesized with a defined error rate at each position, allowing for a customized distribution of mutations. A more systematic approach uses oligos containing degenerate codons, such as 'NNN' (where N can be any base), 'NNS'/'NNK' (where S is G/C and K is G/T) or the 22-c group, to generate all 19 possible amino acid substitutions at each targeted position. These oligos can be used as primers in PCR-based methods or assembled into larger gene fragments.
CRISPR-based Endogenous Mutagenesis represents an in cellulo strategy for DMS, enabling the generation of variants directly at their native genomic locus within a cell. This provides the most biologically relevant context, as the protein is expressed under its endogenous regulatory control. These methods typically use a library of guide RNAs to target CRISPR-Cas9 to specific sites, introducing mutations via error-prone non-homologous end joining or by providing a library of donor DNA templates for homology-directed repair (HDR). While powerful, this approach is technically challenging. The efficiency of HDR is often low, which limits the size of the library that can be generated, and there are persistent concerns about off-target mutations and editing biases. Furthermore, not ever cell type is amenable to this workflow.

The Functional Selection Assay - Linking Genotype to Phenotype

The functional assay is the utmost important aspect of a DMS experiment, as it provides the selective pressure that distinguishes functional variants from non-functional variants. A robust assay must physically link the DNA sequence of the variant to the protein's function. The choice of assay depends entirely on the protein function being investigated.

Fitness or Survival-Based Assays are among the most direct and economical approaches. In this setup, the protein's function is essential for the survival or proliferation of the host cell or virus under specific conditions. For example, when studying a viral protein, the library of viral variants can be passaged in cell culture; viruses with mutations that enhance replication will increase in frequency, while those with deleterious mutations will be depleted. Similarly, when studying an antibiotic resistance enzyme, the cell library can be grown in the presence of the drug, selecting for variants that confer resistance. While powerful, a limitation is that a "fitness" score can be an indirect measure. There may be multiple biological actions contributing to growth phenotypes including population dynamics.
Binding and Stability Assays are designed to provide more direct, mechanistic insights into a mutation's effect. A major advance in DMS has been its coupling with various display technologies, such as yeast display, phage display, and mammalian cell display. In these systems, the library of protein variants is expressed on the surface of a yeast cell, phage particle, or mammalian cell. Selection can then be performed by incubating the library with a fluorescently labeled binding partner. Cells or phages that bind the partner are then physically separated from non-binders using techniques like fluorescence-activated cell sorting (FACS). This allows for a quantitative measure of binding affinity.
A persistent challenge in interpreting DMS data is "biophysical ambiguity"—a single functional score often conflates a mutation's effect on protein stability and abundance with its effect on a specific activity like binding. A mutation might appear to disrupt binding simply because it causes the protein to misfold and be degraded, not because it directly alters the binding interface. To address this, the field has evolved towards more sophisticated, multi-dimensional assays. For example, many DMS studies on viral proteins now measure two phenotypes in parallel: the level of protein expression on the cell surface (a proxy for folding and stability) and its binding affinity to a receptor.

Deep Sequencing and Variant Quantification

After the selection is complete, high-throughput DNA sequencing is employed to read out the results. Samples of the variant library are sequenced from both the pre-selection (input) population and the post-selection (output) population. The goal is to obtain accurate counts for every variant in both pools.

Experimental design is paramount at this stage to ensure high-quality data. Most modern DMS experiments incorporate unique molecular identifiers (UMIs) or random barcodes into the library constructs. These short, random DNA sequences are linked to each initial variant molecule, allowing researchers to count the true number of molecules rather than just the number of sequencing reads, which can be biased by PCR amplification during library preparation. The use of multiple, independent biological replicates is also essential to assess the reproducibility of the selection and to obtain the statistical power needed to confidently identify functional effects.

The choice of sequencing platform is also important. Illumina sequencing platforms are most widely used due to their high accuracy and cost-effectiveness, though their relatively short read lengths can pose a challenge for assaying large protein domains in a single read. This can be overcome by barcoding strategies or by designing the experiment in small pools focusing on smaller regions. Emerging long-read technologies, such as PacBio's circular consensus sequencing (CCS) and UMI-based Nanopore sequencing, are achieving higher accuracy and read depth, may become more common for DMS studies in the future.

Data Analysis - From Raw Counts to Functional Scores

The final step in the DMS workflow is the computational analysis that transforms raw sequencing counts into a meaningful, quantitative functional landscape. The fundamental calculation is an enrichment ratio for each variant, which is typically calculated by dividing the frequency of the variant in the post-selection library by its frequency in the pre-selection library. This ratio is often normalized to the score of the wild-type variant, which is set to 1, and log-transformed to produce a more symmetric distribution of scores.

However, simple enrichment ratios can be highly susceptible to noise, particularly for variants that are present at low counts in the input library. This has driven the development of more sophisticated statistical modeling frameworks to analyze DMS data. Software packages like dms_tools and Enrich2 have been created to address these challenges. These tools often employ a likelihood-based approach, modeling the counts using statistical distributions and incorporating information from replicates to more accurately infer functional scores and estimate their associated errors.

The final output of this analysis is a comprehensive dataset of functional scores for thousands of variants. This is most often visualized as a sequence-function map, which is typically a heatmap where the rows represent the positions in the protein sequence, the columns represent the 19 possible amino acid substitutions, and the color of each cell indicates the functional score of that specific mutation. These maps, along with other visualizations like sequence logos, provide an intuitive and information-dense overview of the protein's entire mutational landscape, revealing patterns of constraint and flexibility across its structure.

The Expanding Frontier: Diverse Applications of Mutational Scanning

DMS’s versatility allows it to be applied to a vast range of fundamental questions across biology. Its ability to generate comprehensive functional landscapes is transforming our understanding of protein function, evolution, and engineering.

Protein Engineering and OptimizationL DMS is an exceptionally powerful tool for protein engineering, providing a roadmap for how to rationally modify a protein to enhance its properties. By systematically mapping the effects of mutations on function, stability, or binding affinity, DMS can rapidly identify beneficial mutations that might be missed by rational design or random screening. A prime example is its application to anti-CRISPR (Acr) proteins, which are natural inhibitors of CRISPR-Cas systems and are valuable tools for controlling gene editing. DMS has been used to map the mutational fitness landscape of Acr proteins, revealing a considerable tolerance to mutation and identifying specific substitutions that significantly boost their inhibitory potency against Cas9. This information can be used to engineer more effective and specific "off-switches" for CRISPR-based therapies.

Unraveling Allosteric Regulation: Allostery—the process by which binding at one site on a protein regulates activity at a distant site—is a fundamental mechanism of biological control. However, the communication pathways that mediate allostery are often subtle and distributed throughout the protein structure, making them difficult to dissect with traditional targeted mutagenesis. DMS is uniquely suited to this challenge because it probes the entire protein in an unbiased manner. In a landmark study, DMS was applied to four homologous bacterial allosteric transcription factors (aTFs). The results revealed that "allosteric hotspots"—residues critical for allosteric communication—were not confined to the ligand-binding or DNA-binding sites but were distributed throughout the protein structure. This provided a global, systems-level view of the intramolecular network that governs allosteric function, an insight that would be nearly impossible to achieve with piecemeal approaches.
Designing Novel Therapeutics: Antimicrobial Peptides: The rise of antibiotic resistance has created an urgent need for new classes of antimicrobial drugs. Antimicrobial peptides (AMPs) are a promising class of therapeutics that often work by lysing bacterial membranes, but their clinical development has been hampered by toxicity to mammalian cells. A novel application of DMS, termed deep mutational surface localized antimicrobial display (dmSLAY), was developed to address this challenge. The technique was used to create a comprehensive mutational map of the AMP Protegrin-1, simultaneously assessing its activity against bacterial and mammalian cells. The resulting dataset revealed key sequence features that drive membrane selectivity, such as the avoidance of large aromatic residues and the mutation of cysteine pairs. This knowledge provides clear design principles for engineering next-generation AMPs with high antibacterial potency and low toxicity, paving the way for safer and more effective antibiotics.
A New Horizon: Plant Science: While the adoption of DMS in plant sciences has been slower, its potential to revolutionize crop improvement and our understanding of plant biology is immense. Technical challenges in creating and screening large variant libraries in plants have been a barrier, but pioneering studies are beginning to emerge. In one such study, DMS was applied to the plant sugar transporter AtSWEET13. The experiment yielded a detailed map of how mutations throughout the protein affect its abundance and transport function. This type of information is invaluable for agricultural biotechnology, as it can guide efforts to engineer transporters for improved nutrient uptake, leading to crops with higher yields or enhanced resistance to pathogens that hijack these transporters. As the methods become more established, DMS is poised to become a critical tool for accelerating molecular-level discoveries in the plant sciences.

The Computational Symbiosis: Integrating Machine Learning with High-Throughput Functional Data

The raw output of a DMS experiment—vast tables of variant counts—must be carefully processed before it can fuel a machine learning model. This initial stage of data cleaning and preprocessing is critical for model performance. A key step is normalization, which puts functional scores from different experiments onto a common scale, a necessity when combining data from diverse assays. For instance, scores can be normalized such that the wild-type variant has a score of 1, while nonsense (protein-truncating) mutations have a score of 0. Any missing data points, which can arise from experimental dropouts, are often imputed using the mean value of available scores to create a complete dataset.

Once the functional scores are cleaned, the protein sequences themselves must be converted into a machine-readable format. The most common method is one-hot encoding. In this approach, each amino acid in a protein sequence is represented by a vector. For a vocabulary of 20 standard amino acids, this would be a 20-dimensional vector where 19 positions are zero and a single position is one, uniquely identifying that amino acid. A full protein sequence is thus transformed into a matrix of these vectors, providing a numerical representation that machine learning algorithms can process.

With preprocessed data in hand, the next step is to select an appropriate AI model. The choice depends on the specific goal, whether it's predicting the function of existing variants or designing entirely new ones.

Supervised Learning: For predicting the functional score of a given enzyme sequence, supervised learning models are the standard approach. Deep learning architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have proven particularly effective. A CNN can scan across the one-hot encoded sequence, learning to recognize important local patterns or motifs that are critical for function. RNNs, especially those with Long Short-Term Memory (LSTM) units, are designed to process sequential data and can capture long-range dependencies between amino acids that are distant in the primary sequence but may be close in the folded 3D structure. By training on thousands of variant sequences and their corresponding functional scores from a DMS experiment, these models learn the complex, non-linear "rules" of the enzyme's sequence-function relationship.
Generative Models: More advanced applications, such as designing entirely new proteins, call for generative models. Architectures like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) go beyond simple prediction. A VAE learns a compressed, low-dimensional "latent space" representation of the training data. By sampling from this latent space, it can generate entirely new protein sequences that share the essential characteristics of the functional proteins it was trained on. GANs consist of two competing networks—a generator that creates new sequences and a discriminator that tries to distinguish them from real sequences—which work together to produce highly realistic and novel protein designs. These models can not only predict function but can also generate entirely new sequences that are likely to be functional, opening up new frontiers in protein engineering
The process of training an AI model on DMS data can be understood through a powerful analogy: learning a new language. In this view, the 20 amino acids are the alphabet, short functional motifs are the words, and the entire protein sequence is a sentence that encodes a specific structure and function. Just as an AI can learn the grammar and semantics of English by processing vast amounts of text from the internet, a protein language model can learn the "language" of an enzyme by "reading" the comprehensive sequence-function map provided by a DMS experiment.
A single mutation can be like changing a letter in a word—it might render the sentence meaningless (a loss-of-function mutation) or subtly change its meaning (altering activity). By analyzing thousands of these "sentences" and their functional outcomes (the DMS scores), the model learns the underlying rules. It learns which "words" (motifs) are essential, which "grammatical structures" (long-range interactions) are required for stability, and which changes are tolerated. The outcome of this training process is a powerful predictive model that has internalized the enzyme's functional landscape. This model can then be used to accurately score novel sequences—variants that have never been seen or tested—without the need for immediate and costly experimental validation, dramatically accelerating the pace of protein research and engineering.

The Iterative Cycle in Practice: AI-Guided Protein Design

The synergy between DMS and AI is not merely theoretical; it has enabled a powerful, iterative cycle that is actively being used to engineer novel proteins with tailored functions.

Engineering Gene Therapy Vectors. A prominent example is the optimization of Adeno-Associated Virus (AAV) capsids, the protein shells used as vectors in gene therapy. The challenge is to design capsids that can efficiently target specific tissues while evading the patient's immune system. In a series of landmark studies, researchers created massive DMS libraries of AAV capsid variants and administered them to animal models. By sequencing the variants that successfully reached the target organ, they generated a rich in vivo fitness landscape. This data was used to train a machine learning model that learned the sequence rules for successful delivery. The model then generated novel capsid sequences predicted to have even higher performance. When these AI-designed capsids were synthesized and tested, they showed dramatically improved and highly specific organ targeting, creating far more effective vectors for gene therapy.
Designing Selective Antimicrobial Peptides. As mentioned previously, the dmSLAY method used DMS to map the function of the antimicrobial peptide Protegrin-1 against both bacterial and human cells. A machine learning model trained on this dual-objective dataset learned the sequence features that govern selectivity. The model's predictions enabled the in silico design of novel peptide sequences with a predicted high therapeutic index (high antibacterial activity and low toxicity). Subsequent experimental validation confirmed that these AI-designed peptides were indeed highly potent against bacteria while being significantly safer for human cells than the original peptide, demonstrating a clear path to designing better therapeutics.

A Critical Perspective: Navigating the Promise and Pitfalls of DMS

Deep Mutational Scanning has undeniably transformed protein science, providing an unprecedented ability to map sequence to function on a massive scale. However, a balanced perspective requires acknowledging both its revolutionary advantages and its significant practical and conceptual limitations. The future progress of the field will depend on creatively addressing these challenges.

The Transformative Advantages: The primary strengths of DMS are clear and profound. First is its scale and throughput. The ability to functionally assess hundreds of thousands or even millions of protein variants in a single, pooled experiment represents a quantum leap in efficiency over traditional, one-at-a-time methods. Second is its power for unbiased discovery. By systematically testing all possible mutations without preconceived notions of their importance, DMS removes the constraint of hypothesis-driven research and enables the discovery of unexpected functional hotspots, allosteric networks, and complex epistatic interactions. Finally, on a per-variant basis, DMS is remarkably efficient and economical, making large-scale functional genomics accessible to a wider range of laboratories and research questions.
The Inherent Challenges and Limitations: Despite its power, DMS is not without significant challenges. The most critical bottleneck is often the design of the functional assay itself. The entire experiment rests on the development of a robust, scalable selection scheme that accurately recapitulates the biological function of interest. This is a creative and technical hurdle that can be particularly difficult for proteins with complex or subtle cellular functions.
Second are limitations related to scalability and biological context. While DMS can test many variants, there is still a practical limit to library size imposed by factors like the efficiency of transforming the library into cells. A more fundamental limitation is that most DMS experiments are performed using ectopic expression of the variant library in a model system, such as yeast or cultured human cells. This context may not fully mirror the protein's environment and regulation in its native tissue or organism. Performing DMS at endogenous genomic loci is a major goal for the field but remains technically challenging.
Finally, biophysical ambiguity and data noise are persistent issues. As discussed, a single functional score can conflate multiple molecular effects (e.g., stability, binding, catalysis), complicating mechanistic interpretation. The data analysis pipeline is also complex, requiring sophisticated statistical methods to handle sequencing errors, sampling noise, and experimental variance to extract a clean and reliable biological signal from the raw counts.

The Future Promise

The trajectory of Deep Mutational Scanning points toward an increasingly powerful and integrated future. Methodologically, the field will continue to develop more sophisticated, multi-phenotype assays to resolve biophysical ambiguity and expand the toolkit to tackle more complex systems, including whole-genome scanning and in vivo applications.

The ultimate promise of this technology, especially when coupled with machine learning, is the creation of a comprehensive functional atlas of variation. The goal is to build a future where the functional consequence of any possible genetic variant in a clinically relevant gene can be found in a database of empirical functional scores. Such a resource would revolutionize protein engineering and drug design, and provide an unprecedentedly deep and quantitative understanding of the fundamental rules that link a protein's sequence to its biological function. While many challenges remain, DMS has laid the foundation for a new era of discovery in biology and medicine.

Additional Resources on DMS and Functional Assays

Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11(8):801-807.
Tareen A, Koşaloğlu-Yalçın Z, Darnell SJ, et al. Integrating deep mutational scanning and low-throughput experimental data for variant effect prediction using a joint Gaussian process model. GigaScience. 2023;12:giad073.
Araya CL, Fowler DM. Measuring the activity of protein variants on a large scale using deep mutational scanning. Methods Mol Biol. 2015;1278:349-365.
Li C, Wang S, Wang Y, et al. Deep mutational scanning: A versatile tool in systematically mapping genotypes to phenotypes. Front Genet. 2023;14:1087267.
Li C, Wang S, Wang Y, et al. Deep mutational scanning: a versatile tool in systematically mapping genotypes to phenotypes. Front Genet. 2023;14:1087267. Published 2023 Jan 20.
Yuan D, Gelman H, Smaill Z, et al. Deep Mutational Scanning Comprehensively Maps How Zika Envelope Protein Mutations Affect Viral Growth and Antibody Escape. J Virol. 2020;94(4):e01291-19.
Dadonaite B, Crawford KHD, D. H. O’Connor S, et al. Deep mutational scanning of whole SARS-CoV-2 spike in an inverted infection system. bioRxiv. Preprint posted online July 18, 2023.
Rocklin GJ, Chidyausiku TM, Goreshnik I, et al. Deep mutational scanning and CRISPR-engineered viruses: tools for evolutionary and functional genomics studies. mSphere. 2024;9(6):e0050824.
Starr TN, Greaney AJ, Addetia A, et al. Deep mutational scans for ACE2 binding, RBD expression, and antibody escape in SARS-CoV-2 Omicron BA.1 and BA.2. PLoS Pathog. 2022;18(11):e1010951.
Lee J, Gerasimavicius L, Miller E, et al. popDMS infers mutation effects from deep mutational scanning data. Bioinformatics. 2024;40(8):btae499.
Bloom JD. dms_tools: Software for the analysis and visualization of deep mutational scanning data. PeerJ. 2015;3:e976.
Basile W, List M, Hart T, et al. A deep mutational scanning platform to characterize the fitness landscape of anti-CRISPR proteins. Nucleic Acids Res. 2024;52(22):e103.
Stiffler MA, Hekstra DR, Ranganathan R. Deep mutational scanning and machine learning reveal structural and dynamic rules for allosteric hotspots. Elife. 2023;12:e79932.
Linsky M, Arad O, Goren MG, et al. Deep mutational scanning and machine learning uncover antimicrobial peptide features driving membrane selectivity. bioRxiv. Preprint posted online July 28, 2023.
Lawrence T, Zhao Y, He Y, et al. Deep mutational scanning reveals sequence to function constraints for SWEET family transporters. bioRxiv. Preprint posted online June 28, 2024.
Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction. The Moonlight. Published online October 11, 2023. Accessed July 8, 2025.
Yang KK. Learning the language of proteins. Kevin Kaichuang Yang's Blog. Published March 26, 2018. Accessed July 8, 2025.
Bileschi ML, Belanger D, Sanderson T, et al. ProteInfer, deep neural networks for protein functional inference. Elife. 2023;12:e80942.
Automated Protein Function Prediction Using Machine Learning Techniques. GitHub. Accessed July 8, 2025. https://cepdnaclk.github.io/e16-4yp-Automated-Protein-Function-Prediction/
Al-Tashi Q, Jadid Abdulkadir S, Rais H, et al. Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications. Appl Sci. 2024;15(9):517.
Wang J, Ma Y, You L, et al. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform. 2023;24(5):bbad289.
Ingraham J, Riesselman A, Hie B, et al. Deep Generative Modeling for Protein Design. arXiv. Preprint posted online September 27, 2021.
Satorras R, Eismann S, Ingraham J, et al. Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput Biol. 2023;19(7):e1010271.
Brandes N, Ofer D, Linial M. The promises of large language models for protein design and modeling. Cell Mol Life Sci. 2023;80(12):345.
Wang J, Ma Y, You L, et al. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform. 2023;24(5):bbad289.
Chen J, Zhang Y, Zhang Y, et al. A new way of looking at transcription factor assays. Front Bioeng Biotechnol. 2024;12:1365452.

Leveraging AI and Deep Mutational Scanning to Engineer Novel Enzymes

Ranomics